Skip to content
GitHub stars

Data Storage

Storage Layers

LayerRoleLocation
SQLiteSystem of record~/.msgvault/msgvault.db
ParquetAnalytics cache~/.msgvault/analytics/
AttachmentsContent-addressed files~/.msgvault/attachments/
TokensOAuth credentials~/.msgvault/tokens/

SQLite (System of Record)

All message data (metadata, labels, participants, and raw MIME) lives in a single SQLite database. This is the authoritative copy of your archive.

Core Tables

sources — Gmail accounts with sync state.

ColumnTypeDescription
idINTEGER PKAuto-increment
source_typeTEXTgmail
identifierTEXTEmail address
display_nameTEXTAccount display name
sync_cursorTEXTGmail history ID for incremental sync
last_sync_atDATETIMELast sync timestamp

conversations — Gmail thread abstraction.

ColumnTypeDescription
idINTEGER PKAuto-increment
source_idINTEGER FKReferences sources
source_conversation_idTEXTGmail thread ID
conversation_typeTEXTemail_thread
message_countINTEGERDenormalized count
last_message_atDATETIMELatest message timestamp

messages — Message metadata. Foreign key to conversations.

ColumnTypeDescription
idINTEGER PKAuto-increment
conversation_idINTEGER FKReferences conversations
source_idINTEGER FKReferences sources
source_message_idTEXTGmail message ID
message_typeTEXTemail
sent_atDATETIMESend timestamp
sender_idINTEGER FKReferences participants
subjectTEXTEmail subject
body_textTEXTPlain text content
snippetTEXTPreview excerpt
size_estimateINTEGERApproximate size in bytes
has_attachmentsBOOLEANAttachment flag
deleted_atDATETIMESoft delete timestamp

message_raw — Raw MIME blob storage, compressed with zlib.

ColumnTypeDescription
message_idINTEGER PK/FKReferences messages
raw_dataBLOBCompressed MIME data
compressionTEXTzlib

participants — Unified contacts.

ColumnTypeDescription
idINTEGER PKAuto-increment
email_addressTEXTEmail address (unique index)
display_nameTEXTContact name
domainTEXTExtracted domain

message_recipients — From/To/Cc/Bcc mapping.

ColumnTypeDescription
message_idINTEGER FKReferences messages
participant_idINTEGER FKReferences participants
recipient_typeTEXTfrom, to, cc, bcc

labels / message_labels — Gmail labels (many-to-many).

TableKey Columns
labelsid, source_id, source_label_id, name, label_type
message_labelsmessage_id, label_id

attachments — Content-addressed attachment metadata.

ColumnTypeDescription
idINTEGER PKAuto-increment
message_idINTEGER FKReferences messages
filenameTEXTOriginal filename
mime_typeTEXTMIME type
sizeINTEGERSize in bytes
content_hashTEXTSHA-256 hash
storage_pathTEXTRelative path: ab/abcd1234...

sync_runs / sync_checkpoints — Sync state for resumability.

TablePurpose
sync_runsTrack each sync operation (start, end, counts, errors)
sync_checkpointsResume point per source (message ID, page token)

FTS5 Virtual Table

CREATE VIRTUAL TABLE messages_fts USING fts5(
subject,
body_text,
content='messages',
content_rowid='id'
);

Powers full-text search via msgvault search. Indexed on subject and body_text columns.

Relationships

sources ─┬─< conversations ─< messages ─┬─< message_recipients ─> participants
│ ├─< message_labels ─> labels
│ ├── message_raw
│ └─< attachments
└─< labels

Parquet (Analytics Cache)

The TUI needs to aggregate across your entire archive (top senders, domains, labels, time series) and return results instantly as you drill down. SQLite JOINs across normalized tables cannot do this at interactive speeds on large archives. msgvault solves this with denormalized Parquet files queried by an embedded DuckDB engine, delivering aggregate queries hundreds of times faster than SQLite.

The Parquet cache is disposable and can be rebuilt at any time. The TUI automatically builds or updates it on launch when new messages are detected.

Terminal window
# Manual build
msgvault build-cache
# Full rebuild (discard existing)
msgvault build-cache --full-rebuild

Directory structure:

analytics/
├── messages/
│ ├── year=2020/
│ ├── year=2021/
│ └── ...
├── participants/
├── message_recipients/
├── labels/
├── attachments/
├── sources/
└── _last_sync.json

Messages are partitioned by year for efficient time-range queries. The entire analytics cache is typically a few MB even for hundreds of thousands of messages, compared to the much larger SQLite database with full message bodies.

Content-Addressed Attachments

Every attachment from every message (PDFs, photos, documents, spreadsheets, archives) is extracted and stored as a plain file on your local filesystem. No more digging through Gmail’s web UI or hitting API rate limits to retrieve your own files. Once synced, they are yours to browse, back up, or process however you like.

Attachments are deduplicated by SHA-256 hash:

attachments/
├── ab/
│ └── abcd1234567890... # full SHA-256 hash as filename
├── cd/
│ └── cdef9876543210...
└── ...

The first two characters of the hash form the subdirectory (sharding). Multiple messages referencing the same attachment share one file.

Token Storage

OAuth tokens are stored as JSON files per account:

tokens/
├── personal@gmail.com.json
└── work@company.com.json

Tokens refresh automatically. Protect this directory; tokens grant read/write access to Gmail.

Compression

DataFormatRatio
Raw MIMEzlib in SQLite BLOB~3-5x compression
ParquetSnappy (DuckDB default)~10x vs raw SQLite
AttachmentsStored as-is (already compressed formats)