SQL Queries

The msgvault query command lets you run SQL directly against your archive’s Parquet analytics cache using an in-memory DuckDB engine. This gives you full SQL expressiveness for ad-hoc analysis, custom reports, and data exploration beyond what the built-in analytics commands provide.

Basic Usage

msgvault query "SELECT count(*) AS total FROM messages"

The command takes a single argument: the SQL string. If the analytics cache is stale, it is automatically rebuilt before the query runs (progress is printed to stderr).

Output Formats

Control output format with the --format flag:

# JSON (default)
msgvault query "SELECT count(*) AS total FROM messages"

# CSV
msgvault query --format csv "SELECT from_email, message_count FROM v_senders LIMIT 5"

# Aligned text table
msgvault query --format table "SELECT from_email, message_count FROM v_senders LIMIT 5"

Format	Description
`json`	JSON object with `columns`, `rows`, and `row_count` fields (default)
`csv`	Standard CSV with a header row
`table`	Aligned text table with separator line and `(N rows)` footer

Available Views

Base views

These map directly to the Parquet files in ~/.msgvault/analytics/.

View	Key Columns
`messages`	id, source_id, source_message_id, conversation_id, subject, snippet, sent_at, size_estimate, has_attachments, attachment_count, sender_id, message_type, year, month, deleted_from_source_at
`participants`	id, email_address, domain, display_name, phone_number
`message_recipients`	message_id, participant_id, recipient_type (from/to/cc/bcc), display_name
`labels`	id, name
`message_labels`	message_id, label_id
`attachments`	message_id, filename, size
`conversations`	id, source_conversation_id, title, conversation_type
`sources`	id, source_type

Convenience views

Pre-joined and aggregated views for common queries.

View	Description
`v_messages`	Messages with resolved sender (from_email, from_name, from_domain, from_phone) and labels as a JSON array
`v_senders`	Per-sender aggregates: from_email, from_name, from_domain, message_count, total_size, attachment_size, attachment_count, first_message_at, last_message_at
`v_domains`	Per-domain aggregates: domain, message_count, total_size, sender_count
`v_labels`	Per-label: name, message_count, total_size
`v_threads`	Per-conversation: conversation_id, source_conversation_id, conversation_title, conversation_type, message_count, first_message_at, last_message_at, participant_emails (JSON array)

Example Queries

Top senders by message count

msgvault query "
  SELECT from_email, from_name, message_count, total_size
  FROM v_senders
  ORDER BY message_count DESC
  LIMIT 20
"

Domain breakdown

msgvault query "
  SELECT domain, message_count, sender_count,
         total_size / (1024*1024) AS size_mb
  FROM v_domains
  ORDER BY message_count DESC
  LIMIT 20
"

Messages per month for a given year

msgvault query "
  SELECT month, count(*) AS messages
  FROM messages
  WHERE year = 2024
  GROUP BY month
  ORDER BY month
"

Label statistics

msgvault query "
  SELECT name, message_count, total_size / (1024*1024) AS size_mb
  FROM v_labels
  ORDER BY message_count DESC
"

Largest attachments

msgvault query --format table "
  SELECT a.filename, a.size / (1024*1024) AS size_mb,
         m.subject, m.sent_at
  FROM attachments a
  JOIN messages m ON a.message_id = m.id
  ORDER BY a.size DESC
  LIMIT 20
"

Thread activity

msgvault query "
  SELECT conversation_title, message_count,
         first_message_at, last_message_at
  FROM v_threads
  ORDER BY message_count DESC
  LIMIT 10
"

Filter by label

The labels column in v_messages is a JSON array string. Use DuckDB’s list_contains to filter:

msgvault query "
  SELECT subject, from_email, sent_at
  FROM v_messages
  WHERE list_contains(labels::VARCHAR[], 'INBOX')
  ORDER BY sent_at DESC
  LIMIT 20
"

Or join through the base tables for more control:

msgvault query "
  SELECT m.subject, m.sent_at
  FROM messages m
  JOIN message_labels ml ON m.id = ml.message_id
  JOIN labels l ON ml.label_id = l.id
  WHERE l.name = 'INBOX'
  ORDER BY m.sent_at DESC
  LIMIT 20
"

Tips

Partition pruning. The messages view is hive-partitioned by year. Adding WHERE year = 2024 to queries on messages lets DuckDB skip irrelevant Parquet files, which speeds up queries on large archives.

Auto-cache rebuild. If the analytics cache is stale or missing, msgvault query rebuilds it automatically before running your SQL. The rebuild progress prints to stderr, so it does not interfere with piping query output.

Pipe-friendly. JSON and CSV output modes are designed for piping into other tools (jq, csvkit, xsv, etc.). Use --format csv for spreadsheet workflows or --format json for programmatic consumption.

Full DuckDB SQL. You have access to DuckDB’s full SQL dialect, including window functions, CTEs, UNNEST, list_contains, and all built-in functions. See the DuckDB documentation for the full SQL reference.