Skip to content
GitHub stars

Vector Search

Semantic search finds messages by meaning, not just keyword overlap: a query like “planning offsite agenda” can surface a message titled “Q2 team kickoff” if the bodies discuss the same topic, even when none of the query words appear in the result. msgvault builds that capability on top of the default keyword (FTS5) search by sending message text to an embedding endpoint you configure, then storing the vectors locally in vectors.db alongside your archive.

When vector search is enabled, the search command, the HTTP /api/v1/search endpoint, and the MCP search_messages tool all accept mode=vector (pure semantic) and mode=hybrid (BM25 + vector fused with Reciprocal Rank Fusion). A separate MCP tool, find_similar_messages, returns nearest-neighbor messages for a given seed. The vectors and archive stay local, but embedding work is performed by the endpoint in your config. If that endpoint is hosted by a third party, message text and semantic query text are sent there; use a local or self-hosted endpoint when you need the workflow to stay on your own machine or network.

Prerequisites

  1. A running OpenAI-compatible embedding endpoint. msgvault does not host a model. Point it at a local, self-hosted, or hosted endpoint that you trust. Common local options include Ollama, llama.cpp’s server, and LM Studio. The endpoint must accept POST /embeddings with an OpenAI-style JSON body and return indexed data rows such as {"data": [{"index": 0, "embedding": [...]}]}.

  2. A build with sqlite_vec support. The standard make build target already passes -tags "fts5 sqlite_vec". If you see errors mentioning “binary was built without -tags sqlite_vec”, rebuild via make build (or go build -tags "fts5 sqlite_vec" if you are invoking go build directly).

Windows source builds

The sqlite-vec CGo binding needs sqlite3.h at compile time, and the MinGW 15 toolchain needs two extra flags to link arrow-go/v18’s helpers. The easiest path is powershell -File scripts/build.ps1, which wires everything up automatically. To invoke go build yourself from PowerShell:

Terminal window
C:\msys64\usr\bin\pacman.exe -S --noconfirm --needed mingw-w64-x86_64-sqlite3
$env:CGO_ENABLED = "1"
$env:CGO_CFLAGS = "-IC:/msys64/mingw64/include -fgnu89-inline"
$env:CGO_LDFLAGS = "-Wl,--allow-multiple-definition"
go build -tags "fts5 sqlite_vec" -o msgvault.exe ./cmd/msgvault

Enable

Add a [vector] block to ~/.msgvault/config.toml:

[vector]
enabled = true
backend = "sqlite-vec"
# db_path defaults to <data_dir>/vectors.db when empty.
# db_path = "/path/to/vectors.db"
[vector.embeddings]
endpoint = "http://tailnet-host:11434/v1"
api_key_env = "OLLAMA_API_KEY" # optional; omit for anonymous endpoints
model = "nomic-embed-text"
dimension = 768
batch_size = 32 # embeddings per HTTP call
timeout = "30s"
max_retries = 3
max_input_chars = 2000 # see sizing guidance below
[vector.preprocess]
strip_quotes = true # drop quoted reply blocks before embedding
strip_signatures = true # drop common `-- ` signature blocks
[vector.search]
rrf_k = 60 # RRF constant; higher flattens score differences
k_per_signal = 100 # candidate pool size per signal (BM25 or vector)
subject_boost = 2.0 # score boost when a query term hits the subject
max_page_size_hybrid = 50 # hard cap on vector/hybrid page_size
[vector.embed.schedule]
cron = "*/5 * * * *" # embed worker cron (5-field); empty disables cron
run_after_sync = true # run a pass after every successful scheduled sync

The [vector] section only takes effect when enabled = true and the binary was built with sqlite_vec. If either is missing, msgvault behaves as before. Disabled vector search returns vector_not_enabled from server surfaces; a binary built without sqlite_vec reports a rebuild-with-sqlite-vec error when vector features are requested.

Matching max_input_chars to your embedder’s context window

max_input_chars is an upper bound in characters; the embedder converts this to tokens on its own. Set it below the embedder’s maximum context or full-length messages can fail with HTTP 400 during msgvault build-embeddings.

Practical guidance:

  • 2k-token embedding models: start around max_input_chars = 2000 and raise only after confirming the endpoint accepts longer inputs.
  • 8k-token embedding models: start around max_input_chars = 24000.
  • Self-hosted models: match the actual context window exposed by your server, not just the upstream model card.

If msgvault build-embeddings fails with repeated HTTP 400 warnings, check the embedder’s logs. the input length exceeds the context length confirms you need to lower max_input_chars.

Initial Embedding

Once vector search is enabled and your archive has synced or imported messages, embed it:

Terminal window
msgvault build-embeddings --full-rebuild --yes

This creates a new building generation, seeds the pending queue with every non-deleted message in your archive, drains the queue in batches through your configured embedder, and atomically activates the generation once every pending row has been embedded. During the first build, when no active generation exists yet, HTTP and MCP vector/hybrid search return index_building; use mode=fts for the interim.

The initial embed is the largest and longest operation. Runtime is roughly proportional to archive size divided by embedding throughput.

Keeping the Index Up to Date

After the initial rebuild, new messages arriving via email sync need to be embedded as well. msgvault handles this in two ways depending on how you run it.

CLI workflow (manual syncs)

If you run msgvault sync-full or msgvault sync (alias: sync-incremental) by hand, new Gmail and IMAP messages are auto-enqueued into every non-retired generation during the sync. In steady state that means the active generation; during a rebuild it means both the old active generation and the new building generation. Run msgvault build-embeddings (no --full-rebuild) to drain the queue:

Terminal window
# Sync new messages (auto-enqueues them for embedding)
msgvault sync you@gmail.com
# Drain the embedding queue into the active generation
msgvault build-embeddings

msgvault build-embeddings without --full-rebuild is a short, incremental operation: it picks up the configured active generation, drains any pending rows, and exits. You can schedule it via cron, run it after every sync, or chain it (sync && build-embeddings).

Daemon workflow (msgvault serve)

In daemon mode the scheduler can run both pieces automatically. The [vector.embed.schedule] section controls the embed worker independently from the sync scheduler:

[vector.embed.schedule]
cron = "*/5 * * * *" # run every 5 minutes
run_after_sync = true # and opportunistically after every scheduled sync

With run_after_sync = true, every successful scheduled sync triggers an immediate embed pass against the queue it just populated. The standalone cron ensures the queue drains even when syncs are quiet (e.g. overnight). An empty cron = "" disables the standalone schedule (useful if you only want the post-sync trigger).

What auto-enqueues

Ingest pathAuto-enqueues?
sync-full / sync (Gmail, IMAP)Yes
Scheduled syncs in msgvault serveYes
import-emlx (Apple Mail backup)No. Re-run --full-rebuild after large imports
import-mbox / import (mbox, eml)No. Re-run --full-rebuild after large imports
Chat imports (iMessage, WhatsApp, Google Voice)No. Run a full rebuild after importing if you want chats included

For ingest paths that do not auto-enqueue, running msgvault build-embeddings --full-rebuild --yes rebuilds the index over the full archive including the newly-imported messages. A same-model full rebuild is atomic from the searcher’s perspective: vector and hybrid queries keep answering from the previous active generation until the new one is ready. If the rebuild changes the configured model or dimension, vector and hybrid queries return index_stale until the new generation activates.

CLI:

Terminal window
msgvault search "planning offsite agenda" --mode hybrid
msgvault search "planning offsite agenda" --mode vector --explain
msgvault search "..." --json --mode hybrid # JSON output with scores

CLI vector and hybrid modes run against the local archive. If [remote].url is configured, msgvault search --mode vector|hybrid is rejected; call the remote server’s HTTP /api/v1/search endpoint directly for remote vector search.

HTTP:

Terminal window
curl "http://localhost:8080/api/v1/search?q=planning+offsite&mode=hybrid"
curl "http://localhost:8080/api/v1/search?q=planning+offsite&mode=vector&explain=1"

Response shape differs from the FTS path; see the Web Server reference for details. Pagination is not supported for vector/hybrid responses; bump page_size (capped at max_page_size_hybrid) instead.

mode=vector and mode=hybrid require at least one free-text term: the free text is what gets embedded as the query vector. A query that is purely operators (e.g. from:alice label:IMPORTANT) is rejected; HTTP and MCP return missing_free_text. Use mode=fts for those.

MCP tools:

  • search_messages accepts mode (fts/vector/hybrid) and explain arguments.
  • find_similar_messages takes a seed message_id and returns nearest neighbors (excluding the seed itself). Optional account, after, before, has_attachment filters.

Model Rotation

To switch models or dimensions, update [vector.embeddings].model and/or .dimension in your config, then run:

Terminal window
msgvault build-embeddings --full-rebuild --yes

This builds a new generation with the new fingerprint and activates it atomically when the build completes. While the rebuild is in flight, mode=vector and mode=hybrid return index_stale (the previously-active generation no longer matches the configured fingerprint, so search refuses to serve potentially-mismatched results). Use mode=fts until the new generation activates; it does not depend on the vector index. Once msgvault build-embeddings reports the new generation activated, vector and hybrid modes resume.

Troubleshooting

Common HTTP/MCP error codes and fixes. The CLI reports equivalent conditions as command errors rather than structured codes.

ErrorMeaningRecovery
vector_not_enabledThe server or MCP process did not wire a vector backend, usually because [vector] enabled = false.Set enabled = true, configure [vector.embeddings], and start with a sqlite_vec build.
index_staleActive generation’s model/dimension doesn’t match the configured [vector.embeddings] fingerprint.Run msgvault build-embeddings --full-rebuild --yes.
index_buildingNo active generation yet; one is being built.Finish running msgvault build-embeddings or wait for the scheduler. Use mode=fts for the interim.
missing_free_textmode=vector or mode=hybrid used with a filter-only query (no free text to embed).Add free-text terms to q, or switch to mode=fts.
pagination_unsupportedRequest asked for page>1 with `mode=vectorhybrid`.
invalid_modemode= value other than fts, vector, hybrid.Pick one of those.
embedding_timeoutThe embedding endpoint did not respond before the request deadline (transient: slow/cold model, network blip).Retry; if persistent, raise [vector.embeddings].timeout or use a faster endpoint.

msgvault build-embeddings repeatedly logs embed batch failed ... HTTP 400 and aborts after 5 consecutive failures: check the embedder’s logs. If they say the input length exceeds the context length (Ollama) or an equivalent token-limit error, lower max_input_chars to match the model’s context window. See the sizing guidance above.

To confirm the binary was built with vector support:

Terminal window
msgvault search "probe" --mode vector

A clear “rebuild with sqlite_vec” error indicates the tag is missing. A different error (vector_not_enabled, index_stale, etc.) means the command moved past the build-tag check and is now waiting on config or backfill.

Check index health via the stats endpoint:

Terminal window
curl -H "X-API-Key: ..." http://localhost:8080/api/v1/stats | jq .vector_search

The active_generation.message_count should roughly match total_messages. pending_embeddings_total shows how many rows still need embedding (either because a rebuild is in flight or because recent syncs have not yet been drained).

What Gets Embedded

The embedder processes one vector per message. Per-message input is assembled from subject and body_text after preprocessing (configurable in [vector.preprocess]):

  • Optional stripping of quoted-reply blocks (> ... lines and common reply-preamble markers).
  • Optional stripping of trailing signatures (lines after -- ).
  • Truncation at max_input_chars at a UTF-8 rune boundary.

Messages deleted at the source (deleted_from_source_at IS NOT NULL) are skipped entirely. Messages without a body_text fall back to HTML-to-text conversion of body_html so HTML-only messages still contribute full-body embeddings. Messages with neither body field use the subject only; if the subject is also empty, the embedder receives an empty string for that row.

See Also