
SentrySearch splits videos into overlapping chunks, embeds each chunk as video using either Google's Gemini Embedding API or a local Qwen3-VL model, and stores vectors in a local ChromaDB database. When you search, your text query is embedded into the same vector space and matched. Top match is automatically trimmed and saved as a clip.
How it works
The key insight: both Gemini Embedding 2 and Qwen3-VL can natively embed video. Raw video pixels are projected into the same vector space as text queries. No transcription, no frame captioning, no text middleman. A query like "red truck at a stop sign" is directly comparable to a 30-second video clip at the vector level.
This is what makes sub-second semantic search over hours of footage practical.
Language
Python. Uses ChromaDB for vector storage.
Cost
Indexing 1 hour of footage costs ~$2.84 with Gemini (default: 30s chunks, 5s overlap). Qwen3-VL-Embedding is free but runs locally. Qwen supports Matryoshka Representation Learning - only the first 768 dimensions are kept and L2-normalized.
Value
Very relevant for video search use cases. The "no transcription needed" approach is a big deal for surveillance, media libraries, or any footage where speech isn't the primary content. Direct video-to-vector embedding is the future of video search.