SentrySearch

Semantic search over videos using native video embeddings from Gemini or Qwen3-VL.

Table of Contents

SentrySearch splits videos into overlapping chunks, embeds each chunk as video using either Google's Gemini Embedding API or a local Qwen3-VL model, and stores vectors in a local ChromaDB database. When you search, your text query is embedded into the same vector space and matched. Top match is automatically trimmed and saved as a clip.

How it works

The key insight: both Gemini Embedding 2 and Qwen3-VL can natively embed video. Raw video pixels are projected into the same vector space as text queries. No transcription, no frame captioning, no text middleman. A query like "red truck at a stop sign" is directly comparable to a 30-second video clip at the vector level.

This is what makes sub-second semantic search over hours of footage practical.

Language

Python. Uses ChromaDB for vector storage.

Cost

Indexing 1 hour of footage costs ~$2.84 with Gemini (default: 30s chunks, 5s overlap). Qwen3-VL-Embedding is free but runs locally. Qwen supports Matryoshka Representation Learning - only the first 768 dimensions are kept and L2-normalized.

Value

Very relevant for video search use cases. The "no transcription needed" approach is a big deal for surveillance, media libraries, or any footage where speech isn't the primary content. Direct video-to-vector embedding is the future of video search.

links

social