Knowledge & Memory

Local Brain

A semantic knowledge OS that transforms your entire codebase into searchable, queryable memory through vector embeddings, local LLMs, and an MCP bridge that compresses files by up to 100x before they ever reach the cloud.

In Plain English

This is like a private search engine that lives on your computer and actually understands what you are looking for. It uses AI to read the meaning behind your words instead of just matching keywords, so searching for "that script that fixes WiFi issues" actually finds it even if the file is called something completely different. All of this runs locally, so nothing leaves your machine.

Problem

Every developer who works with AI assistants faces a fundamental tension: cloud LLMs are powerful reasoners, but they know nothing about your specific codebase. Each new conversation starts from scratch, and the only way to give context is to paste in entire files, burning through token budgets at alarming speed. A single 50KB log file can consume the equivalent of a small novel's worth of tokens, and the LLM still might not find the one error line that matters.

The obvious workaround is to manually copy-paste relevant snippets, but that shifts the cognitive burden back onto the developer. You have to know which files are relevant before you can ask the question, which defeats the purpose of having an AI assistant in the first place. What is really needed is a layer that sits between the human and the cloud, one that already knows your codebase, understands the semantic relationships between files, and can compress thousands of lines into a tight summary that preserves exactly the information the LLM needs.

Local Brain solves this by running entirely on local hardware. It crawls your projects, chunks every file into overlapping segments, generates 768-dimensional vector embeddings using nomic-embed-text through Ollama, and stores everything in a SQLite database. When a query arrives, it performs cosine similarity search across those vectors to find the most relevant chunks, then uses a local LLM to compress the results into a concise answer. A 10,000-line log file becomes a 200-token error summary. A sprawling project directory becomes a 500-token context brief. The cloud LLM never sees the raw files; it sees the distilled intelligence.

Architecture

Scroll to explore diagram

Features

Semantic Search

768-dim vectors

Every file in your codebase is chunked into overlapping 1500-character segments and embedded into 768-dimensional vectors using nomic-embed-text through Ollama. When you search for "that WiFi fix script," the system performs cosine similarity across all stored vectors and returns the most relevant chunks, regardless of filenames or folder structure. The result is conceptual search that understands meaning, not just keywords.

Token Compression

10-100x savings

The core insight driving Local Brain is that cloud LLMs do not need to see raw files. A 50KB log file can be compressed to a 200-token error summary. A sprawling project directory becomes a 500-token context brief. The MCP bridge intercepts file-read requests and routes them through local models that extract only the relevant information, achieving compression ratios from 10x for code reviews up to 100x for log analysis. Your cloud token budget stretches dramatically further.

Intelligent Routing

6 specialized tools

Not every request needs the same treatment. The classify_task tool uses Hermes3 to determine whether a query should go to the summarizer, the code reviewer, or the error extractor. Code review requests route to Qwen2.5-Coder for its deep understanding of programming patterns. Log analysis routes to Hermes3 with severity and time filters. Each request type has a specific model assignment, token budget, and output format optimized for its purpose.

Persistent Crawling

15+ file types

The Python-side crawler watches your project directories using watchdog for real-time filesystem events. It parses Python, TypeScript, Markdown, JSON, Lua, PowerShell, YAML, PDFs, and DOCX files, then chunks and embeds them into the vector store. A state tracker based on file content hashes ensures only changed files get re-indexed, making incremental updates near-instant even across large codebases.

How It Works

Crawl and Index

The Python crawler scans configured project directories, parsing files across 15+ formats. Each file is split into overlapping 1500-character chunks with 150-character overlap to preserve context across boundaries. The chunker generates a content hash for each file, so subsequent crawls skip unchanged files entirely. Every chunk is embedded into a 768-dimensional vector by nomic-embed-text through Ollama and stored in brain.db alongside its source path, chunk index, and timestamp.

Query Arrives via MCP

When Claude Code, Claude Desktop, or any HTTP client needs file content, the request arrives at the TypeScript MCP bridge server on port 8420. The bridge exposes six specialized tools: read_smart for compressed file reading, analyze_log for error extraction, build_context for project overviews, classify_task for agent routing, review_code for code feedback, and digest_directory for folder summaries. Each tool accepts structured parameters including source paths, intent hints, severity filters, and token budgets.

Cache Check and Model Selection

Before hitting Ollama, every request passes through the cache layer. Cache keys are generated from a hash of the file path, request type, and filter parameters, with a default TTL of five minutes. On a cache hit, the stored response returns instantly. On a miss, the system consults its model selection matrix: Hermes3 handles summarization, classification, and general compression, while Qwen2.5-Coder handles code review and diff explanation. Each request type also carries a maximum token budget to keep responses tight.

Local LLM Processing

The selected model receives the raw file content along with a task-specific prompt. For log analysis, the prompt instructs the model to extract only errors matching the severity filter within the specified time range. For code review, it focuses on bugs, memory leaks, or whatever the task hint specifies. For context building, it produces a structured overview of the project's architecture. The model runs entirely through Ollama on local hardware, so no data leaves the machine.

Compressed Response

The model's output is formatted according to the requested output type (summary, JSON, markdown, list, or raw), wrapped in metadata that includes the compression ratio and cache status, and returned to the caller. A 10,000-line log file that would have consumed 40,000 tokens becomes a 200-token structured JSON listing just the errors. The cloud LLM receives this compressed intelligence instead of the raw file, saving both money and context window space while preserving the information that actually matters.

Tech Stack

MCP Bridge

TypeScript + @modelcontextprotocol/sdk

Ingestion

Python + watchdog + pypdf + python-docx

Models

Hermes3, Qwen2.5-Coder via Ollama

Embeddings

nomic-embed-text (768-dim)

Storage

SQLite (brain.db) with NumPy cosine similarity

Caching

Hash-keyed file cache with 5-min TTL