RAG Pipeline

The server provides a per-session Retrieval-Augmented Generation pipeline that indexes local codebases, chunks source files by language boundaries, and retrieves relevant context using hybrid BM25 + vector search with Reciprocal Rank Fusion.

Architecture Overview

The RAG pipeline has three stages: indexing (scanning files and splitting them into chunks), storage (keeping chunks in an in-memory knowledge base with BM25 and vector indexes), and retrieval (finding the most relevant chunks for a user's query using hybrid search).

Directory  -->  FileIndexer  -->  CodeChunker  -->  KnowledgeBase (in-memory)
                                                          |
User Query  -->  HybridRetriever  -->  [BM25Stream 60%]  -+  RRF  -->  Results
                                  -->  [VectorStream 40%] -+

Each session gets its own RagService, lazily created by SessionManager.getRag(sessionId). The service is thread-safe: indexing is serialized via a ReentrantLock, while reads (search) run concurrently.

RagService

The central orchestrator for a session's RAG pipeline.

RagService rag = sessionManager.getRag("my-session");

// Index a directory
rag.indexDirectory(Path.of("/project/src"), progress -> {
    System.out.printf("Indexed %d/%d: %s%n",
        progress.indexedFiles(), progress.totalFiles(), progress.currentFile());
});

// Search
List<SearchResult> results = rag.search("authentication middleware", 5);

// Build augmented prompt (auto-prepends context)
String prompt = rag.buildContextPrompt("How does auth work?", 5);

// Document management
String docId = rag.addDocument("Custom knowledge...", Map.of("source", "manual"));
rag.removeDocument(docId);
List<RagService.DocumentInfo> docs = rag.listDocuments();

The hybrid retriever is configured at construction with BM25 at 60% weight and the vector knowledge base at 40%:

this.hybridRetriever = HybridRetriever.builder()
    .stream(bm25Stream, 0.6)
    .stream(new KnowledgeBaseStream(knowledgeBase), 0.4)
    .build();

FileIndexer

The FileIndexer recursively walks a directory, identifies source files by extension, splits them into chunks using CodeChunker, and stores the chunks in the knowledge base. It supports incremental indexing so only changed files are re-processed on subsequent runs.

Supported Extensions (28+)

The indexer recognizes 28+ file extensions covering most popular programming languages and configuration formats.

java, ts, tsx, js, jsx, py, md, json, yml, yaml, xml, html, css, sh, sql, go, rs, rb, kt, scala, c, cpp, h -- plus language aliases (kts, bash, zsh, markdown, htm, cc, cxx, hpp, sc).

Filtering

The indexer automatically skips build artifacts, dependency directories, and files matching your .gitignore patterns to avoid polluting the knowledge base with irrelevant content.

Skipped directories: .git, node_modules, build, dist, target, .idea, .vscode, .gradle, __pycache__, vendor, .next, out, coverage, .svn, .hg
Ignore files: Reads .gitignore and .tnsignore from the root, converting glob patterns to Java PathMatcher instances
Size limit: Files larger than 512 KB or empty files are skipped

Incremental Indexing

To avoid re-processing unchanged files, the indexer computes a SHA-256 hash of each file's content and stores it in a ConcurrentHashMap<String, String>. On re-index:

If the hash matches the previous run, the file is skipped
If the file changed, old chunks are removed from both KnowledgeBase and BM25Stream
New chunks are generated and added

Call fileIndexer.clearHashes() to force a full re-index.

CodeChunker

The CodeChunker splits source files into semantically meaningful chunks -- for example, by class or function boundaries in Java/TypeScript, or by headings in Markdown. This ensures that search results return coherent, self-contained code blocks rather than arbitrary line ranges.

Chunking Strategies

The chunker picks a strategy based on the file's language. Languages with known structure get smarter splitting; everything else falls back to fixed-size line groups.

Language	Strategy	Boundary Detection
Java, Kotlin, Scala	Class/method boundaries	Regex: class/interface/enum/record declarations + method signatures
TypeScript, JavaScript	Function/class boundaries	Regex: export/function/class/const arrow declarations
Markdown	Heading boundaries	Regex: `#{1-6}` heading lines
Everything else	Fixed line groups	Max 100 lines per chunk

Small files (100 lines or fewer) are always kept as a single chunk. Large boundary-detected chunks are sub-split into 100-line groups.

Each chunk becomes a Document with metadata:

Document.builder()
    .id("src/auth/Middleware.java:15-45")
    .content(chunkContent)
    .metadata("file", "src/auth/Middleware.java")
    .metadata("startLine", 15)
    .metadata("endLine", 45)
    .metadata("language", "java")
    .build();

BM25Stream

The BM25Stream provides keyword-based search using the Okapi BM25 algorithm, which is the same ranking function used by search engines like Elasticsearch. It scores documents based on how well their terms match the query, accounting for term frequency and document length.

Parameters

These BM25 parameters control how the scoring behaves. The defaults work well for code search and rarely need tuning.

Parameter	Value	Description
K1	1.2	Term frequency saturation
B	0.75	Document length normalization

Text Processing Pipeline

Before scoring, queries and documents go through a text processing pipeline that normalizes, tokenizes, and stems terms. This improves recall by matching different forms of the same word.

Tokenization: Lowercase, strip non-alphanumeric (except _), split on whitespace, drop tokens with 1 character or fewer
Stop word removal: 50 common English stop words
Stemming: Suffix-stripping rules for 14 suffixes (-ies, -ing, -tion, -sion, -ment, -ness, -able, -ous, -ful, -less, -ly, -ed, -er, -es, -s)
Synonym expansion (query-time only): 20 coding-domain synonym pairs

Synonym Pairs

At query time, common coding abbreviations are expanded to their full forms (and vice versa) so that searching for "auth" also finds documents containing "authentication".

Term	Synonyms
db	database
auth	authentication, authorization
config	configuration
perf	performance
impl	implementation
req	request
res	response
err	error
msg	message
fn	function
param	parameter
repo	repository
env	environment
async	asynchronous
sync	synchronous

HybridRetriever

The HybridRetriever combines results from multiple search strategies (like BM25 keyword search and vector similarity search) into a single ranked list. This hybrid approach gives better results than either method alone because keyword search finds exact term matches while vector search captures semantic similarity.

Fusion Algorithm

The retriever merges results using Reciprocal Rank Fusion (RRF), which combines rankings without needing normalized scores. For each document appearing in any stream's results:

score(doc) = SUM over streams: weight(stream) / (K + rank(doc, stream) + 1)

Where K = 60 (the RRF constant). Documents are then sorted by fused score.

Diversification

To prevent a single large file from dominating search results, the retriever limits output to a maximum of 3 chunks per source file. This ensures the agent sees context from multiple relevant files.

HybridRetriever retriever = HybridRetriever.builder()
    .stream(bm25Stream, 0.6)       // 60% weight
    .stream(vectorStream, 0.4)      // 40% weight
    .build();

List<SearchResult> results = retriever.retrieve("authentication flow", 10);

Context Prompt Format

When the agent asks a question, RagService.buildContextPrompt searches for relevant code and prepends it to the user's query. This gives the LLM the codebase context it needs to answer accurately.

[Relevant code context]
--- file: src/auth/Middleware.java (lines 15-45) ---
public class AuthMiddleware {
    private final TokenValidator validator;
    ...
}

--- file: src/auth/TokenValidator.java (lines 1-30) ---
public class TokenValidator {
    ...
}

[User question]
How does the authentication middleware work?

If no context is found (empty knowledge base or no matches), the original query is returned unchanged.

Document Management API

Beyond automatic directory indexing, you can manually add, list, and remove documents in the knowledge base. This is useful for injecting custom knowledge (like deployment procedures or domain-specific documentation) that is not part of the codebase.

// Add a document with metadata
String docId = rag.addDocument("Custom knowledge content",
    Map.of("source", "user", "topic", "deployment"));

// List documents (returns preview, length, metadata)
List<RagService.DocumentInfo> docs = rag.listDocuments();
// DocumentInfo(id, preview(100chars), contentLength, metadata)

// Get a specific document
Optional<Document> doc = rag.getDocument(docId);

// Remove
boolean removed = rag.removeDocument(docId);

// Clear everything
rag.clear();

Documents added via addDocument are tracked separately and appear in listDocuments(). Both manually added documents and file-indexed chunks are searchable through the same hybrid retriever.

RAG Pipeline

On this page