How to Design a Hybrid RAG Stack with pgvector + Elasticsearch
How
16 min read
Wed Feb 18 2026

How to Design a Hybrid RAG Stack with pgvector + Elasticsearch

A production retrieval blueprint covering index design, parallel query planning, rank fusion, reranking, and offline evaluation before prompt tuning.

RAG
pgvector
Elasticsearch
AI Systems

How to Optimize a Hybrid RAG Stack

In production RAG, the objective is not "best embedding similarity." The objective is: maximize answer quality under token budget and latency constraints. Hybrid retrieval works because semantic and lexical systems fail differently.

Vector retrieval captures intent-level similarity. BM25 captures exact terms and rare tokens. A robust stack queries both, fuses candidates, reranks top-k, and only then builds prompt context.

Retrieval pipeline stages

  1. Chunk and embed documents with stable IDs and version markers.
  2. Issue vector and lexical searches in parallel.
  3. Fuse candidates with reciprocal-rank fusion (RRF).
  4. Apply reranker to top fused set.
  5. Pack context with citation IDs and source metadata.
hybrid-retrieval.tsts
type Candidate = {
  chunkId: string;
  source: "vector" | "bm25";
  rank: number;
  score: number;
};

function reciprocalRankFusion(groups: Candidate[][], k = 60) {
  const scoreMap = new Map<string, number>();

  for (const group of groups) {
    for (const item of group) {
      const prev = scoreMap.get(item.chunkId) ?? 0;
      scoreMap.set(item.chunkId, prev + 1 / (k + item.rank));
    }
  }

  return [...scoreMap.entries()]
    .map(([chunkId, fusedScore]) => ({ chunkId, fusedScore }))
    .sort((a, b) => b.fusedScore - a.fusedScore);
}

export async function hybridRetrieve(query: string) {
  const [vectorHits, bm25Hits] = await Promise.all([
    vectorStore.search(query, { topK: 40 }),
    elastic.search(query, { size: 40 }),
  ]);

  const fused = reciprocalRankFusion([vectorHits, bm25Hits]);
  return rerank(query, fused.slice(0, 25));
}

Data Model and Indexing Choices

Index design decisions dominate recall and latency. For pgvector, choose index type by scale and update profile: IVFFlat for steady append-heavy workloads, HNSW for faster recall at query time with higher memory pressure.

hybrid-index.sqlsql
CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE rag_chunks (
  chunk_id UUID PRIMARY KEY,
  document_id UUID NOT NULL,
  tenant_id UUID NOT NULL,
  body TEXT NOT NULL,
  embedding VECTOR(1536) NOT NULL,
  lexical_tsv tsvector GENERATED ALWAYS AS (to_tsvector('english', body)) STORED,
  metadata JSONB NOT NULL,
  created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);

CREATE INDEX rag_chunks_embedding_hnsw
ON rag_chunks
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 200);

CREATE INDEX rag_chunks_lexical_gin
ON rag_chunks
USING gin (lexical_tsv);

CREATE INDEX rag_chunks_tenant_idx
ON rag_chunks (tenant_id, document_id);

Evaluation Before Prompt Tuning

Most teams tune prompts before retrieval metrics. This is backwards. You need offline retrieval evaluation first: MRR, nDCG, and hit@k on labeled query-document pairs.

retrieval-eval.tsts
interface LabeledQuery {
  query: string;
  relevantChunkIds: string[];
}

export async function evaluateRetrieval(dataset: LabeledQuery[]) {
  let hitAt5 = 0;
  let reciprocalRankSum = 0;

  for (const item of dataset) {
    const hits = await hybridRetrieve(item.query);
    const top5 = hits.slice(0, 5).map((h) => h.chunkId);

    if (top5.some((id) => item.relevantChunkIds.includes(id))) {
      hitAt5 += 1;
    }

    const rrIndex = hits.findIndex((h) => item.relevantChunkIds.includes(h.chunkId));
    reciprocalRankSum += rrIndex === -1 ? 0 : 1 / (rrIndex + 1);
  }

  return {
    hitAt5: hitAt5 / dataset.length,
    mrr: reciprocalRankSum / dataset.length,
  };
}

Deployment rule

Ship retriever changes behind feature flags and log retrieval traces. Without per-query traces, you cannot debug why a good prompt produced a bad answer.