Building BM25 Keyword Search In Postgres

It turns out, you can build a surprisingly powerful keyword search, known as BM25, with just a handful of SQL functions. What follows is my journey including what I've learned, how this algorithm works in simple terms, and how I implemented it without any external dependencies.

The Foundation: From TF-IDF to BM25

Before diving into BM25, it's helpful to understand TF-IDF (Term Frequency-Inverse Document Frequency). It's a simple but powerful idea that helps determine the importance of a word per document within a wider collection of documents.

Term Frequency (TF): This is just what it sounds like. How often does a term appear in a single document? The intuition is that if the word "data" appears 5 times in a 100-word article, it's probably more relevant to that article than a word that appears only once in that same document.
Inverse Document Frequency (IDF): This measures how rare a word is across *all* documents. Common words like "the" or "a" will appear everywhere and have a low IDF score, effectively penalizing them. Rare words will appear in fewer documents and should have a higher IDF score, signaling that they are more significant.

By multiplying these two scores, you get the TF-IDF score, which gives you a weighted value for how important a word is to a specific document. This was the starting point for my exploration.

BM25 is essentially a more refined version of this. It builds on the core concepts of TF-IDF but adds a couple of clever improvements:

Term Frequency Saturation: BM25 recognizes that a word's relevance doesn't increase linearly. The difference between a word appearing zero times and one time is huge. The difference between it appearing 20 times and 21 times is negligible. BM25 provides a simple configuration so you can control how quickly the term frequency score "saturates."
Document Length Normalization: Longer documents naturally have an advantage because they have more opportunities for a search term to appear. BM25 provides a simple configuration so you can penalize longer documents to level the playing field, ensuring that relevance isn't just a side effect of verbosity.

The Implementation

The beauty of this approach is that the entire search algorithm lives within Postgres and can be called from any tech stack.

Here's a breakdown of the core components:

Text Processing and Tokenization

The first step in any search system is to break down raw text into meaningful terms, or tokens. I created a tokenize_and_count function to handle this. It takes a block of text and converts it to lowercase, removes common stop words using the built-in English stop words list (like 'a', 'the', 'is'), uses stemming to reduce words to their root form (e.g., "canceling" and "canceled" both become "cancel"), filters out blank tokens, and finally returns a table of unique terms and their frequencies in the text.

    CREATE OR REPLACE FUNCTION tokenize_and_count(input_text TEXT)
    RETURNS TABLE (term TEXT, count INTEGER) AS $$
    BEGIN
        -- Get stemmed tokens with stop words removed
        RETURN QUERY
        WITH tokens AS (
            SELECT word
            FROM ts_parse('default', lower(input_text)) AS t(tokid, word)
            WHERE tokid != 12  -- Filter out blank tokens
        ),
        processed_tokens AS (
            SELECT
                CASE
                    WHEN ts_lexize('public.simple_dict', word) = '{}'
                    THEN NULL  -- It's a stop word
                    ELSE COALESCE(
                        (ts_lexize('public.english_stem', word))[1],
                        word
                    )
                END AS processed_word
            FROM tokens
        )
        SELECT
            processed_word as term,
            COUNT(*)::INTEGER
        FROM processed_tokens
        WHERE processed_word IS NOT NULL
        GROUP BY processed_word;
    END;
    $$ LANGUAGE plpgsql;

Storing Statistics for Performance

Calculating these scores on the fly for every document during every search would be incredibly slow. To optimize this, I created two key tables:

verse_stats: This table stores the pre-calculated term frequencies for each document. It holds the document's ID, its total length, and a JSONB object mapping each term to its count.
term_stats: This table stores global statistics about each term, such as how many documents it appears in. This is crucial for calculating the IDF score efficiently.

I also created a materialized view, global_stats, to keep track of the total number of documents and the average document length across the entire corpus, which is needed for the BM25 calculation.

The BM25 Scoring Function

With the data structures in place, I implemented the core BM25 logic. The calculate_idf function uses the BM25 formula to determine a term's weight. It's slightly more robust than the standard TF-IDF version to ensure scores are non-negative.

    CREATE OR REPLACE FUNCTION calculate_idf(term_doc_count INTEGER, total_docs INTEGER)
    RETURNS FLOAT AS $$
    BEGIN
      -- Modified BM25 IDF formula to ensure non-negative scores
      RETURN ln(1 + (total_docs - term_doc_count + 0.5) /
                    (term_doc_count + 0.5));
    END;
    $$ LANGUAGE plpgsql;

The bm25_term_score function brings it all together, combining the term frequency, document length, and IDF score to produce the final score for a single term in a single document.

Putting it all Together

The final piece is the search_verses function. This function orchestrates the entire process: it takes a user's query text, tokenizes it, and to handle typos, it uses the pg_trgm extension to find terms in term_stats that are similar to the (potentially misspelled) query terms. It then joins the query terms against the verse_stats and term_stats tables, calculates the bm25_term_score for each matching term in each document, sums them up to get a final relevance score, and returns the top results.

    CREATE OR REPLACE FUNCTION search_verses(
        query_text TEXT,
        k1 FLOAT DEFAULT 1.2,
        b FLOAT DEFAULT 0.75,
        limit_val INTEGER DEFAULT 10,
        similarity_threshold FLOAT DEFAULT 0.3
    ) RETURNS TABLE (
        verse_id BIGINT,
        score FLOAT,
        content TEXT
    ) AS $$
    DECLARE
        v_total_docs INTEGER;
        v_avg_length FLOAT;
    BEGIN
        SELECT gs.total_docs, gs.avg_length
        INTO v_total_docs, v_avg_length
        FROM global_stats gs;
        
        RETURN QUERY
        WITH raw_query_terms AS (
            SELECT term
            FROM tokenize_and_count(query_text)
        ),
        query_terms AS (
            SELECT DISTINCT corrected_term AS term
            FROM raw_query_terms rqt
            CROSS JOIN LATERAL (
                SELECT ts.term AS corrected_term
                FROM term_stats ts
                WHERE similarity(rqt.term, ts.term) >= similarity_threshold
                ORDER BY similarity(rqt.term, ts.term) DESC
                LIMIT 1
            ) AS best_match
        ),
        term_scores AS (
            SELECT
                d.verse_id,
                bm25_term_score(
                    (d.terms->>t.term)::INTEGER,
                    d.length,
                    calculate_idf(ts.doc_count, v_total_docs),
                    v_avg_length,
                    k1,
                    b
                ) AS term_score,
                v.text AS doc_text
            FROM
                verse_stats d
            JOIN
                verses v ON v.id = d.verse_id
            JOIN
                query_terms t ON d.terms ? t.term
            JOIN
                term_stats ts ON ts.term = t.term
        )
        SELECT
            ts.verse_id,
            SUM(ts.term_score) AS score,
            ts.doc_text AS content
        FROM
            term_scores ts
        GROUP BY
            ts.verse_id,
            ts.doc_text
        ORDER BY
            score DESC
        LIMIT limit_val;
    END;
    $$ LANGUAGE plpgsql;

Highlights

With everything fully operational I decided to summarize what I've done and benchmark this against something much simpler to get a feel for how this relevance scoring worked in practice.

BM25 Ranking: Industry-standard relevance scoring with proper IDF calculation and document length normalization
PostgreSQL-Native: Zero external dependencies, using built-in text search and JSONB
Fuzzy Matching: Trigram-based typo correction with configurable similarity threshold
Text Analysis: Complete pipeline with tokenization, stemming and stopword removal
Incremental Updates: Efficient document re-indexing with proper term statistics maintenance
Optimized Storage: JSONB term frequencies with GIN indexing for fast lookups
Materialized Statistics: Pre-computed global metrics for efficient scoring

To benchmark this I wrote a simple BEIR script that showed a clear improvement in relevance ranking.

NDCG@10 (relevance)

BM25: 0.2940

ILIKE: 0.1105

BM25 leads by +166.1%

Precision@10 (accuracy of the top 10 results)

BM25: 0.2806

ILIKE: 0.1747

BM25 leads by +60.6%

Recall % (found relevant documents)

BM25: 66.7%

ILIKE: 28.1%

BM25 leads by +38.6pp

The result is a fast, efficient, and typo-tolerant search engine built without any external dependencies. It was a fascinating journey into the mechanics of search, and it's incredibly satisfying to see it working so well. Hopefully, this sheds some light on how modern keyword search works and inspires you to see what you can build with the tools you already have.