Building BM25 Keyword Search In Postgres

Published August 16, 2025 by Toran Billups

It turns out, you can build a surprisingly powerful keyword search, known as BM25, with just a handful of PSQL functions. What follows is my journey including what I've learned, how this search algorithm works in simple terms, and how I implemented it from scratch.

The Foundation: From TF-IDF to BM25

Before diving into BM25, it's helpful to understand its predecessor, TF-IDF (Term Frequency-Inverse Document Frequency). It's a simple but powerful idea that helps determine a word's importance to a document within a collection of documents.

  • Term Frequency (TF): This is just what it sounds like. How often does a term appear in a single document? The intuition is that if the word "data" appears 5 times in a 100-word article, it's probably more relevant to that article than a word that appears only once.
  • Inverse Document Frequency (IDF): This measures how common or rare a word is across *all* documents. Common words like "the" or "a" will appear everywhere and have a low IDF score, effectively penalizing them. Rare words will appear in fewer documents and thus have a higher IDF score, signaling that they are more significant.

By multiplying these two scores, you get the TF-IDF score, which gives you a weighted value for how important a word is to a specific document. This was the starting point for my exploration.

BM25 is essentially a more refined version of this. It builds on the core concepts of TF-IDF but adds a couple of clever improvements:

  • Term Frequency Saturation: BM25 recognizes that a word's relevance doesn't increase linearly. The difference between a word appearing zero times and one time is huge. The difference between it appearing 20 times and 21 times is negligible. It uses a parameter k1 to control how quickly the term frequency score "saturates."
  • Document Length Normalization: Longer documents naturally have an advantage because they have more opportunities for a search term to appear. BM25 penalizes longer documents to level the playing field, ensuring that relevance isn't just a side effect of verbosity.

The Implementation: A PostgreSQL Deep Dive

The beauty of this approach is that the entire search algorithm lives within PostgreSQL and can be called from any application.

Here's a breakdown of the core components:

1. Text Processing and Tokenization

The first step in any search system is to break down raw text into meaningful terms, or tokens. I created a tokenize_and_count function to handle this. It takes a block of text and converts it to lowercase, removes common stop words using PostgreSQL's built-in English stop words list (like 'a', 'the', 'is'), uses stemming to reduce words to their root form (e.g., "canceling" and "canceled" both become "cancel"), filters out blank tokens, and finally returns a table of unique terms and their frequencies in the text.

    CREATE OR REPLACE FUNCTION tokenize_and_count(input_text TEXT)
    RETURNS TABLE (term TEXT, count INTEGER) AS $$
    BEGIN
        -- Get stemmed tokens with stop words removed
        RETURN QUERY
        WITH tokens AS (
            SELECT word
            FROM ts_parse('default', lower(input_text)) AS t(tokid, word)
            WHERE tokid != 12  -- Filter out blank tokens
        ),
        processed_tokens AS (
            SELECT
                CASE
                    WHEN ts_lexize('public.simple_dict', word) = '{}'
                    THEN NULL  -- It's a stop word
                    ELSE COALESCE(
                        (ts_lexize('public.english_stem', word))[1],
                        word
                    )
                END AS processed_word
            FROM tokens
        )
        SELECT
            processed_word as term,
            COUNT(*)::INTEGER
        FROM processed_tokens
        WHERE processed_word IS NOT NULL
        GROUP BY processed_word;
    END;
    $$ LANGUAGE plpgsql;
  

2. Storing Statistics for Performance

Calculating these scores on the fly for every document during every search would be incredibly slow. To optimize this, I created two key tables:

  • verse_stats: This table stores the pre-calculated term frequencies for each document. It holds the document's ID, its total length, and a JSONB object mapping each term to its count.
  • term_stats: This table stores global statistics about each term, such as how many documents it appears in doc_count. This is crucial for calculating the IDF score efficiently.

I also created a materialized view, global_stats, to keep track of the total number of documents and the average document length across the entire corpus, which is needed for the BM25 calculation. It's important to note that this materialized view needs to be refreshed whenever documents are added or updated to ensure accurate statistics. The migration includes the helper function update_modified_verses that handle this refresh automatically.

3. The BM25 Scoring Functions

With the data structures in place, I implemented the core BM25 logic. The calculate_idf function uses the BM25 formula to determine a term's weight. It's slightly more complex than the standard TF-IDF version but ensures scores are non-negative.

    CREATE OR REPLACE FUNCTION calculate_idf(term_doc_count INTEGER, total_docs INTEGER)
    RETURNS FLOAT AS $$
    BEGIN
      -- Modified BM25 IDF formula to ensure non-negative scores
      RETURN ln(1 + (total_docs - term_doc_count + 0.5) /
                    (term_doc_count + 0.5));
    END;
    $$ LANGUAGE plpgsql;
  

The bm25_term_score function brings it all together, combining the term frequency, document length, and IDF score to produce the final score for a single term in a single document.

4. Putting it all Together: The Search Function

The final piece is the search_verses function. This function orchestrates the entire process: it takes a user's query text, tokenizes it, and to handle typos, it uses the pg_trgm extension to find terms in term_stats that are similar to the (potentially misspelled) query terms. It then joins the query terms against the verse_stats and term_stats tables, calculates the bm25_term_score for each matching term in each document, sums them up to get a final relevance score, and returns the top results.

One important setup detail: the migration needs to enable the pg_trgm extension for the typo tolerance to work. This should be added at the beginning of your migration:

    CREATE EXTENSION IF NOT EXISTS pg_trgm
  

The result is a fast, efficient, and typo-tolerant search engine built without any external dependencies. It was a fascinating journey into the mechanics of search, and it's incredibly satisfying to see it working with just a bit of clever SQL. Hopefully, this sheds some light on how modern keyword search works and inspires you to see what you can build with the tools you already have.


Buy Me a Coffee

Twitter / Github / Email