Full text search in milliseconds with PostgreSQL

At Lateral we use PostgreSQL to store documents for our visualiser. Each document consists of a text column and a JSON column with meta data inside such as a title, date and URL. We wanted to create a fast search experience for the visualiser that lets you search the full text of documents as well as their titles to quickly find a document to get Lateral recommendations for.

To achieve this we assessed our options; we could use an open source solution such as Apache Solr or Elasticsearch, we could use a managed solution such as Elastic or Algolia or we could use PostgreSQL’s full text search. We decided to use PostgreSQL for the following reasons:

  • There’s no need to install extra software or libraries
  • We can use the same database interface we use in our app to perform search queries (in our case ActiveRecord)
  • We don’t need to provision any additional servers
  • There’s no extra cost to us
  • The data is stored in one place that we control
  • There’s no requirement to keep our database rows in sync with another source

The main arguments against using PostgreSQL search are accuracy and speed at scale. The accuracy of PostgreSQL searches is not the best. In the examples that follow I’ve explained how to weight titles over text, but we found a few cases where typing the exact title of a document wouldn’t return that document as the first result. This isn’t ideal and please let us know if you have any feedback about this. We’ve had issues in the past with the speed of PostgreSQL searches. This was for 23 million records and we found the speed deteriorates around 1 to 2 million rows.

So even though the dedicated solutions could probably work a bit better, we decided that for this use case they weren’t necessary.

How to do it

To get optimum performance with PostgreSQL full text search you need to create a column to store the tsvector values with an index on it. Then you need to fill this column with the tsv values and create a trigger to update the field on INSERT and UPDATE. Then you’ll be able to query the table quickly.

To create the column for the tsvector values run:

ALTER TABLE documents ADD COLUMN tsv tsvector;

And then to create the index on this column run:

CREATE INDEX tsv_idx ON documents USING gin(tsv);

Now you have somewhere to store the tsvectors you can populate the column. In this example, I am assuming a table with the structure I mentioned above, with a text column and a meta column that contains a JSON object with a title. To populate the tsv column with tsvectors, run the following:

UPDATE data_rows SET tsv = setweight(to_tsvector(coalesce(meta->>'title','')), 'A') || setweight(to_tsvector(coalesce(text,'')), 'D');

This query gets the title from the meta JSON column and gives it the heighest weight of A. Then it gets the text value and weights it D. Then it combines the two tsvectors and writes them to the tsv column we just created.

At this point, if your data was static you could stop and start querying. But we want all future rows and updates to have up-to-date tsv columns so we need to create a trigger to do this. Firstly, you’ll need to create a function to take a column and update the tsv column:

CREATE FUNCTION documents_search_trigger() RETURNS trigger AS $$
begin
  new.tsv :=
    setweight(to_tsvector(coalesce(new.meta->>'title','')), 'A') ||
    setweight(to_tsvector(coalesce(new.text,'')), 'D');
  return new;
end
$$ LANGUAGE plpgsql;

This basically does the same as the update query above but as a function. Now you need to create a trigger to execute that function when any row is updated or inserted:

CREATE TRIGGER tsvectorupdate BEFORE INSERT OR UPDATE
ON data_rows FOR EACH ROW EXECUTE PROCEDURE documents_search_trigger();

Now you can search the database. Replace both occurences of YOUR QUERY text with what you want to search and run the following query:

SELECT id, meta->>'title' as title, meta FROM (
  SELECT id, meta, tsv
  FROM data_rows, plainto_tsquery('YOUR QUERY') AS q
  WHERE (tsv @@ q)
) AS t1 ORDER BY ts_rank_cd(t1.tsv, plainto_tsquery('YOUR QUERY')) DESC LIMIT 5;

This query performs two queries; firstly it performs the search on the indexed tsv column. Then it ranks and sorts those results and returns 5.

This query takes about 50ms! We experimented with returning the document text with results and found that if we returned the full text of documents, it added about 350ms to the query time which seems to be network overhead rather than a slower query. If we only returned 200 characters from the texts then it only added around 100ms.

There is some added overhead because we’re using Rails to query PostgreSQL, so it’s not as fast as something like Algolia but it’s good enough for our use case.

See the search in action in one of our recommender demos:

The Author
Making stuff with computers at Lateral.
Comments

18 thoughts on “Full text search in milliseconds with PostgreSQL

  1. Hi Max,

    Nice article.

    I tried to do the same thing with :

    d+ products;

    Table “public.products”

    Column | Type | Modifiers | Storage | Stats target | Description

    ———+———-+——————————————————+———-+————–+————-

    id | integer | not null default nextval(‘product_id_seq’::regclass) | plain | |

    product | jsonb | | extended | |

    tsv | tsvector | | extended | |

    Indexes:

    “name_idx” btree ((product ->> ‘name’::text))

    “product_index” gin (product jsonb_path_ops)

    “tsv_idx” gin (tsv)

    I have some dummy entries like this:

    ID json tsv
    5504620 | {“name”: “Andrew”, “price”: 225} | ‘andrew’:1

    5504622 | {“name”: “Andrew”, “price”: 225} | ‘andrew’:1

    5504624 | {“name”: “Andrew”, “price”: 225} | ‘andrew’:1

    5504626 | {“name”: “Andrew”, “price”: 225} | ‘andrew’:1

    5504628 | {“name”: “Andrew”, “price”: 225} | ‘andrew’:1

    So I execute:

    explain select * from products where tsv @@ to_tsquery(‘Andrew’) ;

    QUERY PLAN

    ——————————————————————————–

    Bitmap Heap Scan on products (cost=21047.77..189149.65 rows=2283325 width=74)

    Recheck Cond: (tsv @@ to_tsquery(‘Andrew’::text))

    -> Bitmap Index Scan on tsv_idx (cost=0.00..20476.94 rows=2283325 width=0)

    Index Cond: (tsv @@ to_tsquery(‘Andrew’::text))

    This tells me that it uses the index.

    But when I execute the query, it takes about 13seconds, for 4 million rows.

    True, the entries are pretty much the same (lots of Andrews)

    I’d appreciate if you have any suggestions.

    Thanks,
    Florin

    1. Hey Florin,

      Hmm. I think maybe there are a few things you could do to speed this up. Firstly, you can select only the 20 best results. This should speed things up a bit. Then, you could also only get IDs when running the search query. After that you then join these IDs with another outer query which means you only get the full data for the rows you want.

      What happens when you run something this this:


      SELECT d.id, d.meta FROM (
      SELECT id FROM products, to_tsquery('Andrew') AS q
      WHERE tsv @@ q ORDER BY ts_rank_cd(tsv, q) DESC LIMIT 20
      ) AS ids INNER JOIN products p ON p.id = ids.id;"

      So what that is doing is getting the IDs for the 20 most relevant results – this is a bit silly in this example as all the results are just ‘Andrew’ so there aren’t going to be any more or less relevant ones.

      Next using those 20 IDs another query is made to get the meta field of each result row.

      There could be performance issues because you’ve got the same thing in every row. Maybe try again with some more realistic data?

  2. thank you for sharing this, it was very helpful!

    i’m wondering, though, whether it’s actually necessary to add a column to store the vectorized representation – in theory the index should be enough?

    wouldn’t it be possible to update the index with the same function and do away with the column?

    1. Hi Tom,

      Yes you are correct. This article was posted on Hacker News where there was some discussion about this. Something like this:


      CREATE INDEX tsv_idx ON documents USING gin(to_tsvector('english', text));

      Would be fine I think.

  3. Which version of PostgreSQL do you use? I’m not sure about this, but I think ranking speed was greatly improved in 9.5 because now the index contains information to rank results without hitting the row stored in the heap.

    1. Hey Nicolas, we’re using 9.4. I’ve heard that 9.5 includes improvements to the ranking speed but didn’t know why exactly. Thanks for letting us know! We’ll have to run some experiments to try it out. The ranking is the bottleneck right now, so that could greatly increase performance for tables with lots of rows.

      1. Yes, you can probably expect a massive performance increase with PostgreSQL 9.5. In 9.4, when you use ts_rank, PostgreSQL has to read each matched row, which are stored in the heap, which involves a lot of random accesses. In 9.5, all the data necessary to ts_rank are already stored in the index, and thus you avoid all the expensive random accesses. This was contributed by Oleg Bartunov and other developers that brought FTS to PostgreSQL.

        By the way, if you can report your results here with 9.5, it would be great! 🙂

        1. Do you have to write the query in a specific manner so that the ts_rank value from the index is used? I did a quick test with 9.5 but couldn’t notice any difference in performance.

          1. The documentation for 9.5 still says: “Ranking can be expensive since it requires consulting the
            tsvector of each matching document, which
            can be I/O bound and therefore slow.”
            http://www.postgresql.org/d

            So I guess it’s not included in 9.5.

  4. Hi @maxnovakovic, We are kind of similar boat. You mentioned “Performance degrades when the rows reach to 1-2 million usually for text search.”

    – Would you please tell me what is the size of the DB box? and memory?

    1. The database has 4 vCPUs and 15 GB memory. However, I think that when I wrote this article the database was running on a different machine and I can’t remember what size that was unfortunately.

  5. HI @maxnovakovic. Great article. I have a question, how long text stored in the “text” column? I have the same solution, but the text that needs to be indexed and keep into tsv is too long. I have an NOTICE: word is too long to be indexed; DETAIL: Words longer than 2047 characters are ignored. In the result the tsv in that situation stays empty. (I need to index office documents, stored in postgresql)

  6. Can anyone please describe the structure of the table “documents” and show how a row of it would be like. I am a beginner and I would like to understand how the column of meta should be in order to get it working on my system.
    Thank you.

  7. Hi @maxnovakovik
    Nice and well explained example. I was in a situation where I needed FTS or more precisely ranked tokenized search and got to implement something very similar to what you described. The benchmarks though were not very convincing, I seeded a db with 20k rows and performed queries on it. For a single request it went nice, for 2, 3, etc. in parallel it was the same, but as size increased after that I noticed big latency was added, this query seems to be very intensive on memory (rightfully so). I tested it to the point that for 40 parallel requests it would take a total of 8 seconds to resolve all of them. As I required a lot of future scalability I went with an additional tool for indexing (ES). What are your thought on all this?

Comments are closed.