May 1, 2015

Full text search in milliseconds with PostgreSQL

Documentation storage

At Lateral we use PostgreSQL to store documents for our visualiser. Each document consists of a text column and a JSON column with meta data inside such as a title, date and URL. We wanted to create a fast search experience for the visualiser that lets you search the full text of documents as well as their titles to quickly find a document to get Lateral recommendations for.

To achieve this we assessed our options; we could use an open source solution such as Apache Solr or Elasticsearch, we could use a managed solution such as Elastic or Algolia or we could use PostgreSQL’s full text search. We decided to use PostgreSQL for the following reasons:

There’s no need to install extra software or libraries
We can use the same database interface we use in our app to perform search queries (in our case ActiveRecord)
We don’t need to provision any additional servers
There’s no extra cost to us
The data is stored in one place that we control
There’s no requirement to keep our database rows in sync with another source

The main arguments against using PostgreSQL search are accuracy and speed at scale. The accuracy of PostgreSQL searches is not the best. In the examples that follow I’ve explained how to weight titles over text, but we found a few cases where typing the exact title of a document wouldn’t return that document as the first result. This isn’t ideal and please let us know if you have any feedback about this. We’ve had issues in the past with the speed of PostgreSQL searches. This was for 23 million records and we found the speed deteriorates around 1 to 2 million rows.

So even though the dedicated solutions could probably work a bit better, we decided that for this use case they weren’t necessary.

“

How to do it

To get optimum performance with PostgreSQL full text search you need to create a column to store the tsvector values with an index on it. Then you need to fill this column with the tsv values and create a trigger to update the field on INSERT and UPDATE. Then you’ll be able to query the table quickly.

To create the column for the tsvector values run:

ALTER TABLE documents ADD COLUMN tsv tsvector;

And then to create the index on this column run:

CREATE INDEX tsv_idx ON documents USING gin(tsv);

Now you have somewhere to store the tsvectors you can populate the column. In this example, I am assuming a table with the structure I mentioned above, with a text column and a meta column that contains a JSON object with a title. To populate the tsv column with tsvectors, run the following:

UPDATE data_rows SET tsv = setweight(to_tsvector(coalesce(meta->>'title','')), 'A') || setweight(to_tsvector(coalesce(text,'')), 'D');

This query gets the title from the meta JSON column and gives it the heighest weight of A. Then it gets the text value and weights it D. Then it combines the two tsvectors and writes them to the tsv column we just created.

At this point, if your data was static you could stop and start querying. But we want all future rows and updates to have up-to-date tsv columns so we need to create a trigger to do this. Firstly, you’ll need to create a function to take a column and update the tsv column:

CREATE FUNCTION documents_search_trigger() RETURNS trigger AS $$
begin
  new.tsv :=
    setweight(to_tsvector(coalesce(new.meta->>'title','')), 'A') ||
    setweight(to_tsvector(coalesce(new.text,'')), 'D');
  return new;
end
$$ LANGUAGE plpgsql;

This basically does the same as the update query above but as a function. Now you need to create a trigger to execute that function when any row is updated or inserted:

CREATE TRIGGER tsvectorupdate BEFORE INSERT OR UPDATE
ON data_rows FOR EACH ROW EXECUTE PROCEDURE documents_search_trigger();

Now you can search the database. Replace both occurences of YOUR QUERY text with what you want to search and run the following query:

SELECT id, meta->>'title' as title, meta FROM (
  SELECT id, meta, tsv
  FROM data_rows, plainto_tsquery('YOUR QUERY') AS q
  WHERE (tsv @@ q)
) AS t1 ORDER BY ts_rank_cd(t1.tsv, plainto_tsquery('YOUR QUERY')) DESC LIMIT 5;

This query performs two queries; firstly it performs the search on the indexed tsv column. Then it ranks and sorts those results and returns 5.

This query takes about 50ms! We experimented with returning the document text with results and found that if we returned the full text of documents, it added about 350ms to the query time which seems to be network overhead rather than a slower query. If we only returned 200 characters from the texts then it only added around 100ms.

There is some added overhead because we’re using Rails to query PostgreSQL, so it’s not as fast as something like Algolia but it’s good enough for our use case.

See the search in action in one of our recommender demos:

More in

Dev

Creating an Azure VHD from Ubuntu Cloud Images on Mac OS X

How to create a VHD that is fully compatible with Azure from an Ubuntu Cloud Image base.

Dev

Create a Chrome extension to modify a website’s HTML or CSS

In this blog post, I will create a Chrome extension that modifies this blog to set a custom background and to modify the HTML

Dev

Give me five

Build your own Give me five Chrome extension using our open source template, based on our popular NewsBot Chrome extension for recommending news.

By clicking “Agree”, you agree to the storing of cookies on your device to enhance site navigation, analyse site usage, and assist in our marketing efforts. View our Privacy Policy for more information.

More Options Deny Agree

Full text search in milliseconds with PostgreSQL

Become a Lateral Pioneer

Documentation storage

How to do it

More in

Creating an Azure VHD from Ubuntu Cloud Images on Mac OS X

Create a Chrome extension to modify a website’s HTML or CSS

Give me five

Get into flow.

Full text search in milliseconds with PostgreSQL

Become a Lateral Pioneer

Documentation storage

How to do it

Spread the word

More in

Creating an Azure VHD from Ubuntu Cloud Images on Mac OS X

Create a Chrome extension to modify a website’s HTML or CSS

Give me five

Get into flow.