A common feature that a software application often needs to support is efficient text search. Search is simply expected to be available, and users want it to be fast. This article explores how efficient text search support can be provided by the underlying application database. We will look at Postgres and its built-in support for efficient indexed text search. But other database systems provide similar functionality.
Notice that I earlier provided a brief introduction to Postgres.
The Enron dataset
To have some data to test on I have used the Enron dataset. A large number of emails from Enron employees was made available to the public after the Enron scandal. This section covers how to import the data into Postgres.
The original dataset is simply a set of folders and files, where folders are organized by users and emails stored in individual files. The dataset (or parts thereof) has also been made available in a MySQL format. We will use this as the starting point for importing it into Postgres.
First however, we will create a database to hold the data:
> createdb enron
The MySQL format is different from Postgres, so to make it easy we will manually create the Postgres tables that are part of the MySQL dataset:
-- Table structure for table 'employeelist' CREATE TABLE employeelist ( eid SERIAL, firstName varchar(31) NOT NULL default '', lastName varchar(31) NOT NULL default '', Email_id varchar(31) NOT NULL default '', PRIMARY KEY (eid), UNIQUE (Email_id) ); -- Table structure for table 'message' CREATE TABLE message ( mid int4 NOT NULL default '0', sender varchar(127) NOT NULL default '', date timestamp default NULL, message_id varchar(127) default NULL, subject text, body text, folder varchar(127) NOT NULL default '', PRIMARY KEY (mid) ); -- Table structure for table 'recipientinfo' CREATE TYPE address_format AS ENUM ('TO', 'CC', 'BCC'); CREATE TABLE recipientinfo ( rid int4 NOT NULL default '0', mid int4 NOT NULL default '0', rtype address_format default NULL, rvalue varchar(127) default NULL, dater timestamp default NULL, PRIMARY KEY (rid) ); -- Table structure for table 'referenceinfo' CREATE TABLE referenceinfo ( rfid int4 NOT NULL default '0', mid int4 NOT NULL default '0', reference text, PRIMARY KEY (rfid) );
Once this is done we will transform the MySQL data dump by:
- Removing unnecessary and unsupported statements
- Remove all the CREATE TABLE statements that are MySQL specific (we have already created the same tables above)
To do this we will execute the following command in the terminal:1
> sed -e '/USE \`enron\`;/d' -e '/CREATE TABLE/,/;/d' < enron-mysqldump.sql > enron-psqldump-insert-only.psql
enron-mysqldump.sql file is the file downloaded from the MySQL dump, and the resulting file
enron-psqldump-insert-only.psql we will use to insert data into our Postgres database:
> psql -d enron -f enron-psqldump-insert-only.psql
After this is done we can inspect the data imported:
SELECT COUNT(*) FROM message; count ---------- 252759
If successful, we should get as response that there are 252,759 messages in the database. With the data available in the database, we can start exploring it.
A relational database consists of tables and columns with rows of data. Some columns contain text that should be searchable while others do not.
The application might want the user to be able to search the email subject and body columns only.
A simple way of accomplishing this is to use SQL
LIKE queries. For example, if the user searches for the term ‘debt’ or the term ‘loan’, the following queries can be used to finds such emails:
-- Executing query: SELECT * FROM message WHERE body ILIKE '%debt%'; Total query runtime: 13231 ms. 4198 rows retrieved. -- Executing query: SELECT * FROM message WHERE body ILIKE '%debt%' OR subject ILIKE '%loan%'; Total query runtime: 21072 ms. 4387 rows retrieved. -- Executing query: SELECT * FROM message WHERE body ILIKE '%debt%' OR subject ILIKE '%loan%' LIMIT 100; Total query runtime: 861 ms. 100 rows retrieved.
Some of these queries take over 20 seconds, which is not very responsive. This kind of search has severe drawbacks:
- They are slow because there is no indexing support. Each time the query is issued we have to go through all the rows in the tables being queried. This works for small databases, but becomes untenable when the dataset grows.
- The results cannot be ranked. All we know is whether or not a particular row matched the query.
- The query is based on evaluating regular expressions which makes it difficult to take derived words or synonyms into consideration during the search. For example, if the user searches for the word city, tasks with the word cities would also be expected to be found. Handling this manually quickly becomes unpractical.
Modern relational databases such as Postgres has built-in support to deal with efficient text search that avoids these kinds of problems.
Full text search
Let’s take a closer look at how Postgres supports full text search and how we can make use of it for more efficient search across large datasets.
What the user really is searching for are called documents. A document might be a blog post, an email, a task in a tasking application, or a report. The content of the document (or what should be searchable) might be stored in multiple columns or tables in the database.
Documents can be created by referencing and concatenating multiple columns like this:
SELECT coalesce(subject,'') || ' ' || coalesce(body, '') AS document FROM message LIMIT 5
We are not doing anything special here, just concatenating columns using the coalesce function to avoid getting into trouble when columns contain
From a high-level perspective, the following process is gone through to create text indexes for efficient performance.
- Convert documents into tokens (words)
- Covert tokens into lexemes (normalize words, remove stop words), based on a given dictionary (e.g. English)
- Store the lexemes in a format that are optimized for searching
Search types and text matching
The basic Postgres type for dealing with text is tsvector. The documentation:
For text search purposes, each document must be reduced to the preprocessed tsvector format. Searching and ranking are performed entirely on the tsvector representation of a document — the original text need only be retrieved when the document has been selected for display to a user. We therefore often speak of the tsvector as being the document, but of course it is only a compact representation of the full document.”
The following exemplifies how we can tokenize a string by converting it into a tsvector type:
SELECT 'The quick brown fox jumps over the lazy dog'::tsvector;
The result is:
"'The' 'brown' 'dog' 'fox' 'jumps' 'lazy' 'over' 'quick' 'the'"
This breakdown of the sentence is just based on tokens; we do not know what language the sentence is in and do not know, e.g., that “the” is only a non-interesting stop word. But we can do better. From the documentation:
For most English-text-searching applications the above words would be considered non-normalized, but tsvector doesn’t care. Raw document text should usually be passed through to_tsvector to normalize the words appropriately for searching
SELECT to_tsvector('english', 'The quick brown fox jumps over the lazy dog');
The result is:
"'brown':3 'dog':9 'fox':4 'jump':5 'lazi':8 'quick':2"
Here we are using the built-in support for English, using a dictionary and support for stop words. The result is a set of lexemes: the tokens have been normalized and the stop words have been removed. The two ‘the’ words are removed and ‘jumps’ is represented by the stem word ‘jump’ etc.
We can now do some basic text matching using the match operator
SELECT 'The quick brown fox jumps over the lazy dog'::tsvector @@ 'jump & over'::tsquery; ?column? ---------- f
The match returns false because the word ‘jump’ does not appear in the tokenized word set. However, if we use the lexemes from the English dictionary, we do get a match:
SELECT to_tsvector('english', 'The quick brown fox jumps over the lazy dog') @@ to_tsquery('jump & over'); ?column? ---------- t
Query without an index
We can do these kind of queries on the Enron dataset too, in this case we are searching for the word ‘debt’ in the body of the email messages:
-- Executing query: SELECT body FROM message WHERE to_tsvector('english', body) @@ to_tsquery('english', 'debt') LIMIT 100; Total query runtime: 12786 ms. 100 rows retrieved.
Here we only look for 100 records, and it took over 12 seconds. The problem is we have a lot of data and it is not indexed.
Full text search using an index
To improve text search performance we will now use Postgres’ built-in indexing support to speed up our text search queries.
We first introduce a new column to our database table to hold the tsvector representation of our document.
-- Executing query: ALTER TABLE message ADD COLUMN textsearchable_index_col tsvector; Query returned successfully with no result in 32 ms.
We then populate the new column with our document using an UPDATE query:
-- Executing query: UPDATE message SET textsearchable_index_col = to_tsvector('english', coalesce(subject,'') || ' ' || coalesce(body,'')); Query returned successfully: 252759 rows affected, 341102 ms execution time.
As can be seen, this update took a long time, but it was a one time operation on all the data we had in the table. When new data is inserted into this table there should be a trigger setup that automatically updates the tsvector column. How to create triggers for this purpose will not be covered here, but is well documented.
Next we want to create our index. There are two types of indexes supported in Postgres with different performance characteristics: GIN and GIST:
As a rule of thumb, GIN indexes are best for static data because lookups are faster. For dynamic data, GiST indexes are faster to update. Specifically, GiST indexes are very good for dynamic data and fast if the number of unique words (lexemes) is under 100,000, while GIN indexes will handle 100,000+ lexemes better but are slower to update.
Since we are dealing with static data here, we will as suggested create a GIN index.
-- Executing query: CREATE INDEX textsearch_idx ON message USING gin(textsearchable_index_col) Query returned successfully with no result in 483200 ms.
When this index has been created we don’t have to worry anymore. If the triggers have been setup, the tsvector column (which keeps track of the lexemes of the document) will be updated when new data is inserted. The created index is automatically updated and maintained by Postgres.
It can be helpful to give a reasonable name to the created indexes, something that describes what they are for. Index names can easily be changed at a later time:
-- Executing query: ALTER INDEX IF EXISTS textsearch_idx RENAME TO message_text_idx; Query returned successfully with no result in 12 ms.
Query using an index
We can now query our Enron dataset using the newly created index. Let’s search for messages that contain the word ‘debt’.
-- Executing query: SELECT * FROM message WHERE textsearchable_index_col @@ to_tsquery('debt') ORDER BY DATE LIMIT 100; Total query runtime: 445 ms. 100 rows retrieved.
Notice that we are not querying any of the old columns but instead we are querying our indexed column that contain the lexemes.
Here is another example searching for all messages containing the words ‘debt’ or ‘loan’, with the results ordered by date (latest first):
-- Executing query: SELECT * FROM message WHERE textsearchable_index_col @@ to_tsquery('debt | loan') ORDER BY date DESC; Total query runtime: 4602 ms. 5151 rows retrieved.
As can be seen we are significantly improving our search performance using an index.
There are lot more things that can be done with Postgres and full text search of course. It is for example possible to completely configure the dictionaries to specialize the search for particular domains, and use different sets of stop-words etc. You can use dictionaries to:
- Define stop words that should not be indexed.
- Map synonyms to a single word using Ispell.
- Map phrases to a single word using a thesaurus.
- Map different variations of a word to a canonical form using an Ispell dictionary.
- Map different variations of a word to a canonical form using Snowball stemmer rules.
Full text search is not difficult and it is built right into Postgres, so there is really no excuse not to use it.
I’m here assuming a Unix-like environment. ↩