Recently, I am studying PostgreSQL to do search engine, in short, the main use of the search engine is the inverted index, that is, an article or statement, the first word, the article into N words, each word has a certain weight, there are many places in this step can be optimized, the article will be cut into the exact meaning of the word, the impact on the subsequent search is very big, this can refer to TF-IDF or TEXTRANK algorithm. The second step is to establish the inverted index, that is, the word and the position of the word in the article associated; with the inverted index, the third step is to search, the same, we will input the word, and assembled into a certain search criteria, into the search engine to search; the fourth step is to process the search results.
However, the commonly used search engines in the industry are ES, why do I want to use PG? The search in the case of massive data maybe ES will be a better solution, but the general case with ES is too heavy, PG can be competent for common cases? If our business data and search can be handled by PG, we don’t need to synchronize between database and ES, and the overall complexity will be reduced a lot.
This is the reason why I made this attempt.
PG Internal Support
To use full-text indexes in PostgreSQL (PG for short), you need to use a built-in data structure provided by PG, called
tsvector is used to store the subword vector, let’s look at a simple example.
As you can see, here is what we said earlier about the inverted index, each is a word and the position of the word composed of the
B is actually the weight of the word, PG has ABCD 4 weights, A’s weight is the highest, D’s lowest. The higher the weight, the higher the ranking can be when the subsequent search.
Build inverted index
After we get the article, we have to do the splitting first, I use
sego, the reason is that
gojieba is always somehow panic, so for simplicity, I use
sego first, assuming that later we find that
sego optimization space is not enough, then we can use
jieba Python version encapsulated as a service to provide out.
My database is designed as follows
descriptionarticle or description
tokensstores the subword vectors
This should work for most scenarios, such as searching for articles, products, lyrics, posts, etc. The code is as follows.
The next step is to update the results of the word split into.
to_tsvector to set the vector, preceded by
simple to indicate that it is cut by spaces, or by
english by default if not given. The
setweight function is used to set the weight of the result of
to_tsvector, and there are
ABCD options for the weight, as described above.
|| is used to merge multiple subword vectors.
This mode is actually controlling the subword entirely at the application level, and most of the cases that can be searched online are based on the compiled PG plugin form. I prefer application-level subscripts, which have the following advantages.
- Simple maintenance and no compilation. If it is a cloud-hosted PG, it may not be able to load self-compiled plugins
- Easy to scale, application level scaling is much easier than database level scaling. The word splitting itself is a CPU-intensive task, so it is easy to reach the bottleneck in the database.
- The application is fast to update, and it is very easy to update any new features and functions of the Pictionary plugin.
At this point, we’ve updated the subword vector in. Next we still need to create the index.
I use the GIN index, in addition to the GiST option, see the difference at: https://www.postgresql.org/docs/current/textsearch-indexes.html.
After saving the data, the next thing we have to do is to search.
to_tsquery is a parsing query statement with the following syntax (refer to documentation).
&means both conditions must be satisfied
|means one of them is satisfied
!NOT operation means no match
<->means that the A word follows the B word, almost like
blabla*means it matches the prefix
ts_rank is calculated based on the weight, which is convenient for the subsequent
ORDER BY ranking.
I have repeatedly imported all the articles of the blog many times to get together 100 million subwords, here is the actual test data.
As you can see, the total number of articles is 360,000, and the total number of words is about 100 million after the word separation. When searching, the response time is basically proportional to the number of results returned, and the response is very fast if there are few search results. I think the common scenario PG is completely enough to cover. There are still many places to optimize, for example, removing common tone words, removing punctuation, special characters, removing most useless words, using TF-IDF to extract keywords to give higher weight, and reducing the weight of the rest of the words, after these optimizations, the overall performance should be much better.
This article summarizes my experience of tossing PG as a search engine, after verification, PG is fully capable of common scenarios, in the future I do some of my own what need to search capabilities, I believe this solution can make the overall simpler.