Recently, I am studying PostgreSQL to do search engine, in short, the main use of the search engine is the inverted index, that is, an article or statement, the first word, the article into N words, each word has a certain weight, there are many places in this step can be optimized, the article will be cut into the exact meaning of the word, the impact on the subsequent search is very big, this can refer to TF-IDF or TEXTRANK algorithm. The second step is to establish the inverted index, that is, the word and the position of the word in the article associated; with the inverted index, the third step is to search, the same, we will input the word, and assembled into a certain search criteria, into the search engine to search; the fourth step is to process the search results.

However, the commonly used search engines in the industry are ES, why do I want to use PG? The search in the case of massive data maybe ES will be a better solution, but the general case with ES is too heavy, PG can be competent for common cases? If our business data and search can be handled by PG, we don’t need to synchronize between database and ES, and the overall complexity will be reduced a lot.

This is the reason why I made this attempt.

PG Internal Support

To use full-text indexes in PostgreSQL (PG for short), you need to use a built-in data structure provided by PG, called tsvector.

1
2
3
4
5
6
7
8
A tsvector value is a sorted list of distinct lexemes, which are words that have been normalized to merge different
variants of the same word (see Chapter 12 for details). Sorting and duplicate-elimination are done automatically
during input, as shown in this example:

SELECT 'a fat cat sat on a mat and ate a fat rat'::tsvector;
                      tsvector
----------------------------------------------------
 'a' 'and' 'ate' 'cat' 'fat' 'mat' 'on' 'rat' 'sat'

tsvector is used to store the subword vector, let’s look at a simple example.

1
2
3
# SELECT tokens FROM cargos;

'一':66B,70B '一部分':68B,72B '上':35B '不是':62B '两极':29B '两极分化':31B '中国':44B,54B '丰':19B '丰衣足食':22B...

As you can see, here is what we said earlier about the inverted index, each is a word and the position of the word composed of the A and B is actually the weight of the word, PG has ABCD 4 weights, A’s weight is the highest, D’s lowest. The higher the weight, the higher the ranking can be when the subsequent search.

Build inverted index

After we get the article, we have to do the splitting first, I use sego, the reason is that gojieba is always somehow panic, so for simplicity, I use sego first, assuming that later we find that sego optimization space is not enough, then we can use jieba Python version encapsulated as a service to provide out.

My database is designed as follows

  • title title
  • description article or description
  • tokens stores the subword vectors

This should work for most scenarios, such as searching for articles, products, lyrics, posts, etc. The code is as follows.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
var (
    titleWords       []string
    descriptionWords []string
)
for _, i := range sego.SegmentsToSlice(segmenter.Segment([]byte(cargo.Title)), true) {
    if len(i) <= 1 {
        continue
    }

    titleWords = append(titleWords, i)
}
for _, i := range sego.SegmentsToSlice(segmenter.Segment([]byte(cargo.Description)), true) {
    if len(i) <= 1 {
        continue
    }

    descriptionWords = append(descriptionWords, i)
}

The next step is to update the results of the word split into.

1
2
3
sql := `UPDATE cargos SET tokens = setweight(to_tsvector('simple', $1), 'A') || setweight(to_tsvector('simple', $2), 'B') WHERE id = $3`

_, err = tx.Exec(ctx, sql, strings.Join(titleWords, " "), strings.Join(descriptionWords, " "), cargo.ID)

Use to_tsvector to set the vector, preceded by simple to indicate that it is cut by spaces, or by english by default if not given. The setweight function is used to set the weight of the result of to_tsvector, and there are ABCD options for the weight, as described above. || is used to merge multiple subword vectors.

This mode is actually controlling the subword entirely at the application level, and most of the cases that can be searched online are based on the compiled PG plugin form. I prefer application-level subscripts, which have the following advantages.

  • Simple maintenance and no compilation. If it is a cloud-hosted PG, it may not be able to load self-compiled plugins
  • Easy to scale, application level scaling is much easier than database level scaling. The word splitting itself is a CPU-intensive task, so it is easy to reach the bottleneck in the database.
  • The application is fast to update, and it is very easy to update any new features and functions of the Pictionary plugin.

At this point, we’ve updated the subword vector in. Next we still need to create the index.

1
2
3
create extension pg_trgm;

CREATE INDEX IF NOT EXISTS cargos_token_idx ON cargos USING GIN(tokens);

I use the GIN index, in addition to the GiST option, see the difference at: https://www.postgresql.org/docs/current/textsearch-indexes.html.

Searching

After saving the data, the next thing we have to do is to search.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
var queryWords []string
for _, i := range sego.SegmentsToSlice(segmenter.Segment([]byte(query)), true) {
    if len(i) <= 1 {
        continue
    }

    queryWords = append(queryWords, i)
}

sql := `SELECT id, created_at, updated_at, title, ts_rank(tokens, query) AS score FROM cargos, to_tsquery('simple', $1) query WHERE tokens @@ query ORDER BY score DESC`

rows, err := tx.Query(ctx, sql, strings.Join(queryWords, " | "))
// ...

to_tsquery is a parsing query statement with the following syntax (refer to documentation).

  • & means both conditions must be satisfied
  • | means one of them is satisfied
  • ! NOT operation means no match
  • <-> means that the A word follows the B word, almost like A.... B structure
  • blabla* means it matches the prefix

ts_rank is calculated based on the weight, which is convenient for the subsequent ORDER BY ranking.

Performance testing

I have repeatedly imported all the articles of the blog many times to get together 100 million subwords, here is the actual test data.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
# \timing on
Timing is on.
# SELECT id, ts_rank(tokens, query) AS score FROM cargos, to_tsquery('simple', '共同 & 富裕') query WHERE tokens @@ query ORDER BY score DESC;
Time: 1.775 ms  ; 1 行结果
# SELECT id, ts_rank(tokens, query) AS score FROM cargos, to_tsquery('simple', 'Linux') query WHERE tokens @@ query ORDER BY score DESC;
Time: 317.306 ms  ; 47775 行结果
# SELECT id, ts_rank(tokens, query) AS score FROM cargos, to_tsquery('simple', 'Go & 并发') query WHERE tokens @@ query ORDER BY score DESC;
Time: 111.316 ms  ; 16170 行结果
# SELECT id, ts_rank(tokens, query) AS score FROM cargos, to_tsquery('simple', 'Windows & 虚拟机') query WHERE tokens @@ query ORDER BY score DESC;
Time: 57.231 ms  ; 8085 行结果

# SELECT COUNT(*) FROM cargos;
 count  
--------
 360152
(1 row)

Time: 48.547 ms
# SELECT SUM(LENGTH(tokens)) FROM cargos;
    sum    
-----------
 101115491
(1 row)

Time: 477.441 ms

As you can see, the total number of articles is 360,000, and the total number of words is about 100 million after the word separation. When searching, the response time is basically proportional to the number of results returned, and the response is very fast if there are few search results. I think the common scenario PG is completely enough to cover. There are still many places to optimize, for example, removing common tone words, removing punctuation, special characters, removing most useless words, using TF-IDF to extract keywords to give higher weight, and reducing the weight of the rest of the words, after these optimizations, the overall performance should be much better.

Summary

This article summarizes my experience of tossing PG as a search engine, after verification, PG is fully capable of common scenarios, in the future I do some of my own what need to search capabilities, I believe this solution can make the overall simpler.