从零构建搜索引擎 build demo search engine from scratch

我们每天都会使用搜索引擎：打开google等搜索引擎，输入关键词，检索出结果，这是一次搜索；当打开历史记录旁边的🔍按钮，输入关键词，检索出结果，这也是一次搜索。
本文将从原理开始到最终实现一个搜索demo，探究搜索引擎背后是如何实现“一瞬间”从海量数据中检索出想要的结果的。

配合本文，实现了一个本文讲述的搜索引擎的demo用于检索内容，支持AND、包含、排除查询等常见的查询，也附带上简单的benchmark对比，可见我的github仓库：https://github.com/578223592/go-small-practices/tree/main/search_engine

全文搜索

全文搜索可以简单的定义为：根据输入的查询语句（query），在一组文档（document）中，筛选出符合条件的文档。

过程

要实现全文搜索，我们可以想象到最简单的方式就是循环去匹配，like如下的伪代码，想象一下这种循环会在文档数量增加和查询复杂后产生多少次循环匹配计算：

    a = ["just", "do"] #查询条件count = 0  #匹配数量for d in tokenized_documents:  #遍历文档for t in d:  #遍历文档中的每个语句for c in a: # 遍历查询条件if c == t: # 如果匹配，则count++count += 1return count

因此在现实世界，会对文档进行预处理（分析），分析之后产生索引。查询的时候根据索引去匹配定位到源文件，比如下图这样的管道：

实际上索引是不同查询引擎里面影响性能的关键，我们通常可以用如下指标来评价一个搜索索引的好坏：

计算索引的速度
压缩率
可拓展性
查询性能

由于本人目前对搜索索引的了解也不是特别深入，因此这里就不卖弄了。

关键词

document：一组单词。

Do it. Just do it. Don’t let your dreams be dreams. Yesterday, you said tomorrow. So just do it. Make your dreams come true. Just do it. Some people dream of success, while you’re gonna wake up and work hard at it. Nothing is impossible. You should get to the point where anyone else would quit, and you’re not gonna stop there. No, what are you waiting for? Do it! Just do it! Yes you can. Just do it.

term：词。

just

token：用户输入的查询词（或其变体）在被检索文档中出现的每一个具体匹配实例。

Just do it. .... So just do it. Make your dreams come true. Just do it. ... Just do .... Just do it.

query：指定特定格式的查询语句

以下举例的是本文构建出来的搜索引擎支持的查询格式，不同引擎的语句可能是不同的。

just：查询包含just的文档

just do：查询包含just或者包含do的文档

just AND do：查询同时包含just和do的文档

"just do"：查询顺序出现just和do的文档

just do+it：出现it的文档中，查询包含just或者包含do的文档

just do -it：查询包含just或者包含do的文档，再剔除掉包含it的文档。

score：文档匹配的分数，如果多个文档都符合查询条件，那么文档的列表会按照分数从高到低给出。本文搜索引擎是按照文档中匹配到的词的个数来评分，现实世界中的评分会参考更多因素，比如说地区、性别等等。

index：索引，搜索的索引为inverted-index（倒序索引），其可以根据关键词寻找关键词出现的位置，通常的结构如下。这是搜索引擎如此之快的关键奥秘！通过inverted-index，我们寻找关键词不再需要通过遍历文档，而是可以直接通过类似key->value查找的方式，速度非常快。

just -> [(1, 0), (1, 5), (1, 20), (1, 28), (1, 76), (1, 82)] #含义是just出现的位置是document1的第0个位置，document1的第5个位置，document1的第20个位置。。。
do -> [(0, 94), (0, 399), (0, 455), (0, 493), (1, 1), (1, 3), (1, 6), (1, 21), (1, 29), (1, 74), (1, 77), (1, 83), (2, 15), (2, 33), (2, 44)] #含义是do出现的位置是。。。

搜索索引称为inverted-index的原因是通常的索引是根据位置去找内容，而inverted-index是根据内容去找位置，是inverted版本的普通索引。

关键算法实现

创建倒序索引：

func (a *Analyser) AnalyseAndIndex(documents []string) {for docId, oneDocument := range documents {tokens := regexp.MustCompile(`\w+`).FindAllString(oneDocument, -1) // 使用正则将文档拆解成termfor index, token := range tokens {token = strings.ToLower(token)local_ := local{DocId: int64(docId), Indexes: int64(index)} //use inverted index to index termsif a.index == nil {a.index = make(map[string][]local)}a.index[token] = append(a.index[token], local_)}}
}

‍

搜索：

我们将文档检索分成四种模式，如下代码块所示，默认来说是OR模式，如果出现其他模式的关键词，那么就会调整对应的搜索模式：

const (AND     searchModeExp = "AND"OR      searchModeExp = "OR"INCLUDE searchModeExp = "+"EXCLUDE searchModeExp = "-"
)

我们创建了两个结果数组，分别用于记录结果和记录必须要排除掉的文档（EXCLUDE searchModeExp = "-")：

includeDocScoresMap := make(map[int64]float64)excludeDocMap := make(map[int64]any)

此外，还需要注意的是，为了支持"just do"这样保证顺序的查找，我们需要匹配成对出现的"，然后根据其中的内容去文档中固定匹配内容，而不能直接简单组合不同文档的内容。

整个搜索过程关键算法如下：

func (a *Analyser) Search(keyWords string) []SearchRes {includeDocScoresMap := make(map[int64]float64)excludeDocMap := make(map[int64]any)queryExpressions := a.parseQuery(keyWords)var searchMode = OR        // for i := 0; i < len(queryExpressions); i++ {var locals []localif isSearchMode(queryExpressions[i]) { //调整搜索模式searchMode = searchModeExp(queryExpressions[i])continue} else if queryExpressions[i] == Quote {endIndex := stringsIndex(queryExpressions[i+1:], Quote)if endIndex == -1 {continue // 缺少结束引号，忽略}if endIndex == 0 {continue // 引号中间没有任何exp，忽略}quoteQueryExpressions := queryExpressions[i+1 : i+1+endIndex]i += endIndex + 1for i2 := range quoteQueryExpressions {quoteQueryExpressions[i2] = strings.ToLower(quoteQueryExpressions[i2])}firstExp, otherExp := quoteQueryExpressions[0], quoteQueryExpressions[1:]var ok boollocals, ok = a.index[firstExp]if !ok {// not found,early continuecontinue}for _, exp := range otherExp {locals = a.findExpInIndex(locals, exp)}} else {//	searchlocals = a.index[queryExpressions[i]]}switch searchMode {case AND:for _, l := range locals {if _, ok := includeDocScoresMap[l.DocId]; ok {includeDocScoresMap[l.DocId] += 1}}case OR:for _, l := range locals {includeDocScoresMap[l.DocId] += 1}case INCLUDE:locsMap := make(map[int64]float64, len(includeDocScoresMap))for _, l := range locals {locsMap[l.DocId] = includeDocScoresMap[l.DocId] + 1}includeDocScoresMap = locsMapcase EXCLUDE:for _, l := range locals {excludeDocMap[l.DocId] = nil}}searchMode = OR}res := make([]SearchRes, 0)for k, v := range includeDocScoresMap {if _, ok := excludeDocMap[k]; ok {continue}res = append(res, SearchRes{DocId: k, Score: v})}slices.SortFunc(res, func(a, b SearchRes) int {return int(b.Score - a.Score) // order by score desc})return res
}