搜索引擎

搜索和推荐都是向用户提供用户需要的信息，搜索，由用户检索"拉动"，包含"Query"；推荐的基本是"推动"，由系统进行推送。

有的业务场景中，通过q2i的用户反馈信息进行排序，进一步提升业务指标。

1. requirements

产品/功能/use cases

Is it a generalized search engine (like google) or specialized (like amazon product)?
What are the specific use cases and scenarios where it will be applied?
What are the system requirements (such as response time, accuracy, scalability, and integration with existing systems or platforms)?
How many languages needs to be supported?
What types of items (products) are available on the platform, and what attributes are associated with them?
What are the common user search behaviors and patterns? Do users frequently use filters, sort options, or advanced search features?
Are there specific search-related challenges unique to the use case (e-commerce)? such as handling product availability, pricing, and customer reviews?

目标类

约束类

Is their any data available? What format?
response time, accuracy, scalability (50M DAU)
budget limitations, hardware limitations, or legal and privacy constraints
What is the expected scale of the system in terms of data and user interactions?

搜索，先对文档进行预处理并建立索引，根据用户查询，从索引中查询匹配，然后排序返回。

搜索引擎的六个核心组件：爬虫、解析、索引、链接关系分析、查询处理、排名.

query understanding

预处理，无效字符、表情直接丢掉，截取前n个字等
纠错，包括错误检测和自动纠错，目前的方法有：噪声信道模型、序列标注模型、seq2seq模型
分词&词性标注，一般都是现成的分词工具+用户词典来做了
词权重计算，计算每个词的重要性，一般会根据用户的点击日志离线算好
同义词
意图分析，为了意图可以扩充，所以一般做成很多个二分类任务，方法比较多，最常见的还是CNN，也有BERT蒸馏到CNN的
实体识别，识别搜索词中的实体词，一般也是序列标注模型BILSTM+CRF，或者BERT蒸馏到BILSTM
丢词，因为目前的搜索引擎更多的是还是以文本匹配的方式进行文档召回，所以如果query中有一些语义不重要的词，那就会丢弃了，并且往往会有多次丢词，比如：北京著名的温泉，在进行召回的时候，会先丢弃“的”字，以“北京、著名、温泉”三个词去和文档集求交集，如果没有好的结果，这三个词会继续丢词，以“北京、温泉”和文档集求交集，这里一般也是用序列标注来做
Query改写，其实丢词&纠错也都算改写的一种，不过这里的改写是指找到原始Query的一些等价或者近似Query，规则的方法比较多，也有用seq2seq的

整个搜索召回的流程大致如下，以搜索“北京著名的温泉”为例：

The ranking was good if the user clicked on some link and spent significant time reading the page.
The ranking was not so good if the user clicked on a link to a result and then hit “back” quickly.
The ranking was bad if the user clicked on the “next page” link.

offline

online

A/B test

Last updated 7 months ago