Elastic Learned Sparse EncodeR（或 ELSER）是一种由 Elastic 训练的 NLP 模型，使你能够使用稀疏向量表示来执行语义搜索。语义搜索不是根据搜索词进行字面匹配，而是根据搜索查询的意图和上下文含义检索结果。

本教程中的说明向你展示了如何使用 ELSER 对数据执行语义搜索。

提示：在使用 ELSER v1 进行语义搜索期间，仅考虑每个字段的前 512 个提取的标记。有关详细信息，请参阅此页面。

要求

要使用 ELSER 执行语义搜索，你必须在集群中部署 NLP 模型。请参阅 ELSER 文档以了解如何下载和部署模型。

创建索引映射

首先，必须创建目标索引的映射 —— 包含模型根据你的文本创建的标记的索引。目标索引必须有一个具有 rank_features 字段类型的字段来索引 ELSER 输出。












1.  PUT my-index
2.  {
3.    "mappings": {
4.      "properties": {
5.        "ml.tokens": {
6.          "type": "rank_features" 
7.        },
8.        "text_field": {
9.          "type": "text" 
10.        }
11.      }
12.    }
13.  }

注意：

包含预测的字段是 rank_features 字段。
用于创建稀疏矢量表示的 text 字段。

有关 rank_features 字段的使用，请详细阅读文章 “Elasticsearch：Rank feature query – 排名功能查询”。

使用推理处理器创建摄取管道

创建一个带有推理处理器的摄取管道，以使用 ELSER 对管道中摄取的数据进行推理。












1.  PUT _ingest/pipeline/elser-v1-test
2.  {
3.    "processors": [
4.      {
5.        "inference": {
6.          "model_id": ".elser_model_1",
7.          "target_field": "ml",
8.          "field_map": {
9.            "text": "text_field"
10.          },
11.          "inference_config": {
12.            "text_expansion": { 
13.              "results_field": "tokens"
14.            }
15.          }
16.        }
17.      }
18.    ]
19.  }

text_expansion 推理类型需要在推理摄取处理器中使用。

加载数据

在此步骤中，你将加载稍后在推理摄取管道中使用的数据，以从中提取 token。

使用 msmarco-passagetest2019-top1000 数据集，它是 MS MACRO Passage Ranking 数据集的子集。它包含 200 个查询，每个查询都附有相关文本段落的列表。所有独特的段落及其 ID 都已从该数据集中提取并编译成一个 tsv 文件。

使用机器学习 UI 中的数据可视化工具下载文件并将其上传到你的集群。将名称 id 分配给第一列，将 text 分配给第二列。索引名称是 test-data。上传完成后，n你可以看到一个名为 test-data 的索引，其中包含 182469 个文档。

关于如何加载这个数据，请详细阅读文章 “Elasticsearch：如何部署 NLP：文本嵌入和向量搜索”。

通过推理摄取管道摄取数据

通过使用 ELSER 作为推理模型的推理管道重新索引数据，从文本创建 tokens。












1.  POST _reindex?wait_for_completion=false
2.  {
3.    "source": {
4.      "index": "test-data"
5.    },
6.    "dest": {
7.      "index": "my-index",
8.      "pipeline": "elser-v1-test"
9.    }
10.  }

该调用返回一个任务 ID 以监控进度：

GET _tasks/<task_id>

你还可以打开经过训练的模型 UI，选择 ELSER 下的 Pipelines 选项卡以跟踪进度。完成该过程可能需要几分钟时间。

我们通过如下的命令来查看被写入的文档：

GET my-index/_search

使用 text_expansion 查询进行语义搜索

要执行语义搜索，请使用 text_expansion 查询，并提供查询文本和 ELSER 模型 ID。下面的示例使用查询文本 “How to avoid muscle soreness after running?”：












1.  GET my-index/_search

2.  {

3.     "query":{
4.        "text_expansion":{
5.           "ml.tokens":{
6.              "model_id":".elser_model_1",
7.              "model_text":"How to avoid muscle soreness after running?"
8.           }
9.        }
10.     }
11.  }

上面搜索的结果是：

结果是根据相关性排序的 my-index 索引中与你的查询文本含义最接近的前 10 个文档。结果还包含为每个相关搜索结果提取的 token 及其权重。

`






1.  "hits":[
2.     {
3.        "_index":"my-index",
4.        "_id":"978UAYgBKCQMet06sLEy",
5.        "_score":18.612831,
6.        "_ignored":[
7.           "text.keyword"
8.        ],
9.        "_source":{
10.           "id":7361587,
11.           "text":"For example, if you go for a run, you will mostly use the muscles in your lower body. Give yourself 2 days to rest those muscles so they have a chance to heal before you exercise them again. Not giving your muscles enough time to rest can cause muscle damage, rather than muscle development.",
12.           "ml":{
13.              "tokens":{
14.                 "muscular":0.075696334,
15.                 "mostly":0.52380747,
16.                 "practice":0.23430172,
17.                 "rehab":0.3673556,
18.                 "cycling":0.13947526,
19.                 "your":0.35725075,
20.                 "years":0.69484913,
21.                 "soon":0.005317828,
22.                 "leg":0.41748235,
23.                 "fatigue":0.3157955,
24.                 "rehabilitation":0.13636169,
25.                 "muscles":1.302141,
26.                 "exercises":0.36694175,
27.                 (...)
28.              },
29.              "model_id":".elser_model_1"
30.           }
31.        }
32.     },
33.     (...)
34.  ]

`![](https://csdnimg.cn/release/blogv2/dist/pc/img/newCodeMoreWhite.png)

将语义搜索与其他查询相结合

你可以将 text_expansion 与复合查询中的其他查询结合使用。例如，在布尔或全文查询中使用过滤器子句，可能会或可能不会使用与 text_expansion 查询相同的查询文本。这使你能够合并来自两个查询的搜索结果。

来自 text_expansion 查询的搜索命中往往得分高于其他 Elasticsearch 查询。这些分数可以通过使用 boost 参数增加或减少每个查询的相关性分数来规范化。 text_expansion 查询的召回率可能很高，因为相关性较低的结果很长。使用 min_score 参数修剪那些不太相关的文档。












1.  GET my-index/_search

2.  {

3.    "query": {
4.      "bool": { 
5.        "should": [
6.          {
7.            "text_expansion": {
8.              "ml.tokens": {
9.                "model_text": "How to avoid muscle soreness after running?",
10.                "model_id": ".elser_model_1",
11.                "boost": 1 
12.              }
13.            }
14.          },
15.          {
16.            "query_string": {
17.              "query": "toxins",
18.              "boost": 4 
19.            }
20.          }
21.        ]
22.      }
23.    },
24.    "min_score": 10 
25.  }