内容简介:承接上一篇博客most_fields是以字段为中心,这就使得它会查询最多匹配的字段。假设我们有一个让用户搜索地址。其中有两个文档如下:
承接上一篇博客 https://segmentfault.com/a/11...
4、most_fields查询
most_fields是以字段为中心,这就使得它会查询最多匹配的字段。
假设我们有一个让用户搜索地址。其中有两个文档如下:
PUT /test_index/_create/1
{
"street": "5 Poland Street",
"city": "Poland",
"country": "United W1V",
"postcode": "W1V 3DG"
}
PUT /test_index/_create/2
{
"street": "5 Poland Street W1V",
"city": "London",
"country": "United Kingdom",
"postcode": "3DG"
}
使用most_fields进行查询:
GET /test_index/_search
{
"query": {
"bool": {
"should": [
{
"match": {
"street": "Poland Street W1V"
}
},
{
"match": {
"city": "Poland Street W1V"
}
},
{
"match": {
"country": "Poland Street W1V"
}
},
{
"match": {
"postcode": "Poland Street W1V"
}
}
]
}
}
}
我们发现对每个字段重复查询字符串很快就会显得冗长,此时用multi_match进行简化如下:
GET /test_index/_search
{
"query": {
"multi_match": {
"query": "Poland Street W1V",
"type": "most_fields",
"fields": ["street", "city", "country", "postcode"]
}
}
}
结果:
{
"took" : 4,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : 2.3835402,
"hits" : [
{
"_index" : "test_index",
"_type" : "_doc",
"_id" : "1",
"_score" : 2.3835402,
"_source" : {
"street" : "5 Poland Street",
"city" : "Poland",
"country" : "United W1V",
"postcode" : "W1V 3DG"
}
},
{
"_index" : "test_index",
"_type" : "_doc",
"_id" : "2",
"_score" : 0.99938464,
"_source" : {
"street" : "5 Poland Street W1V",
"city" : "London",
"country" : "United Kingdom",
"postcode" : "3DG"
}
}
]
}
}
如果用best_fields,那么doc2会在doc1的前面
GET /test_index/_search
{
"query": {
"multi_match": {
"query": "Poland Street W1V",
"type": "best_fields",
"fields": ["street", "city", "country", "postcode"]
}
}
}
结果:
{
"took" : 3,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : 0.99938464,
"hits" : [
{
"_index" : "test_index",
"_type" : "_doc",
"_id" : "2",
"_score" : 0.99938464,
"_source" : {
"street" : "5 Poland Street W1V",
"city" : "London",
"country" : "United Kingdom",
"postcode" : "3DG"
}
},
{
"_index" : "test_index",
"_type" : "_doc",
"_id" : "1",
"_score" : 0.6931472,
"_source" : {
"street" : "5 Poland Street",
"city" : "Poland",
"country" : "United W1V",
"postcode" : "W1V 3DG"
}
}
]
}
}
使用most_fields存在的问题
(1)它被设计用来找到匹配任意单词的多数字段,而不是找到跨越所有字段的最匹配的单词
(2)它不能使用operator或者minimum_should_match参数来减少低相关度结果带来的长尾效应
(3)每个字段的词条频度是不同的,会互相干扰最终得到较差的 排序 结果
5、全字段查询使用copy_to参数
上面那说了most_fields的问题,下面就来解决一下这个问题,解决这个问题的第一种方式就是使用copy_to参数。
我们可以用copy_to将多个field组合成一个field
建立如下索引:
DELETE /test_index
PUT /test_index
{
"mappings": {
"properties": {
"street": {
"type": "text",
"copy_to": "full_address"
},
"city": {
"type": "text",
"copy_to": "full_address"
},
"country": {
"type": "text",
"copy_to": "full_address"
},
"postcode": {
"type": "text",
"copy_to": "full_address"
},
"full_address": {
"type": "text"
}
}
}
}
插入之前的数据:
PUT /test_index/_create/1
{
"street": "5 Poland Street",
"city": "Poland",
"country": "United W1V",
"postcode": "W1V 3DG"
}
PUT /test_index/_create/2
{
"street": "5 Poland Street W1V",
"city": "London",
"country": "United Kingdom",
"postcode": "3DG"
}
查询:
GET /test_index/_search
{
"query": {
"match": {
"full_address": "Poland Street W1V"
}
}
}
结果:
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : 0.68370587,
"hits" : [
{
"_index" : "test_index",
"_type" : "_doc",
"_id" : "1",
"_score" : 0.68370587,
"_source" : {
"street" : "5 Poland Street",
"city" : "Poland",
"country" : "United W1V",
"postcode" : "W1V 3DG"
}
},
{
"_index" : "test_index",
"_type" : "_doc",
"_id" : "2",
"_score" : 0.5469647,
"_source" : {
"street" : "5 Poland Street W1V",
"city" : "London",
"country" : "United Kingdom",
"postcode" : "3DG"
}
}
]
}
}
我们可以发现这样变成一个字段full_address之后,就可以解决most_fields的问题了。
5、cross_fields查询
解决most_fields的问题的第二种方式就是使用cross_fields查询。
如果我们在索引文档之前都能够使用_all或是提前定义好copy_to的话,那就没什么问题。但是,Elasticsearch同时也提供了一个搜索期间的解决方案就是使用cross_fields查询。cross_fields采用了一种以词条为中心的方法,这种方法和best_fields以及most_fields采用的以字段为中心的方法有很大的区别。它将所有的字段视为一个大的字段,然后在任一字段中搜索每个词条。
下面解释一下以字段为中心和以词条为中心的区别。
以字段为中心
通过查询:
GET /test_index/_validate/query?explain
{
"query": {
"multi_match": {
"query": "Poland Street W1V",
"type": "best_fields",
"fields": ["street", "city", "country", "postcode"]
}
}
}
得到:
{
"_shards" : {
"total" : 1,
"successful" : 1,
"failed" : 0
},
"valid" : true,
"explanations" : [
{
"index" : "test_index",
"valid" : true,
"explanation" : "((postcode:poland postcode:street postcode:w1v) | (country:poland country:street country:w1v) | (city:poland city:street city:w1v) | (street:poland street:street street:w1v))"
}
]
}
((postcode:poland postcode:street postcode:w1v) |
(country:poland country:street country:w1v) |
(city:poland city:street city:w1v) |
(street:poland street:street street:w1v))
这个就是规则。
将operator设置成and就变成
((+postcode:poland +postcode:street +postcode:w1v) |
(+country:poland +country:street +country:w1v) |
(+city:poland +city:street +city:w1v) |
(+street:poland +street:street +street:w1v))
标识四个词条都需要出现在相同的字段中
以词条为中心
通过查询
GET /test_index/_validate/query?explain
{
"query": {
"multi_match": {
"query": "Poland Street W1V",
"type": "cross_fields",
"operator": "and",
"fields": ["street", "city", "country", "postcode"]
}
}
}
得到:
{
"_shards" : {
"total" : 1,
"successful" : 1,
"failed" : 0
},
"valid" : true,
"explanations" : [
{
"index" : "test_index",
"valid" : true,
"explanation" : "+blended(terms:[postcode:poland, country:poland, city:poland, street:poland]) +blended(terms:[postcode:street, country:street, city:street, street:street]) +blended(terms:[postcode:w1v, country:w1v, city:w1v, street:w1v])"
}
]
}
+blended(terms:[postcode:poland, country:poland, city:poland, street:poland]) +blended(terms:[postcode:street, country:street, city:street, street:street]) +blended(terms:[postcode:w1v, country:w1v, city:w1v, street:w1v])
这个是规则。换言之所有的词必须出现在任意字段中。
cross_fields类型首先会解析查询字符串来得到一个词条列表,然后在任一字段中搜索每个词条。通过混合字段的倒排文档频度来解决词条频度问题。从而完美结局了most_fields的问题。
使用cross_fields相比较于copy_to,可以在查询期间对个别字段进行加权。
示例:
GET /test_index/_search
{
"query": {
"multi_match": {
"query": "Poland Street W1V",
"type": "cross_fields",
"fields": ["street^2", "city", "country", "postcode"]
}
}
}
这样street字段的boost就是2,其它字段都为1
以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网
猜你喜欢:- elasticsearch学习笔记高级篇(十)——多字段搜索(上)
- springboot~DTO字符字段与日期字段的转换问题
- Protocol Buffers 学习(2):字段类型和其他语言字段类型之间的映射
- Protocol Buffers 学习(2):字段类型和其他语言字段类型之间的映射
- PHPRAP 2.0.2 发布,接口和字段数据分离,字段使用单独数据表
- 如何正确新增字段
本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。
WebKit技术内幕
朱永盛 / 电子工业出版社 / 2014-6 / 79.00元
《WebKit技术内幕》从炙手可热的HTML5 的基础知识入手,重点阐述目前应用最广的渲染引擎项目——WebKit。不仅着眼于系统描述WebKit 内部渲染HTML 网页的原理,并基于Chromium 的实现,阐明渲染引擎如何高效地利用硬件和最新技术,而且试图通过对原理的剖析,向读者传授实现高性能Web 前端开发所需的宝贵经验。 《WebKit技术内幕》首先从总体上描述WebKit 架构和组......一起来看看 《WebKit技术内幕》 这本书的介绍吧!