elasticsearch学习笔记高级篇(十一)——多字段搜索(下)

栏目: 后端 · 发布时间: 6年前

内容简介:承接上一篇博客most_fields是以字段为中心,这就使得它会查询最多匹配的字段。假设我们有一个让用户搜索地址。其中有两个文档如下:

承接上一篇博客 https://segmentfault.com/a/11...

4、most_fields查询

most_fields是以字段为中心,这就使得它会查询最多匹配的字段。

假设我们有一个让用户搜索地址。其中有两个文档如下:

PUT /test_index/_create/1
{
    "street":   "5 Poland Street",
    "city":     "Poland",
    "country":  "United W1V",
    "postcode": "W1V 3DG"
}

PUT /test_index/_create/2
{
    "street":   "5 Poland Street W1V",
    "city":     "London",
    "country":  "United Kingdom",
    "postcode": "3DG"
}

使用most_fields进行查询:

GET /test_index/_search
{
  "query": {
    "bool": {
      "should": [
        {
          "match": {
            "street": "Poland Street W1V"
          }
        },
        {
          "match": {
            "city": "Poland Street W1V"
          }
        },
        {
          "match": {
            "country": "Poland Street W1V"
          }
        },
        {
          "match": {
            "postcode": "Poland Street W1V"
          }
        }
      ]
    }
  }
}

我们发现对每个字段重复查询字符串很快就会显得冗长,此时用multi_match进行简化如下:

GET /test_index/_search
{
  "query": {
    "multi_match": {
      "query": "Poland Street W1V",
      "type": "most_fields", 
      "fields": ["street", "city", "country", "postcode"]
    }
  }
}

结果:

{
  "took" : 4,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 2.3835402,
    "hits" : [
      {
        "_index" : "test_index",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 2.3835402,
        "_source" : {
          "street" : "5 Poland Street",
          "city" : "Poland",
          "country" : "United W1V",
          "postcode" : "W1V 3DG"
        }
      },
      {
        "_index" : "test_index",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 0.99938464,
        "_source" : {
          "street" : "5 Poland Street W1V",
          "city" : "London",
          "country" : "United Kingdom",
          "postcode" : "3DG"
        }
      }
    ]
  }
}

如果用best_fields,那么doc2会在doc1的前面

GET /test_index/_search
{
  "query": {
    "multi_match": {
      "query": "Poland Street W1V",
      "type": "best_fields", 
      "fields": ["street", "city", "country", "postcode"]
    }
  }
}

结果:

{
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 0.99938464,
    "hits" : [
      {
        "_index" : "test_index",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 0.99938464,
        "_source" : {
          "street" : "5 Poland Street W1V",
          "city" : "London",
          "country" : "United Kingdom",
          "postcode" : "3DG"
        }
      },
      {
        "_index" : "test_index",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.6931472,
        "_source" : {
          "street" : "5 Poland Street",
          "city" : "Poland",
          "country" : "United W1V",
          "postcode" : "W1V 3DG"
        }
      }
    ]
  }
}

使用most_fields存在的问题

(1)它被设计用来找到匹配任意单词的多数字段,而不是找到跨越所有字段的最匹配的单词

(2)它不能使用operator或者minimum_should_match参数来减少低相关度结果带来的长尾效应

(3)每个字段的词条频度是不同的,会互相干扰最终得到较差的 排序 结果

5、全字段查询使用copy_to参数

上面那说了most_fields的问题,下面就来解决一下这个问题,解决这个问题的第一种方式就是使用copy_to参数。

我们可以用copy_to将多个field组合成一个field

建立如下索引:

DELETE /test_index
PUT /test_index
{
  "mappings": {
    "properties": {
      "street": {
        "type": "text",
        "copy_to": "full_address"
      },
      "city": {
        "type": "text",
        "copy_to": "full_address"
      },
      "country": {
        "type": "text",
        "copy_to": "full_address"
      },
      "postcode": {
        "type": "text",
        "copy_to": "full_address"
      },
      "full_address": {
        "type": "text"
      }
    }
  }
}

插入之前的数据:

PUT /test_index/_create/1
{
    "street":   "5 Poland Street",
    "city":     "Poland",
    "country":  "United W1V",
    "postcode": "W1V 3DG"
}

PUT /test_index/_create/2
{
    "street":   "5 Poland Street W1V",
    "city":     "London",
    "country":  "United Kingdom",
    "postcode": "3DG"
}

查询:

GET /test_index/_search
{
  "query": {
    "match": {
      "full_address": "Poland Street W1V"
    }
  }
}

结果:

{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 0.68370587,
    "hits" : [
      {
        "_index" : "test_index",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.68370587,
        "_source" : {
          "street" : "5 Poland Street",
          "city" : "Poland",
          "country" : "United W1V",
          "postcode" : "W1V 3DG"
        }
      },
      {
        "_index" : "test_index",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 0.5469647,
        "_source" : {
          "street" : "5 Poland Street W1V",
          "city" : "London",
          "country" : "United Kingdom",
          "postcode" : "3DG"
        }
      }
    ]
  }
}

我们可以发现这样变成一个字段full_address之后,就可以解决most_fields的问题了。

5、cross_fields查询

解决most_fields的问题的第二种方式就是使用cross_fields查询。

如果我们在索引文档之前都能够使用_all或是提前定义好copy_to的话,那就没什么问题。但是,Elasticsearch同时也提供了一个搜索期间的解决方案就是使用cross_fields查询。cross_fields采用了一种以词条为中心的方法,这种方法和best_fields以及most_fields采用的以字段为中心的方法有很大的区别。它将所有的字段视为一个大的字段,然后在任一字段中搜索每个词条。

下面解释一下以字段为中心和以词条为中心的区别。

以字段为中心

通过查询:

GET /test_index/_validate/query?explain
{
  "query": {
    "multi_match": {
      "query": "Poland Street W1V",
      "type": "best_fields",
      "fields": ["street", "city", "country", "postcode"]
    }
  }
}

得到:

{
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "valid" : true,
  "explanations" : [
    {
      "index" : "test_index",
      "valid" : true,
      "explanation" : "((postcode:poland postcode:street postcode:w1v) | (country:poland country:street country:w1v) | (city:poland city:street city:w1v) | (street:poland street:street street:w1v))"
    }
  ]
}

((postcode:poland postcode:street postcode:w1v) |

(country:poland country:street country:w1v) |

(city:poland city:street city:w1v) |

(street:poland street:street street:w1v))

这个就是规则。

将operator设置成and就变成

((+postcode:poland +postcode:street +postcode:w1v) |

(+country:poland +country:street +country:w1v) |

(+city:poland +city:street +city:w1v) |

(+street:poland +street:street +street:w1v))

标识四个词条都需要出现在相同的字段中

以词条为中心

通过查询

GET /test_index/_validate/query?explain
{
  "query": {
    "multi_match": {
      "query": "Poland Street W1V",
      "type": "cross_fields", 
      "operator": "and", 
      "fields": ["street", "city", "country", "postcode"]
    }
  }
}

得到:

{
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "valid" : true,
  "explanations" : [
    {
      "index" : "test_index",
      "valid" : true,
      "explanation" : "+blended(terms:[postcode:poland, country:poland, city:poland, street:poland]) +blended(terms:[postcode:street, country:street, city:street, street:street]) +blended(terms:[postcode:w1v, country:w1v, city:w1v, street:w1v])"
    }
  ]
}

+blended(terms:[postcode:poland, country:poland, city:poland, street:poland]) +blended(terms:[postcode:street, country:street, city:street, street:street]) +blended(terms:[postcode:w1v, country:w1v, city:w1v, street:w1v])

这个是规则。换言之所有的词必须出现在任意字段中。

cross_fields类型首先会解析查询字符串来得到一个词条列表,然后在任一字段中搜索每个词条。通过混合字段的倒排文档频度来解决词条频度问题。从而完美结局了most_fields的问题。

使用cross_fields相比较于copy_to,可以在查询期间对个别字段进行加权。

示例:

GET /test_index/_search
{
  "query": {
    "multi_match": {
      "query": "Poland Street W1V",
      "type": "cross_fields", 
      "fields": ["street^2", "city", "country", "postcode"]
    }
  }
}

这样street字段的boost就是2,其它字段都为1


以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

WebKit技术内幕

WebKit技术内幕

朱永盛 / 电子工业出版社 / 2014-6 / 79.00元

《WebKit技术内幕》从炙手可热的HTML5 的基础知识入手,重点阐述目前应用最广的渲染引擎项目——WebKit。不仅着眼于系统描述WebKit 内部渲染HTML 网页的原理,并基于Chromium 的实现,阐明渲染引擎如何高效地利用硬件和最新技术,而且试图通过对原理的剖析,向读者传授实现高性能Web 前端开发所需的宝贵经验。 《WebKit技术内幕》首先从总体上描述WebKit 架构和组......一起来看看 《WebKit技术内幕》 这本书的介绍吧!

RGB HSV 转换
RGB HSV 转换

RGB HSV 互转工具

HEX CMYK 转换工具
HEX CMYK 转换工具

HEX CMYK 互转工具

HSV CMYK 转换工具
HSV CMYK 转换工具

HSV CMYK互换工具