许吉友 - 运维

2016年美国总统选举辩论数据检索实战

数据来源:http://dataju.cn/Dataju/web/datasetInstanceDetail/268

下载下来之后是一个 CSV 数据,通过 kibana 的 Data Visualizer 功能导入到 Elasticsearch 中,之后就可以查询了。

简单的查询

一个简单的 bool 查询:

GET /race-for-president-2016/_search
{
  "query": {
    "bool": {
      "must": [
        { "match": { "Text": "New York"}}
      ],
      "filter": [
        {"term": { "Speaker": "Holt" }}
      ]
    }
  }
}

这样子查询出来的文档时带有 _score 即分数的,默认是按分数倒排序的。

如果人为加上排序:

GET /race-for-president-2016/_search
{
  "query": {
    "bool": {
      "must": [
        { "match": { "Text": "New York"}}
      ],
      "filter": [
        {"term": { "Speaker": "Holt" }}
      ]
    }
  },
  "sort": [
    {"Line": { "order": "desc" }}
  ]
}

就不会有分数了。

当然也可以按分数正排:

GET /race-for-president-2016/_search
{
  "query": {
    "bool": {
      "must": [
        { "match": { "Text": "New York"}}
      ],
      "filter": [
        {"term": { "Speaker": "Holt" }}
      ]
    }
  },
  "sort": [
    {"_score": { "order": "asc" }}
  ]
}

如果想按 Line 排序,也想要分数怎么办,参见:https://www.elastic.co/guide/en/elasticsearch/guide/current/_sorting.html

可以这样:

GET /race-for-president-2016/_search
{
  "query": {
    "bool": {
      "must": [
        { "match": { "Text": "New York"}}
      ],
      "filter": [
        {"term": { "Speaker": "Holt" }}
      ]
    }
  },
  "sort": [
    {"_score": { "order": "asc" }},
    {"Line": { "order": "desc" }}
  ]
}

高亮

如果想高亮显示某个单词怎么办,参考:https://www.elastic.co/guide/en/elasticsearch/reference/current/highlighting.html

POST /race-for-president-2016/_search
{
  "query": {
    "match": {
      "Text": "New York"
    }
  },
  "highlight": {
    "fields": {
      "Text": {}
    }
  }
}

部分返回如下:

{
        "_index" : "race-for-president-2016",
        "_type" : "_doc",
        "_id" : "Jsmf13MBam6_7L8uDJbD",
        "_score" : 8.121967,
        "_source" : {
          "Line" : 167,
          "@timestamp" : "2016-09-26T00:00:00.000+08:00",
          "Text" : "Your two -- your two minutes expired, but I do want to follow up. Stop-and-frisk was ruled unconstitutional in New York, because it largely singled out black and Hispanic young men.",
          "Date" : "9/26/16",
          "Speaker" : "Holt"
        },
        "highlight" : {
          "Text" : [
            "Stop-and-frisk was ruled unconstitutional in <em>New</em> <em>York</em>, because it largely singled out black and Hispanic"
          ]
        }
}

可以发现,在 highlight 中有 em,并且只显示与关键词相关的一句话,多余的部分不会显示。

上边的搜索有一部分会将 New York 两个单词分开,如果想连为一体查询,可以使用短语搜索:

POST /race-for-president-2016/_search
{
  "query": {
    "match_phrase": {
      "Text": "New York"
    }
  },
  "highlight": {
    "fields": {
      "Text": {}
    }
  }
}

分页查询

查询主持人的发言,主持人名为 Holt

POST /race-for-president-2016/_search
{
  "query": {
    "bool": {
      "filter": [
        {"term": {"Speaker": "Holt"}}
      ]
    }
  }
}

这样查出来是分数是 0.0,如果想要带分数,可以使用:

POST /race-for-president-2016/_search
{
  "query": {
    "constant_score": {
      "filter": {
        "term": {"Speaker": "Holt"}
      }
    }
  }
}

这样查出来的分数是 1.0

上边的俩查询是可以显示文档数量的,但是默认只会显示十条文档,如果想显示多一些,可以这样:

POST /race-for-president-2016/_search
{
  "query": {
    "bool": {
      "filter": [
        {"term": {"Speaker": "Holt"}}
      ]
    }
  },
  "size": 20
}

不过从性能上考虑,一次查询不要返回过多文档,如果文档过多可以利用分页,比如从第二个文档开始查询:

POST /race-for-president-2016/_search
{
  "query": {
    "bool": {
      "filter": [
        {"term": {"Speaker": "Holt"}}
      ]
    }
  },
  "size": 20,
  "from": 2
}

如果只想看看主持人发言的总数量,可以这样:

参见:https://www.elastic.co/guide/en/elasticsearch/reference/current/search-count.html

POST /race-for-president-2016/_count
{
  "query": {
    "bool": {
      "filter": [
        {"term": {"Speaker": "Holt"}}
      ]
    }
  }
}

以上就是分页查询的全部内容了。

聚合查询

查看有多少发言人:

参见:https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-metrics-cardinality-aggregation.html

POST /race-for-president-2016/_search
{
  "size": 0,
  "aggs": {
    "Number of speakers": {
      "cardinality": {
        "field": "Speaker"
      }
    }
  }
}

部分返回如下:

  "aggregations" : {
    "Number of speakers" : {
      "value" : 11
    }
  }

发现总共有 11 个发言人。

聚合查看每个发言人都发言了多少次:

参见:https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-terms-aggregation.html

POST /race-for-president-2016/_search
{
  "size": 0,
  "aggs": {
    "Speakers and number of speeches": {
      "terms": {
        "field": "Speaker"
      }
    }
  }
}

部分返回如下:

  "aggregations" : {
    "Speakers and number of speeches" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 8,
      "buckets" : [
        {
          "key" : "Trump",
          "doc_count" : 224
        },
        {
          "key" : "Clinton",
          "doc_count" : 159
        },
        {
          "key" : "Pence",
          "doc_count" : 134
        },
        {
          "key" : "Kaine",
          "doc_count" : 124
        },
        {
          "key" : "Holt",
          "doc_count" : 98
        },
        {
          "key" : "Quijano",
          "doc_count" : 76
        },
        {
          "key" : "Cooper",
          "doc_count" : 75
        },
        {
          "key" : "Raddatz",
          "doc_count" : 62
        },
        {
          "key" : "CANDIDATES",
          "doc_count" : 37
        },
        {
          "key" : "Audience",
          "doc_count" : 29
        }
      ]
    }
  }

川建国同志最能说了。