许吉友 - 运维

170 万篇论文数据实战

数据地址:https://www.kaggle.com/Cornell-University/arxiv

大小:5GB左右

使用 Logstash 将 json 文件导入 Elasticsearch。

Logstash 配置如下:

input {
  file {
    path => "/opt/source/arxiv-metadata-oai-snapshot.json"
    codec => json {
        charset => "UTF-8"
    }
    start_position => "beginning"
    tags => "arxiv-metadata-oai-snapshot"
  }
}

output {
    if "arxiv-metadata-oai-snapshot" in [tags] {
        elasticsearch {
            index => "arxiv-metadata-oai-snapshot"
            hosts => ["192.168.112.157:9200"]
        }
    }
}

查找固定题目

想先查询一个完全匹配的题目,可以用短语搜索:

参见:https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-match-query-phrase.html

POST /arxiv-metadata-oai-snapshot/_search
{
  "query": {
    "match_phrase": {
      "title": "The Stability Region of the Two-User Interference Channel"
    }
  }
}