许吉友 - 运维

Prometheus 配置

官方文档:https://prometheus.io/docs/prometheus/latest/configuration/configuration/

Prometheus 的配置由命令行选项和配置文件构成。

可以通过./prometheus -h 来查看命令行参数。

Prometheus 可以在运行期间重新加载配置,如果新配置有语法错误,则不会启用新配置。

可以通过给 Prometheus 发送 SIGHUP 信号来重加载配置,也可以通过 API :POST /-/reload(前提是 --web.enable-lifecycle 开启)来重加载配置文件。

命令行参数

--config.file="prometheus.yml"

配置文件路径

--web.listen-address="0.0.0.0:9090"

监听的地址

--web.read-timeout=5m

读取的超时时间

--web.max-connections=512

最大连接数

--web.external-url=

普罗米修斯外部可访问的URL(例如,如果普罗米修斯是通过反向代理服务的)。用于生成返回到普罗米修斯自身的相对和绝对链接。如果URL有一个路径部分,它将用于前缀Prometheus服务的所有HTTP端点。如果省略,相关的URL组件将自动派生

--web.route-prefix=

Prefix for the internal routes of web endpoints. Defaults to path of --web.external-url

--web.user-assets=

Path to static asset directory, available at /user

--web.enable-lifecycle

是否可以通过 HTTP 请求来实现重启和关闭

--web.enable-admin-api

Enable API endpoints for admin control actions.

--web.console.templates="consoles"

定义 html 文件的路径

--web.console.libraries="console_libraries"

定义网站文件所需依赖的路径

--web.page-title="Prometheus Time Series Collection and Processing Server"

网站标题

--web.cors.origin=".*"

跨域

--storage.tsdb.path="data/"

储存数据的目录

--storage.tsdb.retention.time=STORAGE.TSDB.RETENTION.TIME

数据保存多长时间,保留时间默认为15d。支持的单位:y、w、d、h、m、s、ms。

--storage.tsdb.retention.size=STORAGE.TSDB.RETENTION.SIZE

[实验特性] 数据保存的最大大小,支持的单位 B, KB, MB, GB, TB, PB, EB

--storage.tsdb.no-lockfile

不在数据目录中创建锁文件

--storage.tsdb.allow-overlapping-blocks

[实验特性] 允许重叠的块,从而启用垂直压缩和垂直查询合并。

--storage.tsdb.wal-compression

压缩 wal 文件

--storage.remote.flush-deadline=

关闭或重新加载配置时等待刷新的时间,超过这个时间,刷新配置就会被中断

--storage.remote.read-sample-limit=5e7

单个查询中通过远程读取接口返回的最大样本总数。0表示没有限制。对于流式响应类型,将忽略此限制。

--storage.remote.read-concurrent-limit=10

最大并发远程读取调用数。0表示无限制

--storage.remote.read-max-bytes-in-frame=1048576

在封送处理之前,用于流式远程读取响应类型的单个帧中的最大字节数。请注意,客户端可能也有帧大小限制。默认为protobuf推荐的1MB。

--rules.alert.for-outage-tolerance=1h

为恢复“for”警报状态而允许普罗米修斯中断的最长时间。

配置文件

可以使用参数 --config.file 来指明配置文件。配置文件是 yaml 格式的。

一个配置示例文件如下:https://github.com/prometheus/prometheus/blob/release-2.16/config/testdata/conf.good.yml

以下代码文件中,方括号代表是可选的,另外约定如下:

全局配置指定在所有其他配置上下文中有效的参数。它们还用作其他配置部分的默认设置。

全局配置如下,官网代码显示渣的一批。

global:
  # 抓取间隔
  [ scrape_interval: <duration> | default = 1m ]

  # 抓取超时时间
  [ scrape_timeout: <duration> | default = 10s ]

  # 评估规则间隔,用于对告警规则做定期计算
  [ evaluation_interval: <duration> | default = 1m ]

  # The labels to add to any time series or alerts when communicating with
  # external systems (federation, remote storage, Alertmanager).
  external_labels:
    [ <labelname>: <labelvalue> ... ]

  # 查询的日志文件
  [ query_log_file: <string> ]

# 规则文件列表
rule_files:
  [ - <filepath_glob> ... ]

# 抓取配置列表
scrape_configs:
  [ - <scrape_config> ... ]

# 告警设置
alerting:
  alert_relabel_configs:
    [ - <relabel_config> ... ]
  alertmanagers:
    [ - <alertmanager_config> ... ]

# Settings related to the remote write feature.
# 远程写入,比如写入到 Kafka 里边
remote_write:
  [ - <remote_write> ... ]

# Settings related to the remote read feature.
remote_read:
  [ - <remote_read> ... ]

scrape_config

scrape_config 表示一条抓取规则,这里可以静态配置,也可以使用服务发现机制动态发现目标。

具体字段如下:

# 任务名称
job_name: <job_name>

# 抓取间隔
[ scrape_interval: <duration> | default = <global_config.scrape_interval> ]

# 抓取超时时间
[ scrape_timeout: <duration> | default = <global_config.scrape_timeout> ]

# The HTTP resource path on which to fetch metrics from targets.
# 获取 metrics 的 http 端点 
[ metrics_path: <path> | default = /metrics ]

# honor_labels controls how Prometheus handles conflicts between labels that are
# already present in scraped data and labels that Prometheus would attach
# server-side ("job" and "instance" labels, manually configured target
# labels, and labels generated by service discovery implementations).
#
# If honor_labels is set to "true", label conflicts are resolved by keeping label
# values from the scraped data and ignoring the conflicting server-side labels.
#
# If honor_labels is set to "false", label conflicts are resolved by renaming
# conflicting labels in the scraped data to "exported_<original-label>" (for
# example "exported_instance", "exported_job") and then attaching server-side
# labels.
#
# Setting honor_labels to "true" is useful for use cases such as federation and
# scraping the Pushgateway, where all labels specified in the target should be
# preserved.
#
# Note that any globally configured "external_labels" are unaffected by this
# setting. In communication with external systems, they are always applied only
# when a time series does not have a given label yet and are ignored otherwise.
[ honor_labels: <boolean> | default = false ]

# honor_timestamps controls whether Prometheus respects the timestamps present
# in scraped data.
#
# If honor_timestamps is set to "true", the timestamps of the metrics exposed
# by the target will be used.
#
# If honor_timestamps is set to "false", the timestamps of the metrics exposed
# by the target will be ignored.
[ honor_timestamps: <boolean> | default = true ]

# 协议
[ scheme: <scheme> | default = http ]

# 可选的 HTTP URL 的参数
params:
  [ <string>: [<string>, ...] ]

# Sets the `Authorization` header on every scrape request with the
# configured username and password.
# password and password_file are mutually exclusive.
basic_auth:
  [ username: <string> ]
  [ password: <secret> ]
  [ password_file: <string> ]

# Sets the `Authorization` header on every scrape request with
# the configured bearer token. It is mutually exclusive with `bearer_token_file`.
[ bearer_token: <secret> ]

# Sets the `Authorization` header on every scrape request with the bearer token
# read from the configured file. It is mutually exclusive with `bearer_token`.
[ bearer_token_file: /path/to/bearer/token/file ]

# TLS 配置
tls_config:
  [ <tls_config> ]

# 可选的代理 URL
[ proxy_url: <string> ]

# Azure 服务发现配置
azure_sd_configs:
  [ - <azure_sd_config> ... ]

# Consul 服务发现配置
consul_sd_configs:
  [ - <consul_sd_config> ... ]

# DNS 服务发现配置
dns_sd_configs:
  [ - <dns_sd_config> ... ]

# EC2的服务发现配置
ec2_sd_configs:
  [ - <ec2_sd_config> ... ]

# OpenStack 的服务发现配追
openstack_sd_configs:
  [ - <openstack_sd_config> ... ]

# 基于文件的服务发现配置
file_sd_configs:
  [ - <file_sd_config> ... ]

# GCE 服务发现配置
gce_sd_configs:
  [ - <gce_sd_config> ... ]

# Kubernetes 服务发现配置
kubernetes_sd_configs:
  [ - <kubernetes_sd_config> ... ]

# Marathon 服务发现配置
marathon_sd_configs:
  [ - <marathon_sd_config> ... ]

# AirBnB's Nerve 服务发现配置
nerve_sd_configs:
  [ - <nerve_sd_config> ... ]

# Zookeeper Serverset 服务发现配置
serverset_sd_configs:
  [ - <serverset_sd_config> ... ]

# Triton 服务发现配置
triton_sd_configs:
  [ - <triton_sd_config> ... ]

# 静态配置,入门常用
static_configs:
  [ - <static_config> ... ]

# List of target relabel configurations.
relabel_configs:
  [ - <relabel_config> ... ]

# List of metric relabel configurations.
metric_relabel_configs:
  [ - <relabel_config> ... ]

# Per-scrape limit on number of scraped samples that will be accepted.
# If more than this number of samples are present after metric relabelling
# the entire scrape will be treated as failed. 0 means no limit.
[ sample_limit: <int> | default = 0 ]

tls_config

# CA certificate to validate API server certificate with.
[ ca_file: <filename> ]

# Certificate and key files for client cert authentication to the server.
[ cert_file: <filename> ]
[ key_file: <filename> ]

# ServerName extension to indicate the name of the server.
# https://tools.ietf.org/html/rfc4366#section-3.1
[ server_name: <string> ]

# Disable validation of the server certificate.
[ insecure_skip_verify: <boolean> ]

kubernetes_sd_config

下面重点学习一下 kubernetes_sd_config,其他的服务发现机制都略过。

Kubernetes 服务发现机制允许从 Kubernetes 的 REST Api 来抓取目标,并始终与集群保持同步状态。

可以将以下的类型配置为目标

节点(node)

可以为每个群集节点提供一个目标,其地址默认为Kubelet的HTTP端口。

可获得的元标签:

service

###pod

endpoints

ingress

# The information to access the Kubernetes API.

# The API server addresses. If left empty, Prometheus is assumed to run inside
# of the cluster and will discover API servers automatically and use the pod's
# CA certificate and bearer token file at /var/run/secrets/kubernetes.io/serviceaccount/.
[ api_server: <host> ]

# The Kubernetes role of entities that should be discovered.
role: <role>

# Optional authentication information used to authenticate to the API server.
# Note that `basic_auth`, `bearer_token` and `bearer_token_file` options are
# mutually exclusive.
# password and password_file are mutually exclusive.

# Optional HTTP basic authentication information.
basic_auth:
  [ username: <string> ]
  [ password: <secret> ]
  [ password_file: <string> ]

# Optional bearer token authentication information.
[ bearer_token: <secret> ]

# Optional bearer token file authentication information.
[ bearer_token_file: <filename> ]

# Optional proxy URL.
[ proxy_url: <string> ]

# TLS configuration.
tls_config:
  [ <tls_config> ]

# Optional namespace discovery. If omitted, all namespaces are used.
namespaces:
  names:
    [ - <string> ]

###

定义记录规则

Prometheus 有两种规则,一种是记录规则,一种是告警规则。Prometheus 使用 rule_files 这个字段来定义记录规则和告警规则的文件列表,规则文件的格式是 YAML 。

规则语法检查:

下载的二进制包中,有一个二进制可执行文件,叫 promtool ,下面利用这个文件进行语法检查:

$ ./promtool check rules prometheus.rules.yml 
Checking prometheus.rules.yml
  SUCCESS: 1 rules found

记录规则

记录规则可以实现预先计算经常计算,或计算量大的表达式,并将其结果保存为一组新的时间序列。这样,在查询时就会变得快很多,这对于仪表盘来说非常有用,仪表盘每次刷新时都需要查询相同的表达式。

规则文件中,都会包含一个 groups ,如下:

groups:
  [ - <rule_group> ]

下面看一个简单的规则:

groups:
  - name: example
    rules:
    - record: job:http_inprogress_requests:sum
      expr: sum(http_inprogress_requests) by (job)

每个 group 下,都会有一个 rules

rule

# The name of the time series to output to. Must be a valid metric name.
record: <string>

# The PromQL expression to evaluate. Every evaluation cycle this is
# evaluated at the current time, and the result recorded as a new set of
# time series with the metric name as given by 'record'.
expr: <string>

# Labels to add or overwrite before storing the result.
labels:
  [ <labelname>: <labelvalue> ]

告警规则

告警规则可以实现 Prometheus 表达式语言定义警报条件,并通知外部服务。

定义告警规则

告警规则的配置方法和记录规则差不多,下面是一个例子:

groups:
- name: example
  rules:
  - alert: HighRequestLatency
    expr: job:request_latency_seconds:mean5m{job="myjob"} > 0.5
    for: 10m
    labels:
      severity: page
    annotations:
      summary: High request latency

for:评估等待时间,可选参数。用于表示只有当触发条件持续一段时间后才发送告警。在等待期间新产生告警的状态为pending。

模版

为了让告警具有良好的可读性,可以使用模版。

上边的 label 和 annotation 可以写模版。

在告警规则文件的annotations中使用summary描述告警的概要信息,description用于描述告警的详细信息。同时Alertmanager的UI也会根据这两个标签值,显示告警信息。为了让告警信息具有更好的可读性,Prometheus支持模板化label和annotations的中标签的值。

通过$labels.变量可以访问当前告警实例中指定标签的值。$value则可以获取当前PromQL表达式计算的样本值。

例如:

groups:
- name: example
  rules:

  # Alert for any instance that is unreachable for >5 minutes.
  - alert: InstanceDown
    expr: up == 0
    for: 5m
    labels:
      severity: page
    annotations:
      summary: "Instance {{ $labels.instance }} down"
      description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minutes."

  # Alert for any instance that has a median request latency >1s.
  - alert: APIHighRequestLatency
    expr: api_http_request_latencies_second{quantile="0.5"} > 1
    for: 10m
    annotations:
      summary: "High request latency on {{ $labels.instance }}"
      description: "{{ $labels.instance }} has a median request latency above 1s (current value: {{ $value }}s)"

查看当前报警状态

报警的查看地址是:http://localhost:9090/alerts

也可以通过表达式来查询告警:

ALERTS{alertname="<alert name>", alertstate="pending|firing", <additional alert labels>}

样本值为1表示当前告警处于活动状态(pending或者firing),当告警从活动状态转换为非活动状态时,样本值则为0。

为了更好地控制告警的发送,可以使用 Alertmanager

模版介绍

Prometheus 的模版也是 Go 模版,和 Helm 的模版都是一样的。不过多介绍了,懂得都懂。

关于 Prometheus 模版中可以使用的变量和函数可以参考:https://prometheus.io/docs/prometheus/latest/configuration/template_reference/

规则的单元测试

可以使用 promtool 工具测试规则:

$ # For a single test file.
$ ./promtool test rules test.yml

$ # If you have multiple test files, say test1.yml,test2.yml,test2.yml
$ ./promtool test rules test1.yml test2.yml test3.yml

测试文件格式:

# This is a list of rule files to consider for testing. Globs are supported.
rule_files:
  [ - <file_name> ]

# optional, default = 1m
evaluation_interval: <duration>

# The order in which group names are listed below will be the order of evaluation of
# rule groups (at a given evaluation time). The order is guaranteed only for the groups mentioned below.
# All the groups need not be mentioned below.
group_eval_order:
  [ - <group_name> ]

# All the tests are listed here.
tests:
  [ - <test_group> ]

test_group:

# Series data
interval: <duration>
input_series:
  [ - <series> ]

# Unit tests for the above data.

# Unit tests for alerting rules. We consider the alerting rules from the input file.
alert_rule_test:
  [ - <alert_test_case> ]

# Unit tests for PromQL expressions.
promql_expr_test:
  [ - <promql_test_case> ]

# External labels accessible to the alert template.
external_labels:
  [ <labelname>: <string> ... ]

series

# This follows the usual series notation '<metric name>{<label name>=<label value>, ...}'
# Examples:
#      series_name{label1="value1", label2="value2"}
#      go_goroutines{job="prometheus", instance="localhost:9090"}
series: <string>

# This uses expanding notation.
# Expanding notation:
#     'a+bxc' becomes 'a a+b a+(2*b) a+(3*b) … a+(c*b)'
#     'a-bxc' becomes 'a a-b a-(2*b) a-(3*b) … a-(c*b)'
# Examples:
#     1. '-2+4x3' becomes '-2 2 6 10'
#     2. ' 1-2x4' becomes '1 -1 -3 -5 -7'
values: <string>