Prometheus监控安装(持续更新)

Prometheus, a Cloud Native Computing Foundation project, is a systems and service monitoring system. It collects metrics from configured targets at given intervals, evaluates rule expressions, displays the results, and can trigger alerts when specified conditions are observed.

The features that distinguish Prometheus from other metrics and monitoring systems are:

  • A multi-dimensional data model (time series defined by metric name and set of key/value dimensions)
  • PromQL, a powerful and flexible query language to leverage this dimensionality
  • No dependency on distributed storage; single server nodes are autonomous
  • An HTTP pull model for time series collection
  • Pushing time series is supported via an intermediary gateway for batch jobs
  • Targets are discovered via service discovery or static configuration
  • Multiple modes of graphing and dashboarding support
  • Support for hierarchical and horizontal federation

配置Prometheus

下载

1
2
3
4
wget https://github.com/prometheus/prometheus/releases/download/v2.14.0/prometheus-2.14.0.linux-amd64.tar.gz

tar zxf prometheus-2.14.0.linux-amd64.tar.gz
mv prometheus-2.14.0.linux-amd64 /usr/local/prometheus

创建用户

1
2
groupadd --system prometheus
useradd --system -g prometheus -s /sbin/nologin -c "Prometheus Monitoring System" prometheus

赋权

1
chown -R prometheus:prometheus /usr/local/prometheus

创建数据目录

1
2
mkdir /data/prometheus
chown -R prometheus:prometheus /data/prometheus

创建Prometheus服务

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
cat > /usr/lib/systemd/system/prometheus.service <<EOF
[Unit]
Description=Prometheus
After=network.target
[Service]
Type=simple
User=prometheus
ExecStart=/usr/local/prometheus/prometheus \
--config.file=/usr/local/prometheus/prometheus.yml \
--storage.tsdb.path=/data/prometheus \
--storage.tsdb.retention=30d \
--storage.tsdb.retention.size=512M \
--web.enable-admin-api \
--web.enable-lifecycle \
--web.external-url=http://monitor.example.com
Restart=on-failure
[Install]
WantedBy=multi-user.target
EOF

Type设置为notify时,服务会不断重启

--storage.tsdb.path是可选项,默认数据目录在运行目录的./dada目录中

--storage.tsdb.retention设置了保留多长时间的数据

--storage.tsdb.retention.size存储块可以使用的最大字节数(请注意,这不包括WAL大小,这可能很大)。 最早的数据将被删除。 默认为0或禁用。 此标志是实验性的,可以在将来的版本中进行更改。 支持的单位:KB,MB,GB,PB。 例如:“512MB”

--web.enable-admin-api开启对admin api的访问权限

--web.enable-lifecycle启用远程热加载配置文件

--web.external-url=http://localhost:9090/ prometheus主机外网地址,不写会导致告警GeneratorURL不对

创建告警规则文件

推荐一个网站,里面有很多告警规则https://awesome-prometheus-alerts.grep.to/

Linux服务器存活报警

1
vi rules/basis.yml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
groups:
- name: 主机状态-监控告警
rules:
- alert: 主机状态
expr: up == 0
for: 1m
labels:
status: 非常严重
annotations:
summary: "{{$labels.instance}}:服务器宕机"
description: "{{$labels.instance}}:服务器延时超过5分钟"

- alert: CPU使用情况
expr: 100-(avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) by(instance)* 100) > 60
for: 1m
labels:
status: 一般告警
annotations:
summary: "{{$labels.mountpoint}} CPU使用率过高!"
description: "{{$labels.mountpoint }} CPU使用大于60%(目前使用:{{$value}}%)"

- alert: 内存使用
expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes ))* 100 > 80
for: 1m
labels:
status: 严重告警
annotations:
summary: "{{$labels.mountpoint}} 内存使用率过高!"
description: "{{$labels.mountpoint }} 内存使用大于80%(目前使用:{{$value}}%)"
- alert: IO性能
expr: 100-(avg(irate(node_disk_io_time_seconds_total[1m])) by(instance)* 100) < 60
for: 1m
labels:
status: 严重告警
annotations:
summary: "{{$labels.mountpoint}} 流入磁盘IO使用率过高!"
description: "{{$labels.mountpoint }} 流入磁盘IO大于60%(目前使用:{{$value}})"

- alert: 网络
expr: ((sum(rate (node_network_receive_bytes_total{device!~'tap.*|veth.*|br.*|docker.*|virbr*|lo*'}[5m])) by (instance)) / 100) > 102400
for: 1m
labels:
status: 严重告警
annotations:
summary: "{{$labels.mountpoint}} 流入网络带宽过高!"
description: "{{$labels.mountpoint }}流入网络带宽持续2分钟高于100M. RX带宽使用率{{$value}}"

- alert: 网络
expr: ((sum(rate (node_network_transmit_bytes_total{device!~'tap.*|veth.*|br.*|docker.*|virbr*|lo*'}[5m])) by (instance)) / 100) > 102400
for: 1m
labels:
status: 严重告警
annotations:
summary: "{{$labels.mountpoint}} 流出网络带宽过高!"
description: "{{$labels.mountpoint }}流出网络带宽持续2分钟高于100M. RX带宽使用率{{$value}}"

- alert: TCP会话
expr: node_netstat_Tcp_CurrEstab > 1000
for: 1m
labels:
status: 严重告警
annotations:
summary: "{{$labels.mountpoint}} TCP_ESTABLISHED过高!"
description: "{{$labels.mountpoint }} TCP_ESTABLISHED大于1000%(目前使用:{{$value}}%)"

- alert: 磁盘容量
expr: 100-(node_filesystem_free_bytes{fstype=~"ext4|xfs"}/node_filesystem_size_bytes {fstype=~"ext4|xfs"}*100) > 80
for: 1m
labels:
status: 严重告警
annotations:
summary: "{{$labels.mountpoint}} 磁盘分区使用率过高!"
description: "{{$labels.mountpoint }} 磁盘分区使用大于80%(目前使用:{{$value}}%)"

Windows服务器存活报警

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
groups:
- name: Windows主机状态-监控告警
rules:
- alert: WindowsServerCollectorError
expr: windows_exporter_collector_success == 0
for: 5m
labels:
severity: critical
annotations:
summary: Windows Server collector Error (instance {{ $labels.instance }})
description: Collector {{ $labels.collector }} was not successful\n VALUE = {{ $value }}\n LABELS:\ {{ $labels }}

- alert: WindowsServerServiceStatus
expr: windows_service_status{status="ok"} != 1
for: 5m
labels:
severity: critical
annotations:
summary: Windows Server service Status (instance {{ $labels.instance }})
description: Windows Service state is not OK\n VALUE = {{ $value }}\n LABELS:\ {{ $labels }}

- alert: WindowsServerCpuUsage
expr: 100 - (avg by (instance) (rate(windows_cpu_time_total{mode="idle"}[2m])) * 100) > 80
for: 5m
labels:
severity: warning
annotations:
summary: Windows Server CPU Usage (instance {{ $labels.instance }})
description: CPU Usage is more than 80%\n VALUE = {{ $value }}\n LABELS:\ {{ $labels }}

- alert: WindowsServerMemoryUsage
expr: 100 - ((windows_os_physical_memory_free_bytes / windows_cs_physical_memory_bytes) * 100) > 90
for: 5m
labels:
severity: warning
annotations:
summary: Windows Server memory Usage (instance {{ $labels.instance }})
description: Memory usage is more than 90%\n VALUE = {{ $value }}\n LABELS:\ {{ $labels }}

- alert: WindowsServerDiskSpaceUsage
expr: 100.0 - 100 * ((windows_logical_disk_free_bytes / 1024 / 1024 ) / (windows_logical_disk_size_bytes / 1024 / 1024)) > 80
for: 5m
labels:
severity: critical
annotations:
summary: Windows Server disk Space Usage (instance {{ $labels.instance }})
description: Disk usage is more than 80%\n VALUE = {{ $value }}\n LABELS:\ {{ $labels }}

- alert: 网络
expr: (irate(windows_net_bytes_received_total{nic!~'isatap.*|VPN.*'}[5m])*8 /1000) > 5120
for: 1m
labels:
status: 严重告警
annotations:
summary: "{{$labels.mountpoint}} 流入(下载)网络带宽过高!"
description: "{{$labels.mountpoint }}流入(下载)网络带宽持续2分钟高于5M. RX带宽使用率{{$value}}"

- alert: 网络
expr: (irate(windows_net_bytes_sent_total{nic!~'isatap.*|VPN.*'}[5m])*8 /1000) > 5120
for: 1m
labels:
status: 严重告警
annotations:
summary: "{{$labels.mountpoint}} 流出(上传)网络带宽过高!"
description: "{{$labels.mountpoint }}流出(上传)网络带宽持续2分钟高于5M. RX带宽使用率{{$value}}"

Http监控告警

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
groups:
- name: blackbox_network_stats
rules:
- alert: blackbox_network_stats
expr: probe_success == 0
for: 1m #如1分钟内持续为0 报警
labels:
severity: critical
annotations:
description: 'Job {{ $labels.job }} 中的网站/接口 {{ $labels.instance }} 已经down掉超过一分钟.'
summary: '网站/接口 {{ $labels.instance }} down ! ! !'

- alert: BlackboxProbeHttpFailure
expr: probe_http_status_code <= 199 OR probe_http_status_code >= 400
for: 5m
labels:
severity: critical
annotations:
summary: Blackbox probe HTTP failure (instance {{ $labels.instance }})
description: HTTP status code is not 200-399\n VALUE = {{ $value }}\n LABELS:\ {{ $labels }}

- alert: BlackboxSslCertificateWillExpireSoon
expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 30
for: 5m
labels:
severity: warning
annotations:
summary: Blackbox SSL certificate will expire soon (instance {{ $labels.instance }})
description: SSL certificate expires in 30 days\n VALUE = {{ $value }}\n LABELS:\ {{ $labels }}

测试规则是不是正确

1
./promtool check rules rules/basis.yml

修改Prometheus配置

1
vi prometheus.yml
1
2
3
4
5
6
7
8
9
10
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
- localhost:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
- "rules/*.yml"

启动

1
2
systemctl enable prometheus.service
systemctl start prometheus.service

配置nginx代理和HTTP Basic Auth

Prometheus并没有提供任何认证支持。不过,借助 Nginx 作为反向代理服务器,我们可以很容易地添加 HTTP Basic Auth 功能。

然后,在 /usr/local/nginx/conf/ (可能你的 Nginx 配置目录在其他路径,请做相应修改)目录下,使用 apache2-utils 提供的 htpasswd 工具创建一个用户文件,需要填入用户名和密码:

1
htpasswd -c /usr/local/nginx/conf/.htpasswd admin

配置nginx

server {
    listen 80;
        server_name monitor.example.com;

    location / {
        auth_basic "Prometheus";
        auth_basic_user_file ".htpasswd";
        proxy_pass http://localhost:9090/;
    }
}

访问 http://example.com:9090 ,输入账号密码访问

prometheus自动发现

自动发现机制方便我们在监控系统中动态的添加或者删除资源。比如zabbix可以自动发现监控主机以及监控资源。prometheus作为一个可以与zabbix旗鼓相当的监控系统,自然也有它的自动发现机制。

file_sd_configs

file_sd_configs可以用来动态的添加和删除target。

修改prometheus的配置文件

1
2
3
4
5
- job_name: 'node'
file_sd_configs:
- refresh_interval: 1m
files:
- targets/nodes/*.yml

创建被扫描的文件nodes.yml

1
2
3
4
5
6
7
8
- targets:
- '172.19.179.239:9100'
- '172.19.179.240:9100'
- '172.19.179.244:9100'
- '172.19.179.253:9100'
- '172.19.179.254:9100'
labels:
server: linux

consul_sd_file

Consul 是基于 GO 语言开发的开源工具,主要面向分布式,服务化的系统提供服务注册、服务发现和配置管理的功能。Consul 提供服务注册/发现、健康检查、Key/Value存储、多数据中心和分布式一致性保证等功能。之前我们通过 Prometheus 实现监控,当新增一个 Target 时,需要变更服务器上的配置文件,即使使用 file_sd_configs 配置,也需要登录服务器修改对应 Json 文件,会非常麻烦。不过 Prometheus 官方支持多种自动服务发现的类型,其中就支持 Consul。

consul的配置需要有consul的服务提供

修改prometheus的配置文件

1
2
3
4
- job_name: 'consul-prometheus'
consul_sd_configs:
- server: '172.30.12.167:8500'
services: []

容器启动

1
2
3
4
5
6
docker run -d \
-p 9090:9090 \
-v "/prom/prometheus.yml:/etc/prometheus/prometheus.yml" \
-v "/prom/rules:/etc/prometheus/rules" \
-v "/prom/targets:/etc/prometheus/targets \
prom/prometheus

配置node_exporter监控主机

Node_exporter是可以在* Nix和Linux系统上运行的计算机度量标准的导出器。

Node_exporter 主要用于暴露 metrics 给 Prometheus,其中 metrics 包括:cpu 的负载,内存的使用情况,网络等。

下载

1
2
3
4
wget https://github.com/prometheus/node_exporter/releases/download/v0.18.1/node_exporter-0.18.1.linux-amd64.tar.gz

tar zxf node_exporter-0.18.1.linux-amd64.tar.gz
mv node_exporter-0.18.1.linux-amd64 /usr/local/node_exporter

创建node_exporter服务

1
2
3
4
5
6
7
8
9
10
11
cat > /usr/lib/systemd/system/node_exporter.service <<EOF
[Unit]
Description=Node Exporter
After=network.target

[Service]
ExecStart=/usr/local/node_exporter/node_exporter

[Install]
WantedBy=multi-user.target
EOF

启动

1
2
systemctl enable node_exporter.service
systemctl start node_exporter.service

配置prometheus.yml

scrape_configs下添加node_exporter,重启Prometheus。

1
2
3
4
5
- job_name: 'node'
static_configs:
- targets:
- '172.19.179.239:9100'
- '172.19.179.240:9100'

容器运行

1
2
3
4
5
6
7
docker run -d \
-p 9100:9100 \
-v "/:/host:ro,rslave" \
--net="host" \
--path.rootfs=/host \
--collector.filesystem.ignored-mount-points "^/(sys|proc|dev|host|etc)($|/)" \
prom/node-exporter

配置Granfana

下载

1
2
3
4
wget https://dl.grafana.com/oss/release/grafana-6.5.2.linux-amd64.tar.gz

tar -zxf grafana-6.5.2.linux-amd64.tar.gz
mv grafana-6.5.2 /usr/local/grafana

创建Grafana服务

1
2
3
4
5
6
7
8
9
10
11
cat > /usr/lib/systemd/system/grafana-server.service <<EOF
[Unit]
Description=Grafana
After=network.target
[Service]
Type=notify
ExecStart=/usr/local/grafana/bin/grafana-server -homepath /usr/local/grafana
Restart=on-failure
[Install]
WantedBy=multi-user.target
EOF

启动

1
2
systemctl enable grafana-server.service
systemctl start grafana-server.service

配置数据源

添加数据源

点击 Add data source,选择Prometheus,在URL输入框键入http://localhost:9090,点击save & test,如果出现下图中的绿色提示,则表示配置有效,否则可能是地址或者端口等其他错误,需要自行修改。

下载模板

下载https://grafana.com/grafana/dashboards/9276或者https://grafana.com/grafana/dashboards/8919

导入模板

效果图

配置nginx

添加Nginx配置,proxy_pass后面一定要有”/“(用以去掉/grafana/匹配本身)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
server {
listen 80;
server_name localhost;

location /grafana/ {
proxy_pass http://localhost:3000/;
proxy_buffering off;
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header Host $host;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
access_log off;
}
}

修改grafana配置(grafana.ini),需要去掉行前的”;”

1
2
3
[server]
domain = 你的域名
root_url = %(protocol)s://%(domain)s/grafana/

容器启动

1
2
3
4
5
6
7
8
9
10
11
docker run -d \
-p 3000:3000 \
-e TZ=Asia/Shanghai \
-e GF_DATABASE_TYPE=mysql \
-e GF_DATABASE_HOST=127.0.0.1:3306 \
-e GF_DATABASE_NAME=grafana \
-e GF_DATABASE_USER=root \
-e GF_DATABASE_PASSWORD=root \
-e GF_PLUGINS_ENABLE_ALPHA=true \
-e "GF_INSTALL_PLUGINS=grafana-piechart-panel,grafana-simple-json-datasource" \
grafana/grafana

配置Alertmanager

下载

1
2
3
wget https://github.com/prometheus/alertmanager/releases/download/v0.20.0/alertmanager-0.20.0.linux-amd64.tar.gz

tar zxf alertmanager-0.20.0.linux-amd64.tar.gz

下载钉钉告警插件

1
2
3
4
wget https://github.com/timonwong/prometheus-webhook-dingtalk/releases/download/v1.4.0/prometheus-webhook-dingtalk-1.4.0.linux-amd64.tar.gz

tar zxf prometheus-webhook-dingtalk-1.4.0.linux-amd64.tar.gz
mv prometheus-webhook-dingtalk-1.4.0.linux-amd64 /usr/local/prometheus/alertmanager/webhook-dingtalk

配置config.yml

1
2
3
4
5
targets:
webhook:
url: https://oapi.dingtalk.com/robot/send?access_token=xxxxxxxxxxxx
secret: SEC000000000000000000000
template: contrib/templates/legacy/dingtalk.tmpl

配置消息模板

这里提供几个模板

https://github.com/bwcxyk/tools_file/tree/master/prometheus/alertmanager/dingtalk/templates

创建服务

1
2
3
4
5
6
7
8
9
10
11
12
13
cat > /usr/lib/systemd/system/prometheus-webhook-dingtalk.service <<EOF
[Unit]
Description=prometheus-webhook-dingtalk
After=network-online.target

[Service]
Restart=on-failure
ExecStart=/usr/local/prometheus/alertmanager/prometheus-webhook-dingtalk/prometheus-webhook-dingtalk \
--config.file=/usr/local/prometheus/alertmanager/prometheus-webhook-dingtalk/config.yml

[Install]
WantedBy=multi-user.target
EOF

启动钉钉告警插件

1
2
systemctl enable prometheus-webhook-dingtalk.service
systemctl start prometheus-webhook-dingtalk.service

修改Alertmanager配置

1
vi alertmanager.yml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
global:
resolve_timeout: 5m
# 路由树: 根节点
route:
receiver: webhook
# 分组维度
group_by: [alertname]
# 新分组等待发送, 收敛间隔30s
group_wait: 30s
# 存在分组,有新告警加入发送, 收敛间隔5m
group_interval: 5m
# 发送成功的alert重复发送需等待3h
repeat_interval: 3h
routes:
- receiver: webhook
group_wait: 10s
# 接收
receivers:
- name: webhook
webhook_configs:
- url: http://localhost:8060/dingtalk/webhook/send
send_resolved: true
# 抑制
# alertname、cluster、service相同的告警
# critical存在则warning的被抑制
inhibit_rules:
- equal: ['alertname', 'cluster', 'service']
source_match:
severity: 'critical'
target_match:
severity: 'warning'

创建Alertmanager服务

1
2
3
4
5
6
7
8
9
10
11
12
13
14
cat > /usr/lib/systemd/system/alertmanager.service <<EOF
[Unit]
Description=Alertmanager
After=network.target

[Service]
Type=simple
User=prometheus
ExecStart=/usr/local/prometheus/alertmanager/alertmanager --web.external-url=http://example.com:9093 --config.file=/usr/local/prometheus/alertmanager/alertmanager.yml --storage.path=/data/prometheus/alertmanager/data
Restart=on-failure

[Install]
WantedBy=multi-user.target
EOF

启动服务

1
2
systemctl enable alertmanager.service
systemctl start alertmanager.service

配置nginx

server {
    listen 80;
        server_name monitor.example.com;

    location / {
        auth_basic "Prometheus";
        auth_basic_user_file ".htpasswd";
        proxy_pass http://localhost:9093/;
    }
}

容器启动

1
2
3
4
docker run -d \
-p 9093:9093 \
-v "/prom/alertmanager.yaml:/etc/alertmanager/alertmanager.yaml" \
prom/alertmanager

Blackbox_exporter

下载

1
2
3
4
wget https://github.com/prometheus/blackbox_exporter/releases/download/v0.17.0/blackbox_exporter-0.17.0.linux-amd64.tar.gz

tar zxf blackbox_exporter-0.17.0.linux-amd64.tar.gz
mv blackbox_exporter-0.17.0.linux-amd64 blackbox_exporter

配置

编辑blackbox.yml文件

1
vi blackbox.yml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
modules:
http_2xx: # http 检测模块 Blockbox-Exporter 中所有的探针均是以 Module 的信息进行配置
prober: http
timeout: 10s
http:
valid_http_versions: ["HTTP/1.1", "HTTP/2"]
valid_status_codes: [200] # 这里最好作一个返回状态码,在grafana作图时,有明示
method: GET
preferred_ip_protocol: "ip4"
http_post_2xx: # http post 监测模块
prober: http
timeout: 10s
http:
valid_http_versions: ["HTTP/1.1", "HTTP/2"]
method: POST
preferred_ip_protocol: "ip4"

修改Prometheus配置,增加job,使用基于文件的自动发现

metrics_path的值在源码中默认为/metrics,注意修改

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
- job_name: "blackbox-http"
metrics_path: /probe # 不是 metrics,是 probe
params:
module: [http_2xx] # 使用 http_2xx 模块
file_sd_configs:
- refresh_interval: 1m
files:
- targets/blackbox/http_2xx.yml
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: 127.0.0.1:9115 # blackbox服务地址

- job_name: "blackbox-http-post"
metrics_path: /probe # 不是 metrics,是 probe
params:
module: [http_post_2xx] # 使用 http_2xx 模块
file_sd_configs:
- refresh_interval: 1m
files:
- targets/blackbox/http_post_2xx.yml
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: 127.0.0.1:9115 # blackbox服务地址

创建targets/blackbox/http_2xx.yml文件

1
2
- targets:
- https://baidu.com

创建系统服务

1
2
3
4
5
6
7
8
9
10
11
12
13
14
cat > /usr/lib/systemd/system/blackbox.service <<EOF
[Unit]
Description=blackbox_exporter
After=network.target

[Service]
User=root
Type=simple
ExecStart=/usr/local/prometheus/blackbox_exporter/blackbox_exporter --config.file=/usr/local/prometheus/blackbox_exporter/blackbox.yml
Restart=on-failure

[Install]
WantedBy=multi-user.target
EOF

启动服务

1
2
3
systemctl daemon-reload
systemctl start blackbox.service
systemctl enable blackbox.service

重载Prometheus

1
curl -X POST "http://127.0.0.1:9090/-/reload"

grafana图表

导入 https://grafana.com/grafana/dashboards/9965

告警配置

创建rules/blackbox_exporter.yml文件

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
groups:
- name: blackbox_network_stats
rules:
- alert: blackbox_network_stats
expr: probe_success == 0
for: 1m #如1分钟内持续为0 报警
labels:
severity: critical
annotations:
description: 'Job {{ $labels.job }} 中的网站/接口 {{ $labels.instance }} 已经down掉超过一分钟.'
summary: '网站/接口 {{ $labels.instance }} down ! ! !'

- alert: BlackboxProbeHttpFailure
expr: probe_http_status_code <= 199 OR probe_http_status_code >= 400
for: 5m
labels:
severity: critical
annotations:
summary: Blackbox probe HTTP failure (instance {{ $labels.instance }})
description: HTTP status code is not 200-399\n VALUE = {{ $value }}\n LABELS:\ {{ $labels }}

- alert: BlackboxSslCertificateWillExpireSoon
expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 30
for: 5m
labels:
severity: warning
annotations:
summary: Blackbox SSL certificate will expire soon (instance {{ $labels.instance }})
description: SSL certificate expires in 30 days\n VALUE = {{ $value }}\n LABELS:\ {{ $labels }}

其他知识点

删除一些 job 任务或者 instance 的数据指标,则可以使用下面的命令:

1
2
curl -X POST -g 'http://localhost:9090/api/v1/admin/tsdb/delete_series?match[]={job="kubernetes"}'
curl -X POST -g 'http://localhost:9090/api/v1/admin/tsdb/delete_series?match[]={instance="10.244.2.158:9090"}'

参考:Prometheus 删除数据指标

grafana模板

ES Nginx Logs