Prometheus监控安装(持续更新)
Prometheus, a Cloud Native Computing Foundation project, is a systems and service monitoring system. It collects metrics from configured targets at given intervals, evaluates rule expressions, displays the results, and can trigger alerts when specified conditions are observed.
The features that distinguish Prometheus from other metrics and monitoring systems are:
- A multi-dimensional data model (time series defined by metric name and set of key/value dimensions)
- PromQL, a powerful and flexible query language to leverage this dimensionality
- No dependency on distributed storage; single server nodes are autonomous
- An HTTP pull model for time series collection
- Pushing time series is supported via an intermediary gateway for batch jobs
- Targets are discovered via service discovery or static configuration
- Multiple modes of graphing and dashboarding support
- Support for hierarchical and horizontal federation
1 配置Prometheus
1.1 下载
wget https://github.com/prometheus/prometheus/releases/download/v2.14.0/prometheus-2.14.0.linux-amd64.tar.gz
tar zxf prometheus-2.14.0.linux-amd64.tar.gz
mv prometheus-2.14.0.linux-amd64 /usr/local/prometheus
1.2 创建用户
groupadd --system prometheus
useradd --system -g prometheus -s /sbin/nologin -c "Prometheus Monitoring System" prometheus
1.3 赋权
chown -R prometheus:prometheus /usr/local/prometheus
1.4 创建数据目录
mkdir /data/prometheus
chown -R prometheus:prometheus /data/prometheus
1.5 创建Prometheus服务
cat > /usr/lib/systemd/system/prometheus.service <<EOF
[Unit]
Description=Prometheus
After=network.target
[Service]
Type=simple
User=prometheus
ExecStart=/usr/local/prometheus/prometheus \
--config.file=/usr/local/prometheus/prometheus.yml \
--storage.tsdb.path=/data/prometheus \
--storage.tsdb.retention=30d \
--storage.tsdb.retention.size=512M \
--web.enable-admin-api \
--web.enable-lifecycle \
--web.external-url=http://monitor.example.com
Restart=on-failure
[Install]
WantedBy=multi-user.target
EOF
Type设置为notify时,服务会不断重启
--storage.tsdb.path
是可选项,默认数据目录在运行目录的./dada
目录中
--storage.tsdb.retention
设置了保留多长时间的数据
--storage.tsdb.retention.size
存储块可以使用的最大字节数(请注意,这不包括WAL大小,这可能很大)。 最早的数据将被删除。 默认为0或禁用。 此标志是实验性的,可以在将来的版本中进行更改。 支持的单位:KB,MB,GB,PB。 例如:“512MB”
--web.enable-admin-api
开启对admin api
的访问权限
--web.enable-lifecycle
启用远程热加载配置文件
--web.external-url=http://localhost:9090/
prometheus主机外网地址,不写会导致告警GeneratorURL不对
1.6 创建告警规则文件
推荐一个网站,里面有很多告警规则https://awesome-prometheus-alerts.grep.to/
1.6.1 Linux服务器存活报警
vi rules/basis.yml
groups:
- name: 主机状态-监控告警
rules:
- alert: 主机状态
expr: up == 0
for: 1m
labels:
status: 非常严重
annotations:
summary: "{{$labels.instance}}:服务器宕机"
description: "{{$labels.instance}}:服务器延时超过5分钟"
- alert: CPU使用情况
expr: 100-(avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) by(instance)* 100) > 60
for: 1m
labels:
status: 一般告警
annotations:
summary: "{{$labels.mountpoint}} CPU使用率过高!"
description: "{{$labels.mountpoint }} CPU使用大于60%(目前使用:{{$value}}%)"
- alert: 内存使用
expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes ))* 100 > 80
for: 1m
labels:
status: 严重告警
annotations:
summary: "{{$labels.mountpoint}} 内存使用率过高!"
description: "{{$labels.mountpoint }} 内存使用大于80%(目前使用:{{$value}}%)"
- alert: IO性能
expr: 100-(avg(irate(node_disk_io_time_seconds_total[1m])) by(instance)* 100) < 60
for: 1m
labels:
status: 严重告警
annotations:
summary: "{{$labels.mountpoint}} 流入磁盘IO使用率过高!"
description: "{{$labels.mountpoint }} 流入磁盘IO大于60%(目前使用:{{$value}})"
- alert: 网络
expr: ((sum(rate (node_network_receive_bytes_total{device!~'tap.*|veth.*|br.*|docker.*|virbr*|lo*'}[5m])) by (instance)) / 100) > 102400
for: 1m
labels:
status: 严重告警
annotations:
summary: "{{$labels.mountpoint}} 流入网络带宽过高!"
description: "{{$labels.mountpoint }}流入网络带宽持续2分钟高于100M. RX带宽使用率{{$value}}"
- alert: 网络
expr: ((sum(rate (node_network_transmit_bytes_total{device!~'tap.*|veth.*|br.*|docker.*|virbr*|lo*'}[5m])) by (instance)) / 100) > 102400
for: 1m
labels:
status: 严重告警
annotations:
summary: "{{$labels.mountpoint}} 流出网络带宽过高!"
description: "{{$labels.mountpoint }}流出网络带宽持续2分钟高于100M. RX带宽使用率{{$value}}"
- alert: TCP会话
expr: node_netstat_Tcp_CurrEstab > 1000
for: 1m
labels:
status: 严重告警
annotations:
summary: "{{$labels.mountpoint}} TCP_ESTABLISHED过高!"
description: "{{$labels.mountpoint }} TCP_ESTABLISHED大于1000%(目前使用:{{$value}}%)"
- alert: 磁盘容量
expr: 100-(node_filesystem_free_bytes{fstype=~"ext4|xfs"}/node_filesystem_size_bytes {fstype=~"ext4|xfs"}*100) > 80
for: 1m
labels:
status: 严重告警
annotations:
summary: "{{$labels.mountpoint}} 磁盘分区使用率过高!"
description: "{{$labels.mountpoint }} 磁盘分区使用大于80%(目前使用:{{$value}}%)"
1.6.2 Windows服务器存活报警
groups:
- name: Windows主机状态-监控告警
rules:
- alert: WindowsServerCollectorError
expr: windows_exporter_collector_success == 0
for: 5m
labels:
severity: critical
annotations:
summary: Windows Server collector Error (instance {{ $labels.instance }})
description: Collector {{ $labels.collector }} was not successful\n VALUE = {{ $value }}\n LABELS:\ {{ $labels }}
- alert: WindowsServerServiceStatus
expr: windows_service_status{status="ok"} != 1
for: 5m
labels:
severity: critical
annotations:
summary: Windows Server service Status (instance {{ $labels.instance }})
description: Windows Service state is not OK\n VALUE = {{ $value }}\n LABELS:\ {{ $labels }}
- alert: WindowsServerCpuUsage
expr: 100 - (avg by (instance) (rate(windows_cpu_time_total{mode="idle"}[2m])) * 100) > 80
for: 5m
labels:
severity: warning
annotations:
summary: Windows Server CPU Usage (instance {{ $labels.instance }})
description: CPU Usage is more than 80%\n VALUE = {{ $value }}\n LABELS:\ {{ $labels }}
- alert: WindowsServerMemoryUsage
expr: 100 - ((windows_os_physical_memory_free_bytes / windows_cs_physical_memory_bytes) * 100) > 90
for: 5m
labels:
severity: warning
annotations:
summary: Windows Server memory Usage (instance {{ $labels.instance }})
description: Memory usage is more than 90%\n VALUE = {{ $value }}\n LABELS:\ {{ $labels }}
- alert: WindowsServerDiskSpaceUsage
expr: 100.0 - 100 * ((windows_logical_disk_free_bytes / 1024 / 1024 ) / (windows_logical_disk_size_bytes / 1024 / 1024)) > 80
for: 5m
labels:
severity: critical
annotations:
summary: Windows Server disk Space Usage (instance {{ $labels.instance }})
description: Disk usage is more than 80%\n VALUE = {{ $value }}\n LABELS:\ {{ $labels }}
- alert: 网络
expr: (irate(windows_net_bytes_received_total{nic!~'isatap.*|VPN.*'}[5m])*8 /1000) > 5120
for: 1m
labels:
status: 严重告警
annotations:
summary: "{{$labels.mountpoint}} 流入(下载)网络带宽过高!"
description: "{{$labels.mountpoint }}流入(下载)网络带宽持续2分钟高于5M. RX带宽使用率{{$value}}"
- alert: 网络
expr: (irate(windows_net_bytes_sent_total{nic!~'isatap.*|VPN.*'}[5m])*8 /1000) > 5120
for: 1m
labels:
status: 严重告警
annotations:
summary: "{{$labels.mountpoint}} 流出(上传)网络带宽过高!"
description: "{{$labels.mountpoint }}流出(上传)网络带宽持续2分钟高于5M. RX带宽使用率{{$value}}"
1.6.3 Http监控告警
groups:
- name: blackbox_network_stats
rules:
- alert: blackbox_network_stats
expr: probe_success == 0
for: 1m #如1分钟内持续为0 报警
labels:
severity: critical
annotations:
description: 'Job {{ $labels.job }} 中的网站/接口 {{ $labels.instance }} 已经down掉超过一分钟.'
summary: '网站/接口 {{ $labels.instance }} down ! ! !'
- alert: BlackboxProbeHttpFailure
expr: probe_http_status_code <= 199 OR probe_http_status_code >= 400
for: 5m
labels:
severity: critical
annotations:
summary: Blackbox probe HTTP failure (instance {{ $labels.instance }})
description: HTTP status code is not 200-399\n VALUE = {{ $value }}\n LABELS:\ {{ $labels }}
- alert: BlackboxSslCertificateWillExpireSoon
expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 30
for: 5m
labels:
severity: warning
annotations:
summary: Blackbox SSL certificate will expire soon (instance {{ $labels.instance }})
description: SSL certificate expires in 30 days\n VALUE = {{ $value }}\n LABELS:\ {{ $labels }}
测试规则是不是正确
./promtool check rules rules/basis.yml
1.7 修改Prometheus配置
vi prometheus.yml
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
- localhost:9093
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
- "rules/*.yml"
1.8 启动
systemctl enable prometheus.service
systemctl start prometheus.service
1.9 配置nginx代理和HTTP Basic Auth
Prometheus并没有提供任何认证支持。不过,借助 Nginx 作为反向代理服务器,我们可以很容易地添加 HTTP Basic Auth 功能。
然后,在 /usr/local/nginx/conf/
(可能你的 Nginx 配置目录在其他路径,请做相应修改)目录下,使用 apache2-utils
提供的 htpasswd
工具创建一个用户文件,需要填入用户名和密码:
htpasswd -c /usr/local/nginx/conf/.htpasswd admin
配置nginx
server {
listen 80;
server_name monitor.example.com;
location / {
auth_basic "Prometheus";
auth_basic_user_file ".htpasswd";
proxy_pass http://localhost:9090/;
}
}
访问 http://example.com:9090 ,输入账号密码访问
1.10 prometheus自动发现
自动发现机制方便我们在监控系统中动态的添加或者删除资源。比如zabbix可以自动发现监控主机以及监控资源。prometheus作为一个可以与zabbix旗鼓相当的监控系统,自然也有它的自动发现机制。
1.10.1 file_sd_configs
file_sd_configs可以用来动态的添加和删除target。
修改prometheus的配置文件
- job_name: 'node'
file_sd_configs:
- refresh_interval: 1m
files:
- targets/nodes/*.yml
创建被扫描的文件nodes.yml
- targets:
- '172.19.179.239:9100'
- '172.19.179.240:9100'
- '172.19.179.244:9100'
- '172.19.179.253:9100'
- '172.19.179.254:9100'
labels:
server: linux
1.10.2 consul_sd_file
Consul 是基于 GO 语言开发的开源工具,主要面向分布式,服务化的系统提供服务注册、服务发现和配置管理的功能。Consul 提供服务注册/发现、健康检查、Key/Value存储、多数据中心和分布式一致性保证等功能。之前我们通过 Prometheus 实现监控,当新增一个 Target 时,需要变更服务器上的配置文件,即使使用 file_sd_configs 配置,也需要登录服务器修改对应 Json 文件,会非常麻烦。不过 Prometheus 官方支持多种自动服务发现的类型,其中就支持 Consul。
consul的配置需要有consul的服务提供
修改prometheus的配置文件
- job_name: 'consul-prometheus'
consul_sd_configs:
- server: '172.30.12.167:8500'
services: []
1.11 容器启动
docker run -d \
-p 9090:9090 \
-v "/prom/prometheus.yml:/etc/prometheus/prometheus.yml" \
-v "/prom/rules:/etc/prometheus/rules" \
-v "/prom/targets:/etc/prometheus/targets \
prom/prometheus
2 配置node_exporter监控主机
Node_exporter是可以在* Nix和Linux系统上运行的计算机度量标准的导出器。
Node_exporter 主要用于暴露 metrics 给 Prometheus,其中 metrics 包括:cpu 的负载,内存的使用情况,网络等。
2.1 下载
wget https://github.com/prometheus/node_exporter/releases/download/v0.18.1/node_exporter-0.18.1.linux-amd64.tar.gz
tar zxf node_exporter-0.18.1.linux-amd64.tar.gz
mv node_exporter-0.18.1.linux-amd64 /usr/local/node_exporter
2.2 创建node_exporter服务
cat > /usr/lib/systemd/system/node_exporter.service <<EOF
[Unit]
Description=Node Exporter
After=network.target
[Service]
ExecStart=/usr/local/node_exporter/node_exporter
[Install]
WantedBy=multi-user.target
EOF
2.3 启动
systemctl enable node_exporter.service
systemctl start node_exporter.service
2.4 配置prometheus.yml
在scrape_configs
下添加node_exporter
,重启Prometheus。
- job_name: 'node'
static_configs:
- targets:
- '172.19.179.239:9100'
- '172.19.179.240:9100'
2.5 容器运行
docker run -d \
-p 9100:9100 \
-v "/:/host:ro,rslave" \
--net="host" \
--path.rootfs=/host \
--collector.filesystem.ignored-mount-points "^/(sys|proc|dev|host|etc)($|/)" \
prom/node-exporter
3 配置Alertmanager
3.1 下载
wget https://github.com/prometheus/alertmanager/releases/download/v0.20.0/alertmanager-0.20.0.linux-amd64.tar.gz
tar zxf alertmanager-0.20.0.linux-amd64.tar.gz
3.2 下载钉钉告警插件
wget https://github.com/timonwong/prometheus-webhook-dingtalk/releases/download/v1.4.0/prometheus-webhook-dingtalk-1.4.0.linux-amd64.tar.gz
tar zxf prometheus-webhook-dingtalk-1.4.0.linux-amd64.tar.gz
mv prometheus-webhook-dingtalk-1.4.0.linux-amd64 /usr/local/prometheus/alertmanager/webhook-dingtalk
配置config.yml
targets:
webhook:
url: https://oapi.dingtalk.com/robot/send?access_token=xxxxxxxxxxxx
secret: SEC000000000000000000000
template: contrib/templates/legacy/dingtalk.tmpl
配置消息模板
这里提供几个模板
https://github.com/bwcxyk/config_file/tree/master/prometheus/alertmanager/dingtalk/templates
3.3 创建服务
cat > /usr/lib/systemd/system/prometheus-webhook-dingtalk.service <<EOF
[Unit]
Description=prometheus-webhook-dingtalk
After=network-online.target
[Service]
Restart=on-failure
ExecStart=/usr/local/prometheus/prometheus-webhook-dingtalk/prometheus-webhook-dingtalk \
--config.file=/usr/local/prometheus/prometheus-webhook-dingtalk/config.yml \
--web.enable-lifecycle
[Install]
WantedBy=multi-user.target
EOF
3.4 启动钉钉告警插件
systemctl enable prometheus-webhook-dingtalk.service
systemctl start prometheus-webhook-dingtalk.service
3.5 修改Alertmanager配置
vi alertmanager.yml
global:
resolve_timeout: 5m
# 路由树: 根节点
route:
receiver: webhook
# 分组维度
group_by: [alertname]
# 新分组等待发送, 收敛间隔30s
group_wait: 30s
# 存在分组,有新告警加入发送, 收敛间隔5m
group_interval: 5m
# 发送成功的alert重复发送需等待3h
repeat_interval: 3h
routes:
- receiver: webhook
group_wait: 10s
# 接收
receivers:
- name: webhook
webhook_configs:
- url: http://localhost:8060/dingtalk/webhook/send
send_resolved: true
# 抑制
# alertname、cluster、service相同的告警
# critical存在则warning的被抑制
inhibit_rules:
- equal: ['alertname', 'cluster', 'service']
source_match:
severity: 'critical'
target_match:
severity: 'warning'
3.6 创建Alertmanager服务
cat > /usr/lib/systemd/system/alertmanager.service <<EOF
[Unit]
Description=Alertmanager
After=network.target
[Service]
Type=simple
User=prometheus
ExecStart=/usr/local/prometheus/alertmanager/alertmanager --web.external-url=http://example.com:9093 --config.file=/usr/local/prometheus/alertmanager/alertmanager.yml --storage.path=/data/prometheus/alertmanager/data
Restart=on-failure
[Install]
WantedBy=multi-user.target
EOF
3.7 启动服务
systemctl enable alertmanager.service
systemctl start alertmanager.service
配置nginx
server {
listen 80;
server_name monitor.example.com;
location / {
auth_basic "Prometheus";
auth_basic_user_file ".htpasswd";
proxy_pass http://localhost:9093/;
}
}
3.8 容器启动
docker run -d \
-p 9093:9093 \
-v "/prom/alertmanager.yaml:/etc/alertmanager/alertmanager.yaml" \
prom/alertmanager
4 Blackbox_exporter
4.1 下载
wget https://github.com/prometheus/blackbox_exporter/releases/download/v0.17.0/blackbox_exporter-0.17.0.linux-amd64.tar.gz
tar zxf blackbox_exporter-0.17.0.linux-amd64.tar.gz
mv blackbox_exporter-0.17.0.linux-amd64 blackbox_exporter
4.2 配置
编辑blackbox.yml文件
vi blackbox.yml
modules:
http_2xx: # http 检测模块 Blockbox-Exporter 中所有的探针均是以 Module 的信息进行配置
prober: http
timeout: 10s
http:
valid_http_versions: ["HTTP/1.1", "HTTP/2"]
valid_status_codes: [200] # 这里最好作一个返回状态码,在grafana作图时,有明示
method: GET
preferred_ip_protocol: "ip4"
http_post_2xx: # http post 监测模块
prober: http
timeout: 10s
http:
valid_http_versions: ["HTTP/1.1", "HTTP/2"]
method: POST
preferred_ip_protocol: "ip4"
修改Prometheus配置,增加job,使用基于文件的自动发现
metrics_path
的值在源码中默认为/metrics
,注意修改
- job_name: "blackbox-http"
metrics_path: /probe # 不是 metrics,是 probe
params:
module: [http_2xx] # 使用 http_2xx 模块
file_sd_configs:
- refresh_interval: 1m
files:
- targets/blackbox/http_2xx.yml
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: 127.0.0.1:9115 # blackbox服务地址
- job_name: "blackbox-http-post"
metrics_path: /probe # 不是 metrics,是 probe
params:
module: [http_post_2xx] # 使用 http_2xx 模块
file_sd_configs:
- refresh_interval: 1m
files:
- targets/blackbox/http_post_2xx.yml
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: 127.0.0.1:9115 # blackbox服务地址
创建targets/blackbox/http_2xx.yml
文件
- targets:
- https://baidu.com
4.3 创建系统服务
cat > /usr/lib/systemd/system/blackbox.service <<EOF
[Unit]
Description=blackbox_exporter
After=network.target
[Service]
User=root
Type=simple
ExecStart=/usr/local/prometheus/blackbox_exporter/blackbox_exporter --config.file=/usr/local/prometheus/blackbox_exporter/blackbox.yml
Restart=on-failure
[Install]
WantedBy=multi-user.target
EOF
4.4 启动服务
systemctl daemon-reload
systemctl start blackbox.service
systemctl enable blackbox.service
4.5 重载Prometheus
curl -X POST "http://127.0.0.1:9090/-/reload"
4.6 grafana图表
导入 https://grafana.com/grafana/dashboards/9965
4.7 告警配置
创建rules/blackbox_exporter.yml
文件
groups:
- name: blackbox_network_stats
rules:
- alert: blackbox_network_stats
expr: probe_success == 0
for: 1m #如1分钟内持续为0 报警
labels:
severity: critical
annotations:
description: 'Job {{ $labels.job }} 中的网站/接口 {{ $labels.instance }} 已经down掉超过一分钟.'
summary: '网站/接口 {{ $labels.instance }} down ! ! !'
- alert: BlackboxProbeHttpFailure
expr: probe_http_status_code <= 199 OR probe_http_status_code >= 400
for: 5m
labels:
severity: critical
annotations:
summary: Blackbox probe HTTP failure (instance {{ $labels.instance }})
description: HTTP status code is not 200-399\n VALUE = {{ $value }}\n LABELS:\ {{ $labels }}
- alert: BlackboxSslCertificateWillExpireSoon
expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 30
for: 5m
labels:
severity: warning
annotations:
summary: Blackbox SSL certificate will expire soon (instance {{ $labels.instance }})
description: SSL certificate expires in 30 days\n VALUE = {{ $value }}\n LABELS:\ {{ $labels }}
5 其他知识点
删除一些 job 任务或者 instance 的数据指标,则可以使用下面的命令:
curl -X POST -g 'http://localhost:9090/api/v1/admin/tsdb/delete_series?match[]={job="kubernetes"}'
curl -X POST -g 'http://localhost:9090/api/v1/admin/tsdb/delete_series?match[]={instance="10.244.2.158:9090"}'