Prometheus监控安装(持续更新)

Prometheus, a Cloud Native Computing Foundation project, is a systems and service monitoring system. It collects metrics from configured targets at given intervals, evaluates rule expressions, displays the results, and can trigger alerts when specified conditions are observed.

The features that distinguish Prometheus from other metrics and monitoring systems are:

  • A multi-dimensional data model (time series defined by metric name and set of key/value dimensions)
  • PromQL, a powerful and flexible query language to leverage this dimensionality
  • No dependency on distributed storage; single server nodes are autonomous
  • An HTTP pull model for time series collection
  • Pushing time series is supported via an intermediary gateway for batch jobs
  • Targets are discovered via service discovery or static configuration
  • Multiple modes of graphing and dashboarding support
  • Support for hierarchical and horizontal federation

bash

wget https://github.com/prometheus/prometheus/releases/download/v2.14.0/prometheus-2.14.0.linux-amd64.tar.gz

tar zxf prometheus-2.14.0.linux-amd64.tar.gz
mv prometheus-2.14.0.linux-amd64 /usr/local/prometheus

bash

groupadd --system prometheus
useradd --system -g prometheus -s /sbin/nologin -c "Prometheus Monitoring System" prometheus

bash

chown -R prometheus:prometheus /usr/local/prometheus

bash

mkdir /data/prometheus
chown -R prometheus:prometheus /data/prometheus

bash

cat > /usr/lib/systemd/system/prometheus.service <<EOF
[Unit]
Description=Prometheus
After=network.target
[Service]
Type=simple
User=prometheus
ExecStart=/usr/local/prometheus/prometheus \
--config.file=/usr/local/prometheus/prometheus.yml \
--storage.tsdb.path=/data/prometheus \
--storage.tsdb.retention=30d \
--storage.tsdb.retention.size=512M \
--web.enable-admin-api \
--web.enable-lifecycle \
--web.external-url=http://monitor.example.com
Restart=on-failure
[Install]
WantedBy=multi-user.target
EOF

Type设置为notify时,服务会不断重启

--storage.tsdb.path是可选项,默认数据目录在运行目录的./dada目录中

--storage.tsdb.retention设置了保留多长时间的数据

--storage.tsdb.retention.size存储块可以使用的最大字节数(请注意,这不包括WAL大小,这可能很大)。 最早的数据将被删除。 默认为0或禁用。 此标志是实验性的,可以在将来的版本中进行更改。 支持的单位:KB,MB,GB,PB。 例如:“512MB”

--web.enable-admin-api开启对admin api的访问权限

--web.enable-lifecycle启用远程热加载配置文件

--web.external-url=http://localhost:9090/ prometheus主机外网地址,不写会导致告警GeneratorURL不对

推荐一个网站,里面有很多告警规则https://awesome-prometheus-alerts.grep.to/

bash

vi rules/basis.yml

bash

groups:
- name: 主机状态-监控告警
  rules:
  - alert: 主机状态
    expr: up == 0
    for: 1m
    labels:
      status: 非常严重
    annotations:
      summary: "{{$labels.instance}}:服务器宕机"
      description: "{{$labels.instance}}:服务器延时超过5分钟"
  
  - alert: CPU使用情况
    expr: 100-(avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) by(instance)* 100) > 60
    for: 1m
    labels:
      status: 一般告警
    annotations:
      summary: "{{$labels.mountpoint}} CPU使用率过高!"
      description: "{{$labels.mountpoint }} CPU使用大于60%(目前使用:{{$value}}%)"

  - alert: 内存使用
    expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes ))* 100 > 80
    for: 1m
    labels:
      status: 严重告警
    annotations:
      summary: "{{$labels.mountpoint}} 内存使用率过高!"
      description: "{{$labels.mountpoint }} 内存使用大于80%(目前使用:{{$value}}%)"
  - alert: IO性能
    expr: 100-(avg(irate(node_disk_io_time_seconds_total[1m])) by(instance)* 100) < 60
    for: 1m
    labels:
      status: 严重告警
    annotations:
      summary: "{{$labels.mountpoint}} 流入磁盘IO使用率过高!"
      description: "{{$labels.mountpoint }} 流入磁盘IO大于60%(目前使用:{{$value}})"

  - alert: 网络
    expr: ((sum(rate (node_network_receive_bytes_total{device!~'tap.*|veth.*|br.*|docker.*|virbr*|lo*'}[5m])) by (instance)) / 100) > 102400
    for: 1m
    labels:
      status: 严重告警
    annotations:
      summary: "{{$labels.mountpoint}} 流入网络带宽过高!"
      description: "{{$labels.mountpoint }}流入网络带宽持续2分钟高于100M. RX带宽使用率{{$value}}"

  - alert: 网络
    expr: ((sum(rate (node_network_transmit_bytes_total{device!~'tap.*|veth.*|br.*|docker.*|virbr*|lo*'}[5m])) by (instance)) / 100) > 102400
    for: 1m
    labels:
      status: 严重告警
    annotations:
      summary: "{{$labels.mountpoint}} 流出网络带宽过高!"
      description: "{{$labels.mountpoint }}流出网络带宽持续2分钟高于100M. RX带宽使用率{{$value}}"
  
  - alert: TCP会话
    expr: node_netstat_Tcp_CurrEstab > 1000
    for: 1m
    labels:
      status: 严重告警
    annotations:
      summary: "{{$labels.mountpoint}} TCP_ESTABLISHED过高!"
      description: "{{$labels.mountpoint }} TCP_ESTABLISHED大于1000%(目前使用:{{$value}}%)"

  - alert: 磁盘容量
    expr: 100-(node_filesystem_free_bytes{fstype=~"ext4|xfs"}/node_filesystem_size_bytes {fstype=~"ext4|xfs"}*100) > 80
    for: 1m
    labels:
      status: 严重告警
    annotations:
      summary: "{{$labels.mountpoint}} 磁盘分区使用率过高!"
      description: "{{$labels.mountpoint }} 磁盘分区使用大于80%(目前使用:{{$value}}%)"

yaml

groups:
- name: Windows主机状态-监控告警
  rules:
  - alert: WindowsServerCollectorError
    expr: windows_exporter_collector_success == 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Windows Server collector Error (instance {{ $labels.instance }})
      description: Collector {{ $labels.collector }} was not successful\n  VALUE = {{ $value }}\n  LABELS:\ {{ $labels }}

  - alert: WindowsServerServiceStatus
    expr: windows_service_status{status="ok"} != 1
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Windows Server service Status (instance {{ $labels.instance }})
      description: Windows Service state is not OK\n  VALUE = {{ $value }}\n  LABELS:\ {{ $labels }}

  - alert: WindowsServerCpuUsage
    expr: 100 - (avg by (instance) (rate(windows_cpu_time_total{mode="idle"}[2m])) * 100) > 80
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Windows Server CPU Usage (instance {{ $labels.instance }})
      description: CPU Usage is more than 80%\n  VALUE = {{ $value }}\n  LABELS:\ {{ $labels }}

  - alert: WindowsServerMemoryUsage
    expr: 100 - ((windows_os_physical_memory_free_bytes / windows_cs_physical_memory_bytes) * 100) > 90
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Windows Server memory Usage (instance {{ $labels.instance }})
      description: Memory usage is more than 90%\n  VALUE = {{ $value }}\n  LABELS:\ {{ $labels }}

  - alert: WindowsServerDiskSpaceUsage
    expr: 100.0 - 100 * ((windows_logical_disk_free_bytes / 1024 / 1024 ) / (windows_logical_disk_size_bytes / 1024 / 1024)) > 80
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Windows Server disk Space Usage (instance {{ $labels.instance }})
      description: Disk usage is more than 80%\n  VALUE = {{ $value }}\n  LABELS:\ {{ $labels }}

  - alert: 网络
    expr: (irate(windows_net_bytes_received_total{nic!~'isatap.*|VPN.*'}[5m])*8 /1000) > 5120
    for: 1m
    labels:
      status: 严重告警
    annotations:
      summary: "{{$labels.mountpoint}} 流入(下载)网络带宽过高!"
      description: "{{$labels.mountpoint }}流入(下载)网络带宽持续2分钟高于5M. RX带宽使用率{{$value}}"

  - alert: 网络
    expr: (irate(windows_net_bytes_sent_total{nic!~'isatap.*|VPN.*'}[5m])*8 /1000) > 5120
    for: 1m
    labels:
      status: 严重告警
    annotations:
      summary: "{{$labels.mountpoint}} 流出(上传)网络带宽过高!"
      description: "{{$labels.mountpoint }}流出(上传)网络带宽持续2分钟高于5M. RX带宽使用率{{$value}}"

yaml

groups:
- name: blackbox_network_stats
  rules:
  - alert: blackbox_network_stats
    expr: probe_success == 0
    for: 1m  #如1分钟内持续为0  报警
    labels:
      severity: critical
    annotations:
      description: 'Job {{ $labels.job }} 中的网站/接口 {{ $labels.instance }} 已经down掉超过一分钟.'
      summary: '网站/接口 {{ $labels.instance }} down ! ! !'

  - alert: BlackboxProbeHttpFailure
    expr: probe_http_status_code <= 199 OR probe_http_status_code >= 400
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Blackbox probe HTTP failure (instance {{ $labels.instance }})
      description: HTTP status code is not 200-399\n  VALUE = {{ $value }}\n  LABELS:\ {{ $labels }}

  - alert: BlackboxSslCertificateWillExpireSoon
    expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 30
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Blackbox SSL certificate will expire soon (instance {{ $labels.instance }})
      description: SSL certificate expires in 30 days\n  VALUE = {{ $value }}\n  LABELS:\ {{ $labels }}

测试规则是不是正确

bash

./promtool check rules rules/basis.yml

text

vi prometheus.yml

yaml

# Alertmanager configuration
alerting:
  alertmanagers:
  - static_configs:
    - targets:
       - localhost:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  - "rules/*.yml"

bash

systemctl enable prometheus.service
systemctl start prometheus.service

Prometheus并没有提供任何认证支持。不过,借助 Nginx 作为反向代理服务器,我们可以很容易地添加 HTTP Basic Auth 功能。

然后,在 /usr/local/nginx/conf/ (可能你的 Nginx 配置目录在其他路径,请做相应修改)目录下,使用 apache2-utils 提供的 htpasswd 工具创建一个用户文件,需要填入用户名和密码:

bash

htpasswd -c /usr/local/nginx/conf/.htpasswd admin

配置nginx

server {
    listen 80;
        server_name monitor.example.com;

    location / {
        auth_basic "Prometheus";
        auth_basic_user_file ".htpasswd";
        proxy_pass http://localhost:9090/;
    }
}

访问 http://example.com:9090 ,输入账号密码访问

自动发现机制方便我们在监控系统中动态的添加或者删除资源。比如zabbix可以自动发现监控主机以及监控资源。prometheus作为一个可以与zabbix旗鼓相当的监控系统,自然也有它的自动发现机制。

file_sd_configs可以用来动态的添加和删除target。

修改prometheus的配置文件

yaml

  - job_name: 'node'
    file_sd_configs:
    - refresh_interval: 1m
      files:
      - targets/nodes/*.yml

创建被扫描的文件nodes.yml

yaml

- targets:
  - '172.19.179.239:9100'
  - '172.19.179.240:9100'
  - '172.19.179.244:9100'
  - '172.19.179.253:9100'
  - '172.19.179.254:9100'
  labels:
    server: linux

Consul 是基于 GO 语言开发的开源工具,主要面向分布式,服务化的系统提供服务注册、服务发现和配置管理的功能。Consul 提供服务注册/发现、健康检查、Key/Value存储、多数据中心和分布式一致性保证等功能。之前我们通过 Prometheus 实现监控,当新增一个 Target 时,需要变更服务器上的配置文件,即使使用 file_sd_configs 配置,也需要登录服务器修改对应 Json 文件,会非常麻烦。不过 Prometheus 官方支持多种自动服务发现的类型,其中就支持 Consul。

consul的配置需要有consul的服务提供

修改prometheus的配置文件

yaml

  - job_name: 'consul-prometheus'
    consul_sd_configs:
    - server: '172.30.12.167:8500'
      services: []  

bash

docker run -d \
-p 9090:9090 \
-v "/prom/prometheus.yml:/etc/prometheus/prometheus.yml" \
-v "/prom/rules:/etc/prometheus/rules" \
-v "/prom/targets:/etc/prometheus/targets \
prom/prometheus

Node_exporter是可以在* Nix和Linux系统上运行的计算机度量标准的导出器。

Node_exporter 主要用于暴露 metrics 给 Prometheus,其中 metrics 包括:cpu 的负载,内存的使用情况,网络等。

bash

wget https://github.com/prometheus/node_exporter/releases/download/v0.18.1/node_exporter-0.18.1.linux-amd64.tar.gz

tar zxf node_exporter-0.18.1.linux-amd64.tar.gz
mv node_exporter-0.18.1.linux-amd64 /usr/local/node_exporter

bash

cat > /usr/lib/systemd/system/node_exporter.service <<EOF
[Unit]
Description=Node Exporter
After=network.target

[Service]
ExecStart=/usr/local/node_exporter/node_exporter

[Install]
WantedBy=multi-user.target
EOF

bash

systemctl enable node_exporter.service
systemctl start node_exporter.service

scrape_configs下添加node_exporter,重启Prometheus。

csharp

  - job_name: 'node'
    static_configs:
    - targets:
      - '172.19.179.239:9100'
      - '172.19.179.240:9100'

bash

docker run -d \
-p 9100:9100 \
-v "/:/host:ro,rslave" \
--net="host" \
--path.rootfs=/host \
--collector.filesystem.ignored-mount-points "^/(sys|proc|dev|host|etc)($|/)" \
prom/node-exporter

bash

wget https://github.com/prometheus/alertmanager/releases/download/v0.20.0/alertmanager-0.20.0.linux-amd64.tar.gz

tar zxf alertmanager-0.20.0.linux-amd64.tar.gz

bash

wget https://github.com/timonwong/prometheus-webhook-dingtalk/releases/download/v1.4.0/prometheus-webhook-dingtalk-1.4.0.linux-amd64.tar.gz

tar zxf prometheus-webhook-dingtalk-1.4.0.linux-amd64.tar.gz
mv prometheus-webhook-dingtalk-1.4.0.linux-amd64 /usr/local/prometheus/alertmanager/webhook-dingtalk

配置config.yml

yaml

targets:
  webhook:
    url: https://oapi.dingtalk.com/robot/send?access_token=xxxxxxxxxxxx
    secret: SEC000000000000000000000
template: contrib/templates/legacy/dingtalk.tmpl

配置消息模板

这里提供几个模板

https://github.com/bwcxyk/config_file/tree/master/prometheus/alertmanager/dingtalk/templates

bash

cat > /usr/lib/systemd/system/prometheus-webhook-dingtalk.service <<EOF
[Unit]
Description=prometheus-webhook-dingtalk
After=network-online.target

[Service]
Restart=on-failure
ExecStart=/usr/local/prometheus/prometheus-webhook-dingtalk/prometheus-webhook-dingtalk \
 --config.file=/usr/local/prometheus/prometheus-webhook-dingtalk/config.yml \
 --web.enable-lifecycle

[Install]
WantedBy=multi-user.target
EOF

bash

systemctl enable prometheus-webhook-dingtalk.service
systemctl start prometheus-webhook-dingtalk.service

bash

vi alertmanager.yml

yaml

global:
  resolve_timeout: 5m
# 路由树: 根节点
route:
  receiver: webhook
  # 分组维度
  group_by: [alertname]
  # 新分组等待发送, 收敛间隔30s
  group_wait: 30s
  # 存在分组,有新告警加入发送, 收敛间隔5m
  group_interval: 5m
  # 发送成功的alert重复发送需等待3h
  repeat_interval: 3h
  routes:
  - receiver: webhook
    group_wait: 10s
# 接收
receivers:
- name: webhook
  webhook_configs:
  - url: http://localhost:8060/dingtalk/webhook/send  
    send_resolved: true
# 抑制
# alertname、cluster、service相同的告警
# critical存在则warning的被抑制
inhibit_rules:
- equal: ['alertname', 'cluster', 'service']
  source_match:
    severity: 'critical'
  target_match:
    severity: 'warning'

bash

cat > /usr/lib/systemd/system/alertmanager.service <<EOF
[Unit]
Description=Alertmanager
After=network.target

[Service]
Type=simple
User=prometheus
ExecStart=/usr/local/prometheus/alertmanager/alertmanager --web.external-url=http://example.com:9093 --config.file=/usr/local/prometheus/alertmanager/alertmanager.yml --storage.path=/data/prometheus/alertmanager/data
Restart=on-failure

[Install]
WantedBy=multi-user.target
EOF

bash

systemctl enable alertmanager.service
systemctl start alertmanager.service

配置nginx

server {
    listen 80;
        server_name monitor.example.com;

    location / {
        auth_basic "Prometheus";
        auth_basic_user_file ".htpasswd";
        proxy_pass http://localhost:9093/;
    }
}

bash

docker run -d \
-p 9093:9093 \
-v "/prom/alertmanager.yaml:/etc/alertmanager/alertmanager.yaml" \
prom/alertmanager

bash

wget https://github.com/prometheus/blackbox_exporter/releases/download/v0.17.0/blackbox_exporter-0.17.0.linux-amd64.tar.gz

tar zxf blackbox_exporter-0.17.0.linux-amd64.tar.gz
mv blackbox_exporter-0.17.0.linux-amd64 blackbox_exporter

编辑blackbox.yml文件

bash

vi blackbox.yml

yaml

modules:
  http_2xx:  # http 检测模块  Blockbox-Exporter 中所有的探针均是以 Module 的信息进行配置
    prober: http
    timeout: 10s
    http:
      valid_http_versions: ["HTTP/1.1", "HTTP/2"]   
      valid_status_codes: [200]  # 这里最好作一个返回状态码,在grafana作图时,有明示
      method: GET
      preferred_ip_protocol: "ip4"
  http_post_2xx: # http post 监测模块
    prober: http
    timeout: 10s
    http:
      valid_http_versions: ["HTTP/1.1", "HTTP/2"]
      method: POST
      preferred_ip_protocol: "ip4"

修改Prometheus配置,增加job,使用基于文件的自动发现

metrics_path的值在源码中默认为/metrics,注意修改

yaml

  - job_name: "blackbox-http"
    metrics_path: /probe # 不是 metrics,是 probe
    params:
      module: [http_2xx] # 使用 http_2xx 模块
    file_sd_configs:
    - refresh_interval: 1m
      files:
      - targets/blackbox/http_2xx.yml
    relabel_configs:
    - source_labels: [__address__]
      target_label: __param_target
    - source_labels: [__param_target]
      target_label: instance
    - target_label: __address__
      replacement: 127.0.0.1:9115  # blackbox服务地址

  - job_name: "blackbox-http-post"
    metrics_path: /probe # 不是 metrics,是 probe
    params:
      module: [http_post_2xx] # 使用 http_2xx 模块
    file_sd_configs:
    - refresh_interval: 1m
      files:
      - targets/blackbox/http_post_2xx.yml
    relabel_configs:
    - source_labels: [__address__]
      target_label: __param_target
    - source_labels: [__param_target]
      target_label: instance
    - target_label: __address__
      replacement: 127.0.0.1:9115  # blackbox服务地址

创建targets/blackbox/http_2xx.yml文件

yaml

- targets:
  - https://baidu.com

bash

cat > /usr/lib/systemd/system/blackbox.service <<EOF
[Unit]
Description=blackbox_exporter
After=network.target

[Service]
User=root
Type=simple
ExecStart=/usr/local/prometheus/blackbox_exporter/blackbox_exporter --config.file=/usr/local/prometheus/blackbox_exporter/blackbox.yml
Restart=on-failure

[Install]
WantedBy=multi-user.target
EOF

bash

systemctl daemon-reload
systemctl start blackbox.service
systemctl enable blackbox.service

bash

curl -X POST "http://127.0.0.1:9090/-/reload"

导入 https://grafana.com/grafana/dashboards/9965

创建rules/blackbox_exporter.yml文件

yaml

groups:
- name: blackbox_network_stats
  rules:
  - alert: blackbox_network_stats
    expr: probe_success == 0
    for: 1m  #如1分钟内持续为0  报警
    labels:
      severity: critical
    annotations:
      description: 'Job {{ $labels.job }} 中的网站/接口 {{ $labels.instance }} 已经down掉超过一分钟.'
      summary: '网站/接口 {{ $labels.instance }} down ! ! !'

  - alert: BlackboxProbeHttpFailure
    expr: probe_http_status_code <= 199 OR probe_http_status_code >= 400
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Blackbox probe HTTP failure (instance {{ $labels.instance }})
      description: HTTP status code is not 200-399\n  VALUE = {{ $value }}\n  LABELS:\ {{ $labels }}

  - alert: BlackboxSslCertificateWillExpireSoon
    expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 30
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Blackbox SSL certificate will expire soon (instance {{ $labels.instance }})
      description: SSL certificate expires in 30 days\n  VALUE = {{ $value }}\n  LABELS:\ {{ $labels }}

删除一些 job 任务或者 instance 的数据指标,则可以使用下面的命令:

bash

curl -X POST -g 'http://localhost:9090/api/v1/admin/tsdb/delete_series?match[]={job="kubernetes"}'
curl -X POST -g 'http://localhost:9090/api/v1/admin/tsdb/delete_series?match[]={instance="10.244.2.158:9090"}'

参考:Prometheus 删除数据指标