摘要
配置Prometheus 下载 1 2 3 4 wget https://github.com/prometheus/prometheus/releases/download/v2.14.0/prometheus-2.14.0.linux-amd64.tar.gz tar zxf prometheus-2.14.0.linux-amd64.tar.gz mv prometheus-2.14.0.linux-amd64 /usr/local /prometheus
创建用户 1 2 groupadd --system prometheus useradd --system -g prometheus -s /sbin/nologin -c "Prometheus Monitoring System" prometheus
赋权 1 chown -R prometheus:prometheus /usr/local /prometheus
创建数据目录 1 2 mkdir /data/prometheus chown -R prometheus:prometheus /data/prometheus
创建Prometheus服务 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 cat > /usr/lib/systemd/system/prometheus.service <<EOF [Unit] Description=Prometheus After=network.target [Service] Type=simple User=prometheus ExecStart=/usr/local/prometheus/prometheus \ --config.file=/usr/local/prometheus/prometheus.yml \ --storage.tsdb.path=/data/prometheus \ --storage.tsdb.retention=30d \ --storage.tsdb.retention.size=512M \ --web.enable-admin-api \ --web.enable-lifecycle \ --web.external-url=http://monitor.example.com Restart=on-failure [Install] WantedBy=multi-user.target EOF
Type设置为notify时,服务会不断重启
--storage.tsdb.path
是可选项,默认数据目录在运行目录的./dada
目录中
--storage.tsdb.retention
设置了保留多长时间的数据
--storage.tsdb.retention.size
存储块可以使用的最大字节数(请注意,这不包括WAL大小,这可能很大)。 最早的数据将被删除。 默认为0或禁用。 此标志是实验性的,可以在将来的版本中进行更改。 支持的单位:KB,MB,GB,PB。 例如:“512MB”
--web.enable-admin-api
开启对admin api
的访问权限
--web.enable-lifecycle
启用远程热加载配置文件
--web.external-url=http://localhost:9090/
prometheus主机外网地址,不写会导致告警GeneratorURL不对
创建告警规则文件 推荐一个网站,里面有很多告警规则https://awesome-prometheus-alerts.grep.to/
Linux服务器存活报警
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 groups: - name: 主机状态-监控告警 rules: - alert: 主机状态 expr: up == 0 for : 1m labels: status: 非常严重 annotations: summary: "{{$labels .instance}}:服务器宕机" description: "{{$labels .instance}}:服务器延时超过5分钟" - alert: CPU使用情况 expr: 100-(avg(irate(node_cpu_seconds_total{mode="idle" }[5m])) by(instance)* 100) > 60 for : 1m labels: status: 一般告警 annotations: summary: "{{$labels .mountpoint}} CPU使用率过高!" description: "{{$labels .mountpoint }} CPU使用大于60%(目前使用:{{$value }}%)" - alert: 内存使用 expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes ))* 100 > 80 for : 1m labels: status: 严重告警 annotations: summary: "{{$labels .mountpoint}} 内存使用率过高!" description: "{{$labels .mountpoint }} 内存使用大于80%(目前使用:{{$value }}%)" - alert: IO性能 expr: 100-(avg(irate(node_disk_io_time_seconds_total[1m])) by(instance)* 100) < 60 for : 1m labels: status: 严重告警 annotations: summary: "{{$labels .mountpoint}} 流入磁盘IO使用率过高!" description: "{{$labels .mountpoint }} 流入磁盘IO大于60%(目前使用:{{$value }})" - alert: 网络 expr: ((sum(rate (node_network_receive_bytes_total{device!~'tap.*|veth.*|br.*|docker.*|virbr*|lo*' }[5m])) by (instance)) / 100) > 102400 for : 1m labels: status: 严重告警 annotations: summary: "{{$labels .mountpoint}} 流入网络带宽过高!" description: "{{$labels .mountpoint }}流入网络带宽持续2分钟高于100M. RX带宽使用率{{$value }}" - alert: 网络 expr: ((sum(rate (node_network_transmit_bytes_total{device!~'tap.*|veth.*|br.*|docker.*|virbr*|lo*' }[5m])) by (instance)) / 100) > 102400 for : 1m labels: status: 严重告警 annotations: summary: "{{$labels .mountpoint}} 流出网络带宽过高!" description: "{{$labels .mountpoint }}流出网络带宽持续2分钟高于100M. RX带宽使用率{{$value }}" - alert: TCP会话 expr: node_netstat_Tcp_CurrEstab > 1000 for : 1m labels: status: 严重告警 annotations: summary: "{{$labels .mountpoint}} TCP_ESTABLISHED过高!" description: "{{$labels .mountpoint }} TCP_ESTABLISHED大于1000%(目前使用:{{$value }}%)" - alert: 磁盘容量 expr: 100-(node_filesystem_free_bytes{fstype=~"ext4|xfs" }/node_filesystem_size_bytes {fstype=~"ext4|xfs" }*100) > 80 for : 1m labels: status: 严重告警 annotations: summary: "{{$labels .mountpoint}} 磁盘分区使用率过高!" description: "{{$labels .mountpoint }} 磁盘分区使用大于80%(目前使用:{{$value }}%)"
Windows服务器存活报警 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 groups: - name: Windows主机状态-监控告警 rules: - alert: WindowsServerCollectorError expr: windows_exporter_collector_success == 0 for: 5m labels: severity: critical annotations: summary: Windows Server collector Error (instance {{ $labels.instance }}) description: Collector {{ $labels.collector }} was not successful\n VALUE = {{ $value }}\n LABELS:\ {{ $labels }} - alert: WindowsServerServiceStatus expr: windows_service_status{status="ok"} != 1 for: 5m labels: severity: critical annotations: summary: Windows Server service Status (instance {{ $labels.instance }}) description: Windows Service state is not OK\n VALUE = {{ $value }}\n LABELS:\ {{ $labels }} - alert: WindowsServerCpuUsage expr: 100 - (avg by (instance) (rate(windows_cpu_time_total{mode="idle"}[2m])) * 100 ) > 80 for: 5m labels: severity: warning annotations: summary: Windows Server CPU Usage (instance {{ $labels.instance }}) description: CPU Usage is more than 80 %\n VALUE = {{ $value }}\n LABELS:\ {{ $labels }} - alert: WindowsServerMemoryUsage expr: 100 - ((windows_os_physical_memory_free_bytes / windows_cs_physical_memory_bytes) * 100 ) > 90 for: 5m labels: severity: warning annotations: summary: Windows Server memory Usage (instance {{ $labels.instance }}) description: Memory usage is more than 90 %\n VALUE = {{ $value }}\n LABELS:\ {{ $labels }} - alert: WindowsServerDiskSpaceUsage expr: 100.0 - 100 * ((windows_logical_disk_free_bytes / 1024 / 1024 ) / (windows_logical_disk_size_bytes / 1024 / 1024 )) > 80 for: 5m labels: severity: critical annotations: summary: Windows Server disk Space Usage (instance {{ $labels.instance }}) description: Disk usage is more than 80 %\n VALUE = {{ $value }}\n LABELS:\ {{ $labels }} - alert: 网络 expr: (irate(windows_net_bytes_received_total{nic!~'isatap.*|VPN.*'}[5m])*8 /1000) > 5120 for: 1m labels: status: 严重告警 annotations: summary: "{{$labels.mountpoint}} 流入(下载)网络带宽过高!" description: "{{$labels.mountpoint }} 流入(下载)网络带宽持续2分钟高于5M. RX带宽使用率{{$value}} " - alert: 网络 expr: (irate(windows_net_bytes_sent_total{nic!~'isatap.*|VPN.*'}[5m])*8 /1000) > 5120 for: 1m labels: status: 严重告警 annotations: summary: "{{$labels.mountpoint}} 流出(上传)网络带宽过高!" description: "{{$labels.mountpoint }} 流出(上传)网络带宽持续2分钟高于5M. RX带宽使用率{{$value}} "
Http监控告警 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 groups: - name: blackbox_network_stats rules: - alert: blackbox_network_stats expr: probe_success == 0 for: 1m labels: severity: critical annotations: description: 'Job {{ $labels.job }} 中的网站/接口 {{ $labels.instance }} 已经down掉超过一分钟.' summary: '网站/接口 {{ $labels.instance }} down ! ! !' - alert: BlackboxProbeHttpFailure expr: probe_http_status_code <= 199 OR probe_http_status_code >= 400 for: 5m labels: severity: critical annotations: summary: Blackbox probe HTTP failure (instance {{ $labels.instance }}) description: HTTP status code is not 200 -399 \n VALUE = {{ $value }}\n LABELS:\ {{ $labels }} - alert: BlackboxSslCertificateWillExpireSoon expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 30 for: 5m labels: severity: warning annotations: summary: Blackbox SSL certificate will expire soon (instance {{ $labels.instance }}) description: SSL certificate expires in 30 days\n VALUE = {{ $value }}\n LABELS:\ {{ $labels }}
测试规则是不是正确
1 ./promtool check rules rules/basis.yml
修改Prometheus配置
1 2 3 4 5 6 7 8 9 10 alerting: alertmanagers: - static_configs: - targets: - localhost:9093 rule_files: - "rules/*.yml"
启动 1 2 systemctl enable prometheus.service systemctl start prometheus.service
配置nginx代理和HTTP Basic Auth Prometheus并没有提供任何认证支持。不过,借助 Nginx 作为反向代理服务器,我们可以很容易地添加 HTTP Basic Auth 功能。
然后,在 /usr/local/nginx/conf/
(可能你的 Nginx 配置目录在其他路径,请做相应修改)目录下,使用 apache2-utils
提供的 htpasswd
工具创建一个用户文件,需要填入用户名和密码:
1 htpasswd -c /usr/local /nginx/conf/.htpasswd admin
配置nginx
server {
listen 80;
server_name monitor.example.com;
location / {
auth_basic "Prometheus";
auth_basic_user_file ".htpasswd";
proxy_pass http://localhost:9090/;
}
}
访问 http://example.com:9090 ,输入账号密码访问
prometheus自动发现 自动发现机制方便我们在监控系统中动态的添加或者删除资源。比如zabbix可以自动发现监控主机以及监控资源。prometheus作为一个可以与zabbix旗鼓相当的监控系统,自然也有它的自动发现机制。
file_sd_configs file_sd_configs可以用来动态的添加和删除target。
修改prometheus的配置文件
1 2 3 4 5 - job_name: 'node' file_sd_configs: - refresh_interval: 1m files: - targets/nodes/*.yml
创建被扫描的文件nodes.yml
1 2 3 4 5 6 7 8 - targets: - '172.19.179.239:9100' - '172.19.179.240:9100' - '172.19.179.244:9100' - '172.19.179.253:9100' - '172.19.179.254:9100' labels: server: linux
consul_sd_file Consul 是基于 GO 语言开发的开源工具,主要面向分布式,服务化的系统提供服务注册、服务发现和配置管理的功能。Consul 提供服务注册/发现、健康检查、Key/Value存储、多数据中心和分布式一致性保证等功能。之前我们通过 Prometheus 实现监控,当新增一个 Target 时,需要变更服务器上的配置文件,即使使用 file_sd_configs 配置,也需要登录服务器修改对应 Json 文件,会非常麻烦。不过 Prometheus 官方支持多种自动服务发现的类型,其中就支持 Consul。
consul的配置需要有consul的服务提供
修改prometheus的配置文件
1 2 3 4 - job_name: 'consul-prometheus' consul_sd_configs: - server: '172.30.12.167:8500' services: []
容器启动 1 2 3 4 5 6 docker run -d \ -p 9090:9090 \ -v "/prom/prometheus.yml:/etc/prometheus/prometheus.yml" \ -v "/prom/rules:/etc/prometheus/rules" \ -v "/prom/targets:/etc/prometheus/targets \ prom/prometheus
配置node_exporter监控主机 Node_exporter是可以在* Nix和Linux系统上运行的计算机度量标准的导出器。
Node_exporter 主要用于暴露 metrics 给 Prometheus,其中 metrics 包括:cpu 的负载,内存的使用情况,网络等。
下载 1 2 3 4 wget https://github.com/prometheus/node_exporter/releases/download/v0.18.1/node_exporter-0.18.1.linux-amd64.tar.gz tar zxf node_exporter-0.18.1.linux-amd64.tar.gz mv node_exporter-0.18.1.linux-amd64 /usr/local /node_exporter
创建node_exporter服务 1 2 3 4 5 6 7 8 9 10 11 cat > /usr/lib/systemd/system/node_exporter.service <<EOF [Unit] Description=Node Exporter After=network.target [Service] ExecStart=/usr/local/node_exporter/node_exporter [Install] WantedBy=multi-user.target EOF
启动 1 2 systemctl enable node_exporter.service systemctl start node_exporter.service
配置prometheus.yml 在scrape_configs
下添加node_exporter
,重启Prometheus。
1 2 3 4 5 - job_name: 'node' static_configs: - targets: - '172.19.179.239:9100' - '172.19.179.240:9100'
容器运行 1 2 3 4 5 6 7 docker run -d \ -p 9100:9100 \ -v "/:/host:ro,rslave" \ --net="host" \ prom/node-exporter \ --path.rootfs=/host \ --collector.filesystem.ignored-mount-points "^/(sys|proc|dev|host|etc)($|/)"
配置Granfana 下载 1 2 3 4 wget https://dl.grafana.com/oss/release/grafana-6.5.2.linux-amd64.tar.gz tar -zxf grafana-6.5.2.linux-amd64.tar.gz mv grafana-6.5.2 /usr/local /grafana
创建Grafana服务 1 2 3 4 5 6 7 8 9 10 11 cat > /usr/lib/systemd/system/grafana-server.service <<EOF [Unit] Description=Grafana After=network.target [Service] Type=notify ExecStart=/usr/local/grafana/bin/grafana-server -homepath /usr/local/grafana Restart=on-failure [Install] WantedBy=multi-user.target EOF
启动 1 2 systemctl enable grafana-server.service systemctl start grafana-server.service
配置数据源 添加数据源
点击 Add data source
,选择Prometheus,在URL输入框键入http://localhost:9090
,点击save & test
,如果出现下图中的绿色提示,则表示配置有效,否则可能是地址或者端口等其他错误,需要自行修改。
下载模板 下载https://grafana.com/grafana/dashboards/9276 或者https://grafana.com/grafana/dashboards/8919
导入模板
效果图
配置nginx 添加Nginx配置,proxy_pass后面一定要有”/“(用以去掉/grafana/匹配本身)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 server { listen 80; server_name localhost; location /grafana/ { proxy_pass http://localhost:3000/; proxy_buffering off; proxy_http_version 1.1; proxy_set_header Upgrade $http_upgrade; proxy_set_header Connection "upgrade"; proxy_set_header X-Real-IP $remote_addr; proxy_set_header Host $host; proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; access_log off; } }
修改grafana配置(grafana.ini),需要去掉行前的”;”
1 2 3 [server] domain = 你的域名 root_url = %(protocol)s://%(domain)s/grafana/
容器启动 1 2 3 docker run -d \ -p 3000:3000 \ grafana/grafana
配置Alertmanager 下载 1 2 3 wget https://github.com/prometheus/alertmanager/releases/download/v0.20.0/alertmanager-0.20.0.linux-amd64.tar.gz tar zxf alertmanager-0.20.0.linux-amd64.tar.gz
下载钉钉告警插件 1 2 3 4 wget https://github.com/timonwong/prometheus-webhook-dingtalk/releases/download/v1.4.0/prometheus-webhook-dingtalk-1.4.0.linux-amd64.tar.gz tar zxf prometheus-webhook-dingtalk-1.4.0.linux-amd64.tar.gz mv prometheus-webhook-dingtalk-1.4.0.linux-amd64 /usr/local /prometheus/alertmanager/webhook-dingtalk
配置config.yml
1 2 3 4 5 targets: webhook: url: https://oapi.dingtalk.com/robot/send?access_token=xxxxxxxxxxxx secret: SEC000000000000000000000 template: contrib/templates/legacy/dingtalk.tmpl
配置消息模板
这里提供两个模板
https://github.com/bwcxyk/tools_file/raw/master/Prometheus/dingtalk_custom_tempalte.tmpl
https://github.com/bwcxyk/tools_file/raw/master/Prometheus/dingtalk.tmpl
创建服务 1 2 3 4 5 6 7 8 9 10 11 12 13 cat > /usr/lib/systemd/system/prometheus-webhook-dingtalk.service <<EOF [Unit] Description=prometheus-webhook-dingtalk After=network-online.target [Service] Restart=on-failure ExecStart=/usr/local/prometheus/alertmanager/prometheus-webhook-dingtalk/prometheus-webhook-dingtalk \ --config.file=/usr/local/prometheus/alertmanager/prometheus-webhook-dingtalk/config.yml [Install] WantedBy=multi-user.target EOF
启动钉钉告警插件 1 2 systemctl enable prometheus-webhook-dingtalk.service systemctl start prometheus-webhook-dingtalk.service
修改Alertmanager配置
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 global: resolve_timeout: 5m route: receiver: webhook group_by: [alertname ] group_wait: 30s group_interval: 5m repeat_interval: 3h routes: - receiver: webhook group_wait: 10s receivers: - name: webhook webhook_configs: - url: http://localhost:8060/dingtalk/webhook/send send_resolved: true inhibit_rules: - equal: ['alertname' , 'cluster' , 'service' ] source_match: severity: 'critical' target_match: severity: 'warning'
创建Alertmanager服务 1 2 3 4 5 6 7 8 9 10 11 12 13 14 cat > /usr/lib/systemd/system/alertmanager.service <<EOF [Unit] Description=Alertmanager After=network.target [Service] Type=simple User=prometheus ExecStart=/usr/local/prometheus/alertmanager/alertmanager --web.external-url=http://example.com:9093 --config.file=/usr/local/prometheus/alertmanager/alertmanager.yml --storage.path=/data/prometheus/alertmanager/data Restart=on-failure [Install] WantedBy=multi-user.target EOF
启动服务 1 2 systemctl enable alertmanager.service systemctl start alertmanager.service
配置nginx
server {
listen 80;
server_name monitor.example.com;
location / {
auth_basic "Prometheus";
auth_basic_user_file ".htpasswd";
proxy_pass http://localhost:9093/;
}
}
容器启动 1 2 3 4 docker run -d \ -p 9093:9093 \ -v "/prom/alertmanager.yaml:/etc/alertmanager/alertmanager.yaml" \ prom/alertmanager
Blackbox_exporter 下载 1 2 3 4 wget https://github.com/prometheus/blackbox_exporter/releases/download/v0.17.0/blackbox_exporter-0.17.0.linux-amd64.tar.gz tar zxf blackbox_exporter-0.17.0.linux-amd64.tar.gz mv blackbox_exporter-0.17.0.linux-amd64 blackbox_exporter
配置 编辑blackbox.yml文件
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 modules: http_2xx: prober: http timeout: 10s http: valid_http_versions: ["HTTP/1.1" , "HTTP/2" ] valid_status_codes: [200 ] method: GET preferred_ip_protocol: "ip4" http_post_2xx: prober: http timeout: 10s http: valid_http_versions: ["HTTP/1.1" , "HTTP/2" ] method: POST preferred_ip_protocol: "ip4"
修改Prometheus配置,增加job,使用基于文件的自动发现
metrics_path
的值在源码中默认为/metrics
,注意修改
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 - job_name: "blackbox-http" metrics_path: /probe params: module: [http_2xx ] file_sd_configs: - refresh_interval: 1m files: - targets/blackbox/http_2xx.yml relabel_configs: - source_labels: [__address__ ] target_label: __param_target - source_labels: [__param_target ] target_label: instance - target_label: __address__ replacement: 127.0 .0 .1 :9115 - job_name: "blackbox-http-post" metrics_path: /probe params: module: [http_post_2xx ] file_sd_configs: - refresh_interval: 1m files: - targets/blackbox/http_post_2xx.yml relabel_configs: - source_labels: [__address__ ] target_label: __param_target - source_labels: [__param_target ] target_label: instance - target_label: __address__ replacement: 127.0 .0 .1 :9115
创建targets/blackbox/http_2xx.yml
文件
1 2 - targets: - https://baidu.com
创建系统服务 1 2 3 4 5 6 7 8 9 10 11 12 13 14 cat > /usr/lib/systemd/system/blackbox.service <<EOF [Unit] Description=blackbox_exporter After=network.target [Service] User=root Type=simple ExecStart=/usr/local/prometheus/blackbox_exporter/blackbox_exporter --config.file=/usr/local/prometheus/blackbox_exporter/blackbox.yml Restart=on-failure [Install] WantedBy=multi-user.target EOF
启动服务 1 2 3 systemctl daemon-reload systemctl start blackbox.service systemctl enable blackbox.service
重载Prometheus 1 curl -X POST "http://127.0.0.1:9090/-/reload"
grafana图表 导入 https://grafana.com/grafana/dashboards/9965
告警配置 创建rules/blackbox_exporter.yml
文件
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 groups: - name: blackbox_network_stats rules: - alert: blackbox_network_stats expr: probe_success == 0 for: 1m labels: severity: critical annotations: description: 'Job {{ $labels.job }} 中的网站/接口 {{ $labels.instance }} 已经down掉超过一分钟.' summary: '网站/接口 {{ $labels.instance }} down ! ! !' - alert: BlackboxProbeHttpFailure expr: probe_http_status_code <= 199 OR probe_http_status_code >= 400 for: 5m labels: severity: critical annotations: summary: Blackbox probe HTTP failure (instance {{ $labels.instance }}) description: HTTP status code is not 200 -399 \n VALUE = {{ $value }}\n LABELS:\ {{ $labels }} - alert: BlackboxSslCertificateWillExpireSoon expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 30 for: 5m labels: severity: warning annotations: summary: Blackbox SSL certificate will expire soon (instance {{ $labels.instance }}) description: SSL certificate expires in 30 days\n VALUE = {{ $value }}\n LABELS:\ {{ $labels }}
其他知识点 删除一些 job 任务或者 instance 的数据指标,则可以使用下面的命令:
1 2 curl -X POST -g 'http://localhost:9090/api/v1/admin/tsdb/delete_series?match[]={job="kubernetes"}' curl -X POST -g 'http://localhost:9090/api/v1/admin/tsdb/delete_series?match[]={instance="10.244.2.158:9090"}'
参考:Prometheus 删除数据指标
grafana模板 ES Nginx Logs