前言
书接上文,prometheus已经安装好了,监控数据是有了,我们需要对其进行告警,而且可以发送到对应的平台,比如飞书、钉钉等,这里选择用飞书来测试
环境准备
组件版本操作体系Ubuntu 22.04.4 LTSdocker24.0.7alertmanagerv0.27.0下载编排文件
本文所有的编排文件,都在这里- ▶ cd /tmp && git clone git@github.com:wilsonchai8/installations.git && cd installations/prometheus
复制代码 安装alertmanager
alertmanager重要用作对prometheus发来的告警进行响应,包括发送、克制等- ▶ cd installations/prometheus
- ▶ kubectl apply -f alertmanager.yaml
复制代码 检查是否启动- ▶ kubectl -n prometheus get pod -owide | grep alertmanager
- alertmanager-5b6d594f6c-2swpw 1/1 Running 0 69s 10.244.0.17 minikube <none> <none>
复制代码 访问页面- ▶ kubectl get node -owide
- NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
- minikube Ready control-plane 6d2h v1.26.3 192.168.49.2 <none> Ubuntu 20.04.5 LTS 6.8.0-45-generic docker://23.0.2
- ▶ kubectl -n prometheus get svc | grep alertmanager
- alertmanager NodePort 10.110.182.95 <none> 9093:30297/TCP 70s
复制代码 http://192.168.49.2:30297
测试alertmanager
1. 定义一个测试的deployment
- ▶ kubectl create deployment busybox-test --image=registry.cn-beijing.aliyuncs.com/wilsonchai/busybox:latest -- sleep 33333
- deployment.apps/busybox-test created
- ▶ kubectl get pod
- NAME READY STATUS RESTARTS AGE
- busybox-test-fcb69d5f9-tn8vx 1/1 Running 0 6s
复制代码 2. 定义告警规则
我们定义当deployment的副本是为0就告警,修改prometheus configmap
在最底部追加,相当于新增一个配置文件,里面专门定义告警规则- apiVersion: v1
- kind: ConfigMap
- metadata:
- name: prometheus-cm
- labels:
- name: prometheus-cm
- namespace: prometheus
- data:
- prometheus.yml: |-
- global:
- scrape_interval: 5s
- evaluation_interval: 5s
- alerting:
- alertmanagers:
- - static_configs:
- - targets: ['alertmanager:9093']
- rule_files:
- - /etc/prometheus/*.rules
- scrape_configs:
- - job_name: 'prometheus'
- static_configs:
- - targets: ['localhost:9090']
- - job_name: "prometheus-kube-state-metrics"
- static_configs:
- - targets: ["kube-state-metrics.kube-system:8080"]
- - job_name: 'kubernetes-nodes'
- tls_config:
- ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
- bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
- kubernetes_sd_configs:
- - role: node
- relabel_configs:
- - source_labels: [__address__]
- regex: '(.*):10250'
- replacement: '${1}:9100'
- target_label: __address__
- action: replace
- - action: labelmap
- regex: __meta_kubernetes_node_label_(.+)
- # 从这里是新加的
- prometheus.rules: |-
- groups:
- - name: test alert
- rules:
- - alert: deployment replicas is 0
- expr: kube_deployment_spec_replicas == 0
- for: 30s
- labels:
- severity: slack
- annotations:
- summary: deployment replicas is 0
复制代码 然后重启prometheus,检察告警是否生效
3. 触发告警
- ▶ kubectl scale --replicas=0 deploy busybox-test
复制代码 等待些许片刻,检察alertmanager页面
已经有告警触发了
发送到飞书
我们已经有一个告警了,但是目前没法关照出来,需要给他告警到飞书去
1. 创建飞书的告警群组,并创建机器人拿到机器人的webhook
webhook:- https://open.feishu.cn/open-apis/bot/v2/hook/*******************
复制代码 2. 创建发送消息的服务
这里我们选用python tornado web服务来接收从alertmanager发送的告警信息- from tornado.ioloop import IOLoop
- import tornado.httpserver as httpserver
- import tornado.web
- import requests
- import json
- WEBHOOK_URL = 'https://open.feishu.cn/open-apis/bot/v2/hook/********'
- def send_to_feishu(msg_raw):
- headers = { 'Content-Type': 'application/json' }
- for alert in msg_raw['alerts']:
- msg = '## 告警发生 ##\n'
- msg += '\n'
- msg += '告警:{}\n'.format(alert['labels']['alertname'])
- msg += '时间:{}\n'.format(alert['startsAt'])
- msg += '级别:{}\n'.format(alert['labels']['severity'])
- msg += '详情:\n'
- msg += ' deploy:{}\n'.format(alert['labels']['deployment'])
- msg += ' namespace:{}\n'.format(alert['labels']['namespace'])
- msg += ' content:{}\n'.format(alert['annotations']['summary'])
- data = {
- 'msg_type': 'text',
- 'content': {
- 'text': msg
- }
- }
- res = requests.Session().post(url=WEBHOOK_URL, headers=headers, json=data)
- print(res.json())
- class SendmsgFlow(tornado.web.RequestHandler):
- def post(self, *args, **kwargs):
- send_to_feishu(json.loads(self.request.body.decode('utf-8')))
- def applications():
- urls = []
- urls.append([r'/sendmsg', SendmsgFlow])
- return tornado.web.Application(urls)
- def main():
- app = applications()
- server = httpserver.HTTPServer(app)
- server.bind(10000, '0.0.0.0')
- server.start(1)
- IOLoop.current().start()
- if __name__ == "__main__":
- try:
- main()
- except KeyboardInterrupt as e:
- IOLoop.current().stop()
- finally:
- IOLoop.current().close()
复制代码 本脚本已上传至仓库
3. 修改alertmanager configmap
修改alertmanager的configmap,把webhook_configs改为sendmsg的api地址- apiVersion: v1
- kind: ConfigMap
- metadata:
- name: alertmanager-config
- namespace: prometheus
- data:
- alertmanager.yml: |-
- global:
- resolve_timeout: 5m
- route:
- group_by: ['alertname', 'cluster']
- group_wait: 30s
- group_interval: 5m
- repeat_interval: 5m
- receiver: default
- receivers:
- - name: 'default'
- webhook_configs:
- - url: 'http://127.0.0.1:10000/sendmsg'
复制代码 重启alertmanager
4. 检查飞书
至此,一个简单告警流程制作完成
联系我
至此,本文竣事
在下才疏学浅,有撒汤漏水的,请各位不吝赐教...
免责声明:如果侵犯了您的权益,请联系站长,我们会及时删除侵权内容,谢谢合作!更多信息从访问主页:qidao123.com:ToB企服之家,中国第一个企服评测及商务社交产业平台。 |