Kubernetes监控手册04-监控Kube-Proxy

打印 上一主题 下一主题

主题 844|帖子 844|积分 2532

简介

首先,请阅读文章《Kubernetes监控手册01-体系介绍》,回顾一下 Kubernetes 架构,Kube-Proxy 是在所有工作负载节点上的。
Kube-Proxy 默认暴露两个端口,10249用于暴露监控指标,在 /metrics 接口吐出 Prometheus 协议的监控数据:
  1. [root@tt-fc-dev01.nj lib]# curl -s http://localhost:10249/metrics | head -n 10
  2. # HELP apiserver_audit_event_total [ALPHA] Counter of audit events generated and sent to the audit backend.
  3. # TYPE apiserver_audit_event_total counter
  4. apiserver_audit_event_total 0
  5. # HELP apiserver_audit_requests_rejected_total [ALPHA] Counter of apiserver requests rejected due to an error in audit logging backend.
  6. # TYPE apiserver_audit_requests_rejected_total counter
  7. apiserver_audit_requests_rejected_total 0
  8. # HELP go_gc_duration_seconds A summary of the pause duration of garbage collection cycles.
  9. # TYPE go_gc_duration_seconds summary
  10. go_gc_duration_seconds{quantile="0"} 2.5307e-05
  11. go_gc_duration_seconds{quantile="0.25"} 2.8884e-05
复制代码
10256 端口作为健康检查的端口,使用 /healthz 接口做健康检查,请求之后返回两个时间信息:
  1. [root@tt-fc-dev01.nj lib]# curl -s http://localhost:10256/healthz | jq .
  2. {
  3.   "lastUpdated": "2022-11-09 13:14:35.621317865 +0800 CST m=+4802354.950616250",
  4.   "currentTime": "2022-11-09 13:14:35.621317865 +0800 CST m=+4802354.950616250"
  5. }
复制代码
所以,我们只要从 http://localhost:10249/metrics 采集监控数据即可。既然是 Prometheus 协议的数据,使用 Categraf 的 input.prometheus 来搞定即可。
Categraf prometheus 插件

配置文件在 conf/input.prometheus/prometheus.toml,把 Kube-Proxy 的地址配置进来即可:
  1. interval = 15
  2. [[instances]]
  3. urls = [
  4.      "http://localhost:10249/metrics"
  5. ]
  6. labels = { job="kube-proxy" }
复制代码
urls 字段配置 endpoint 列表,即所有提供 metrics 数据的接口,我们使用下面的命令做个测试:
  1. [work@tt-fc-dev01.nj categraf]$ ./categraf --test --inputs prometheus | grep kubeproxy_sync_proxy_rules
  2. 2022/11/09 13:30:17 main.go:110: I! runner.binarydir: /home/work/go/src/categraf
  3. 2022/11/09 13:30:17 main.go:111: I! runner.hostname: tt-fc-dev01.nj
  4. 2022/11/09 13:30:17 main.go:112: I! runner.fd_limits: (soft=655360, hard=655360)
  5. 2022/11/09 13:30:17 main.go:113: I! runner.vm_limits: (soft=unlimited, hard=unlimited)
  6. 2022/11/09 13:30:17 config.go:33: I! tracing disabled
  7. 2022/11/09 13:30:17 provider.go:63: I! use input provider: [local]
  8. 2022/11/09 13:30:17 agent.go:87: I! agent starting
  9. 2022/11/09 13:30:17 metrics_agent.go:93: I! input: local.prometheus started
  10. 2022/11/09 13:30:17 prometheus_scrape.go:14: I! prometheus scraping disabled!
  11. 2022/11/09 13:30:17 agent.go:98: I! agent started
  12. 13:30:17 kubeproxy_sync_proxy_rules_endpoint_changes_pending agent_hostname=tt-fc-dev01.nj instance=http://localhost:10249/metrics 0
  13. 13:30:17 kubeproxy_sync_proxy_rules_duration_seconds_count agent_hostname=tt-fc-dev01.nj instance=http://localhost:10249/metrics 319786
  14. 13:30:17 kubeproxy_sync_proxy_rules_duration_seconds_sum agent_hostname=tt-fc-dev01.nj instance=http://localhost:10249/metrics 17652.749911909214
  15. 13:30:17 kubeproxy_sync_proxy_rules_duration_seconds_bucket agent_hostname=tt-fc-dev01.nj instance=http://localhost:10249/metrics le=+Inf 319786
  16. 13:30:17 kubeproxy_sync_proxy_rules_duration_seconds_bucket agent_hostname=tt-fc-dev01.nj instance=http://localhost:10249/metrics le=0.001 0
  17. 13:30:17 kubeproxy_sync_proxy_rules_duration_seconds_bucket agent_hostname=tt-fc-dev01.nj instance=http://localhost:10249/metrics le=0.002 0
  18. 13:30:17 kubeproxy_sync_proxy_rules_duration_seconds_bucket agent_hostname=tt-fc-dev01.nj instance=http://localhost:10249/metrics le=0.004 0
  19. 13:30:17 kubeproxy_sync_proxy_rules_duration_seconds_bucket agent_hostname=tt-fc-dev01.nj instance=http://localhost:10249/metrics le=0.008 0
  20. 13:30:17 kubeproxy_sync_proxy_rules_duration_seconds_bucket agent_hostname=tt-fc-dev01.nj instance=http://localhost:10249/metrics le=0.016 0
  21. 13:30:17 kubeproxy_sync_proxy_rules_duration_seconds_bucket agent_hostname=tt-fc-dev01.nj instance=http://localhost:10249/metrics le=0.032 0
  22. 13:30:17 kubeproxy_sync_proxy_rules_duration_seconds_bucket agent_hostname=tt-fc-dev01.nj instance=http://localhost:10249/metrics le=0.064 274815
  23. 13:30:17 kubeproxy_sync_proxy_rules_duration_seconds_bucket agent_hostname=tt-fc-dev01.nj instance=http://localhost:10249/metrics le=0.128 316616
  24. 13:30:17 kubeproxy_sync_proxy_rules_duration_seconds_bucket agent_hostname=tt-fc-dev01.nj instance=http://localhost:10249/metrics le=0.256 319525
  25. 13:30:17 kubeproxy_sync_proxy_rules_duration_seconds_bucket agent_hostname=tt-fc-dev01.nj instance=http://localhost:10249/metrics le=0.512 319776
  26. 13:30:17 kubeproxy_sync_proxy_rules_duration_seconds_bucket agent_hostname=tt-fc-dev01.nj instance=http://localhost:10249/metrics le=1.024 319784
  27. 13:30:17 kubeproxy_sync_proxy_rules_duration_seconds_bucket agent_hostname=tt-fc-dev01.nj instance=http://localhost:10249/metrics le=2.048 319784
  28. 13:30:17 kubeproxy_sync_proxy_rules_duration_seconds_bucket agent_hostname=tt-fc-dev01.nj instance=http://localhost:10249/metrics le=4.096 319784
  29. 13:30:17 kubeproxy_sync_proxy_rules_duration_seconds_bucket agent_hostname=tt-fc-dev01.nj instance=http://localhost:10249/metrics le=8.192 319784
  30. 13:30:17 kubeproxy_sync_proxy_rules_duration_seconds_bucket agent_hostname=tt-fc-dev01.nj instance=http://localhost:10249/metrics le=16.384 319786
  31. 13:30:17 kubeproxy_sync_proxy_rules_service_changes_pending agent_hostname=tt-fc-dev01.nj instance=http://localhost:10249/metrics 0
  32. 13:30:17 kubeproxy_sync_proxy_rules_last_queued_timestamp_seconds agent_hostname=tt-fc-dev01.nj instance=http://localhost:10249/metrics 1.6668536394083393e+09
  33. 13:30:17 kubeproxy_sync_proxy_rules_iptables_restore_failures_total agent_hostname=tt-fc-dev01.nj instance=http://localhost:10249/metrics 0
  34. 13:30:17 kubeproxy_sync_proxy_rules_endpoint_changes_total agent_hostname=tt-fc-dev01.nj instance=http://localhost:10249/metrics 219139
  35. 13:30:17 kubeproxy_sync_proxy_rules_last_timestamp_seconds agent_hostname=tt-fc-dev01.nj instance=http://localhost:10249/metrics 1.6679718066295934e+09
  36. 13:30:17 kubeproxy_sync_proxy_rules_service_changes_total agent_hostname=tt-fc-dev01.nj instance=http://localhost:10249/metrics 512372
复制代码
Kube-Proxy 在 Kubernetes 架构中,负责从 APIServer 同步规则,然后修改 iptables 或 ipvs 配置,同步规则相关的指标就非常关键了,这里我就 grep 了这些指标作为样例。
通过 --test 看到输出了,就说明正常采集到数据了,你有几个工作负载节点,就分别去修改 Categraf 的配置即可。当然,这样做非常直观,只是略麻烦,如果未来扩容新的 Node 节点,也要去修改 Categraf 的采集配置,把 Kube-Proxy 这个 /metrics 地址给加上,如果你是用脚本批量跑的,倒是还可以,如果是手工部署就略麻烦。我们可以把 Categraf 采集器做成 Daemonset,这样就不用担心扩容的问题了,Daemonset 会被自动调度到所有 Node 节点。
Categraf 作为 Daemonset 部署

Categraf 作为 Daemonset 运行,首先要创建一个 namespace,然后相关的 ConfigMap、Daemonset 等都归属这个 namespace。只是监控 Kube-Proxy 的话,Categraf 的配置就只需要主配置 config.toml 和 prometheus.toml,下面我们就实操演示一下。
创建 namespace
  1. [work@tt-fc-dev01.nj categraf]$ kubectl create namespace flashcat
  2. namespace/flashcat created
  3. [work@tt-fc-dev01.nj categraf]$ kubectl get ns | grep flashcat
  4. flashcat                                 Active   29s
复制代码
创建 ConfigMap

ConfigMap 是用于放置 config.toml 和 prometheus.toml 的内容,我把 yaml 文件也给你准备好了,请保存为 categraf-configmap-v1.yaml
  1. ---
  2. kind: ConfigMap
  3. metadata:
  4.   name: categraf-config
  5. apiVersion: v1
  6. data:
  7.   config.toml: |
  8.     [global]
  9.     hostname = "$HOSTNAME"
  10.     interval = 15
  11.     providers = ["local"]
  12.     [writer_opt]
  13.     batch = 2000
  14.     chan_size = 10000
  15.     [[writers]]
  16.     url = "http://10.206.0.16:19000/prometheus/v1/write"
  17.     timeout = 5000
  18.     dial_timeout = 2500
  19.     max_idle_conns_per_host = 100   
  20. ---
  21. kind: ConfigMap
  22. metadata:
  23.   name: categraf-input-prometheus
  24. apiVersion: v1
  25. data:
  26.   prometheus.toml: |
  27.     [[instances]]
  28.     urls = ["http://127.0.0.1:10249/metrics"]
  29.     labels = { job="kube-proxy" }   
复制代码
上面的 10.206.0.16:19000 只是举个例子,请改成你自己的 n9e-server 的地址。当然,如果不想把监控数据推给 Nightingale 也OK,写成其他的时序库(支持 remote write 协议的接口)也可以。hostname = "$HOSTNAME" 这个配置用了 $ 符号,后面创建 Daemonset 的时候会把 HOSTNAME 这个环境变量注入,让 Categraf 自动拿到。
下面我们把 ConfigMap 创建出来:
  1. [work@tt-fc-dev01.nj yamls]$ kubectl apply -f categraf-configmap-v1.yaml -n flashcat
  2. configmap/categraf-config created
  3. configmap/categraf-input-prometheus created
  4. [work@tt-fc-dev01.nj yamls]$ kubectl get configmap -n flashcat
  5. NAME                        DATA   AGE
  6. categraf-config             1      19s
  7. categraf-input-prometheus   1      19s
  8. kube-root-ca.crt            1      22m
复制代码
创建 Daemonset

配置文件准备好了,开始创建 Daemonset,注意把 HOSTNAME 环境变量注入进去,yaml 文件如下,你可以保存为 categraf-daemonset-v1.yaml:
  1. apiVersion: apps/v1
  2. kind: DaemonSet
  3. metadata:
  4.   labels:
  5.     app: categraf-daemonset
  6.   name: categraf-daemonset
  7. spec:
  8.   selector:
  9.     matchLabels:
  10.       app: categraf-daemonset
  11.   template:
  12.     metadata:
  13.       labels:
  14.         app: categraf-daemonset
  15.     spec:
  16.       containers:
  17.       - env:
  18.         - name: TZ
  19.           value: Asia/Shanghai
  20.         - name: HOSTNAME
  21.           valueFrom:
  22.             fieldRef:
  23.               apiVersion: v1
  24.               fieldPath: spec.nodeName
  25.         - name: HOSTIP
  26.           valueFrom:
  27.             fieldRef:
  28.               apiVersion: v1
  29.               fieldPath: status.hostIP
  30.         image: flashcatcloud/categraf:v0.2.18
  31.         imagePullPolicy: IfNotPresent
  32.         name: categraf
  33.         volumeMounts:
  34.         - mountPath: /etc/categraf/conf
  35.           name: categraf-config
  36.         - mountPath: /etc/categraf/conf/input.prometheus
  37.           name: categraf-input-prometheus
  38.       hostNetwork: true
  39.       restartPolicy: Always
  40.       tolerations:
  41.       - effect: NoSchedule
  42.         operator: Exists
  43.       volumes:
  44.       - configMap:
  45.           name: categraf-config
  46.         name: categraf-config
  47.       - configMap:
  48.           name: categraf-input-prometheus
  49.         name: categraf-input-prometheus
复制代码
apply 一下这个 Daemonset 文件:
  1. [work@tt-fc-dev01.nj yamls]$ kubectl apply -f categraf-daemonset-v1.yaml -n flashcat
  2. daemonset.apps/categraf-daemonset created
  3. [work@tt-fc-dev01.nj yamls]$ kubectl get ds -o wide -n flashcat
  4. NAME                 DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE     CONTAINERS   IMAGES                           SELECTOR
  5. categraf-daemonset   6         6         6       6            6           <none>          2m20s   categraf     flashcatcloud/categraf:v0.2.17   app=categraf-daemonset
  6. [work@tt-fc-dev01.nj yamls]$ kubectl get pods -o wide -n flashcat
  7. NAME                       READY   STATUS    RESTARTS   AGE     IP            NODE          NOMINATED NODE   READINESS GATES
  8. categraf-daemonset-4qlt9   1/1     Running   0          2m10s   10.206.0.7    10.206.0.7    <none>           <none>
  9. categraf-daemonset-s9bk2   1/1     Running   0          2m10s   10.206.0.11   10.206.0.11   <none>           <none>
  10. categraf-daemonset-w77lt   1/1     Running   0          2m10s   10.206.16.3   10.206.16.3   <none>           <none>
  11. categraf-daemonset-xgwf5   1/1     Running   0          2m10s   10.206.0.16   10.206.0.16   <none>           <none>
  12. categraf-daemonset-z9rk5   1/1     Running   0          2m10s   10.206.16.8   10.206.16.8   <none>           <none>
  13. categraf-daemonset-zdp8v   1/1     Running   0          2m10s   10.206.0.17   10.206.0.17   <none>           <none>
复制代码
看起来一切正常,我们去 Nightingale 查一下相关监控指标,看看有了没有:

监控指标说明

Kube-Proxy 的指标,孔飞老师之前整理过,我也给挪到这个章节,供大家参考:
  1. # HELP go_gc_duration_seconds A summary of the pause duration of garbage collection cycles.
  2. # TYPE go_gc_duration_seconds summary
  3. gc时间
  4. # HELP go_goroutines Number of goroutines that currently exist.
  5. # TYPE go_goroutines gauge
  6. goroutine数量
  7. # HELP go_threads Number of OS threads created.
  8. # TYPE go_threads gauge
  9. 线程数量
  10. # HELP kubeproxy_network_programming_duration_seconds [ALPHA] In Cluster Network Programming Latency in seconds
  11. # TYPE kubeproxy_network_programming_duration_seconds histogram
  12. service或者pod发生变化到kube-proxy规则同步完成时间指标含义较复杂,参照https://github.com/kubernetes/community/blob/master/sig-scalability/slos/network_programming_latency.md
  13. # HELP kubeproxy_sync_proxy_rules_duration_seconds [ALPHA] SyncProxyRules latency in seconds
  14. # TYPE kubeproxy_sync_proxy_rules_duration_seconds histogram
  15. 规则同步耗时
  16. # HELP kubeproxy_sync_proxy_rules_endpoint_changes_pending [ALPHA] Pending proxy rules Endpoint changes
  17. # TYPE kubeproxy_sync_proxy_rules_endpoint_changes_pending gauge
  18. endpoint 发生变化后规则同步pending的次数
  19. # HELP kubeproxy_sync_proxy_rules_endpoint_changes_total [ALPHA] Cumulative proxy rules Endpoint changes
  20. # TYPE kubeproxy_sync_proxy_rules_endpoint_changes_total counter
  21. endpoint 发生变化后规则同步的总次数
  22. # HELP kubeproxy_sync_proxy_rules_iptables_restore_failures_total [ALPHA] Cumulative proxy iptables restore failures
  23. # TYPE kubeproxy_sync_proxy_rules_iptables_restore_failures_total counter
  24. 本机上 iptables restore 失败的总次数
  25. # HELP kubeproxy_sync_proxy_rules_last_queued_timestamp_seconds [ALPHA] The last time a sync of proxy rules was queued
  26. # TYPE kubeproxy_sync_proxy_rules_last_queued_timestamp_seconds gauge
  27. 最近一次规则同步的请求时间戳,如果比下一个指标 kubeproxy_sync_proxy_rules_last_timestamp_seconds 大很多,那说明同步 hung 住了
  28. # HELP kubeproxy_sync_proxy_rules_last_timestamp_seconds [ALPHA] The last time proxy rules were successfully synced
  29. # TYPE kubeproxy_sync_proxy_rules_last_timestamp_seconds gauge
  30. 最近一次规则同步的完成时间戳
  31. # HELP kubeproxy_sync_proxy_rules_service_changes_pending [ALPHA] Pending proxy rules Service changes
  32. # TYPE kubeproxy_sync_proxy_rules_service_changes_pending gauge
  33. service变化引起的规则同步pending数量
  34. # HELP kubeproxy_sync_proxy_rules_service_changes_total [ALPHA] Cumulative proxy rules Service changes
  35. # TYPE kubeproxy_sync_proxy_rules_service_changes_total counter
  36. service变化引起的规则同步总数
  37. # HELP process_cpu_seconds_total Total user and system CPU time spent in seconds.
  38. # TYPE process_cpu_seconds_total counter
  39. 利用这个指标统计cpu使用率
  40. # HELP process_max_fds Maximum number of open file descriptors.
  41. # TYPE process_max_fds gauge
  42. 进程可以打开的最大fd数
  43. # HELP process_open_fds Number of open file descriptors.
  44. # TYPE process_open_fds gauge
  45. 进程当前打开的fd数
  46. # HELP process_resident_memory_bytes Resident memory size in bytes.
  47. # TYPE process_resident_memory_bytes gauge
  48. 统计内存使用大小
  49. # HELP process_start_time_seconds Start time of the process since unix epoch in seconds.
  50. # TYPE process_start_time_seconds gauge
  51. 进程启动时间戳
  52. # HELP rest_client_request_duration_seconds [ALPHA] Request latency in seconds. Broken down by verb and URL.
  53. # TYPE rest_client_request_duration_seconds histogram
  54. 请求 apiserver 的耗时(按照url和verb统计)
  55. # HELP rest_client_requests_total [ALPHA] Number of HTTP requests, partitioned by status code, method, and host.
  56. # TYPE rest_client_requests_total counter
  57. 请求 apiserver 的总数(按照code method host统计)
复制代码
导入监控大盘

由于上面给出的监控方案是通过 Daemonset,所以各个 Kube-Proxy 的监控数据,是通过 ident 标签来区分的,并非是通过 instance 标签来区分,从 Grafana 官网找到一个分享,地址在 这里,改造之后的大盘在 这里 导入夜莺即可使用。
相关文章

关于作者

本文作者秦晓辉,快猫星云合伙人,文章内容是快猫技术团队共同沉淀的结晶,作者做了编辑整理,我们会持续输出监控、稳定性保障相关的技术文章,文章可转载,转载请注明出处,尊重技术人员的成果。

免责声明:如果侵犯了您的权益,请联系站长,我们会及时删除侵权内容,谢谢合作!

本帖子中包含更多资源

您需要 登录 才可以下载或查看,没有账号?立即注册

x
回复

使用道具 举报

0 个回复

倒序浏览

快速回复

您需要登录后才可以回帖 登录 or 立即注册

本版积分规则

温锦文欧普厨电及净水器总代理

金牌会员
这个人很懒什么都没写!
快速回复 返回顶部 返回列表