马上注册,结交更多好友,享用更多功能,让你轻松玩转社区。
您需要 登录 才可以下载或查看,没有账号?立即注册
x
最近长沙跑了半个多月,跟甲方客户对了下项目指标,许久没更新
回来后继承研究如何实现 grafana实现HAMi vgpu虚拟化监控,毕竟条约里写了须要体现gpu资源限制和算力共享以及体现算力卡资源共享监控
先说下为啥要用HAMi吧, 一个重要原因是公司有人引见了这个工具的作者, 很多问题我都可以直接向作者提问
HAMi,是一个国产的GPU与国产加速卡(支持的GPU与国产加速卡型号与具体特性请查看此项目官网:https://github.com/Project-HAMi/HAMi/)虚拟化开源项目,实现以kubernetes为基础的容器场景下GPU或加速卡虚拟化。HAMi原名“k8s-vGPU-scheduler”,
最初由我司开源,现已在国内与国际上愈加流行,是管理Kubernetes中异构装备的中间件。它可以管理不同范例的异构装备(如GPU、NPU等),在Pod之间共享异构装备,根据装备的拓扑信息和调理策略做出更好的调理决策。为了论述的简明性,本文只提供一种可行的办法,最终实现使用prometheus抓取监控指标并作为数据源、使用grafana来展示监控信息的目的。
本文假定已经部署好Kubernetes集群、HAMi。以下涉及到的相关组件都是在kubernetes集群内安装的,相关组件或软件版本信息如下:
组件或软件名称版本备注kubernetes集群v1.23.1AMD64构架服务器环境下HAMi根据向开源作者提问,当前HAMi版本发行机制还不够成熟,暂以安装HAMi的scheduler.kubeScheduler.imageTag 参数值为其版本,此值要跟kubernetes版本看齐项目地址:https://github.com/Project-HAMi/HAMi/kube-prometheus stack prom/prometheus:v2.27.1关于监控的安装参见实现prometheus+grafana的监控部署_prometheus grafana监控部署-CSDN博客dcgm-exporternvcr.io/nvidia/k8s/dcgm-exporter:3.3.9-3.6.1-ubuntu22.04 HAMi 的默认安装方式是通过helm,添加Helm堆栈:
- helm repo add hami-charts https://project-hami.github.io/HAMi/
复制代码
检查Kubernetes版本并安装HAMi(服务器版本为1.23.1):
- helm install hami hami-charts/hami --set scheduler.kubeScheduler.imageTag=v1.23.1 -n kube-system
复制代码 验证hami安装成功
- kubectl get pods -n kube-system
复制代码
确认hami-device-plugin和hami-scheduler都处于Running状态表示安装成功。
把helm安装转为hami-install.yaml
- helm template hami hami-charts/hami --set scheduler.kubeScheduler.imageTag=v1.23.1 -n kube-system > hami-install.yaml
复制代码 该格式部署
- ---
- # Source: hami/templates/device-plugin/monitorserviceaccount.yaml
- apiVersion: v1
- kind: ServiceAccount
- metadata:
- name: hami-device-plugin
- namespace: "kube-system"
- labels:
- app.kubernetes.io/component: "hami-device-plugin"
- helm.sh/chart: hami-2.4.0
- app.kubernetes.io/name: hami
- app.kubernetes.io/instance: hami
- app.kubernetes.io/version: "2.4.0"
- app.kubernetes.io/managed-by: Helm
- ---
- # Source: hami/templates/scheduler/serviceaccount.yaml
- apiVersion: v1
- kind: ServiceAccount
- metadata:
- name: hami-scheduler
- namespace: "kube-system"
- labels:
- app.kubernetes.io/component: "hami-scheduler"
- helm.sh/chart: hami-2.4.0
- app.kubernetes.io/name: hami
- app.kubernetes.io/instance: hami
- app.kubernetes.io/version: "2.4.0"
- app.kubernetes.io/managed-by: Helm
- ---
- # Source: hami/templates/device-plugin/configmap.yaml
- apiVersion: v1
- kind: ConfigMap
- metadata:
- name: hami-device-plugin
- labels:
- app.kubernetes.io/component: hami-device-plugin
- helm.sh/chart: hami-2.4.0
- app.kubernetes.io/name: hami
- app.kubernetes.io/instance: hami
- app.kubernetes.io/version: "2.4.0"
- app.kubernetes.io/managed-by: Helm
- data:
- config.json: |
- {
- "nodeconfig": [
- {
- "name": "m5-cloudinfra-online02",
- "devicememoryscaling": 1.8,
- "devicesplitcount": 10,
- "migstrategy":"none",
- "filterdevices": {
- "uuid": [],
- "index": []
- }
- }
- ]
- }
- ---
- # Source: hami/templates/scheduler/configmap.yaml
- apiVersion: v1
- kind: ConfigMap
- metadata:
- name: hami-scheduler
- labels:
- app.kubernetes.io/component: hami-scheduler
- helm.sh/chart: hami-2.4.0
- app.kubernetes.io/name: hami
- app.kubernetes.io/instance: hami
- app.kubernetes.io/version: "2.4.0"
- app.kubernetes.io/managed-by: Helm
- data:
- config.json: |
- {
- "kind": "Policy",
- "apiVersion": "v1",
- "extenders": [
- {
- "urlPrefix": "https://127.0.0.1:443",
- "filterVerb": "filter",
- "bindVerb": "bind",
- "enableHttps": true,
- "weight": 1,
- "nodeCacheCapable": true,
- "httpTimeout": 30000000000,
- "tlsConfig": {
- "insecure": true
- },
- "managedResources": [
- {
- "name": "nvidia.com/gpu",
- "ignoredByScheduler": true
- },
- {
- "name": "nvidia.com/gpumem",
- "ignoredByScheduler": true
- },
- {
- "name": "nvidia.com/gpucores",
- "ignoredByScheduler": true
- },
- {
- "name": "nvidia.com/gpumem-percentage",
- "ignoredByScheduler": true
- },
- {
- "name": "nvidia.com/priority",
- "ignoredByScheduler": true
- },
- {
- "name": "cambricon.com/vmlu",
- "ignoredByScheduler": true
- },
- {
- "name": "hygon.com/dcunum",
- "ignoredByScheduler": true
- },
- {
- "name": "hygon.com/dcumem",
- "ignoredByScheduler": true
- },
- {
- "name": "hygon.com/dcucores",
- "ignoredByScheduler": true
- },
- {
- "name": "iluvatar.ai/vgpu",
- "ignoredByScheduler": true
- }
- ],
- "ignoreable": false
- }
- ]
- }
- ---
- # Source: hami/templates/scheduler/configmapnew.yaml
- apiVersion: v1
- kind: ConfigMap
- metadata:
- name: hami-scheduler-newversion
- labels:
- app.kubernetes.io/component: hami-scheduler
- helm.sh/chart: hami-2.4.0
- app.kubernetes.io/name: hami
- app.kubernetes.io/instance: hami
- app.kubernetes.io/version: "2.4.0"
- app.kubernetes.io/managed-by: Helm
- data:
- config.yaml: |
- apiVersion: kubescheduler.config.k8s.io/v1
- kind: KubeSchedulerConfiguration
- leaderElection:
- leaderElect: false
- profiles:
- - schedulerName: hami-scheduler
- extenders:
- - urlPrefix: "https://127.0.0.1:443"
- filterVerb: filter
- bindVerb: bind
- nodeCacheCapable: true
- weight: 1
- httpTimeout: 30s
- enableHTTPS: true
- tlsConfig:
- insecure: true
- managedResources:
- - name: nvidia.com/gpu
- ignoredByScheduler: true
- - name: nvidia.com/gpumem
- ignoredByScheduler: true
- - name: nvidia.com/gpucores
- ignoredByScheduler: true
- - name: nvidia.com/gpumem-percentage
- ignoredByScheduler: true
- - name: nvidia.com/priority
- ignoredByScheduler: true
- - name: cambricon.com/vmlu
- ignoredByScheduler: true
- - name: hygon.com/dcunum
- ignoredByScheduler: true
- - name: hygon.com/dcumem
- ignoredByScheduler: true
- - name: hygon.com/dcucores
- ignoredByScheduler: true
- - name: iluvatar.ai/vgpu
- ignoredByScheduler: true
- ---
- # Source: hami/templates/scheduler/device-configmap.yaml
- apiVersion: v1
- kind: ConfigMap
- metadata:
- name: hami-scheduler-device
- labels:
- app.kubernetes.io/component: hami-scheduler
- helm.sh/chart: hami-2.4.0
- app.kubernetes.io/name: hami
- app.kubernetes.io/instance: hami
- app.kubernetes.io/version: "2.4.0"
- app.kubernetes.io/managed-by: Helm
- data:
- device-config.yaml: |-
- nvidia:
- resourceCountName: nvidia.com/gpu
- resourceMemoryName: nvidia.com/gpumem
- resourceMemoryPercentageName: nvidia.com/gpumem-percentage
- resourceCoreName: nvidia.com/gpucores
- resourcePriorityName: nvidia.com/priority
- overwriteEnv: false
- defaultMemory: 0
- defaultCores: 0
- defaultGPUNum: 1
- deviceSplitCount: 10
- deviceMemoryScaling: 1
- deviceCoreScaling: 1
- cambricon:
- resourceCountName: cambricon.com/vmlu
- resourceMemoryName: cambricon.com/mlu.smlu.vmemory
- resourceCoreName: cambricon.com/mlu.smlu.vcore
- hygon:
- resourceCountName: hygon.com/dcunum
- resourceMemoryName: hygon.com/dcumem
- resourceCoreName: hygon.com/dcucores
- metax:
- resourceCountName: "metax-tech.com/gpu"
- mthreads:
- resourceCountName: "mthreads.com/vgpu"
- resourceMemoryName: "mthreads.com/sgpu-memory"
- resourceCoreName: "mthreads.com/sgpu-core"
- iluvatar:
- resourceCountName: iluvatar.ai/vgpu
- resourceMemoryName: iluvatar.ai/vcuda-memory
- resourceCoreName: iluvatar.ai/vcuda-core
- vnpus:
- - chipName: 910B
- commonWord: Ascend910A
- resourceName: huawei.com/Ascend910A
- resourceMemoryName: huawei.com/Ascend910A-memory
- memoryAllocatable: 32768
- memoryCapacity: 32768
- aiCore: 30
- templates:
- - name: vir02
- memory: 2184
- aiCore: 2
- - name: vir04
- memory: 4369
- aiCore: 4
- - name: vir08
- memory: 8738
- aiCore: 8
- - name: vir16
- memory: 17476
- aiCore: 16
- - chipName: 910B3
- commonWord: Ascend910B
- resourceName: huawei.com/Ascend910B
- resourceMemoryName: huawei.com/Ascend910B-memory
- memoryAllocatable: 65536
- memoryCapacity: 65536
- aiCore: 20
- aiCPU: 7
- templates:
- - name: vir05_1c_16g
- memory: 16384
- aiCore: 5
- aiCPU: 1
- - name: vir10_3c_32g
- memory: 32768
- aiCore: 10
- aiCPU: 3
- - chipName: 310P3
- commonWord: Ascend310P
- resourceName: huawei.com/Ascend310P
- resourceMemoryName: huawei.com/Ascend310P-memory
- memoryAllocatable: 21527
- memoryCapacity: 24576
- aiCore: 8
- aiCPU: 7
- templates:
- - name: vir01
- memory: 3072
- aiCore: 1
- aiCPU: 1
- - name: vir02
- memory: 6144
- aiCore: 2
- aiCPU: 2
- - name: vir04
- memory: 12288
- aiCore: 4
- aiCPU: 4
- ---
- # Source: hami/templates/device-plugin/monitorrole.yaml
- apiVersion: rbac.authorization.k8s.io/v1
- kind: ClusterRole
- metadata:
- name: hami-device-plugin-monitor
- rules:
- - apiGroups:
- - ""
- resources:
- - pods
- verbs:
- - get
- - create
- - watch
- - list
- - update
- - patch
- - apiGroups:
- - ""
- resources:
- - nodes
- verbs:
- - get
- - update
- - list
- - patch
- ---
- # Source: hami/templates/device-plugin/monitorrolebinding.yaml
- apiVersion: rbac.authorization.k8s.io/v1
- kind: ClusterRoleBinding
- metadata:
- name: hami-device-plugin
- labels:
- app.kubernetes.io/component: "hami-device-plugin"
- helm.sh/chart: hami-2.4.0
- app.kubernetes.io/name: hami
- app.kubernetes.io/instance: hami
- app.kubernetes.io/version: "2.4.0"
- app.kubernetes.io/managed-by: Helm
- roleRef:
- apiGroup: rbac.authorization.k8s.io
- kind: ClusterRole
- #name: cluster-admin
- name: hami-device-plugin-monitor
- subjects:
- - kind: ServiceAccount
- name: hami-device-plugin
- namespace: "kube-system"
- ---
- # Source: hami/templates/scheduler/rolebinding.yaml
- apiVersion: rbac.authorization.k8s.io/v1
- kind: ClusterRoleBinding
- metadata:
- name: hami-scheduler
- labels:
- app.kubernetes.io/component: "hami-scheduler"
- helm.sh/chart: hami-2.4.0
- app.kubernetes.io/name: hami
- app.kubernetes.io/instance: hami
- app.kubernetes.io/version: "2.4.0"
- app.kubernetes.io/managed-by: Helm
- roleRef:
- apiGroup: rbac.authorization.k8s.io
- kind: ClusterRole
- name: cluster-admin
- subjects:
- - kind: ServiceAccount
- name: hami-scheduler
- namespace: "kube-system"
- ---
- # Source: hami/templates/device-plugin/monitorservice.yaml
- apiVersion: v1
- kind: Service
- metadata:
- name: hami-device-plugin-monitor
- labels:
- app.kubernetes.io/component: hami-device-plugin
- helm.sh/chart: hami-2.4.0
- app.kubernetes.io/name: hami
- app.kubernetes.io/instance: hami
- app.kubernetes.io/version: "2.4.0"
- app.kubernetes.io/managed-by: Helm
- spec:
- externalTrafficPolicy: Local
- selector:
- app.kubernetes.io/component: hami-device-plugin
- type: NodePort
- ports:
- - name: monitorport
- port: 31992
- targetPort: 9394
- nodePort: 31992
- ---
- # Source: hami/templates/scheduler/service.yaml
- apiVersion: v1
- kind: Service
- metadata:
- name: hami-scheduler
- labels:
- app.kubernetes.io/component: hami-scheduler
- helm.sh/chart: hami-2.4.0
- app.kubernetes.io/name: hami
- app.kubernetes.io/instance: hami
- app.kubernetes.io/version: "2.4.0"
- app.kubernetes.io/managed-by: Helm
- spec:
- type: NodePort
- ports:
- - name: http
- port: 443
- targetPort: 443
- nodePort: 31998
- protocol: TCP
- - name: monitor
- port: 31993
- targetPort: 9395
- nodePort: 31993
- protocol: TCP
- selector:
- app.kubernetes.io/component: hami-scheduler
- app.kubernetes.io/name: hami
- app.kubernetes.io/instance: hami
- ---
- # Source: hami/templates/device-plugin/daemonsetnvidia.yaml
- apiVersion: apps/v1
- kind: DaemonSet
- metadata:
- name: hami-device-plugin
- labels:
- app.kubernetes.io/component: hami-device-plugin
- helm.sh/chart: hami-2.4.0
- app.kubernetes.io/name: hami
- app.kubernetes.io/instance: hami
- app.kubernetes.io/version: "2.4.0"
- app.kubernetes.io/managed-by: Helm
- spec:
- selector:
- matchLabels:
- app.kubernetes.io/component: hami-device-plugin
- app.kubernetes.io/name: hami
- app.kubernetes.io/instance: hami
- template:
- metadata:
- labels:
- app.kubernetes.io/component: hami-device-plugin
- hami.io/webhook: ignore
- app.kubernetes.io/name: hami
- app.kubernetes.io/instance: hami
- spec:
- imagePullSecrets:
- []
- serviceAccountName: hami-device-plugin
- priorityClassName: system-node-critical
- hostPID: true
- hostNetwork: true
- containers:
- - name: device-plugin
- image: projecthami/hami:latest
- imagePullPolicy: "IfNotPresent"
- lifecycle:
- postStart:
- exec:
- command: ["/bin/sh","-c", "cp -f /k8s-vgpu/lib/nvidia/* /usr/local/vgpu/"]
- command:
- - nvidia-device-plugin
- - --config-file=/device-config.yaml
- - --mig-strategy=none
- - --disable-core-limit=false
- - -v=false
- env:
- - name: NODE_NAME
- valueFrom:
- fieldRef:
- fieldPath: spec.nodeName
- - name: NVIDIA_MIG_MONITOR_DEVICES
- value: all
- - name: HOOK_PATH
- value: /usr/local
- securityContext:
- allowPrivilegeEscalation: false
- capabilities:
- drop: ["ALL"]
- add: ["SYS_ADMIN"]
- volumeMounts:
- - name: device-plugin
- mountPath: /var/lib/kubelet/device-plugins
- - name: lib
- mountPath: /usr/local/vgpu
- - name: usrbin
- mountPath: /usrbin
- - name: deviceconfig
- mountPath: /config
- - name: hosttmp
- mountPath: /tmp
- - name: device-config
- mountPath: /device-config.yaml
- subPath: device-config.yaml
- - name: vgpu-monitor
- image: projecthami/hami:latest
- imagePullPolicy: "IfNotPresent"
- command: ["vGPUmonitor"]
- securityContext:
- allowPrivilegeEscalation: false
- capabilities:
- drop: ["ALL"]
- add: ["SYS_ADMIN"]
- env:
- - name: NVIDIA_VISIBLE_DEVICES
- value: "all"
- - name: NVIDIA_MIG_MONITOR_DEVICES
- value: all
- - name: HOOK_PATH
- value: /usr/local/vgpu
- volumeMounts:
- - name: ctrs
- mountPath: /usr/local/vgpu/containers
- - name: dockers
- mountPath: /run/docker
- - name: containerds
- mountPath: /run/containerd
- - name: sysinfo
- mountPath: /sysinfo
- - name: hostvar
- mountPath: /hostvar
- volumes:
- - name: ctrs
- hostPath:
- path: /usr/local/vgpu/containers
- - name: hosttmp
- hostPath:
- path: /tmp
- - name: dockers
- hostPath:
- path: /run/docker
- - name: containerds
- hostPath:
- path: /run/containerd
- - name: device-plugin
- hostPath:
- path: /var/lib/kubelet/device-plugins
- - name: lib
- hostPath:
- path: /usr/local/vgpu
- - name: usrbin
- hostPath:
- path: /usr/bin
- - name: sysinfo
- hostPath:
- path: /sys
- - name: hostvar
- hostPath:
- path: /var
- - name: deviceconfig
- configMap:
- name: hami-device-plugin
- - name: device-config
- configMap:
- name: hami-scheduler-device
- nodeSelector:
- gpu: "on"
- ---
- # Source: hami/templates/scheduler/deployment.yaml
- apiVersion: apps/v1
- kind: Deployment
- metadata:
- name: hami-scheduler
- labels:
- app.kubernetes.io/component: hami-scheduler
- helm.sh/chart: hami-2.4.0
- app.kubernetes.io/name: hami
- app.kubernetes.io/instance: hami
- app.kubernetes.io/version: "2.4.0"
- app.kubernetes.io/managed-by: Helm
- spec:
- replicas: 1
- selector:
- matchLabels:
- app.kubernetes.io/component: hami-scheduler
- app.kubernetes.io/name: hami
- app.kubernetes.io/instance: hami
- template:
- metadata:
- labels:
- app.kubernetes.io/component: hami-scheduler
- app.kubernetes.io/name: hami
- app.kubernetes.io/instance: hami
- hami.io/webhook: ignore
- spec:
- imagePullSecrets:
- []
- serviceAccountName: hami-scheduler
- priorityClassName: system-node-critical
- containers:
- - name: kube-scheduler
- image: registry.cn-hangzhou.aliyuncs.com/google_containers/kube-scheduler:v1.31.0
- imagePullPolicy: "IfNotPresent"
- command:
- - kube-scheduler
- - --config=/config/config.yaml
- - -v=4
- - --leader-elect=true
- - --leader-elect-resource-name=hami-scheduler
- - --leader-elect-resource-namespace=kube-system
- volumeMounts:
- - name: scheduler-config
- mountPath: /config
- - name: vgpu-scheduler-extender
- image: projecthami/hami:latest
- imagePullPolicy: "IfNotPresent"
- env:
- command:
- - scheduler
- - --http_bind=0.0.0.0:443
- - --cert_file=/tls/tls.crt
- - --key_file=/tls/tls.key
- - --scheduler-name=hami-scheduler
- - --metrics-bind-address=:9395
- - --node-scheduler-policy=binpack
- - --gpu-scheduler-policy=spread
- - --device-config-file=/device-config.yaml
- - --debug
- - -v=4
- ports:
- - name: http
- containerPort: 443
- protocol: TCP
- volumeMounts:
- - name: tls-config
- mountPath: /tls
- - name: device-config
- mountPath: /device-config.yaml
- subPath: device-config.yaml
- volumes:
- - name: tls-config
- secret:
- secretName: hami-scheduler-tls
- - name: scheduler-config
- configMap:
- name: hami-scheduler-newversion
- - name: device-config
- configMap:
- name: hami-scheduler-device
- ---
- # Source: hami/templates/scheduler/webhook.yaml
- apiVersion: admissionregistration.k8s.io/v1
- kind: MutatingWebhookConfiguration
- metadata:
- name: hami-webhook
- webhooks:
- - admissionReviewVersions:
- - v1beta1
- clientConfig:
- service:
- name: hami-scheduler
- namespace: kube-system
- path: /webhook
- port: 443
- failurePolicy: Ignore
- matchPolicy: Equivalent
- name: vgpu.hami.io
- namespaceSelector:
- matchExpressions:
- - key: hami.io/webhook
- operator: NotIn
- values:
- - ignore
- objectSelector:
- matchExpressions:
- - key: hami.io/webhook
- operator: NotIn
- values:
- - ignore
- reinvocationPolicy: Never
- rules:
- - apiGroups:
- - ""
- apiVersions:
- - v1
- operations:
- - CREATE
- resources:
- - pods
- scope: '*'
- sideEffects: None
- timeoutSeconds: 10
- ---
- # Source: hami/templates/scheduler/job-patch/serviceaccount.yaml
- apiVersion: v1
- kind: ServiceAccount
- metadata:
- name: hami-admission
- annotations:
- "helm.sh/hook": pre-install,pre-upgrade,post-install,post-upgrade
- "helm.sh/hook-delete-policy": before-hook-creation,hook-succeeded
- labels:
- helm.sh/chart: hami-2.4.0
- app.kubernetes.io/name: hami
- app.kubernetes.io/instance: hami
- app.kubernetes.io/version: "2.4.0"
- app.kubernetes.io/managed-by: Helm
- app.kubernetes.io/component: admission-webhook
- ---
- # Source: hami/templates/scheduler/job-patch/clusterrole.yaml
- apiVersion: rbac.authorization.k8s.io/v1
- kind: ClusterRole
- metadata:
- name: hami-admission
- annotations:
- "helm.sh/hook": pre-install,pre-upgrade,post-install,post-upgrade
- "helm.sh/hook-delete-policy": before-hook-creation,hook-succeeded
- labels:
- helm.sh/chart: hami-2.4.0
- app.kubernetes.io/name: hami
- app.kubernetes.io/instance: hami
- app.kubernetes.io/version: "2.4.0"
- app.kubernetes.io/managed-by: Helm
- app.kubernetes.io/component: admission-webhook
- rules:
- - apiGroups:
- - admissionregistration.k8s.io
- resources:
- #- validatingwebhookconfigurations
- - mutatingwebhookconfigurations
- verbs:
- - get
- - update
- ---
- # Source: hami/templates/scheduler/job-patch/clusterrolebinding.yaml
- apiVersion: rbac.authorization.k8s.io/v1
- kind: ClusterRoleBinding
- metadata:
- name: hami-admission
- annotations:
- "helm.sh/hook": pre-install,pre-upgrade,post-install,post-upgrade
- "helm.sh/hook-delete-policy": before-hook-creation,hook-succeeded
- labels:
- helm.sh/chart: hami-2.4.0
- app.kubernetes.io/name: hami
- app.kubernetes.io/instance: hami
- app.kubernetes.io/version: "2.4.0"
- app.kubernetes.io/managed-by: Helm
- app.kubernetes.io/component: admission-webhook
- roleRef:
- apiGroup: rbac.authorization.k8s.io
- kind: ClusterRole
- name: hami-admission
- subjects:
- - kind: ServiceAccount
- name: hami-admission
- namespace: "kube-system"
- ---
- # Source: hami/templates/scheduler/job-patch/role.yaml
- apiVersion: rbac.authorization.k8s.io/v1
- kind: Role
- metadata:
- name: hami-admission
- annotations:
- "helm.sh/hook": pre-install,pre-upgrade,post-install,post-upgrade
- "helm.sh/hook-delete-policy": before-hook-creation,hook-succeeded
- labels:
- helm.sh/chart: hami-2.4.0
- app.kubernetes.io/name: hami
- app.kubernetes.io/instance: hami
- app.kubernetes.io/version: "2.4.0"
- app.kubernetes.io/managed-by: Helm
- app.kubernetes.io/component: admission-webhook
- rules:
- - apiGroups:
- - ""
- resources:
- - secrets
- verbs:
- - get
- - create
- ---
- # Source: hami/templates/scheduler/job-patch/rolebinding.yaml
- apiVersion: rbac.authorization.k8s.io/v1
- kind: RoleBinding
- metadata:
- name: hami-admission
- annotations:
- "helm.sh/hook": pre-install,pre-upgrade,post-install,post-upgrade
- "helm.sh/hook-delete-policy": before-hook-creation,hook-succeeded
- labels:
- helm.sh/chart: hami-2.4.0
- app.kubernetes.io/name: hami
- app.kubernetes.io/instance: hami
- app.kubernetes.io/version: "2.4.0"
- app.kubernetes.io/managed-by: Helm
- app.kubernetes.io/component: admission-webhook
- roleRef:
- apiGroup: rbac.authorization.k8s.io
- kind: Role
- name: hami-admission
- subjects:
- - kind: ServiceAccount
- name: hami-admission
- namespace: "kube-system"
- ---
- # Source: hami/templates/scheduler/job-patch/job-createSecret.yaml
- apiVersion: batch/v1
- kind: Job
- metadata:
- name: hami-admission-create
- annotations:
- "helm.sh/hook": pre-install,pre-upgrade
- "helm.sh/hook-delete-policy": before-hook-creation,hook-succeeded
- labels:
- helm.sh/chart: hami-2.4.0
- app.kubernetes.io/name: hami
- app.kubernetes.io/instance: hami
- app.kubernetes.io/version: "2.4.0"
- app.kubernetes.io/managed-by: Helm
- app.kubernetes.io/component: admission-webhook
- spec:
- template:
- metadata:
- name: hami-admission-create
- labels:
- helm.sh/chart: hami-2.4.0
- app.kubernetes.io/name: hami
- app.kubernetes.io/instance: hami
- app.kubernetes.io/version: "2.4.0"
- app.kubernetes.io/managed-by: Helm
- app.kubernetes.io/component: admission-webhook
- hami.io/webhook: ignore
- spec:
- imagePullSecrets:
- []
- containers:
- - name: create
- image: liangjw/kube-webhook-certgen:v1.1.1
- imagePullPolicy: IfNotPresent
- args:
- - create
- - --cert-name=tls.crt
- - --key-name=tls.key
- - --host=hami-scheduler.kube-system.svc,127.0.0.1
- - --namespace=kube-system
- - --secret-name=hami-scheduler-tls
- restartPolicy: OnFailure
- serviceAccountName: hami-admission
- securityContext:
- runAsNonRoot: true
- runAsUser: 2000
- ---
- # Source: hami/templates/scheduler/job-patch/job-patchWebhook.yaml
- apiVersion: batch/v1
- kind: Job
- metadata:
- name: hami-admission-patch
- annotations:
- "helm.sh/hook": post-install,post-upgrade
- "helm.sh/hook-delete-policy": before-hook-creation,hook-succeeded
- labels:
- helm.sh/chart: hami-2.4.0
- app.kubernetes.io/name: hami
- app.kubernetes.io/instance: hami
- app.kubernetes.io/version: "2.4.0"
- app.kubernetes.io/managed-by: Helm
- app.kubernetes.io/component: admission-webhook
- spec:
- template:
- metadata:
- name: hami-admission-patch
- labels:
- helm.sh/chart: hami-2.4.0
- app.kubernetes.io/name: hami
- app.kubernetes.io/instance: hami
- app.kubernetes.io/version: "2.4.0"
- app.kubernetes.io/managed-by: Helm
- app.kubernetes.io/component: admission-webhook
- hami.io/webhook: ignore
- spec:
- imagePullSecrets:
- []
- containers:
- - name: patch
- image: liangjw/kube-webhook-certgen:v1.1.1
- imagePullPolicy: IfNotPresent
- args:
- - patch
- - --webhook-name=hami-webhook
- - --namespace=kube-system
- - --patch-validating=false
- - --secret-name=hami-scheduler-tls
- restartPolicy: OnFailure
- serviceAccountName: hami-admission
- securityContext:
- runAsNonRoot: true
- runAsUser: 2000
复制代码
部署dcgm-exporter
- apiVersion: apps/v1
- kind: DaemonSet
- metadata:
- name: "dcgm-exporter"
- labels:
- app.kubernetes.io/name: "dcgm-exporter"
- app.kubernetes.io/version: "3.6.1"
- spec:
- updateStrategy:
- type: RollingUpdate
- selector:
- matchLabels:
- app.kubernetes.io/name: "dcgm-exporter"
- app.kubernetes.io/version: "3.6.1"
- template:
- metadata:
- labels:
- app.kubernetes.io/name: "dcgm-exporter"
- app.kubernetes.io/version: "3.6.1"
- name: "dcgm-exporter"
- spec:
- containers:
- - image: "nvcr.io/nvidia/k8s/dcgm-exporter:3.3.9-3.6.1-ubuntu22.04"
- env:
- - name: "DCGM_EXPORTER_LISTEN"
- value: ":9400"
- - name: "DCGM_EXPORTER_KUBERNETES"
- value: "true"
- name: "dcgm-exporter"
- ports:
- - name: "metrics"
- containerPort: 9400
- securityContext:
- runAsNonRoot: false
- runAsUser: 0
- capabilities:
- add: ["SYS_ADMIN"]
- volumeMounts:
- - name: "pod-gpu-resources"
- readOnly: true
- mountPath: "/var/lib/kubelet/pod-resources"
- volumes:
- - name: "pod-gpu-resources"
- hostPath:
- path: "/var/lib/kubelet/pod-resources"
- ---
- kind: Service
- apiVersion: v1
- metadata:
- name: "dcgm-exporter"
- labels:
- app.kubernetes.io/name: "dcgm-exporter"
- app.kubernetes.io/version: "3.6.1"
- spec:
- selector:
- app.kubernetes.io/name: "dcgm-exporter"
- app.kubernetes.io/version: "3.6.1"
- ports:
- - name: "metrics"
- port: 9400
复制代码 dcgm-exporter安装成功
参考这个hami-vgpu dashboard 下载panel 的json文件
hami-vgpu-dashboard | Grafana Labs 导入后grafana中将创建一个名为“hami-vgpu-dashboard”的dashboard,但此页面中有一些Panel如vGPUCorePercentage还没有数据
ServiceMonitor 是 Prometheus Operator 中的一个自定义资源,主要用于监控 Kubernetes 中的服务。它的作用包罗:
1. 自动化发现
ServiceMonitor 允许 Prometheus 自动发现和监控 Kubernetes 中的服务。通过定义 ServiceMonitor,您可以告诉 Prometheus 监控特定服务的端点。
2. 配置抓取参数
您可以在 ServiceMonitor 中设置抓取的相关参数,比方:
- 抓取隔断:定义 Prometheus 多频繁抓取数据(如每 30 秒)。
- 超时:定义抓取请求的超时时间。
- 标签选择器:指定要监控的服务的标签,确保 Prometheus 仅抓取相关服务的数据。
dcgm-exporter须要配置两个service monitor
hami-device-plugin-svc-monitor.yaml
- apiVersion: monitoring.coreos.com/v1
- kind: ServiceMonitor
- metadata:
- name: hami-device-plugin-svc-monitor
- namespace: kube-system
- spec:
- selector:
- matchLabels:
- app.kubernetes.io/component: hami-device-plugin
- namespaceSelector:
- matchNames:
- - kube-system
- endpoints:
- - path: /metrics
- port: monitorport
- interval: "15s"
- honorLabels: false
- relabelings:
- - sourceLabels: [__meta_kubernetes_endpoints_name]
- regex: hami-.*
- replacement: $1
- action: keep
- - sourceLabels: [__meta_kubernetes_pod_node_name]
- regex: (.*)
- targetLabel: node_name
- replacement: ${1}
- action: replace
- - sourceLabels: [__meta_kubernetes_pod_host_ip]
- regex: (.*)
- targetLabel: ip
- replacement: $1
- action: replace
复制代码 hami-scheduler-svc-monitor.yaml
- apiVersion: monitoring.coreos.com/v1
- kind: ServiceMonitor
- metadata:
- name: hami-scheduler-svc-monitor
- namespace: kube-system
- spec:
- selector:
- matchLabels:
- app.kubernetes.io/component: hami-scheduler
- namespaceSelector:
- matchNames:
- - kube-system
- endpoints:
- - path: /metrics
- port: monitor
- interval: "15s"
- honorLabels: false
- relabelings:
- - sourceLabels: [__meta_kubernetes_endpoints_name]
- regex: hami-.*
- replacement: $1
- action: keep
- - sourceLabels: [__meta_kubernetes_pod_node_name]
- regex: (.*)
- targetLabel: node_name
- replacement: ${1}
- action: replace
- - sourceLabels: [__meta_kubernetes_pod_host_ip]
- regex: (.*)
- targetLabel: ip
- replacement: $1
- action: replace
复制代码 确认创建的ServiceMonitor
启动gpu pod一个测试下
- apiVersion: v1
- kind: Pod
- metadata:
- name: gpu-pod-1
- spec:
- restartPolicy: Never
- containers:
- - name: cuda-container
- image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.2.1
- command: ["sleep", "infinity"]
- resources:
- limits:
- nvidia.com/gpu: 1
- nvidia.com/gpumem: 1000
- nvidia.com/gpucores: 10
复制代码 如果看到pod不绝pending 状态
检查下节点如果出现下面gpu为0的环境
须要
- docker:
- 1:下载NVIDIA-DOCKER2安装包并安装
- 2:修改/etc/docker/daemon.json文件内容加上
- {
- "default-runtime": "nvidia",
- "runtimes": {
- "nvidia": {
- "path": "/usr/bin/nvidia-container-runtime",
- "runtimeArgs": []
- }
- },
- }
- k8s:
- 1:下载k8s-device-plugin 镜像
- 2:编写nvidia-device-plugin.yml创建驱动pod
复制代码 使用这个yml进行创建
- apiVersion: apps/v1
- kind: DaemonSet
- metadata:
- name: nvidia-device-plugin-daemonset
- namespace: kube-system
- spec:
- selector:
- matchLabels:
- name: nvidia-device-plugin-ds
- updateStrategy:
- type: RollingUpdate
- template:
- metadata:
- labels:
- name: nvidia-device-plugin-ds
- spec:
- tolerations:
- - key: nvidia.com/gpu
- operator: Exists
- effect: NoSchedule
- priorityClassName: "system-node-critical"
- containers:
- - image: nvidia/k8s-device-plugin:1.11
- name: nvidia-device-plugin-ctr
- env:
- - name: FAIL_ON_INIT_ERROR
- value: "false"
- securityContext:
- allowPrivilegeEscalation: false
- capabilities:
- drop: ["ALL"]
- volumeMounts:
- - name: device-plugin
- mountPath: /var/lib/kubelet/device-plugins
- volumes:
- - name: device-plugin
- hostPath:
- path: /var/lib/kubelet/device-plugins
复制代码
gpu pod启动后进入查看下, gpu内存和限制的大小雷同设置成功

访问下{scheduler node ip}:31993/metrics

日志最后有两行
- vGPUPodsDeviceAllocated{containeridx="0",deviceusedcore="40",deviceuuid="GPU-7666e9de-679b-a768-51c6-260b81cd00ec",nodename="192.168.110.126",podname="gpu-pod-1",podnamespace="default",zone="vGPU"} 1.048576e+10
- vGPUPodsDeviceAllocated{containeridx="0",deviceusedcore="40",deviceuuid="GPU-7666e9de-679b-a768-51c6-260b81cd00ec",nodename="192.168.110.126",podname="gpu-pod-2",podnamespace="default",zone="vGPU"} 1.048576e+10
复制代码 可以看到雷同deviceuuid的gpu被不同pod共享使用
exec进入hami-device-plugin daemonset里面执行nvidia-smi -L 可以看到呆板上所有显卡的信息
root@node126:/# nvidia-smi -L
GPU 0: NVIDIA GeForce RTX 4090 (UUID: GPU-7666e9de-679b-a768-51c6-260b81cd00ec)
GPU 1: NVIDIA GeForce RTX 4090 (UUID: GPU-9f32af29-1a72-6e47-af2c-72b1130a176b)
root@node126:/#
之前创建的两个serviceMonitor会去请求
app.kubernetes.io/component: hami-scheduler 和app.kubernetes.io/component: hami-device-plugin 的/metrics 接口获取数据
当gpu-pod跑起来以后查看hami-vgpu-metrics-dashboard
免责声明:如果侵犯了您的权益,请联系站长,我们会及时删除侵权内容,谢谢合作!更多信息从访问主页:qidao123.com:ToB企服之家,中国第一个企服评测及商务社交产业平台。 |