k8s 上怎样跑 Dolphins 模型

渣渣兔 · 2024-11-4 22:35:03

接着上一篇的介绍，这一篇就来跑跑 Dolphins 模型，本篇会记载，跑模型常见的阬点。
1 在 k8s 上创建 pod

将外部数据挂载在 pod 里，并申请 gpu 资源。同时修改代码里对应的引入数据的路径

# dolphins.yaml
apiVersion: v1
kind: Pod
metadata:
name: czl-test-pod-dolphins
labels:
app: czl-dolphins
spec:
containers:
- name: czl-1-container
image: harbor.yoocar.com.cn/deeplearning/pytorch/pytorch:2.1.0-cuda11.8-cudnn8-devel
#imagePullPolicy: Always
command: ['sh', '-c', 'sleep infinity;']
resources:
limits:
nvidia.com/gpu: 1
requests:
nvidia.com/gpu: 1
volumeMounts:
- name: data
mountPath: /mount/bev
- name: dshm
mountPath: /dev/shm
volumes:
- name: data
hostPath:
path: "/root/data/pjp/dolphins"
type: Directory
- name: dshm
emptyDir:
medium: Memory
sizeLimit: 1000Gi
restartPolicy: Never

复制代码

用 yaml 方式创建 pod

kebuctl apply -f dolphins.yaml

复制代码

2 去 github 下载 Dolphins

https://github.com/SaFoLab-WISC/Dolphins/tree/main
2.1 修改源码——依靠包

这里为了避免一些报错，比方重复的依靠。

ERROR: Cannot install einops==0.6.1 and einops==0.7.0 because these package versions have conflicting dependencies.

复制代码

直接修改依靠包，requirement.txt

# 更新依赖后的requirements.txt，指定了一些版本
open_clip_torch==2.16.0
opencv_python_headless==4.5.5.64
#einops==0.6.1
einops_exts==0.0.4
transformers==4.28.1
accelerate==0.31.0
deepspeed==0.9.3
huggingface_hub
inflection==0.5.1
nltk==3.8.1
numpy==1.23.5
#torch==2.0.0
#torchvision==0.15.1
tqdm==4.65.0
fastapi>=0.95.2
gradio==3.34
braceexpand==0.1.7
einops==0.7.0
fastapi==0.104.1
#horovod==0.27.0
huggingface_hub==0.14.0
ijson==3.2.3
importlib_metadata==6.6.0
inflection==0.5.1
markdown2==2.4.8
natsort==8.4.0
nltk==3.8.1
#numpy==1.26.2
openai==1.3.7
orjson==3.9.10
packaging==23.2
Pillow==10.1.0
pycocoevalcap==1.2
pycocotools==2.0.7
Requests==2.31.0
uvicorn==0.24.0.post1
webdataset==0.2.79
wandb
datasets
mmengine
peft
pandas
h5py
# https://github.com/gradio-app/gradio/issues/4306
httpx==0.24.1

复制代码

2.2 修改源码——数据引入路径

正常情况下，load_pretrained_modoel 会从 huggingface 里去下载数据。假如无法下载那么只能自己从网络上搬运了。我这里是统一存放，并挂载到了 pod 的 /mount/bev/ 路径里。找到的数据如下所示

修改源代码里的数据引入路径，如下修改地址的注释

def load_pretrained_modoel():
peft_config, peft_model_id = None, None
peft_config = LoraConfig(**openflamingo_tuning_config)
model, image_processor, tokenizer = create_model_and_transforms(
clip_vision_encoder_path="ViT-L-14-336",
clip_vision_encoder_pretrained="openai",
clip_vision_encoder_cache_dir="/mount/bev/clip", # 修改地址，添加 clip_vision 的缓存路径，那么他会在这个路径里去查找 ViT-L-14-336 模型
lang_encoder_path="/mount/bev/anas-awadalla/mpt-7b", # 修改地址 anas-awadalla/mpt-7b
tokenizer_path="/mount/bev/anas-awadalla/mpt-7b", # 修改地址 anas-awadalla/mpt-7b
cross_attn_every_n_layers=4,
use_peft=True,
peft_config=peft_config,
)
checkpoint_path ="/mount/bev/huggingface/gray311/Dolphins/checkpoint.pt" # 修改地址
model.load_state_dict(torch.load(checkpoint_path), strict=False)
model.half().cuda()
return model, image_processor, tokenizer

复制代码

3 从本地将代码上传到 k8s 的 pod 里

kubectl cp Dolphins-main czl-test-pod-dolphins:/workspace/Dolphins-main -n test

复制代码

4 进入 pod，开始安装依靠，跑模型

kubectl exec -it czl-test-pod-dolphins -n test -- bash
pip install -r requirement.txt
python inference.py

复制代码

到这里就会开始一系列的报错了
5 处置惩罚一系列报错题目

报错1:

办理1：切换安装源

pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple/

复制代码

报错2：

办理2：
安装 ffmpeg libsm6 libxext6

apt-get install ffmpeg libsm6 libxext6 -y

复制代码

此时还没办理就又报错了，没报错的可以跳过下一步

那么

apt update
apt-get install software-properties-common

复制代码

然后再安装

apt-get install ffmpeg libsm6 libxext6 -y

复制代码

6 效果展示

方式一：

python inference.py

复制代码

方式二：

这里必要开通 k8s 对外访问的服务，我这里对外暴露的端标语为 30066

# service.yaml
apiVersion: v1
kind: Service
metadata:
name: czl-dolphins-svc
spec:
selector:
app: czl-dolphins
type: NodePort
ports:
- protocol: TCP
port: 7862
targetPort: 7862
nodePort: 30066

复制代码

创建服务:

kubectl apply -f service.yaml -n test

复制代码

接下来一系列的启动命令

python -m serve.controller --host 0.0.0.0 --port 10000

复制代码

CUDA_VISIBLE_DEVICES=0 python -m serve.model_worker --controller http://localhost:10000 --port 40000 --worker http://localhost:40000 --model_name dolphins --use_lora --num_gpus 1 --limit_model_concurrency 200

复制代码

python -m serve.gradio_web_server_video --controller http://localhost:10000 --port 7862 --host 0.0.0.0 --share

复制代码

这个命令记得加上 --host 0.0.0.0

这个时候，集群地址加上，创建 service.yaml 对外暴露的端标语，即可打开 Dolphins web 页面。假如页面不长如许，那么大概是 gradio 依靠包的版本不对。我这里的是 3.34.0 版本，其他版本都会报错，或者展示的 web 界面有题目。

7 总结

跑模型，要注意机子本身是否能跑模型，是否必要 gpu 资源，大概还要注意下可以支持的显存大小。
流程：

github 上下载模型源码
数据准备：找用到的数据，和源码放在一起，修改引入路径
跑模型：安装依靠，跑模型 github 启动命令
办理一系列的报错：包括环境、依靠包。乃至看源码，修改源码。

免责声明：如果侵犯了您的权益，请联系站长，我们会及时删除侵权内容，谢谢合作！更多信息从访问主页：qidao123.com:ToB企服之家，中国第一个企服评测及商务社交产业平台。

		自动登录	找回密码
密码			立即注册

k8s 上怎样跑 Dolphins 模型

本帖子中包含更多资源

0 个回复

快速回复

楼主热帖

标签云