使用KTransformer部署DeepSeekR1-671B满血量化版+ OpenwebUI 及多GPU(多卡)
ktransformers 是针对预算低但是想跑 671B Deepseek推出的框架 ,是基于显卡 + CPU 混合推理方案。官方使用双 Xeon ® Gold 6454S(共 64 核)、1TB-D5 内存和 RTX 4090 24GB,实现 671B 4-bit 量化版每秒 13.69 个词令天生速度。ktransformers 采用 Intel AMX 指令扩展,prefill 速度提拔明显。我的环境:Intel 61332(40C)+ 256G D4 + 4090D4 + 2T SSD + 16T HDD \ Ubuntu22.04 \ CUDA12.6 \ Python3.11
注:停止25.2.21,官方现在支持的模型如下,
https://i-blog.csdnimg.cn/direct/1eac5cd461aa4f8a8b922c9c48d23ade.png
其他版本装上去会乱码,我背面用的DeepSeek-R1-Q2_K_XS,读者需要根据最新版本的支持情况进行修改
1.预预备
详细步骤见 https://kvcache-ai.github.io/ktransformers/index.html
这里简朴说一下:
首先需要CUDA12.1及以上,V0.3好像要12.5,只管装最新吧
1.添加CUDA环境变量
vim ~/.bashrc ,末尾添加
# Adding CUDA to PATH
if [ -d "/usr/local/cuda/bin" ]; then
export PATH=$PATH:/usr/local/cuda/bin
fi
if [ -d "/usr/local/cuda/lib64" ]; then
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/lib64
# Or you can add it to /etc/ld.so.conf and run ldconfig as root:
# echo "/usr/local/cuda-12.x/lib64" | sudo tee -a /etc/ld.so.conf
# sudo ldconfig
fi
if [ -d "/usr/local/cuda" ]; then
export CUDA_PATH=$CUDA_PATH:/usr/local/cuda
fi
然后wq保存,退出
2.安装编译链
sudo apt-get update
sudo apt-get install build-essential cmake ninja-build
3.创建conda环境
conda安装教程自己去找一下,网上很多
conda create --name ktransformers python=3.11
conda activate ktransformers # you may need to run ‘conda init’ and reopen shell first
conda install -c conda-forge libstdcxx-ng # Anaconda provides a package called `libstdcxx-ng` that includes a newer version of `libstdc++`, which can be installed via `conda-forge`.
strings ~/anaconda3/envs/ktransformers/lib/libstdc++.so.6 | grep GLIBCXX
4.安装pytorch等
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126
pip3 install packaging ninja cpufeature numpy
2.安装 最新正式版
1.克隆代码并初始化
git clone https://github.com/kvcache-ai/ktransformers.git
cd ktransformers
git submodule init
git submodule update
2.安装
bash install.sh
注:如果双插槽CPU,且你的运行内存大于模型大小的两倍,则运行:
# Make sure your system has dual sockets and double size RAM than the model's size (e.g. 1T RAM for 512G model)
export USE_NUMA=1
这个命令的作用是利用两个CPU,但运行的时候要把模型加载两份,所以如果内存小于模型的两倍大小,就不运行这段命令,否则反而会让速度非常慢
然后运行:
bash install.sh
# or `make dev_install` 这一步可能需要较长时间,甚至报错…
[*]安装flashatt
wget https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu12torch2.3cxx11abiFALSE-cp311-cp311-linux_x86_64.whl
pip install flash_attn-2.7.4.post1+cu12torch2.3cxx11abiFALSE-cp311-cp311-linux_x86_64.whl
2. 安装 V0.3预览版(需要AMX支持)
这里由于CPU需要至强支持AMX指令集,因以后续以0.2.1版本进行
这个指令集需要至强3代 Xeon Scalable及4、5代
wget https://github.com/kvcache-ai/ktransformers/releases/download/v0.1.4/ktransformers-0.3.0rc0+cu126torch26fancy-cp311-cp311-linux_x86_64.whl
pip install ./ktransformers-0.3.0rc0+cu126torch26fancy-cp311-cp311-linux_x86_64.whl
python -m ktransformers.local_chat --model_path <your model path> --gguf_path <your gguf path>--prompt_file <your prompt txt file>--cpu_infer 40 --max_new_tokens 1000
<when you see chat, then press enter to load the text prompt_file>
3.从huggingface下载模型文件
例如从
https://hf-mirror.com/unsloth/DeepSeek-R1-GGUF/tree/main/DeepSeek-R1-Q2_K_XS
或
https://www.modelscope.cn/models/unsloth/DeepSeek-R1-GGUF/files
下载文件到某个目录
然后将
https://github.com/ubergarm/r1-ktransformers-guide
这里面的文件全部放到谁人目录里去
https://i-blog.csdnimg.cn/direct/7cfe0a33ad8641b1bc30dbca94597845.png
4.测试运行
创建一个prompt预提示文件:
touch p.txt
路径改为自己的,CPU-infer改为略小于自己CPU总核数的值
python ./ktransformers/ktransformers/local_chat.py \
--gguf_path "/data/DeepSeek-R1/DeepSeek-R1-Q2_K_XS/" \
--model_path "/data/DeepSeek-R1/DeepSeek-R1-Q2_K_XS/" \
--prompt_file ./p.txt \
--cpu_infer 38 \
--max_new_tokens 1024 \
--force_think true
这时候可能会报错:
ImportError: /home/user/anaconda3/envs/ktransformers/bin/…/lib/libstdc++.so.6: version `GLIBCXX_3.4.30’ not found (required by /home/user/anaconda3/envs/ktransformers/lib/python3.11/site-packages/cpuinfer_ext.cpython-311-x86_64-linux-gnu.so)
解决:
方法1:
sudo add-apt-repository ppa:ubuntu-toolchain-r/test
sudo apt-get update
sudo apt-get install --only-upgrade libstdc++6
conda install -c conda-forge libstdcxx-ng
方法2:
https://www.cnblogs.com/michaelcjl/p/18432886
https://blog.csdn.net/goodsirlee/article/details/1062318215d9197002c03e20000000667b71957
加载后可看到chat:的提示
https://i-blog.csdnimg.cn/direct/03f6e6adcd094c668bfd2de17f85ccff.png
5.WEB\API运行
ktransformers\
--gguf_path "/home/user/r1_gguf/DeepSeek-R1-Q2_K_XS" \
--model_path "/home/user/r1_gguf/DeepSeek-R1-Q2_K_XS" \
--cpu_infer 38 \
--no_flash_attn false \
--total_context 2048 \
--cache_q4 true \
如果需要API:
添加
--port 10002
如果需要直接web运行:
--port 10002
--web True
运行成功后可以看到:
https://i-blog.csdnimg.cn/direct/685bf35e2c9646ec9c06cdacd0d1a1ba.png
访问 http://localhost:10002/web/index.html#/chat
6.OpenwebUI连接
首先在本地用curl测试
curl -X 'POST' \
'http://localhost:10002/api/generate' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"model": "DeepSeek-R1-Q2_K_XS",
"prompt": "hello.",
"stream": true
}'
有数据返回即可
防火墙记得开一下端口。
openweb安装教程可见:https://blog.csdn.net/qq_26123545/article/details/145723607
然后openwebui:
https://i-blog.csdnimg.cn/direct/144f58c03b564d5cb9e05e98c4806c41.png
这里添加OpenAI API,填写
http://host.docker.internal:10002/v1
或
http://127.0.0.1:10002/v1
即可。
随后即可看到刚才添加的模型:
https://i-blog.csdnimg.cn/direct/c883acf9cafa44b9a2259d56a3633ab1.png
另外记得在这里取消掉标题标签天生,大概改为轻量模型,要不然回答完一个问题后,会多次哀求模型天生标题、标签,让模型思考一些不相干的内容,占用大量资源 (我这里改为了使用1.5b模型来天生)
https://i-blog.csdnimg.cn/direct/04468737edb24597b451a1911dcddeb4.png
5.多GPU运行
多GPU需要参考这下面的文件:
https://github.com/kvcache-ai/ktransformers/blob/main/ktransformers/optimize/optimize_rules
https://github.com/ubergarm/r1-ktransformers-guide
可能需要自己修改yml文件,这里详细的官方没有做过多说明
我临时是配的KExpertsMarlin,但是好像会报错:
- match:
name: "^model.embed_tokens"
replace:
class: "default"
kwargs:
generate_device: "cpu"
prefill_device: "cpu"
# GPU 0: layers 0-5
- match:
name: "^model\\.layers\\.()\\.mlp\\.experts$"
replace:
class: ktransformers.operators.experts.KTransformersExperts
kwargs:
generate_device: "cuda:0"
generate_op: "KExpertsMarlin"
recursive: False
# GPU 1: layers 6-11
- match:
name: "^model\\.layers\\.(|1)\\.mlp\\.experts$"
replace:
class: ktransformers.operators.experts.KTransformersExperts
kwargs:
generate_device: "cuda:1"
generate_op: "KExpertsMarlin"
recursive: False
# GPU 2: layers 12-17
- match:
name: "^model\\.layers\\.(1)\\.mlp\\.experts$"
replace:
class: ktransformers.operators.experts.KTransformersExperts
kwargs:
generate_device: "cuda:2"
generate_op: "KExpertsMarlin"
recursive: False
# GPU 3: layers 18-23
- match:
name: "^model\\.layers\\.(1|2)\\.mlp\\.experts$"
replace:
class: ktransformers.operators.experts.KTransformersExperts
kwargs:
generate_device: "cuda:3"
generate_op: "KExpertsMarlin"
recursive: False
大概用
ktransformers/ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat-multi-gpu-4.yaml
目录下的文件
然后运行:
ktransformers\ --gguf_path "/home/user/r1_gguf/DeepSeek-R1-Q2_K_XS" \ --model_path "/home/user/r1_gguf/DeepSeek-R1-Q2_K_XS" \ --cpu_infer 38 \ --port 10002
\ --no_flash_attn false \ --total_context 2048 \ --cache_q4 true \ --optimize_config_path/home/user/r1_gguf/DeepSeek-R1-Q2_K_XS/custom-multi-gpu-4.yaml 注:此功能现在还不完善,官方现在正在努力优化多GPU实现,请耐心等候后续版本大概本文的后续更新
免责声明:如果侵犯了您的权益,请联系站长,我们会及时删除侵权内容,谢谢合作!更多信息从访问主页:qidao123.com:ToB企服之家,中国第一个企服评测及商务社交产业平台。
页:
[1]