备份vllm+qwen2摆设！ - Powered by Discuz! Archiver

盛世宏图 发表于 2024-8-31 11:03:50

vllm+qwen2摆设！

准备好qwen2模子：去huggingface镜像、魔搭都可下载：
HF-Mirror、魔搭社区

创建conda环境：
conda create -n name python==3.10 （python环境肯定要3.10 后面有用！）

激活环境：
conda activate name

更换镜像源：
pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple

安装所需依靠：
pip install modelscope==1.11.0
pip install openai==1.17.1
<strong>pip/pip3 install torch torchvision torchaudio</strong>
pip install tqdm==4.64.1
pip install transformers==4.39.3 安装flash-attn依靠包的时间有坑！
需要先安装nijia这个包：
pip install ninja
检查ninja是否安装乐成：
echo $?
https://i-blog.csdnimg.cn/direct/930674c9ebe54cce8ddab629be1c1509.png
返回0代表安装乐成！

此时再次安装flash-attn：
MAX_JOBS=8 pip install flash-attn --no-build-isolation

还是报错，加上代理再次安装！
pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.5.2/flash_attn-2.5.2+cu122torch2.2cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
如果超时可以设置参数 --timeout=250（具体多少根据实际情况定）

参考：安装flash-attention失败的终极解决方案_building wheels for collected packages: flash-attn-CSDN博客

安装乐成！
https://i-blog.csdnimg.cn/direct/375cf237d54a4e5e9b1af656c273183e.png

pip install vllm

启动openai风格接口：
python -m vllm.entrypoints.openai.api_server --model /dfs/data/autodl-tmp/qwen/Qwen2-7B-Instruct --served-model-name Qwen2-7B-Instruct --max-model-len=2048

--dtype=half （我当前显卡为esla V100-PCIE-32GB GPU具有计算能力7.0，不够8.0，所以需要设置半精度，使用float16（half precision）而非Bfloat16进行计算，这样可以低落算力要求）

若想启动多Gpu再设置以下两个参数：
CUDA_VISIBLE_DEVICES=0,1,2,3

并行计算参数：
--tensor-parallel-size=2（张量并行参数设置）
--pipeline-parallel-size=4（管道并行参数设置）

CUDA_VISIBLE_DEVICES=0,1,2,3 python -m vllm.entrypoints.openai.api_server --model /root/autodl-tmp/qwen/Qwen2-7B-Instruct --served-model-name Qwen2-7B-Instruct --max-model-len=2048

乐成启动服务！

https://i-blog.csdnimg.cn/direct/6a968d809b514eaea9abaf778d3e0e2e.png

免责声明：如果侵犯了您的权益，请联系站长，我们会及时删除侵权内容，谢谢合作！更多信息从访问主页：qidao123.com:ToB企服之家，中国第一个企服评测及商务社交产业平台。

页: [1]

IT评测·应用市场-qidao123.com技术社区's Archiver

vllm+qwen2摆设！