一、前言
通过vLLM源码构建Docker镜像有诸多显著的好处。起首,源码构建可以或许确保我们使用的是最新的功能,制止版本不同等带来的问题。其次,自界说构建过程使可以或许根据特定需求优化镜像,好比去除不须要的依靠或者添加自界说设置,从而进步镜像的性能和安全性。此外,Docker容器化的特性使得部署更加灵活,简化了环境隔离与管理,开辟人员可以在任何支持Docker的平台上快速复现雷同的运行环境,确保开辟和生产的同等性。末了,借助Docker的版本控制和可复制性,可以更轻松地追踪和回滚到先前的镜像版本,这在举行调试和测试时尤为告急。
二、术语
2.1. Docker
是一个开源的容器化平台,允许开辟者将应用及其依靠打包成轻量级、可移植的容器。这些容器可以在任何支持 Docker 的环境中运行,从而确保应用在差别环境中的同等性。Docker 提供了简化的开辟、测试和部署流程,使得应用的交付更加高效和灵活,同时也支持微服务架构的实现。通过隔离和资源管理,Docker 使得应用的扩展和维护变得更加便捷。
2.2. hub.docker.com
是 Docker 官方的容器镜像库和共享平台,用户可以在此存储、分享和管理 Docker 镜像。它提供了一个简单易用的界面,开辟者可以通过该平台上传自己的镜像,或从中拉取公共镜像以用于项目开辟。
2.3. vLLM
vLLM是一个开源的大模型推理加速框架,通过PagedAttention高效地管理attention中缓存的张量,实现了比HuggingFace Transformers高14-24倍的吞吐量。
三、前提条件
3.1. 基础环境及前置条件
1)操纵系统:centos7
2)Tesla V100-SXM2-32GB CUDA Version: 12.2
3.2. Docker安装
1. 更新系统,确保你的系统是最新的:
实行结果:
Last metadata expiration check: 2:32:22 ago on Sun 06 Oct 2024 09:48:43 AM CST.
Dependencies resolved.
Nothing to do.
Complete!
2. 安装须要的依靠:
- sudo yum install -y yum-utils device-mapper-persistent-data lvm2
复制代码 实行结果:
Installed:
device-mapper-event-8:1.02.177-10.el8.x86_64 device-mapper-event-libs-8:1.02.177-10.el8.x86_64 device-mapper-persistent-data-0.9.0-4.el8.x86_64 libaio-0.3.112-1.el8.x86_64 lvm2-8:2.03.12-10.el8.x86_64
lvm2-libs-8:2.03.12-10.el8.x86_64
Complete!
3. 设置Docker堆栈:
- sudo yum-config-manager --add-repo https://download.docker.com/linux/centos/docker-ce.repo
复制代码 实行结果:
Adding repo from: https://download.docker.com/linux/centos/docker-ce.repo
4. 安装 Docker:
- sudo yum install -y docker-ce docker-ce-cli containerd.io
复制代码 5. 启动Docker服务,并设置为开机自启:
- sudo systemctl start docker
- sudo systemctl enable docker
复制代码 6. 验证 Docker 安装:
- sudo docker run hello-world
复制代码 假如看到 "Hello from Docker!" 的消息,说明 Docker 安装成功。
7. (可选)将用户添加到Docker组:
假如你盼望以非 root 用户身份运行 Docker 命令,可以将用户添加到 Docker 组:
- sudo usermod -aG docker $USER
复制代码 添加后,注销并重新登录,或者运行 `newgrp docker` 使更改生效。
3.3. 下载vLLM源码
Tags · vllm-project/vllm · GitHubA high-throughput and memory-efficient inference and serving engine for LLMs - Tags · vllm-project/vllmhttps://github.com/vllm-project/vllm/tags
- (base) [root@gpu tmp]# git clone --branch v0.6.2 --single-branch https://github.com/vllm-project/vllm.git
- Cloning into 'vllm'...
- remote: Enumerating objects: 28985, done.
- remote: Total 28985 (delta 0), reused 0 (delta 0), pack-reused 28985 (from 1)
- Receiving objects: 100% (28985/28985), 17.32 MiB | 6.84 MiB/s, done.
- Resolving deltas: 100% (21960/21960), done.
- Note: switching to '7193774b1ff8603ad5bf4598e5efba0d9a39b436'.
- You are in 'detached HEAD' state. You can look around, make experimental
- changes and commit them, and you can discard any commits you make in this
- state without impacting any branches by switching back to a branch.
- If you want to create a new branch to retain commits you create, you may
- do so (now or later) by using -c with the switch command. Example:
- git switch -c <new-branch-name>
- Or undo this operation with:
- git switch -
- Turn off this advice by setting config variable advice.detachedHead to false
复制代码 四、技术实现
4.1. 需求
背景:使用目前最新vLLM开源镜像,无法正常运行Qwen2-VL-7B-Instruct模型,错误描述如下:
- (base) [root@gpu ~]# docker run --runtime nvidia --gpus all \
- > -p 9000:9000 \
- > --ipc=host \
- > -v /data/model/qwen2-vl-7b-instruct:/qwen2-vl-7b-instruct \
- > -it --rm \
- > vllm/vllm-openai:latest \
- > --model /qwen2-vl-7b-instruct --dtype float16 --max-parallel-loading-workers 1 --max-model-len 8192 --enforce-eager --host 0.0.0.0 --port 9000
- INFO 10-09 22:34:49 api_server.py:526] vLLM API server version 0.6.1.dev238+ge2c6e0a82
- INFO 10-09 22:34:49 api_server.py:527] args: Namespace(host='0.0.0.0', port=9000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_auto_tool_choice=False, tool_call_parser=None, model='/qwen2-vl-7b-instruct', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', config_format='auto', dtype='float16', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=8192, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=1, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=True, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=False, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, override_neuron_config=None, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False)
- INFO 10-09 22:34:49 api_server.py:164] Multiprocessing frontend to use ipc:///tmp/11e808d4-b43b-40e9-9201-01436c4f6469 for IPC Path.
- INFO 10-09 22:34:49 api_server.py:177] Started engine process with PID 22
- Unrecognized keys in `rope_scaling` for 'rope_type'='default': {'mrope_section'}
- Traceback (most recent call last):
- File "<frozen runpy>", line 198, in _run_module_as_main
- File "<frozen runpy>", line 88, in _run_code
- File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 571, in <module>
- uvloop.run(run_server(args))
- File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 109, in run
- return __asyncio.run(
- ^^^^^^^^^^^^^^
- File "/usr/lib/python3.12/asyncio/runners.py", line 194, in run
- return runner.run(main)
- ^^^^^^^^^^^^^^^^
- File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
- return self._loop.run_until_complete(task)
- ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- File "uvloop/loop.pyx", line 1517, in uvloop.loop.Loop.run_until_complete
- File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 61, in wrapper
- return await main
- ^^^^^^^^^^
- File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 538, in run_server
- async with build_async_engine_client(args) as engine_client:
- ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
- return await anext(self.gen)
- ^^^^^^^^^^^^^^^^^^^^^
- File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 105, in build_async_engine_client
- async with build_async_engine_client_from_engine_args(
- ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
- return await anext(self.gen)
- ^^^^^^^^^^^^^^^^^^^^^
- File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 182, in build_async_engine_client_from_engine_args
- engine_config = engine_args.create_engine_config()
- ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- File "/usr/local/lib/python3.12/dist-packages/vllm/engine/arg_utils.py", line 874, in create_engine_config
- model_config = self.create_model_config()
- ^^^^^^^^^^^^^^^^^^^^^^^^^^
- File "/usr/local/lib/python3.12/dist-packages/vllm/engine/arg_utils.py", line 811, in create_model_config
- return ModelConfig(
- ^^^^^^^^^^^^
- File "/usr/local/lib/python3.12/dist-packages/vllm/config.py", line 207, in __init__
- self.max_model_len = _get_and_verify_max_len(
- ^^^^^^^^^^^^^^^^^^^^^^^^
- File "/usr/local/lib/python3.12/dist-packages/vllm/config.py", line 1746, in _get_and_verify_max_len
- assert "factor" in rope_scaling
- ^^^^^^^^^^^^^^^^^^^^^^^^
- AssertionError
- Unrecognized keys in `rope_scaling` for 'rope_type'='default': {'mrope_section'}
- Process SpawnProcess-1:
- Traceback (most recent call last):
- File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
- self.run()
- File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
- self._target(*self._args, **self._kwargs)
- File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 388, in run_mp_engine
- engine = MQLLMEngine.from_engine_args(engine_args=engine_args,
- ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 134, in from_engine_args
- engine_config = engine_args.create_engine_config()
- ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- File "/usr/local/lib/python3.12/dist-packages/vllm/engine/arg_utils.py", line 874, in create_engine_config
- model_config = self.create_model_config()
- ^^^^^^^^^^^^^^^^^^^^^^^^^^
- File "/usr/local/lib/python3.12/dist-packages/vllm/engine/arg_utils.py", line 811, in create_model_config
- return ModelConfig(
- ^^^^^^^^^^^^
- File "/usr/local/lib/python3.12/dist-packages/vllm/config.py", line 207, in __init__
- self.max_model_len = _get_and_verify_max_len(
- ^^^^^^^^^^^^^^^^^^^^^^^^
- File "/usr/local/lib/python3.12/dist-packages/vllm/config.py", line 1746, in _get_and_verify_max_len
- assert "factor" in rope_scaling
- ^^^^^^^^^^^^^^^^^^^^^^^^
- AssertionError
复制代码 目标:构建自界说vLLM镜像,可以通过docker容器快速体验Qwen2-VL-7B-Instruct效果
4.2. 添加自界说需求
4.2.1.修改transformers版本
默认vLLM源码中,指定安装的transformers版本为≥4.45.0,为了后续安装Qwen2-VL-7B-Instruct,需要将transformers指定为特定版本,示比方下:
修改requirements-common.txt文件
删掉transformers依靠:
修改Dockerfile文件,指定transformers通过源码安装
- RUN --mount=type=cache,target=/root/.cache/pip python3 -m pip install git+https://github.com/huggingface/transformers@21fac7abba2a37fae86106f87fcf9974fd1e3830
复制代码
4.2.2.增加qwen-vl-utils依靠
修改requirements-common.txt文件,添加qwen-vl-utils依靠
4.2.3.问题修复
问题一:ModuleNotFoundError: No module named 'distutils'
详细描述:
- 832.4 from pip._internal.cli.main import main as _main # isort:skip # noqa
- 832.4 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- 832.4 File "/usr/lib/python3/dist-packages/pip/_internal/cli/main.py", line 10, in <module>
- 832.4 from pip._internal.cli.autocompletion import autocomplete
- 832.4 File "/usr/lib/python3/dist-packages/pip/_internal/cli/autocompletion.py", line 9, in <module>
- 832.4 from pip._internal.cli.main_parser import create_main_parser
- 832.4 File "/usr/lib/python3/dist-packages/pip/_internal/cli/main_parser.py", line 7, in <module>
- 832.4 from pip._internal.cli import cmdoptions
- 832.4 File "/usr/lib/python3/dist-packages/pip/_internal/cli/cmdoptions.py", line 19, in <module>
- 832.4 from distutils.util import strtobool
- 832.4 ModuleNotFoundError: No module named 'distutils'
- ------
- Dockerfile:148
- --------------------
- 147 | # Install Python and other dependencies
- 148 | >>> RUN echo 'tzdata tzdata/Areas select America' | debconf-set-selections \
- 149 | >>> && echo 'tzdata tzdata/Zones/America select Los_Angeles' | debconf-set-selections \
- 150 | >>> && apt-get update -y \
- 151 | >>> && apt-get install -y ccache software-properties-common git curl sudo vim python3-pip \
- 152 | >>> && apt-get install -y ffmpeg libsm6 libxext6 libgl1 \
- 153 | >>> && add-apt-repository ppa:deadsnakes/ppa \
- 154 | >>> && apt-get update -y \
- 155 | >>> && apt-get install -y python${PYTHON_VERSION} python${PYTHON_VERSION}-dev python${PYTHON_VERSION}-venv libibverbs-dev \
- 156 | >>> && update-alternatives --install /usr/bin/python3 python3 /usr/bin/python${PYTHON_VERSION} 1 \
- 157 | >>> && update-alternatives --set python3 /usr/bin/python${PYTHON_VERSION} \
- 158 | >>> && ln -sf /usr/bin/python${PYTHON_VERSION}-config /usr/bin/python3-config \
- 159 | >>> && curl -sS https://bootstrap.pypa.io/get-pip.py | python${PYTHON_VERSION} \
- 160 | >>> && python3 --version && python3 -m pip --version
- 161 |
- --------------------
- ERROR: failed to solve: process "/bin/sh -c echo 'tzdata tzdata/Areas select America' | debconf-set-selections && echo 'tzdata tzdata/Zones/America select Los_Angeles' | debconf-set-selections && apt-get update -y && apt-get install -y ccache software-properties-common git curl sudo vim python3-pip && apt-get install -y ffmpeg libsm6 libxext6 libgl1 && add-apt-repository ppa:deadsnakes/ppa && apt-get update -y && apt-get install -y python${PYTHON_VERSION} python${PYTHON_VERSION}-dev python${PYTHON_VERSION}-venv libibverbs-dev && update-alternatives --install /usr/bin/python3 python3 /usr/bin/python${PYTHON_VERSION} 1 && update-alternatives --set python3 /usr/bin/python${PYTHON_VERSION} && ln -sf /usr/bin/python${PYTHON_VERSION}-config /usr/bin/python3-config && curl -sS https://bootstrap.pypa.io/get-pip.py | python${PYTHON_VERSION} && python3 --version && python3 -m pip --version" did not complete successfully: exit code: 1
复制代码 修改Dockerfile文件,增加python3-distutils依靠
4.3. 构建镜像
- DOCKER_BUILDKIT=1 docker build . --target vllm-openai --tag vllm/vllm-qwen2-vl --build-arg max_jobs=8 --build-arg nvcc_threads=2
复制代码 参数解释:
- 1. `DOCKER_BUILDKIT=1`:
- - 这个环境变量设置了 Docker BuildKit,为 Docker 构建过程启用 BuildKit。一旦启用,BuildKit 可以提供更快的构建速度、更好的缓存机制和更强的功能。
- 2. `docker build`:
- - 这是 Docker 命令,用于构建 Docker 镜像。
- 3. `.`:
- - 这个点表示当前目录,Docker 会在当前目录中查找 Dockerfile 和上下文文件。上下文文件包括 Dockerfile 所需的其他文件和目录。
- 4. `--target vllm-openai`:
- - 指定了构建的目标(stage)。在多阶段构建中,可能会有多个构建阶段。通过这个参数,Docker 会只构建名为 `vllm-openai` 的阶段。
- 5. `--tag vllm/vllm-qwen2-vl`:
- - 指定了构建完成的镜像的名称和标签。格式是 `repository:tag`,在这里 `vllm` 是命名空间(或仓库名),`vllm-qwen2-vl` 是镜像名称。可以用于推送到 Docker Hub 或其他镜像仓库。
- 6. `--build-arg max_jobs=8`:
- - 这个参数用于设置构建时的变量 `max_jobs`,值为 `8`。在 Dockerfile 中可以通过 `ARG max_jobs` 来访问这个变量。通常用于定制构建过程,例如并行构建的作业数。
- 7. `--build-arg nvcc_threads=2`:
- - 这个参数类似于前一个,用于设置另一个构建时变量 `nvcc_threads`,值为 `2`。这可能用于指定与 NVIDIA CUDA 编译相关的线程数量。
复制代码 漫长的构建过程:
五、附带说明
5.1.整理Docker临时文件
在 Docker 中,假如构建失败,可能会留下临时文件和中间层,这些文件可能会占用磁盘空间。
1. 整理未使用的构建缓存
docker builder prune
使用 `-f` 选项来跳过确认提示:
docker builder prune -f
2. 全面地整理 Docker,包罗未使用的容器、网络、镜像和构建缓存
docker system prune
使用 `-f` 选项来跳过确认提示:
docker system prune -f
要删除全部未使用的镜像(包罗未被任何容器使用的镜像),可以加上 `--all` 选项:
docker system prune -a -f
3. 检查磁盘使用情况,包罗镜像、容器、数据卷和构建缓存的详细信息。
docker system df
免责声明:如果侵犯了您的权益,请联系站长,我们会及时删除侵权内容,谢谢合作!更多信息从访问主页:qidao123.com:ToB企服之家,中国第一个企服评测及商务社交产业平台。 |