INFO 10-21 16:29:04 selector.py:224] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 10-21 16:29:04 selector.py:115] Using XFormers backend.
/usr/local/miniconda3/envs/vllm/lib/python3.10/site-packages/xformers/ops/fmha/flash.py:211: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
/usr/local/miniconda3/envs/vllm/lib/python3.10/site-packages/xformers/ops/fmha/flash.py:344: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
INFO 10-21 16:30:51 model_runner.py:1071] Loading model weights took 13.0675 GB
INFO 10-21 16:30:57 gpu_executor.py:122] # GPU blocks: 9932, # CPU blocks: 11702
INFO 10-21 16:30:57 gpu_executor.py:126] Maximum concurrency for 32768 tokens per request: 4.85x
INFO 10-21 16:31:03 model_runner.py:1402] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 10-21 16:31:03 model_runner.py:1406] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 10-21 16:31:28 model_runner.py:1530] Graph capturing finished in 24 secs.
INFO 10-21 16:34:36 selector.py:224] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 10-21 16:34:36 selector.py:115] Using XFormers backend.
/usr/local/miniconda3/envs/vllm/lib/python3.10/site-packages/xformers/ops/fmha/flash.py:211: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
/usr/local/miniconda3/envs/vllm/lib/python3.10/site-packages/xformers/ops/fmha/flash.py:344: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
INFO 10-21 16:35:02 model_runner.py:1071] Loading model weights took 12.1953 GB
INFO 10-21 16:35:08 gpu_executor.py:122] # GPU blocks: 10952, # CPU blocks: 2340
INFO 10-21 16:35:08 gpu_executor.py:126] Maximum concurrency for 32768 tokens per request: 5.35x
INFO 10-21 16:35:09 model_runner.py:1402] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 10-21 16:35:09 model_runner.py:1406] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 10-21 16:35:36 model_runner.py:1530] Graph capturing finished in 27 secs.
5.1. 问题:ValueError: Bfloat16 is only supported on GPUs with compute capability of at least 8.0. Your Tesla V100S-PCIE-32GB GPU has compute capability 7.0. You can use float16 instead by explicitly setting the`dtype` flag in CLI, for example: --dtype=half.
model: The name or path of a HuggingFace Transformers model.
tokenizer: The name or path of a HuggingFace Transformers tokenizer.
tokenizer_mode: The tokenizer mode. "auto" will use the fast tokenizer if available, and "slow" will always use the slow tokenizer.
skip_tokenizer_init: If true, skip initialization of tokenizer and detokenizer. Expect valid prompt_token_ids and None for prompt from the input.
trust_remote_code: Trust remote code (e.g., from HuggingFace) when downloading the model and tokenizer.
tensor_parallel_size: The number of GPUs to use for distributed execution with tensor parallelism.
dtype: The data type for the model weights and activations. Currently,we support `float32`, `float16`, and `bfloat16`. If `auto`, we use the `torch_dtype` attribute specified in the model config file. However, if the `torch_dtype` in the config is `float32`, we will use `float16` instead.
quantization: The method used to quantize the model weights. Currently,we support "awq", "gptq", and "fp8" (experimental).If None, we first check the `quantization_config` attribute in the model config file. If that is None, we assume the model weights are not quantized and use `dtype` to determine the data type of the weights.
revision: The specific model version to use. It can be a branch name,a tag name, or a commit id.
tokenizer_revision: The specific tokenizer version to use. It can be a branch name, a tag name, or a commit id.
seed: The seed to initialize the random number generator for sampling.
gpu_memory_utilization: The ratio (between 0 and 1) of GPU memory to reserve for the model weights, activations, and KV cache. Highervalues will increase the KV cache size and thus improve the model's throughput. However, if the value is too high, it may cause out-of-memory (OOM) errors.
swap_space: The size (GiB) of CPU memory per GPU to use as swap space. This can be used for temporarily storing the states of the requests when their `best_of` sampling parameters are larger than 1. If all requests will have `best_of=1`, you can safely set this to 0.Otherwise, too small values may cause out-of-memory (OOM) errors.
cpu_offload_gb: The size (GiB) of CPU memory to use for offloading the model weights. This virtually increases the GPU memory space you can use to hold the model weights, at the cost of CPU-GPU data transfer for every forward pass.
enforce_eager: Whether to enforce eager execution. If True, we will disable CUDA graph and always execute the model in eager mode.If False, we will use CUDA graph and eager execution in hybrid.
max_context_len_to_capture: Maximum context len covered by CUDA graphs.When a sequence has context length larger than this, we fall back to eager mode (DEPRECATED. Use `max_seq_len_to_capture` instead).
max_seq_len_to_capture: Maximum sequence len covered by CUDA graphs.When a sequence has context length larger than this, we fall back to eager mode. Additionally for encoder-decoder models, if the sequence length of the encoder input is larger than this, we fall back to the eager mode.
disable_custom_all_reduce: See ParallelConfig **kwargs: Arguments for :class:`~vllm.EngineArgs`.