DeepSeek模子量化

美食家大橙子 · 2025-2-23 03:16:44

马上注册，结交更多好友，享用更多功能，让你轻松玩转社区。

您需要登录才可以下载或查看，没有账号？立即注册

x

技术配景

大语言模子（Large Language Model，LLM），可以通过量化（Quantization）操作来节流内存/显存的使用，而且降低了通讯开销，进而到达加速模子推理的效果。常见的就是把Float16的浮点数，转换成低精度的整数，比方Int4整数。最极限的情况下，可以把参数转化成二值Bool变量，也就是只有0和1，但是这种大幅度的量化有可能导致模子的推理效果不佳。常用的是，在70B以下的模子用Q8，70B以上可以用Q4。具体的原理，包罗对称量化和非对称量化等，这里就不作介绍了，主要看看工程上怎么实现，主要使用了llama.cpp来完成量化。

安装llama.cpp

这里我们在Ubuntu上使用本地编译构建的方法举行安装，首先从github上面clone下来：

$ git clone https://github.com/ggerganov/llama.cpp.git
正克隆到 'llama.cpp'...
remote: Enumerating objects: 43657, done.
remote: Counting objects: 100% (15/15), done.
remote: Compressing objects: 100% (14/14), done.
remote: Total 43657 (delta 3), reused 5 (delta 1), pack-reused 43642 (from 3)
接收对象中: 100% (43657/43657), 88.26 MiB | 8.30 MiB/s, 完成.
处理 delta 中: 100% (31409/31409), 完成.

复制代码

最好创建一个虚拟情况，以避免各种软件依赖的问题，保举Python3.10：

# 创建虚拟环境
$ conda create -n llama python=3.10
# 激活虚拟环境
$ conda activate llama

复制代码

进入下载好的llama.cpp路径，安装所有的依赖项：

$ cd llama.cpp/
$ python3 -m pip install -e .

复制代码

创建一个编译目次，执行编译指令：

$ mkdir build
$ cd build/
$ cmake ..
-- The C compiler identification is GNU 7.5.0
-- The CXX compiler identification is GNU 9.4.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found Git: /usr/bin/git (found version "2.25.1")
-- Looking for pthread.h
-- Looking for pthread.h - found
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed
-- Check if compiler accepts -pthread
-- Check if compiler accepts -pthread - yes
-- Found Threads: TRUE
-- Warning: ccache not found - consider installing it for faster compilation or disable this warning with GGML_CCACHE=OFF
-- CMAKE_SYSTEM_PROCESSOR: x86_64
-- Including CPU backend
-- Found OpenMP_C: -fopenmp (found version "4.5")
-- Found OpenMP_CXX: -fopenmp (found version "4.5")
-- Found OpenMP: TRUE (found version "4.5")
-- x86 detected
-- Adding CPU backend variant ggml-cpu: -march=native
-- Configuring done
-- Generating done
-- Build files have been written to: /datb/DeepSeek/llama/llama.cpp/build
$ cmake --build . --config Release
Scanning dependencies of target ggml-base
[ 0%] Building C object ggml/src/CMakeFiles/ggml-base.dir/ggml.c.o
[ 1%] Building C object ggml/src/CMakeFiles/ggml-base.dir/ggml-alloc.c.o
[100%] Linking CXX executable ../../bin/llama-vdot
[100%] Built target llama-vdot

复制代码

到这里，就成功构建了cpu版本的llama.cpp，可以直接使用了。如果需要安装gpu加速的版本，可以参考下面这一末节，如果嫌麻烦发起直接跳过。

llama.cpp之CUDA加速

安装GPU版本llama.cpp需要先安装一些依赖：

$ sudo apt install curl libcurl4-openssl-dev

复制代码

跟cpu版本差别的地方，主要在于cmake的编译指令（如果已经编译了cpu的版本，最好先清空build路径下的文件）：

$ cmake .. -DCMAKE_CUDA_COMPILER=/usr/bin/nvcc -DGGML_CUDA=ON -DBUILD_SHARED_LIBS=OFF -DLLAMA_CURL=ON -DCMAKE_CUDA_STANDARD=17

复制代码

这里加的一个FLAG：-DCMAKE_CUDA_STANDARD=17可以办理Llama.cpp仓库里面的Issue，如果不加这个Flag，有可能出现下面这种报错：

Make Error in ggml/src/ggml-cuda/CMakeLists.txt:
Target "ggml-cuda" requires the language dialect "CUDA17" (with compiler
extensions), but CMake does not know the compile flags to use to enable it.

复制代码

如果顺遂的话，执行下面这个指令，成功编译通过的话就是成功了：

$ cmake --build . --config Release

复制代码

但是如果像我如许有报错信息，那就得单独处理以下。

/datb/DeepSeek/llama/llama.cpp/ggml/src/ggml-cuda/vendors/cuda.h:6:10: fatal error: cuda_bf16.h: 没有那个文件或目录
#include <cuda_bf16.h>
^~~~~~~~~~~~~
compilation terminated.

复制代码

这个报错是说找不到头文件，于是在情况里面find / -name cuda_bf16.h了一下，发现实在是有这个头文件的：

/home/dechin/anaconda3/envs/llama/lib/python3.10/site-packages/nvidia/cuda_runtime/include/cuda_bf16.h
/home/dechin/anaconda3/envs/llama/lib/python3.10/site-packages/triton/backends/nvidia/include/cuda_bf16.h

复制代码

处理方式是把这个路径加到CPATH里面：

$ export CPATH=$CPATH:/home/dechin/anaconda3/envs/llama/lib/python3.10/site-packages/nvidia/cuda_runtime/include/

复制代码

如果是出现这个报错：

/home/dechin/anaconda3/envs/llama/lib/python3.10/site-packages/nvidia/cuda_runtime/include/cuda_fp16.h:4100:10: fatal error: nv/target: 没有那个文件或目录
#include <nv/target>
^~~~~~~~~~~
compilation terminated.

复制代码

那就是找不到target目次的路径，如果本地有target路径的话，也可以直接配置到CPATH里面：

$ export CPATH=/home/dechin/anaconda3/pkgs/cupy-core-13.3.0-py310h5da974a_2/lib/python3.10/site-packages/cupy/_core/include/cupy/_cccl/libcudacxx/:$CPATH

复制代码

如果是下面这些报错：

/datb/DeepSeek/llama/llama.cpp/ggml/src/ggml-cuda/common.cuh(138): error: identifier "cublasGetStatusString" is undefined
/datb/DeepSeek/llama/llama.cpp/ggml/src/ggml-cuda/common.cuh(417): error: A __device__ variable cannot be marked constexpr
/datb/DeepSeek/llama/llama.cpp/ggml/src/ggml-cuda/common.cuh(745): error: identifier "CUBLAS_TF32_TENSOR_OP_MATH" is undefined
3 errors detected in the compilation of "/tmp/tmpxft_000a126f_00000000-9_acc.compute_75.cpp1.ii".
make[2]: *** [ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/build.make:82：ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/acc.cu.o] 错误 1
make[1]: *** [CMakeFiles/Makefile2:1964：ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/all] 错误 2
make: *** [Makefile:160：all] 错误 2

复制代码

那么很有可能是cuda-toolkit的版本问题，尝试安装cuda-12：

$ conda install nvidia::cuda-toolkit

复制代码

如果使用conda安装过程有这种问题：

Collecting package metadata (current_repodata.json): failed
# >>>>>>>>>>>>>>>>>>>>>> ERROR REPORT <<<<<<<<<<<<<<<<<<<<<<
Traceback (most recent call last):
File "/home/dechin/anaconda3/lib/python3.8/site-packages/conda/gateways/repodata/__init__.py", line 132, in conda_http_errors
yield
File "/home/dechin/anaconda3/lib/python3.8/site-packages/conda/gateways/repodata/__init__.py", line 101, in repodata
response.raise_for_status()
File "/home/dechin/anaconda3/lib/python3.8/site-packages/requests/models.py", line 1024, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://conda.anaconda.org/defaults/linux-64/current_repodata.json

复制代码

那应该是conda源的问题，可以删掉旧的channels，使用默认channels大概找一个国内可以用的镜像源举行配置：

$ conda config --remove-key channels
$ conda config --remove-key default_channels
$ conda config --append channels conda-forge

复制代码

重新安装以后，nvcc的路径发生了变革，要注意修改下编译时的DCMAKE_CUDA_COMPILER参数配置：

$ cmake .. -DCMAKE_CUDA_COMPILER=/home/dechin/anaconda3/envs/llama/bin/nvcc -DGGML_CUDA=ON -DBUILD_SHARED_LIBS=OFF -DLLAMA_CURL=ON -DCMAKE_CUDA_STANDARD=17

复制代码

如果出现如下报错：

-- Unable to find cuda_runtime.h in "/home/dechin/anaconda3/envs/llama/include" for CUDAToolkit_INCLUDE_DIR.
-- Could NOT find CUDAToolkit (missing: CUDAToolkit_INCLUDE_DIR)
CMake Error at ggml/src/ggml-cuda/CMakeLists.txt:151 (message):
CUDA Toolkit not found
-- Configuring incomplete, errors occurred!
See also "/datb/DeepSeek/llama/llama.cpp/build/CMakeFiles/CMakeOutput.log".
See also "/datb/DeepSeek/llama/llama.cpp/build/CMakeFiles/CMakeError.log".

复制代码

这是找不到CUDAToolkit_INCLUDE_DIR的路径配置，只要在cmake的指令里面加上一个include路径即可：

$ cmake .. -DCMAKE_CUDA_COMPILER=/home/dechin/anaconda3/envs/llama/bin/nvcc -DGGML_CUDA=ON -DBUILD_SHARED_LIBS=OFF -DLLAMA_CURL=ON -DCMAKE_CUDA_STANDARD=17 -DCUDAToolkit_INCLUDE_DIR=/home/dechin/anaconda3/envs/llama/targets/x86_64-linux/include/ -DCURL_LIBRARY=/usr/lib/x86_64-linux-gnu/

复制代码

如果经过以上的一串处理，依然有报错信息，那我发起还是用个Docker吧，大概直接用CPU版本执行quantize，模子调用使用Ollama，如许方便一些。

下载Hugging Face模子

由于很多已经完成量化的GGUF模子文件，无法被二次量化，所以发起直接从Hugging Face下载safetensors模子文件。然后用llama.cpp里面的一个Python脚本将hf模子转为gguf模子，然后再使用llama.cpp举行模子quantize。

关于模子下载这部门，因为Hugging Face的访问有时间也会受限，所以这里首推的还是国内的ModelScope平台。从ModelScope平台下载模子，可以装一个这种Python情势的modelscope：

$ python3 -m pip install modelscope
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Requirement already satisfied: modelscope in /home/dechin/anaconda3/lib/python3.8/site-packages (1.22.3)
Requirement already satisfied: requests>=2.25 in /home/dechin/.local/lib/python3.8/site-packages (from modelscope) (2.25.1)
Requirement already satisfied: urllib3>=1.26 in /home/dechin/.local/lib/python3.8/site-packages (from modelscope) (1.26.5)
Requirement already satisfied: tqdm>=4.64.0 in /home/dechin/anaconda3/lib/python3.8/site-packages (from modelscope) (4.67.1)
Requirement already satisfied: certifi>=2017.4.17 in /home/dechin/.local/lib/python3.8/site-packages (from requests>=2.25->modelscope) (2021.5.30)
Requirement already satisfied: chardet<5,>=3.0.2 in /home/dechin/.local/lib/python3.8/site-packages (from requests>=2.25->modelscope) (4.0.0)
Requirement already satisfied: idna<3,>=2.5 in /home/dechin/.local/lib/python3.8/site-packages (from requests>=2.25->modelscope) (2.10)

复制代码

然后使用modelcope下载模子：

$ modelscope download --model deepseek-ai/DeepSeek-R1-Distill-Qwen-32B

复制代码

如果出现报错（如果没有报错就不消分析，等待模子下载完成即可）：

safetensors integrity check failed, expected sha256 signature is xxx

复制代码

可以尝试另一种安装方式：

$ sudo apt install git-lfs

复制代码

下载模子：

$ git clone https://www.modelscope.cn/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B.git
正克隆到 'DeepSeek-R1-Distill-Qwen-32B'...
remote: Enumerating objects: 52, done.
remote: Counting objects: 100% (52/52), done.
remote: Compressing objects: 100% (37/37), done.
remote: Total 52 (delta 17), reused 42 (delta 13), pack-reused 0
展开对象中: 100% (52/52), 2.27 MiB | 2.62 MiB/s, 完成.
过滤内容: 100% (8/8), 5.02 GiB | 912.00 KiB/s, 完成.
Encountered 8 file(s) that may not have been copied correctly on Windows:
model-00005-of-000008.safetensors
model-00004-of-000008.safetensors
model-00008-of-000008.safetensors
model-00002-of-000008.safetensors
model-00007-of-000008.safetensors
model-00003-of-000008.safetensors
model-00006-of-000008.safetensors
model-00001-of-000008.safetensors
See: `git lfs help smudge` for more details.

复制代码

这个过程会消耗很多时间，请耐烦等待模子下载完成为止。下载完成后查看路径：

$ cd DeepSeek-R1-Distill-Qwen-32B/
$ ll
总用量 63999072
drwxrwxr-x 4 dechin dechin 4096 2月 12 19:22 ./
drwxrwxr-x 3 dechin dechin 4096 2月 12 17:46 ../
-rw-rw-r-- 1 dechin dechin 664 2月 12 17:46 config.json
-rw-rw-r-- 1 dechin dechin 73 2月 12 17:46 configuration.json
drwxrwxr-x 2 dechin dechin 4096 2月 12 17:46 figures/
-rw-rw-r-- 1 dechin dechin 181 2月 12 17:46 generation_config.json
drwxrwxr-x 9 dechin dechin 4096 2月 12 19:22 .git/
-rw-rw-r-- 1 dechin dechin 1519 2月 12 17:46 .gitattributes
-rw-rw-r-- 1 dechin dechin 1064 2月 12 17:46 LICENSE
-rw-rw-r-- 1 dechin dechin 8792578462 2月 12 19:22 model-00001-of-000008.safetensors
-rw-rw-r-- 1 dechin dechin 8776906899 2月 12 19:03 model-00002-of-000008.safetensors
-rw-rw-r-- 1 dechin dechin 8776906927 2月 12 19:18 model-00003-of-000008.safetensors
-rw-rw-r-- 1 dechin dechin 8776906927 2月 12 18:56 model-00004-of-000008.safetensors
-rw-rw-r-- 1 dechin dechin 8776906927 2月 12 18:38 model-00005-of-000008.safetensors
-rw-rw-r-- 1 dechin dechin 8776906927 2月 12 19:19 model-00006-of-000008.safetensors
-rw-rw-r-- 1 dechin dechin 8776906927 2月 12 19:15 model-00007-of-000008.safetensors
-rw-rw-r-- 1 dechin dechin 4073821536 2月 12 19:02 model-00008-of-000008.safetensors
-rw-rw-r-- 1 dechin dechin 64018 2月 12 17:46 model.safetensors.index.json
-rw-rw-r-- 1 dechin dechin 18985 2月 12 17:46 README.md
-rw-rw-r-- 1 dechin dechin 3071 2月 12 17:46 tokenizer_config.json
-rw-rw-r-- 1 dechin dechin 7031660 2月 12 17:46 tokenizer.json

复制代码

这就是下载成功了。

HF模子转GGUF模子

找到编译好的llama/llama.cpp/下的python脚本文件，可以先看下其用法：

$ python3 convert_hf_to_gguf.py --help
usage: convert_hf_to_gguf.py [-h] [--vocab-only] [--outfile OUTFILE] [--outtype {f32,f16,bf16,q8_0,tq1_0,tq2_0,auto}] [--bigendian] [--use-temp-file] [--no-lazy]
[--model-name MODEL_NAME] [--verbose] [--split-max-tensors SPLIT_MAX_TENSORS] [--split-max-size SPLIT_MAX_SIZE] [--dry-run]
[--no-tensor-first-split] [--metadata METADATA] [--print-supported-models]
[model]
Convert a huggingface model to a GGML compatible file
positional arguments:
model directory containing model file
options:
-h, --help show this help message and exit
--vocab-only extract only the vocab
--outfile OUTFILE path to write to; default: based on input. {ftype} will be replaced by the outtype.
--outtype {f32,f16,bf16,q8_0,tq1_0,tq2_0,auto}
output format - use f32 for float32, f16 for float16, bf16 for bfloat16, q8_0 for Q8_0, tq1_0 or tq2_0 for ternary, and auto for the highest-
fidelity 16-bit float type depending on the first loaded tensor type
--bigendian model is executed on big endian machine
--use-temp-file use the tempfile library while processing (helpful when running out of memory, process killed)
--no-lazy use more RAM by computing all outputs before writing (use in case lazy evaluation is broken)
--model-name MODEL_NAME
name of the model
--verbose increase output verbosity
--split-max-tensors SPLIT_MAX_TENSORS
max tensors in each split
--split-max-size SPLIT_MAX_SIZE
max size per split N(M|G)
--dry-run only print out a split plan and exit, without writing any new files
--no-tensor-first-split
do not add tensors to the first split (disabled by default)
--metadata METADATA Specify the path for an authorship metadata override file
--print-supported-models
Print the supported models

复制代码

然后执行构建GGUF：

$ python3 convert_hf_to_gguf.py /datb/DeepSeek/models/DeepSeek-R1-Distill-Qwen-32B --outfile /datb/DeepSeek/models/DeepSeek-R1-Distill-Qwen-32B.gguf
INFO:hf-to-gguf:Set model quantization version
INFO:gguf.gguf_writer:Writing the following files:
INFO:gguf.gguf_writer:/datb/DeepSeek/models/DeepSeek-R1-Distill-Qwen-32B.gguf: n_tensors = 771, total_size = 65.5G
Writing: 100%|██████████████████████████████████████████████████████████████| 65.5G/65.5G [19:42<00:00, 55.4Mbyte/s]
INFO:hf-to-gguf:Model successfully exported to /datb/DeepSeek/models/DeepSeek-R1-Distill-Qwen-32B.gguf

复制代码

完成转化后，会在指定的路径下生成一个gguf文件，也就是all-in-one的模子文件。默认是fp32的精度，可以用于执行下一步的量化操作。

GGUF模子量化

在编译好的llama.cpp的build/bin/路径下，可以找到量化的可执行文件：

$ ./llama-quantize --help
usage: ./llama-quantize [--help] [--allow-requantize] [--leave-output-tensor] [--pure] [--imatrix] [--include-weights] [--exclude-weights] [--output-tensor-type] [--token-embedding-type] [--override-kv] model-f32.gguf [model-quant.gguf] type [nthreads]
--allow-requantize: Allows requantizing tensors that have already been quantized. Warning: This can severely reduce quality compared to quantizing from 16bit or 32bit
--leave-output-tensor: Will leave output.weight un(re)quantized. Increases model size but may also increase quality, especially when requantizing
--pure: Disable k-quant mixtures and quantize all tensors to the same type
--imatrix file_name: use data in file_name as importance matrix for quant optimizations
--include-weights tensor_name: use importance matrix for this/these tensor(s)
--exclude-weights tensor_name: use importance matrix for this/these tensor(s)
--output-tensor-type ggml_type: use this ggml_type for the output.weight tensor
--token-embedding-type ggml_type: use this ggml_type for the token embeddings tensor
--keep-split: will generate quantized model in the same shards as input
--override-kv KEY=TYPE:VALUE
Advanced option to override model metadata by key in the quantized model. May be specified multiple times.
Note: --include-weights and --exclude-weights cannot be used together
Allowed quantization types:
2 or Q4_0 : 4.34G, +0.4685 ppl @ Llama-3-8B
3 or Q4_1 : 4.78G, +0.4511 ppl @ Llama-3-8B
8 or Q5_0 : 5.21G, +0.1316 ppl @ Llama-3-8B
9 or Q5_1 : 5.65G, +0.1062 ppl @ Llama-3-8B
19 or IQ2_XXS : 2.06 bpw quantization
20 or IQ2_XS : 2.31 bpw quantization
28 or IQ2_S : 2.5 bpw quantization
29 or IQ2_M : 2.7 bpw quantization
24 or IQ1_S : 1.56 bpw quantization
31 or IQ1_M : 1.75 bpw quantization
36 or TQ1_0 : 1.69 bpw ternarization
37 or TQ2_0 : 2.06 bpw ternarization
10 or Q2_K : 2.96G, +3.5199 ppl @ Llama-3-8B
21 or Q2_K_S : 2.96G, +3.1836 ppl @ Llama-3-8B
23 or IQ3_XXS : 3.06 bpw quantization
26 or IQ3_S : 3.44 bpw quantization
27 or IQ3_M : 3.66 bpw quantization mix
12 or Q3_K : alias for Q3_K_M
22 or IQ3_XS : 3.3 bpw quantization
11 or Q3_K_S : 3.41G, +1.6321 ppl @ Llama-3-8B
12 or Q3_K_M : 3.74G, +0.6569 ppl @ Llama-3-8B
13 or Q3_K_L : 4.03G, +0.5562 ppl @ Llama-3-8B
25 or IQ4_NL : 4.50 bpw non-linear quantization
30 or IQ4_XS : 4.25 bpw non-linear quantization
15 or Q4_K : alias for Q4_K_M
14 or Q4_K_S : 4.37G, +0.2689 ppl @ Llama-3-8B
15 or Q4_K_M : 4.58G, +0.1754 ppl @ Llama-3-8B
17 or Q5_K : alias for Q5_K_M
16 or Q5_K_S : 5.21G, +0.1049 ppl @ Llama-3-8B
17 or Q5_K_M : 5.33G, +0.0569 ppl @ Llama-3-8B
18 or Q6_K : 6.14G, +0.0217 ppl @ Llama-3-8B
7 or Q8_0 : 7.96G, +0.0026 ppl @ Llama-3-8B
1 or F16 : 14.00G, +0.0020 ppl @ Mistral-7B
32 or BF16 : 14.00G, -0.0050 ppl @ Mistral-7B
0 or F32 : 26.00G @ 7B
COPY : only copy tensors, no quantizing

复制代码

这里可以看到完整的可以执行量化操作的精度。比方我们可以量化一个q4_0精度的32B模子：

$ ./llama-quantize /datb/DeepSeek/models/DeepSeek-R1-Distill-Qwen-32B.gguf /datb/DeepSeek/models/DeepSeek-R1-Distill-Qwen-32B-Q4_0.gguf q4_0

复制代码

输出效果对比（这里的Q8_0是直接从模子仓库里面下载的别人量化出来的Q8_0模子）：

-rw-rw-r-- 1 dechin dechin 65535969184 2月 13 09:33 DeepSeek-R1-Distill-Qwen-32B.gguf
-rw-rw-r-- 1 dechin dechin 18640230304 2月 13 09:51 DeepSeek-R1-Distill-Qwen-32B-Q4_0.gguf
-rw-rw-r-- 1 dechin dechin 34820884384 2月 9 01:44 DeepSeek-R1-Distill-Qwen-32B-Q8_0.gguf

复制代码

从F32到Q8再到Q4，可以看到有一个很明显的内存占用的降落。我们可以根据自己本地的盘算机资源来决定要做多少精度的量化操作。

量化完成后，导入模子成功以后，可以用ollama list查看到所有的本地模子：

$ ollama list
NAME ID SIZE MODIFIED
deepseek-r1:32b-q2k 8d2a0c19f6e0 12 GB 5 seconds ago
deepseek-r1:32b-q40 13c7c287f615 18 GB 3 minutes ago
deepseek-r1:32b 91f2de3dd7fd 34 GB 42 hours ago
nomic-embed-text-v1.5:latest 5b3683392ccb 274 MB 43 hours ago
deepseek-r1:14b ea35dfe18182 9.0 GB 7 days ago

复制代码

这里q2k也是本地量化的Q2_K的模子。只是从Q4_0到Q2_k已经没有太大的参数内存缩减了，所以很多人量化一样平常就到Q4_0这个级别，可以兼具性能与精确性。

其他报错处理

如果运行llama-quantize这个可执行文件出现这种报错：

./xxx/llama-quantize: error while loading shared libraries: libllama.so: cannot open shared object file: No such file or directory

复制代码

动态链接库路径LD_LIBRARY_PATH没有设置，也可以选择直接进入到bin/路径下运行该可执行文件。

总结概要

这篇文章主要介绍了llama.cpp这一大模子工具的使用。因为已经使用Ollama来run大模子，因此仅介绍了llama.cpp在HF模子转GGUF模子中的应用，及其在大模子量化中的使用。大模子的参数量化技术，使得我们可以在本地有限预算的硬件条件下，也能够运行DeepSeek的蒸馏模子。

文章转载自：Dechin的博客
原文链接：DeepSeek模子量化 - DECHIN - 博客园
体验地址：引迈 - JNPF快速开发平台_低代码开发平台_零代码开发平台_流程设计器_表单引擎_工作流引擎_软件架构

免责声明：如果侵犯了您的权益，请联系站长，我们会及时删除侵权内容，谢谢合作！更多信息从访问主页：qidao123.com:ToB企服之家，中国第一个企服评测及商务社交产业平台。

继续阅读请点击广告

用户名		自动登录	找回密码
密码			立即注册

DeepSeek模子量化

马上注册，结交更多好友，享用更多功能，让你轻松玩转社区。

0 个回复

快速回复

楼主热帖

标签云