Loading vocab file 'models/Llama-2-7b-chat-hf/tokenizer.model', type 'spm'
...
Wrote models/Llama-2-7b-chat-hf/ggml-model-f16.gguf
复制代码
vocabtype 指定分词算法,默认值是 spm,如果是 bpe,需要显示指定。 量化模型
使用 quantize 量化模型
quantize 提供各种精度的量化。
./quantize
usage: ./quantize [--help] [--allow-requantize] [--leave-output-tensor] model-f32.gguf [model-quant.gguf] type [nthreads]
--allow-requantize: Allows requantizing tensors that have already been quantized. Warning: This can severely reduce quality compared to quantizing from 16bit or 32bit
--leave-output-tensor: Will leave output.weight un(re)quantized. Increases model size but may also increase quality, especially when requantizing
./main -m ./models/llama-2-7b-langchain-chat-GGUF/llama-2-7b-langchain-chat-q4_0.gguf -p "What color is the sun?" -n 1024
What color is the sun?
nobody knows. It’s not a specific color, more a range of colors. Some people say it's yellow; some say orange, while others believe it to be red or white. Ultimately, we can only imagine what color the sun might be because we can't see its exact color from this planet due to its immense distance away!
It’s fascinating how something so fundamental to our daily lives remains a mystery even after decades of scientific inquiry into its properties and behavior.” [end of text]
llama_print_timings: load time = 376.57 ms
llama_print_timings: sample time = 56.40 ms / 105 runs ( 0.54 ms per token, 1861.77 tokens per second)
llama_print_timings: prompt eval time = 366.68 ms / 7 tokens ( 52.38 ms per token, 19.09 tokens per second)
llama_print_timings: eval time = 15946.81 ms / 104 runs ( 153.33 ms per token, 6.52 tokens per second)
llama_print_timings: total time = 16401.43 ms
复制代码
当然,也可以用上面量化的模型举行推理。
./main -m ./models/Llama-2-7b-chat-hf/ggml-model-q4_0.gguf -p "What color is the sun?" -n 1024
What color is the sun?
sierp 10, 2017 at 12:04 pm - Reply
The sun does not have a color because it emits light in all wavelengths of the visible spectrum and beyond. However, due to our atmosphere's scattering properties, the sun appears yellow or orange from Earth. This is known as Rayleigh scattering and is why the sky appears blue during the daytime. [end of text]
llama_print_timings: load time = 90612.21 ms
llama_print_timings: sample time = 52.31 ms / 91 runs ( 0.57 ms per token, 1739.76 tokens per second)
llama_print_timings: prompt eval time = 523.38 ms / 7 tokens ( 74.77 ms per token, 13.37 tokens per second)
llama_print_timings: eval time = 15266.91 ms / 90 runs ( 169.63 ms per token, 5.90 tokens per second)
INFO: Uvicorn running on http://localhost:8000 (Press CTRL+C to quit)
复制代码
使用 curl 测试 API 服务
curl -X 'POST' \
'http://localhost:8000/v1/chat/completions' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"messages": [
{
"content": "You are a helpful assistant.",
"role": "system"
},
{
"content": "Write a poem for Chinese?",
"role": "user"
}
]
}'
{"id":"chatcmpl-c3eec466-6073-41e2-817f-9d1e307ab55f","object":"chat.completion","created":1693829165,"model":"./models/llama-2-7b-langchain-chat-GGUF/llama-2-7b-langchain-chat-q4_0.gguf","choices":[{"index":0,"message":{"role":"assistant","content":"I am not programmed to write poems in different languages. How about I"},"finish_reason":"length"}],"usage":{"prompt_tokens":26,"completion_tokens":16,"total_tokens":42}}