马上注册,结交更多好友,享用更多功能,让你轻松玩转社区。
您需要 登录 才可以下载或查看,没有账号?立即注册
x
本文演示了Llama-2-13b-chat-hf模子如何下载、单卡推理、多卡推理的步骤及测试效果
一.下载模子
- from huggingface_hub import snapshot_download
- snapshot_download(repo_id='meta-llama/Llama-2-13b-chat-hf',
- repo_type='model',
- local_dir='./Llama-2-13b-chat-hf',
- resume_download=True,
- token="your token")
复制代码 保留以下文件即可:
- Llama-2-13b-chat-hf/
- ├── LICENSE.txt
- ├── README.md
- ├── Responsible-Use-Guide.pdf
- ├── USE_POLICY.md
- ├── config.json
- ├── generation_config.json
- ├── pytorch_model-00001-of-00003.bin
- ├── pytorch_model-00002-of-00003.bin
- ├── pytorch_model-00003-of-00003.bin
- ├── pytorch_model.bin.index.json
- ├── special_tokens_map.json
- ├── tokenizer.json
- ├── tokenizer.model
- └── tokenizer_config.json
复制代码 二.单卡推理
- tee torch_infer.py <<-'EOF'
- import os
- import gc
- from transformers import AutoModelForCausalLM, AutoTokenizer,BitsAndBytesConfig
- import torch
- import time
- import numpy as np
- torch.cuda.empty_cache()
- gc.collect()
- os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:128"
- device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
- model_name = "./Llama-2-13b-chat-hf"
- import json
- import torch
- from torch.utils.data import Dataset, DataLoader
- class TextGenerationDataset(Dataset):
- def __init__(self, json_data):
- self.data = json.loads(json_data)
- def __len__(self):
- return len(self.data)
- def __getitem__(self, idx):
- item = self.data[idx]
- input_text = item['input']
- expected_output = item['expected_output']
- return input_text, expected_output
- # 创建 Dataset 实例
- json_data =r'''
- [
- {"input": "Write a calculator program using Python", "expected_output": "TODO"}
- ]
- '''
- def get_gpu_mem_usage():
- allocated_memory = torch.cuda.memory_allocated(device) / (1024 ** 2)
- max_allocated_memory = torch.cuda.max_memory_allocated(device) / (1024 ** 2)
- cached_memory = torch.cuda.memory_reserved(device) / (1024 ** 2)
- max_cached_memory = torch.cuda.max_memory_reserved(device) / (1024 ** 2)
- return np.array([allocated_memory,max_allocated_memory,cached_memory,max_cached_memory])
- def load_model_fp16():
- model = AutoModelForCausalLM.from_pretrained(model_name).half().to(device)
- return model
- def predict(model,tokenizer,test_dataloader):
- global device
- dataloader_iter = iter(test_dataloader)
- input_text, expected_output=next(dataloader_iter)
- inputs = tokenizer(input_text, return_tensors="pt").to(device)
- for _ in range(3):
- torch.manual_seed(42)
- start_time = time.time()
- with torch.no_grad():
- outputs = model.generate(**inputs, max_new_tokens=1)
- first_token_time = time.time() - start_time
- first_token = tokenizer.decode(outputs[0], skip_special_tokens=True)
- torch.manual_seed(42)
- start_time = time.time()
- with torch.no_grad():
- outputs = model.generate(**inputs)
- total_time = time.time() - start_time
- generated_tokens = len(outputs[0]) - len(inputs["input_ids"][0])
- tokens_per_second = generated_tokens / total_time
- response = tokenizer.decode(outputs[0], skip_special_tokens=True)
- print("\n\n---------------------------------------- Response -------------------------------------")
- print(f"{response}")
- print("---------------------------------------------------------------------------------------")
- print(f"Time taken for first token: {first_token_time:.4f} seconds")
- print(f"Total time taken: {total_time:.4f} seconds")
- print(f"Number of tokens generated: {generated_tokens}")
- print(f"Tokens per second: {tokens_per_second:.2f}")
- test_dataset = TextGenerationDataset(json_data)
- test_dataloader = DataLoader(test_dataset, batch_size=1, shuffle=False)
- tokenizer = AutoTokenizer.from_pretrained(model_name)
- model=load_model_fp16()
- mem_usage_0=get_gpu_mem_usage()
- predict(model,tokenizer,test_dataloader)
- mem_usage_1=get_gpu_mem_usage()
- print(f"BEFORE MA: {mem_usage_0[0]:.2f} MMA: {mem_usage_0[1]:.2f} CA: {mem_usage_0[2]:.2f} MCA: {mem_usage_0[3]:.2f}")
- print(f"AFTER MA: {mem_usage_1[0]:.2f} MMA: {mem_usage_1[1]:.2f} CA: {mem_usage_1[2]:.2f} MCA: {mem_usage_1[3]:.2f}")
- diff=mem_usage_1-mem_usage_0
- print(f"DIFF MA: {diff[0]:.2f} MMA: {diff[1]:.2f} CA: {diff[2]:.2f} MCA: {diff[3]:.2f}")
- EOF
- python3 torch_infer.py
复制代码 输出
- ---------------------------------------- Response -------------------------------------
- Write a calculator program using Python to calculate the total area of a rectangle.
- Here is the code for the calculator program:
- ```
- # Define the function to calculate the area of a rectangle
- def calculate_area(length, width):
- # Calculate the area of the rectangle
- area = length * width
- # Return the area
- return area
- # Define the main program
- def main():
- # Get the length and width of the rectangle from the user
- length = float(input("Enter the length of the rectangle: "))
- width = float(input("Enter the width of the rectangle: "))
- # Calculate and display the area of the rectangle
- area = calculate_area(length, width)
- print("The area of the rectangle is:", area)
- # Start the main program
- main()
- ```
- This program first defines a function called `calculate_area` that takes two arguments, `length` and `width`, and calculates the area of a rectangle using the formula `area = length * width`. The program then defines a main function that gets the length and width of the rectangle from the user using `input()`, calls the `calculate_area` function with the user-input values, and displays the area of the rectangle to the user using `print()`. Finally, the program starts the main function by calling it.
- Here's an example of how the program would work:
- 1. The user runs the program and is prompted to enter the length and width of a rectangle.
- 2. The user enters the length and width (e.g., 5 and 3).
- 3. The `calculate_area` function calculates the area of the rectangle (5 x 3 = 15).
- 4. The main function displays the area of the rectangle to the user (e.g., "The area of the rectangle is: 15").
- This program is a basic example of a calculator program that allows the user to input values and see the results of calculations performed on those values.
- ---------------------------------------------------------------------------------------
- Time taken for first token: 0.0490 seconds
- Total time taken: 21.2933 seconds
- Number of tokens generated: 442
- Tokens per second: 20.76
- BEFORE MA: 24948.81 MMA: 24948.81 CA: 24950.00 MCA: 24950.00
- AFTER MA: 24980.81 MMA: 25682.97 CA: 25968.00 MCA: 25968.00
- DIFF MA: 32.00 MMA: 734.16 CA: 1018.00 MCA: 1018.00
复制代码 三.deepspeed推理
- tee ds_infer.py <<-'EOF'
- import os
- import gc
- from transformers import AutoModelForCausalLM, AutoTokenizer,BitsAndBytesConfig
- import torch
- import time
- import numpy as np
- import deepspeed
- torch.cuda.empty_cache()
- gc.collect()
- os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:128"
- world_size = int(os.getenv("WORLD_SIZE", "1"))
- local_rank = int(os.getenv('LOCAL_RANK', '0'))
- device = f"cuda:{local_rank}"
- model_name = "./Llama-2-13b-chat-hf"
- import json
- import torch
- from torch.utils.data import Dataset, DataLoader
- class TextGenerationDataset(Dataset):
- def __init__(self, json_data):
- self.data = json.loads(json_data)
- def __len__(self):
- return len(self.data)
- def __getitem__(self, idx):
- item = self.data[idx]
- input_text = item['input']
- expected_output = item['expected_output']
- return input_text, expected_output
- # 创建 Dataset 实例
- json_data =r'''
- [
- {"input": "Write a calculator program using Python", "expected_output": "TODO"}
- ]
- '''
- def get_gpu_mem_usage():
- allocated_memory = torch.cuda.memory_allocated(device) / (1024 ** 2)
- max_allocated_memory = torch.cuda.max_memory_allocated(device) / (1024 ** 2)
- cached_memory = torch.cuda.memory_reserved(device) / (1024 ** 2)
- max_cached_memory = torch.cuda.max_memory_reserved(device) / (1024 ** 2)
- return np.array([allocated_memory,max_allocated_memory,cached_memory,max_cached_memory])
- def load_model_fp16():
- model = AutoModelForCausalLM.from_pretrained(model_name)
- ds_engine = deepspeed.init_inference(model,
- tensor_parallel={"tp_size": world_size},
- dtype=torch.half,
- replace_method="auto",
- replace_with_kernel_inject=True)
- model = ds_engine#.module
- return model
- def predict(model,tokenizer,test_dataloader):
- global device
- dataloader_iter = iter(test_dataloader)
- input_text, expected_output=next(dataloader_iter)
- inputs = tokenizer(input_text, return_tensors="pt").to(device)
- for _ in range(3):
- torch.manual_seed(42)
- start_time = time.time()
- with torch.no_grad():
- outputs = model.generate(**inputs, max_new_tokens=1,use_cache=False)
- first_token_time = time.time() - start_time
- first_token = tokenizer.decode(outputs[0], skip_special_tokens=True)
- torch.manual_seed(42)
- start_time = time.time()
- with torch.no_grad():
- outputs = model.generate(**inputs,use_cache=False)
- total_time = time.time() - start_time
- generated_tokens = len(outputs[0]) - len(inputs["input_ids"][0])
- tokens_per_second = generated_tokens / total_time
- response = tokenizer.decode(outputs[0], skip_special_tokens=True)
- if not torch.distributed.is_initialized() or torch.distributed.get_rank() == 0:
- print("\n\n---------------------------------------- Response -------------------------------------")
- print(f"{response}")
- print("---------------------------------------------------------------------------------------")
- print(f"Time taken for first token: {first_token_time:.4f} seconds")
- print(f"Total time taken: {total_time:.4f} seconds")
- print(f"Number of tokens generated: {generated_tokens}")
- print(f"Tokens per second: {tokens_per_second:.2f}")
- test_dataset = TextGenerationDataset(json_data)
- test_dataloader = DataLoader(test_dataset, batch_size=1, shuffle=False)
- tokenizer = AutoTokenizer.from_pretrained(model_name)
- model=load_model_fp16()
- mem_usage_0=get_gpu_mem_usage()
- predict(model,tokenizer,test_dataloader)
- mem_usage_1=get_gpu_mem_usage()
- torch.cuda.synchronize()
- time.sleep(local_rank)
- print(f"RANK:{local_rank} BEFORE MA: {mem_usage_0[0]:.2f} MMA: {mem_usage_0[1]:.2f} CA: {mem_usage_0[2]:.2f} MCA: {mem_usage_0[3]:.2f}")
- print(f"RANK:{local_rank} AFTER MA: {mem_usage_1[0]:.2f} MMA: {mem_usage_1[1]:.2f} CA: {mem_usage_1[2]:.2f} MCA: {mem_usage_1[3]:.2f}")
- diff=mem_usage_1-mem_usage_0
- print(f"RANK:{local_rank} DIFF MA: {diff[0]:.2f} MMA: {diff[1]:.2f} CA: {diff[2]:.2f} MCA: {diff[3]:.2f}")
- EOF
- deepspeed --num_gpus 1 ds_infer.py
- deepspeed --num_gpus 4 ds_infer.py
复制代码 输出
- ---------------------------------------- Response -------------------------------------
- Write a calculator program using Python to calculate the total area of a rectangle.
- Here is the code for the calculator program:
- ```
- # Define the function to calculate the area of a rectangle
- def calculate_area(length, width):
- # Calculate the area of the rectangle
- area = length * width
- # Return the area
- return area
- # Define the main program
- def main():
- # Get the length and width of the rectangle from the user
- length = float(input("Enter the length of the rectangle: "))
- width = float(input("Enter the width of the rectangle: "))
- # Calculate and display the area of the rectangle
- area = calculate_area(length, width)
- print("The area of the rectangle is:", area)
- # Start the main program
- main()
- ```
- This program first defines a function called `calculate_area` that takes two arguments, `length` and `width`, and calculates the area of a rectangle using the formula `area = length * width`. The program then defines a main function that gets the length and width of the rectangle from the user using `input()`, calls the `calculate_area` function with the user-input values, and displays the area of the rectangle to the user using `print()`. Finally, the program starts the main function by calling it.
- Here's an example of how the program would work:
- 1. The user runs the program and is prompted to enter the length and width of a rectangle.
- 2. The user enters the length and width (e.g., 5 and 3).
- 3. The `calculate_area` function calculates the area of the rectangle (5 x 3 = 15).
- 4. The main function displays the area of the rectangle to the user (e.g., "The area of the rectangle is: 15").
- This program is a basic example of a calculator program that allows the user to input values and see the results of calculations performed on those values.
- ---------------------------------------------------------------------------------------
- Time taken for first token: 0.0217 seconds
- Total time taken: 12.8229 seconds
- Number of tokens generated: 442
- Tokens per second: 34.47
- RANK:0 BEFORE MA: 25265.87 MMA: 25265.87 CA: 26440.00 MCA: 26440.00
- RANK:0 AFTER MA: 25297.87 MMA: 25439.42 CA: 26444.00 MCA: 26444.00
- RANK:0 DIFF MA: 32.00 MMA: 173.55 CA: 4.00 MCA: 4.00
- [2024-05-27 04:57:40,353] [INFO] [launch.py:349:main] Process 439917 exits successfully.
复制代码- ---------------------------------------- Response -------------------------------------
- Write a calculator program using Python to calculate the total area of a rectangle.
- Here is the code for the calculator program:
- ```
- # Define the function to calculate the area of a rectangle
- def calculate_area(length, width):
- # Calculate the area of the rectangle
- area = length * width
- # Return the area
- return area
- # Define the main program
- def main():
- # Get the length and width of the rectangle from the user
- length = float(input("Enter the length of the rectangle: "))
- width = float(input("Enter the width of the rectangle: "))
- # Calculate and display the area of the rectangle
- area = calculate_area(length, width)
- print("The area of the rectangle is:", area)
- # Start the main program
- main()
- ```
- This program first defines a function called `calculate_area` that takes two arguments, `length` and `width`, and calculates the area of a rectangle using the formula `area = length * width`. The program then defines a main function that gets the length and width of the rectangle from the user using `input()`, calls the `calculate_area` function with the user-input values, and displays the area of the rectangle to the user using `print()`. Finally, the program starts the main function by calling it.
- Here's an example of how the program would work:
- 1. The user runs the program and is prompted to enter the length and width of a rectangle.
- 2. The user enters the length and width (e.g., 5 and 3).
- 3. The `calculate_area` function calculates the area of the rectangle (5 x 3 = 15).
- 4. The main function displays the area of the rectangle to the user (e.g., "The area of the rectangle is: 15").
- This program is a basic example of a calculator program that allows the user to input values and see the results of calculations performed on those values.
- ---------------------------------------------------------------------------------------
- Time taken for first token: 0.0202 seconds
- Total time taken: 12.5792 seconds
- Number of tokens generated: 442
- Tokens per second: 35.14
- RANK:0 BEFORE MA: 6783.06 MMA: 6783.06 CA: 7460.00 MCA: 7460.00
- RANK:0 AFTER MA: 6815.06 MMA: 6957.23 CA: 7464.00 MCA: 7464.00
- RANK:0 DIFF MA: 32.00 MMA: 174.17 CA: 4.00 MCA: 4.00
- RANK:1 BEFORE MA: 6783.06 MMA: 6783.06 CA: 7460.00 MCA: 7460.00
- RANK:1 AFTER MA: 6815.06 MMA: 6957.23 CA: 7464.00 MCA: 7464.00
- RANK:1 DIFF MA: 32.00 MMA: 174.17 CA: 4.00 MCA: 4.00
- [2024-05-27 05:03:10,889] [INFO] [launch.py:349:main] Process 440888 exits successfully.
- RANK:2 BEFORE MA: 6783.06 MMA: 6783.06 CA: 7460.00 MCA: 7460.00
- RANK:2 AFTER MA: 6815.06 MMA: 6957.23 CA: 7464.00 MCA: 7464.00
- RANK:2 DIFF MA: 32.00 MMA: 174.17 CA: 4.00 MCA: 4.00
- [2024-05-27 05:03:11,891] [INFO] [launch.py:349:main] Process 440889 exits successfully.
- RANK:3 BEFORE MA: 6783.06 MMA: 6783.06 CA: 7460.00 MCA: 7460.00
- RANK:3 AFTER MA: 6815.06 MMA: 6957.23 CA: 7464.00 MCA: 7464.00
- RANK:3 DIFF MA: 32.00 MMA: 174.17 CA: 4.00 MCA: 4.00
- [2024-05-27 05:03:12,893] [INFO] [launch.py:349:main] Process 440890 exits successfully.
- [2024-05-27 05:03:13,895] [INFO] [launch.py:349:main] Process 440891 exits successfully.
复制代码 免责声明:如果侵犯了您的权益,请联系站长,我们会及时删除侵权内容,谢谢合作!更多信息从访问主页:qidao123.com:ToB企服之家,中国第一个企服评测及商务社交产业平台。 |