Llama-2-13b-chat-hf单卡、多卡推理

张裕  论坛元老 | 2024-10-6 16:09:50 | 显示全部楼层 | 阅读模式
打印 上一主题 下一主题

主题 1553|帖子 1553|积分 4659

马上注册,结交更多好友,享用更多功能,让你轻松玩转社区。

您需要 登录 才可以下载或查看,没有账号?立即注册

x
本文演示了Llama-2-13b-chat-hf模子如何下载、单卡推理、多卡推理的步骤及测试效果
一.下载模子

  1. from huggingface_hub import snapshot_download
  2. snapshot_download(repo_id='meta-llama/Llama-2-13b-chat-hf',
  3.                   repo_type='model',
  4.                   local_dir='./Llama-2-13b-chat-hf',
  5.                   resume_download=True,
  6.                   token="your token")
复制代码
保留以下文件即可:
  1. Llama-2-13b-chat-hf/
  2. ├── LICENSE.txt
  3. ├── README.md
  4. ├── Responsible-Use-Guide.pdf
  5. ├── USE_POLICY.md
  6. ├── config.json
  7. ├── generation_config.json
  8. ├── pytorch_model-00001-of-00003.bin
  9. ├── pytorch_model-00002-of-00003.bin
  10. ├── pytorch_model-00003-of-00003.bin
  11. ├── pytorch_model.bin.index.json
  12. ├── special_tokens_map.json
  13. ├── tokenizer.json
  14. ├── tokenizer.model
  15. └── tokenizer_config.json
复制代码
二.单卡推理

  1. tee torch_infer.py <<-'EOF'
  2. import os
  3. import gc
  4. from transformers import AutoModelForCausalLM, AutoTokenizer,BitsAndBytesConfig
  5. import torch
  6. import time
  7. import numpy as np
  8. torch.cuda.empty_cache()
  9. gc.collect()
  10. os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:128"
  11. device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
  12. model_name = "./Llama-2-13b-chat-hf"
  13. import json
  14. import torch
  15. from torch.utils.data import Dataset, DataLoader
  16. class TextGenerationDataset(Dataset):
  17.     def __init__(self, json_data):
  18.         self.data = json.loads(json_data)
  19.     def __len__(self):
  20.         return len(self.data)
  21.     def __getitem__(self, idx):
  22.         item = self.data[idx]
  23.         input_text = item['input']
  24.         expected_output = item['expected_output']
  25.         return input_text, expected_output
  26. # 创建 Dataset 实例
  27. json_data =r'''
  28. [
  29.     {"input": "Write a calculator program using Python", "expected_output": "TODO"}
  30. ]
  31. '''
  32. def get_gpu_mem_usage():
  33.     allocated_memory = torch.cuda.memory_allocated(device) / (1024 ** 2)
  34.     max_allocated_memory = torch.cuda.max_memory_allocated(device) / (1024 ** 2)
  35.     cached_memory = torch.cuda.memory_reserved(device) / (1024 ** 2)   
  36.     max_cached_memory = torch.cuda.max_memory_reserved(device) / (1024 ** 2)
  37.     return np.array([allocated_memory,max_allocated_memory,cached_memory,max_cached_memory])
  38. def load_model_fp16():
  39.     model = AutoModelForCausalLM.from_pretrained(model_name).half().to(device)
  40.     return model
  41. def predict(model,tokenizer,test_dataloader):
  42.     global device
  43.     dataloader_iter = iter(test_dataloader)
  44.     input_text, expected_output=next(dataloader_iter)
  45.     inputs = tokenizer(input_text, return_tensors="pt").to(device)
  46.     for _ in range(3):
  47.         torch.manual_seed(42)
  48.         start_time = time.time()
  49.         with torch.no_grad():
  50.             outputs = model.generate(**inputs, max_new_tokens=1)
  51.         first_token_time = time.time() - start_time
  52.         first_token = tokenizer.decode(outputs[0], skip_special_tokens=True)
  53.         torch.manual_seed(42)
  54.         start_time = time.time()
  55.         with torch.no_grad():
  56.             outputs = model.generate(**inputs)
  57.         total_time = time.time() - start_time
  58.         generated_tokens = len(outputs[0]) - len(inputs["input_ids"][0])
  59.         tokens_per_second = generated_tokens / total_time
  60.     response = tokenizer.decode(outputs[0], skip_special_tokens=True)
  61.     print("\n\n---------------------------------------- Response -------------------------------------")
  62.     print(f"{response}")
  63.     print("---------------------------------------------------------------------------------------")
  64.     print(f"Time taken for first token: {first_token_time:.4f} seconds")
  65.     print(f"Total time taken: {total_time:.4f} seconds")
  66.     print(f"Number of tokens generated: {generated_tokens}")
  67.     print(f"Tokens per second: {tokens_per_second:.2f}")
  68. test_dataset = TextGenerationDataset(json_data)
  69. test_dataloader = DataLoader(test_dataset, batch_size=1, shuffle=False)
  70. tokenizer = AutoTokenizer.from_pretrained(model_name)
  71. model=load_model_fp16()
  72. mem_usage_0=get_gpu_mem_usage()
  73. predict(model,tokenizer,test_dataloader)
  74. mem_usage_1=get_gpu_mem_usage()
  75. print(f"BEFORE MA: {mem_usage_0[0]:.2f} MMA: {mem_usage_0[1]:.2f} CA: {mem_usage_0[2]:.2f} MCA: {mem_usage_0[3]:.2f}")
  76. print(f"AFTER  MA: {mem_usage_1[0]:.2f} MMA: {mem_usage_1[1]:.2f} CA: {mem_usage_1[2]:.2f} MCA: {mem_usage_1[3]:.2f}")
  77. diff=mem_usage_1-mem_usage_0
  78. print(f"DIFF   MA: {diff[0]:.2f} MMA: {diff[1]:.2f} CA: {diff[2]:.2f} MCA: {diff[3]:.2f}")
  79. EOF
  80. python3 torch_infer.py
复制代码
输出
  1. ---------------------------------------- Response -------------------------------------
  2. Write a calculator program using Python to calculate the total area of a rectangle.
  3. Here is the code for the calculator program:
  4. ​```
  5. # Define the function to calculate the area of a rectangle
  6. def calculate_area(length, width):
  7.     # Calculate the area of the rectangle
  8.     area = length * width
  9.     # Return the area
  10.     return area
  11. # Define the main program
  12. def main():
  13.     # Get the length and width of the rectangle from the user
  14.     length = float(input("Enter the length of the rectangle: "))
  15.     width = float(input("Enter the width of the rectangle: "))
  16.     # Calculate and display the area of the rectangle
  17.     area = calculate_area(length, width)
  18.     print("The area of the rectangle is:", area)
  19. # Start the main program
  20. main()
  21. ​```
  22. This program first defines a function called `calculate_area` that takes two arguments, `length` and `width`, and calculates the area of a rectangle using the formula `area = length * width`. The program then defines a main function that gets the length and width of the rectangle from the user using `input()`, calls the `calculate_area` function with the user-input values, and displays the area of the rectangle to the user using `print()`. Finally, the program starts the main function by calling it.
  23. Here's an example of how the program would work:
  24. 1. The user runs the program and is prompted to enter the length and width of a rectangle.
  25. 2. The user enters the length and width (e.g., 5 and 3).
  26. 3. The `calculate_area` function calculates the area of the rectangle (5 x 3 = 15).
  27. 4. The main function displays the area of the rectangle to the user (e.g., "The area of the rectangle is: 15").
  28. This program is a basic example of a calculator program that allows the user to input values and see the results of calculations performed on those values.
  29. ---------------------------------------------------------------------------------------
  30. Time taken for first token: 0.0490 seconds
  31. Total time taken: 21.2933 seconds
  32. Number of tokens generated: 442
  33. Tokens per second: 20.76
  34. BEFORE MA: 24948.81 MMA: 24948.81 CA: 24950.00 MCA: 24950.00
  35. AFTER  MA: 24980.81 MMA: 25682.97 CA: 25968.00 MCA: 25968.00
  36. DIFF   MA: 32.00 MMA: 734.16 CA: 1018.00 MCA: 1018.00
复制代码
三.deepspeed推理

  1. tee ds_infer.py <<-'EOF'
  2. import os
  3. import gc
  4. from transformers import AutoModelForCausalLM, AutoTokenizer,BitsAndBytesConfig
  5. import torch
  6. import time
  7. import numpy as np
  8. import deepspeed
  9. torch.cuda.empty_cache()
  10. gc.collect()
  11. os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:128"
  12. world_size = int(os.getenv("WORLD_SIZE", "1"))
  13. local_rank = int(os.getenv('LOCAL_RANK', '0'))
  14. device = f"cuda:{local_rank}"
  15. model_name = "./Llama-2-13b-chat-hf"
  16. import json
  17. import torch
  18. from torch.utils.data import Dataset, DataLoader
  19. class TextGenerationDataset(Dataset):
  20.     def __init__(self, json_data):
  21.         self.data = json.loads(json_data)
  22.     def __len__(self):
  23.         return len(self.data)
  24.     def __getitem__(self, idx):
  25.         item = self.data[idx]
  26.         input_text = item['input']
  27.         expected_output = item['expected_output']
  28.         return input_text, expected_output
  29. # 创建 Dataset 实例
  30. json_data =r'''
  31. [
  32.     {"input": "Write a calculator program using Python", "expected_output": "TODO"}
  33. ]
  34. '''
  35. def get_gpu_mem_usage():
  36.     allocated_memory = torch.cuda.memory_allocated(device) / (1024 ** 2)
  37.     max_allocated_memory = torch.cuda.max_memory_allocated(device) / (1024 ** 2)
  38.     cached_memory = torch.cuda.memory_reserved(device) / (1024 ** 2)   
  39.     max_cached_memory = torch.cuda.max_memory_reserved(device) / (1024 ** 2)
  40.     return np.array([allocated_memory,max_allocated_memory,cached_memory,max_cached_memory])
  41. def load_model_fp16():
  42.     model = AutoModelForCausalLM.from_pretrained(model_name)
  43.     ds_engine = deepspeed.init_inference(model,
  44.                                      tensor_parallel={"tp_size": world_size},
  45.                                      dtype=torch.half,
  46.                                      replace_method="auto",
  47.                                      replace_with_kernel_inject=True)
  48.     model = ds_engine#.module
  49.     return model
  50. def predict(model,tokenizer,test_dataloader):
  51.     global device
  52.     dataloader_iter = iter(test_dataloader)
  53.     input_text, expected_output=next(dataloader_iter)
  54.     inputs = tokenizer(input_text, return_tensors="pt").to(device)
  55.     for _ in range(3):
  56.         torch.manual_seed(42)
  57.         start_time = time.time()
  58.         with torch.no_grad():
  59.             outputs = model.generate(**inputs, max_new_tokens=1,use_cache=False)
  60.         first_token_time = time.time() - start_time
  61.         first_token = tokenizer.decode(outputs[0], skip_special_tokens=True)
  62.         torch.manual_seed(42)
  63.         start_time = time.time()
  64.         with torch.no_grad():
  65.             outputs = model.generate(**inputs,use_cache=False)
  66.         total_time = time.time() - start_time
  67.         generated_tokens = len(outputs[0]) - len(inputs["input_ids"][0])
  68.         tokens_per_second = generated_tokens / total_time
  69.     response = tokenizer.decode(outputs[0], skip_special_tokens=True)
  70.     if not torch.distributed.is_initialized() or torch.distributed.get_rank() == 0:
  71.         print("\n\n---------------------------------------- Response -------------------------------------")
  72.         print(f"{response}")
  73.         print("---------------------------------------------------------------------------------------")
  74.         print(f"Time taken for first token: {first_token_time:.4f} seconds")
  75.         print(f"Total time taken: {total_time:.4f} seconds")
  76.         print(f"Number of tokens generated: {generated_tokens}")
  77.         print(f"Tokens per second: {tokens_per_second:.2f}")
  78. test_dataset = TextGenerationDataset(json_data)
  79. test_dataloader = DataLoader(test_dataset, batch_size=1, shuffle=False)
  80. tokenizer = AutoTokenizer.from_pretrained(model_name)
  81. model=load_model_fp16()
  82. mem_usage_0=get_gpu_mem_usage()
  83. predict(model,tokenizer,test_dataloader)
  84. mem_usage_1=get_gpu_mem_usage()
  85. torch.cuda.synchronize()
  86. time.sleep(local_rank)
  87. print(f"RANK:{local_rank} BEFORE MA: {mem_usage_0[0]:.2f} MMA: {mem_usage_0[1]:.2f} CA: {mem_usage_0[2]:.2f} MCA: {mem_usage_0[3]:.2f}")
  88. print(f"RANK:{local_rank} AFTER  MA: {mem_usage_1[0]:.2f} MMA: {mem_usage_1[1]:.2f} CA: {mem_usage_1[2]:.2f} MCA: {mem_usage_1[3]:.2f}")
  89. diff=mem_usage_1-mem_usage_0
  90. print(f"RANK:{local_rank} DIFF   MA: {diff[0]:.2f} MMA: {diff[1]:.2f} CA: {diff[2]:.2f} MCA: {diff[3]:.2f}")
  91. EOF
  92. deepspeed --num_gpus 1 ds_infer.py
  93. deepspeed --num_gpus 4 ds_infer.py
复制代码
输出
  1. ---------------------------------------- Response -------------------------------------
  2. Write a calculator program using Python to calculate the total area of a rectangle.
  3. Here is the code for the calculator program:
  4. ​```
  5. # Define the function to calculate the area of a rectangle
  6. def calculate_area(length, width):
  7.     # Calculate the area of the rectangle
  8.     area = length * width
  9.     # Return the area
  10.     return area
  11. # Define the main program
  12. def main():
  13.     # Get the length and width of the rectangle from the user
  14.     length = float(input("Enter the length of the rectangle: "))
  15.     width = float(input("Enter the width of the rectangle: "))
  16.     # Calculate and display the area of the rectangle
  17.     area = calculate_area(length, width)
  18.     print("The area of the rectangle is:", area)
  19. # Start the main program
  20. main()
  21. ​```
  22. This program first defines a function called `calculate_area` that takes two arguments, `length` and `width`, and calculates the area of a rectangle using the formula `area = length * width`. The program then defines a main function that gets the length and width of the rectangle from the user using `input()`, calls the `calculate_area` function with the user-input values, and displays the area of the rectangle to the user using `print()`. Finally, the program starts the main function by calling it.
  23. Here's an example of how the program would work:
  24. 1. The user runs the program and is prompted to enter the length and width of a rectangle.
  25. 2. The user enters the length and width (e.g., 5 and 3).
  26. 3. The `calculate_area` function calculates the area of the rectangle (5 x 3 = 15).
  27. 4. The main function displays the area of the rectangle to the user (e.g., "The area of the rectangle is: 15").
  28. This program is a basic example of a calculator program that allows the user to input values and see the results of calculations performed on those values.
  29. ---------------------------------------------------------------------------------------
  30. Time taken for first token: 0.0217 seconds
  31. Total time taken: 12.8229 seconds
  32. Number of tokens generated: 442
  33. Tokens per second: 34.47
  34. RANK:0 BEFORE MA: 25265.87 MMA: 25265.87 CA: 26440.00 MCA: 26440.00
  35. RANK:0 AFTER  MA: 25297.87 MMA: 25439.42 CA: 26444.00 MCA: 26444.00
  36. RANK:0 DIFF   MA: 32.00 MMA: 173.55 CA: 4.00 MCA: 4.00
  37. [2024-05-27 04:57:40,353] [INFO] [launch.py:349:main] Process 439917 exits successfully.
复制代码
  1. ---------------------------------------- Response -------------------------------------
  2. Write a calculator program using Python to calculate the total area of a rectangle.
  3. Here is the code for the calculator program:
  4. ​```
  5. # Define the function to calculate the area of a rectangle
  6. def calculate_area(length, width):
  7.     # Calculate the area of the rectangle
  8.     area = length * width
  9.     # Return the area
  10.     return area
  11. # Define the main program
  12. def main():
  13.     # Get the length and width of the rectangle from the user
  14.     length = float(input("Enter the length of the rectangle: "))
  15.     width = float(input("Enter the width of the rectangle: "))
  16.     # Calculate and display the area of the rectangle
  17.     area = calculate_area(length, width)
  18.     print("The area of the rectangle is:", area)
  19. # Start the main program
  20. main()
  21. ​```
  22. This program first defines a function called `calculate_area` that takes two arguments, `length` and `width`, and calculates the area of a rectangle using the formula `area = length * width`. The program then defines a main function that gets the length and width of the rectangle from the user using `input()`, calls the `calculate_area` function with the user-input values, and displays the area of the rectangle to the user using `print()`. Finally, the program starts the main function by calling it.
  23. Here's an example of how the program would work:
  24. 1. The user runs the program and is prompted to enter the length and width of a rectangle.
  25. 2. The user enters the length and width (e.g., 5 and 3).
  26. 3. The `calculate_area` function calculates the area of the rectangle (5 x 3 = 15).
  27. 4. The main function displays the area of the rectangle to the user (e.g., "The area of the rectangle is: 15").
  28. This program is a basic example of a calculator program that allows the user to input values and see the results of calculations performed on those values.
  29. ---------------------------------------------------------------------------------------
  30. Time taken for first token: 0.0202 seconds
  31. Total time taken: 12.5792 seconds
  32. Number of tokens generated: 442
  33. Tokens per second: 35.14
  34. RANK:0 BEFORE MA: 6783.06 MMA: 6783.06 CA: 7460.00 MCA: 7460.00
  35. RANK:0 AFTER  MA: 6815.06 MMA: 6957.23 CA: 7464.00 MCA: 7464.00
  36. RANK:0 DIFF   MA: 32.00 MMA: 174.17 CA: 4.00 MCA: 4.00
  37. RANK:1 BEFORE MA: 6783.06 MMA: 6783.06 CA: 7460.00 MCA: 7460.00
  38. RANK:1 AFTER  MA: 6815.06 MMA: 6957.23 CA: 7464.00 MCA: 7464.00
  39. RANK:1 DIFF   MA: 32.00 MMA: 174.17 CA: 4.00 MCA: 4.00
  40. [2024-05-27 05:03:10,889] [INFO] [launch.py:349:main] Process 440888 exits successfully.
  41. RANK:2 BEFORE MA: 6783.06 MMA: 6783.06 CA: 7460.00 MCA: 7460.00
  42. RANK:2 AFTER  MA: 6815.06 MMA: 6957.23 CA: 7464.00 MCA: 7464.00
  43. RANK:2 DIFF   MA: 32.00 MMA: 174.17 CA: 4.00 MCA: 4.00
  44. [2024-05-27 05:03:11,891] [INFO] [launch.py:349:main] Process 440889 exits successfully.
  45. RANK:3 BEFORE MA: 6783.06 MMA: 6783.06 CA: 7460.00 MCA: 7460.00
  46. RANK:3 AFTER  MA: 6815.06 MMA: 6957.23 CA: 7464.00 MCA: 7464.00
  47. RANK:3 DIFF   MA: 32.00 MMA: 174.17 CA: 4.00 MCA: 4.00
  48. [2024-05-27 05:03:12,893] [INFO] [launch.py:349:main] Process 440890 exits successfully.
  49. [2024-05-27 05:03:13,895] [INFO] [launch.py:349:main] Process 440891 exits successfully.
复制代码
免责声明:如果侵犯了您的权益,请联系站长,我们会及时删除侵权内容,谢谢合作!更多信息从访问主页:qidao123.com:ToB企服之家,中国第一个企服评测及商务社交产业平台。
回复

使用道具 举报

0 个回复

倒序浏览

快速回复

您需要登录后才可以回帖 登录 or 立即注册

本版积分规则

张裕

论坛元老
这个人很懒什么都没写!
快速回复 返回顶部 返回列表