IT评测·应用市场-qidao123.com

标题: 【Karapathy大神build-nanogpt】Take Away Notes [打印本页]

作者: 何小豆儿在此    时间: 2024-7-18 22:42
标题: 【Karapathy大神build-nanogpt】Take Away Notes
B站翻译LINK
Personal Note

Andrej rebuild gpt2 in pytorch.
Take Away Points

CODES FOR DDP (SAMPLE)

  1. # torchrun --stand_alone --nproc_per_node=<num_gpu_per_node> <your_training_script.py> <script_arguments>
  2. # Above only applies for single node training.
  3. # SETTINGS FOR EACH DIFFERENT RANK
  4. ddp = int(os.environ.get('RANK',-1))!=-1
  5. if ddp:
  6.   assert torch.cuda.is_available()
  7.   init_process_group(backend='nccl')
  8.   ddp_rank = int(os.environ['RANK']) # It is a global rank, for each process it has a unique ddp_rank
  9.   ddp_local_rank = int(os.environ['LOCAL_RANK']) # It is a local rank in the local machine (node)
  10.   ddp_world_size = int(os.environ['WORLD_SIZE']) # How many gpus (processes) in total
  11.   device = f'cuda:{ddp_local_rank}'
  12.   torch.cuda.set_device(device)
  13.   master_process = ddp_rank == 0
  14. else:
  15.   ddp_rank = 0
  16.   ddp_local_rank = 0
  17.   ddp_world_size = 1
  18.   master_process = True
  19.   device = "cpu"
  20.   if torhc.cuda.is_available():
  21.     device = "cuda"
  22.   elif hasattr(torch.backends,"mps") and torch.bakends.mps.is_available():
  23.     device = "mps"
  24.   print(f"using device:{device}")
  25. # IF YOU USE GRAD ACCUMULATION
  26. total_batch_size = 524288 # batch size measured in token numbers
  27. B = 16 # micro batch for each process
  28. T = 1024 # sequence length
  29. assert total_batch%(B * T * ddp_world_size) == 0
  30. grad_accum_steps = total_batch_size // (B * T * ddp_world_size)
  31. # SET DATALOADER
  32. Dataloader = DataLoader(*args, ddp_world_size, ddp_rank) # MUST! make each process deal with different part of datset
  33. # CREATE MODEL
  34. model = createmodel()
  35. model.to(device)
  36. model = torch.compile(model)
  37. if ddp:
  38.   model = DDP(model,device_ids=[ddp_local_rank]) # this must be ddp_local_rank not ddp_rank
  39. raw_model = model.module if ddp else model
  40. # FIX SEED
  41. seed = 'YOUR LUCKY NUMBER'
  42. torch.mannual_seed(seed)
  43. if torch.cuda.is_available():
  44.   torch.cuda.manual_seed(seed)
  45.   
  46. # TRAIN
  47. for step in range(max_steps):
  48.   t0 = time.time()  
  49.   
  50.   model.train()
  51.   optimizer.zero_grad()
  52.   loss_accum = 0.0
  53.   for micro_step in range(grad_accum_steps):
  54.     x,y = Dataloader.next_batch()
  55.     x,y = x.to(device),y.to(device)
  56.     with torch.autocast(device_type=device,dtype=torch.bfloat16):
  57.       logits, loss = model(x,y)
  58.     loss = loss / grad_accum_steps
  59.     loss_accum += loss.detach()
  60.     if ddp:
  61.       model.require_backward_grad_sync = (micro_step == grad_accum_steps - 1) # The ddp sync if applied to every micro step will be wasting time. So only the last backward in one accum cycle should be synchronized. See ddp.no_sync() contextmanager for official advice. Or use it in this way shown here.
  62.     loss.backward()
  63.   if ddp:
  64.     torch.distributed.all_reduce(loss_accum,op=torch.distributed.ReduceOp.AVG)
  65. norm = torch.nn.utils.clip_grad_norm_(model.parameters(),1.0)
  66.   
  67.    if step%100 == 0:
  68.     # start evaluation
  69.     model.eval()
  70.     with torch.no_grad():
  71.         # SOME EVALUATION CODE
  72. if ddp:
  73.   destroy_process_group()
复制代码
免责声明:如果侵犯了您的权益,请联系站长,我们会及时删除侵权内容,谢谢合作!更多信息从访问主页:qidao123.com:ToB企服之家,中国第一个企服评测及商务社交产业平台。




欢迎光临 IT评测·应用市场-qidao123.com (https://dis.qidao123.com/) Powered by Discuz! X3.4