【Karapathy大神build-nanogpt】Take Away Notes

打印 上一主题 下一主题

主题 884|帖子 884|积分 2652

B站翻译LINK
Personal Note

Andrej rebuild gpt2 in pytorch.
Take Away Points


  • Before entereing serious training, he use Shakespear’s work as a small debugging datset to see if a model can overfit. Overfitging is a should thing.
  • If we use TF32 or BF32, (by default is FP32 in pytorch it takes more memory), gpu can do hundred times faster. Becuase the core is computing really fast, most time (more than 40%) it is waiting for the memory allocation/transfer… Every computation is done in a manner like breaking down to 4x4 matrix multiplication.
  • when time a gpu programing, remember torch.cuda.synchronize()
  • watch gpu: watch -n 0.1 nvidia-smi
  • torch.set_float32_matmul_precision(‘high’) easily activate tf32 mode

    • default is highest -> float 32
    • High-> if avaible tensorflow32 (depends on GPU)

  • simply use torch.set_float32_matmul_precision(‘high’) theoraticlaly should make us have 8x speed. However we only achieve 3x. Becuse we are still memory bound, moving data around still cost a lot.
  • This can only be used in Amphere:
    use torch.autocast(device_type=device,dtype=torch.bfloat16) to wrap the forward process. In this wrap, some CUDA ops can autocast to BF16, many other stays in float32. Matrix Multiplication will be BF16.
  • One debug technique: import code; code.interact(local=locals())
  • torch.compile! Model = torch.compile(model)
  • Flash Attention. Flash Attention2. Online softmax.
    Use F.scale_dot_product_attention(q,k,v,is_causal=True) instead
  • Look for ugly numbers, make it to beautiful numbers. Any ugly numbers->increase it to have as much as 2 (Although flops will increase, time will decrease)
  • ALL ABOVE CHANGES MAKE PRAGRAM TRAINING 10x FASTER!!
  • Linear_warmup + cosine learning rate with minimum learning rate, see GPT3 paper for more details
  • First stage of the training, the model is not differing each tokens, they are just learning which tokens can show up which are not and driving them probability to zero. It is the reason that why in the early training stage, a small batchsize will be OK, as the gradients will not behave different if you use full batchsize.
  • parameters that should be weight decayed and should not be. WD: all weight tensors + embeddings (p.dim()>=2), NWD: all biaes, layernorms (p.dim()<2)
  • AdamW’s use_fused configuration (accelarate training process)
  • Model size up, lr down, batchsize up.
  • Grad accumulation: Remember to normalize: loss /= grad_accum_steps
  • when evaluation, use torch.Generator to create object used in torch.multinomial(xx,xx,generator=.), so that the generating process do not impact the global random number generator used for training.
  • However, torch.compile must be banned, so that you can sample in the training process.
CODES FOR DDP (SAMPLE)

  1. # torchrun --stand_alone --nproc_per_node=<num_gpu_per_node> <your_training_script.py> <script_arguments>
  2. # Above only applies for single node training.
  3. # SETTINGS FOR EACH DIFFERENT RANK
  4. ddp = int(os.environ.get('RANK',-1))!=-1
  5. if ddp:
  6.   assert torch.cuda.is_available()
  7.   init_process_group(backend='nccl')
  8.   ddp_rank = int(os.environ['RANK']) # It is a global rank, for each process it has a unique ddp_rank
  9.   ddp_local_rank = int(os.environ['LOCAL_RANK']) # It is a local rank in the local machine (node)
  10.   ddp_world_size = int(os.environ['WORLD_SIZE']) # How many gpus (processes) in total
  11.   device = f'cuda:{ddp_local_rank}'
  12.   torch.cuda.set_device(device)
  13.   master_process = ddp_rank == 0
  14. else:
  15.   ddp_rank = 0
  16.   ddp_local_rank = 0
  17.   ddp_world_size = 1
  18.   master_process = True
  19.   device = "cpu"
  20.   if torhc.cuda.is_available():
  21.     device = "cuda"
  22.   elif hasattr(torch.backends,"mps") and torch.bakends.mps.is_available():
  23.     device = "mps"
  24.   print(f"using device:{device}")
  25. # IF YOU USE GRAD ACCUMULATION
  26. total_batch_size = 524288 # batch size measured in token numbers
  27. B = 16 # micro batch for each process
  28. T = 1024 # sequence length
  29. assert total_batch%(B * T * ddp_world_size) == 0
  30. grad_accum_steps = total_batch_size // (B * T * ddp_world_size)
  31. # SET DATALOADER
  32. Dataloader = DataLoader(*args, ddp_world_size, ddp_rank) # MUST! make each process deal with different part of datset
  33. # CREATE MODEL
  34. model = createmodel()
  35. model.to(device)
  36. model = torch.compile(model)
  37. if ddp:
  38.   model = DDP(model,device_ids=[ddp_local_rank]) # this must be ddp_local_rank not ddp_rank
  39. raw_model = model.module if ddp else model
  40. # FIX SEED
  41. seed = 'YOUR LUCKY NUMBER'
  42. torch.mannual_seed(seed)
  43. if torch.cuda.is_available():
  44.   torch.cuda.manual_seed(seed)
  45.   
  46. # TRAIN
  47. for step in range(max_steps):
  48.   t0 = time.time()  
  49.   
  50.   model.train()
  51.   optimizer.zero_grad()
  52.   loss_accum = 0.0
  53.   for micro_step in range(grad_accum_steps):
  54.     x,y = Dataloader.next_batch()
  55.     x,y = x.to(device),y.to(device)
  56.     with torch.autocast(device_type=device,dtype=torch.bfloat16):
  57.       logits, loss = model(x,y)
  58.     loss = loss / grad_accum_steps
  59.     loss_accum += loss.detach()
  60.     if ddp:
  61.       model.require_backward_grad_sync = (micro_step == grad_accum_steps - 1) # The ddp sync if applied to every micro step will be wasting time. So only the last backward in one accum cycle should be synchronized. See ddp.no_sync() contextmanager for official advice. Or use it in this way shown here.
  62.     loss.backward()
  63.   if ddp:
  64.     torch.distributed.all_reduce(loss_accum,op=torch.distributed.ReduceOp.AVG)
  65. norm = torch.nn.utils.clip_grad_norm_(model.parameters(),1.0)
  66.   
  67.    if step%100 == 0:
  68.     # start evaluation
  69.     model.eval()
  70.     with torch.no_grad():
  71.         # SOME EVALUATION CODE
  72. if ddp:
  73.   destroy_process_group()
复制代码
免责声明:如果侵犯了您的权益,请联系站长,我们会及时删除侵权内容,谢谢合作!更多信息从访问主页:qidao123.com:ToB企服之家,中国第一个企服评测及商务社交产业平台。
回复

使用道具 举报

0 个回复

倒序浏览

快速回复

您需要登录后才可以回帖 登录 or 立即注册

本版积分规则

何小豆儿在此

金牌会员
这个人很懒什么都没写!

标签云

快速回复 返回顶部 返回列表