IT评测·应用市场-qidao123.com

标题: 大模子系列——LLAMA-O1 复刻代码解读 [打印本页]

作者: 河曲智叟    时间: 2025-3-12 01:26
标题: 大模子系列——LLAMA-O1 复刻代码解读
1、预训练模子

利用的模子基座为:qq8933/OpenLongCoT-Base-Gemma2-2B,形貌如下:
This model is a fine-tuned version of google/gemma-2-2b-it on the OpenLongCoT dataset.
This model can read and output o1-like LongCoT which targeting work with LLaMA-O1 runtime frameworks.
 gemma-2-2b-it形貌如下:
Gemma is a family of lightweight, state-of-the-art open models from Google, built from the same research and technology used to create the Gemini models. They are text-to-text, decoder-only large language models, available in English, with open weights for both pre-trained variants and instruction-tuned variants. Gemma models are well-suited for a variety of text generation tasks, including question answering, summarization, and reasoning. Their relatively small size makes it possible to deploy them in environments with limited resources such as a laptop, desktop or your own cloud infrastructure, democratizing access to state of the art AI models and helping foster innovation for everyone.

训练参数如下:

检察qq8933/OpenLongCoT-Pretrain数据集,数据量126K,单条数据如下:
<start_of_father_id>-1<end_of_father_id><start_of_local_id>0<end_of_local_id><start_of_thought><problem>The average speed for an hour drive is 66 miles per hour. If Felix wanted to drive twice as fast for 4 hours, how many miles will he cover? <end_of_thought> <start_of_father_id>0<end_of_father_id><start_of_local_id>1<end_of_local_id><start_of_thought>Since Felix wants to drive twice as fast, he will drive at 2*66=<<2*66=132>>132 miles per hour. <end_of_thought><start_of_rating><positive_rating><end_of_rating> <start_of_father_id>1<end_of_father_id><start_of_local_id>2<end_of_local_id><start_of_thought> If he drives for 4 hours, he will have driven for 4*132=<<4*132=528>>528 miles. <end_of_thought><start_of_rating><positive_rating><end_of_rating> <start_of_father_id>1<end_of_father_id><start_of_local_id>3<end_of_local_id><start_of_thought><critic> Felix wants to drive twice as fast as his original speed of 66 miles per hour. Multiplying 66 by 2 gives 132 miles per hour. This calculation is correct.<end_of_thought><start_of_rating><unknown_rating><end_of_rating> <start_of_father_id>1<end_of_father_id><start_of_local_id>4<end_of_local_id><start_of_thought><critic> If Felix drives at 132 miles per hour for 4 hours, the total distance he covers can be calculated by multiplying his speed by the time. 132 miles per hour * 4 hours = 528 miles. This calculation is correct.<end_of_thought><start_of_rating><unknown_rating><end_of_rating>
为方便理解,翻译成中文:
<start_of_father_id>-1<end_of_father_id><start_of_local_id>0<end_of_local_id><start_of_thought><problem>一小时车程的平均速度为 66 英里每小时。假如 Felix 想以两倍的速度开车 4 小时,他能行驶多少英里?<end_of_thought> <start_of_father_id>0<end_of_father_id><start_of_local_id>1<end_of_local_id><start_of_thought>由于 Felix 想以两倍的速度开车,因此他的行驶速度将为 2*66=<<2*66=132>>132 英里每小时。 <end_of_thought><start_of_rating><positive_rating><end_of_rating> <start_of_father_id>1<end_of_father_id><start_of_local_id>2<end_of_local_id><start_of_thought> 假如他开车 4 小时,他将行驶 4*132=<<4*132=528>>528 英里。 <end_of_thought><start_of_rating><positive_rating><end_of_rating> <start_of_father_id>1<end_of_father_id><start_of_local_id>3<end_of_local_id><start_of_thought><critic> 菲利克斯盼望将他原来的 66 英里每小时的速度提高一倍。 将 66 乘以 2 得到 132 英里每小时。这个盘算是精确的。<end_of_thought><start_of_rating><unknown_rating><end_of_rating> <start_of_father_id>1<end_of_father_id><start_of_local_id>4<end_of_local_id><start_of_thought><critic> 假如 Felix 以每小时 132 英里的速度行驶 4 个小时,那么他行驶的总距离可以通过将速度乘以时间来盘算。每小时 132 英里 * 4 小时 = 528 英里。这个盘算是精确的。<end_of_thought><start_of_rating><unknown_rating><end_of_rating>
从数据来看,应该是做了简单的增量预训练微调。
2、主要函数分析

定义不同的提示词模板

hint = '<hint> Try generate a reasonable rationale solution that can got final answer {GT}</hint>'
# hint = ''
hint_for_critics = f"<hint> Point out the potential flaws in the current solution. </hint>"
hint_for_refine = f"<hint> Try to refine the current solution for higher quality. </hint>"
hint_for_conclusion = "<hint> Try to summarize the current solution and draw a conclusion. Final answer should bracket in \\box{answer} </hint>"
hint_for_divide_and_conquer = f"<hint> Try divide the problem into smaller easier sub-problems and solve them divide-and-conquer. </hint>"
compute_policy_head分析

  1. # 策略生成的主要函数
  2. @torch.no_grad()
  3. def compute_policy_head(model, tokenizer, selected_node, num_candidates=3, meta="", envoirment=None):
  4.     local_id = get_max_node_id_in_tree(selected_node) + 1
  5.     hint_text = {
  6.         "<conclusion>": hint_for_critics,
  7.         "<problem>": hint_for_divide_and_conquer,
  8.         "<critic>": hint_for_critics,
  9.         "<refine>": hint_for_refine,
  10.     }.get(meta, hint.format(GT=envoirment.get_ground_truth(selected_node)))
  11.     inputs_string = policy_head_template(selected_node, local_id, meta, hint_text)
  12.     with set_left_truncate(tokenizer):
  13.         inputs = tokenizer(
  14.             inputs_string,
  15.             return_tensors="pt",
  16.             truncation=True,
  17.             padding='longest',
  18.             max_length=CUT_OFF_LEN
  19.         )
  20.     inputs = {k: v.to(accelerator.device) for k, v in inputs.items()}
  21.     outputs = accelerator.unwrap_model(model).generate(
  22.         input_ids=inputs['input_ids'],
  23.         attention_mask=inputs['attention_mask'],
  24.         max_new_tokens=GENERATE_MAX_NEW_TOKENS,
  25.         do_sample=True,
  26.         num_return_sequences=num_candidates,
  27.         return_dict_in_generate=True,
  28.         output_scores=True,
  29.         temperature=1.5,
  30.         output_logits=True,
  31.         stop_strings=policy_head_stopping_criteria,
  32.         tokenizer=tokenizer,
  33.     )
  34.     generated_sequences = outputs.sequences[:, inputs['input_ids'].size(1):]
  35.     generated_sequences_mask = generated_sequences != tokenizer.pad_token_id
  36.     generated_texts = tokenizer.batch_decode(generated_sequences, skip_special_tokens=True)
  37.     logits = torch.stack(outputs.logits, dim=1)
  38.     normalized_log_probs, normalized_entropy, varentropy = length_normed_log_probs(
  39.         generated_sequences, logits, attention_mask=generated_sequences_mask, return_entropy=True, return_varentropy=True
  40.     )
  41.     normalized_probs = torch.exp(normalized_log_probs)
  42.     generated_texts = [meta + clean_generated_text(text) for text in generated_texts]
  43.     for i, generated_text in enumerate(generated_texts):
  44.         if not generated_text.startswith(meta):
  45.             generated_texts[i] = meta + generated_text
  46.     return generated_texts, normalized_probs.tolist(), normalized_entropy.tolist(), varentropy.tolist(), [meta,] * num_candidates
复制代码
  1. def policy_head_template(selected_node, local_id, meta="", hint=""):
  2.     return (
  3.         path_to_string(selected_node)
  4.         + f"{hint}\n<start_of_father_id>{selected_node.index if selected_node else -1}<end_of_father_id><start_of_local_id>{local_id}<end_of_local_id><start_of_thought>{meta}"
  5.     )
  6. def path_to_string(node):
  7.     path = []
  8.     while node:
  9.         path.append(node)
  10.         node = node.parent
  11.     string = "\n".join(
  12.         [
  13.             f"<start_of_father_id>{node.parent.index if node.parent else -1}<end_of_father_id><start_of_local_id>{node.index}<end_of_local_id><start_of_thought>{node.state}<end_of_thought><start_of_rating>{value_to_rating_token(node.value)}<end_of_rating>"
  14.             for node in path[::-1]
  15.         ]
  16.     )
  17.     return string
复制代码
这个函数的逻辑为,支持根据meta标签类型,做不同的天生任务,若meta不指定,则利用hint默认的提示词。最终返回多个候选答案,每个答案以meta开头。
留意构造prompt的时候,将selected_node的上文路径信息引入进来了。每个节点利用这种格式:<start_of_thought>{node.state}<end_of_thought><start_of_rating>{value_to_rating_token(node.value)}<end_of_rating>"
再次验证了,基座模子就是预训练好的大模子。
compute_value_head

  1. # 价值头生成函数
  2. @torch.no_grad()
  3. def compute_value_head(model, tokenizer, node):
  4.     text_for_value = value_head_template(node) + '<positive_rating>'
  5.     with set_left_truncate(tokenizer):
  6.         inputs = tokenizer(text_for_value, return_tensors="pt", truncation=True, padding='longest', max_length=CUT_OFF_LEN)
  7.     inputs = {k: v.to(accelerator.device) for k, v in inputs.items()}
  8.     outputs = model(**inputs, return_dict=True)
  9.     logits = outputs.logits
  10.     last_logits = logits[:, -2, :]
  11.     positive_token_id = tokenizer.convert_tokens_to_ids("<positive_rating>")
  12.     negative_token_id = tokenizer.convert_tokens_to_ids("<negative_rating>")
  13.     positive_logit = last_logits[:, positive_token_id]
  14.     negative_logit = last_logits[:, negative_token_id]
  15.     value_logits = torch.stack([positive_logit, negative_logit], dim=1)
  16.     probs, log_probs = robust_softmax(value_logits)
  17.     return log_probs[:, 0].item()
复制代码
这是一个很不错的计划,增加<positive_rating>构成prompt,然后从输出logits中根据positive_token_id和negative_token_id取对应的logits,最后利用softmax单独盘算正负概率。从而能对价值做评估。

  1. # 元策略生成函数
  2. @torch.no_grad()
  3. def meta_compute_policy_head(model, tokenizer, selected_node, num_candidates=3, meta_ratio=0.5, envoirment=None):
  4.     metas = sampling_meta_action(selected_node, num_candidates)
  5.     generated_texts, policy_probs, normalized_entropys, varentropys = [], [], [], []
  6.     for meta in metas:
  7.         texts, policy_probs, normalized_entropy, varentropy, _ = compute_policy_head(model, tokenizer,
  8.             selected_node, num_candidates=1, meta=meta, envoirment=envoirment
  9.         )
  10.         generated_texts.append(texts[0])
  11.         policy_probs.append(policy_probs[0])
  12.         normalized_entropys.append(normalized_entropy[0])
  13.         varentropys.append(varentropy[0])
  14.     return generated_texts, policy_probs, normalized_entropys, varentropys, metas
复制代码
主要内容:下一步要利用什么策略,概率是多少,天生的内容是什么。可选的meta这里利用sampling而来,具体没有弄明确,TODO
cal_meta_transition_probs函数分析

  1. def cal_meta_transition_probs(node):
  2.     num_meta_actions = len(meta_action_types)
  3.     # 展开树结构,获取父节点索引、子节点索引和对应的值
  4.     parents, children, values = flatten_tree(node)
  5.     # 初始化转移概率矩阵
  6.     TransitionProbs = np.zeros((num_meta_actions, num_meta_actions))
  7.     # 使用 NumPy 的高级索引和累加来更新矩阵
  8.     if len(parents) > 0:
  9.         np.add.at(TransitionProbs, (parents, children), values)
  10.     return TransitionProbs
复制代码
>>> TransitionProbs = np.zeros((5,5))
>>> np.add.at(TransitionProbs, ([0,0,0,0,1],[1,2,3,4,3]), [0.1, 0.2, 0.1, 0.3,0.01])
>>> TransitionProbs
array([[0.  , 0.1 , 0.2 , 0.1 , 0.3 ],
       [0.  , 0.  , 0.  , 0.01, 0.  ],
       [0.  , 0.  , 0.  , 0.  , 0.  ],
       [0.  , 0.  , 0.  , 0.  , 0.  ],
       [0.  , 0.  , 0.  , 0.  , 0.  ]])
>>> TransitionProbs = np.zeros((5,5))
>>> np.add.at(TransitionProbs, ([0,0,0,0,1,1],[1,2,3,4,3,4]), [0.1, 0.2, 0.1, 0.3,0.01,0.02])
>>> TransitionProbs
array([[0.  , 0.1 , 0.2 , 0.1 , 0.3 ],
       [0.  , 0.  , 0.  , 0.01, 0.02],
       [0.  , 0.  , 0.  , 0.  , 0.  ],
       [0.  , 0.  , 0.  , 0.  , 0.  ],
       [0.  , 0.  , 0.  , 0.  , 0.  ]])
sampling_meta_action函数分析

  1. @lru_cache()
  2. def sampling_meta_action(node, num=1, TransitionProbs=None):
  3.     if TransitionProbs is None:
  4.         root = get_root(node)
  5.         TransitionProbs = cal_meta_transition_probs(root)
  6.     # 计算转移概率的 softmax
  7.     transition_probs_softmax = np_softmax(TransitionProbs)
  8.     i = meta_action_type_to_index[node.meta]
  9.     p = transition_probs_softmax[i]
  10.     # 进行采样
  11.     meta_actions = np.random.choice(meta_action_types, size=num, p=p)
  12.     return meta_actions
复制代码
(1)将上一步测试的TransitionProbs放入np_softmax函数,可以看出,按行盘算概率分布。
(2)取出node.meta即当前节点的meta类型,得到meta类型地点行下,对应的下一个action的meta概率分布,按概率采样num个输出
>>> def np_softmax(x):
...     # 对矩阵的每一行举行 softmax 操纵
...     max_vals = np.max(x, axis=1, keepdims=True)
...     e_x = np.exp(x - max_vals)
...     sum_e_x = np.sum(e_x, axis=1, keepdims=True)
...     return e_x / sum_e_x
...
>>> np_softmax(TransitionProbs)
array([[0.1729624 , 0.19115301, 0.21125675, 0.19115301, 0.23347482],
       [0.19879722, 0.19879722, 0.19879722, 0.20079516, 0.20281319],
       [0.2       , 0.2       , 0.2       , 0.2       , 0.2       ],
       [0.2       , 0.2       , 0.2       , 0.2       , 0.2       ],
       [0.2       , 0.2       , 0.2       , 0.2       , 0.2       ]])
 
TreeNode分析

(1)get_path_reward,获取当前节点链路上value的平均值=value总和/链路长度
(2)get_child_policy_prob,获取child的概率,这里self.policy为kv字典,value为logits
(3)get_child_policy_entropy,同上,value不同,todo
(4)get_child_policy_varentropy,同上,value不同
MCTS分析

(1)search(self, root_node)。仿真N次,根据最终成功或失败情况,更新各节点value。
  1.     def search(self, root_node):
  2.         if not root_node.children:
  3.             root_node.value = 0
  4.         for _ in tqdm(range(self.num_simulations)):
  5.             self.simulate(root_node)
  6.             max_reward, path_len = find_max_reward_path(root_node)
  7.             print(f'find max reward path: {max_reward} with {path_len} steps.')
  8.             if self.patient <= 0:
  9.                 break
  10.         for leaf in self.identify_leaf(root_node):
  11.             if leaf.leaf_type == "successful":
  12.                 self.rectify_values_from_leaf(leaf, 0)
  13.             else:
  14.                 self.rectify_values_from_leaf(leaf, np.log(self.reward_epsilon))
  15.         return root_node
复制代码

  1. def find_max_reward_path(node):
  2.     path = 0
  3.     reward = 0
  4.     while node:
  5.         reward += node.value
  6.         path += 1
  7.         if not node.children:
  8.             break
  9.         node = max(node.children, key=lambda x: x.value)
  10.     return math.exp(reward), path
复制代码

(2)simulate(self, node)。若node为叶子节点,则扩展子节点,否则对最佳子节点举行仿真,返回对应value,最后将value举行加权更新。
  1.     def simulate(self, node):
  2.         if node.is_leaf() or node.should_expand():
  3.             value = self.expand_node(node) * self.discount_factor
  4.         else:
  5.             best_child = self.select_action(node)
  6.             value = self.simulate(best_child) * self.discount_factor
  7.         node.visits += 1
  8.         node.value += (value - node.value) / node.visits
  9.         return node.value
复制代码
(3)expand_node(self, node)。

  1. normalized_log_probs, normalized_entropy, varentropy = length_normed_log_probs(
  2.         generated_sequences, logits, attention_mask=generated_sequences_mask, return_entropy=True, return_varentropy=True
  3.     )
复制代码

  1. # 价值头生成函数
  2. @torch.no_grad()
  3. def compute_value_head(model, tokenizer, node):
  4.     text_for_value = value_head_template(node) + '<positive_rating>'
  5.     with set_left_truncate(tokenizer):
  6.         inputs = tokenizer(text_for_value, return_tensors="pt", truncation=True, padding='longest', max_length=CUT_OFF_LEN)
  7.     inputs = {k: v.to(accelerator.device) for k, v in inputs.items()}
  8.     outputs = model(**inputs, return_dict=True)
  9.     logits = outputs.logits
  10.     last_logits = logits[:, -2, :]
  11.     positive_token_id = tokenizer.convert_tokens_to_ids("<positive_rating>")
  12.     negative_token_id = tokenizer.convert_tokens_to_ids("<negative_rating>")
  13.     positive_logit = last_logits[:, positive_token_id]
  14.     negative_logit = last_logits[:, negative_token_id]
  15.     value_logits = torch.stack([positive_logit, negative_logit], dim=1)
  16.     probs, log_probs = robust_softmax(value_logits)
  17.     return log_probs[:, 0].item()
复制代码

PrioritizedReplayBuffer

根据优先级对buffer中的数据举行采样,同时提供了长期化方法
RLSPTrainer

(1)self_play
(2)collect_experience。这里有几个部分:

  1.     def collect_experience(self, root_node):
  2.         """Traverse the MCTS tree to collect experiences and store them in the replay buffer."""
  3.         # Collect training data from the tree
  4.         for node in traverse_tree(root_node):
  5.             if node == root_node:
  6.                 continue
  7.             reward = node.true_value_from_tree if node.true_value_from_tree is not None else node.value
  8.             advantage = compute_gae_from_node(node)
  9.             policy_input = tokenize_policy_predict([node,], self.tokenizer)
  10.             advantage_tensor = torch.tensor([advantage], dtype=torch.float32).unsqueeze(0)
  11.             value_input = tokenize_value_predict(node, self.tokenizer)
  12.             value_target = torch.tensor([reward], dtype=torch.float32).unsqueeze(0)
  13.             # Store the experience with initial priority
  14.             experience = {
  15.                 'advantage': advantage_tensor,
  16.                 'value_target': value_target,
  17.                 **policy_input,
  18.                 **value_input,
  19.             }
  20.             # Use absolute advantage as initial priority
  21.             priority = abs(advantage_tensor.item())
  22.             self.replay_buffer.add(experience, priority)
复制代码
 
  1. def tokenize_policy_predict(nodes,tokenizer):
  2.     with set_left_truncate(tokenizer):
  3.         text_for_policys = [policy_head_template(node.parent, node.index) + node.state for node in nodes]
  4.         targets = [node.state for node in nodes]
  5.         # with set_left_padding(tokenizer):
  6.         inputs = tokenizer(text_for_policys, return_tensors="pt", truncation=True, padding='longest', max_length=CUT_OFF_LEN)
  7.         target = tokenizer(targets, return_tensors="pt", truncation=True, padding='longest', max_length=CUT_OFF_LEN)
  8.     ret = {'input_ids':inputs['input_ids'],'attention_mask':inputs['attention_mask'],'target':target['input_ids'],'target_attention_mask':target['attention_mask']}
  9.     return ret
复制代码
(3)compute_loss(self, model, inputs, return_outputs=False)
  1.     def compute_loss(self, model, inputs, return_outputs=False):
  2.         """Compute the loss, incorporating importance-sampling weights."""
  3.         # Compute policy loss using PPO
  4.         new_policy_log_probs = forward_policy_predict(self.model, self.tokenizer, inputs)
  5.         with torch.no_grad():
  6.             old_policy_log_probs = forward_policy_predict(self.model.get_base_model(), self.tokenizer, inputs).detach()
  7.         target_mask = inputs['target_attention_mask']
  8.         advantage = inputs['advantage']
  9.         epsilon = 0.2  # PPO clip parameter
  10.         ratio = (new_policy_log_probs - old_policy_log_probs).exp() * target_mask[:,1:]
  11.         surr1 = ratio * advantage.unsqueeze(-1)
  12.         surr2 = torch.clamp(ratio, 1 - epsilon, 1 + epsilon) * advantage.unsqueeze(-1)
  13.         policy_loss = -torch.min(surr1, surr2).mean()
  14.         
  15.         # Compute value loss
  16.         value_prediction = forward_value_predict(self.model, self.tokenizer, inputs)
  17.         value_target = inputs['value_target']
  18.         clamp_positive_rating_prob = torch.exp(torch.clamp(
  19.             value_target, math.log(1e-6), 0
  20.         ))
  21.         clamp_negative_rating_prob = 1 - clamp_positive_rating_prob
  22.         target_probs = torch.concat(
  23.             [clamp_positive_rating_prob.unsqueeze(-1), clamp_negative_rating_prob.unsqueeze(-1)], dim=1
  24.         )
  25.         value_loss = F.binary_cross_entropy_with_logits(
  26.             value_prediction, target_probs.to(self.accelerator.device)
  27.         )
  28.         # Combine losses
  29.         total_loss = policy_loss + value_loss
  30.         if total_loss == 0:
  31.             return total_loss
  32.         # Apply importance-sampling weights
  33.         weights = torch.tensor(inputs['weights'], dtype=torch.float32).to(total_loss.device)
  34.         total_loss = total_loss * weights
  35.         td_error = total_loss.sum(dim=-1).detach().abs().cpu().numpy()
  36.         total_loss = total_loss.mean()
  37.         print(f'Policy Loss: {policy_loss}, Value Loss: {value_loss}, Total Loss: {total_loss}')
  38.         if return_outputs:
  39.             return total_loss, td_error
  40.         else:
  41.             return total_loss
复制代码
(1)new_policy_log_probs,这里盘算的是在给定输入下,得到的输出logits概率(对目标输出id举行取值)。这个计划还是比较巧妙的
(2)对新旧policy概率举行拟合,得到policy_loss。ratio = (new_policy_log_probs - old_policy_log_probs).exp() * target_mask[:,1:]。policy_loss = -torch.min(surr1, surr2).mean()。
(3)盘算value_loss。F.binary_cross_entropy_with_logits( value_prediction, target_probs.to(self.accelerator.device) )
 (4)total_loss为两者之和。total_loss = policy_loss + value_loss
  1. def forward_policy_predict(model,tokenizer,inputs):
  2.     inputs = {k: v.to(accelerator.device) for k, v in inputs.items()}
  3.     input_ids = inputs["input_ids"]
  4.     attention_mask = inputs["attention_mask"]
  5.     target_ids = inputs["target"]
  6.     target_mask = inputs["target_attention_mask"]
  7.     outputs = model(input_ids=input_ids, attention_mask=attention_mask, return_dict=True)
  8.     logits = outputs.logits[:,:-1,:][:, -target_ids[:,1:].shape[-1] :]
  9.     log_probs = F.log_softmax(logits, dim=-1)
  10.     seleted_log_probs = log_probs.gather(2, target_ids[:,1:].unsqueeze(-1)).squeeze(-1)
  11.     return seleted_log_probs
复制代码
(5)train(self, num_iterations, beta_start=0.4, beta_frames=100000, **kwargs)

论文解读

llm在需要战略和逻辑推理的范畴面临着显著的挑战。
别的,我们引入了一种动态剪枝策略,结合改进的上置信界(Srinivas等,2009)(UCB)公式,以优化高风险任务的有效决策的探索-开发平衡。
本研究推进了LLMs在复杂推理挑战中的应用。它为未来集成人工智能技能的创新奠定了底子,以提高llm驱动的应用程序的决策、推理的准确性和可靠性
为了更好地评估我们方法的有效性,我们选择LLaMA-3.1-8B-Instruct模子(Meta,2024b)作为SR-MCTS搜索的底子模子,不举行任何额外的训练。我们训练了一个Gemma2-2b-指示模子(谷歌,2024)作为PPRM,以便在搜索过程中提供奖励信号。
为了更好地评估我们方法的有效性,我们选择LLaMA-3.1-8B-Instruct模子(Meta,2024b)作为SR-MCTS搜索的底子模子,不举行任何额外的训练。我们训练了一个Gemma2-2b-指示模子(谷歌,2024)作为PPRM,以便在搜索过程中提供奖励信号。
我们猜疑缘故原由是,在问题难度较低的基准上数学推理性能主要依靠于其固有的推理能力,而在更复杂的基准上,它的性能很大程度上依靠于其自我优化能力。
   
  我们的方法不仅在数学推理问题上,而且在各种科学和工程问题上都表现出了显著的性能。

 

免责声明:如果侵犯了您的权益,请联系站长,我们会及时删除侵权内容,谢谢合作!更多信息从访问主页:qidao123.com:ToB企服之家,中国第一个企服评测及商务社交产业平台。




欢迎光临 IT评测·应用市场-qidao123.com (https://dis.qidao123.com/) Powered by Discuz! X3.4