GPT, Claude, Llama? How to tell which AI model is best

Beware model-makers marking their own homework

When Meta, the parent company of Facebook, announced its latest open-
source large language model (LLM) on July 23rd, it claimed that the most
powerful version of Llama 3.1 had “state-of-the-art capabilities that rival the
best closed-source models” such as GPT-4o and Claude 3.5 Sonnet. Meta’s
announcement included a table, showing the scores achieved by these and
other models on a series of popular benchmarks with names such as
当脸书的母公司Meta在7月23日公布其最新的开源大型语言模子(LLM)时,它声称最强盛的版本Llama 3.1具有“开始进的本领,可以与GPT-4o和Claude 3.5 Sonnet等最好的闭源模子相媲美”。Meta的公告包括一个表格,显示了这些模子和其他模子在一系列盛行的基准测试中取得的结果,如MMLU,GSM8K和GPQA。
On MMLU, for example, the most powerful version of Llama 3.1 scored 88.6%,
against 88.7% for GPT-4o and 88.3% for Claude 3.5 Sonnet, rival models
made by OpenAI and Anthropic, two AI startups, respectively. Claude 3.5
Sonnet had itself been unveiled on June 20th, again with a table of
impressive benchmark scores. And on July 24th, the day after Llama 3.1’s
debut, Mistral, a French AI startup, announced Mistral Large 2, its latest LLM,
with—you’ve guessed it—yet another table of benchmarks. Where do such
numbers come from, and can they be trusted?
比方,在MMLU上,最强盛的版本Llama 3.1的得分为88.6%,而GPT-4o的得分为88.7%,Claude 3.5 Sonnet的得分为88.3%,这两个版天职别由两家人工智能初创公司OpenAI和Anthropic制作。6月20日,Claude 3.5 Sonnet也发布了,同样是令人印象深刻的基准分数表。7月24日,在Llama 3.1首次亮相后的第二天,法国人工智能初创公司Mistral公布了其最新的Mistral Large 2,你可能已经猜到了,这是另一个基准表。如许的数字从何而来,是否可信?
Having accurate, reliable benchmarks for AI models matters, and not just for
the bragging rights of the firms making them. Benchmarks “define and drive
progress”, telling model-makers where they stand and incentivising them to
improve, says Percy Liang of the Institute for Human-Centred Artificial
Intelligence at Stanford University. Benchmarks chart the field’s overall
progress and show how AI systems compare with humans at specific tasks.
They can also help users decide which model to use for a particular job and
identify promising new entrants in the space, says Clémentine Fourrier, a
specialist in evaluating LLMs at Hugging Face, a startup that provides tools for
AI developers.
拥有准确、可靠的人工智能模子基准很重要,不但仅是为了制造这些模子的公司的吹嘘资本。斯坦福大学以人为中心的人工智能研究所的Percy Liang说,基准“定义并推动进步”,告诉模子制作者他们的立场并激励他们改进。基准测试记录了该领域的整体进展,并显示了人工智能系统在特定任务上与人类相比的体现。为人工智能开发人员提供工具的初创公司Hugging Face评估LLM的专家Clémentine Fourrier表示,它们还可以资助用户决定特定工作使用哪种模子,并确定该领域有前程的新进入者。
bragging rights:暂时的上风;炫耀的权利;吹嘘的资本
   这里的 “chart” 意思是“记录”或“描绘”。在这种情况下,“chart the field’s overall progress” 意味着记录并描绘AI领域的整体进展情况。

  • Scientists chart the course of a storm to predict its path and potential impact.(科学家记录风暴的路线,以预测其路径和潜伏影响。)
  • The book charts the rise and fall of ancient civilizations, detailing their achievements and eventual decline.(这本书记录了古代文明的兴衰,详细形貌了它们的成绩和终极的衰落。)
  在上述例子中,“chart” 都表示详细记录和描绘事物的发展过程。
But, says Dr Fourrier, benchmark scores “should be taken with a pinch of
salt”. Model-makers are, in effect, marking their own homework—and then
using the results to hype their products and talk up their company valuations.
Yet all too often, she says, their grandiose claims fail to match real-world
performance, because existing benchmarks, and the ways they are applied,
are flawed in various ways.
a pinch of: 一点;一撮;少许
talk up:吹捧;议论起来了, 畅谈,
grandiose:美 [ˈɡrændioʊs] 夸大的;脆而不坚的;不切现实的
are flawed:存在缺陷
   这里的 “marking their own homework” 意思是“自己给自己的作业打分”。在这种情况下,它比喻的是模子制作者自己评估他们的AI模子,并用这些评估结果来宣传他们的产物和提高公司估值。
  “Talk up” 意思是“吹捧”或“夸大”。在这种情况下,它指的是公司通过夸大其产物或服务的优点来提高其价值或声誉。

  • When students are allowed to mark their own homework, the grades they give themselves might not be accurate.(当学生被答应自己给自己的作业打分时,他们给自己的分数可能不准确。)
  • The company talked up its new software, claiming it would revolutionize the industry, but users found it to be full of bugs.(这家公司吹捧其新软件,声称它将彻底改变行业,但用户发现它充满了毛病。)
  在这些例子中,“marking their own homework” 表示自我评估可能带有偏见,而 “talk up” 表示夸大某事的优点以提升其声誉或价值。
One problem with benchmarks such as MMLU (massive multi-task language
understanding) is that they are simply too easy for today’s models. MMLU was
created in 2020 and consists of 15,908 multiple-choice questions, each with
four possible answers, across 57 topics including maths, American history,
science and law. At the time, most language models scored little better than
25% on MMLU, which is what you would get by picking answers at random;
OpenAI’s GPT-3 did best, with a score of 43.9%. But since then, models have
improved, with the best now scoring between 88% and 90%.
multiple-choice questions:多选题

This means it is difficult to draw meaningful distinctions from their scores, a
problem known as “saturation” (see chart). “It’s like grading high-school
students on middle-school tests,” says Dr Fourrier. More difficult
benchmarks have been devised—MMLU-Pro has tougher questions and ten
possible answers rather than four. GPQA is like MMLU at PhD level, on selected
science topics; today’s best models tend to score between 50% and 60% on
it. Another benchmark, MuSR (multi-step soft reasoning), tests reasoning
ability using, for example, murder-mystery scenarios. When a person reads
such a story and works out who the killer is, they are combining an
understanding of motivation with language comprehension and logical
deduction. AI models are not so good at this kind of “soft reasoning” over
multiple steps. So far, few models score better than random on MuSR
MMLU also highlights two other problems. One is that the answers in such tests
are sometimes wrong. A study carried out by Aryo Gema of the University
of Edinburgh and colleagues, published in June, found that, of the questions
they sampled, 57% of MMLU’s virology questions and 26% of its logical-fallacy
ones contained errors. Some had no correct answer; others had more than
one. (The researchers cleaned up the MMLU questions to create a new
benchmark, MMLU-Redux.)
MMLU还强调了别的两个问题。一是这类测试的答案偶然是错误的。爱丁堡大学的Aryo Gema和他的同事进行的一项研究在6月发表,研究发现,在他们抽样的问题中,MMLU 57%的病毒学问题和26%的逻辑谬误问题包罗错误。有些问题没有正确答案;其他问题有不止一个答案。(研究人员清理了MMLU问题,创建了一个新的基准,MMLU-Redux。)
virology:美 [vaɪˈrɑlədʒi] 病毒学
Then there is a deeper issue, known as “contamination”. LLMs are trained
using data from the internet, which may include the exact questions and
answers for MMLU and other benchmarks. Intentionally or not, the models may
be cheating, in short, because they have seen the tests in advance. Indeed,
some model-makers may deliberately train a model with benchmark data to
boost its score. But the score then fails to reflect the model’s true ability.
One way to get around this problem is to create “private” benchmarks for
which the questions are kept secret, or released only in a tightly controlled
manner, to ensure that they are not used for training (GPQA does this). But then
only those with access can independently verify a model’s scores.
contamination:美 [kənˌtæmɪˈneɪʃn] 污染;污染物;弄脏
get around:绕过,解决
To complicate matters further, it turns out that small changes in the way
questions are posed to models can significantly affect their scores. In a
multiple-choice test, asking an AI model to state the answer directly, or to
reply with the letter or number corresponding to the correct answer, can
produce different results. That affects reproducibility and comparability.
Automated testing systems are now used to test models against benchmarks
in a standardised manner. Dr Liang’s team at Stanford has built one such
system, called HELM (holistic evaluation of language models), which generates
leaderboards showing how a range of models perform on various
benchmarks. Dr Fourrier’s team at Hugging Face uses another such system,
EleutherAI Harness, to generate leaderboards for open-source models. These
leaderboards are more trustworthy than the tables of results provided by
model-makers, because the benchmark scores have been generated in a
consistent way
自动化测试系统如今被用来以标准化的方式对照基准测试模子。斯坦福大学梁博士的团队已经创建了一个如许的系统,称为HELM(语言模子的整体评估),它可以生成排行榜,显示一系列模子在各种基准上的体现。Fourrier博士在Hugging Face的团队使用另一个如许的系统,EleutherAI Harness,为开源模子生成排行榜。这些排行榜比模子制作者提供的结果表更值得信赖,因为基准分数是以同等的方式生成的
holistic: 美 [hoʊˈlɪstɪk] 整体的;全面的;
The greatest trick AI ever pulled

As models gain new skills, new benchmarks are being developed to assess
them. GAIA, for example, tests AI models on real-world problem-solving. (Some
of the answers are kept secret to avoid contamination.) NoCha (novel
challenge), announced in June, is a “long context” benchmark consisting of
1,001 questions about 67 recently published English-language novels. The
answers depend on having read and understood the whole book, which is
supplied to the model as part of the test. Recent novels were chosen because
they are unlikely to have been used as training data. Other benchmarks
assess models’ ability to solve biology problems or their tendency to
But new benchmarks can be expensive to develop, because they often
require human experts to create a detailed set of questions and answers. One
answer is to use LLMs themselves to develop new benchmarks. Dr Liang is
doing this with a project called AutoBencher, which extracts questions and
answers from source documents and identifies the hardest ones.
Anthropic, the startup behind the Claude LLM, has started funding the creation
of benchmarks directly, with a particular emphasis on AI safety. “We are
super-undersupplied on benchmarks for safety,” says Logan Graham, a
researcher at Anthropic. “We are in a dark forest of not knowing what the
models are capable of.” On July 1st the company began inviting proposals
for new benchmarks, and tools for generating them, which it will co-fund,
with a view to making them available to all. This might involve developing
ways to assess a model’s ability to develop cyber-attack tools, say, or its
willingness to provide advice on making chemical or biological weapons.
These benchmarks can then be used to assess the safety of a model before
public release.
Claude LLM背后的初创公司Anthropic已经开始直接资助基准的创建,特别强调人工智能的安全。Anthropic的研究员洛根·格雷厄姆说:“我们在安全基准方面供应严重不足”。"我们处在一片黑暗的森林中,不知道这些模子能做什么."7月1日,该公司开始为新的基准和生成基准的工具征求建议,并共同出资,以期让所有人都能使用。这些基准可以用来在公开发布之前评估模子的安全性。
Historically, says Dr Graham, AI benchmarks have been devised by
academics. But as AI is commercialised and deployed in a range of fields,
there is a growing need for reliable and specific benchmarks. Startups that
specialise in providing AI benchmarks are starting to appear, he notes. “Our
goal is to pump-prime the market,” he says, to give researchers, regulators
and academics the tools they need to assess the capabilities of AI models,
good and bad. The days of AI labs marking their own homework could soon
be over. ■
  在这段话中,Dr. Graham的意思是,通过提供AI基准测试的工具和资源,初期支持研究人员、监管机构和学术界,使他们能够更好地评估AI模子的本领,从而推动整个AI基准测试市场的发展。




