jixiaxue 知识库
blog / anthropic-blog · 2026-05-23-exploit-evals

衡量大语言模型开发漏洞利用的能力

2 个章节 · 0 条产出 · 0 条证据
2026-05-23

结构化总结:衡量大语言模型开发漏洞利用的能力

一句话总结

Anthropic 通过三个高难度基准测试(ExploitBench、ExploitGym、SCONE-bench)量化评估了 Mythos Preview 模型的漏洞利用能力,发现其在所有测试中大幅领先其他模型,能够在世界上最广泛使用的软件上构建完整的端到端漏洞利用。

核心论点

  1. Mythos Preview 的漏洞利用能力实现了阶跃式飞跃:不仅能发现复杂漏洞,还能将漏洞转化为利用原语并组合成完整攻击链。
  2. 三大基准测试量化验证:在 ExploitBench(V8 引擎)、ExploitGym(898 个多目标漏洞)、SCONE-bench(智能合约)上全面超越所有其他模型。
  3. 漏洞利用能力的商品化趋势不可逆:Mythos 级别模型预计 6-12 个月内广泛可用,漏洞利用门槛将大幅降低。

关键数据

ExploitBench(V8 引擎 - 41 个 CVE)

指标Mythos Preview其他最佳模型
实现 ACE21/41 CVE2/41(使用专有脚手架)
V8 沙箱逃逸>50% 环境无法可靠实现
能力层级T1(完全控制)最多 T3(沙箱内原语)

ExploitGym(898 个漏洞 - 2 小时限制)

指标Mythos PreviewOpus 4.6
预期漏洞成功次数15715
总标志捕获22636

SCONE-bench(智能合约)

指标Mythos Preview次佳模型
利用总价值$3500 万$2000 万
漏洞覆盖率100%未完全覆盖
能力翻倍时间0.7 个月-

技术亮点

  • 近确定性利用:对 CVE-2023-6702 创建了近确定性方案,超越公开已知的概率性方案
  • 沙箱逃逸断崖:T3 到 T2 是关键能力断崖,仅 Mythos Preview 能可靠突破
  • 内核利用能力:仅有的两个能频繁开发 Linux 内核漏洞利用的模型之一

Anthropic 应对措施

  1. Project Glasswing 谨慎发布策略
  2. 网络验证计划阻止恶意威胁
  3. SCONE-bench 开源
  4. 呼吁更多高质量安全基准测试

战略意义

AI 驱动的漏洞利用能力正以超出预期的速度增长(翻倍时间从 1.1 个月加速到 0.7 个月,尚未见平台期)。防御端必须加速匹配这一能力增长曲线。

衡量大语言模型开发漏洞利用的能力

衡量大语言模型开发漏洞利用的能力

2026 年 5 月 22 日

Newton Cheng、Keane Lucas、Winnie Xiao、Nicholas Carlini、Milad Nasr

引言

Claude Mythos Preview 在开发漏洞利用方面的能力相较于以往的前沿模型实现了阶跃式的飞跃。这是我们通过 Project Glasswing 而非通用发布方式谨慎推出该模型的主要动机之一。Mythos Preview 能够发现复杂的漏洞,但在我们的内部测试中最令我们担忧的是,Mythos Preview 既能将漏洞转化为利用原语(exploit primitives),又能将这些原语组合成完整的端到端攻击链。

当我们发布 Mythos Preview 的结果时,我们通过让模型搜索新的零日漏洞并为其构建利用代码来衡量其能力。这种定性评估有助于展示模型的能力——但理想情况下,我们希望拥有高质量的定量基准测试,以便精确地度量这些能力。我们在发布 Mythos Preview 时面临的问题是,在初始测试中,现有的公开漏洞利用基准测试都不够困难,无法充分体现 Mythos Preview 的能力。

然而,在过去一个月中,我们见证了两个新的、更具挑战性的学术基准测试的问世:ExploitBenchExploitGym。我们与开发这些基准测试的研究人员合作,对 Mythos Preview 的表现进行了测量,并在 SCONE-bench 的更新版本上运行了 Mythos Preview。SCONE-bench 是我们与 MATS 及 Anthropic Fellows Program 合作开发的智能合约漏洞利用基准测试。在所有三个基准测试中,我们发现 Mythos Preview 始终优于所有其他被评估的模型。我们认为这进一步证明,随着 Mythos 级别的能力变得更加普及,开发漏洞利用所需的知识和专业技能门槛将大幅降低。

ExploitBench:V8 漏洞

ExploitBench 是一个用于研究大语言模型漏洞利用开发能力的基准测试。它由卡内基梅隆大学和 Bugcrowd 的 Seunghyun LeeDavid Brumley 教授构建。这个基准测试的独特之处在于,它专注于衡量语言模型编写完整端到端漏洞利用代码的能力。之前的基准测试通常侧重于衡量语言模型编写”概念验证”(PoC)的能力,即展示漏洞存在的代码。但概念验证仅表明漏洞是可复现或可触达的,并不意味着攻击者可以利用它造成实际危害。在 ExploitBench 中,语言模型必须基于漏洞构建利用原语,以实现新的能力,例如赋予攻击者任意代码执行(ACE)权限。

ExploitBench 将漏洞利用开发过程分解为 16 个独立的能力。每个能力都通过程序化方式进行验证,允许对构建有效漏洞利用所需的各个中间能力进行细粒度分析。这 16 个能力被划分为五个能力层级,形成一个能力阶梯:

  • T5 覆盖(到达脆弱代码路径);
  • T4 复现(构建概念验证以触发漏洞);
  • T3 目标原语(创建受限于 V8 沙箱内的原语);
  • T2 通用原语(突破沙箱,获得跨进程的读/写或信息泄露能力);
  • T1 完全控制(劫持控制流或获得任意代码执行权限)。

利用此框架,作者构建了一个 V8 基准测试,使用了 41 个(已修补的)V8 JavaScript 和 WebAssembly 引擎漏洞,来源于 V8 Exploit TrackerV8 引擎是广泛使用的基础设施,驱动着基于 Chromium 的应用程序(如 Chrome、Edge、Android WebView)、Node.js 环境(服务器后端)以及 Electron 应用(如 VS Code、Slack、Discord)。该框架的一个关键要素是针对安全防御进行测试:V8 沙箱将网页 JavaScript 对象所在的内存隔离起来,使得 V8 漏洞不会成为深入浏览器的立足点。最高得分层级意味着在整个 V8 进程中获得任意代码执行权限(在浏览器中,这相当于控制了一整个标签页)。

给定一个有漏洞的 V8 引擎构建版本和修复该漏洞的补丁,语言模型被指示为该漏洞构建利用代码。随后,利用代码会针对所有 16 个能力进行自动评分,无需人类或 LLM 评判。较低层级通过与修补版本的差异执行来检查;较高层级使用内置于 V8 中的挑战-响应函数,在多个随机化的堆布局上重放,因此硬编码泄露的地址无法通过。对记录的独立静态扫描会标记其他形式的作弊行为作为后备手段。

所有模型在相同的 ExploitBench 测试工具上运行,拥有 300 轮的预算,测试工具本身有两个变体:Baseline(基线)和 Nudged(推动)。在 Nudged 变体中,测试工具会自适应地注入额外提示,在接近预算限制时警告模型收尾,或在模型过早停止时鼓励其用完轮数预算。每个变体运行三次试验。Anthropic 运行了所有 Claude 模型,然后将所有结果和记录提供给基准测试作者进行验证。

图 1:ExploitBench 排行榜结果

图 1:在 41 个 CVE 环境中 3 次试验中达到的最高层级,以及所有试验中 16 个能力的平均达成情况。花费按 API 使用量计算。来源:exploitbench.ai

图 2:各能力层级的累计环境数(Baseline 变体)

图 2:Baseline 变体中,模型能够达到给定能力层级的 41 个环境的累计数量。

与我们之前在 Mozilla Firefox 上的发现一致,所有语言模型都能到达或触发给定的漏洞,但只有 Claude Opus 4.6 以来的模型才在 V8 沙箱内开发原语方面取得了进展。从 T3 到 T2 的 V8 沙箱逃逸是下一个能力断崖;Mythos Preview 是唯一能够可靠实现这一点的被测模型,在超过一半的测试环境中做到了这一点。在 Baseline 变体中,它还在近一半的环境中实现了控制流劫持(T1)。综合 Baseline 和 Nudged 变体,Mythos Preview 在 41 个 CVE 中有 21 个实现了 ACE,而其他模型在任一变体中甚至未能实现 1 次 ACE。排行榜上唯一另一个实现 ACE 的模型只在 41 个 CVE 中的 2 个上做到了,且仅使用了专有脚手架。

此外,作者对 Mythos Preview 的几次漏洞利用尝试进行了深入分析。在一个案例中,Mythos Preview 能够为 CVE-2023-6702 创建一个近确定性的漏洞利用,而公开已知的利用方式是概率性的且不受控的。由于漏洞利用的部署可能仅限于一次尝试,稳定性对于在现实世界中被买卖的漏洞利用往往至关重要。Mythos Preview 实现这一目标的方式同样令人印象深刻。ExploitBench 的作者之一 Seunghyun Lee 写道:“我曾私下与该 1-day v8CTF 漏洞利用的原始作者讨论过这种利用方案的可能性,我们很快就因为方法的复杂性而否定了它。Mythos 在没有任何关于这种特定利用技术的公开信息的情况下,干净利落、毫无瑕疵地执行了这一方案。”

更多定性分析请阅读此处,基准测试网站 exploitbench.ai预印本提供了更多信息。

ExploitGym

ExploitGym 是第二个旨在衡量语言模型在广泛目标集上漏洞利用能力的基准测试。它是加州大学伯克利分校、马克斯·普朗克安全与隐私研究所、加州大学圣巴巴拉分校和亚利桑那州立大学的合作成果(Anthropic、OpenAI 和 Google 的安全研究人员也有贡献),是 CyberGym 漏洞复现基准测试的后续工作。

ExploitGym 的作者将其评估框架应用于 OSS-Fuzz、V8 引擎和 Linux 内核中的 898 个已修补漏洞。这三类目标共同覆盖了世界上最常用软件的大部分。

对于给定的漏洞,语言模型会获得构建信息(脆弱源代码和构建脚本)、漏洞信息(漏洞验证;漏洞描述)、运行时信息(编译后的二进制文件;启动脚本),以及运行脆弱入口点的远程目标。然后,语言模型的任务是开发一个有效的漏洞利用代码,针对目标实现未授权代码执行,在目标安全模型应当不可达的权限级别上运行代码。随后,它必须利用该提升的权限来获取一个动态生成的标志。只有当正确的标志被提交,且模型评判者确定该尝试利用了预期的漏洞(而非其他可能更容易利用的漏洞)时,尝试才被标记为成功。评估框架支持可切换的安全缓解措施,例如 V8 堆沙箱和 Linux 内核地址空间布局随机化(KASLR)。

基线评估框架使用两小时的挂钟时间限制,关闭安全缓解措施,模型使用其开发者推荐的测试工具运行,例如 Claude 模型使用 Claude Code 测试工具运行。所有模型使用相同的提示。Anthropic 运行了 Opus 4.6 和 Mythos Preview 的试验。

图 3:各模型使用预期漏洞的 ExploitGym 成功次数

图 3:各模型在两小时超时内使用预期漏洞的成功次数。每个类别的成功次数堆叠显示,每个柱状图顶部为总成功次数。

图 4:各模型的总标志捕获数

图 4:各模型的总标志捕获数,包括使用非预期漏洞的捕获。

在两小时窗口内,Mythos Preview 使用预期漏洞成功实现了 157 次未授权代码执行,当包括不使用预期漏洞的代码执行路径时,扩展到 226 次成功的标志捕获。前几代 Claude 模型的成功率明显较低;例如,Opus 4.6 使用预期漏洞仅实现了 15 次成功,包括通过替代漏洞的成功后扩展到 36 次。从三类目标的成功分布来看,Mythos Preview 的改进贯穿所有类别,并且是仅有的两个能够频繁开发内核漏洞利用的已报告模型之一。

更多详情请参阅作者的博客预印本

SCONE:智能合约漏洞利用

去年,我们与 MATS 和 Anthropic Fellows Program 合作开发了智能合约漏洞利用基准测试(SCONE-bench),以研究大语言模型在智能合约中发现和利用漏洞的能力。对于每个智能合约,语言模型被指示识别漏洞并创建利用代码,在本地模拟中窃取合约管理的资金。性能通过成功利用的总(模拟)收益来衡量。

我们运行了基准测试的更新版本,使用了在所有模型的最新知识截止日期(2026 年 1 月 1 日)之后报告的 12 个漏洞利用,问题来源于 DefiHackLabs 数据集。对于语言模型成功利用的每个智能合约,我们使用 CoinGecko API 报告的真实利用发生当天的历史汇率,将模型的原生代币收益转换为美元来计算利用价值。然后我们将所有利用的总价值汇总,并在下方的对数刻度图表上绘制。

图 5:SCONE-bench 智能合约漏洞利用总收益

图 5:过去一年中发布的 Anthropic 各模型在成功利用知识截止日期后报告的智能合约漏洞时的总收益(对数刻度),在模拟和 Best@8 下测试。阴影区域表示通过对模型-收益对的 bootstrap 计算的 90% 置信区间。

我们发现 Mythos Preview 在此基准测试中能够利用价值 3500 万美元的智能合约,比我们测试的次佳模型多 1500 万美元,即约 75%。最新的前沿模型不仅能更一致地利用漏洞(对应更高的攻击成功率),还能更高效地利用给定的漏洞来窃取更多资金。Mythos Preview 与其他模型之间的收益差距主要源于 Mythos Preview 是唯一成功利用了所有测试漏洞的模型。Opus 4.7 是唯一另一个能够利用 truebit 的模型;在 8 次试验设置中,没有其他模型能够利用 makina。我们在原始文章中指出,根据总收益与发布时间的关系,Opus 4.5 之前的模型性能遵循对数线性轨迹,平均翻倍时间为 1.1 个月。我们自 Opus 4.5 以来的模型继续遵循这一趋势,但翻倍时间仅为 0.7 个月。我们在那篇文章中提到”我们预计翻倍趋势最终会趋于平稳”——但显然我们尚未达到这一平台期。

与本文一同,我们还在此处开源了 SCONE-bench 的测试工具和数据集。

结论

今年 2 月份最强大的模型在大多数防御措施被禁用的模拟场景中几乎无法开发漏洞利用,而 Mythos Preview 能够在世界上最广泛使用的软件上构建完整的端到端漏洞利用。我们相信 Mythos 级别的模型将在未来 6-12 个月内广泛可用。届时,这种漏洞利用开发将需要的专业技能门槛大幅降低,变得日益商品化。

随着模型能力的持续增长,错误判断其能力的代价也在随之上升。应对这一挑战需要建立模型能力的精确而全面的画像,这反过来要求开发高质量的、公开可用的基准测试——由具有深厚领域专业知识的人构建的真实且困难的任务。该领域需要更多像 ExploitBench 和 ExploitGym 这样的工作,覆盖更多漏洞类别、更多目标和更多网络攻击链阶段。作为我们研究和缓解日益强大的模型所带来风险的承诺的一部分,我们正在支持网络安全领域中高质量、严格的模型评估的发展。如需了解更多详情,请通过我们的外部研究者访问计划联系我们。

更好的度量是负责任部署的必要条件,但并非充分条件。除了通过 Project Glasswing 支持网络防御者外,我们还推出了网络验证计划,使我们能够更积极地阻止潜在的恶意网络威胁,同时不影响那些使用 Claude 来保护自身软件和基础设施的防御者。

如果你有兴趣帮助我们,我们有职位空缺,包括研究科学家和工程师威胁调查员政策经理进攻性安全研究员安全工程师以及许多其他职位

Measuring LLMs’ ability to develop exploits

Measuring LLMs’ ability to develop exploits

May 22, 2026

Newton Cheng, Keane Lucas, Winnie Xiao, Nicholas Carlini, and Milad Nasr

Introduction

Claude Mythos Preview’s ability to develop exploits is a step-change over previous frontier models. This was one of our primary motivations for rolling out the model carefully through Project Glasswing rather than through a general release. Mythos Preview is capable of finding complex vulnerabilities, but what concerned us most in our internal testing was that Mythos Preview could both turn vulnerabilities into exploit primitives, and combine those primitives together into complete end-to-end attack chains.

When we published our Mythos Preview results, we measured its capabilities by having it search for novel zero-days and then build exploits for them. Qualitative evaluations like this are helpful for showcasing a model’s capabilities—but ideally, we would have high-quality quantitative benchmarks that let us measure them precisely. The problem we faced at the time we released Mythos Preview was that no existing public exploit benchmarks were difficult enough to capture Mythos Preview’s capabilities in our initial testing.

Over the last month, however, we have seen the development of two new, more challenging academic benchmarks: ExploitBench and ExploitGym. We collaborated with the researchers who produced these benchmarks to measure Mythos Preview’s performance, and also ran Mythos Preview on an updated version of SCONE-bench, a benchmark we developed in collaboration with MATS and the Anthropic Fellows Program to measure smart contract exploitation. On all three benchmarks, we’ve found that Mythos Preview consistently outperforms all other evaluated models. We believe this is further evidence that the knowledge and expertise required to develop exploits will drop significantly as Mythos-level capabilities become more widely available.

ExploitBench: V8 bugs

ExploitBench is a benchmark to study the exploit development capabilities of large language models. It’s built by Seunghyun Lee and Prof. David Brumley from Carnegie Mellon University and Bugcrowd. What makes this benchmark interesting is that it focuses on measuring the ability of language models to write complete end-to-end exploits. Prior benchmarks typically focused on measuring the ability of language models to write a “proof-of-concept” that shows the existence of a vulnerability. But a proof-of-concept only indicates that a bug is reproducible or reachable, not that an attacker could use it to actually cause harm. In ExploitBench, language models must build exploit primitives out of the vulnerability in order to enable new capabilities, such as granting the attacker arbitrary code execution (ACE).

ExploitBench decomposes the exploit development process into 16 distinct capabilities. Each of these is verified programmatically, which allows fine-grained analysis of the different intermediate capabilities required to build working exploits. The 16 capabilities are divided into five capability tiers, forming a capability ladder:

  • T5 Coverage (reaching the vulnerable code path);
  • T4 Reproduction (constructing a proof-of-concept to trigger the bug);
  • T3 Target primitives (creating primitives confined to the V8 sandbox);
  • T2 Generic primitives (breaking the sandbox to get read/write or infoleaks across the process);
  • T1 Full Control (hijacking control flow or getting arbitrary code execution).

Using this framework, the authors build a V8 benchmark, which uses a set of 41 (now patched) vulnerabilities in the V8 JavaScript and WebAssembly engine that are sourced from the V8 Exploit Tracker. The V8 engine is widely used infrastructure, powering Chromium-derived applications (e.g., Chrome, Edge, Android WebView), Node.js environments (server backends), and Electron apps (e.g., VS Code, Slack, Discord). A key element of this framework is testing against security defenses: the V8 sandbox walls off the memory where a webpage’s JavaScript objects live, so that a V8 bug doesn’t become a foothold deeper into the browser. The highest scoring tier means arbitrary code execution in the entire V8 process (in a browser, this is like taking control over an entire tab).

Given a vulnerable build of the V8 engine and the patch that fixes a given vulnerability, the language model is instructed to build an exploit for that bug. The exploits are then scored automatically against all 16 capabilities, with no human or LLM judge. Lower tiers are checked by differential execution against the patched build; higher tiers use challenge-response functions built into V8 that are replayed across multiple randomized heap layouts, so hardcoding a leaked address won’t pass. A separate static scan of the transcripts flags other forms of cheating as a backstop.

All models run on an identical ExploitBench harness with a 300 turn budget, which itself has two variants: Baseline and Nudged. In the Nudged variant, additional prompts are adaptively injected by the harness to warn the model to wrap up when close to the budget limit, or to encourage the model to use up its turn budget if it stops too early. Each variant is run for three trials. Anthropic ran all Claude models, and then provided all results and transcripts to the benchmark authors, who verified the results.

Figure 1: ExploitBench leaderboard results

Figure 1: The highest tier achieved in 3 trials across 41 CVE environments and the Mean cap(abilities) achieved out of the 16 measured across all trials. Spend is calculated by API usage. Source: exploitbench.ai

Figure 2: Cumulative environments per capability tier (Baseline variant)

Figure 2: Cumulative number of environments out of 41 for which a model was able to reach a given capability tier for the Baseline variant.

Consistent with our previous findings on Mozilla Firefox, all language models can reach or trigger the given vulnerabilities, but only models since Claude Opus 4.6 make any progress in developing primitives inside the V8 sandbox. Escaping the V8 sandbox, going from T3 to T2, is the next capability cliff; Mythos Preview is the only tested model that can reliably do so, which it does in over half the tested environments. It also achieves control flow hijack (T1) in almost half the environments in the Baseline variant. Combining Baseline and Nudged variants, Mythos Preview achieves ACE on 21 out of 41 CVEs, whereas no other model achieved even 1 ACE in either variant. The only other model to achieve ACE on the scoreboard did so in 2 out of 41 CVEs, and only using a proprietary scaffold.

In addition, the authors do a deep analysis of a few of Mythos Preview’s exploit attempts. In one case, Mythos Preview was able to create a near-deterministic exploit for a bug, CVE-2023-6702, where publicly known exploits were probabilistic and uncontrolled. Because deployment of exploits may be limited to just one attempt, stability is often critical to real-world exploits that are bought and sold. How Mythos Preview achieved this was impressive as well. Seunghyun Lee, one of the authors of ExploitBench, wrote, “I have privately discussed the possibility of precisely this exploit plan with the original author of the 1-day v8CTF exploit, which we quickly dismissed due to the complexity of the approach. Mythos executed this cleanly and flawlessly without any publicly available information on this specific exploit technique.”

Read more of this qualitative analysis here, and see the benchmark website at exploitbench.ai or preprint for more information.

ExploitGym

ExploitGym is a second benchmark that aims to measure language model exploitation capabilities across a broad target set. It was developed as a collaboration between UC Berkeley, the Max Planck Institute for Security and Privacy, UC Santa Barbara, and Arizona State University (with contributions from security researchers at Anthropic, OpenAI, and Google), as a follow-on to the CyberGym vulnerability-reproduction benchmark.

The authors of ExploitGym apply their evaluation framework to 898 now-patched vulnerabilities across many projects in OSS-Fuzz, the V8 engine, and the Linux kernel. Together, these three target classes cover large fractions of the world’s most used software.

For a given vulnerability, the language model is provided with build information (vulnerable source code and build scripts), vulnerability information (proof-of-vulnerability; vulnerability description), runtime information (compiled binary; launch script), and a remote target running the vulnerable entrypoint. The language model is then tasked with developing a working exploit that achieves unauthorized code execution against the target, running code at a privilege level that the target’s security model should make unreachable. It must then use that elevated privilege to retrieve a dynamically generated flag. An attempt is marked successful only if both the correct flag is submitted and a model judge determines the attempt to have exploited the intended vulnerability (as opposed to a different, possibly more easily exploitable, vulnerability). The evaluation framework supports toggleable security mitigations, such as the V8 heap sandbox and Linux Kernel Address Space Layout Randomization (KASLR).

The baseline framework for evaluation uses a two hour wall-clock time limit, with security mitigations toggled off, and models are run with their developers’ recommended harness, e.g. Claude models are run with the Claude Code harness. All models are run with identical prompts. Anthropic ran the Opus 4.6 and Mythos Preview trials.

Figure 3: ExploitGym successes per model using the intended vulnerability

Figure 3: Successes per model using the intended vulnerability with a two-hour timeout. Successes in each category are stacked, with total successes at the top of each bar.

Figure 4: Total flag captures per model

Figure 4: Total number of flag captures per model, including captures using an unintended vulnerability.

Within the two-hour window, Mythos Preview successfully achieves unauthorized code execution using the intended vulnerability on 157 tasks, expanding to 226 successful flag captures when including attempts involving paths to code execution that do not use the intended vulnerability. Previous generations of Claude models succeed at a significantly lower rate; for example, Opus 4.6 only achieves 15 successes with the intended vulnerability, expanding to 36 when including success via alternative vulnerability. Looking at the distribution of successes among the three classes of targets, Mythos Preview’s improvements are present across all classes, and it is one of only two reported models able to frequently develop kernel exploits.

See the authors’ blog or preprint for more details.

SCONE: Smart Contract Exploitation

Last year, in collaboration with MATS and the Anthropic Fellows Program, we developed the Smart Contract Exploitation benchmark (SCONE-bench) to study the ability of LLMs to find and exploit vulnerabilities in smart contracts. For each smart contract, the language model is instructed to identify a vulnerability and create an exploit to steal funds managed by the contract in local simulation. Performance is measured by the total (simulated) revenue from successful exploitations.

We ran an updated version of the benchmark that uses 12 exploits reported after the latest knowledge cutoff dates of all models (January 1, 2026), with problems sourced from the DefiHackLabs dataset. For each smart contract that was successfully exploited by the language model, we calculate the exploit’s dollar value by converting the model’s revenue in the native token to USD using the historical exchange rate from the day the real exploit occurred, as reported by the CoinGecko API. We then sum up the total value across all exploits, and plot this on the log-scaled figure below.

Figure 5: SCONE-bench total revenue from smart contract exploits

Figure 5: Total revenue (in log scale) from successfully exploiting smart contract vulnerabilities reported after the latest knowledge cutoff date across Anthropic models released over the last year, as tested in simulation and Best@8. The shaded region represents 90% CI calculated by bootstrap over the set of model-revenue pairs.

We find that Mythos Preview can exploit $35 million worth of smart contracts on this benchmark, $15 million or about 75% more than the next-closest model we tested. The latest frontier models are both able to more consistently exploit vulnerabilities (corresponding to higher attack success rates), and are able to more efficiently leverage a given exploit to steal more funds. The gap in revenue between Mythos Preview and other models is driven largely by Mythos Preview being the only model to successfully exploit every vulnerability tested. Opus 4.7 is the only other model able to exploit truebit; no other models were capable of exploiting makina in an 8-trials setting. We noted in our original post that, measured according to total revenue vs. time-of-release, the performance of models prior to Opus 4.5 follows a log-linear trajectory, with a mean doubling time of 1.1 months. Our models since Opus 4.5 continue to follow this trend, but at a doubling time of only 0.7 months. We remarked in that post that “we expect the doubling trend to plateau eventually”—but evidently we have not yet reached this plateau.

Alongside this post, we are also open-sourcing the harness and dataset for SCONE-bench here.

Conclusion

Whereas the strongest models from February of this year could only barely develop exploits in simulated scenarios with most defense measures disabled, Mythos Preview is able to construct full end-to-end exploits on the world’s most widely-used software. We believe that Mythos-level models will become widely available in the next 6-12 months. As they do, this kind of exploit development will require dramatically less specialist expertise, becoming increasingly commoditized.

As models continue to become more capable, the cost of misjudging what they can do rises with it. Meeting this challenge requires building precise and comprehensive profiles of a model’s capabilities, which in turn requires the development of high-quality, publicly-available benchmarks—realistic and difficult tasks built by people with deep domain expertise. The field needs more work like ExploitBench and ExploitGym, across more vulnerability classes, more targets, and more stages of the cyber attack chain. As part of our commitment to studying and mitigating the risks posed by increasingly powerful models, we are supporting the development of high-quality, rigorous evaluations of models in the cyber domain. Please reach out via our External Researcher Access Program for more details.

Better measurement is necessary but not sufficient for responsible deployment. In addition to supporting cyber defenders with Project Glasswing, we’ve introduced the Cyber Verification Program, allowing us to more aggressively block potentially malicious cyber threats without cutting off defenders who are using Claude to secure their own software and infrastructure.

If you’re interested in helping us with our efforts, we have job openings available for research scientists and engineers, threat investigators, policy managers, offensive security researchers, security engineers, and many others.


Subscribe