OpenAI unveils EVMbench to benchmark AI for smart-contract security
Overview
OpenAI announced a public benchmark named EVMbench, designed to evaluate how artificial intelligence handles code running on Ethereum-style virtual machines. The suite is intended to simulate realistic conditions by using previously observed bug patterns and exploit scenarios, and it was developed in partnership with Paradigm. The launch signals a move from informal experiments to a structured testing regimen for models applied to blockchain code.
Short sentences. Clear aim. Measure, compare, repeat.
What the benchmark measures
EVMbench evaluates three discrete abilities: pinpointing security flaws, generating controlled exploits for validation, and producing corrected code that preserves contract behavior. Each ability is scored independently so progress on one axis does not mask regressions on another. The dataset pulls from audit discoveries and security competitions, prioritizing cases with real economic consequences.
Tests run against live-like bytecode and source variants to assess whether an AI’s output would be useful in practical audits or offensive research. That approach forces models to demonstrate both analytical depth and precision when changing sensitive on‑chain logic.
Why this matters
Smart contracts currently guard a very large pool of user assets; the industry backdrop makes systematic evaluation timely. By codifying success criteria, EVMbench creates a shared reference for toolmakers, auditors, and regulators to judge AI-driven tooling. Collaboration with a crypto research investor like Paradigm suggests the benchmark balances academic rigor with field relevance.
Adoption could accelerate the integration of AI into security workflows, speed up audits, and change how teams triage vulnerabilities. It may also stimulate an arms race where defensive models improve, and attackers tune models to evade or exploit them — raising the bar for continuous evaluation.
Read Our Expert Analysis
Create an account or login for free to unlock our expert analysis and key takeaways for this development.
By continuing, you agree to receive marketing communications and our weekly newsletter. You can opt-out at any time.
Recommended for you

Cecuro’s specialized AI flags 92% of exploited DeFi contracts
A domain-focused security agent from Cecuro identified vulnerabilities tied to most exploited DeFi contracts in an open benchmark, covering far more loss value than a GPT-5.1-based baseline. The public dataset and evaluation show tailored review processes and heuristics materially lift detection compared with general coding agents.
Future Doctor unveils clinical safety‑effectiveness benchmark; MedGPT leads comparative evaluation
China’s Future Doctor published a Clinical Safety‑Effectiveness Dual‑Track Benchmark (CSEDB) to measure medical AI performance under clinical constraints and used it to compare leading large language models. Their proprietary MedGPT topped the assessment in overall, safety and effectiveness measures, a result that could reshape how hospitals evaluate AI for clinical deployment.
