SWE-fficiency

Leaderboard

System	Speedup Ratio

We evaluate models using Speedup Ratio (SR). SR tests how well models match up against expert speedups: SR = (LM speedup) / (Expert speedup). An SR of 1.0× means the model matches human expert speedup (and above 1.0× means surpassing expert performance).

Overview

SWE-fficiency (pronounced swee-FISH-uhn-see) challenges language models to optimize the runtime of real repos on real workloads. It contains 498 tasks across nine popular Python repos, including numpy, pandas, aand scipy.

Given a complete codebase and a real workload, agents must investigate code semantics, localize bottlenecks and relevant tests, and produce a patch that matches or exceeds expert speedup while passing the same unit tests. This challenges models to perform the investigative, pass-to-pass workflow of real performance engineering.

Our evaluation reveals significant underperformance of current state-of-the-art agents. On average, top frontier models achieve less than 0.15× the expert speedup, struggling with localizing optimization opportunities, reasoning about execution across functions, and maintaining correctness in proposed edits.

Task Formulation

SWE-fficiency tasks mirror how real engineers approach performance optimization in mature, widely-used codebases. Each task provides a complete repository snapshot, an associated performance workload script, and the existing unit test suite. An autonomous agent (or human) must iteratively reason about code semantics, profile or inspect execution, localize bottlenecks, and propose a patch that improves runtime without regressing correctness.

Instead of isolated algorithm puzzles, tasks demand an investigative, pass-to-pass workflow: identify hot paths, understand call chains across modules, inspect data structures, validate hypotheses against targeted tests, and refine edits to reach or surpass expert speedups. This formulation directly evaluates: (1) repository-scale code understanding, (2) performance reasoning and prioritization, and (3) correctness-preserving implementation skill.

Key Features

🎯 Real-World Tasks

498 performance optimization tasks extracted from actual GitHub pull requests across 9 major Python libraries used by millions of developers: numpy, pandas, scipy, scikit-learn, matplotlib, xarray, sympy, dask, astropy.

✅ Correctness-Preserving

All optimizations must pass existing repository unit tests, ensuring functional correctness while improving performance. We use coverage analysis to identify relevant tests for each optimization.

🔬 Investigative Workflow

Models must profile workloads, localize bottlenecks, identify relevant tests, and implement optimizations—mirroring real performance engineering workflows used in industry.

📊 Expert-Anchored Evaluation

Speedup Ratio (SR) metric compares model performance directly to human expert optimizations, encouraging progress toward and beyond human parity. Expert speedups range from 1.1× to over 100×.

Methodology and Dataset

Figure 1: SWE-fficiency contains a diverse distribution over performance workload runtime (left); over gold patch speedup (speedup achieved from expert PR edit); and over types of optimizations made by the expert (right).

🔍 Data Collection

Multi-stage pipeline combining keyword filtering, static analysis, coverage tooling, and execution validation to extract diverse and realistic optimization tasks from GitHub PRs.

⚖️ Evaluation Metric

Speedup Ratio (SR) = (Model Speedup) / (Expert Speedup)
Aggregated via harmonic mean across all tasks. SR = 1.0× means matching expert performance.

🧪 Experimental Setup

Containerized environments with CPU/memory pinning (4 vCPUs, 16GB RAM per worker). 3-hour time limit, 100 max actions per task.

🤖 Agent Frameworks

Evaluation using OpenHands and SWE-agent scaffolds with file-editing tools and bash terminal interfaces for comprehensive code investigation.

Figure 2: SWE-fficiency uses a multi-stage pipeline that filters from raw GitHub pull requests information to extract containerized, reproducible, performance-focused optimization tasks for evaluation.

Key Findings

Figure 3: LMs achieve strong performance on easier problems but struggle on tasks with longer workload runtime duration and larger baseline expert speedups. We bucket LM submissions by per-instance speedup ratio and compute the geometric mean per-bucket of (i) pre-edit workload runtime, (ii) the gold (expert) patch speedup, and (iii) the number of lines in the gold patch.

Figure 4: LMs find expert-level wins earlier on in action trajectories. When they underperform experts, LMs submit satisficing optimizations rather than trying on for expert parity.

Figure 5: LMs leave a significant portion of expert-achievable speedup on the table due to wrong file/function selection and localization.

Figure 6: LMs prefer to edit different functions than the gold patch, missing out on major speedups. For a workload flamegraph for task pandas--dev__pandas-52054, Claude 3.7 Sonnet (SWE-agent) (red) chooses a different function (and file) than the expert (gold): it does not achieve the expert's overall workload speedup, since the expert's speedup is at a shallower scope.