Tables & Resources
This page contains statistical tables and resources from our comprehensive survey on Issue Resolution in Software Engineering.
📊 Evaluation Datasets Overview
A comprehensive survey and statistical overview of issue resolution datasets. We categorize these datasetsbased on programming language, modality support, source repositories, data scale (Amount), and the availability ofreproducible execution environments.
| Dataset |
Language |
Multimodal |
Repos |
Amount |
Environment |
Link |
| SWE-bench-train |
Python |
❌ |
37 |
19k |
❌ |
 |
| SWE-bench |
Python |
❌ |
12 |
2294 |
✅ |
 |
| SWE-bench Lite |
Python |
❌ |
12 |
300 |
✅ |
 |
| SWE-bench Verified |
Python |
❌ |
/ |
500 |
✅ |
 |
| SWE-bench-java |
Java |
❌ |
19 |
1797 |
✅ |
 |
| SWE-bench Multimodal |
JS,TS,HTML,CSS |
✅ |
17 |
619 |
✅ |
 |
| SWE-bench-extra |
Python |
❌ |
2k |
6.38k |
✅ |
 |
| Visual SWE-bench |
Python |
✅ |
11 |
133 |
✅ |
 |
| SWE-Lancer |
JS, TS |
❌ |
/ |
1488 |
✅ |
 |
| Multi-SWE-bench |
Java, JS, TS, Go, Rust, C, C++ |
❌ |
76 |
4,723 |
✅ |
 |
| R2E-Gym |
Python |
❌ |
10 |
8,135 |
✅ |
 |
| SWE-PolyBench |
Python, Java, JS, TS |
❌ |
21 |
2110 |
✅ |
 |
| Loc-Bench |
Python |
❌ |
/ |
560 |
❌ |
 |
| SWE-smith |
Python |
❌ |
128 |
50k |
✅ |
 |
| SWE-bench Multilingual |
C, C++, Go, Java, JS, TS, Rust, Python, Ruby, PHP |
❌ |
42 |
300 |
✅ |
 |
| SWE-Fixer |
Python |
❌ |
856 |
115406 |
❌ |
 |
| OmniGIRL |
Python, TS, Java, JS |
✅ |
15 |
959 |
✅ |
 |
| SWE-rebench |
Python |
❌ |
30,000 |
21,336 |
✅ |
 |
| SWE-bench-Live |
Python |
❌ |
93 |
1319 |
✅ |
 |
| SWE-Gym |
Python |
❌ |
11 |
2,438 |
✅ |
 |
| SWE-Flow |
Python |
❌ |
74 |
18081 |
✅ |
 |
| SWE-Factory |
Python, Java, JS, TS |
❌ |
12 |
430 |
✅ |
 |
| SWE-Bench-CL |
Python |
❌ |
8 |
273 |
✅ |
 |
| Skywork-SWE |
Python |
❌ |
2531 |
10169 |
✅ |
/ |
| SWE-MERA |
Python |
❌ |
200 |
300 |
✅ |
 |
| SWE-Perf |
Python |
❌ |
12 |
140 |
✅ |
 |
| RepoForge |
Python |
❌ |
/ |
7.3k |
✅ |
/ |
| SWE-Mirror |
Python, Rust, Go |
❌ |
40 |
60k |
✅ |
/ |
| SWE-Bench Pro |
Go, TS, Python |
❌ |
41 |
1865 |
✅ |
 |
| SWE-InfraBench |
Python, TS |
❌ |
/ |
100 |
✅ |
/ |
| SWE-Sharp-Bench |
C# |
❌ |
17 |
150 |
✅ |
 |
| SWE-fficiency |
Python, Cython |
❌ |
9 |
498 |
✅ |
 |
| SWE-Compass |
Python, JS, TS, Java, C, C++, Go, Rust, Kotlin, C# |
❌ |
/ |
2000 |
✅ |
 |
| SWE-bench++ |
Python, Go, TS, JS, Ruby, PHP, Java, Rust, C++, C#, C |
❌ |
3,971 |
1,782 |
✅ |
 |
| SWE-EVO |
Python |
❌ |
7 |
48 |
✅ |
 |
🎯 Training Trajectory Datasets
A survey of trajectory datasets used for agent training or analysis. We list the programming language, number of source repositories, and total trajectories for each dataset.
| Dataset |
Language |
Repos |
Amount |
Link |
| R2E-Gym |
Python |
10 |
3,321 |
 |
| SWE-Gym |
Python |
11 |
491 |
 |
| SWE-Synth |
Python |
11 |
3,018 |
 |
| SWE-Fixer |
Python |
856 |
69,752 |
 |
| SWE-Factory |
Python |
10 |
2,809 |
 |
🔧 Supervised Fine-Tuning (SFT) Models
Overview of SFT-based methods for issue resolution. This table categorizes models by their base architecture and training scaffold (Sorted by Performance).
| Model Name |
Base Model |
Size |
Arch. |
Training Scaffold |
Res.(\%) |
Code |
Data |
Model |
| Devstral |
Mistral Small 3 |
22B |
Dense |
OpenHands |
46.8 |
/ |
 |
 |
🤖 Reinforcement Learning (RL) Models
A comprehensive overview of specialized models for issue resolution, categorized by parameter size. The table details each model's base architecture, the training scaffold used for rollout, the type of reward signal employed (Outcome vs. Process), and their performance results (Res. \%) on issue resolution benchmarks.
| Model Name |
Base Model |
Size |
Arch. |
Train. Scaffold |
Reward |
Res.(\%) |
Code |
Data |
Model |
| 560B Models (MoE) |
|
|
|
|
|
|
|
|
|
| LongCat-Flash-Think |
LongCatFlash-Base |
560B-A27B |
MoE |
R2E-Gym |
Outcome |
60.4 |
 |
/ |
 |
| 72B Models |
|
|
|
|
|
|
|
|
|
| Kimi-Dev |
Qwen 2.5-72B-Base |
72B |
Dense |
BugFixer + TestWriter |
Outcome |
60.4 |
 |
/ |
 |
| Multi-turn RL(Nebius) |
Qwen2.5-72B-Instruct |
72B |
Dense |
SWE-agent |
Outcome |
39.0 |
/ |
/ |
/ |
| Agent-RLVR-RM-72B |
Qwen2.5-Coder-72B |
72B |
Dense |
Localization + Repair |
Outcome |
27.8 |
/ |
/ |
/ |
| Agent-RLVR-72B |
Qwen2.5-Coder-72B |
72B |
Dense |
Localization + Repair |
Outcome |
22.4 |
/ |
/ |
/ |
| 70B Models |
|
|
|
|
|
|
|
|
|
| SWE-RL |
Llama-3.3-70B-Instruct |
70B |
Dense |
Agentless-mini |
Outcome |
41.0 |
 |
/ |
/ |
| 36B Models |
|
|
|
|
|
|
|
|
|
| FoldAgent |
Seed-OSS-36B-Instruct |
36B |
Dense |
FoldAgent |
Process |
58.0 |
 |
 |
/ |
| 32B Models |
|
|
|
|
|
|
|
|
|
| OpenHands Critic |
Qwen2.5-Coder-32B |
32B |
Dense |
SWE-Gym |
/ |
66.4 |
 |
/ |
 |
| KAT-Dev-32B |
Qwen3-32B |
32B |
Dense |
/ |
/ |
62.4 |
/ |
/ |
 |
| SWE-Swiss-32B |
Qwen2.5-32B-Instruct |
32B |
Dense |
/ |
Outcome |
60.2 |
 |
 |
 |
| SeamlessFlow-32B |
Qwen3-32B |
32B |
Dense |
SWE-agent |
Outcome |
45.8 |
 |
/ |
/ |
| DeepSWE |
Qwen3-32B |
32B |
Dense |
R2E-Gym |
Outcome |
42.2 |
 |
 |
 |
| SA-SWE-32B |
/ |
32B |
Dense |
SkyRL-Agent |
/ |
39.4 |
/ |
/ |
/ |
| OpenHands LM v0.1 |
Qwen2.5-Coder-32B |
32B |
Dense |
SWE-Gym |
/ |
37.2 |
 |
/ |
 |
| SWE-Dev-32B |
Qwen2.5-Coder-32B |
32B |
Dense |
OpenHands |
Outcome |
36.6 |
 |
/ |
 |
| Satori-SWE |
Qwen2.5-Coder-32B |
32B |
Dense |
Retriever + Code editor |
Outcome |
35.8 |
 |
 |
 |
| SoRFT-32B |
Qwen2.5-Coder-32B |
32B |
Dense |
Agentless |
Outcome |
30.8 |
/ |
/ |
/ |
| Agent-RLVR-32B |
Qwen2.5-Coder-32B |
32B |
Dense |
Localization + Repair |
Outcome |
21.6 |
/ |
/ |
/ |
| 14B Models |
|
|
|
|
|
|
|
|
|
| Agent-RLVR-14B |
Qwen2.5-Coder-14B |
14B |
Dense |
Localization + Repair |
Outcome |
18.0 |
/ |
/ |
/ |
| SEAlign-14B |
Qwen2.5-Coder-14B |
14B |
Dense |
OpenHands |
Process |
17.7 |
/ |
/ |
/ |
| 9B Models |
|
|
|
|
|
|
|
|
|
| SWE-Dev-9B |
GLM-4-9B |
9B |
Dense |
OpenHands |
Outcome |
13.6 |
 |
/ |
 |
| 8B Models |
|
|
|
|
|
|
|
|
|
| SeamlessFlow-8B |
Qwen3-8B |
8B |
Dense |
SWE-agent |
Outcome |
27.4 |
 |
/ |
/ |
| SWE-Dev-8B |
Llama-3.1-8B |
8B |
Dense |
OpenHands |
Outcome |
18.0 |
 |
/ |
 |
| 7B Models |
|
|
|
|
|
|
|
|
|
| SWE-Dev-7B |
Qwen2.5-Coder-7B |
7B |
Dense |
OpenHands |
Outcome |
23.4 |
 |
/ |
 |
| SoRFT-7B |
Qwen2.5-Coder-7B |
7B |
Dense |
Agentless |
Outcome |
21.4 |
/ |
/ |
/ |
| SEAlign-7B |
Qwen2.5-Coder-7B |
7B |
Dense |
OpenHands |
Process |
15.0 |
/ |
/ |
/ |
🌟 General Foundation Models
Overview of general foundation models evaluated on issue resolution. The table details the specific inference scaffolds (e.g., OpenHands, Agentless) employed during the evaluation process to achieve the reported results.
| Model Name |
Size |
Arch. |
Inf. Scaffold |
Reward |
Res.(\%) |
Code |
Model |
| KAT-Coder |
/ |
/ |
Claude Code |
Outcome |
73.4 |
/ |
 |
| Deepseek V3.2 |
671B-A37B |
MoE |
Claude Code, RooCode |
/ |
73.1 |
 |
 |
| Kimi-K2-Instruct |
1T |
MoE |
Agentless |
Outcome |
71.6 |
/ |
 |
| Qwen3-Coder |
480B-A35B |
MoE |
OpenHands |
Outcome |
69.6 |
 |
 |
| gpt-oss-120b |
116.8B-A5.1B |
MoE |
Internal tool |
Outcome |
62.0 |
 |
 |
| Minimax M2 |
230B-10B |
MoE |
R2E-Gym |
Outcome |
61.0 |
 |
 |
| GLM-4.5-Air |
106B-A12B |
MoE |
OpenHands |
Outcome |
57.6 |
/ |
/ |
| Minimax M1-80k |
456B-A45.9B |
MoE |
Agentless |
Outcome |
56.0 |
 |
 |
| Minimax M1-40k |
456B-A45.9B |
MoE |
Agentless |
Outcome |
55.6 |
 |
 |
| Llama 4 Maverick |
400B-A17B |
MoE |
mini-SWE-agent |
Outcome |
21.0 |
 |
 |
| Llama 4 Scout |
109B-17B |
MoE |
mini-SWE-agent |
Outcome |
9.1 |
 |
 |