Tables & Resources¶

This page contains statistical tables and resources from our comprehensive survey on Issue Resolution in Software Engineering.

Evaluation & Training Datasets¶

A comprehensive survey and statistical overview of issue resolution datasets. We categorize these datasets based on programming language, modality support, source repositories, data scale (Amount), and the availability of reproducible execution environments.

Dataset	Language	Multimodal	Repos	Amount	Environment	Link
Single-PL Datasets
SWE-Fixer	Python	❌	856	115,406	❌
SWE-smith	Python	❌	128	50k	✅
SWE-Lego	Python	❌	3,251	32,119	✅
SWE-rebench	Python	❌	3,468	21,336	✅
SWE-bench-train	Python	❌	37	19k	❌
SWE-Flow	Python	❌	74	18,081	✅
Skywork-SWE	Python	❌	2,531	10,169	✅	-
R2E-Gym	Python	❌	10	8,135	✅
RepoForge	Python	❌	-	7.3k	✅	-
SWE-bench-extra	Python	❌	2k	6.38k	✅
SWE-Gym	Python	❌	11	2,438	✅
SWE-bench	Python	❌	12	2,294	✅
SWE-bench-java	Java	❌	19	1,797	✅
FEA-bench	Python	❌	83	1,401	✅
SWE-bench-Live	Python	❌	164	1,565	✅
Loc-Bench	Python	❌	-	560	❌
SWE-bench Verified	Python	❌	-	500	✅
SWE-bench Lite	Python	❌	12	300	✅
SWE-MERA	Python	❌	200	300	✅
SWE-Bench-CL	Python	❌	8	273	✅
SWE-Sharp-Bench	C#	❌	17	150	✅
SWE-Perf	Python	❌	12	140	✅
Visual SWE-bench	Python	✅	11	133	✅
SWE-EVO	Python	❌	7	48	✅
Multi-PL Datasets
SWE-Mirror	Python, Rust, Go	❌	40	60k	✅	-
Multi-SWE-bench	Java, JS, TS, Go, Rust, C, C++	❌	76	4,723	✅
Swing-Bench	Python, Go, C++, Rust	❌	400	2300	✅	-
SWE-PolyBench	Python, Java, JS, TS	❌	21	2,110	✅
SWE-Compass	Python, JS, TS, Java, C, C++, Go, Rust, Kotlin, C#	❌	-	2,000	✅
SWE-Bench Pro	Python, Go, TS	❌	41	1,865	✅
SWE-bench++	Python, Go, TS, JS, Ruby, PHP, Java, Rust, C++, C#, C	❌	3,971	1,782	✅
SWE-Lancer	JS, TS	❌	-	1,488	✅
OmniGIRL	Python, TS, Java, JS	✅	15	959	✅
SWE-bench Multimodal	JS, TS, HTML, CSS	✅	17	619	✅
SWE-fficiency	Python, Cython	❌	9	498	✅
SWE-Factory	Python, Java, JS, TS	❌	12	430	✅
SWE-bench-Live-MultiLang \& Windows	Python, JS, TS, C, C++, C#, Java, Go, Rust	❌	238	418	✅
SWE-bench Multilingual	C, C++, Go, Java, JS, TS, Rust, Python, Ruby, PHP	❌	42	300	✅
SWE-InfraBench	Python, TS	❌	-	100	✅	-

Training Trajectory Datasets¶

A survey of trajectory datasets used for agent training or analysis. We list the programming language, number of source repositories, and total trajectories for each dataset.

Dataset	Language	Repos	Amount
SWE-Fixer	Python	856	69,752
SWE-rebench	Python	1,823	67,074
R2E-Gym	Python	10	3,321
SWE-Synth	Python	11	3,018
SWE-Factory	Python	10	2,809
SWE-Gym	Python	11	491
SWE-Lego	Python	3251	14.6k

SFT-based Methods¶

Overview of SFT-based methods for issue resolution. This table categorizes models by their base architecture and training scaffold (Sorted by Performance).

Model Name	Base Model	Size	Arch.	Training Scaffold	Res.(%)	Code	Data	Model
SWE-rebench-openhands-Qwen3-235B-A22B	Qwen3-235B-A22B	235B-A22B	MoE	OpenHands	59.9	-
SWE-Lego-Qwen3-32B	Qwen3-32B	32B	Dense	OpenHands	57.6
CGM-SWE-PY	Qwen2.5-Coder-72B	72B	Dense	Graph RAG	50.4		-
SWE-rebench-openhands-Qwen3-30B-A3B	Qwen3-30B-A3B	30B-A3B	MoE	OpenHands	49.7	-
Devstral	Mistral Small 3	22B	Dense	OpenHands	46.8	-
Co-PatcheR	Qwen2.5-Coder-14B	3×14B	Dense	PatchPilot-mini	46		-
SWE-Swiss-32B	Qwen2.5-32B-Instruct	32B	Dense	Agentless	45
SWE-Lego-Qwen3-8B	Qwen3-8B	8B	Dense	OpenHands	44.4
Lingma SWE-GPT	Qwen2.5-72B-Instruct	72B	Dense	SWESynInfer	30.2		-	-
SWE-Gym-Qwen-32B	Qwen2.5-Coder-32B	32B	Dense	OpenHands, MoatlessTools	20.6		-
Lingma SWE-GPT	Qwen2.5-Coder-7B	7B	Dense	SWESynInfer	18.2		-	-
SWE-Gym-Qwen-14B	Qwen2.5-Coder-14B	14B	Dense	OpenHands, MoatlessTools	16.4		-
SWE-Gym-Qwen-7B	Qwen2.5-Coder-7B	7B	Dense	OpenHands, MoatlessTools	10.6		-

RL-based Methods¶

A comprehensive overview of specialized models for issue resolution, categorized by parameter size. The table details each model's base architecture, the training scaffold used for rollout, the type of reward signal employed (Outcome vs. Process), and their performance results (Res. %) on issue resolution benchmarks.

Model Name	Base Model	Size	Arch.	Train. Scaffold	Reward	Res.(%)	Code	Data	Model
560B Models (MoE)
LongCat-Flash-Think	LongCatFlash-Base	560B-A27B	MoE	R2E-Gym	Outcome	60.4		-
72B Models
Kimi-Dev	Qwen 2.5-72B-Base	72B	Dense	BugFixer + TestWriter	Outcome	60.4		-
SWE-RL	Llama-3.3-70B-Instruct	70B	Dense	Agentless-mini	Outcome	41.0		-	-
Multi-turn RL(Nebius)	Qwen2.5-72B-Instruct	72B	Dense	SWE-agent	Outcome	39.0	-	-	-
Agent-RLVR-RM-72B	Qwen2.5-Coder-72B	72B	Dense	Localization + Repair	Outcome	27.8	-	-	-
Agent-RLVR-72B	Qwen2.5-Coder-72B	72B	Dense	Localization + Repair	Outcome	22.4	-	-	-
32B Models
OpenHands Critic	Qwen2.5-Coder-32B	32B	Dense	SWE-Gym	-	66.4		-
KAT-Dev-32B	Qwen3-32B	32B	Dense	-	-	62.4	-	-
SWE-Swiss-32B	Qwen2.5-32B-Instruct	32B	Dense	-	Outcome	60.2
FoldAgent	Seed-OSS-36B-Instruct	36B	Dense	FoldAgent	Process	58.0			-
SeamlessFlow-32B	Qwen3-32B	32B	Dense	SWE-agent	Outcome	45.8		-	-
DeepSWE	Qwen3-32B	32B	Dense	R2E-Gym	Outcome	42.2
SA-SWE-32B	-	32B	Dense	SkyRL-Agent	-	39.4	-	-	-
OpenHands LM v0.1	Qwen2.5-Coder-32B	32B	Dense	SWE-Gym	-	37.2		-
SWE-Dev-32B	Qwen2.5-Coder-32B	32B	Dense	OpenHands	Outcome	36.6		-
Satori-SWE	Qwen2.5-Coder-32B	32B	Dense	Retriever + Code editor	Outcome	35.8
SoRFT-32B	Qwen2.5-Coder-32B	32B	Dense	Agentless	Outcome	30.8	-	-	-
Agent-RLVR-32B	Qwen2.5-Coder-32B	32B	Dense	Localization + Repair	Outcome	21.6	-	-	-
14B Models
Agent-RLVR-14B	Qwen2.5-Coder-14B	14B	Dense	Localization + Repair	Outcome	18.0	-	-	-
SEAlign-14B	Qwen2.5-Coder-14B	14B	Dense	OpenHands	Process	17.7	-	-	-
7-8B Models
SeamlessFlow-8B	Qwen3-8B	8B	Dense	SWE-agent	Outcome	27.4		-	-
SWE-Dev-7B	Qwen2.5-Coder-7B	7B	Dense	OpenHands	Outcome	23.4		-
SoRFT-7B	Qwen2.5-Coder-7B	7B	Dense	Agentless	Outcome	21.4	-	-	-
SWE-Dev-8B	Llama-3.1-8B	8B	Dense	OpenHands	Outcome	18.0		-
SEAlign-7B	Qwen2.5-Coder-7B	7B	Dense	OpenHands	Process	15.0	-	-	-
SWE-Dev-9B	GLM-4-9B	9B	Dense	OpenHands	Outcome	13.6		-

General Foundation Models¶

Overview of general foundation models evaluated on issue resolution. The table details the specific inference scaffolds (e.g., OpenHands, Agentless) employed during the evaluation process to achieve the reported results.

Model Name	Size	Arch.	Inf. Scaffold	Reward	Res.(%)	Code	Model
MiMo-V2-Flash	309B-A15B	MoE	Agentless	Outcome	73.4
KAT-Coder	-	-	Claude Code	Outcome	73.4	-
Deepseek V3.2	671B-A37B	MoE	Claude Code, RooCode	-	73.1
Kimi-K2-Instruct	1T	MoE	Agentless	Outcome	71.6	-
Qwen3-Coder	480B-A35B	MoE	OpenHands	Outcome	69.6
GLM-4.6	355B-A32B	MoE	OpenHands	Outcome	68.0	-
gpt-oss-120b	116.8B-A5.1B	MoE	Internal tool	Outcome	62.0
Minimax M2	230B-10B	MoE	R2E-Gym	Outcome	61.0
gpt-oss-20b	20.9B-A3.6B	MoE	Internal tool	Outcome	60.0
GLM-4.5-Air	106B-A12B	MoE	OpenHands	Outcome	57.6	-	-
Minimax M1-80k	456B-A45.9B	MoE	Agentless	Outcome	56.0
Minimax M1-40k	456B-A45.9B	MoE	Agentless	Outcome	55.6
Seed1.5-Thinking	200B-A20B	MoE	-	Outcome	47.0		-
Llama 4 Maverick	400B-A17B	MoE	mini-SWE-agent	Outcome	21.0
Llama 4 Scout	109B-17B	MoE	mini-SWE-agent	Outcome	9.1