ForgeCode vs Claude Code: which AI coding agent actually wins?

I’ve been using Claude Code for months. I like it. I genuinely don’t get the Twitter hate. But there’s one thing that’s been driving me crazy: speed. I’ll ask it to rename a variable across three files and it sits there thinking for 40 seconds. A simple test fix on a small repo, and I’m watching a spinner for two minutes. It’s not a deal-breaker, but it’s the kind of friction that builds up over a day.
We recently rolled out Claude Code across our entire engineering org. We’re not ditching Cursor, just giving devs the option to pick whatever tool works for them. And the feedback I kept hearing from people, unprompted: it’s slow. Not everyone, not every task. But enough devs brought it up that it clearly wasn’t just me being impatient.
So I started looking at alternatives. OpenAI has Codex CLI but I haven’t tried the harness yet, just the models. The TermBench 2.0 leaderboard is what caught my eye. ForgeCode at #1 with 81.8%. Claude Code at 58%, ranked #39. I installed ForgeCode that same day.
TL;DR
- ForgeCode with Opus 4.6 was noticeably faster than Claude Code on the same tasks. Not marginal, real.
- ForgeCode topped TermBench 2.0 at 81.8%, but that’s its own benchmark. On the independent SWE-bench, the gap shrinks to 2.4 points.
- GPT 5.4 through ForgeCode was unstable for me. A research task on a small repo took 15 minutes.
- I’m double-dipping now. Claude Code is still primary, but the latency gains on ForgeCode are too real to ignore.
What is ForgeCode (and why the benchmark confusion exists)?
ForgeCode is not an AI model. It’s a model-agnostic agent harness, open source under Apache 2.0, written in Rust, that wraps any LLM through OpenRouter or direct API keys. It launched in late January 2025 and hit v2.8.0 on GitHub by April 2026 with over 6,000 stars.
ForgeCode ships three built-in agents. forge writes and edits code. sage does read-only research and can’t modify files. muse generates plans and writes them to a plans/ directory. It’s Zsh-native, using a : prefix so you never leave your shell.
Here’s the thing that matters for evaluating the benchmark: TermBench 2.0 is ForgeCode’s own benchmark, hosted at tbench.ai. The organization submitting entries is ForgeCode itself. That doesn’t make the results wrong. But it’s not a neutral third party.
Does the benchmark actually hold up?
On SWE-bench Verified, an independent benchmark from Princeton and UChicago, ForgeCode + Claude 4 scored 72.7% compared to Claude 3.7 Sonnet’s 70.3%. A 2.4-point gap, not the 24-point gap TermBench implies. That context changes the whole picture.
The TermBench 2.0 numbers, self-reported by ForgeCode on tbench.ai:
- ForgeCode + GPT 5.4: 81.8%
- ForgeCode + Claude Opus 4.6: 81.8%
- Claude Code + Claude Opus 4.6: 58.0% (rank #39)
The SWE-bench Verified numbers, independent:
- ForgeCode + Claude 4: 72.7%
- Claude 3.7 Sonnet (extended thinking): 70.3%
- Claude 4.5 Opus: 76.8%
So how did ForgeCode reach 81.8%? Their blog documents four specific harness changes. They reordered JSON schema fields, putting required before properties to reduce GPT 5.4 tool-call errors. They flattened nested schemas. They added explicit truncation reminders when files are partially read. And they added a mandatory verification pass where a reviewer skill checks task completion before the agent can stop.
These are real engineering improvements. They’re also benchmark-specific optimizations. The r/ClaudeCode community called it “benchmaxxed,” which is both funny and kind of fair.
I’ve been eyeing this leaderboard for a while. The numbers are what pushed me to actually try ForgeCode. With Opus 4.6, it was noticeably faster than Claude Code. That part wasn’t hype.
SWE-bench scores went from 1.96% in late 2023 to 76.8% by early 2026. Everything’s getting better fast. The question is whether a 2-point edge on an independent benchmark justifies switching your entire workflow.
What it’s actually like to use ForgeCode
Install is a one-liner: curl -fsSL https://forgecode.dev/cli | sh. Then forge provider login to set up your API keys and you’re in. About the same friction as Claude Code. The Zsh plugin is a nice touch, you type : followed by your prompt and it runs inline without switching contexts.
First thing I tried: pointed it at my portfolio repo (Astro 6, maybe 30 files) with Opus 4.6 as the model. I asked it to add a post counter to the blog index page and wire it into the nav component. Claude Code takes about 90 seconds on that kind of task on this repo. ForgeCode did it in under 30. Correct output, clean diff, no hallucinated imports. The speed difference was immediately obvious.

I ran the same kind of test a few more times. A multi-file rename, adding an external link tooltip component, restructuring a layout. ForgeCode with Opus 4.6 was consistently faster. Not by a little. I could feel it in my workflow.
Plan mode was the other thing that stood out. ForgeCode’s muse agent writes plans to a plans/ directory, and the output felt more detailed and verbose than Claude Code’s plan mode. Whether that’s good or bad depends on what you want. I kind of liked having the longer breakdown.
Then I tried GPT 5.4 through ForgeCode, and it fell apart. I asked it to research the architecture of a small repo. Fifteen minutes. Kept going unstable, tool calls failing, the agent retrying and spinning. I killed it. So “ForgeCode is fast” needs a qualifier: ForgeCode with Opus 4.6 is fast. ForgeCode with GPT 5.4 was borderline unusable for me.
But I’ll give them this: the ForgeCode team explicitly says they’ve hired zero paid influencers. The low social media presence is intentional. Kind of respect that. In an industry where half the “honest reviews” have affiliate links in the description, that’s almost suspiciously refreshing.
Why ForgeCode is actually faster
Part of it is just the Rust binary (Claude Code is TypeScript, so startup and memory are heavier). But that’s not the whole story.
ForgeCode has a context engine that indexes function signatures and module boundaries instead of dumping raw files into the context window. The agent pulls only what it needs. Some estimates say this cuts context size by about 90%, which means faster responses and cheaper models that don’t lose the plot halfway through a task. That’s the real reason the same model (Opus 4.6) responds faster through ForgeCode than through Claude Code.
There’s also a --sandbox flag that creates an isolated git worktree and branch, so you can try something risky without touching your main tree and only merge back what works.
What Claude Code has built around the core loop, parallel agent execution, hooks, scheduled cloud tasks, auto-memory, none of that exists in ForgeCode yet. The harness is fast. Everything around it is thin. ForgeCode is a Lambo with no cup holder. Fast as hell, but you’re holding your coffee between your knees.
What I missed when I wasn’t using Claude Code
I didn’t appreciate this until I spent a few days away from Claude Code: the stuff around the agent matters more than the agent itself.
With Claude Code, I have a CLAUDE.md in every project. My team shares the same project instructions. I have hooks that fire on file changes, so I can run secret scanning, linting, whatever I want on every edit. Auto-memory means I don’t re-explain my codebase every session. And checkpoints mean every file edit gets snapshotted, so if the agent breaks something three steps back, I hit /rewind and roll back without touching git.
ForgeCode has AGENTS.md (similar idea to CLAUDE.md) and MCP support, so the basics are covered. But no hooks, no checkpoints, no auto-memory, no IDE extensions, no JetBrains plugin. The model-agnostic part is great. The ecosystem is still thin.
For reference, here’s the head-to-head:
| Feature | ForgeCode | Claude Code |
|---|---|---|
| Model choice | Any (300+) | Claude only |
| Open source | Yes (Apache 2.0) | No |
| Language | Rust | TypeScript |
| Project config | AGENTS.md | CLAUDE.md (hierarchical) |
| MCP support | Yes | Yes (extensive) |
| Hooks | No | Yes (6 types) |
| Scheduled tasks | No | Yes (cloud + local) |
| Sub-agents | Yes (forge/sage/muse) | Yes (parallel) |
| Plan mode | Yes | Yes (Shift+Tab) |
| VS Code | No extension | Yes |
| JetBrains | No | Yes |
| Auto memory | No | Yes |
| Checkpoints / rewind | No | Yes |
Where I landed
I’m double-dipping. Claude Code is still my primary tool, but I keep ForgeCode open for tasks where the latency kills me. Sometimes I’ll drop into Cursor for something visual. Three tools is kind of ridiculous, but the latency gains on ForgeCode are real enough that I can’t just ignore them.
Claude Code is where my project config lives, where my hooks fire, where my MCP connections run. That’s my home base and it’s not changing. But when I need something fast and self-contained, a quick refactor, a file rename across a module, something where I don’t need the full ecosystem, I’ll run it through ForgeCode with Opus 4.6 and it’s done before Claude Code would’ve finished reading the context.
As of April 2026, ForgeCode is faster than Claude Code when running the same model (Opus 4.6), but Claude Code has the deeper ecosystem with hooks, MCP, auto-memory, and IDE integrations. Neither wins across the board. Pick the one that matches how you work and be ready to use both.
Frequently asked questions
Is ForgeCode’s TermBench #1 score legitimate?
TermBench is ForgeCode’s own benchmark. On SWE-bench Verified, an independent benchmark from Princeton, ForgeCode + Claude 4 scored 72.7% compared to Claude 3.7 Sonnet’s 70.3%. Solid, but not the 24-point gap TermBench suggests.
Can ForgeCode use my existing Claude or ChatGPT subscription?
No. You need API keys, not a subscription login. Separate billing from whatever you pay for Claude Pro or ChatGPT Plus.
Does ForgeCode burn more tokens than Claude Code?
Nobody’s published hard numbers. ForgeCode’s multi-agent setup (forge/sage/muse spawning sub-agents) almost certainly burns more tokens per session. I noticed it anecdotally but didn’t measure. Track your own spend if you try it.
Is ForgeCode safe for proprietary code?
The harness is open source, but default telemetry collects git user emails, scans SSH directories, and sends conversation data externally. GitHub issue #1318 raised data transparency concerns. The team addressed it in March 2025: set FORGE_TRACKER=false to disable all tracking.
Is ForgeCode free?
The code is free and open source (Apache 2.0). The hosted service was originally unlimited, but switched to a tiered model in mid-2025 with daily request caps on the free tier.
ForgeCode’s benchmark lead exists on a test it runs itself. On independent benchmarks, it’s comparable. The speed with Opus 4.6 is real. The GPT 5.4 experience was rough.
I didn’t expect to end up running two coding agents. But here I am. If ForgeCode ships hooks and the ecosystem catches up, that could change. For now, I’m using both, and it’s working.
Sources:
- ForgeCode GitHub Repository - GitHub, April 2026
- TermBench 2.0 Leaderboard - tbench.ai, 2026
- SWE-bench Verified Leaderboard - Princeton/UChicago, 2026
- Claude Code Documentation - Anthropic, 2026
- Anthropic Claude 3.7 Sonnet Announcement - Anthropic, February 2025