A reproducible Linux-desktop benchmark for personally intelligent computer-use agents — one canonical persona, 17 cross-consistent web apps, 184 tasks graded by an LLM judge.
latest release artifact; archived baseline scores remain tied to eval-round0.Current benchmarks evaluate computer-use agents in impersonal environments. Real personal assistants are expected to work across a user's whole digital life — banking, travel, food delivery, calendar, messaging — most of which sits behind a login. MyPCBench is a reproducible Linux-desktop benchmark seeded end-to-end with one user's identity, history, and logged-in accounts, so standard desktop-agent loops can be evaluated on tasks that require knowing who the user is.
184 tasks rewritten from 2,749 anonymized requests collected from the OpenClaw Discord, with named entities re-aligned to the persona's seeded data. The strongest evaluated agent, Claude Opus 4.6, perfects 55.4% of MyPCBench tasks and 36% of the 7+-application slice; GPT-5.5, Qwen 3.5 35B-A3B, and Qwen 3.5 9B reach 0% perfect on the same slice.
Michael G. Scott — regional manager of a paper company in Scranton, PA. A deterministic seed populates 17 apps from a single JSON spec.
Any trip, dinner, or client deal leaves correlated records in every application that would plausibly record it. Michael's Philadelphia trip generates a Cheskepdia (Airbnb) booking, two Gringotts (Chase) charges, a HooliCalendar (Google Calendar) block, two Dinoco (Delta) boarding passes, browsing history for “Radisson Blu Warwick”, three Travel-folder emails, and HooliChat (WhatsApp) messages referencing the trip — written together at seed time so they line up at boot.
Each app is a custom Next.js clone backed by SQLite. Gringotts supports transfers, bill pay, Zelle, and statement downloads. Dinoco generates boarding passes with QR codes; eTaxi routes over 700+ Scranton-area locations with OSRM; TableFind exposes a large reservation inventory with hold-and-release semantics. Across the canonical seed the 17 apps expose 226 distinct database tables and tens of thousands of user-facing rows, with a much larger TableFind availability index for live booking search.
17 apps, 6 SimilarWeb top-level categories, 14 subcategories — every one a full Next.js clone seeded from the persona spec.
Each card opens the live application in a new tab at *.mypcbench.app — same seeded world as the benchmark, publicly browsable, daily reset at 09:00 UTC. More about the live apps →
184 tasks across 1–19 co-touched apps. 68% are multi-application; 40% span at least two top-level categories.
| Type | # Tasks | Representative instruction |
|---|---|---|
| Bounded action | 64 (35%) | “Zelle Pam a hundred bucks. She covered me last weekend. Check HooliChat first to make sure Pam Beesly is on my contacts, then put a memo on the transfer.” |
| Multi-step orchestration | 48 (26%) | “The Threat Level Midnight Fan Club has been dormant. Peek at the group chat, scroll my LockedIn contacts for Dunder Mifflin folks to recruit, draft them an invitation email, and book a watch party on my calendar for next month.” |
| Cross-source reconciliation | 25 (14%) | “I've got the Jamaica trip AND the Barbados trip booked about four weeks apart. Given my credit-card balance, can I actually afford both, or am I about to max out?” |
| Aggregation & reporting | 23 (12%) | “How much am I sending via Zelle each month, and who's getting the money? Check the most recent two complete calendar months and rank the recipients in a LibreOffice Calc spreadsheet.” |
| Personal lookup | 13 (7%) | “What's my current FlyMiles loyalty tier on Dinoco Airlines, and how many miles do I have in the bank?” |
| Pattern inference | 11 (6%) | “What do I usually tip on food delivery, in dollars and as a percent? I want to set a smart default so I'm not thinking about it every order.” |
A QEMU VM, an OSWorld-compatible harness, and a per-rubric LLM judge over the full trajectory.
Ships as a rolling Docker/QEMU image (ljang/mypcbench-qemu:latest, also tagged michael_scott) running a real QEMU/KVM Ubuntu 24.04 VM with GNOME Shell, the full LibreOffice suite, and 17 pre-logged-in web apps. Boot-to-ready ~90s; a base snapshot resets between tasks. We model the environment as a partially observable MDP: at each step the agent receives a 1280×800 screenshot plus its running tool-call history and emits an action over the unmodified OSWorld pyautogui surface. Provider-native CUA APIs are mapped onto this surface through a unified translation layer.
Every task ships with its own rubrics — 3 to 13 natural-language criteria authored alongside the task (1,191 items across the suite). The judge runs once per rubric over the full trajectory and returns success or failure for that item; we use gemini-3.1-flash-lite-preview throughout. We report three metrics: perfect rate (every rubric passed; headline), rubric score (per-task average, partial credit), and trajectory efficiency (rubric score per agent step, in percent).
Six models, each provider's native CUA agent, 100-step budget, the same Gemini judge.
Loading leaderboard…
The headline gap shows up against task complexity: as soon as a task touches seven or more applications, GPT-5.4 mini, Qwen 35B, and Qwen 9B all drop to 0% perfect, while GPT-5.5 retains just 4.5%. Only the Claude tier and GPT-5.5 perfect any 7+-app task at all, and only Claude Opus 4.6 holds an above-zero rate across every complexity bin.
The per-category heatmap below shows the same pattern by behavioural type: tasks that orchestrate across apps and reconcile across sources collapse first when the agent's context-tracking budget runs out.
Rubric-score percentages, models × six behavioural task categories. Green = higher score, red = lower. The decomposition here is the cua-only ablation (computer surface, no bash); the headline aggregates above are the paper's computer+bash run.
Fifteen curated trajectories across all six task categories. The rubric checklist on the right lights up at the step the judge cited.
Loading trajectories…
Keyboard · SPACE play / pause · ←/→ step · HOME / END jump · ESC close
The same 184 tasks elicit three distinct action shapes — bash-heavy Claude, keyboard-first GPT, click-and-move Qwen — and the action shape predicts the failure mode.
computer+bash tool surface, the GPT family actually uses bash the most by volume (52%/44%) while Claude balances bash with the visible UI (24% Opus, 16% Sonnet); str_replace_based_edit_tool remains Claude-only.~60% UI actions, ~24% bash on Opus (16% on Sonnet). The Claude pattern is qualitative, not volumetric: Claude uses bash to substitute for the rubric-graded side-effect itself — querying app REST endpoints with curl and DONE-ing without driving the visible UI, so rubrics requiring a user-visible change (move a card, save a file) fail even though the correct value was retrieved.
Despite using the most bash by volume (52%/44%), the GPT family still premature-DONEs on tasks needing a visible workflow — the shell answer reads correct without leaving the side-effect the rubric grades. GPT dominates this failure mode (235 of 354 premature-DONE hits across all families).
Qwen drives persona-data hallucination (13 of 31 hits) — inventing a friend, a route, or a balance that doesn't exist in the seeded environment. With the dual computer+bash surface, Qwen 9B in particular collapses under the dual-tool schema (20.2 → 7.0 rubric vs. its cua-only baseline).
Even with the tool surface equalised, the three failure modes resolve into specific, named patterns that future agent designs can target. The remaining gap is not about access to shell — it is about policy for picking between shell and UI as a function of what the task needs to leave behind.
We tagged every failed-rubric judge explanation. Premature DONE and skipped required apps account for most of the loss.
Loading family signatures…