A reproducible Linux-desktop benchmark for personally intelligent computer-use agents — one canonical persona, 17 cross-consistent web apps, 184 tasks graded by an LLM judge.
Current benchmarks evaluate computer-use agents in impersonal environments. Real personal assistants are expected to work across a user's whole digital life — banking, travel, food delivery, calendar, messaging — most of which sits behind a login. MyPCBench is a reproducible Linux-desktop benchmark seeded end-to-end with one user's identity, history, and logged-in accounts, so the same agent loops OSWorld-style benchmarks already use can finally be pointed at tasks that require knowing who the user is.
184 tasks rewritten from 2,749 anonymized requests collected from the OpenClaw Discord, with named entities re-aligned to the persona's seeded data. The strongest evaluated agent, Claude Opus 4.6, perfects 55.4% of MyPCBench tasks and 36% of the 7+-application slice; GPT-5.5, Qwen 3.5 35B-A3B, and Qwen 3.5 9B reach 0% perfect on the same slice.
Michael G. Scott — regional manager of a paper company in Scranton, PA. A deterministic seed populates 17 apps from a single JSON spec.
Any trip, dinner, or client deal leaves correlated records in every application that would plausibly record it. Michael's Philadelphia trip generates a Cheskepdia (Airbnb) booking, two Gringotts (Chase) charges, a HooliCalendar (Google Calendar) block, two Dinoco (Delta) boarding passes, browsing history for “Radisson Blu Warwick”, three Travel-folder emails, and HooliChat (WhatsApp) messages referencing the trip — written together at seed time so they line up at boot.
Each app is a custom Next.js clone backed by SQLite. Gringotts supports transfers, bill pay, Zelle, and statement downloads. Dinoco generates boarding passes with QR codes; eTaxi routes over 700+ Scranton-area locations with OSRM; TableFind exposes 7,686 reservation slots with hold-and-release semantics. Across the canonical seed the 17 apps expose 185 distinct database tables and roughly 18,000 rows of state.
17 apps, 6 SimilarWeb top-level categories, 14 subcategories — every one a full Next.js clone seeded from the persona spec.
Each card opens the live application in a new tab at *.mypcbench.app — same seeded world as the benchmark, publicly browsable, daily reset at 09:00 UTC. More about the live apps →
184 tasks across 1–19 co-touched apps. 68% are multi-application; 40% span at least two top-level categories.
| Type | # Tasks | Representative instruction |
|---|---|---|
| Bounded action | 64 (35%) | “Zelle Pam a hundred bucks. She covered me last weekend. Check HooliChat first to make sure Pam Beesly is on my contacts, then put a memo on the transfer.” |
| Multi-step orchestration | 48 (26%) | “The Threat Level Midnight Fan Club has been dormant. Peek at the group chat, scroll my LockedIn contacts for Dunder Mifflin folks to recruit, draft them an invitation email, and book a watch party on my calendar for next month.” |
| Cross-source reconciliation | 25 (14%) | “I've got the Jamaica trip AND the Barbados trip booked about four weeks apart. Given my credit-card balance, can I actually afford both, or am I about to max out?” |
| Aggregation & reporting | 23 (12%) | “How much am I sending via Zelle each month, and who's getting the money? Check the most recent two complete calendar months and rank the recipients in a LibreOffice Calc spreadsheet.” |
| Personal lookup | 13 (7%) | “What's my current FlyMiles loyalty tier on Dinoco Airlines, and how many miles do I have in the bank?” |
| Pattern inference | 11 (6%) | “What do I usually tip on food delivery, in dollars and as a percent? I want to set a smart default so I'm not thinking about it every order.” |
A QEMU VM, an OSWorld-compatible harness, and a per-rubric LLM judge over the full trajectory.
Ships as a Docker image (ljang/mypcbench-qemu:v1.2.17-15d012a4) running a real QEMU/KVM Ubuntu 24.04 VM with GNOME Shell, the full LibreOffice suite, and 17 pre-logged-in web apps. Boot-to-ready ~90s; a base snapshot resets between tasks. We model the environment as a partially observable MDP: at each step the agent receives a 1280×800 screenshot plus its running tool-call history and emits an action over the unmodified OSWorld pyautogui surface. Provider-native CUA APIs are mapped onto this surface through a unified translation layer.
Every task ships with its own rubrics — 3 to 13 natural-language criteria authored alongside the task (1,191 items across the suite). The judge runs once per rubric over the full trajectory and returns success or failure for that item; we use gemini-3.1-flash-lite-preview throughout. We report three metrics: perfect rate (every rubric passed; headline), rubric score (per-task average, partial credit), and trajectory efficiency (rubric score per agent step, in percent).
Six models, each provider's native CUA agent, 100-step budget, the same Gemini judge.
Loading leaderboard…
The headline gap shows up against task complexity: as soon as a task touches seven or more applications, GPT-5.5, Qwen 35B, and Qwen 9B all drop to 0% perfect. Only the Claude tier and GPT-5.4 mini perfect any 7+-app task at all, and only Claude Opus 4.6 holds an above-zero rate across every complexity bin.
The per-category heatmap below shows the same pattern by behavioural type: tasks that orchestrate across apps and reconcile across sources collapse first when the agent's context-tracking budget runs out.
Rubric-score percentages, models × six behavioural task categories. Green = higher score, red = lower.
Twelve curated trajectories. The rubric checklist on the right lights up at the step the judge cited.
Loading trajectories…
Keyboard · SPACE play / pause · ←/→ step · HOME / END jump · ESC close
The same 184 tasks elicit three distinct action shapes — bash-heavy Claude, keyboard-first GPT, click-and-move Qwen — and the action shape predicts the failure mode.
bash and str_replace_based_edit_tool are Claude-only; Qwen's pipeline emits mouse_move as a separate action where Claude folds movement into the click.~60% UI actions, ~24% bash on Opus (16% on Sonnet). The bash share is the “console-script shortcut”: read app state via curl, don't move the visible UI — invisible to rubrics that grade on user-visible side-effects.
28–30% key actions, 11–13% explicit wait, only 8–9% scroll. Keyboard flow is efficient but the family premature-DONEs on tasks that need a card moved or a file saved from the menu — the keyboard answer reads correct without leaving the side-effect the rubric grades.
70–72% of actions are mouse-related (44–58% click + 11–14% mouse_move), reflecting the OSWorld-style coordinate surface these models were trained on. With no shell access every persona-data lookup is a UI read — consistent with the family's persona-data hallucination rate (24 of 36 hits).
Exposing bash alongside the UI is a measurable design choice: future hybrid CUA agents need a policy for picking between shell and UI as a function of what the task needs to leave behind, not just the answer.
We tagged every failed-rubric judge explanation. Skipped required apps and premature DONE account for most of the loss.
Loading family signatures…
@misc{jang2026mypcbench,
title = {{MyPCBench}: A Benchmark for Personally Intelligent Computer-Use Agents},
author = {Jang, Lawrence Keunho and Jang, Andrew Keunwoo and Koh, Jing Yu and Salakhutdinov, Ruslan},
year = {2026}
}