Carnegie Mellon University · a benchmark for personally intelligent computer-use agents · 2026

MyPCBench

A benchmark that evaluates computer-use agents on a desktop that already belongs to someone. One reproducible Linux machine carries a single person's digital life: 17 web apps he is logged into, 42,000 records that reference each other, and 184 tasks an assistant can only finish by knowing him.

Lawrence Keunho JangAndrew Keunwoo JangJing Yu KohRuslan Salakhutdinov

Carnegie Mellon University

Try the apps live ↗ arXiv ↗ Browse tasks Code ↗ Dataset ↗ Docker image ↗

Fig. 1Michael Scott's desktop, drawn from its own task file. Every card is a live app with its task count. A line joins two apps when a graded task touches both; heavier ink means more shared tasks. The blue dots are records on the move: one booked trip writes itself into the airline, the bank, the calendar, and the inbox. loading tasks.json…

17 applications 184 tasks 1,191 rubric items 42,000 seeded records 6 models evaluated

Current benchmarks evaluate computer-use agents in impersonal environments. Real personal assistants are expected to work across a user's whole digital life (banking, travel, food delivery, calendar, messaging), and most of it sits behind a login. MyPCBench is a reproducible Linux-desktop benchmark seeded end-to-end with one user's identity, history, and logged-in accounts, so standard desktop-agent loops can be evaluated on tasks that require knowing who the user is.

The 184 tasks are rewritten from 2,749 anonymized requests collected from the OpenClaw Discord, with named entities re-aligned to the persona's seeded data. The strongest evaluated agent, Claude Opus 4.6, perfects 55.4% of MyPCBench tasks and 36% of the 7+-application slice; GPT-5.5, Qwen 3.5 35B-A3B, and Qwen 3.5 9B reach 0% perfect on the same slice.

The persona

Michael G. Scott, regional manager of a paper company in Scranton, PA. A deterministic seed populates all 17 applications from a single JSON spec.

Any trip, dinner, or client deal leaves correlated records in every application that would plausibly record it. Michael's Philadelphia trip generates a Cheskepdia booking, two Gringotts charges, a HooliCalendar block, two Dinoco boarding passes, browsing history for “Radisson Blu Warwick”, three Travel-folder emails, and HooliChat messages referencing the trip, all written together at seed time so they line up at boot.

Each app is a custom Next.js clone backed by SQLite. Gringotts supports transfers, bill pay, Zelle, and statement downloads. Dinoco generates boarding passes with QR codes; eTaxi routes over 700+ Scranton-area locations with OSRM; TableFind exposes a large reservation inventory with hold-and-release semantics. Across the canonical seed the 17 apps expose 226 distinct database tables and tens of thousands of user-facing rows.

1,524

bank transactions

1,994

emails

632

calendar events

1,515

chat & work messages

582

orders, rides, bookings

207

pages crawled

184

tasks audited PASS

Seventeen applications

Six SimilarWeb top-level categories, fourteen subcategories. Every app is a full Next.js clone seeded from the persona spec.

Each card opens the live application in a new tab at *.mypcbench.app. It is the same seeded world the benchmark runs on, publicly browsable, reset daily at 09:00 UTC. More about the live apps →

Four tasks, verbatim Hover or tap one to see which apps it touches

The task suite

184 tasks touching anywhere from 1 to 19 applications. 68% are multi-application; 40% span at least two top-level categories.

type	n	a representative instruction
Bounded action	64 (35%)	“Zelle Pam a hundred bucks. She covered me last weekend. Check HooliChat first to make sure Pam Beesly is on my contacts, then put a memo on the transfer.”
Multi-step orchestration	48 (26%)	“The Threat Level Midnight Fan Club has been dormant. Peek at the group chat, scroll my LockedIn contacts for Dunder Mifflin folks to recruit, draft them an invitation email, and book a watch party on my calendar for next month.”
Cross-source reconciliation	25 (14%)	“I've got the Jamaica trip AND the Barbados trip booked about four weeks apart. Given my credit-card balance, can I actually afford both, or am I about to max out?”
Aggregation & reporting	23 (12%)	“How much am I sending via Zelle each month, and who's getting the money? Check the most recent two complete calendar months and rank the recipients in a LibreOffice Calc spreadsheet.”
Personal lookup	13 (7%)	“What's my current FlyMiles loyalty tier on Dinoco Airlines, and how many miles do I have in the bank?”
Pattern inference	11 (6%)	“What do I usually tip on food delivery, in dollars and as a percent? I want to set a smart default so I'm not thinking about it every order.”

Task distribution: apps-per-task, domain coverage, behavioural split — Fig. 2Left: distribution of tasks by the number of distinct applications they touch. Middle: per-domain task coverage (non-exclusive). Right: behavioural task-type split.

Browse all 184 tasks ↗ tasks.json ↓

Method

A QEMU VM, an OSWorld-compatible harness, and a per-rubric LLM judge over the full trajectory.

Environment & harness

Ships as a rolling Docker/QEMU image (ljang/mypcbench-qemu:latest, also tagged michael_scott) running a real QEMU/KVM Ubuntu 24.04 VM with GNOME Shell, the full LibreOffice suite, and 17 pre-logged-in web apps. Boot-to-ready ~90s; a base snapshot resets between tasks. We model the environment as a partially observable MDP: at each step the agent receives a 1280×800 screenshot plus its running tool-call history and emits an action over the unmodified OSWorld pyautogui surface. Provider-native CUA APIs are mapped onto this surface through a unified translation layer.

Grading

Every task ships with its own rubrics: 3 to 13 natural-language criteria authored alongside the task, 1,191 items in total. The judge runs once per rubric over the full trajectory and returns success or failure for that item; we use gemini-3.1-flash-lite-preview throughout. We report three metrics: perfect rate (every rubric passed; the headline), rubric score (per-task average, partial credit), and trajectory efficiency (rubric score per agent step, in percent).

Leaderboard

Models ranked by perfect-task rate under a 100-step budget and the same Gemini judge. Open for submissions — run your agent and email us the trajectories.

Loading leaderboard…

Full leaderboard & per-category breakdown ↗ Submit your model ↗

Watch the agents work

Fifteen curated trajectories across all six task categories. The rubric checklist on the right lights up at the step the judge cited.

Loading trajectories…

Keyboard · SPACE play / pause · ←/→ step · HOME / END jump · ESC close

How they fail

The same 184 tasks pull a different action style out of each model family: bash-heavy Claude, keyboard-first GPT, click-and-move Qwen. The style predicts how each one fails.

Per-model action distribution across all 184 trajectories — Fig. 4Shares of total emitted actions per model, grouped into ten categories. With the uniform `computer`+`bash` tool surface, the GPT family actually uses `bash` the most by volume (52%/44%) while Claude balances bash with the visible UI (24% Opus, 16% Sonnet); `str_replace_based_edit_tool` remains Claude-only.

Claude · console-script shortcut

~60% UI actions, ~24% bash on Opus (16% on Sonnet). The Claude pattern is qualitative, not volumetric: it uses bash to substitute for the rubric-graded side-effect itself, querying app REST endpoints with curl and calling DONE without driving the visible UI. Rubrics that require a user-visible change (move a card, save a file) fail even though it found the right answer.

GPT · premature DONE

Despite using the most bash by volume (52%/44%), the GPT family still calls DONE early on tasks that need a visible workflow. The shell answer reads correct, but it leaves nothing behind for the rubric to grade. GPT owns this failure mode, with 235 of 354 premature-DONE hits across all families.

Qwen · persona-data hallucination

Qwen drives persona-data hallucination (13 of 31 hits): it invents a friend, a route, or a balance that does not exist in the seeded environment. With the dual computer+bash surface, Qwen 9B in particular collapses under the dual-tool schema (20.2 → 7.0 rubric vs. its cua-only baseline).

What rubrics reward

Even with the tool surface equalised, the three failure modes resolve into specific, named patterns that future agent designs can target. The remaining gap is not about access to a shell. It is about policy: choosing between shell and UI based on what the task needs to leave behind.

Failure-mode tally

We tagged every failed-rubric judge explanation. Premature DONE and skipped required apps account for most of the loss.

354

Premature DONE

Agent emits DONE before satisfying every rubric. Concentrates in the GPT family (235 of 354 hits): zero-score GPT/Sonnet trajectories average 22–31 steps before abandonment.

GPT family

323

Skipped required app

Agent never visits one of the apps the rubric requires, usually because it commits to a single source (search, email) and never branches across the surface.

all families

129

Surface error → terminal

A single page error or unexpected modal terminates the trajectory. Agents do not recover the underlying intent.

all families

Partial artifact

Output exists but is missing a required field, sheet, or section the rubric calls for.

Sonnet, GPT

Hallucinated persona data

Agent invents persona-specific values (13 of 31 in Qwen): a friend, a route, or a balance that does not exist in the seeded environment.

Qwen family

Loading family signatures…