MyPCBench

A reproducible Linux-desktop benchmark for personally intelligent computer-use agents — one canonical persona, 17 cross-consistent web apps, 184 tasks graded by an LLM judge.

Lawrence Keunho JangAndrew Keunwoo JangJing Yu KohRuslan Salakhutdinov

Carnegie Mellon University

Try apps live ↗ Browse tasks ↗ Code ↗ HF dataset ↗ Docker Hub ↗

Figure 1Overview of MyPCBench. 17 pre-logged-in web apps plus the full LibreOffice suite. The current polished OSS image tracks the daily latest release artifact; archived baseline scores remain tied to eval-round0.

Apps

Web applications

Tasks

184

Six behavioural types

Records

42K

Cross-linked entries

Models

Closed + open weights

Abstract

Current benchmarks evaluate computer-use agents in impersonal environments. Real personal assistants are expected to work across a user's whole digital life — banking, travel, food delivery, calendar, messaging — most of which sits behind a login. MyPCBench is a reproducible Linux-desktop benchmark seeded end-to-end with one user's identity, history, and logged-in accounts, so standard desktop-agent loops can be evaluated on tasks that require knowing who the user is.

184 tasks rewritten from 2,749 anonymized requests collected from the OpenClaw Discord, with named entities re-aligned to the persona's seeded data. The strongest evaluated agent, Claude Opus 4.6, perfects 55.4% of MyPCBench tasks and 36% of the 7+-application slice; GPT-5.5, Qwen 3.5 35B-A3B, and Qwen 3.5 9B reach 0% perfect on the same slice.

Persona

Michael G. Scott — regional manager of a paper company in Scranton, PA. A deterministic seed populates 17 apps from a single JSON spec.

Any trip, dinner, or client deal leaves correlated records in every application that would plausibly record it. Michael's Philadelphia trip generates a Cheskepdia (Airbnb) booking, two Gringotts (Chase) charges, a HooliCalendar (Google Calendar) block, two Dinoco (Delta) boarding passes, browsing history for “Radisson Blu Warwick”, three Travel-folder emails, and HooliChat (WhatsApp) messages referencing the trip — written together at seed time so they line up at boot.

Each app is a custom Next.js clone backed by SQLite. Gringotts supports transfers, bill pay, Zelle, and statement downloads. Dinoco generates boarding passes with QR codes; eTaxi routes over 700+ Scranton-area locations with OSRM; TableFind exposes a large reservation inventory with hold-and-release semantics. Across the canonical seed the 17 apps expose 226 distinct database tables and tens of thousands of user-facing rows, with a much larger TableFind availability index for live booking search.

1,614

Bank txns

2,493

Emails

112

Calendar evts

2,526

Chat/work msgs

703

Orders/rides/bookings

207

Pages crawled

184

Task PASS

Applications

17 apps, 6 SimilarWeb top-level categories, 14 subcategories — every one a full Next.js clone seeded from the persona spec.

Each card opens the live application in a new tab at *.mypcbench.app — same seeded world as the benchmark, publicly browsable, daily reset at 09:00 UTC. More about the live apps →

Example tasks Hover or tap a task to highlight the apps it touches.

Task suite

184 tasks across 1–19 co-touched apps. 68% are multi-application; 40% span at least two top-level categories.

Type	# Tasks	Representative instruction
Bounded action	64 (35%)	“Zelle Pam a hundred bucks. She covered me last weekend. Check HooliChat first to make sure Pam Beesly is on my contacts, then put a memo on the transfer.”
Multi-step orchestration	48 (26%)	“The Threat Level Midnight Fan Club has been dormant. Peek at the group chat, scroll my LockedIn contacts for Dunder Mifflin folks to recruit, draft them an invitation email, and book a watch party on my calendar for next month.”
Cross-source reconciliation	25 (14%)	“I've got the Jamaica trip AND the Barbados trip booked about four weeks apart. Given my credit-card balance, can I actually afford both, or am I about to max out?”
Aggregation & reporting	23 (12%)	“How much am I sending via Zelle each month, and who's getting the money? Check the most recent two complete calendar months and rank the recipients in a LibreOffice Calc spreadsheet.”
Personal lookup	13 (7%)	“What's my current FlyMiles loyalty tier on Dinoco Airlines, and how many miles do I have in the bank?”
Pattern inference	11 (6%)	“What do I usually tip on food delivery, in dollars and as a percent? I want to set a smart default so I'm not thinking about it every order.”

Task distribution: apps-per-task, domain coverage, behavioural split — Figure 3Left: distribution of tasks by the number of distinct applications they touch. Middle: per-domain task coverage (non-exclusive). Right: behavioural task-type split.

Browse all 184 tasks ↗ tasks.json ↓

Methodology

A QEMU VM, an OSWorld-compatible harness, and a per-rubric LLM judge over the full trajectory.

Environment & harness

Ships as a rolling Docker/QEMU image (ljang/mypcbench-qemu:latest, also tagged michael_scott) running a real QEMU/KVM Ubuntu 24.04 VM with GNOME Shell, the full LibreOffice suite, and 17 pre-logged-in web apps. Boot-to-ready ~90s; a base snapshot resets between tasks. We model the environment as a partially observable MDP: at each step the agent receives a 1280×800 screenshot plus its running tool-call history and emits an action over the unmodified OSWorld pyautogui surface. Provider-native CUA APIs are mapped onto this surface through a unified translation layer.

Grading

Every task ships with its own rubrics — 3 to 13 natural-language criteria authored alongside the task (1,191 items across the suite). The judge runs once per rubric over the full trajectory and returns success or failure for that item; we use gemini-3.1-flash-lite-preview throughout. We report three metrics: perfect rate (every rubric passed; headline), rubric score (per-task average, partial credit), and trajectory efficiency (rubric score per agent step, in percent).

Main results

Six models, each provider's native CUA agent, 100-step budget, the same Gemini judge.

Loading leaderboard…

Where the gap opens

The headline gap shows up against task complexity: as soon as a task touches seven or more applications, GPT-5.4 mini, Qwen 35B, and Qwen 9B all drop to 0% perfect, while GPT-5.5 retains just 4.5%. Only the Claude tier and GPT-5.5 perfect any 7+-app task at all, and only Claude Opus 4.6 holds an above-zero rate across every complexity bin.

The per-category heatmap below shows the same pattern by behavioural type: tasks that orchestrate across apps and reconcile across sources collapse first when the agent's context-tracking budget runs out.

Perfect rate vs apps touched per task — Figure 5Perfect-task rate vs. the number of distinct applications a task touches.

Per-category breakdown

Rubric-score percentages, models × six behavioural task categories. Green = higher score, red = lower. The decomposition here is the cua-only ablation (computer surface, no bash); the headline aggregates above are the paper's computer+bash run.

Loading heatmap…

Sample trajectories

Fifteen curated trajectories across all six task categories. The rubric checklist on the right lights up at the step the judge cited.

Loading trajectories…

Keyboard · SPACE play / pause · ←/→ step · HOME / END jump · ESC close

Per-family behaviour

The same 184 tasks elicit three distinct action shapes — bash-heavy Claude, keyboard-first GPT, click-and-move Qwen — and the action shape predicts the failure mode.

Per-model action distribution across all 184 trajectories — Figure 6Shares of total emitted actions per model, grouped into ten categories. With the uniform `computer`+`bash` tool surface, the GPT family actually uses `bash` the most by volume (52%/44%) while Claude balances bash with the visible UI (24% Opus, 16% Sonnet); `str_replace_based_edit_tool` remains Claude-only.

Claude · console-script shortcut

~60% UI actions, ~24% bash on Opus (16% on Sonnet). The Claude pattern is qualitative, not volumetric: Claude uses bash to substitute for the rubric-graded side-effect itself — querying app REST endpoints with curl and DONE-ing without driving the visible UI, so rubrics requiring a user-visible change (move a card, save a file) fail even though the correct value was retrieved.

GPT · premature DONE

Despite using the most bash by volume (52%/44%), the GPT family still premature-DONEs on tasks needing a visible workflow — the shell answer reads correct without leaving the side-effect the rubric grades. GPT dominates this failure mode (235 of 354 premature-DONE hits across all families).

Qwen · persona-data hallucination

Qwen drives persona-data hallucination (13 of 31 hits) — inventing a friend, a route, or a balance that doesn't exist in the seeded environment. With the dual computer+bash surface, Qwen 9B in particular collapses under the dual-tool schema (20.2 → 7.0 rubric vs. its cua-only baseline).

What rubrics reward

Even with the tool surface equalised, the three failure modes resolve into specific, named patterns that future agent designs can target. The remaining gap is not about access to shell — it is about policy for picking between shell and UI as a function of what the task needs to leave behind.

Failure mode tally

We tagged every failed-rubric judge explanation. Premature DONE and skipped required apps account for most of the loss.

354

Premature DONE

Agent emits DONE before satisfying every rubric. Concentrates in the GPT family (235 of 354 hits): zero-score GPT/Sonnet trajectories average 22–31 steps before abandonment.

GPT family

323

Skipped required app

Agent never visits one of the apps the rubric requires — typically because it commits to a single source (search, email) and never branches across the surface.

All families

129

Surface error → terminal

A single page error or unexpected modal terminates the trajectory. Agents do not recover the underlying intent.

All families

Partial artifact

Output exists but is missing a required field, sheet, or section the rubric calls for.

Sonnet, GPT

Hallucinated persona data

Agent invents persona-specific values (13 of 31 in Qwen) — a friend, a route, a balance that doesn't exist in the seeded environment.

Qwen family

Loading family signatures…