MyPCBench

A reproducible Linux-desktop benchmark for personally intelligent computer-use agents — one canonical persona, 17 cross-consistent web apps, 184 tasks graded by an LLM judge.

Lawrence Keunho JangAndrew Keunwoo JangJing Yu KohRuslan Salakhutdinov
Carnegie Mellon University
Overview of MyPCBench — 17 pre-logged-in web apps mirroring real consumer products plus the full LibreOffice suite, populated with one persona's cross-linked records.
Figure 1Overview of MyPCBench. 17 pre-logged-in web apps plus the full LibreOffice suite. The persona's records (1,671 bank txns, 1,884 emails, 656 calendar events, 2,526 chats, 481 orders, 10,776 web visits, 35 bookmarks) are cross-linked so one trip leaves correlated records across every app that would plausibly book it.
Apps
17
Web applications
Tasks
184
Six behavioural types
Records
17K
Cross-linked entries
Models
6
Closed + open weights

Abstract

Current benchmarks evaluate computer-use agents in impersonal environments. Real personal assistants are expected to work across a user's whole digital life — banking, travel, food delivery, calendar, messaging — most of which sits behind a login. MyPCBench is a reproducible Linux-desktop benchmark seeded end-to-end with one user's identity, history, and logged-in accounts, so the same agent loops OSWorld-style benchmarks already use can finally be pointed at tasks that require knowing who the user is.

184 tasks rewritten from 2,749 anonymized requests collected from the OpenClaw Discord, with named entities re-aligned to the persona's seeded data. The strongest evaluated agent, Claude Opus 4.6, perfects 55.4% of MyPCBench tasks and 36% of the 7+-application slice; GPT-5.5, Qwen 3.5 35B-A3B, and Qwen 3.5 9B reach 0% perfect on the same slice.

Persona

Michael G. Scott — regional manager of a paper company in Scranton, PA. A deterministic seed populates 17 apps from a single JSON spec.

Any trip, dinner, or client deal leaves correlated records in every application that would plausibly record it. Michael's Philadelphia trip generates a Cheskepdia (Airbnb) booking, two Gringotts (Chase) charges, a HooliCalendar (Google Calendar) block, two Dinoco (Delta) boarding passes, browsing history for “Radisson Blu Warwick”, three Travel-folder emails, and HooliChat (WhatsApp) messages referencing the trip — written together at seed time so they line up at boot.

Each app is a custom Next.js clone backed by SQLite. Gringotts supports transfers, bill pay, Zelle, and statement downloads. Dinoco generates boarding passes with QR codes; eTaxi routes over 700+ Scranton-area locations with OSRM; TableFind exposes 7,686 reservation slots with hold-and-release semantics. Across the canonical seed the 17 apps expose 185 distinct database tables and roughly 18,000 rows of state.

Michael Scott's seeded Linux desktop
Live VM · 1280 × 800 · GNOME Shell
1,671
Bank txns
1,884
Emails
656
Calendar evts
2,526
Chat msgs
481
Orders & rides
10,776
Page visits
35
Bookmarks

Applications

17 apps, 6 SimilarWeb top-level categories, 14 subcategories — every one a full Next.js clone seeded from the persona spec.

Each card opens the live application in a new tab at *.mypcbench.app — same seeded world as the benchmark, publicly browsable, daily reset at 09:00 UTC. More about the live apps →

Example tasks Hover or tap a task to highlight the apps it touches.

Task suite

184 tasks across 1–19 co-touched apps. 68% are multi-application; 40% span at least two top-level categories.

Type# TasksRepresentative instruction
Bounded action64 (35%)“Zelle Pam a hundred bucks. She covered me last weekend. Check HooliChat first to make sure Pam Beesly is on my contacts, then put a memo on the transfer.”
Multi-step orchestration48 (26%)“The Threat Level Midnight Fan Club has been dormant. Peek at the group chat, scroll my LockedIn contacts for Dunder Mifflin folks to recruit, draft them an invitation email, and book a watch party on my calendar for next month.”
Cross-source reconciliation25 (14%)“I've got the Jamaica trip AND the Barbados trip booked about four weeks apart. Given my credit-card balance, can I actually afford both, or am I about to max out?”
Aggregation & reporting23 (12%)“How much am I sending via Zelle each month, and who's getting the money? Check the most recent two complete calendar months and rank the recipients in a LibreOffice Calc spreadsheet.”
Personal lookup13 (7%)“What's my current FlyMiles loyalty tier on Dinoco Airlines, and how many miles do I have in the bank?”
Pattern inference11 (6%)“What do I usually tip on food delivery, in dollars and as a percent? I want to set a smart default so I'm not thinking about it every order.”
Task distribution: apps-per-task, domain coverage, behavioural split
Figure 3Left: distribution of tasks by the number of distinct applications they touch. Middle: per-domain task coverage (non-exclusive). Right: behavioural task-type split.

Browse all 184 tasks tasks.json

Methodology

A QEMU VM, an OSWorld-compatible harness, and a per-rubric LLM judge over the full trajectory.

Environment & harness

Ships as a Docker image (ljang/mypcbench-qemu:v1.2.17-15d012a4) running a real QEMU/KVM Ubuntu 24.04 VM with GNOME Shell, the full LibreOffice suite, and 17 pre-logged-in web apps. Boot-to-ready ~90s; a base snapshot resets between tasks. We model the environment as a partially observable MDP: at each step the agent receives a 1280×800 screenshot plus its running tool-call history and emits an action over the unmodified OSWorld pyautogui surface. Provider-native CUA APIs are mapped onto this surface through a unified translation layer.

Grading

Every task ships with its own rubrics — 3 to 13 natural-language criteria authored alongside the task (1,191 items across the suite). The judge runs once per rubric over the full trajectory and returns success or failure for that item; we use gemini-3.1-flash-lite-preview throughout. We report three metrics: perfect rate (every rubric passed; headline), rubric score (per-task average, partial credit), and trajectory efficiency (rubric score per agent step, in percent).

Main results

Six models, each provider's native CUA agent, 100-step budget, the same Gemini judge.

Loading leaderboard…

Where the gap opens

The headline gap shows up against task complexity: as soon as a task touches seven or more applications, GPT-5.5, Qwen 35B, and Qwen 9B all drop to 0% perfect. Only the Claude tier and GPT-5.4 mini perfect any 7+-app task at all, and only Claude Opus 4.6 holds an above-zero rate across every complexity bin.

The per-category heatmap below shows the same pattern by behavioural type: tasks that orchestrate across apps and reconcile across sources collapse first when the agent's context-tracking budget runs out.

Perfect rate vs apps touched per task
Figure 5Perfect-task rate vs. the number of distinct applications a task touches.

Per-category breakdown

Rubric-score percentages, models × six behavioural task categories. Green = higher score, red = lower.

Loading heatmap…

Sample trajectories

Twelve curated trajectories. The rubric checklist on the right lights up at the step the judge cited.

Keyboard · SPACE play / pause  ·  ←/→ step  ·  HOME / END jump  ·  ESC close

Per-family behaviour

The same 184 tasks elicit three distinct action shapes — bash-heavy Claude, keyboard-first GPT, click-and-move Qwen — and the action shape predicts the failure mode.

Per-model action distribution across all 184 trajectories
Figure 6Shares of total emitted actions per model, grouped into ten categories. bash and str_replace_based_edit_tool are Claude-only; Qwen's pipeline emits mouse_move as a separate action where Claude folds movement into the click.

Claude · hybrid surface

~60% UI actions, ~24% bash on Opus (16% on Sonnet). The bash share is the “console-script shortcut”: read app state via curl, don't move the visible UI — invisible to rubrics that grade on user-visible side-effects.

GPT · keyboard-first

28–30% key actions, 11–13% explicit wait, only 8–9% scroll. Keyboard flow is efficient but the family premature-DONEs on tasks that need a card moved or a file saved from the menu — the keyboard answer reads correct without leaving the side-effect the rubric grades.

Qwen · click-and-move

70–72% of actions are mouse-related (44–58% click + 11–14% mouse_move), reflecting the OSWorld-style coordinate surface these models were trained on. With no shell access every persona-data lookup is a UI read — consistent with the family's persona-data hallucination rate (24 of 36 hits).

What rubrics reward

Exposing bash alongside the UI is a measurable design choice: future hybrid CUA agents need a policy for picking between shell and UI as a function of what the task needs to leave behind, not just the answer.

Failure mode tally

We tagged every failed-rubric judge explanation. Skipped required apps and premature DONE account for most of the loss.

578
Skipped required app
Agent never visits one of the apps the rubric requires — typically because it commits to a single source (search, email) and never branches across the surface.
All families
532
Premature DONE
Agent emits DONE before satisfying every rubric. Concentrates in GPT-5.5 (307 of 532 hits): zero-score trajectories average 17–20 steps before abandonment.
GPT family
123
Surface error → terminal
A single page error or unexpected modal terminates the trajectory. Agents do not recover the underlying intent.
All families
90
Partial artifact
Output exists but is missing a required field, sheet, or section the rubric calls for.
Sonnet, GPT
36
Hallucinated persona data
Agent invents persona-specific values (24 of 36 in Qwen) — a friend, a route, a balance that doesn't exist in the seeded environment.
Qwen family

Citation

@misc{jang2026mypcbench,
  title  = {{MyPCBench}: A Benchmark for Personally Intelligent Computer-Use Agents},
  author = {Jang, Lawrence Keunho and Jang, Andrew Keunwoo and Koh, Jing Yu and Salakhutdinov, Ruslan},
  year   = {2026}
}