If you’ve ever asked ChatGPT to rewrite an email, check your grammar, or summarise an article, you’re not alone. Yet surprisingly, most AI benchmarks rarely test these everyday tasks. Instead, tech companies publish scores based on academic exams, coding puzzles, and logic games—designed to show which model is smartest. But few users ever ask their AI assistant to solve high school math or translate ancient Greek.
In our new study, Evaluating LLM Metrics Through Real-World Capabilities, we argue there’s a clear mismatch between how AI is evaluated and how people actually use it. Benchmarks like MMLU, GPQA, and HumanEval focus on measurable tasks like multiple-choice quizzes or code generation—useful for comparing raw capabilities, but not reflective of everyday use. Common tasks like giving feedback on writing, formatting documents, or summarising reports are largely ignored.
What people actually use AI for
To understand real-world usage, we analysed two large datasets. The first was a survey of over 18,000 Danish workers, which identified which jobs use AI and which tasks are most automatable. The second came from Anthropic, creators of Claude.ai, and included over four million real prompts. Each prompt had been mapped to job tasks from the U.S. Department of Labor’s O*NET database.
By focusing on the top 100 most common tasks—accounting for more than half of all prompts—we found that every task could be grouped into one or more of six core capabilities:
- Technical Assistance: Helping users solve problems, especially with tech or software
e.g. “Why won’t my code run?” - Reviewing Work: Giving feedback or evaluating quality
e.g. “Check my report for tone and clarity” - Generation: Creating original content
e.g. “Write a product description for my shop” - Summarisation: Condensing content into shorter form
e.g. “Summarise this article in 3 bullet points” - Information Retrieval: Looking up facts or background knowledge
e.g. “Who wrote Macbeth?” - Data Structuring: Formatting or organising content
e.g. “Convert this reference list to APA style”
The most frequent capabilities were Technical Assistance (65.1%) and Reviewing Work (58.9%). Generation, Summarisation, and Information Retrieval were moderately common. Data Structuring, while important, appeared less often. Yet notably, Reviewing Work and Data Structuring aren’t tested by any existing benchmarks.
Where current benchmarks fall short
We reviewed the benchmarks used in model evaluations by companies like OpenAI, Google, Meta, Grok, DeepSeek, Anthropic, and Alibaba. Most focus on tasks that are easy to score—like coding tests or factual questions—but neglect more collaborative or conversational uses like editing or formatting.
However, a few benchmarks stand out for aligning better with real-world use. These use open-ended prompts, human raters, and realistic task formats:
- WebDev Arena tests Technical Assistance through practical web development challenges (e.g. “Clone a WhatsApp-style interface”). Models are judged by humans in head-to-head comparisons, offering a more realistic measure of utility than standard coding tests.
- SimpleQA is best for Information Retrieval. It evaluates how well models answer short factual questions in free text—not multiple choice—and covers diverse domains like politics, science, art, and geography.
- Facts-Grounding measures Summarisation by assessing whether AI-generated summaries are accurate and grounded in source documents. Human annotators score the results, ensuring reliability.
- Chatbot Arena: Creative Writing benchmarks Generation through storytelling and creative tasks. Human voters compare model outputs side by side, capturing subjective but meaningful quality differences.
These four benchmarks are the best available for measuring AI’s practical usefulness. But crucially, no major benchmark currently evaluates two of the most commonly used capabilities: Reviewing Work and Data Structuring.
This gap skews development priorities. Companies optimise for what’s easy to benchmark—like coding or factual recall—while overlooking tasks like editing a report or reformatting a citation list. As a result, benchmark scores often give an incomplete picture of a model’s real-world value.
Which models do best?
Using our framework, we evaluated top AI models across the four benchmarked capabilities. As of May 2025, Gemini 2.5 leads in Technical Assistance, Summarisation, and Generation. GPT-4.5 performs best in factual question answering. Generally, newer models perform better—a sign of “recency bias,” as companies focus their tuning on existing benchmarks.
What should benchmarks look like instead?
To close the gap between evaluation and real use, we recommend:
- Multi-turn interactions: Benchmarks should involve back-and-forth dialogue, not just single-turn prompts.
- Human-in-the-loop evaluation: For subjective tasks like reviewing work, human judgment is essential.
- Open and transparent protocols: Prompt sets and rating guidelines should be public and reproducible.
- Tailored benchmarks: Each capability should have its own specialised benchmark—one size does not fit all.
Why this matters
The question isn’t just “How intelligent is this model?” but “How useful is it in real-life tasks?”
Current benchmarks don’t reflect how millions of people use AI: collaboratively, iteratively, and across diverse workflows. If AI is to truly support all professionals—from teachers and marketers to administrators and analysts—we need evaluation methods that reflect that reality.
The future of AI should be measured not by abstract scores, but by how well it helps us get real work done.
Article by Justin Miller based on collaborative research with Dr Wenjia Tang