What are people actually doing with AI? A better way to measure performance

If you’ve ever asked ChatGPT to rewrite an email, check your grammar, or summarise an article, you’re not alone. Yet surprisingly, most AI benchmarks rarely test these everyday tasks. Instead, tech companies publish scores based on academic exams, coding puzzles, and logic games—designed to show which model is smartest. But few users ever ask their AI assistant to solve high school math or translate ancient Greek.

In our new study, Evaluating LLM Metrics Through Real-World Capabilities, we argue there’s a clear mismatch between how AI is evaluated and how people actually use it. Benchmarks like MMLU, GPQA, and HumanEval focus on measurable tasks like multiple-choice quizzes or code generation—useful for comparing raw capabilities, but not reflective of everyday use. Common tasks like giving feedback on writing, formatting documents, or summarising reports are largely ignored.


What people actually use AI for

To understand real-world usage, we analysed two large datasets. The first was a survey of over 18,000 Danish workers, which identified which jobs use AI and which tasks are most automatable. The second came from Anthropic, creators of Claude.ai, and included over four million real prompts. Each prompt had been mapped to job tasks from the U.S. Department of Labor’s O*NET database.

By focusing on the top 100 most common tasks—accounting for more than half of all prompts—we found that every task could be grouped into one or more of six core capabilities:

  • Technical Assistance: Helping users solve problems, especially with tech or software
    e.g. “Why won’t my code run?”
  • Reviewing Work: Giving feedback or evaluating quality
    e.g. “Check my report for tone and clarity”
  • Generation: Creating original content
    e.g. “Write a product description for my shop”
  • Summarisation: Condensing content into shorter form
    e.g. “Summarise this article in 3 bullet points”
  • Information Retrieval: Looking up facts or background knowledge
    e.g. “Who wrote Macbeth?”
  • Data Structuring: Formatting or organising content
    e.g. “Convert this reference list to APA style”

The most frequent capabilities were Technical Assistance (65.1%) and Reviewing Work (58.9%). Generation, Summarisation, and Information Retrieval were moderately common. Data Structuring, while important, appeared less often. Yet notably, Reviewing Work and Data Structuring aren’t tested by any existing benchmarks.


Where current benchmarks fall short

We reviewed the benchmarks used in model evaluations by companies like OpenAI, Google, Meta, Grok, DeepSeek, Anthropic, and Alibaba. Most focus on tasks that are easy to score—like coding tests or factual questions—but neglect more collaborative or conversational uses like editing or formatting.

However, a few benchmarks stand out for aligning better with real-world use. These use open-ended prompts, human raters, and realistic task formats:

  • WebDev Arena tests Technical Assistance through practical web development challenges (e.g. “Clone a WhatsApp-style interface”). Models are judged by humans in head-to-head comparisons, offering a more realistic measure of utility than standard coding tests.
  • SimpleQA is best for Information Retrieval. It evaluates how well models answer short factual questions in free text—not multiple choice—and covers diverse domains like politics, science, art, and geography.
  • Facts-Grounding measures Summarisation by assessing whether AI-generated summaries are accurate and grounded in source documents. Human annotators score the results, ensuring reliability.
  • Chatbot Arena: Creative Writing benchmarks Generation through storytelling and creative tasks. Human voters compare model outputs side by side, capturing subjective but meaningful quality differences.

These four benchmarks are the best available for measuring AI’s practical usefulness. But crucially, no major benchmark currently evaluates two of the most commonly used capabilities: Reviewing Work and Data Structuring.

This gap skews development priorities. Companies optimise for what’s easy to benchmark—like coding or factual recall—while overlooking tasks like editing a report or reformatting a citation list. As a result, benchmark scores often give an incomplete picture of a model’s real-world value.


Which models do best?

Using our framework, we evaluated top AI models across the four benchmarked capabilities. As of May 2025, Gemini 2.5 leads in Technical Assistance, Summarisation, and Generation. GPT-4.5 performs best in factual question answering. Generally, newer models perform better—a sign of “recency bias,” as companies focus their tuning on existing benchmarks.


What should benchmarks look like instead?

To close the gap between evaluation and real use, we recommend:

  • Multi-turn interactions: Benchmarks should involve back-and-forth dialogue, not just single-turn prompts.
  • Human-in-the-loop evaluation: For subjective tasks like reviewing work, human judgment is essential.
  • Open and transparent protocols: Prompt sets and rating guidelines should be public and reproducible.
  • Tailored benchmarks: Each capability should have its own specialised benchmark—one size does not fit all.

Why this matters

The question isn’t just “How intelligent is this model?” but “How useful is it in real-life tasks?”

Current benchmarks don’t reflect how millions of people use AI: collaboratively, iteratively, and across diverse workflows. If AI is to truly support all professionals—from teachers and marketers to administrators and analysts—we need evaluation methods that reflect that reality.

The future of AI should be measured not by abstract scores, but by how well it helps us get real work done.

Article by Justin Miller based on collaborative research with Dr Wenjia Tang

Evaluating LLM Metrics Through Real-World Capabilities

Share this article

Related Articles

Meet Louisa Shen, new Mediated Trust Post-Doctoral Associate

Louisa Shen recently joined the Mediated Trust team as a Post-Doctoral Associate for Trust and AI. She originally trained in literature and history in Auckland, NZ, before working as a technical communicator in the software sector. Her doctoral research undertaken in Cambridge, UK developed an extended history of electronic display technology from the 19th century to the present.

Are Australian Arts and Humanities Ready for AI?

The 2025 QS World Universities Rankings by Subject showed a concerning decline in the global rankings of Australian arts and humanities courses.

A New Era for Clustering Short Text: Making Sense of Millions with LLMs

In the age of big data, researchers and analysts are often overwhelmed by the scale and complexity of short text datasets: millions of tweets, headlines, reviews, or search queries that are difficult to meaningfully summarize or interpret.

Launch of the 25th Edelman Trust Barometer

We were delighted to co-host the launch of the 2025 Edelman Trust Barometer with Edelman Australia on 18 March. We brought together around 160 people at The Sybil Centre at the University of Sydney.