What are people actually doing with AI? A better way to measure performance

If you’ve ever asked ChatGPT to rewrite an email, check your grammar, or summarise an article, you’re not alone. Yet surprisingly, most AI benchmarks rarely test these everyday tasks. Instead, tech companies publish scores based on academic exams, coding puzzles, and logic games—designed to show which model is smartest. But few users ever ask their AI assistant to solve high school math or translate ancient Greek.

In our new study, Evaluating LLM Metrics Through Real-World Capabilities, we argue there’s a clear mismatch between how AI is evaluated and how people actually use it. Benchmarks like MMLU, GPQA, and HumanEval focus on measurable tasks like multiple-choice quizzes or code generation—useful for comparing raw capabilities, but not reflective of everyday use. Common tasks like giving feedback on writing, formatting documents, or summarising reports are largely ignored.


What people actually use AI for

To understand real-world usage, we analysed two large datasets. The first was a survey of over 18,000 Danish workers, which identified which jobs use AI and which tasks are most automatable. The second came from Anthropic, creators of Claude.ai, and included over four million real prompts. Each prompt had been mapped to job tasks from the U.S. Department of Labor’s O*NET database.

By focusing on the top 100 most common tasks—accounting for more than half of all prompts—we found that every task could be grouped into one or more of six core capabilities:

  • Technical Assistance: Helping users solve problems, especially with tech or software
    e.g. “Why won’t my code run?”
  • Reviewing Work: Giving feedback or evaluating quality
    e.g. “Check my report for tone and clarity”
  • Generation: Creating original content
    e.g. “Write a product description for my shop”
  • Summarisation: Condensing content into shorter form
    e.g. “Summarise this article in 3 bullet points”
  • Information Retrieval: Looking up facts or background knowledge
    e.g. “Who wrote Macbeth?”
  • Data Structuring: Formatting or organising content
    e.g. “Convert this reference list to APA style”

The most frequent capabilities were Technical Assistance (65.1%) and Reviewing Work (58.9%). Generation, Summarisation, and Information Retrieval were moderately common. Data Structuring, while important, appeared less often. Yet notably, Reviewing Work and Data Structuring aren’t tested by any existing benchmarks.


Where current benchmarks fall short

We reviewed the benchmarks used in model evaluations by companies like OpenAI, Google, Meta, Grok, DeepSeek, Anthropic, and Alibaba. Most focus on tasks that are easy to score—like coding tests or factual questions—but neglect more collaborative or conversational uses like editing or formatting.

However, a few benchmarks stand out for aligning better with real-world use. These use open-ended prompts, human raters, and realistic task formats:

  • WebDev Arena tests Technical Assistance through practical web development challenges (e.g. “Clone a WhatsApp-style interface”). Models are judged by humans in head-to-head comparisons, offering a more realistic measure of utility than standard coding tests.
  • SimpleQA is best for Information Retrieval. It evaluates how well models answer short factual questions in free text—not multiple choice—and covers diverse domains like politics, science, art, and geography.
  • Facts-Grounding measures Summarisation by assessing whether AI-generated summaries are accurate and grounded in source documents. Human annotators score the results, ensuring reliability.
  • Chatbot Arena: Creative Writing benchmarks Generation through storytelling and creative tasks. Human voters compare model outputs side by side, capturing subjective but meaningful quality differences.

These four benchmarks are the best available for measuring AI’s practical usefulness. But crucially, no major benchmark currently evaluates two of the most commonly used capabilities: Reviewing Work and Data Structuring.

This gap skews development priorities. Companies optimise for what’s easy to benchmark—like coding or factual recall—while overlooking tasks like editing a report or reformatting a citation list. As a result, benchmark scores often give an incomplete picture of a model’s real-world value.


Which models do best?

Using our framework, we evaluated top AI models across the four benchmarked capabilities. As of May 2025, Gemini 2.5 leads in Technical Assistance, Summarisation, and Generation. GPT-4.5 performs best in factual question answering. Generally, newer models perform better—a sign of “recency bias,” as companies focus their tuning on existing benchmarks.


What should benchmarks look like instead?

To close the gap between evaluation and real use, we recommend:

  • Multi-turn interactions: Benchmarks should involve back-and-forth dialogue, not just single-turn prompts.
  • Human-in-the-loop evaluation: For subjective tasks like reviewing work, human judgment is essential.
  • Open and transparent protocols: Prompt sets and rating guidelines should be public and reproducible.
  • Tailored benchmarks: Each capability should have its own specialised benchmark—one size does not fit all.

Why this matters

The question isn’t just “How intelligent is this model?” but “How useful is it in real-life tasks?”

Current benchmarks don’t reflect how millions of people use AI: collaboratively, iteratively, and across diverse workflows. If AI is to truly support all professionals—from teachers and marketers to administrators and analysts—we need evaluation methods that reflect that reality.

The future of AI should be measured not by abstract scores, but by how well it helps us get real work done.

Article by Justin Miller based on collaborative research with Dr Wenjia Tang

Evaluating LLM Metrics Through Real-World Capabilities

Share this article

Related Articles

Killing the chicken to scare the monkey: the curious progressive urge to take down Australia’s social media minimum age rules

Evidence from Australia after three months of the Online Safety (Social Media Minimum Age Amendment) Act is that outcomes have been ambiguous. A Compliance Update Report released by the Office of the eSafety Commissioner in March 2026 found that while almost half of surveyed parents had at least one child with their own social media account prior to the restrictions coming into effect, this proportion had decreased to nearly one third following implementation of the ban. Notably, of the parents who reported their child had an account on each platform prior to 10 December 2025, around 7 in 10 reported that their child still had an account, with only 3 in 10 reporting that their child no longer had an account.

Why we are not in a post-truth era

Discussions about trust have characteristically tied the concept closely to that of truth. When we are asked why we consider a particular person trustworthy, the question of whether they tell the truth is likely to feature highly. As the great physicist and Nobel Prize winner Albert Einstein observed, ‘Whoever is careless with the truth in small matters cannot be trusted with important matters’.

To age-gate or not to age-gate? The Australian Social Media Minimum Age legislation and its international impact

When Australia implemented the world’s first legislated social media minimum age  restrictions on 10 December 2025, it attracted significant global attention. The Australian Science Media Centre recorded that the 52 academics registered as experts on the subject were sourced in over 2600 news items worldwide in December 2025 alone. It was extensively covered by virtually every major international news outlet, and I did interviews with BBC, CNN, Al-Jazeera, The Times, Asahi Shimbun and many others.  

Time for Trust: Can we trust Hollywood?

In this episode, Associate Professor Bruce Isaacs dives into the crisis of trust in images – from Hollywood to Instagram – and explains why we may no longer know what’s real. It’s a timely, provocative discussion about how cinema, digital media and AI are reshaping our relationship to truth itself.