A New Era for Clustering Short Text: Making Sense of Millions with LLMs

Human-interpretable clustering of short text using large language models

Authors: Justin K. Miller and Tristram Alexander

In the age of big data, researchers and analysts are often overwhelmed by the scale and complexity of short text datasets: millions of tweets, headlines, reviews, or search queries that are difficult to meaningfully summarize or interpret.

Newly published research by Justin K. Miller and Tristram Alexander highlights how large language models (LLMs) are redefining what’s possible in short text clustering. By using powerful semantic embeddings generated by LLMs, their method reduces sprawling, unstructured datasets into just ten concise, human-readable topics—bridging the gap between machine learning output and human understanding.

Unlike traditional clustering techniques that struggle with sparse and ambiguous text, this approach captures deep semantic patterns, allowing for distinctive and interpretable clusters. More importantly, the study introduces a novel use of generative LLMs to label and validate clusters, providing a transparent and scalable alternative to time-consuming human coding.

“The insights gained from this study not only demonstrate a way to make clustering more robust and interpretable,” says Data Scientist Justin Miller, “but also introduces new ways of cluster validation. As LLMs continue to evolve, they offer the potential to transform how we approach text clustering and interpretation: making it faster, more accurate, and more aligned with how humans naturally categorise information.”

The most interesting discovery? That LLMs can not only create meaningful groupings of millions of texts but also generate cluster names that are similar to names humans come up with and are in fact sometimes better, offering a new standard for evaluating clustering quality.

This paper is especially relevant for those working with large-scale textual data in fields like policy analysis, media monitoring, customer feedback, and academic research, anyone who wants to cut through the noise and surface interpretable insights from massive datasets.

📄 Read the research paper: https://royalsocietypublishing.org/doi/10.1098/rsos.241692

About the Authors:

Justin Miller

Justin Miller is a data scientist specializing in natural language processing. He is pursuing a PhD in social media data science, where he is using large language models to explore the complex relationships between identity and behavior on social media platforms. With a background in English literature, psychology, and data science, Justin brings a multidisciplinary approach to his research, and is dedicated to advancing our understanding of this rapidly-evolving field. He is supervised by Tristram Alexander and Eduardo Altmann.

Tristram Alexander

Tristram Alexander is an Associate Professor in Physics at the University of Sydney. He is a physicist with expertise in the modelling of nonlinear dynamical systems with many interacting elements, including social media dynamics. He has developed a suite of processing tools to identify and analyse communities in Twitter stream data. View Associate Professor Tristram Alexander’s profile.

Share this article

Related Articles

Killing the chicken to scare the monkey: the curious progressive urge to take down Australia’s social media minimum age rules

Evidence from Australia after three months of the Online Safety (Social Media Minimum Age Amendment) Act is that outcomes have been ambiguous. A Compliance Update Report released by the Office of the eSafety Commissioner in March 2026 found that while almost half of surveyed parents had at least one child with their own social media account prior to the restrictions coming into effect, this proportion had decreased to nearly one third following implementation of the ban. Notably, of the parents who reported their child had an account on each platform prior to 10 December 2025, around 7 in 10 reported that their child still had an account, with only 3 in 10 reporting that their child no longer had an account.

Why we are not in a post-truth era

Discussions about trust have characteristically tied the concept closely to that of truth. When we are asked why we consider a particular person trustworthy, the question of whether they tell the truth is likely to feature highly. As the great physicist and Nobel Prize winner Albert Einstein observed, ‘Whoever is careless with the truth in small matters cannot be trusted with important matters’.

To age-gate or not to age-gate? The Australian Social Media Minimum Age legislation and its international impact

When Australia implemented the world’s first legislated social media minimum age  restrictions on 10 December 2025, it attracted significant global attention. The Australian Science Media Centre recorded that the 52 academics registered as experts on the subject were sourced in over 2600 news items worldwide in December 2025 alone. It was extensively covered by virtually every major international news outlet, and I did interviews with BBC, CNN, Al-Jazeera, The Times, Asahi Shimbun and many others.  

Time for Trust: Can we trust Hollywood?

In this episode, Associate Professor Bruce Isaacs dives into the crisis of trust in images – from Hollywood to Instagram – and explains why we may no longer know what’s real. It’s a timely, provocative discussion about how cinema, digital media and AI are reshaping our relationship to truth itself.