A New Era for Clustering Short Text: Making Sense of Millions with LLMs

Human-interpretable clustering of short text using large language models

Authors: Justin K. Miller and Tristram Alexander

In the age of big data, researchers and analysts are often overwhelmed by the scale and complexity of short text datasets: millions of tweets, headlines, reviews, or search queries that are difficult to meaningfully summarize or interpret.

Newly published research by Justin K. Miller and Tristram Alexander highlights how large language models (LLMs) are redefining what’s possible in short text clustering. By using powerful semantic embeddings generated by LLMs, their method reduces sprawling, unstructured datasets into just ten concise, human-readable topics—bridging the gap between machine learning output and human understanding.

Unlike traditional clustering techniques that struggle with sparse and ambiguous text, this approach captures deep semantic patterns, allowing for distinctive and interpretable clusters. More importantly, the study introduces a novel use of generative LLMs to label and validate clusters, providing a transparent and scalable alternative to time-consuming human coding.

“The insights gained from this study not only demonstrate a way to make clustering more robust and interpretable,” says Data Scientist Justin Miller, “but also introduces new ways of cluster validation. As LLMs continue to evolve, they offer the potential to transform how we approach text clustering and interpretation: making it faster, more accurate, and more aligned with how humans naturally categorise information.”

The most interesting discovery? That LLMs can not only create meaningful groupings of millions of texts but also generate cluster names that are similar to names humans come up with and are in fact sometimes better, offering a new standard for evaluating clustering quality.

This paper is especially relevant for those working with large-scale textual data in fields like policy analysis, media monitoring, customer feedback, and academic research, anyone who wants to cut through the noise and surface interpretable insights from massive datasets.

📄 Read the research paper: https://royalsocietypublishing.org/doi/10.1098/rsos.241692

About the Authors:

Justin Miller

Justin Miller is a data scientist specializing in natural language processing. He is pursuing a PhD in social media data science, where he is using large language models to explore the complex relationships between identity and behavior on social media platforms. With a background in English literature, psychology, and data science, Justin brings a multidisciplinary approach to his research, and is dedicated to advancing our understanding of this rapidly-evolving field. He is supervised by Tristram Alexander and Eduardo Altmann.

Tristram Alexander

Tristram Alexander is an Associate Professor in Physics at the University of Sydney. He is a physicist with expertise in the modelling of nonlinear dynamical systems with many interacting elements, including social media dynamics. He has developed a suite of processing tools to identify and analyse communities in Twitter stream data. View Associate Professor Tristram Alexander’s profile.

Share this article

Related Articles

What are people actually doing with AI? A better way to measure performance

If you’ve ever asked ChatGPT to rewrite an email, check your grammar, or summarise an article, you’re not alone. Yet surprisingly, most AI benchmarks rarely test these everyday tasks.

Meet Louisa Shen, new Mediated Trust Post-Doctoral Associate

Louisa Shen recently joined the Mediated Trust team as a Post-Doctoral Associate for Trust and AI. She originally trained in literature and history in Auckland, NZ, before working as a technical communicator in the software sector. Her doctoral research undertaken in Cambridge, UK developed an extended history of electronic display technology from the 19th century to the present.

Are Australian Arts and Humanities Ready for AI?

The 2025 QS World Universities Rankings by Subject showed a concerning decline in the global rankings of Australian arts and humanities courses.

Launch of the 25th Edelman Trust Barometer

We were delighted to co-host the launch of the 2025 Edelman Trust Barometer with Edelman Australia on 18 March. We brought together around 160 people at The Sybil Centre at the University of Sydney.