A New Era for Clustering Short Text: Making Sense of Millions with LLMs

Human-interpretable clustering of short text using large language models

Authors: Justin K. Miller and Tristram Alexander

In the age of big data, researchers and analysts are often overwhelmed by the scale and complexity of short text datasets: millions of tweets, headlines, reviews, or search queries that are difficult to meaningfully summarize or interpret.

Newly published research by Justin K. Miller and Tristram Alexander highlights how large language models (LLMs) are redefining what’s possible in short text clustering. By using powerful semantic embeddings generated by LLMs, their method reduces sprawling, unstructured datasets into just ten concise, human-readable topics—bridging the gap between machine learning output and human understanding.

Unlike traditional clustering techniques that struggle with sparse and ambiguous text, this approach captures deep semantic patterns, allowing for distinctive and interpretable clusters. More importantly, the study introduces a novel use of generative LLMs to label and validate clusters, providing a transparent and scalable alternative to time-consuming human coding.

“The insights gained from this study not only demonstrate a way to make clustering more robust and interpretable,” says Data Scientist Justin Miller, “but also introduces new ways of cluster validation. As LLMs continue to evolve, they offer the potential to transform how we approach text clustering and interpretation: making it faster, more accurate, and more aligned with how humans naturally categorise information.”

The most interesting discovery? That LLMs can not only create meaningful groupings of millions of texts but also generate cluster names that are similar to names humans come up with and are in fact sometimes better, offering a new standard for evaluating clustering quality.

This paper is especially relevant for those working with large-scale textual data in fields like policy analysis, media monitoring, customer feedback, and academic research, anyone who wants to cut through the noise and surface interpretable insights from massive datasets.

📄 Read the research paper: https://royalsocietypublishing.org/doi/10.1098/rsos.241692

About the Authors:

Justin Miller

Justin Miller is a data scientist specializing in natural language processing. He is pursuing a PhD in social media data science, where he is using large language models to explore the complex relationships between identity and behavior on social media platforms. With a background in English literature, psychology, and data science, Justin brings a multidisciplinary approach to his research, and is dedicated to advancing our understanding of this rapidly-evolving field. He is supervised by Tristram Alexander and Eduardo Altmann.

Tristram Alexander

Tristram Alexander is an Associate Professor in Physics at the University of Sydney. He is a physicist with expertise in the modelling of nonlinear dynamical systems with many interacting elements, including social media dynamics. He has developed a suite of processing tools to identify and analyse communities in Twitter stream data. View Associate Professor Tristram Alexander’s profile.

Share this article

Related Articles

Are Australian Arts and Humanities Ready for AI?

The 2025 QS World Universities Rankings by Subject showed a concerning decline in the global rankings of Australian arts and humanities courses.

Launch of the 25th Edelman Trust Barometer

We were delighted to co-host the launch of the 2025 Edelman Trust Barometer with Edelman Australia on 18 March. We brought together around 160 people at The Sybil Centre at the University of Sydney.

Time for Trust podcast – Professor James Arvanitakis, Director of the Forrest Research Foundation

This conversation revolves around issues of political polarization, trust, and social cohesion, featuring James Arvanitakis, Professor and Director of the Forrest Research Foundation.

My Summer reading – Simon Schama, Citizens

My summer reading for 2025 was Simon Schama’s very influential 1989 book Citizens, a Chronicle of the French Revolution. Not surprisingly, I was led to this book by listening to The Rest is History podcast, which brings out the full array of colourful characters associated with this eventful period of history.