A New Era for Clustering Short Text: Making Sense of Millions with LLMs

Human-interpretable clustering of short text using large language models

Authors: Justin K. Miller and Tristram Alexander

In the age of big data, researchers and analysts are often overwhelmed by the scale and complexity of short text datasets: millions of tweets, headlines, reviews, or search queries that are difficult to meaningfully summarize or interpret.

Newly published research by Justin K. Miller and Tristram Alexander highlights how large language models (LLMs) are redefining what’s possible in short text clustering. By using powerful semantic embeddings generated by LLMs, their method reduces sprawling, unstructured datasets into just ten concise, human-readable topics—bridging the gap between machine learning output and human understanding.

Unlike traditional clustering techniques that struggle with sparse and ambiguous text, this approach captures deep semantic patterns, allowing for distinctive and interpretable clusters. More importantly, the study introduces a novel use of generative LLMs to label and validate clusters, providing a transparent and scalable alternative to time-consuming human coding.

“The insights gained from this study not only demonstrate a way to make clustering more robust and interpretable,” says Data Scientist Justin Miller, “but also introduces new ways of cluster validation. As LLMs continue to evolve, they offer the potential to transform how we approach text clustering and interpretation: making it faster, more accurate, and more aligned with how humans naturally categorise information.”

The most interesting discovery? That LLMs can not only create meaningful groupings of millions of texts but also generate cluster names that are similar to names humans come up with and are in fact sometimes better, offering a new standard for evaluating clustering quality.

This paper is especially relevant for those working with large-scale textual data in fields like policy analysis, media monitoring, customer feedback, and academic research, anyone who wants to cut through the noise and surface interpretable insights from massive datasets.

📄 Read the research paper: https://royalsocietypublishing.org/doi/10.1098/rsos.241692

About the Authors:

Justin Miller

Justin Miller is a data scientist specializing in natural language processing. He is pursuing a PhD in social media data science, where he is using large language models to explore the complex relationships between identity and behavior on social media platforms. With a background in English literature, psychology, and data science, Justin brings a multidisciplinary approach to his research, and is dedicated to advancing our understanding of this rapidly-evolving field. He is supervised by Tristram Alexander and Eduardo Altmann.

Tristram Alexander

Tristram Alexander is an Associate Professor in Physics at the University of Sydney. He is a physicist with expertise in the modelling of nonlinear dynamical systems with many interacting elements, including social media dynamics. He has developed a suite of processing tools to identify and analyse communities in Twitter stream data. View Associate Professor Tristram Alexander’s profile.

Share this article

Related Articles

Time for Trust: Can we trust Hollywood?

In this episode, Associate Professor Bruce Isaacs dives into the crisis of trust in images – from Hollywood to Instagram – and explains why we may no longer know what’s real. It’s a timely, provocative discussion about how cinema, digital media and AI are reshaping our relationship to truth itself.

Summer Reflections on Australia’s Social Media Minimum Age Laws

It is unusual to find yourself as a digital media researcher in Australia being at the forefront of global policy debates. Given the talk about the three great Digital Empires – the US, EU and China – who set the global agenda, the place for middle-sized powers to be taking a policy lead around digital tech would seem to be limited.

How do Platforms Matter?

The paper ‘How do platforms matter? Media power, platform power and the digital domination of Australian media’, co-authored by Terry Flew (University of Sydney) and Cameron McTernan (Adelaide University) has now been published by International Communications Gazette. The paper is part of a special issue ‘Networks of Power: Media and Internet Concentration, Platform Capitalism, and the Future of Democracy’, edited by Dwayne Winseck (Carleton University). The special issue is part of the Global Media and Internet Concentration Project (GMICP), funded through the Canadian Social Sciences and Humanities Research Council.

Digital policy as problem space: Australia’s social media age restrictions for under-16s

On December 10, 2025, the Online Safety Act (Social Media Minimum Age) Amendment, which was passed by both Australian Federal Houses of Parliament 12 months earlier, was implemented. This marked the onset of what is known globally as Australia’s social media ban for under-16s. In practice it involves those under 16 being restricted from holding accounts on ten platforms designated by the Office of the eSafety Commissioner, including Facebook, Instagram, TikTok, X, Reddit and Snapchat.