A New Era for Clustering Short Text: Making Sense of Millions with LLMs

Human-interpretable clustering of short text using large language models

Authors: Justin K. Miller and Tristram Alexander

In the age of big data, researchers and analysts are often overwhelmed by the scale and complexity of short text datasets: millions of tweets, headlines, reviews, or search queries that are difficult to meaningfully summarize or interpret.

Newly published research by Justin K. Miller and Tristram Alexander highlights how large language models (LLMs) are redefining what’s possible in short text clustering. By using powerful semantic embeddings generated by LLMs, their method reduces sprawling, unstructured datasets into just ten concise, human-readable topics—bridging the gap between machine learning output and human understanding.

Unlike traditional clustering techniques that struggle with sparse and ambiguous text, this approach captures deep semantic patterns, allowing for distinctive and interpretable clusters. More importantly, the study introduces a novel use of generative LLMs to label and validate clusters, providing a transparent and scalable alternative to time-consuming human coding.

“The insights gained from this study not only demonstrate a way to make clustering more robust and interpretable,” says Data Scientist Justin Miller, “but also introduces new ways of cluster validation. As LLMs continue to evolve, they offer the potential to transform how we approach text clustering and interpretation: making it faster, more accurate, and more aligned with how humans naturally categorise information.”

The most interesting discovery? That LLMs can not only create meaningful groupings of millions of texts but also generate cluster names that are similar to names humans come up with and are in fact sometimes better, offering a new standard for evaluating clustering quality.

This paper is especially relevant for those working with large-scale textual data in fields like policy analysis, media monitoring, customer feedback, and academic research, anyone who wants to cut through the noise and surface interpretable insights from massive datasets.

📄 Read the research paper: https://royalsocietypublishing.org/doi/10.1098/rsos.241692

About the Authors:

Justin Miller

Justin Miller is a data scientist specializing in natural language processing. He is pursuing a PhD in social media data science, where he is using large language models to explore the complex relationships between identity and behavior on social media platforms. With a background in English literature, psychology, and data science, Justin brings a multidisciplinary approach to his research, and is dedicated to advancing our understanding of this rapidly-evolving field. He is supervised by Tristram Alexander and Eduardo Altmann.

Tristram Alexander

Tristram Alexander is an Associate Professor in Physics at the University of Sydney. He is a physicist with expertise in the modelling of nonlinear dynamical systems with many interacting elements, including social media dynamics. He has developed a suite of processing tools to identify and analyse communities in Twitter stream data. View Associate Professor Tristram Alexander’s profile.

Share this article

Related Articles

Dr Agata Stepnik discusses digital ethnography

In this video interview, Post-doctoral research fellow Dr. Agata Stepnik talks about the importance of digital ethnography as a research method and the need for situated and observational methods in understanding digital cultures.

Trust, Institutions and Governance

These are a series of seminars for post-doctoral fellows, PhD students and the Mediated Trust research team on the theme of “Trust, Institutions and Governance”. The aims of the seminars are: To ground the concept of trust in institutions and organisations, as an intermediate (meso) point between interpersonal and societal trust. To consider leading theories of trust, truth and communication, and consensus, critical and conflict models of social order, and how they shape understandings of trust.  To discuss institutionalism as a set of theories and methods that can inform the study of trust by grounding it in the historical development of social institutions.  To consider the concept of governance as a way of understanding contemporary forms of politics, power and regulation.

Exploring new perspectives through sabbatical

We were delighted to host Professor Jörg Matthes from the University of Vienna during his recent sabbatical with the Mediated Trust team in Sydney.

What does trust mean in the age of AI?

Professor Terry Flew explores the evolving concept of trust in the context of media, technology, and artificial intelligence. He highlights how AI challenges traditional notions of human-machine interaction, raising concerns about authenticity and reliability.