AAAI 2025. Information-Theoretic Generative Clustering of Documents

Clustering is a fundamental technique in machine learning and data mining, offering a powerful lens to understand self-organizing patterns in the real world. At its core, clustering is inherently information-theoretic: it involves formulating the simplest possible hypothesis—how many clusters exist, and which document belongs to which cluster—while minimizing the information loss incurred during this assignment.

Over the past decade, clustering methods have shifted away from this information-theoretic foundation. Instead of modeling documents as probability distributions (e.g., word frequencies), recent approaches represent documents as dense vectors using powerful neural models like BERT. While effective, these vector representations lack a natural probabilistic interpretation, causing the information-theoretic perspective to fade into the background.

In this research, we revisit and revitalize the classic information-theoretic view by leveraging generative language models. Specifically, we use the Doc2Query model to represent each document as a probability distribution over all possible generated texts. Although this space is infinite and discrete, we estimate the relevant distributions and their KL divergences using regularized importance sampling.

In essence, our method performs clustering and statistical estimation hand in hand. The results are striking: across four standard document clustering benchmarks, our method consistently outperforms strong embedding-based baselines—often by a significant margin.

poster Download

References

Xin Du and Kumiko Tanaka-Ishii. Information-Theoretical Generative Clustering of Documents. In Proceedings of AAAI, 2025. [link]

Categorized in:

Featured Language Machine learning Uncategorized

Tagged in:

clustering, information theory, language modeling, machine learning

AAAI 2025. Information-Theoretic Generative Clustering of Documents

References

Leave a Reply Cancel reply

Other Stories

🏆ACL Outstanding Paper Award

Complexity of Language and Its Relation to Inference

🏆ACL Outstanding Paper Award

Complexity of Language and Its Relation to Inference

Acquiring Stock Vectors from News Text and Application to Investment

🏆ACL Outstanding Paper Award

Strahler number of natural language sentences

Correlation dimension of natural language in a statistical manifold

Modeling of financial markets under extreme risks

Quantification of structural complexity underlying real world time series

🏆ACL Outstanding Paper Award

Complexity of Language and Its Relation to Inference

Acquiring Stock Vectors from News Text and Application to Investment

Strahler number of natural language sentences

ICML 2024. Bottleneck-minimal indexing for generative document retrieval

Acquiring Stock Vectors from News Text and Application to Investment

Co-Training Realized Volatility Prediction Model with Neural Distributional Transformation

Influence of textual data and communication structure on financial prices

Modeling of financial markets under extreme risks

Press ESC to close

Or check our Popular Categories...

References

Leave a Reply Cancel reply

Related Articles

Other Stories

🏆ACL Outstanding Paper Award

Complexity of Language and Its Relation to Inference