Correlation dimension of natural language in a statistical manifold

The correlation dimension of natural language is measured by applying the Grassberger-Procaccia algorithm to high-dimensional sequences produced by a large-scale language model. This method, previously studied only in a Euclidean space, is reformulated in a statistical manifold via the Fisher-Rao distance. Language exhibits a multifractal, with global self-similarity and a universal dimension around 6.5, which is smaller than those of simple discrete random sequences and larger than that of a Barabási-Albert process. Long memory is the key to producing self-similarity. Our method is applicable to any probabilistic model of real-world discrete sequences, and we show an application to music data.

System state xt and the next-word probability distribution pt

References

Xin Du and Kumiko Tanaka-Ishii. Correlation dimension of natural language in a statistical manifold. Physical Review Research 6, L022028, 2024. [link]

Categorized in:

Complex System Featured Language

Tagged in:

computational linguistics, fractal, language modeling, scaling law

References

Leave a Reply Cancel reply

Other Stories

ICML 2024. Bottleneck-minimal indexing for generative document retrieval

Entropy rate of human symbolic sequences

AAAI 2025. Information-Theoretic Generative Clustering of Documents

Complexity of Language and Its Relation to Inference

Acquiring Stock Vectors from News Text and Application to Investment

AAAI 2025. Information-Theoretic Generative Clustering of Documents

Strahler number of natural language sentences

Modeling of financial markets under extreme risks

Quantification of structural complexity underlying real world time series

NeurIPS 2022. FIRE: Nonlinear language representation

AAAI 2025. Information-Theoretic Generative Clustering of Documents

Complexity of Language and Its Relation to Inference

Acquiring Stock Vectors from News Text and Application to Investment

Strahler number of natural language sentences

ICML 2024. Bottleneck-minimal indexing for generative document retrieval

Acquiring Stock Vectors from News Text and Application to Investment

Co-Training Realized Volatility Prediction Model with Neural Distributional Transformation

Influence of textual data and communication structure on financial prices

Modeling of financial markets under extreme risks

Press ESC to close

Or check our Popular Categories...

References

Leave a Reply Cancel reply

Related Articles

Other Stories

ICML 2024. Bottleneck-minimal indexing for generative document retrieval

Entropy rate of human symbolic sequences