![](https://ml-waseda.jp/wp-content/uploads/2024/05/Key-image.png)
The correlation dimension of natural language is measured by applying the Grassberger-Procaccia algorithm to high-dimensional sequences produced by a large-scale language model. This method, previously studied only in a Euclidean space, is reformulated in a statistical manifold via the Fisher-Rao distance. Language exhibits a multifractal, with global self-similarity and a universal dimension around 6.5, which is smaller than those of simple discrete random sequences and larger than that of a Barabási-Albert process. Long memory is the key to producing self-similarity. Our method is applicable to any probabilistic model of real-world discrete sequences, and we show an application to music data.
![](https://ml-waseda.jp/wp-content/uploads/2024/05/xandp-1024x613.png)
![](https://ml-waseda.jp/wp-content/uploads/2024/05/phi-1024x529.png)
![](https://ml-waseda.jp/wp-content/uploads/2024/05/language.jpg)
![](https://ml-waseda.jp/wp-content/uploads/2024/05/modelsize-1024x804.png)
![](https://ml-waseda.jp/wp-content/uploads/2024/05/domain-1024x457.jpg)
References
- Xin Du and Kumiko Tanaka-Ishii. Correlation dimension of natural language in a statistical manifold. Physical Review Research 6, L022028, 2024. [link]