Entropy rate of human symbolic sequences

We explore the complexity underlying human symbolic sequences via entropy rate estimation. Consider the number of possibilities for a time series of length n, with a parameter h, as 2_hn_. For a random binary series consisting of half ones and half zeros, h=_1. For the 26 characters in English, however, the number of possibilities is not 26_n, because of various constraints such as “q” being followed only by “u”. Shannon computed a value of h=1.3, but the question of acquiring a true h for human language is difficult to answer and remains unsolved: in fact, it is unknown whether h is even positive. Therefore, we study ways to compute the upper bound of h for various kinds of data, including music, programs, and market data, in addition to natural language.

References

Ryosuke Takahira, Kumiko Tanaka-Ishii, and Łukasz Dębowski. Entropy Rate Estimates for Natural Language—A New Extrapolation of Compressed Large-Scale Corpora. Entropy, 2016, 18.10: 364. [link]
Geng Ren, Shuntaro Takahashi, Kumiko Tanaka-Ishii. Entropy Rate Estimation for English via a Large Cognitive Experiment Using Mechanical Turk. Entropy, 2019, 21.12: 1201. [link]

Categorized in:

Complex System Language

Tagged in:

computational linguistics, entropy rate, music, newspaper, programming language, time-series

References

Leave a Reply Cancel reply

Other Stories

Correlation dimension of natural language in a statistical manifold

Influence of textual data and communication structure on financial prices

AAAI 2025. Information-Theoretic Generative Clustering of Documents

Complexity of Language and Its Relation to Inference

Acquiring Stock Vectors from News Text and Application to Investment

AAAI 2025. Information-Theoretic Generative Clustering of Documents

Strahler number of natural language sentences

Correlation dimension of natural language in a statistical manifold

Modeling of financial markets under extreme risks

Quantification of structural complexity underlying real world time series

AAAI 2025. Information-Theoretic Generative Clustering of Documents

Complexity of Language and Its Relation to Inference

Acquiring Stock Vectors from News Text and Application to Investment

Strahler number of natural language sentences

ICML 2024. Bottleneck-minimal indexing for generative document retrieval

Acquiring Stock Vectors from News Text and Application to Investment

Co-Training Realized Volatility Prediction Model with Neural Distributional Transformation

Influence of textual data and communication structure on financial prices

Modeling of financial markets under extreme risks

Press ESC to close

Or check our Popular Categories...

References

Leave a Reply Cancel reply

Related Articles

Other Stories

Correlation dimension of natural language in a statistical manifold

Influence of textual data and communication structure on financial prices