ACL 2026. Repeated Sequences Reveal Gapsbetween Large Language Models and Natural Language

Evaluating whether large language models (LLMs) capture the structure of natural language beyond local fluency remains an open challenge. Existing evaluation methods, largely based on task performance or short-context behavior, provide limited insight into the long-range statistical organization of generated text. We propose a complementary evaluation framework based on repeated subsequences. By analyzing their distribution across scales and relating it to higher-order Rényi entropies, we probe how texts reuse previously established structure under finite-length conditions.

Experiments on human-written texts and length-matched GPT-generated texts show that, while power-law models can describe restricted ranges of block length, the observed entropy growth is often equally or better characterized by logarithmic–power forms. Across datasets, natural language exhibits stable entropy-growth patterns over accessible ranges, with consistent average behavior despite variability across individual texts. In contrast, GPT-generated texts show systematic and statistically significant shifts in estimated exponents with model size. These results demonstrate that repeated-subsequence entropy provides a quantitative structural diagnostic that reveals systematic differences in long-range organization, distinguishing natural language from state-of-the-art LLM outputs beyond surface-level fluency.

References

Kumiko Tanaka-Ishii. Repeated Sequences Reveal Gaps between Large Language Models and Natural Language. Accepted to the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026), to appear in July 2026.

Categorized in:

Inference, Reasoning Language Machine learning

References

Leave a Reply Cancel reply

Other Stories

TACL. Understanding Benchmark Language Under Weakened Formal Semantics

DH 2026. Retrieval-Augmented Description Generation for Ceramic Artworks— Effectiveness of Knowledge-Enhancement by the MuseumMetadata—

TACL. Understanding Benchmark Language Under Weakened Formal Semantics

DH 2026. Retrieval-Augmented Description Generation for Ceramic Artworks— Effectiveness of Knowledge-Enhancement by the MuseumMetadata—

ICML 2026. Escaping Mode Collapse in LLM Generation via Geometric Regulation

🏆ACL 2025 Outstanding Paper Award. New Formulation of Zipf’s Meaning-Frequency Law

AAAI 2025. Information-Theoretic Generative Clustering of Documents

JSTAT 2023. Strahler number of natural language sentences in comparison with random trees

Physical Review Research 2024. Correlation dimension of natural language in a statistical manifold

Knowledge-Based Systems 2022. Modeling of financial markets under extreme risks

TACL. Understanding Benchmark Language Under Weakened Formal Semantics

DH 2026. Retrieval-Augmented Description Generation for Ceramic Artworks— Effectiveness of Knowledge-Enhancement by the MuseumMetadata—

ICML 2026. Escaping Mode Collapse in LLM Generation via Geometric Regulation

NeurIPS 2025. Correlation Dimension of Autoregressive Large Language Models

🏆ACL 2025 Outstanding Paper Award. New Formulation of Zipf’s Meaning-Frequency Law

ACL 2020. Stock Embeddings Acquired from News Articles and Price History, and an Application to Portfolio Optimization

ACM ICAIF 2023. Co-Training Realized Volatility Prediction Model with Neural Distributional Transformation

ACL 2020. Influence of textual data and communication structure on financial prices

Knowledge-Based Systems 2022. Modeling of financial markets under extreme risks

Press ESC to close

Or check our Popular Categories...

References

Leave a Reply Cancel reply

Related Articles

Other Stories

TACL. Understanding Benchmark Language Under Weakened Formal Semantics

DH 2026. Retrieval-Augmented Description Generation for Ceramic Artworks— Effectiveness of Knowledge-Enhancement by the MuseumMetadata—