TACL. Understanding Benchmark Language Under Weakened Formal Semantics

State-of-the-art NLP benchmarks require interpretation of natural language that specifies conditions, procedures, and exceptions, often relying on implicit assumptions and external knowledge. Constructing complete semantic representations with proof-theoretic guarantees is frequently impractical at scale, and purely text-based reasoning offers limited means of inspection. This paper asks how much understanding of benchmark language can be achieved when formal semantic guarantees are weakened.

We investigate this question by extracting computables: executable representations whose runtime behavior provides operational evidence of semantic adequacy, including executability, execution traces, and runtime failures. We induce and iteratively refine computables for benchmark instances using retrieval from external knowledge. Across mathematical reasoning, multi-step reasoning, causal inference, and rule- and exception-heavy legal and biomedical benchmarks, we find that the proposed approach consistently exceeds text-only reasoning and one-shot code execution. Beyond accuracy, our analyses show that these computables provide scalable, inspectable semantic evidence: they expose conditions and exceptions benchmark language forces into executable form, offering a practical bridge between proof-oriented semantics and purely textual reasoning.

References

Haoyang Chen and Kumiko Tanaka-Ishii. Understanding Benchmark Language Under Weakened Formal Semantics. Transactions of the Association for Computational Linguistics (TACL), in press, to appear in 2026.

Categorized in:

Inference, Reasoning Language Machine learning

References

Leave a Reply Cancel reply

Other Stories

ACL 2026. Repeated Sequences Reveal Gapsbetween Large Language Models and Natural Language

ACL 2026. Repeated Sequences Reveal Gapsbetween Large Language Models and Natural Language

DH 2026. Retrieval-Augmented Description Generation for Ceramic Artworks— Effectiveness of Knowledge-Enhancement by the MuseumMetadata—

ICML 2026. Escaping Mode Collapse in LLM Generation via Geometric Regulation

🏆ACL 2025 Outstanding Paper Award. New Formulation of Zipf’s Meaning-Frequency Law

AAAI 2025. Information-Theoretic Generative Clustering of Documents

JSTAT 2023. Strahler number of natural language sentences in comparison with random trees

Physical Review Research 2024. Correlation dimension of natural language in a statistical manifold

Knowledge-Based Systems 2022. Modeling of financial markets under extreme risks

ACL 2026. Repeated Sequences Reveal Gapsbetween Large Language Models and Natural Language

DH 2026. Retrieval-Augmented Description Generation for Ceramic Artworks— Effectiveness of Knowledge-Enhancement by the MuseumMetadata—

ICML 2026. Escaping Mode Collapse in LLM Generation via Geometric Regulation

NeurIPS 2025. Correlation Dimension of Autoregressive Large Language Models

🏆ACL 2025 Outstanding Paper Award. New Formulation of Zipf’s Meaning-Frequency Law

ACL 2020. Stock Embeddings Acquired from News Articles and Price History, and an Application to Portfolio Optimization

ACM ICAIF 2023. Co-Training Realized Volatility Prediction Model with Neural Distributional Transformation

ACL 2020. Influence of textual data and communication structure on financial prices

Knowledge-Based Systems 2022. Modeling of financial markets under extreme risks

Press ESC to close

Or check our Popular Categories...

References

Leave a Reply Cancel reply

Related Articles

Other Stories

ACL 2026. Repeated Sequences Reveal Gapsbetween Large Language Models and Natural Language