In a twist no one saw coming, Polish has outperformed both English and Chinese in long-context evaluations for large language models (LLMs). According to a new academic benchmark, the language structure not the size of training data plays a much bigger role once context windows start stretching past 64,000 tokens.
Polish leads new LLM benchmark focused on long sequences

The study comes from the OneRuler benchmark, presented in the COLM 2025 paper, which tested 26 languages across tasks involving retrieval and aggregation in extremely long documents. When the models had to dig deep processing up to 128,000 tokens Polish topped the charts with an average accuracy of 88%. English? It fell to sixth. Chinese dropped near the bottom.
This shift highlights how certain languages handle long-form tokenization better than others. The results suggest that linguistic structure especially the type of script and how words are tokenized starts to matter more than raw training exposure as sequences get longer.
LLM tests script and tokenization drive surprising results
OneRuler’s results point to a striking trend: languages that use Latin-based scripts like Polish, French, and Spanish consistently outperform languages using logographic (like Chinese) or abugida scripts (like Tamil). And the longer the document, the wider the gap gets.
Here’s what the study revealed:
- At 8,000 tokens, the performance gap between the best and worst languages was just 11%.
- At 128,000 tokens, that gap ballooned to 34%.
- Small changes in instructions such as allowing “none” as an answer reduced English accuracy by 32% at high token lengths.
This strongly suggests that current models, while trained on English-dominant datasets, struggle when long-context reasoning requires structural clarity and tight token grouping something Latin-script languages seem to handle better.
English isn’t the long-context gold standard anymore
LLMs are typically tested in English because of its sheer volume in training data. But this study flips that assumption. Once models are forced to retrieve buried details from long documents, it’s not about familiarity, it’s about efficiency.
Polish, with its compact morphology and Latin alphabet, appears to offer an ideal structure for such tasks. Meanwhile, Chinese and similar scripts, where each character often becomes a separate token, lose traction fast as documents grow.
Long-context LLM tests expose deeper language bias
The results from OneRuler challenge how we define LLM strength. If benchmarks stay limited to short documents or English-only tests, they miss this deeper shift. As context windows get longer and models are used for more complex research, legal, or archival tasks language structure will matter more, not less.
Token count is cheap. Accuracy buried 100,000 tokens deep? That’s where real differences start to show.

