If you’ve ever used ChatGPT or similar AI tools, you’ve likely been impressed by their ability to answer questions, write emails, or summarize documents. But what happens when you ask these systems to process book-length documents in different languages? A new study reveals some startling truths about AI’s linguistic capabilities and its surprising weaknesses.
Long-context capabilities are critical for real-world applications like document QA, legal analysis, and summarization. However, existing evaluations have been almost exclusively English-centric, leaving a significant gap in our understanding of how these abilities generalize across the world’s diverse languages, scripts, and resource levels. This lack of comprehensive, multilingual diagnostics has obscured persistent disparities and unexpected performance patterns in large language models (LLMs).
The Multilingual Measuring Stick: Introducing ONERULER
Researchers from the University of Maryland, Microsoft, and UMass Amherst recently developed ONERULER, the first comprehensive benchmark designed to evaluate how well LLMs handle lengthy texts across 26 languages. Their findings challenge many assumptions about AI’s capabilities and reveal significant disparities in performance.
The team tested six prominent AI models, including both open-source options like Qwen 2.5 (7B/72B), Llama 3.1 (8B), Llama 3.3 (70B), and commercial systems like Google’s Gemini 1.5 Flash and OpenAI’s o3-mini-high across four different context lengths (from 8,000 to a massive 128,000 tokens). The results paint a fascinating picture of where AI excels and where it struggles, particularly in multilingual and cross-lingual settings.
How ONERULER Works: Tasks and Contexts
ONERULER adapts and extends the English-only RULER benchmark (a benchmark that reveals the limitations of long-context language models by testing their performance on complex tasks beyond simple retrieval, showing that many models struggle as context length increases despite large claimed capacities) by creating seven synthetic tasks:
- Needle-in-a-Haystack (NIAH) Variants: These tasks involve embedding specific “needle” sentences (e.g., “The special magic number for ‘book’ is: 12345.”) into very long contexts, then asking the model to retrieve the associated information.
- S-NIAH (Single-NIAH): One needle inserted; model retrieves the number.
- MK-NIAH (Multi-key NIAH): Multiple needles with different keys; only one is queried.
- MV-NIAH (Multi-value NIAH): Multiple needles share the same key but provide different values; model must output all values.
- MQ-NIAH (Multi-query NIAH): Multiple queries per prompt; requires correct retrieval for all.
- NONE-NIAH (None NIAH): All embedded needles are distractors; the correct answer is “none.”
- Aggregation Tasks (CWE – Common Word Extraction): Models must list the 10 most frequent words from a long enumerated list.
- CWE-easy: Target words appear 30 times; distractors appear 3 times.
- CWE-hard: Target words appear 20 times; distractors 10 times (smaller frequency gap).
Contexts for these tasks were drawn from non-copyrighted books (one per language), ensuring authentic linguistic data. Instructions were carefully translated and localized by native speakers to ensure naturalness and grammatical accuracy across all 26 languages, which span diverse families and scripts.
The Widening Gap: When Longer Means Less Equal
One of the most concerning findings is what happens as documents get longer. The performance gap between high-resource and low-resource languages widens dramatically as context length increases.
The average accuracy gap between the top-5 and bottom-5 languages (ranked by Wikipedia article count, a proxy for resource level) grew from 11% at 8,000 tokens to a staggering 34% at 128,000 tokens. This suggests that the techniques used to extend AI’s context window—the amount of text it can process at once—don’t transfer equally across languages. Low-resource languages like Hindi, Sesotho, Swahili, and Tamil suffer disproportionately as documents get longer, even though these languages are spoken by hundreds of millions of people worldwide.
The “None” Problem: When Caution Becomes a Weakness
Here’s where things get really interesting and reveal a critical calibration issue in LLMs. When researchers simply added the instruction “If no such numbers exist, please answer ‘none’” to the prompts, performance plummeted.
For the single needle task (S-NIAH) in English at the longest context length (128K), this simple addition caused a staggering 32% drop in accuracy. Why? Because models became overly cautious, frequently answering “none” even when the needle was clearly present in the text. OpenAI’s o3-mini-high was particularly prone to this error, exhibiting the highest rate of “none” overuse, despite being one of the most advanced reasoning models available. This behavior mirrors what happened when reading comprehension tests first introduced unanswerable questions – initially, both humans and AI struggle with knowing when to say “I don’t know.”
The Language Ranking Shockers: Polish Leads, Chinese Struggles
Perhaps the most surprising finding concerns which languages AI handles best. If you assumed English would come out on top, given its dominance in AI training data, you’d be wrong.
At longer context lengths (64,000 and 128,000 tokens), Polish emerged as the top-performing language on NIAH tasks, with an average accuracy of 88% across models. English ranked only 6th at 83.9%, while Chinese performed surprisingly poorly, ranking 4th worst at just 62.1%.
The top-performing language families were Slavic, Romance, and Germanic languages, mostly using Latin or Cyrillic scripts. Meanwhile, all four low-resource languages (Hindi, Sesotho, Swahili, Tamil) ranked in the bottom six, highlighting how resource availability continues to drive AI performance disparities.
Despite being a high-resource language and a focus of model training, LLMs tested on ONERULER disproportionately fail on Chinese tasks by incorrectly responding that an answer is absent (“none”) when it is, in fact, present in the context. The paper highlights this as a puzzling and counterintuitive result, suggesting that long-context capabilities do not automatically transfer or function robustly even for well-resourced languages. Its underperformance appears to stem from a combination of architectural biases toward Latin-script languages, tokenization inefficiencies, instruction sensitivity, and potentially uneven long-context training data distribution.
The Aggregation Challenge: When Counting Becomes Impossible
While retrieval tasks revealed interesting patterns, aggregation tasks exposed even more fundamental limitations. Researchers asked models to identify the ten most frequent words in massive word lists—a task that would be tedious but straightforward for humans.
The results were sobering. In the “easy” version of this task (CWE-easy), average English accuracy across all models was only 31.5%. While three models managed over 80% accuracy at the shortest context length (8K), performance degraded sharply by 128K. The “hard” version (CWE-hard) proved nearly impossible, with accuracy close to 0% across all models and contexts. This exposes a fundamental weakness in current AI systems’ ability to synthesize information across long contexts, even for relatively simple statistical tasks like frequency counting.
The Cross-Lingual Instruction Effect
What happens when you give instructions in one language but provide content in another? The researchers found that the language of instructions can swing performance by up to 20%.
For example:
- When the context was in English but instructions were switched from English to Korean, average accuracy dropped from 91% to 71% at 64,000 tokens.
- Conversely, when Korean content was paired with English instructions, accuracy improved from 61% to 77% at 128,000 tokens.
These significant fluctuations indicate a strong dependence on instruction language, beyond just the context or needle language. This has practical implications for how we design multilingual AI systems, suggesting that simply translating interfaces might not be sufficient for optimal performance.
The Reasoning Model Paradox: Overthinking to Failure
Some of the most advanced AI models available today are “reasoning models” that show their work, generating step-by-step explanations for their answers. But this study found that more reasoning doesn’t always mean better reasoning.
OpenAI’s o3-mini-high, for instance, produced “significantly more reasoning tokens for its incorrect answers than for its correct answers.” This suggests an inefficient reasoning process where the model essentially “overthinks” itself into mistakes. On aggregation tasks, both o3-mini-high and DeepSeek-R1 frequently exceeded their maximum output token limits because they generated excessively verbose reasoning. In some cases, their explanations were longer than the original input text – all for tasks that required simple list generation. This highlights a challenge in controlling reasoning verbosity and efficiency in long-context scenarios.
Beyond the Numbers: Deeper Insights into AI’s Struggles
Qualitative analysis revealed several dominant error modes beyond just the “none” problem:
- Distractor Selection: Models often picked incorrect “needles” when multiple distractors were present, especially in MK-NIAH and NONE-NIAH.
- Partial Completions: In tasks requiring multiple values (MV-NIAH) or multiple queries (MQ-NIAH), models frequently returned only one needle or omitted values, indicating a struggle with comprehensive retrieval.
- Hallucination and Reformulation: Some models invented rationales or altered the task, generating hypothetical examples for CWE or creating riddle-like reasoning chains instead of simple extraction.
- Language Mixing: Smaller models like Qwen 2.5 7B and Llama 3.1 8B sometimes mixed languages in their outputs, particularly on Polish, suggesting limitations in maintaining linguistic coherence.
These qualitative insights underscore that long-context capabilities are not just about memory, but also about precise instruction following, selective attention, and efficient reasoning across diverse linguistic inputs.
What This Means for the Future of AI
These findings have significant implications for how we develop and use AI systems:
Training Pipelines: Building More Robust and Equitable LLMs
- Diverse Language Exposure: Training pipelines must include a wider array of diverse languages during long-context extension and instruction tuning. Prioritizing low-resource scripts and ensuring tokenizer stability across non-Latin scripts is crucial to mitigate the widening performance gap.
- Abstention Calibration: Models need to be explicitly calibrated for abstention. Incorporating balanced answerable/unanswerable training across languages can prevent the “none” overuse observed in the study.
- Improved Aggregation Pretraining: Synthetic curricula for counting and frequency analysis over long contexts are needed to address the fundamental weakness in aggregation tasks.
Prompting in Production: Optimizing for Real-World Use
- Careful “None” Usage: Avoid “none” options in prompts unless absolutely necessary. If required, consider adding explicit confidence checks or secondary verification mechanisms to counteract models’ tendency towards conservative failure.
- Strategic Instruction Language: In cross-lingual pipelines, prefer instruction languages that are well-aligned with the model’s instruction-tuning distribution (often English), even if the context is in another language. Expect and account for potential performance swings of up to 20%.
Evaluation Practice: Towards Fairer and More Comprehensive Benchmarking
- Tokenizer-Aware Benchmarking: One tokenizer does not rule them all – The same text can vary dramatically in token count across different AI systems (e.g., a Tamil document measured 42,124 tokens in Google’s system became 103,990 tokens in Qwen’s system). Future evaluations should develop bits-per-byte normalization or multilingual token budget equalization for fairer cross-language comparisons.
- Expand Beyond Synthetic Tasks: While ONERULER provides crucial diagnostic signals, future research should expand to more realistic multilingual tasks (e.g., legal document QA, multi-document summarization) to triangulate these findings with applied performance.
The Path Forward
The release of ONERULER represents an important step toward more equitable AI evaluation. By rigorously testing across 26 languages and multiple context lengths, researchers have created a valuable tool for identifying and addressing the limitations of current systems.
The study reveals that long contexts amplify multilingual disparities, instruction design (especially “none” options) strongly influences outcomes, and aggregation across long inputs remains a major open challenge. The fact that a language like Polish can outperform English in certain contexts, while Chinese unexpectedly struggles, reminds us that AI capabilities don’t always align with our expectations based on training data volume alone.
As AI becomes increasingly integrated into global applications, from multilingual document analysis to cross-cultural communication tools, understanding these disparities becomes crucial. The next frontier in AI development may not be about making models bigger, but about making them more consistently capable across the incredible diversity of human languages and communication contexts. We still have a long way to go before AI can truly understand our world in all its linguistic richness, serving speakers of all languages equally well, whether they’re reading a short article or analyzing an entire library.