Opportunity: Large language models (LLMs) are rapidly becoming adopted by users across the world. However, their performance and factual knowledge differs across languages. For geopolitical questions, these different responses are especially problematic, as they amplify differences in cultural viewpoints.
Challenge: We propose the task of asking LLMs about disputed territories in the different languages of the claimant countries. We release a dataset in 48 languages and design an evaluation framework. We find that popular LLMs all respond inconsistently, and we quantify how susceptible each model is to geopolitical bias.