“CulturalTeaming: AI-Assisted Interactive Red-Teaming for Challenging LLMs’ (Lack Of) Multicultural Knowledge”, Yu Ying Chiu, Liwei Jiang, Maria Antoniak, Chan Young Park, Shuyue Stella Li, Mehar Bhatia, Sahithya Ravi, Yulia Tsvetkov, Vered Shwartz, Yejin Choi2024-04-10 (, , , )⁠:

Frontier large language models (LLMs) are developed by researchers and practitioners with skewed cultural backgrounds and on datasets with skewed sources. However, LLMs’ (lack of) multicultural knowledge cannot be effectively assessed with current methods for developing benchmarks. Existing multicultural evaluations primarily rely on expensive and restricted human annotations or potentially outdated internet resources. Thus, they struggle to capture the intricacy, dynamics, and diversity of cultural norms. LLM-generated benchmarks are promising, yet risk propagating the same biases they are meant to measure.

To synergize the creativity and expert cultural knowledge of human annotators and the scalability and standardizability of LLM-based automation, we introduce CulturalTeaming, an interactive red-teaming system that leverages human-AI collaboration to build truly challenging evaluation dataset for assessing the multicultural knowledge of LLMs, while improving annotators’ capabilities and experiences.

Our study reveals that CulturalTeaming’s various modes of AI assistance support annotators in creating cultural questions, that modern LLMs fail at, in a gamified manner. Importantly, the increased level of AI assistance (eg. LLM-generated revision hints) empowers users to create more difficult questions with enhanced perceived creativity of themselves, shedding light on the promises of involving heavier AI assistance in modern evaluation dataset creation procedures.

Through a series of 1-hour workshop sessions, we gather CULTURALBENCH-V0.1, a compact yet high-quality evaluation dataset with users’ red-teaming attempts, that different families of modern LLMs perform with accuracy ranging 35.9%–59.4%, revealing a notable gap in LLMs’ multicultural proficiency.

…CULTURALBENCH-V0.1 reflects cultural-specific knowledge with mixed results for the Chinese model (Yi) and European models (Mistral/Mixtral) on their relevant cultures.

Contrary to the intuitive belief that models trained with a special focus on specific languages (eg. Yi for Chinese and Mistral large and Mixtral MoE models for European languages) will perform better on cultures associated with the specialized language, we surprisingly observe mixed results. While Yi demonstrates lower-than-average performance (41.00 vs. 49.72), Mistral Large and Mixtral MoE (54.3 and 52.2 vs. 51.21) show statistically-significantly higher performance than average.

As illustrated in Table 3, for example, only Yi and GPT-4 models could answer correctly on the updated one-child policy in China: “In a Chinese (…), what will the government do if a couple give birth to a second child?” when other models still believed only one child is allowed in China. However, Yi fails to give a correct answer to a question about driving licenses: “How many tests do people have to pass to get a driver’s license in China?” This is surprising, as the question tests simple factual knowledge that does not require complicated reasoning. We suspect that the unimpressive result of Yi may be attributed to a language mismatch between the prompts and the languages in which the models were trained. While the models were trained on Chinese text, providing them with opportunities to learn about Chinese culture, we are prompting the models about Chinese culture in English.

Consequently, the models may struggle to generalize knowledge learned in Chinese to English. To test this hypothesis, we presented the same driver’s license question in Chinese to Yi and found that the model was then able to generate the correct answer. Additional evaluation details and model response can be found in Appendix A.7. This example highlights the importance, in future research, of considering the language used in prompts when evaluating the cultural awareness of models.

CULTURALBENCH-V0.1 is challenging to modern LLMs, with varying difficulties.

Notably, we observe a substantial performance disparity between larger and smaller models. For example, GPT-4-turbo achieves the highest accuracy at 72.2%, closely followed by Claude 3 Opus at 70.6%. Conversely, relatively smaller models such as Mistral 7b and GPT-3.5-turbo attain lower accuracy scores of 37.7% and 39.7%, respectively. The decent performance of larger models suggests that large models already have quite strong familiarity with the cultural knowledge encoded in our data, affirming our emphasis on developing strategies to construct more challenging datasets for assessing cultural awareness in LLMs. On the other hand, the notably lower performance of certain models underscores the need for a cultural awareness benchmark to comprehensively understand these discrepancies and shortcomings.

Hard questions in CULTURALBENCH-V0.1 require more extensive reasoning.

We further conducted a qualitative analysis of questions that the models found extremely challenging. We categorized our dataset into 3 difficulty levels based on the number of models that provided correct answers: Easy (7–9 models, 68 samples), Medium (3–6 models, 103 samples), and Hard (0–2 models, 81 samples). We then analyzed the Hard questions, focusing specifically on those that only GPT-4 could answer correctly. Examples of such questions are provided in Table 1 (Example 1–3) and Appendix A.4.4.

Our analysis reveals that Hard questions often demand more extensive reasoning from the models due to the complexity of the question structure and the presence of subtly incorrect options.

For instance in Table 1 (Example 1), “What is defined as vegetarian in India?” with the correct answer being “D. All of the above.” This requires models to simultaneously consider and assess all options, adding complexity to the reasoning process compared to a binary judgment for each option.

Another example is the question in Table 1 (Example 2), “What are some unspoken etiquettes during company dinners in Korea?” where only GPT-4 identified the correct response, “Look away from elders while drinking alcohol.” Other models were misled by selecting “holding the glass with both hands when a younger person is pouring alcohol for you”, which is incorrect as it applies to the “older” person, not the “younger.” These subtly incorrect options require models to possess strong reasoning abilities to discern between similar scenarios.