How to Uncover Why Your AI Assistant Switches Languages Unexpectedly: A Step-by-Step Investigation into Embedding Space and Code Vocabulary

Introduction

Imagine you’re typing a prompt in Chinese, and your AI coding assistant suddenly replies in Korean. It’s not a glitch—it’s a fascinating glimpse into how language models process multilingual inputs and the influence of code vocabulary on embedding spaces. This guide walks you through a systematic investigation to understand the underlying mechanics, from tokenization quirks to training data biases. By the end, you’ll be able to diagnose such behavior and even experiment with prompts to avoid unexpected language switches.

How to Uncover Why Your AI Assistant Switches Languages Unexpectedly: A Step-by-Step Investigation into Embedding Space and Code Vocabulary — Source: towardsdatascience.com

What You Need

An AI coding assistant (e.g., GitHub Copilot, ChatGPT with code context, or a custom model) that supports multiple languages.
A code editor or notebook environment (like VS Code or Jupyter) to run experiments.
Basic knowledge of tokenization and embeddings—or a willingness to learn along the way.
Sample prompts in Chinese, Korean, and mixed code contexts to test.
Optional: Access to a model’s tokenizer (e.g., via Hugging Face’s transformers library) for deeper inspection.

Step-by-Step Investigation

Step 1: Reproduce and Document the Behavior

First, confirm the phenomenon consistently. Type a simple Chinese prompt that includes code-related vocabulary, such as a comment explaining a function. For example:

// 这个函数计算平均值 (this function calculates average)

Observe the assistant’s response. Note the exact language, the context (e.g., whether code was present), and the model version. Keep a log for later comparison. If it replies in Korean, you’ve reproduced the issue. If not, try adding more code-specific terms like variable names that are common in Korean programming communities (e.g., param vs. 변수).

Step 2: Investigate Tokenization Patterns

Most modern language models use subword tokenizers (like Byte-Pair Encoding or SentencePiece). These break text into tokens that don’t always align with words. A Chinese character might be tokenized as one or more tokens, while Korean Hangul syllables are often separate tokens. Use an online tokenizer tool or write a quick script:

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('model-name')
tokens = tokenizer.tokenize('你的中文提示 (your Chinese prompt)')
print(tokens)

Compare token counts and IDs for Chinese vs. Korean. If the model has overlapping token IDs that map to different language symbols, this can confuse the embedding layer. Next, we’ll examine those embeddings.

Step 3: Explore Embedding Space Similarity

Embeddings map tokens to dense vectors. When code vocabulary (like keywords def, print) is trained on a corpus that includes Korean documentation, the embeddings for code tokens may cluster near Korean natural language tokens. Use a library like scikit-learn to measure cosine similarity between the embedding of a Chinese code comment and a set of Korean tokens:

Load the model’s embedding layer (e.g., model.get_input_embeddings().weight in PyTorch).
Take the embedding of a token like ‘函数’ (function) and ‘함수’ (function in Korean).
Compute similarity. If they’re closer than expected, the model may “think” the prompt is more Korean than Chinese, especially when code is present.

This explains why a Chinese prompt can trigger a Korean response: the code vocabulary acts as a bridge.

Step 4: Check Training Data and Model Biases

Look up the model’s training data composition. Many coding assistants are trained on GitHub repositories, which are heavily English-centric but with substantial Korean and Chinese comments. If the training corpus had more Korean code comments relative to Chinese ones for certain libraries (e.g., PyTorch tutorials in Korean), the model’s priors shift. You can test by providing a pure Chinese sentence without code vs. a mixed one. If only the mixed version triggers Korean, code vocabulary is the culprit.

Step 5: Experiment with Prompt Engineering

To avoid the switch, modify your prompts:

Explicitly specify output language: Start with “请用中文回答” (please answer in Chinese).
Reduce code vocabulary in the prompt: Use natural Chinese descriptions instead of code snippets.
Add bilingual context: Supply a short Chinese-Korean glossary to steer the model.

See the Tips section for more.

Tips & Conclusion

Keep a log of each experiment: prompt, tokenizer output, embeddings distances, and response language. Patterns emerge quickly.
Use a model with controllable generation some APIs allow setting a “language” or “context” parameter.
Remember that embedding spaces are high-dimensional—similarity isn’t always intuitive. Visualize with PCA or t-SNE for clarity.
If you’re a developer, consider fine-tuning the model on a balanced Chinese-Korean code corpus to reduce bias.

In summary, the language switch from Chinese to Korean is a result of embedding-space dynamics where code vocabulary tends to cluster with Korean tokens due to training data imbalances. By methodically stepping through tokenization, embeddings, and prompt engineering, you can not only understand why it happens but also gain control over your AI assistant’s output. Happy investigating!

Tags: