“Probing the Decision Boundaries of In-Context Learning in Large Language Models”, Siyan Zhao, Tung Nguyen, Aditya Grover2024-06-17 (, , , , , , , )⁠:

[Twitter] In-context learning is a key paradigm in large language models (LLMs) that enables them to generalize to new tasks and domains by simply prompting these models with a few exemplars without explicit parameter updates. Many attempts have been made to understand in-context learning in LLMs as a function of model scale, pretraining data, and other factors.

In this work, we propose a new mechanism to probe and understand in-context learning from the lens of decision boundaries for in-context binary classification. Decision boundaries are straightforward to visualize and provide important information about the qualitative behavior of the inductive biases of standard classifiers.

To our surprise, we find that the decision boundaries learned by current LLMs in simple binary classification tasks are often irregular and non-smooth, regardless of linear separability in the underlying task. This paper investigates the factors influencing these decision boundaries and explores methods to enhance their generalizability.

We assess various approaches, including training-free and fine-tuning methods for LLMs, the impact of model architecture, and the effectiveness of active prompting techniques for smoothing decision boundaries in a data-efficient manner.

Our findings provide a deeper understanding of in-context learning dynamics and offer practical improvements for enhancing robustness and generalizability of in-context learning.

Figure 1: Decision boundaries of LLMs and traditional machine learning models on a linearly separable binary classification task. The background colors represent the model’s predictions, while the points represent the in-context or training examples. LLMs exhibit non-smooth decision boundaries compared to the classical models. See Appendix E for model hyperparameters.

…In contrast to existing approaches, our study introduces a fresh perspective by viewing in-context learning in large language models (LLMs) as a unique machine learning algorithm. This conceptual framework enables us to leverage a classical tool from machine learning—analyzing decision boundaries in binary classification tasks. By visualizing these decision boundaries, both in linear and non-linear contexts, we gain invaluable insights into the performance and behavior of in-context learning. This method allows us to probe the inductive biases and generalization capabilities of LLMs and offers a unique assessment of the robustness of their in-context learning performance. Consequently, this approach provides a comprehensive means to qualitatively analyze the underlying mechanisms that govern in-context learning and suggest ways to improve its performance in LLMs.

To our surprise, we found that the recent LLMs struggle to provide smooth decision boundaries in all the classification tasks we considered, regardless of the model size, the number and ordering of in-context examples, and semantics of the label format. This issue persists even for simple binary linear classification tasks, where classical methods such as SVM can easily achieve smooth boundaries with fewer examples as shown in Figure 1.

This observation raises questions about the factors that influence the decision boundaries of LLMs.

To explore this, we experimented with a series of open-source LLMs including LLaMA-2-7b, LLaMA-2-13b, Llama3-8b (Touvron et al 2023), Mistral-7b (Jiang et al 2023), pruned LLaMA-2-1.3b (Xia et al 2023), as well as state-of-the-art closed-source LLMs GPT-4o and GPT-3-Turbo (Brown et al 2020).

We then explore methods to smooth the decision boundary, including fine-tuning and adaptive prompting strategies.


Our work provides valuable practical insights for understanding and improving in-context learning in LLMs through a new perspective. Our contributions can be summarized as follows:

…As the number of in-context examples increases, LLMs can achieve high accuracy on linear and non-linear classification tasks. But how reliable are these in-context classifiers? We probe their decision boundaries to find out.

By visualizing the decision boundaries, we show that SOTA LLMs, ranging from 1B to large closed-source models such as GPT-3.5-turbo and GPT-4o, all exhibit different non-smooth, irregular decision boundaries, even on simple linearly separable tasks.

Figure 2: Visualizations of decision boundaries for various LLMs, ranging in size 1.3–13B, on a linearly separable binary classification task. The in-context data points are shown as scatter points and the colors indicate the label determined by each model. These decision boundaries are obtained using 128 in-context examples. The visualization highlights that the decision boundaries of these language models are not smooth.

How do these irregularities arise? We study various factors that impact decision boundary smoothness in LLMs, including in-context example count, quantization levels, label semantics & examples order. Then, we identify methods to improve the smoothness of the boundaries.

First, increasing in-context examples does not guarantee smoother decision boundaries. While classification accuracy improves with more in-context examples, the decision boundary remains fragmented.

Decision boundaries are sensitive to label names, example order and quantization.

Shuffling in-context examples and labels changes the model’s decision boundaries, suggesting they depend on LLM’s semantic prior knowledge of the labels and is not permutation invariant.

Reducing precision from 8 → 4-bit impacts areas near the boundary with high uncertainties. Varying quantization levels can flip LLM decisions in these uncertain regions.

Can we improve decision boundary smoothness in LLMs through training? We show that fine-tuning on simple linearly separable tasks can improve the smoothness of decision boundaries and generalize to more complex non-linear, multi-class tasks, enhancing robustness.

Further, we show that fine-tuning the token embedding and attention layers can lead to smoother decision boundaries. However fine-tuning the linear prediction head alone does not improve smoothness.

We also explore uncertainty-aware active learning. By adding labels for the most uncertain points to the in-context dataset, we can smoothen the decision boundary more efficiently.

Figure 11: Comparison of active learning and random sampling methods. We plot the decision boundaries and uncertainty plot across different number of in-context examples 32–256, where the in-context examples are gradually added to the prompt using active or random methods. The test set accuracies is plotted in the titles. Active sampling gives smoother decision boundary and the uncertain points lie on it.

…As shown in Figure 11, this uncertainty-aware active sampling method results in a smoother decision boundary over iterations compared to random sampling. The iterative refinement enhances the model’s generalization capabilities, leading to higher test set accuracies and greater sample efficiency, requiring fewer additional in-context examples to achieve performance gains. These findings indicate that leveraging the LLM’s uncertainty measurements is valuable for selecting new in-context examples in resource-constrained settings where labeled data is scarce. We show more examples in Appendix F.

Lastly, we explore the effect of language pretraining. Compared to pretrained LLMs, we find that transformers trained from scratch on synthetic classification tasks can learn smooth in-context decision boundaries for unseen classification problems.