Attention sorting: A simple trick to boost your language model's long context skills
Is your language model forgetting crucial information buried deep in its context? Do its responses degrade with more distractor text added? Attention sorting can help!
What is attention sorting? It's a nifty inference-time technique that re-orders your model's context by attention, pushing the most relevant information to the positions where it's most likely to be used.
The benefits are clear:
- Significant accuracy improvements in long context QA - over 2x in some cases! Together Llama goes from 50% to 90% on 30k contexts.
- Allows smaller 7B models to match proprietary 100B+ models like Claude and GPT-3.5. Closing the capability gap.
- Particularly effective for models not specialized for long contexts, but still provides gains for state-of-the-art models.
- Simple to implement - no training or fine-tuning needed. Just plug it into your inference pipeline.
Best of all, the more you sort, the bigger the boost in performance! Iteratively sorting focuses the model's attention on what matters.
Give your model a better memory and stop critical information from being lost in the middle. Boost your QA, search, and summarization today with attention sorting!
Nov 9, 2023 · 9:26 PM UTC
Comparison with other methods:
- Long context architectures like FlashAttention allow scaling attention to longer sequences, but don't directly address recency bias issues.
- Retrieval augmented generation (RAG) combines a retriever and generator. Performance depends on retriever recall and context length. Attention sorting could improve context usage.
- Methods like chain-of-thought prompting and scratchpads manipulate the prompt context directly in text space. Attention sorting works directly with model attention weights.
- kNN methods like Memorizing Transformer use efficient retrieval over a long context history instead of full attention. The sparsity may improve context usage like attention sorting.
- Attention truncation methods like nucleus sampling prune the attention distribution. But without fine-tuning they hurt performance in this setting.
- Some work uses attention for document retrieval, similar to attention sorting's re-ranking. But attention sorting operates at the granularity of passages rather than full documents.
In comparison to these approaches, attention sorting is a straightforward plug-in method that does not require architectural changes, training, or prompt engineering expertise. It operates purely at inference time. The ability to sort iteratively and stop at an optimal point is also unique. The strong results suggest attention sorting is a promising technique compared to prior work, despite its simplicity. Combining it with other methods like dense retrieval may yield further improvements.