Carlos E. Perez · Nov 9, 2023 · 9:26 PM UTC

Carlos E. Perez · Nov 9, 2023 · 9:26 PM UTC

Carlos E. Perez

Carlos E. Perez

@IntuitMachine

9 Nov 2023

Attention sorting: A simple trick to boost your language model's long context skills Is your language model forgetting crucial information buried deep in its context? Do its responses degrade with more distractor text added? Attention sorting can help! What is attention sorting? It's a nifty inference-time technique that re-orders your model's context by attention, pushing the most relevant information to the positions where it's most likely to be used. The benefits are clear: - Significant accuracy improvements in long context QA - over 2x in some cases! Together Llama goes from 50% to 90% on 30k contexts. - Allows smaller 7B models to match proprietary 100B+ models like Claude and GPT-3.5. Closing the capability gap. - Particularly effective for models not specialized for long contexts, but still provides gains for state-of-the-art models. - Simple to implement - no training or fine-tuning needed. Just plug it into your inference pipeline. Best of all, the more you sort, the bigger the boost in performance! Iteratively sorting focuses the model's attention on what matters. Give your model a better memory and stop critical information from being lost in the middle. Boost your QA, search, and summarization today with attention sorting!

Nov 9, 2023 · 9:26 PM UTC

280

Carlos E. Perez · Nov 9, 2023 · 9:30 PM UTC

Carlos E. Perez

@IntuitMachine

9 Nov 2023

Comparison with other methods: - Long context architectures like FlashAttention allow scaling attention to longer sequences, but don't directly address recency bias issues. - Retrieval augmented generation (RAG) combines a retriever and generator. Performance depends on retriever recall and context length. Attention sorting could improve context usage. - Methods like chain-of-thought prompting and scratchpads manipulate the prompt context directly in text space. Attention sorting works directly with model attention weights. - kNN methods like Memorizing Transformer use efficient retrieval over a long context history instead of full attention. The sparsity may improve context usage like attention sorting. - Attention truncation methods like nucleus sampling prune the attention distribution. But without fine-tuning they hurt performance in this setting. - Some work uses attention for document retrieval, similar to attention sorting's re-ranking. But attention sorting operates at the granularity of passages rather than full documents. In comparison to these approaches, attention sorting is a straightforward plug-in method that does not require architectural changes, training, or prompt engineering expertise. It operates purely at inference time. The ability to sort iteratively and stop at an optimal point is also unique. The strong results suggest attention sorting is a promising technique compared to prior work, despite its simplicity. Combining it with other methods like dense retrieval may yield further improvements.

Sudhir Ravindran · Nov 10, 2023 · 2:59 AM UTC

Sudhir Ravindran @sudhirravindra8

10 Nov 2023

Replying to @IntuitMachine

@memdotai mem it

Mem · Nov 10, 2023 · 3:00 AM UTC

Mem @memdotai

10 Nov 2023

Saved! Here's the compiled thread: mem.ai/p/4itXfDVc92TAH2C8hL5… 🪄 AI-generated summary: "Attention sorting is a technique that can help language models remember crucial information buried deep in its context. It is a better alternative to long context architectures...

Boosting Language Model Long Context Skills with Attention Sorting

Published by Save to Mem · 11/10/2023

mem.ai

more replies

SD0 · Nov 10, 2023 · 7:18 AM UTC

SD0

@Develop0x

10 Nov 2023

Replying to @IntuitMachine

@memdotai mem it

Mem · Nov 10, 2023 · 7:19 AM UTC

Mem @memdotai

10 Nov 2023

Saved! Here's the compiled thread: mem.ai/p/W8wXhdwD1O2rw6hrQ45… 🪄 AI-generated summary: "Attention sorting is a technique that can help language models remember crucial information buried deep in its context. It is a better alternative to long context architectures...

Boosting Language Model's Long Context Skills with Attention Sorting

Published by Save to Mem · 11/10/2023

mem.ai

more replies

Nikolai Yakovenko · Nov 10, 2023 · 3:46 AM UTC

Nikolai Yakovenko

@ivan_bezdomny

10 Nov 2023

Replying to @IntuitMachine

Neat!

Etienne Balit · Nov 10, 2023 · 10:24 AM UTC

Etienne Balit @EtienneB

10 Nov 2023

Replying to @IntuitMachine

Even better with the source arxiv.org/abs/2310.01427

Attention Sorting Combats Recency Bias In Long Context Language Models

Current language models often fail to incorporate long contexts efficiently during generation. We show that a major contributor to this issue are attention priors that are likely learned during...

arxiv.org