“Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More?”, 2024-06-19 ():
Long-context language models (LCLMs) have the potential to revolutionize our approach to tasks traditionally reliant on external tools like retrieval systems or databases. Leveraging LCLMs’ ability to natively ingest and process entire corpora of information offers numerous advantages. It enhances user-friendliness by eliminating the need for specialized knowledge of tools, provides robust end-to-end modeling that minimizes cascading errors in complex pipelines, and allows for the application of sophisticated prompting techniques across the entire system.
To assess this paradigm shift, we introduce LOFT, a benchmark of real-world tasks requiring context up to millions of tokens designed to evaluate LCLMs’ performance on in-context retrieval and reasoning. Our findings reveal LCLMs’ surprising ability to rival state-of-the-art retrieval and RAG systems, despite never having been explicitly trained for these tasks.
However, LCLMs still face challenges in areas like compositional reasoning that are required in SQL-like tasks. Notably, prompting strategies influence performance, emphasizing the need for continued research as context lengths grow.
Overall, LOFT provides a rigorous testing ground for LCLMs, showcasing their potential to supplant existing paradigms and tackle novel tasks as model capabilities scale.
…3.3 Discussion on Efficiency: Encoding a 1-million token context can be slow and computationally expensive. One key advantage of CiC prompting is its compatibility with prefix-caching in autoregressive language models as the query appears at the end of the prompt. This means the corpus only needs to be encoded once, similar to the indexing process in traditional information retrieval. As demonstrated in §5, encoding the corpus as the prefix in this way does not lead to a performance drop in LCLMs…The presentation of document IDs also affects performance. In particular, replacing monotonic numerical IDs with random (
Alphanumeric IDs) negatively impacts performance in most datasets. This could possibly be due to way in which numbers are tokenized, with fewer tokens for certain numbers. [This could also reflect the IDs being meaningfully ordered; cf. AUNN, dynamic evaluation.]Only placing the IDs at the front of the document instead of at the front and the back (
Without ID Echo) also resulted in a 5% performance drop, confirming that repeating text can compensate for missing context in autoregressive language models.