Want to know how our team at @neeva seamlessly combined LLMs with our search stack? And how we reduced time to first byte from 8 seconds to less than 1.5 seconds? Read on below…🧵
To get answers on NeevaAI, there are 4 phases: 1/ Retrieve and rank pages from our 4 billion index (soon to be 6b) 2/ Compute extractive and abstractive summaries for the top pages 3/ Send back a search result page w/a skeleton NeevaAI answer 4/ Stream the AI answer back
Step 1 can be done in about 500ms but step 2 took >7 seconds. That is an unacceptable delay for any search engine! Our team had to get into model brain surgery! While deep learning has traditionally favored model symmetry we find this can be troublesome for efficient inference
We used an open source T5 model for first level abstractive summarization meaning it has encoder and decoder layers which are equal in size. While the encoder is run once on the input text the decoder is called once for each token generated.
Our summaries are ~100 tokens or so which means model inference is dominated by the inference cost of the decoder. Finding this asymmetry we exploit it and prune the decoder while keeping the encoder fixed. This gives us a 2.5x speedup.

Feb 4, 2023 · 1:57 PM UTC

We then used the Faster Transformers library and optimize our models for serving on the Ampere GPU series. That reduced latency by another factor of 3. All of these made our average latencies faster to less than a second! Wait, there’s more…
We still found that our 90+%-ile latencies tended to go above 1-2s (!!!) at times. Using a combination of our benchmarks and telemetry from production deployments, we tracked this down to two factors…
1️⃣Some models (ex: Q&A model types) do not truncate input sequences to a max length when run, which can lead to long inference times when run on a longer documents 2️⃣ On some requests we had a lot of web-pages to summarize So we did a few more things…
We truncated long documents, disabled queueing server-side, relied on load-balancers with client-side retries. This allowed us to tune our product-side setup (# of docs we pick for each summarizer, # of tokens to summarize per document) to bring down our 99%-ile latency by 3x!
Net net: large LLM models running in production at scale delivering interactive & delightful experiences to all of you! Want to check it out? Create a free account to access NeevaAI and try it out for yourself at neeva.com. No ads, just answers.
We have more ideas for knowledge distillation and pseudolabeling that is going to make the inference even faster. Stay tuned for future updates on @neeva 👀
Replying to @RamaswmySridhar
Very astute observations. Typically multi headed (task) decoders also require such optimisations. Thanks for sharing the steps in public!