To get answers on NeevaAI, there are 4 phases:
1/ Retrieve and rank pages from our 4 billion index (soon to be 6b)
2/ Compute extractive and abstractive summaries for the top pages
3/ Send back a search result page w/a skeleton NeevaAI answer
4/ Stream the AI answer back
Step 1 can be done in about 500ms but step 2 took >7 seconds.
That is an unacceptable delay for any search engine!
Our team had to get into model brain surgery!
While deep learning has traditionally favored model symmetry we find this can be troublesome for efficient inference
Our summaries are ~100 tokens or so which means model inference is dominated by the inference cost of the decoder.
Finding this asymmetry we exploit it and prune the decoder while keeping the encoder fixed.
This gives us a 2.5x speedup.
Feb 4, 2023 · 1:57 PM UTC
We then used the Faster Transformers library and optimize our models for serving on the Ampere GPU series.
That reduced latency by another factor of 3.
All of these made our average latencies faster to less than a second!
Wait, there’s more…
We still found that our 90+%-ile latencies tended to go above 1-2s (!!!) at times.
Using a combination of our benchmarks and telemetry from production deployments, we tracked this down to two factors…
1️⃣Some models (ex: Q&A model types) do not truncate input sequences to a max length when run, which can lead to long inference times when run on a longer documents
2️⃣ On some requests we had a lot of web-pages to summarize
So we did a few more things…
We truncated long documents, disabled queueing server-side, relied on load-balancers with client-side retries.
This allowed us to tune our product-side setup (# of docs we pick for each summarizer, # of tokens to summarize per document) to bring down our 99%-ile latency by 3x!
Net net: large LLM models running in production at scale delivering interactive & delightful experiences to all of you!
Want to check it out? Create a free account to access NeevaAI and try it out for yourself at No ads, just answers.
We have more ideas for knowledge distillation and pseudolabeling that is going to make the inference even faster.
Stay tuned for future updates on @neeva 👀