sridhar · Feb 4, 2023 · 1:57 PM UTC

sridhar

Want to know how our team at @neeva seamlessly combined LLMs with our search stack? And how we reduced time to first byte from 8 seconds to less than 1.5 seconds? Read on below…🧵

449

sridhar · Feb 4, 2023 · 1:57 PM UTC

sridhar

@RamaswmySridhar

Feb 4

To get answers on NeevaAI, there are 4 phases: 1/ Retrieve and rank pages from our 4 billion index (soon to be 6b) 2/ Compute extractive and abstractive summaries for the top pages 3/ Send back a search result page w/a skeleton NeevaAI answer 4/ Stream the AI answer back

sridhar · Feb 4, 2023 · 1:57 PM UTC

sridhar

@RamaswmySridhar

Feb 4

Step 1 can be done in about 500ms but step 2 took >7 seconds. That is an unacceptable delay for any search engine! Our team had to get into model brain surgery! While deep learning has traditionally favored model symmetry we find this can be troublesome for efficient inference

sridhar · Feb 4, 2023 · 1:57 PM UTC

sridhar

@RamaswmySridhar

Feb 4

We used an open source T5 model for first level abstractive summarization meaning it has encoder and decoder layers which are equal in size. While the encoder is run once on the input text the decoder is called once for each token generated.

sridhar · Feb 4, 2023 · 1:57 PM UTC

sridhar · Feb 4, 2023 · 1:57 PM UTC

sridhar

@RamaswmySridhar

Feb 4

Our summaries are ~100 tokens or so which means model inference is dominated by the inference cost of the decoder. Finding this asymmetry we exploit it and prune the decoder while keeping the encoder fixed. This gives us a 2.5x speedup.

Feb 4, 2023 · 1:57 PM UTC

sridhar · Feb 4, 2023 · 1:57 PM UTC

sridhar

@RamaswmySridhar

Feb 4

We then used the Faster Transformers library and optimize our models for serving on the Ampere GPU series. That reduced latency by another factor of 3. All of these made our average latencies faster to less than a second! Wait, there’s more…

sridhar · Feb 4, 2023 · 1:57 PM UTC

sridhar

@RamaswmySridhar

Feb 4

We still found that our 90+%-ile latencies tended to go above 1-2s (!!!) at times. Using a combination of our benchmarks and telemetry from production deployments, we tracked this down to two factors…

sridhar · Feb 4, 2023 · 1:57 PM UTC

sridhar

@RamaswmySridhar

Feb 4

1️⃣Some models (ex: Q&A model types) do not truncate input sequences to a max length when run, which can lead to long inference times when run on a longer documents 2️⃣ On some requests we had a lot of web-pages to summarize So we did a few more things…

sridhar · Feb 4, 2023 · 1:57 PM UTC

sridhar

@RamaswmySridhar

Feb 4

We truncated long documents, disabled queueing server-side, relied on load-balancers with client-side retries. This allowed us to tune our product-side setup (# of docs we pick for each summarizer, # of tokens to summarize per document) to bring down our 99%-ile latency by 3x!

sridhar · Feb 4, 2023 · 1:57 PM UTC

sridhar

@RamaswmySridhar

Feb 4

Net net: large LLM models running in production at scale delivering interactive & delightful experiences to all of you! Want to check it out? Create a free account to access NeevaAI and try it out for yourself at neeva.com. No ads, just answers.

Neeva - Search powered by AI

Did you know 40% of search results are ads? Created by ex-Google execs, Neeva only shows you real results. No ads or affiliate links ever.

neeva.com

sridhar · Feb 4, 2023 · 1:57 PM UTC

sridhar

@RamaswmySridhar

Feb 4

We have more ideas for knowledge distillation and pseudolabeling that is going to make the inference even faster. Stay tuned for future updates on @neeva 👀

Kshitij Agrawal · Feb 4, 2023 · 2:10 PM UTC

Kshitij Agrawal @kshitijagrwl

Feb 4

Replying to @RamaswmySridhar

Very astute observations. Typically multi headed (task) decoders also require such optimisations. Thanks for sharing the steps in public!