Correctly answering questions from a 14,000 word scientific paperā€¦ pointing my chatbot to a doc store with embeddings index

Jan 8, 2023 Ā· 4:53 PM UTC

Indexing was surprisingly fast. I indexed per sentence. For retrieval I surface the best 3 matches and the put each, together with the pre and post sentences (for more content) into a prompt that determines how to answer.
For long docs, this makes answering detailed questions work very well. But if I want to answer general questions pertaining to the entire doc (in the doc, how many times is XYZ mentioned) Iā€™d need to look at more results, or also embed summaries.
I also indexed long speeches, company FAQs and other docs. This turns them into a Q&A bot with good results.
A lot of people have been playing with this (embeddings + dynamic prompts) to implement document Q&A for a while. GPTIndex @jerryjliu0 and LangChain @hwchase17 are two libraries that can help with this.
Try it hereā€¦ sloppyjoe.com/summarize (the first question builds the index, so be prepared to wait or to retry if there is an error)
Answering questions about little red riding hood
Btw, Iā€™m not using any vector db.. just embed the question, pull all the indexed vectors and compute cosine similarity, sort and pull the top 3. Hundreds of embeddings per doc. By far the slowest step is then asking gpt3 to compute an answer given the top results and the question
Replying to @jheitzeb
How do you extract text from PDF in a way that gives you good formatting?
I don't handle PDF. One potentially foolproof approach I'd consider: 1) render each page as images 2) extract text using tesseract or the AWS OCR service