Joe Heitzeberg · Jan 8, 2023 · 4:53 PM UTC

Joe Heitzeberg · Jan 8, 2023 · 4:53 PM UTC

Joe Heitzeberg

Correctly answering questions from a 14,000 word scientific paper… pointing my chatbot to a doc store with embeddings index

Jan 8, 2023 · 4:53 PM UTC

336

Joe Heitzeberg · Jan 8, 2023 · 4:55 PM UTC

Joe Heitzeberg

@jheitzeb

Jan 8

Indexing was surprisingly fast. I indexed per sentence. For retrieval I surface the best 3 matches and the put each, together with the pre and post sentences (for more content) into a prompt that determines how to answer.

Joe Heitzeberg · Jan 8, 2023 · 4:57 PM UTC

Joe Heitzeberg

@jheitzeb

Jan 8

For long docs, this makes answering detailed questions work very well. But if I want to answer general questions pertaining to the entire doc (in the doc, how many times is XYZ mentioned) I’d need to look at more results, or also embed summaries.

Joe Heitzeberg · Jan 8, 2023 · 4:58 PM UTC

Joe Heitzeberg

@jheitzeb

Jan 8

I also indexed long speeches, company FAQs and other docs. This turns them into a Q&A bot with good results.

Joe Heitzeberg · Jan 8, 2023 · 5:00 PM UTC

Joe Heitzeberg

@jheitzeb

Jan 8

A lot of people have been playing with this (embeddings + dynamic prompts) to implement document Q&A for a while. GPTIndex @jerryjliu0 and LangChain @hwchase17 are two libraries that can help with this.

Joe Heitzeberg · Jan 8, 2023 · 5:02 PM UTC

Joe Heitzeberg

@jheitzeb

Jan 8

Try it here… sloppyjoe.com/summarize (the first question builds the index, so be prepared to wait or to retry if there is an error)

AI Summarizer

Summarize any text or URL with our AI summarizer. It's free and easy to use.

sloppyjoe.com

Joe Heitzeberg · Jan 8, 2023 · 5:17 PM UTC

Joe Heitzeberg

@jheitzeb

Jan 8

Answering questions about little red riding hood

Joe Heitzeberg · Jan 8, 2023 · 6:45 PM UTC

Joe Heitzeberg

@jheitzeb

Jan 8

Btw, I’m not using any vector db.. just embed the question, pull all the indexed vectors and compute cosine similarity, sort and pull the top 3. Hundreds of embeddings per doc. By far the slowest step is then asking gpt3 to compute an answer given the top results and the question

Joe Heitzeberg · Jan 10, 2023 · 4:38 AM UTC

Joe Heitzeberg

@jheitzeb

Jan 10

just made it faster and more resilient - sloppyjoe.com/summarize

AI Summarizer

Summarize any text or URL with our AI summarizer. It's free and easy to use.

sloppyjoe.com

Chanchana 🐳🧙‍♂️ · Jan 9, 2023 · 6:20 AM UTC

Chanchana 🐳🧙‍♂️ @off99555

Jan 9

Replying to @jheitzeb

How do you extract text from PDF in a way that gives you good formatting?

Joe Heitzeberg · Jan 9, 2023 · 8:54 PM UTC

Joe Heitzeberg

@jheitzeb

Jan 9

I don't handle PDF. One potentially foolproof approach I'd consider: 1) render each page as images 2) extract text using tesseract or the AWS OCR service