Ok, let's go with one more insane AI application. As in the French app I showed the other day, I coded this in one evening, knowing basically nothing ex-ante about Flask or AJAX. This time, let's build your personal research index!

Mar 11, 2023 · 11:28 AM UTC

Here's the idea - I want an LLM to be able any question I ask, in any form, about any of my own research papers, including those that aren't even online. I want to do it with <100 lines of code, for a couple cents per session, with no costly fine-tuning, and with high accuracy.
The code is two parts. Part 1 grabs every pdf, latex file or text file in a folder you specify, "reads" them (more on this in a sec), and gets everything set for your queries. You do it once, it takes about five seconds per paper on my laptop, and costs about 1 penny per paper.
Now the fun part. To make things simple, I just used a single paper, with @AmirSariri and Mitch Hoffman, It's an experiment showing applicants do poorly finding the highest-quality startups to apply to and that credible quality info shifts behavior a ton: kevinbryanecon.com/BryanHoff…
What's the main new contribution of the paper? This is right on.
Some factual details about the setup, or qualitative questions about treatment effect heterogeneity? Got it.
What papers did the lit review think was most related? Where do we go from here? Can you answer some details about the experimental design? No problem.
Does it BS? No. We didn't discuss heterogeneity by ethnic background in the paper. And "KevAIn", my virtual assistant, correctly notes this when you asked. By the way - these answers you are seeing? Not cherry-picked. Literally my first practice run through the 'app'.
The cool thing here is there is no fine-tuning of an LLM or anything else that complex. We're just doing 3 things. First - take all of the text of all the articles you give it, and split them into overlapping 200 word chunks. Embed the 'paragraphs' in high-dimensional space.
Now, for anything you ask, embed that query in the same high-dimensional space. Find the cosine similarity of the query and all of those chunks of text in your library. Grab the most similar handful of chunks by this measure, and pass the query and the chunks to ChatGPT's API.
Why do this chunking/embedding? Well, you can't pass 100 pages directly to GPT & even if you could, it would be expensive. So options: fine-tune LLM to your use case ($$) or be clever pre-processing what you send the AI! It's all just posterior-distribution shifts in the LLM.
Of course, we'll also tell the LLM we want truthful answers, to act like a PhD research assistant, and to say "I don't know" if it can't find the answer, and set temperature=0 to keep GPT on just the facts. That's it! Now just spit back the response with js, and Queryable You!
So what can't this do? Think about how it works. It's not going to find a specific number in a table, do calculations, understand any equations, and so on...though this is doable with a bit of work a la @elicitorg.
But absolutely it can answer your question even if you use words that never appeared in the papers you're searching (the embedding handles that!) and even if your question requires some inference on the text (GPT handles that!). You can see both in the examples above.
I should also note: almost by definition, this is trivial. I'm an economist, not a computer scientist. There is a whole subfield on information retrieval and natural language queries, with multi-billion dollar valuations for companies doing this.
That said, that you can get quality this high with an evening of work & literally 25 cents that I spent setting this up? Insane. Future use case is what I said was coming in a thread in Nov.: custom queryable tutors for every course. Any qualitative course = can do it today!
(One more followup. I made a few mods to the code, all on Git. Now ingests any .tex, .txt, or .pdf files (the latter is a bit wonky) is really good at knowing which paper to try to draw on. See below, for a folder w/ all of my papers. Again, first try: this isn't cherrypicked.)