Paul Novosad · May 9, 2023 · 1:20 PM UTC

Paul Novosad · May 9, 2023 · 1:20 PM UTC

Paul Novosad

Paul Novosad @paulnovosad

9 May 2023

Me: Our best fuzzy matching algorithms still leave a lot on the table GPT: Hold my beer

May 9, 2023 · 1:20 PM UTC

Paul Novosad · May 9, 2023 · 1:23 PM UTC

Paul Novosad @paulnovosad

9 May 2023

The conversation went like this: GPT: Maybe you should try a Levenshtein algorithm, here's some python code that might be helpful. Me: Can't you just look at the names and do it? GPT: Ok fine, here you go.

Paul Atherton · May 9, 2023 · 3:08 PM UTC

Paul Atherton @PaulAtherton13

9 May 2023

Replying to @paulnovosad

How many tokens was the dataset? I worry about hitting limits

Paul Novosad · May 9, 2023 · 4:26 PM UTC

Paul Novosad @paulnovosad

9 May 2023

I think the strategy would be to do it a few chunks at a time. TBH this is probably the ideal use case where you have small groups. With thousands of obs on one side of the merge I think it would have a harder time. Though I suppose you could also break that job into pieces...

more replies

simon heß · May 9, 2023 · 9:30 PM UTC

simon heß @simonhhess

9 May 2023

Replying to @paulnovosad

Do you have any idea how to scale this? I have 14 million names and I know GPT does a good job on those if I break them into small chunks, but I'm not sure how to best feed this to GPT through the API without having a conversation first and without having to split it into chunks.

Paul Novosad · May 9, 2023 · 11:20 PM UTC

Paul Novosad @paulnovosad

9 May 2023

Work manually with small samples to refine the prompt until you really like the result. Then send your data to it in chunks using the API. If you have 14 million names, it might be too costly to be worthwhile — maybe wait 6 months for a version that'll run on your laptop.

more replies

Craig Palsson @ Market Power · May 9, 2023 · 4:05 PM UTC

Craig Palsson @ Market Power @MarketPowerYT

9 May 2023

Replying to @paulnovosad

I had success experimenting with this, but I was trying to figure out how to deal with so many potential matches and adding other variables to help the match. Did you do anything to nudge it that way?

Paul Novosad · May 9, 2023 · 4:26 PM UTC

Paul Novosad @paulnovosad

9 May 2023

I haven't looked into it much, I guess it would take some specific tuning for every version of the problem...

Ömer Özak · May 9, 2023 · 8:22 PM UTC

Ömer Özak @OmerOzakEcon

9 May 2023

Replying to @paulnovosad

Is this fancy ChatGPT with plugins that allow you to upload data etc? Or just copy paste into plain vanilla window? Seems similar to find closest match among strings but with explanation for each. nice for comments and replication. But still need to check correctness ex-post.

Ali Campion · May 9, 2023 · 2:12 PM UTC

Ali Campion @alicampion13

9 May 2023

Replying to @paulnovosad

🤯