Me: Our best fuzzy matching algorithms still leave a lot on the table GPT: Hold my beer

May 9, 2023 · 1:20 PM UTC

The conversation went like this: GPT: Maybe you should try a Levenshtein algorithm, here's some python code that might be helpful. Me: Can't you just look at the names and do it? GPT: Ok fine, here you go.
Replying to @paulnovosad
How many tokens was the dataset? I worry about hitting limits
I think the strategy would be to do it a few chunks at a time. TBH this is probably the ideal use case where you have small groups. With thousands of obs on one side of the merge I think it would have a harder time. Though I suppose you could also break that job into pieces...
Replying to @paulnovosad
Do you have any idea how to scale this? I have 14 million names and I know GPT does a good job on those if I break them into small chunks, but I'm not sure how to best feed this to GPT through the API without having a conversation first and without having to split it into chunks.
Work manually with small samples to refine the prompt until you really like the result. Then send your data to it in chunks using the API. If you have 14 million names, it might be too costly to be worthwhile — maybe wait 6 months for a version that'll run on your laptop.
Replying to @paulnovosad
I had success experimenting with this, but I was trying to figure out how to deal with so many potential matches and adding other variables to help the match. Did you do anything to nudge it that way?
I haven't looked into it much, I guess it would take some specific tuning for every version of the problem...
Replying to @paulnovosad
Is this fancy ChatGPT with plugins that allow you to upload data etc? Or just copy paste into plain vanilla window? Seems similar to find closest match among strings but with explanation for each. nice for comments and replication. But still need to check correctness ex-post.