As mentioned previously, I found AlphaCode2 accounts, and through stalking their submission history, I manually performed the AlphaCode2 Codeforces evals. Overall, very impressive! I arrive at a rating of ~1650, which is the 85-90th percentile of CF users. (1/19)

Dec 7, 2023 · 9:01 PM UTC

I am somewhat concerned about data leakage, see nitter.net/cHHillee/status/…. This is an AlphaCode2 contributor's response nitter.net/RemiLeblond/stat… For the purposes of this analysis I'll take the results at face value. (2/19)
Hey Horace, cool analysis as ever! Let me give you a bit of reassurance; long tweet alert! (I had to subscribe for this 😅) TL;DR All problems we evaluate on are held out, our models have never seen either them, their solutions or their tutorials. As you can imagine, we’ve done a *lot* of checking for this problem in particular, and we’re very confident we got it right. Why? Let’s start with pretraining. Unfortunately I can’t say much about the data distribution (would make this a lot easier tbh!), but we’ve run thorough checks to make sure this exact problem hasn’t appeared before this year, and we find no hits either for the problem, its solutions and its tutorial. We ran recitation checks for both the problem and the tutorial which all came back negative. And when we try to solve the problem with the pretrained model, it can’t do it. (It actually takes a while of fine tuning before it our policies come up with a solution) So not in the pretraining. For fine tuning, we have a time cutoff to our dataset that comes comfortably before the contest, and we’ve again run comparison and recitation checks to make completely sure. Other clues I can give: - Contrary to what you’ve written, the model isn’t especially good at this problem; you’re seeing biased data because we were debugging in public 😅, with all samples coming from the same small ‘cluster of solutions’ as the original correct sample. Its raw solve rate is actually pretty low, *several orders of magnitude lower* than what we see when we intentionally evaluate on problems from the training set. - For the same reason, the relative lack of diversity is not surprising: all samples are coming from the same cluster which naturally groups together similar code samples. This is the one way AC2 found of solving the problem, and what you’re seeing is again the output of a very selective filtering mechanism. We have a lot of diverse incorrect samples though 🤣 - Our hypothesis for why we solve it on “our first try” (definitely not the first thing the model outputs, again, this order comes after filtering, clustering and reranking) is that the public test for this problem is actually pretty thorough, and it’s hard to produce this output until you’ve actually solved the problem. - DP does indeed often feel templated, and this is in some sense a textbook example; we’ve found some content online on eg leetcode that is related though not directly linkable I hope that lays some of your worries to rest, and we’re excited to see the rest of your analysis!!
Disclaimer: I'm trying to reverse-engineer info from their public submissions, so apologies in advance for any errors. There is a fixed set of 12 contests that they submit to. When they kick off a "run" they submit to these contests from several accounts at a time. (3/19)
As far as I can tell they didn't do extensive evals on other contests. They performed most of their evaluation between the middle of october and november. Identifying their accounts is pretty easy. They all have the form <adjective><noun, usually animal>. (4/19)
To find a full "eval" run, I found a particular time that they started submitting. For example, October 25th around 10pm. Generally, there are at least 10 submissions for every problem across all accounts. I then went down all submissions done to the contest. (5/19)
For each contest, I note down whether any of the accounts solved it and how many incorrect submissions they had *before* the first accepted submission. For example, for 1810, they solved A,B,C, and G. B took 6 incorrect submissions before acceptance. (6/19)
Codeforces also has some kind of penalty (either less points or a separate field) depending on the submission time + # of incorrect submissions. I compute the penalty using the approach mentioned in the paper. I also compute a penalty assuming instant submission. (7/19)
Then, for the given score/results for the model, I manually expect the leaderboard to find the "rating" that the model performs at. For example, for contest 1837 3 problems solved + 118 penalty results in a 1104 rating. (8/19)
Overall, after repeating this procedure for 11 (I skipped one contest) of the contests, I arrive at a final rating of 1630 with 43% of the problems solved! This is a 400 rating improvement over the original AlphaCode2. This is already better than many humans :) (9/19)
One concern is whether this "eval" was done with a smaller model So, for each problem it fails on, I checked whether *any* AC2 account had solved it. There are only 2 problems that any AC2 account solved that this run didn't. And those runs failed on other problems. (10/19)
Now, some hesitance. The first is the reliance on pretests. Codeforces provides sample inputs for each problem, and especially on easy problems, these sample inputs are often quite "strong" and informative. AlphaCode2 submits *a lot* of wrong answers. (10/19)
And these solns already pass the pretest! I would expect that on problems without informative pretests (quite common) AC2's performance suffers drastically. For example, o AC2 run solved this 1200 rated problem (codeforces.com/contest/1834/…) and not for a lack of trying. (11/19)
Other than the problem identified in the press release, the hardest problems it solves are a 2200, 2100, 1800 and an 1700. It solves nothing else above 1600. This is partially why solving the 3200 rated stands out so much! (12/19)
Also, AC2 generally benefits quite a bit from solving problems "quickly". Compared to other contestants around its level, it's generally solving easy problems very fast. e.g. On CF round 880, it obtains 2196 rating. But everybody down to 1640 solved the same problems. (13/19)
I also find the contest selection a bit unusual. One note is that when the paper says "Div 2" and "Div 1 + 2" contests, they mean 11 Div 2 contests and a single Div 1+2 contest (which is also where AC2 solved its hardest problem). Why choose the contests they evaled on? (14/19)
My particular qualms 1. why choose a relatively old set of contests? AFAICT, the evals were performed in Oct/Nov, but the latest contest is from June. 2. Why not a contiguous set of contests? For example, contest 1844 was skipped despite doing contest 1841 and 1845. (15/19)
In addition, one of the contests (codeforces.com/contest/1814) was declared non-rated halfway through the contest! Undoubtedly, many contestants stopped participating at this point, leading to a skewed distribution of performance. (16/19)
Finally, another note about the performance of the model. As mentioned, the model is somewhere between the 87-90th percentile of participants. But as you can see from the percentile graph, participant skill is very long tailed. (17/19)
Codeforces includes many non-serious participants who are using the site mostly for things like interview prep. For lower rated problems, many of the problems do not require any novel "insights". So, I'm interested to see how much better AC3 can get than 1600 rating. (18/19)
Overall, great work from the AC2 team! Solid progress over AC1, although perhaps not enough that I think competitive programming will be solved imminently :) I'd love to see more transparency (maybe a live evaluation?), but I understand your hands are tied. (19/19)