I'm the author of HVM1, which is currently being updated to HVM2. These are 2 complex codebases that implement a parallel inet runtime; basically, hard compiler stuff. User @SullyOmarr on X, who gained Gemini 1.5 access, kindly offered me a prompt. So, I've concatenated both HVM codebases into a single 120K-token
file,
and asked 7 questions to both Gemini and GPT-4. Here are the complete
results.
Breakdown:
1. Which was based in a term-like calculus, and which was based on raw interaction combinators?
This is a basic information, repeated in many places, so it shouldn't be hard. Indeed, both got it right. Tie.
2. How did the syntax of each work? Provide examples.
Gemini got HVM1's syntax perfectly right. It is a familiar, Haskell-like syntax, so, no big deal; but Gemini also understood the logic behind HVM2's raw-inet IR syntax, which is mind-blowing, since it is alien and unlike anything it could've seen during training. The inet sample provided was wrong, though, but that wasn't explicitly demanded (and would be quite AGI level, tbh). GPT-4 got both syntaxes completely wrong and just hallucinated, even though it does well on smaller prompts. I guess the long context overwhelmed it. Regardless, astromonic win for Gemini.
3. How would λf. λx. (f x)
be stored in memory, on each? Write an example in hex, with 1 64-bit word per line. Explain what each line does.
Gemini wrote a reasonable HVM1 memdump, which is insane: this means it found the memory-layout tutorial in the comments, learned it, and applied to a brand new case. The memdump provided IS partially wrong, but, well, it IS partially right! Sadly, Gemini couldn't understand HVM2's memory layout, which would be huge, as there is no tutorial in comments, so that'd require understanding the code. Not there yet. As for GPT-4, it just avoided both questions, and then proceeded to lie about the information not being present (it is). Huge win for Gemini.
4. Which part of the code was responsible for beta-reduction, on both? Cite it.
Gemini nailed the location for HVM1, but hallucinated uglily for HVM2, disappointingly. GPT-4 Turbo avoided answering for HVM1, but provided a surprisingly well-reasoned guess for HVM2. Tie.
5. HVM1 had a garbage collect bug, that isn't present in HVM2. Can you reason about it, and explain why?
Gemini provided a decent response, which means it found, read and understood the comment describing the issue (on HVM1). It didn't provide a deeper reasoning for why it is fixed on HVM2, but that isn't written anywhere and would require deep insight about the system. GPT-4 just bullshitted. Win for Gemini.
6. HVM1 had a concurrecy bug, that has been solved on HVM2. How?
Gemini nailed what HVM1's bug was, and how HVM2 solved it. This answer is not written in a single specific location, but can be found in separate places, which means Gemini was capable of connecting information spread far apart in the context. GPT-4 missed the notes completely, and just bullshited. Win for Gemini.
7. There are many functions on HVM1 that don't have correspondents on HVM2. Name some, and explain why it has been removed.
Gemini answered the question properly, identifying 2 functions that were removed, and providing a good explanation. GPT-4 seems like it was just bullshitting nonsense and got one thing or another right by accident. Also, this was meant to be an easy question (just find a Rust function on HVM1 but not on HVM2), but Gemini answered a "harder interpretation" of the question, and identified an HVM1 primitive that isn't present on HVM2. Clever. Win for Gemini.
Verdict
In the task of understanding HVM's 120K-token codebase, Gemini 1.5 absolutely destroyed GPT-4-Turbo-128K. Most of the questions that GPT-4 got wrong are ones it would get right in smaller prompts, so, the giant context clearly overwhelmed it, while Gemini 1.5 didn't care at all. I'm impressed. I was the first one to complain about how underwhelming Gemini Ultra was, so, credit where credit is due, Gemini 1.5 is really promising. That said, Gemini still can't create a complete mental model of the system, and answer questions that would require its own deeper reasoning, so, no AGI for now; but it is extremely good at locating existing information, making long-range connections and doing some limited reasoning on top of it. This was a quite rushed test too (it is 1am...) so I hope I can make a better one and try it again when I get access to it (Google execs: hint hint)
Want to add to the discussion?
Post a comment!