I've put a complex codebase into a single 120K-token prompt, and asked 7 questions GPT-4 and Gemini 1.5. Here are the results!

wntersnw · 2024-02-18T03:15:52+00:00

There was a claim that Gemini was able to write documentation for a codebase that was fed into it. Might be something worth trying if you get a chance.

MehmedPasa · 2024-02-18T08:35:32+00:00

Please redo a test in the near future when Gemini 1.5 Ultra is released.

Remarkable-Fan5954 · 2024-02-18T03:57:06+00:00

Please post more. This is the exact stuff I'd like to see.

Ddog78 · 2024-02-18T04:32:47+00:00

This is an absolutely insane post. Thank you for doing the work!

RiceCookerOfWeb · 2024-02-18T04:11:06+00:00

Woah! This was really detailed post, so what I understood from this is that gemini is able to remember and make links between different things from it's memory to answer questions where as gpt-4 is not that good at it 🤔

iamz_th · 2024-02-18T04:33:10+00:00

The is amazing. Please keep posting tests like this. I want more for video and audio. Do a comparison with Gpt 4 for shorter context 32k

Glass-Tennis4388 · 2024-02-18T06:39:31+00:00

Please test gemini 1.5 reasoning skills too

Droi · 2024-02-18T12:42:22+00:00

Very interesting stuff.

As a software engineer, I think people don't appreciate asking coding tasks of LLMs properly. It's near impossible for a human to get good insights without the ability to run the code multiple times, test your answer, try different solutions and changes, get an understanding and only then answer questions and fix issues or add functionality.

LLMs are just dumped a ridiculous amount of data and expected to take it all in, simulate a compiler, predict any issues and logic paths and give you perfectly working code on the first prompt... Insane.

We really need a dedicated system with a coding agent that can interact with the code, make experiments and learn about it, test its answer and only then give it to the user - at the point we can kinda pack our bags, and I really don't think it's that far away.

sdmat · 2024-02-18T04:11:05+00:00

Awesome writeup!

This is incredibly promising - looks like the recall and in context learning results in the 1.5 paper transfer well to code.

I think when they scale the model up and incorporate Alpha-* style tree search as planned it is going to be superhuman at a lot of relevant tasks.

lordpermaximum · 2024-02-18T08:56:22+00:00

And this is only Pro.

Imagine a free, incredibly fast, small LLM completely destroying GPT-4-Turbo. This is what's happening in front of our eyes. (I don't think the one with the 1M-10M context will be free though.)

I can't even imagine 1.5 Ultra.

Too bad DeepMind won't reveal the breakthroughs they achieved with Gemini 1.5... Not after OpenAI. I feel bad for the OpenSource community. They won't ever catch up.

Andriyo · 2024-02-18T06:41:00+00:00

I don't think it's possible to create a complete mental model of complex software systems without running them and observing behavior. To some extent it will always be hallucination unless there are clues in the code somewhere.

sunplaysbass · 2024-02-18T16:15:01+00:00

But I was told OpenAI has all the talent?

bartturner · 2024-02-18T09:35:48+00:00

This is consistent with my experience so far. So not surprised.

But the other aspect that should be included is the fact that Gemini is way, way, way faster also.

sarten_voladora · 2024-02-18T11:04:22+00:00

this is what sama fears, google using their enormous server infrastructure with a decent model... thats why he seeks for chips like mad and talks about 7T$; he knows google eventually will get there and use their brute force

Fantastic-Opinion8 · 2024-02-18T13:08:16+00:00

gemini 1.5 pro is really a step forward to agi. not the moive demo

Arcturus_Labelle · 2024-02-19T06:25:13+00:00

What the heck? An actual quality post in this sub? Hell has frozen over

stuck-in-an-ide · 2024-02-18T04:59:41+00:00

fuzzy distinct rinse relieved possessive shocking snobbish wrench snails psychotic

This post was mass deleted and anonymized with Redact

LordFumbleboop · 2024-02-18T10:56:07+00:00

That's quite an impressive result for 1.5 :)

HauntingBeach · 2024-02-18T23:07:10+00:00

Another great demo would be feeding a complete web app codebase and ask it to write an additional feature based on the learnings of patterns and best practices from existing code.

PinkWellwet · 2024-02-18T08:14:14+00:00

Nice! Thanks bro!

CodeComedianCat · 2024-02-18T11:02:31+00:00

Really cool test. Thanks for doing this and for sharing it.

VoloNoscere · 2024-02-18T14:56:37+00:00

Great post, I hope you get access asap to even better google models

six__four · 2024-02-18T19:12:27+00:00

Amazing breakdown. Gemini 1.5 could supercharge an app like Dosu for issue triage and documentation writing

Merastius · 2024-02-21T14:53:01+00:00

Very nice coverage, I can definitely see how I'd make use of the longer context window (assuming it didn't cost too much per query).

Have you done any tests to see, when asked about something that definitely isn't in the context, does Gemini 1.5 answer with a hallucination, or does it properly admit to not finding something? (E.g. "List all the methods that do X" when you know that no method in your codebase does X)

Similarly, if Gemini 1.5 is asked something with a wrong assumption in the prompt itself, does it hallucinate in order to not contradict you, or does it properly inform you of the mistake in the prompt? (E.g. "How does method Y which does Z achieve this?" when you know that method Y doesn't exist)

I'm interested in this because all the Gemini 1.5 testing I've seen so far is looking for true positives, as opposed to true negatives, and I've always suspected that it'll be harder to get LLMs to admit to not finding something (or correcting the user) than it will be to find needles in haystacks. In the real world, I might not know if the context contains what I'm looking for or not, and if Gemini 1.5 always returns info, whether it exists in the context or not, then that makes it less useful than if it admitted when it couldn't find anything.

(Edit: Rereading the post, I remembered that you don't have access to Gemini 1.5, your friend does, so sorry for making a request of you. Still, I hope some people do end up performing these kinds of tests...)

youneshlal7 · 2024-02-18T20:03:33+00:00

That’s seriously impressive! The way Gemini 1.5 managed to parse and understand the nuances of HVM1’s syntax, and even offer a partial understanding of HVM2’s more alien IR syntax, is nothing short of mind-blowing. It shows how far AI has come in assimilating and applying even the most obscure technical knowledge.

IJCAI2023 · 2024-03-04T23:40:39+00:00

Love it! This may force OpenAI to release 4.5 or 5 -- and kick Google's @$$ again.

Poildek · 2024-02-18T08:42:26+00:00

Thanks a lot that s great test ! I had the same result with gpt 4 turbo on large context over 40 to 50 k tokens.

JohnToFire · 2024-02-18T10:45:20+00:00

Great apples to apples comparison. It's possible that Gemini pro 1.5 is less nerfed from rl "alignment" at the stage it's at

Spooderman_Spongebob · 2024-02-18T17:44:47+00:00

Imagining this being even better once prompts specifically designed for the (much) longer context begin to emerge, just like Three of Thoughts "breakthrough" for GPT4 last year.

ciekaf · 2024-02-18T09:53:05+00:00

Could it be that those repos were somehow part of the training dataset?

deama15 · 2024-02-18T21:28:37+00:00

It'd be interesting if gpt4 would do better if you created a dedicated GPT for it and dropped the 120k token file as a knowledge file into it and explained in the instructions to use it.

Bitterowner · 2024-02-18T11:06:45+00:00

Now we just need to know how good it is in naughty role-play :⁾

Ingergrim · 2024-02-18T12:33:59+00:00

Was it GPT-4 API or chat-version?

Zemanyak · 2024-02-18T18:04:48+00:00

1 million token is the input limit, I suppose. But what is the output limit ?

extopico · 2024-02-19T00:04:13+00:00

With GPT-4 what I noticed is a huge variability in the quality of its responses. On some days it behaves like an insightful and helpful partner, on others it’s like it’s out of its mind, producing nonsense and ignoring prompts. Thus one shot evaluation of GPT-4 (and Gemini perhaps) may not give the full picture.

Zulfiqaar · 2024-02-19T00:18:09+00:00

Brilliant! I'd like to see a similar test done on a sub-64k token input. I know GPT-4-turbo is supposed to be 128k, but I saw tests saying it's recall dropped off a cliff around 64-72k tokens. Clearly Gemini is better at retrieval, but I wonder what the raw reasoning power is on a query that fits entirely within context.

TotesMessenger · 2024-02-19T09:02:21+00:00

I'm a bot, bleep, bloop. Someone has linked to this thread from another place on reddit:

[/r/hackernews] Gemini 1.5 outshines GPT-4-Turbo-128K on long code prompts, HVM author

^{If you follow any of the above links, please respect the rules of reddit and don't vote in the other threads.} ^(Info ^/ ^Contact)

Alammex · 2024-02-19T14:16:45+00:00

You can easily do this with gpt4 using GitGab.ai

ASD_Project · 2024-02-19T18:29:02+00:00

Really glad I got my degree in civil engineering and just code on the side lol

tonyluitonylui · 2024-02-20T02:04:13+00:00

Rust is fast!

commonuserthefirst · 2024-02-20T05:41:40+00:00

See the recurring theme? GPT is a lazy, avoidant piece of shit.

I didn't used to be, it's like openai is a bad parent and ruined the child somehow.

EddySF02 · 2024-02-20T06:59:31+00:00

Great work, thanks for sharing. To be fair to GPT you may want to try GPT4 standard, from what I have read Turbo is faster but it is some type of quantized version. GPT4 may be less ‘lazy’ and give better results.

airkman · 2024-02-20T12:37:03+00:00

Very interesting! Fantastic experiment!

I wonder if fine-tuning either model would change the outcome. Or making vectors our of it and see if either model can improve understanding.

I'm involved in a project where we aim to fine tune a model on an extensive data set (text). The data is so extensive that it wouldn't be feasible to prompt using it. We're currently opting for fine-tuning, but there may be versions where we try the vector approach.

Drew_Pera · 2024-02-20T14:18:32+00:00

This is wild. Google is really onto something with it's Gemini product.

BergUndChocoCH · 2024-02-20T16:42:13+00:00

Do some of you who post these things about chatgpt and gemini changing the world use a different version? Because from my experience it's not even close to what it does. It makes mistakes in very simple code and text based tasks...

stochmal · 2024-02-20T17:32:48+00:00

Gemini 1.5 might be most important release of 2024

TheFirstPlayBae · 2024-02-21T04:59:44+00:00

Pretty new to this and very curious about how you combined your entire codebase into a single text file? Was that simple copy pasting (manually or via a tool) or do you have to follow a specific structure and format?

Alex_1729 · 2024-02-23T19:00:27+00:00

OpenAI needs to see this.

dodoei · 2024-03-04T21:10:37+00:00

looks like a huge lose for those humans that have been hallucinating "AGI achieved internally" from OpenAI

singularity

Links

On the Technological Singularity

Resources

Posting Rules

Check out /r/Singularitarianism and the Technological Singularity FAQ

MODERATORS

Breakdown:

Verdict