All LLMs are converging towards the same point

ArsNeph · 2024-10-01T16:05:09+00:00

They're converging, simply in the wrong direction. After GPT 3.5 and GPT4 came out, the vast majority of Open source fine tunes were simply trained on synthetic data from the GPT family. Llama 2 is clearly trained on that same synthetic data, leading to the same dry manner of speech and GPT isms. This was not a problem at the time, as open source was way, way behind OpenAI at that point. However, this heavy reliance on synthetic data led to most models effectively being a distilled version of the GPTs. Newer models like llama 3 and Gemma 2 seem to fix the problem of the dry manner of speech of GPT-4, but in reality, they were simply made more likable through DPO and other methods, while still heavily relying on synthetic data. Models that compete with GPT4o came out, Claude and Mistral Large, but Mistral still seems to keep the trend of synthetic data, and Claude has its own problems with Claudisms. Hence what we really have in the open source space is a couple large models, and a bunch of pseudo distilled models. This results in the plague known as GPT slop being incredibly widespread, and a distinct lack of originality between models.

Feztopia · 2024-10-01T16:01:05+00:00

Someone at openai said this. That if you train enough they all represent the training data no matter what the architecture is. This is why I have high hopes for stuff like mamba or rwkv (they should also converge to the same model while being more efficient to run).

AnomalyNexus · 2024-10-01T17:28:27+00:00

Outside of deepseek

Interesting...was doing some janky testing on my own testing framework yesterday night and DS ended up being outlier too, so interesting to see another mention of it <24 hrs later

I wonder if its because the train with chinese content too and are benefitting from the diversity that injects

FishermanEuphoric687 · 2024-10-01T14:09:35+00:00

One thing I notice is that they all associated themselves with ‘Echo’ nickname at a point.

ortegaalfredo · 2024-10-01T19:29:30+00:00

They are converging because there is only one internet to train them.

DisasterNarrow4949 · 2024-10-01T14:09:51+00:00

I swear if I’ll be trying to generate fantasy content and any LLM got me a response about “Whispering” crap again, I quit.

On-The-Red-Team · 2024-10-01T14:23:40+00:00

Yes. This contextually is also why tuners and those whom have run thousands of different LLMs can see through SLOP patterns.

Everlier · 2024-10-01T16:28:26+00:00

You might find klmbr interesting

carnyzzle · 2024-10-01T19:59:29+00:00

and for a minute I thought it was just me noticing that LLMS seem to output things nearly the same even if they're from entirely different organizations

Sabin_Stargem · 2024-10-01T20:47:06+00:00

I would speculate that all base models are fed their fundamental datasets in a certain order during their training. Kinda like as if you gave a classroom of children a specific set of books for kindergarten, with no change in order of reading. Since AI can't bring their 'outside' experiences to change their perception of this education, it becomes rote.

Maybe something like the DRuGs sampler, but for training, could help? It injects noise into layers while generating output - the AI can overcome the noise, but deviates the response in a small but random way. Doing that during training might allow for simulating 'outside' randomness. The 'missed' entries of data that were suppressed by input noise can be injected again later into the model with another noise session, allowing the model to gradually receive all aspects of the education, but viewing the new information with a different lense of preceding information.

Kinda like shuffling a deck of cards, having a person pull out a bunch, shuffle again, and keep doing that until you run out of cards. The initial deck of cards might be the same, but the order of the card pile will be quite different.

Flaky_Solution_8272 · 2024-10-02T12:57:47+00:00

The Platonic Representation Hypothesis

Quote: “We argue that representations in AI models, particularly deep networks, are converging.”

M34L · 2024-10-01T13:30:45+00:00

You'd need to be more specific for anyone to be able to tell if you're having a point or just finding a relatively niche thing printed hard through the datasets.

If it's supposed to be 100 "random" addresses or ZIPs or phone number then yeah no shit, the dataset bias is gonna be massive.

chengzi9 · 2024-10-01T17:21:29+00:00

Deepseek is good indeed

qrios · 2024-10-01T18:57:35+00:00

Yeah I've noticed that if you ask the models for a dataset of the numbers from 1-100 in ascending order, they all seem to give very similar (often identical) answers. Very spooky.

hold_my_fish · 2024-10-01T22:30:46+00:00

You didn't mention Claude 3.5 Sonnet. It has an edge on everything else out there, in my experience. There's a lot I don't like about Anthropic's company culture, but I have to respect their model creation ability.

StuccoGecko · 2024-10-02T01:18:03+00:00

do you think that companies desire to differentiate their LLMs will lead to the sponsorship of MORE human made content? to then train the LLMs on...otherwise i don't know how you avoid all the LLMs indeed becoming the same.

Dudensen · 2024-10-02T01:49:20+00:00

Different LLMs solve some tasks I give them in the same wrong way so that's probably at least somewhat correct.

FitPop9249 · 2024-10-02T04:05:41+00:00

I notice similar patterns all the time now in articles online

silenceimpaired · 2024-10-01T12:52:42+00:00

I think your statement is too vague and if you’re asking for information the only thing that may be able to vary is the yapping and formatting.

I’m curious if you tried qwen. I’m loving it.

moarmagic · 2024-10-01T16:38:03+00:00

At the end of the day, isn't this sorta inevitable? Like in the sense that an LLM autocomplete on steriods, trained to generate the statistically probable next token, at some point if you leverage enough text in a given language they will all find the same statistically probable next token, because the average of all English text would be roughly the same.

I know you can adjust temperature and try to compensate. And there probably would be value their in trying to remove the lowest useful data sets that might increase the average 'correct' answers to some types of questions, information to others. But like, the statistical average name is probably going to always be john, james, will or something similar right?

unlikely_ending · 2024-10-01T23:07:39+00:00

They're all almost identical GPT architectures being trained on almost identical datasets

So yes, kind of inevitable

Glittering_Voice3143 · 2024-10-02T02:40:37+00:00

Intelligence is relatively flat, features are soaring.. cot is a feature that makes models better. Simply put the foundation models are doing at a large scale what the little guys have been doing to maximize llms for the last year. IMHO.

2024-10-02T03:27:56+00:00

Although this is a shot in the dark, but it could be randomness injected into these models is not real randomness but algorithmic randomness, and that algorithmic randomness has this kind of hidden property that is making these models converge. But I could be completely wrong, but I would have liked to experiment by tinkering with source of randomness when training these models.

Eheheh12 · 2024-10-02T05:55:51+00:00

I had once the same problem in coding; I think I had a bug but when I asked ChatGPT and Claude I got the same fix which wasn't correct. Interestingly, when I googled it, there was one stackoverflow solution that's the same as the ones that the LLMs gave me. It was a clearly different problem, but that's the closest one (in term of wording) in the internet.

LLMs are really good in data compression. I feel if there is one single solution out there in the internet, LLMs will be able to bring it as a solution.

TheManicProgrammer · 2024-10-02T08:34:54+00:00

As is with evolution that everything becomes a crab, so too do llms in that they become GPTs

cosmicr · 2024-10-02T10:35:54+00:00

I use mistral, phi and llama for image prompts and they are very different from each other. (mistral and llama 3.2 are closer but phi is majorly different.)

Mysterious-Rent7233 · 2024-10-02T13:50:35+00:00

https://arxiv.org/abs/2405.07987

We argue that representations in AI models, particularly deep networks, are converging. First, we survey many examples of convergence in the literature: over time and across multiple domains, the ways by which different neural networks represent data are becoming more aligned. Next, we demonstrate convergence across data modalities: as vision models and language models get larger, they measure distance between datapoints in a more and more alike way. We hypothesize that this convergence is driving toward a shared statistical model of reality, akin to Plato's concept of an ideal reality. We term such a representation the platonic representation and discuss several possible selective pressures toward it. Finally, we discuss the implications of these trends, their limitations, and counterexamples to our analysis.

BiteFancy9628 · 2024-10-03T03:12:58+00:00

They’re not only trained on the same data, they’re trained on each other. They all use the best at the moment to validate their quality, generate synthetic training data, and otherwise basically reverse engineer whatever the top model is doing.

FeltSteam · 2024-10-03T05:06:34+00:00

https://arxiv.org/abs/2405.07987

ResidentPositive4122 · 2024-10-01T14:18:59+00:00

WIthout knowing how you prompted them, this is of little value tbh. These models have been tuned to 0shot simple, stupid, "aligned" prompts. Of course they will all be equally bad, after you've seen it day in and day out.

That's why the 1m personas dataset is interesting. Use some of that dataset to prompt them, and see where that leads. Then use that same dataset for synthetic generation, fine-tune your own models (loras are cheap af now) and see where that leads.

Open-source and open-weights is more than download -> prompt for 0shot or count_the_rs -> whine on reddit.

2024-10-01T17:23:59+00:00

Makes sense when under the hood they all follow the same architecture and as the model becomes more mature they all get trained on similar datasets

Fresh-Tutor-6982 · 2024-10-01T20:02:41+00:00

of course they won't point to ASI. You are using 2023 paradigm, tech and scaling, which is basically GPT-4 paradigm.

Defiant-Mood6717 · 2024-10-02T17:37:13+00:00

This means absolutely nothing regarding ASI. You are testing domain knowledge, you think that's related to reasoning, planning, adaptation?

All this means is that in the world, those 6 items are the most likely to appear as possibilities. LLMs are exposed to data on the internet. It has nothing to do with them using the same data, it simply has to do that the data describes the same world.

NikoKun · 2024-10-02T21:21:34+00:00

I've seen this topic brought up before. They're simply converging on the same model of reality and reasoning, because their data comes from our reality. I don't think their convergence says anything about whether they're heading for ASI or not. I think that trend will simply be the result of training on synthetic data created through AI Agents using reasoning to build more high quality data as they improve.

mindplaydk · 2024-10-03T05:50:59+00:00

May I ask, when you say it "generated a data set", does this mean you were prompting it to extract some sort of world knowledge?

You weren't feeding it your own data and asking it to extract something from that?

If you're extracting world knowledge, every base model was trained on largely the same data foundation, so they're going to have largely the same world knowledge. (with minor variations due to model size and other constraints.)

World knowledge is not where I see noteworthy differences between models - the interesting differences are in their behaviour, which depends on their fine tuning, which tends to be proprietary.

LocalLLaMA

MODERATORS

LocalLLaMA

MODERATORS

Welcome to Reddit,

Want to add to the discussion?