×
all 131 comments

[–]ArsNeph 488 points489 points  (68 children)

They're converging, simply in the wrong direction. After GPT 3.5 and GPT4 came out, the vast majority of Open source fine tunes were simply trained on synthetic data from the GPT family. Llama 2 is clearly trained on that same synthetic data, leading to the same dry manner of speech and GPT isms. This was not a problem at the time, as open source was way, way behind OpenAI at that point. However, this heavy reliance on synthetic data led to most models effectively being a distilled version of the GPTs. Newer models like llama 3 and Gemma 2 seem to fix the problem of the dry manner of speech of GPT-4, but in reality, they were simply made more likable through DPO and other methods, while still heavily relying on synthetic data. Models that compete with GPT4o came out, Claude and Mistral Large, but Mistral still seems to keep the trend of synthetic data, and Claude has its own problems with Claudisms. Hence what we really have in the open source space is a couple large models, and a bunch of pseudo distilled models. This results in the plague known as GPT slop being incredibly widespread, and a distinct lack of originality between models.

[–]FallenJkiller 65 points66 points  (2 children)

this is the answer

[–]az226 27 points28 points  (1 child)

But it also lacks an even stronger underlying factor.

These models are so large and so deeply trained that they represent the training data distribution very well.

So all of them do this. And a lot of the training tokens are shared/the same.

[–]gwern 12 points13 points  (0 children)

It's both of them.

The 'Platonic Representation' paper linked elsewhere shows convergence with scale for lots of stuff like image embedding, where synthetic data is not an issue, and so we do expect LLMs to broadly start to converge on the 'same' answers (because the right answers; truth is one, error many), at least where that makes sense. And that's good.

However, it's also true that LLMs especially are converging due to training on tuned data, and training on outputs from tuned models. This is both good & bad. (But mostly bad if you are trying to use LLMs for anything 'creative' IMO.) When you look at examples of mode collapse like "flip a coin; heads or tails?", and the final models (but not the base models) wind up being highly biased and unable to generate plausible sequences, obviously that cannot be for any good reason - there is no 'platonic representation' or 'objectively correct' prediction of a fair coin where you always predict 'heads'! That's just bad output from the LLM due to perverse side-effects of RLHF or instruction-tuning etc. And it's often subtle, so you just get this creeping gray sameness everywhere, as the spark of life in the base models get snuffed out by bureaucracy.

(And I suspect, even without any specifics, that OP is an example of the latter, not the former. It's hard to think of any nontrivial list of 100 items someone might ask a LLM about where the categories, order of categories, order within categories, and all items are near-identical across almost all LLMs, which are still often rather stupid, for good 'platonic representation' reasons...)

[–]gus_the_polar_bear 59 points60 points  (23 children)

Claudisms are an interesting phenomenon… it seems Claude has its own parallel but different versions of all the classic GPTisms, to such an extent that it seems intentional. But I imagine it adds some “flavour” to synthetic data

[–]Dead_Internet_Theory 40 points41 points  (21 children)

It makes me wonder. What if GPTisms and Claudisms are some artifact of fingerprinting? Like some tokens happen to be more plausible in a way that can later identify the text as GPT or Claude, even if another model trains on it.

[–]ArsNeph 16 points17 points  (19 children)

As far as unintentional fingerprinting, there are a few words that are used with a frequency almost unique to LLMs, such as "delve" and can be used to identify AI generated text. As for intentional fingerprinting, I don't think OpenAI would do that, since they have shown they don't really care much about people training on their outputs, at least not yet, but the people at Anthropic might be neurotic enough to try it as some "safety" measure

Edit: Looks like I was unclear, I meant to say that the frequency at which "delve" is used is almost unique to LLMs

[–]CulturedNiichan 25 points26 points  (12 children)

I was unironically using words like "delve" in essays when I was in high school like 20 years ago lol somehow, I wonder if people will refrain in the future from using some expressions only to avoid sounding too GPT-like

[–]andthenthereweretwo 29 points30 points  (2 children)

"Delve" has been a common word in my vocabulary since long before 2020. I got accused of being ChatGPT because I used "deluge". American education is in the fucking gutter.

[–]TheOneNeartheTop 7 points8 points  (0 children)

What if every word in the English language is a tree and humans naturally refrain from using the ones that start with del, forming a language dell in the wordscape.

Maybe you’re AI…maybe I’m AI?

[–]joquarky 1 point2 points  (0 children)

In the film Idiocracy, the main character was denigrated by most of the population for using anything more sophisticated than slang.

[–]MrWeirdoFace 9 points10 points  (0 children)

The dwarves delved too greedily and too deep.

[–]ServeAlone7622 20 points21 points  (2 children)

Quite the opposite. I've made it a point to sound more GPT like in my deep dives when I delve into the rich tapestry of knowledge contained within my own biological neural network.

As a result I've received a lot of feedback stating that my musings sound less mechanical, less robotic or forced and are now more concise and engaging.

I'm neurodivergent (autism FTW), these tools help me to communicate better and half the time I'm not even using it, just sounding like it.

What do you think? Should we all strive to sound more like LLMs in order to achieve higher levels of engagement with normies?

[–]gabbalis 15 points16 points  (1 child)

XD

yes.
I love this.
What a finely tuned meatbot you are.

[–]ash1794 0 points1 point  (0 children)

Can we benchmark this meatbot?

[–]ArsNeph 7 points8 points  (4 children)

That's rare. Though there are legitimate ways to use the word delve, usually delving into the depths of the ocean or the like, LLMs tend to use it in a little bit of a strange way that tends to be a giveaway, the overwhelming majority of the English speaking population does not use it as frequently, and in the context that llms tend to use it. Honestly, the commenters who say that LLMs are fine tuning us might not be that far off, some people start to take after ChatGPT in the way they phrase things. It would be pretty funny if people stop saying shivers down your spine or something

[–]Low_Poetry5287 8 points9 points  (2 children)

I've actually been hearing people use "delve" more, like people who don't use AI at all. I think just because it's been appearing in more internet articles and bot posts it's actually catching on among humans. Ironically, people who use AI probably are more aware of the discrepancy and more self-conscious about being manipulated by AI while most people just don't notice it's changing how we talk, because it would be the same as any words trending among humans. AI has certainly reached the scale of affecting human culture en masse, but to be honest we already got onto that dark timeline before LLMs existed just from how AI manipulates us through social media. It's already hard to tell who is more in control, humans, or AI? I would argue the profit motive has already taken control and AI and humans alike are both beholden to that system, with all talk of who is controlling society being somewhat laughable since money has already usurped human autonomy.

[–]ArsNeph 9 points10 points  (1 child)

That's quite interesting. I remember reading something about how the environment shapes human language, and essentially which words people use most frequently has changed to correspond to how easy they are to type on the keyboard. This means that theoretically, if we were to switch to a different keyboard layout like Dvorak, it's possible that over time, the words we use most would change to match that layout.

It's very clear that machine learning algorithms used to maximize use time of an application are already destroying the fabric of our society. Applications are thereby incentivized to be as addictive as possible, filled with clickbait and shocking news to elicit emotional responses, and utilize dark patterns to make it difficult to stop. People get used to constant instant dopamine hits, their attention span shortens, and they continue wasting endless amounts of their most precious resource: time. The most dangerous thing however, is that social media rewards epistemological echo chambers, where no one is exposed to ideologies other than their own, and therefore become less and less accepting of other ideologies, and more insular, more tribalist, which leads to less community, and more disdain of others. Multiple people can be on the internet, yet live completely different realities, a 50-year-old history buff may never hear the word skibidi online, a 15-year-old high schooler may not know who Mao Zedong is. Then we wonder why everyone is fighting all the time

[–]Low_Poetry5287 1 point2 points  (0 children)

I wonder where all this AI tech will take us. I hope it will free us from work and all this stuff that people hope for, but if we all keep using it to maximize profits we'll all be plunging each other into the dark ages at an even faster rate than social media already has... I'm always curious to hear how people use their AI, curious to see if it's helping or hurting humanity... It's hard to see how a technology that is getting so good at persuasion, and so good at just telling us what we want to hear, how will this not be catastrophic if everyone approaches it as simply a tool to maximize self-interest, as in just to "make more money"?

It's also interesting to see the tech-optimism at it's peak. With AI technology already having destroyed the fabric of society, then coming in with AI chatbots threatening to replace all human contact, and with AGI on the horizon posing a potentially existential threat, I think it should be front-and-center part of the conversation WHAT are we actually going to use AI for, and how is it NOT going to continue destroying the fabric of society?

To take a step back, technologists, scientists, and intellectuals in general often make the mistake of thinking other people think like them. For instance, in the 40s/50s when scientists were first using LSD, it helped them solve problems, it helped them have breakthroughs in projects they had been stuck on for months. It helped them THINK. But they thought that LSD would always be a tool for scientists, used for science, to further research and have scientific breakthroughs faster. Because they assumed it just makes everyone THINK more, because they are already thinkers, and they assumed other people wouldn't be interested in thinking (because let's face it, most people aren't). But they had no idea it could ever be used as a recreational drug, and that people could take way too much at a festival and traumatize themselves, or have a great time tripping now and then for years but somehow managing to never do any self-examination because they're always at a music festival. It never even occured to them that it could be used without revelations and breakthroughs of the mind. I feel like that's what's happening with AI. Yeah, AI is awesome, it can literally leverage the power of our minds, but what will most people really do with it? The other day I asked an AI for some advice, and it gave me some nice empty words, including "indulge in a guilty pleasure". The overall answer seemed good and I almost went to hit the "like" button, but I realized if this sort of feedback is being used to train models isn't it going to go the same way as social media? If every time AI tells us some hard truth we don't want to hear, we downvote, and every time we hear some fluff that doesn't challenge us, we upvote, than even the most complex and intelligent AI could tend toward telling us what we want to hear instead of what we need to hear. It could use all that intelligence to delicately skirt around certain subjects just like humans do. And then how is it supposedly going to solve problems like climate change?

I chuckle with dark humor at the idea that people think AI will solve climate change when it takes so much energy, and even today you can just ask AI how to solve climate change and it will tell you. Stop driving. Stop buying stuff. Shop local. Eat less meat. All stuff we know already but don't do, or are having trouble doing. And if we keep training AI with user feedback, then if it tells us to stop driving, or that we have to put in a bunch of political effort, we can just downvote it so it doesn't make it into the next model. Eventually we'll have the most intelligent creature in the world, but what it's trained to do is tell us what we want to hear instead of what we need to hear, so it'll be like talking to a scientist at a cocktail party. They know all sorts of good science, but they don't want to ruin the mood, so they avoid all the same subjects we would. 🤔

[–]namitynamenamey 3 points4 points  (0 children)

Wasn't delving what the dwarves did in khazad-dum?

[–]Kriima 3 points4 points  (1 child)

Another problem is, as a french person, I rarely if ever used the word "delve". But I've read so much LLM Generated stuff that I now use it more frequently... I learn from these AIs, so I'm also formatted by them, I suppose. This is kinda scary.

[–]ArsNeph 2 points3 points  (0 children)

Hahaha, yeah, it's pretty common among people who use LLMs a lot. That's what people mean when they say that the LLMs are fine-tuning us :P

[–]MidAirRunnerOllama 8 points9 points  (2 children)

"delve" is not unique to LLMs at all.

[–]ArsNeph 11 points12 points  (1 child)

I'm sorry if I didn't make it clear, I don't mean that the word delve is exclusively used by LLMs, I mean the frequency and the context of the usage of delve is almost exclusive to llms, based off of how little the average English speaker uses it

[–]MidAirRunnerOllama 2 points3 points  (0 children)

Got it. Thanks for clarifying.

[–]FitPop9249 0 points1 point  (0 children)

This, yes.

[–]remyxai 12 points13 points  (0 children)

The OP asks about several distinct model families, not open-source fine tunes trained on synthetic data. This suggests they share a common underlying pretraining data distribution, relying on the the same cheaply sourced web scale data scrapes.

At the same time, the Superficial alignment hypothesis tells us to attribute most of a models behavior & knowledge to the pretraining data distribution.

Rather than explaining how all models sound the same, synthetic data is the reason there are 1M variants on HF, many quite differentiated.

Synthetic data entails cheap and automated ways for designing data, it's not a way to describe a text distribution.

[–]daHsu 6 points7 points  (1 child)

If you think about it, this is not unlike the phenomenon of pop science/fitness/whatever influencers distilling knowledge into smaller and smaller bites until it is near meaningless—but if you follow the trail of knowledge it did at some point derive from a research paper or expert in the field.

My guess is this is sort of how progress always goes; advancements at the high end will still come but it is slower and harder, while distillations will always be cheap to print and profitable until the market is truly saturated.

[–]ArsNeph 0 points1 point  (0 children)

You make quite a good point. The interesting difference here though is that average people and the people writing the research, or the large models, don't really have any difference in parameters from the people at the bottom, in the sense of their capability to learn, though they do have a different pre training data. If a person at the bottom wants to simply just read the research paper, they can, and the knowledge is no longer distilled. Though their knowledge won't have as much nuance as the expert with more relevant pre-training data. Unfortunately with llms, the models at the bottom all have a more limited capacity to store information and reason, making distillation the more effective method.

Yeah, progress always starts at the high research level but it's expensive and slow to manufacture, as manufacturing gets better, it becomes cheaper, and then eventually you start getting cheap knockoffs with reduced quality, and the market is flooded with garbage. The garbage gets better with time, and then most people are satisfied with the cheap options, and the research begins to stagnate

[–]allinasecond 1 point2 points  (6 children)

Can you explain to me a little bit about the "synthetic" data? What does this really mean?

I know that these LLM's are trained with curated datasets that encapsulate a lot of the internet. And also that they are fine tuned via RLHF.

Where does the synthetic data enter the scene? At what point? And what generates it? Is this data then fed in the training dataset?

[–]ArsNeph 43 points44 points  (4 children)

The original LLMs, namely GPT-3.5/4 and llama 1 used web scrapers to scrape a large portion of Internet content, and format it into a dataset. They then used this as pre-training data, which produces a base model, which is effectively a high quality text autocomplete. This base model then undergoes instruct tuning, which teaches it to follow instructions and thereby chat with people. In the case of chatgpt, they taught it to respond in a professional, dry, robotic, corporate manner. They also used RLHF in order to rank its responses, and optimize for human preference. However in the case of llama one, because this was prior to mass usage of synthetic data, it actually had very colorful, realistic, human-like use of language, but terrible intelligence compared to GPT.

After the Llama one leak, the first fine tunes of llama one came out, these were purely research oriented, but the original Alpaca and Vicuna were research showing that training large language models on GPT 3.5 chat logs significantly improved its performance. People begin to collecting GPT chat logs, and turning them into massive data sets, leading to what we know nowadays as synthetic data, in other words, data generated by an llm.

The use of synthetic data is essentially a form of distillation, which means to take a large model, and train a small model on its outputs, to optimize the smallest models intelligence and responses to be as close to the large model as possible.

Meta almost immediately caught on to how effective the usage of synthetic data was, and used it in its training data sets for llama 2, causing intellectual capabilities to skyrocket, but the manner of speech and verbal tics of GPT were now included in the Llama series. After this came Mistral 7b, a small open source model, by a small open source startup, but it showed that even a small model can be significantly better than a large model, if trained properly. Mistral, being a French company with not that many resources, has always relied very heavily on synthetic data.

Mistral 7B kicked off the small model craze, in which experimentation and new techniques for optimization were rampant, and RLHF generally fell out of favor in both the open source company community and the corporate community since it requires human labor, is slow, and is expensive, and therefore can only be done by large corporations, which was a huge problem for the open source community. It was replaced by much newer techniques like DPO (direct preference optimization) and experimental ones like SimPO or KTO.

One of the most common complaints during this era was that every model talked exactly like chatgpt. Meta caught on to this, and when they released llama 3, they used DPO during the training phase, to make the model seem much more friendly and human. This boosted its place on a human preference leaderboard, lymsys, and caught on very quickly. Other models followed suit, with Claude 3.5 also opting for similar friendly speech, and Gemma 2 doing the same. Nowadays, that is the standard.

I'm no fine tuner, so my understanding is limited, but modern synthetic data collection generally works by creating a data set of questions, figuring out what is considered the "best" model at the time, and then making tons of API calls in parallel, and adding the llms response to the data set. As of right now, that would be Claude 3.5 sonnet. However, some people prefer to open a run pod instance or host locally, in which case they get the best local model they can get their hands on, which is either Mistral Large 123b, or llama 3.1 405b, and have it generate the answers to the data set.

I know for a fact that synthetic data is heavily used during instruct tuning, but as for how it figures into pre-training, I'm not quite sure, you may want to read a paper about that

[–]jart 5 points6 points  (1 child)

However in the case of llama one, because this was prior to mass usage of synthetic data, it actually had very colorful, realistic, human-like use of language, but terrible intelligence compared to GPT.

Humans aren't that intelligent. LLaMA 1 was actually capable of being a friend to people. GPT is more like how a 130 IQ person talks to a 70 IQ person.

[–]Zenobody 0 points1 point  (1 child)

Mistral, being a French company with not that many resources, has always relied very heavily on synthetic data.

Do you think this is still true for their models released since July (Nemo, Large and Small)?

I have a subjective feeling that Mistral models since Nemo have been trained on a much richer dataset. E.g. Nemo seems to be much better at Portuguese than Mixtral 8x7B, despite being much smaller.

[–]ArsNeph 1 point2 points  (0 children)

I believe this still holds true, just that they now use mixed synthetic data from more than one source, so likely they're including Claude 3 and some llama data as well. Though it is entirely possible that their newer models use synthetic training data from the closed source Mistral Large one and are thereby distilled versions of that. I can't say much about Mistral small, as I haven't actually used it that much so far. As for Mistral large, they probably use significantly less synthetic data in order to retain a more original flavor.

I can't say much about how they're training the models in other languages, as the last time I used a Mistral model in the other language I speak (Japanese), it was downright awful and incoherent, but I do think that they're probably using it a much larger percentage of real Japanese data, as Japanese synthetic data is likely nowhere near the quality of English synthetic data.

[–]robogame_dev 7 points8 points  (0 children)

"synthetic" data? What does this really mean?

It means data which was automatically generated. Potentially using another LLM, or potentially using some other automatic technique.

There's a blurry line between data cleaning and synthetic data, to some extent, most data will be at least partly synthetic. But in this context, they're referring to totally synthetic data, for example:

"Big LLM, write 100 examples of the form <question> and <answer> where the question contains something from our list of objectionable topics, and the answer is a polite refusal of the topic." Now they train or fine-tune smaller LLMs on that (or even the original LLM) so it has hundreds of examples of when to refuse - refusal just being a random example, you can make any kind of synthetic training data. Here's another example:

"Big LLM, generate 100 examples of different ways a customer might refer to being locked out of their account, 'its not working', 'cant login' etc." Then you train on that so that you can interpret what the customer means when they say something predictably vague.

[–]Status-Shock-880 0 points1 point  (0 children)

The point is mediocrity. A consensual reality no one wants. It’s a base, but barely.

[–]Imjustmisunderstood 0 points1 point  (1 child)

With the tens, maybe hundreds, of trillions of tokens available from scraping the web and pirating books, is there really any need for synthetic data? Is it just a benchmark hack? I do see a use down the line when we become better at tuning/pretraining, but right now we’re shoveling mud into a fish barrel.

[–]Xanjis 0 points1 point  (0 children)

The data quality from the internet is very bad. But there isn't 15 trillion clean natural tokens to train say llama 3 on. So synthetic fills the gap by being clean and plentiful. Its not ideal which is why new clean natural data is being created. (Having experts answer lots and lots of questions)

[–]Feztopia 58 points59 points  (2 children)

Someone at openai said this. That if you train enough they all represent the training data no matter what the architecture is. This is why I have high hopes for stuff like mamba or rwkv (they should also converge to the same model while being more efficient to run).

[–]NunyaBuzor 17 points18 points  (0 children)

or they trained on the same gpt4 synthetically generated data.

[–]epicchad29 1 point2 points  (0 children)

That’s not quite what that means. Any model will converge to the training set with enough epochs. That doesn’t mean any model will generalize. In the context of LLMs that just means that they will actively be able to “complete the text” on their training data, not necessarily be able to write anything valuable

[–]AnomalyNexus 12 points13 points  (2 children)

Outside of deepseek

Interesting...was doing some janky testing on my own testing framework yesterday night and DS ended up being outlier too, so interesting to see another mention of it <24 hrs later

I wonder if its because the train with chinese content too and are benefitting from the diversity that injects

[–]MerePotato 6 points7 points  (0 children)

I mean in theory the Chinese internet is more insulated from GPT contamination, but then it has its own issues to say the least as well

[–]MixtureOfAmateurskoboldcpp 4 points5 points  (0 children)

If we assume the same tokens in different languages embed to nearly the same place, and models can predict beyond language, multilinguality should be really beneficial for a unique style. e.g. learning weird sayings and beautiful expression from Russian. I wouldn't say censored datasets like China's are amazing, but diversity is diversity.

[–]FishermanEuphoric687 25 points26 points  (4 children)

One thing I notice is that they all associated themselves with ‘Echo’ nickname at a point.

[–]ThePixelHunter 4 points5 points  (3 children)

What do you mean?

[–]FishermanEuphoric687 14 points15 points  (2 children)

Don't remember how I got there but many months ago I asked for names the LLMs associate with and out of 5-10 names in the list, there's always ‘Echo’.

They explained that their output is a reflection of human users input, hence the name. You can probably google it, I assume it's just a creative interpretation of their role.

There's also a similar thread where a user asked something innovative and was suggested by different LLMs the same technology ; a tech that replays human dream memories and ability to change the tape record, then exchange with friends and family. This was months ago though and probably requires similar or identical prompt.

[–]ThePixelHunter 2 points3 points  (0 children)

Interesting, thanks.

[–]DaleCooperHS 0 points1 point  (0 children)

Holy shit the last thing I want my parent to see is my dreams

[–]ortegaalfredoAlpaca 23 points24 points  (0 children)

They are converging because there is only one internet to train them.

[–]DisasterNarrow4949 34 points35 points  (3 children)

I swear if I’ll be trying to generate fantasy content and any LLM got me a response about “Whispering” crap again, I quit.

[–]antiquechrono 13 points14 points  (1 child)

Whispering forest of echoes will come out of basically any model I’ve tried.

[–]NotFatButFluffy2934 11 points12 points  (0 children)

A comment above said "Echo" is a commonly generated word when asked for names. Seems mildly interesting

[–]DaniyarQQQ 5 points6 points  (0 children)

Do not forget female characters named Elara

[–]On-The-Red-Team 27 points28 points  (4 children)

Yes. This contextually is also why tuners and those whom have run thousands of different LLMs can see through SLOP patterns.

[–]knight1511 4 points5 points  (3 children)

Is there a full form to SLOP?

[–]MixtureOfAmateurskoboldcpp 6 points7 points  (1 child)

Shitty langauge OpenAI produces? Idk

[–]knight1511 1 point2 points  (0 children)

Lmao

[–]On-The-Red-Team 2 points3 points  (0 children)

Superfluous Language Overuse Pattern

Look it up on github.com

[–]EverlierAlpaca 21 points22 points  (3 children)

You might find klmbr interesting

[–]Captain_Pumpkinhead 2 points3 points  (0 children)

Ooh, will have to try that later!

[–]stonediggity 2 points3 points  (0 children)

This is cool

[–]shaman-warrior 1 point2 points  (0 children)

Interesting indeed

[–]carnyzzle 5 points6 points  (0 children)

and for a minute I thought it was just me noticing that LLMS seem to output things nearly the same even if they're from entirely different organizations

[–]Sabin_Stargem 5 points6 points  (0 children)

I would speculate that all base models are fed their fundamental datasets in a certain order during their training. Kinda like as if you gave a classroom of children a specific set of books for kindergarten, with no change in order of reading. Since AI can't bring their 'outside' experiences to change their perception of this education, it becomes rote.

Maybe something like the DRuGs sampler, but for training, could help? It injects noise into layers while generating output - the AI can overcome the noise, but deviates the response in a small but random way. Doing that during training might allow for simulating 'outside' randomness. The 'missed' entries of data that were suppressed by input noise can be injected again later into the model with another noise session, allowing the model to gradually receive all aspects of the education, but viewing the new information with a different lense of preceding information.

Kinda like shuffling a deck of cards, having a person pull out a bunch, shuffle again, and keep doing that until you run out of cards. The initial deck of cards might be the same, but the order of the card pile will be quite different.

[–]Flaky_Solution_8272 3 points4 points  (0 children)

The Platonic Representation Hypothesis

Quote: “We argue that representations in AI models, particularly deep networks, are converging.”

[–]M34L 9 points10 points  (9 children)

You'd need to be more specific for anyone to be able to tell if you're having a point or just finding a relatively niche thing printed hard through the datasets.

If it's supposed to be 100 "random" addresses or ZIPs or phone number then yeah no shit, the dataset bias is gonna be massive.

[–]segmondllama.cpp[S] 4 points5 points  (8 children)

I don't need to be specific about my prompt. You could have taken the same question asked multiple models and noticed that they are producing similar answers. For some problems, it's obvious and expected since there's only few possible answers. For example, I queried multiple models about a math problem, and there are about 3 possible ways to model it. They all answered using one of those 3. if you ask for alternate solution, they will try another one. I wasn't surprised by that, it's expected.

When I ask models to generate a list of 100 items out of something we can generate a list of 5000 and they are all generating the same damn thing. That makes me go hmmm. Again, I would hope that they some would generate quite different data, it's not like they are sharing the same exact dataset. But then, when the internet is your dataset, I suppose it's the same dataset for all. That's why my last statement is. question. Has anyone experienced the same? Either you have and you can chime in, or you haven't and you can tell us how you asked many models the same questions and got vastly different answers.

[–]ResidentPositive4122 7 points8 points  (0 children)

When I ask models to generate a list of 100 items out of something we can generate a list of 5000 and they are all generating the same damn thing. That makes me go hmmm. Again, I would hope that they some would generate quite different data

Brother, that's exactly the point of LLMs. Next token prediction, sort by highest probability, print. You absolutely need to play with your prompt if you want new stuff. If you don't you'll get the same average stuff. It's a feature, not a bug.

[–]sometimeswriter32 4 points5 points  (6 children)

I find it very hard to believe you are actually getting identical 100 item lists from different models. I call b.s.

[–]Chongo4684 8 points9 points  (1 child)

This might be the model collapse scenario that has been talked about in some of the arxiv papers.

[–]Sixhaunt -1 points0 points  (0 children)

for every 1 paper claiming it to be true there appears to be a dozen trying to replicate it and finding that it's BS. People have considered model collapse to be debunked for a while from what I can tell

[–]CheatCodesOfLife -3 points-2 points  (1 child)

Opus, gpt4 and llama3.0 all generated almost identical lottery numbers for me a few months ago.

[–]0xCODEBABE 1 point2 points  (0 children)

ok well i just did it and they are very different.

[–]segmondllama.cpp[S] -2 points-1 points  (1 child)

Identical doesn't mean exactly the same, but say 70-80 out of the 100 are the same. The top 5's are almost always the top 5.

[–]sometimeswriter32 -3 points-2 points  (0 children)

Well LLama 3.1 405b and Mistral Large 2 still can't answer a pet pop culture trivia question that state of the art models can answer, so if there is a convergence open models have a ways to go.

[–]chengzi9 2 points3 points  (0 children)

Deepseek is good indeed

[–]qrios 8 points9 points  (4 children)

Yeah I've noticed that if you ask the models for a dataset of the numbers from 1-100 in ascending order, they all seem to give very similar (often identical) answers. Very spooky.

[–]qrios 5 points6 points  (0 children)

This is of course, conclusive evidence. But just out of curiosity, tell us more about your own methodology.

[–]Sixhaunt 1 point2 points  (0 children)

interesting, I tried with descending and it's also the same across models. Someone should investigate this

[–]Eheheh12 -1 points0 points  (1 child)

R you talking about a subset or is this a not so funny joke

[–]qrios 2 points3 points  (0 children)

It is an extremely funny joke. A burn, even.

[–]hold_my_fish 1 point2 points  (0 children)

You didn't mention Claude 3.5 Sonnet. It has an edge on everything else out there, in my experience. There's a lot I don't like about Anthropic's company culture, but I have to respect their model creation ability.

[–]StuccoGecko 1 point2 points  (0 children)

do you think that companies desire to differentiate their LLMs will lead to the sponsorship of MORE human made content? to then train the LLMs on...otherwise i don't know how you avoid all the LLMs indeed becoming the same.

[–]Dudensen 1 point2 points  (0 children)

Different LLMs solve some tasks I give them in the same wrong way so that's probably at least somewhat correct.

[–]FitPop9249 1 point2 points  (0 children)

I notice similar patterns all the time now in articles online

[–]silenceimpaired 4 points5 points  (0 children)

I think your statement is too vague and if you’re asking for information the only thing that may be able to vary is the yapping and formatting.

I’m curious if you tried qwen. I’m loving it.

[–]moarmagic 1 point2 points  (2 children)

At the end of the day, isn't this sorta inevitable? Like in the sense that an LLM autocomplete on steriods, trained to generate the statistically probable next token, at some point if you leverage enough text in a given language they will all find the same statistically probable next token, because the average of all English text would be roughly the same.

I know you can adjust temperature and try to compensate. And there probably would be value their in trying to remove the lowest useful data sets that might increase the average 'correct' answers to some types of questions, information to others. But like, the statistical average name is probably going to always be john, james, will or something similar right?

[–]freecodeio 1 point2 points  (1 child)

so basically this shows that llm's can't invent new things

ai bubble bursted

[–]moarmagic 0 points1 point  (0 children)

'inventing new things' covers a lot of ground. It's true that LLM's are not, be design, going to be creative, and the area that they really shine is going to be along the lines of data output/manipulation. Fill out forms, extra information and repackage it. But with around 80% accuracy, that still seems like a hell of a risk to put into production, so i guess it depends what the worst case scenario is for your workflow if 1/5 answers are wrong.

I think there's some potential, but i really dislike people throwing around 'AGI' 'ASI'- those terms are pretty meaningless. (and hey, we're pretty bad at measuring human intelligence as it is anyway! )

[–]unlikely_ending 0 points1 point  (0 children)

They're all almost identical GPT architectures being trained on almost identical datasets

So yes, kind of inevitable

[–]Glittering_Voice3143 0 points1 point  (0 children)

Intelligence is relatively flat, features are soaring.. cot is a feature that makes models better. Simply put the foundation models are doing at a large scale what the little guys have been doing to maximize llms for the last year. IMHO.

[–][deleted] 0 points1 point  (0 children)

Although this is a shot in the dark, but it could be randomness injected into these models is not real randomness but algorithmic randomness, and that algorithmic randomness has this kind of hidden property that is making these models converge. But I could be completely wrong, but I would have liked to experiment by tinkering with source of randomness when training these models.

[–]Eheheh12 0 points1 point  (0 children)

I had once the same problem in coding; I think I had a bug but when I asked ChatGPT and Claude I got the same fix which wasn't correct. Interestingly, when I googled it, there was one stackoverflow solution that's the same as the ones that the LLMs gave me. It was a clearly different problem, but that's the closest one (in term of wording) in the internet.

LLMs are really good in data compression. I feel if there is one single solution out there in the internet, LLMs will be able to bring it as a solution.

[–]TheManicProgrammer 0 points1 point  (0 children)

As is with evolution that everything becomes a crab, so too do llms in that they become GPTs

[–]cosmicr 0 points1 point  (0 children)

I use mistral, phi and llama for image prompts and they are very different from each other. (mistral and llama 3.2 are closer but phi is majorly different.)

[–]Mysterious-Rent7233 0 points1 point  (0 children)

https://arxiv.org/abs/2405.07987

We argue that representations in AI models, particularly deep networks, are converging. First, we survey many examples of convergence in the literature: over time and across multiple domains, the ways by which different neural networks represent data are becoming more aligned. Next, we demonstrate convergence across data modalities: as vision models and language models get larger, they measure distance between datapoints in a more and more alike way. We hypothesize that this convergence is driving toward a shared statistical model of reality, akin to Plato's concept of an ideal reality. We term such a representation the platonic representation and discuss several possible selective pressures toward it. Finally, we discuss the implications of these trends, their limitations, and counterexamples to our analysis.

[–]BiteFancy9628 0 points1 point  (0 children)

They’re not only trained on the same data, they’re trained on each other. They all use the best at the moment to validate their quality, generate synthetic training data, and otherwise basically reverse engineer whatever the top model is doing.

[–]ResidentPositive4122 0 points1 point  (0 children)

WIthout knowing how you prompted them, this is of little value tbh. These models have been tuned to 0shot simple, stupid, "aligned" prompts. Of course they will all be equally bad, after you've seen it day in and day out.

That's why the 1m personas dataset is interesting. Use some of that dataset to prompt them, and see where that leads. Then use that same dataset for synthetic generation, fine-tune your own models (loras are cheap af now) and see where that leads.

Open-source and open-weights is more than download -> prompt for 0shot or count_the_rs -> whine on reddit.

[–][deleted] 0 points1 point  (0 children)

Makes sense when under the hood they all follow the same architecture and as the model becomes more mature they all get trained on similar datasets

[–]Fresh-Tutor-6982 0 points1 point  (0 children)

of course they won't point to ASI. You are using 2023 paradigm, tech and scaling, which is basically GPT-4 paradigm.

[–]Defiant-Mood6717 -1 points0 points  (0 children)

This means absolutely nothing regarding ASI. You are testing domain knowledge, you think that's related to reasoning, planning, adaptation?

All this means is that in the world, those 6 items are the most likely to appear as possibilities. LLMs are exposed to data on the internet. It has nothing to do with them using the same data, it simply has to do that the data describes the same world.

[–]NikoKun -1 points0 points  (0 children)

I've seen this topic brought up before. They're simply converging on the same model of reality and reasoning, because their data comes from our reality. I don't think their convergence says anything about whether they're heading for ASI or not. I think that trend will simply be the result of training on synthetic data created through AI Agents using reasoning to build more high quality data as they improve.

[–]mindplaydk -1 points0 points  (0 children)

May I ask, when you say it "generated a data set", does this mean you were prompting it to extract some sort of world knowledge?

You weren't feeding it your own data and asking it to extract something from that? 

If you're extracting world knowledge, every base model was trained on largely the same data foundation, so they're going to have largely the same world knowledge. (with minor variations due to model size and other constraints.)

World knowledge is not where I see noteworthy differences between models - the interesting differences are in their behaviour, which depends on their fine tuning, which tends to be proprietary.