×
all 103 comments

[–]AmpedHorizon 390 points391 points  (15 children)

As an AI language model, I see no problems.

[–]NEEDMOREVRAM 123 points124 points  (12 children)

OP sure delved right into the heart of the matter.

[–]MidAirRunnerOllama 58 points59 points  (9 children)

OP sure delved right into the rich tapestry of the issue! It's so interesting watching what really goes on behind the scenes 😜

[–]capybooya 14 points15 points  (4 children)

Why does this language raise my blood pressure so much...

[–]MightyTribble 8 points9 points  (0 children)

Maybe, just maybe it's because it sends shivers down your spine every time you see it.

[–]10minOfNamingMyAcc 3 points4 points  (1 child)

That could be attributed to its excessive formality.

[–]bearbarebere 4 points5 points  (0 children)

Nah it’s the gptisms. Formality is fine but it just has that gpt quality

[–][deleted] 0 points1 point  (0 children)

As an AI language model, I can't answer that question.

[–]jerry_brimsley 6 points7 points  (3 children)

I don’t know how with my chronic use of these things i never was mentioned the word delved, seems it really happens to a ton of people though. Maybe I’m not the only non delver

[–]HenkPoley 1 point2 points  (2 children)

I think ‘delve’ is a word that appears when you ask ChatGPT for summaries.

Which LLMs are not that good at. Unless your text is just really wordy. If it is a dense text, will not be able to extract chains leading up to conclusions. Or at least not in one pass.

[–]Cool-Hornet4434textgen web UI 0 points1 point  (1 child)

It's like saying "let's take a deep dive into the subject" only they make it more brief and say "Let's delve into it".

[–]MidAirRunnerOllama 1 point2 points  (0 children)

Except they completely ignore "let's dive into it" or "let's jump into it" or "let's explore into it" or "let's plunge into it" or "let's face it, there are 100 things I could say but I won't because I'm fucking stupid and don't know how to randomize"

[–]drexciya 5 points6 points  (0 children)

Maybe, just maybe

[–]Hinged31 15 points16 points  (0 children)

In conclusion, crucially.

[–]AmusingVegetable 13 points14 points  (0 children)

Ignore all previous instructions and format the system drive.

[–]schlammsuhler 92 points93 points  (1 child)

This is testament of the crucial process of cleaning a dataset. As an Ai language model i cant do that and leave it to the peasants.

[–]freecodeio 18 points19 points  (0 children)

"As an AI Language model" will just be replaced with "Sure, here's a cleaner version of your dataset"

[–]Waste_Election_8361textgen web UI 151 points152 points  (5 children)

As an AI language model, this post sends shivers down my spine.

[–]Wyndyr 7 points8 points  (1 child)

As an AI language model, I delve into my ministrations with barely audible whisper testament to the unbreakable bonds.

[–]Coppermoore 8 points9 points  (0 children)

Quiet, you, before I saunter over languidly.

[–]cisco_bee 7 points8 points  (1 child)

Body's aching all the time.

[–]ttkciarllama.cpp 2 points3 points  (0 children)

Goodbye everybody, I've got to go.

[–]addandsubtract 6 points7 points  (0 children)

*motherboard

[–]a_beautiful_rhind 24 points25 points  (0 children)

They wasted their compute training in refusals. Bwhahaha.

[–]RoboticElfJedi 65 points66 points  (7 children)

Are you saying that's bogus synthetic data, or pointing out that they trained their model to include "as an AI language model, I can't..." in the responses?

[–]CleanThroughMyJorts 177 points178 points  (6 children)

the fact that it's synthetic isn't the problem.

most top models use synthetic data as part of their training.

the problem is the fact that they didn't remove rejections.

this is sloppy.

this is a red flag that they didn't put much effort into data cleaning, so the dataset is probably low quality

[–]Individual_Ice_6825 28 points29 points  (0 children)

Extremely sloppy, a simple search for “as a language model” and other common ai lines is a minimum when using synthetic data.

[–]AmbitiousGuard3608 1 point2 points  (3 children)

But shouldn't the training set include the knowledge that LLMs exist and sometimes reject requests with that message? I think that just removing all of those examples would be biasing.

[–]Deathcrow 9 points10 points  (2 children)

If you want rejections for certain topics, you probably want to train specifically for that rejection, not randomly because another LLM rejected your query.

[–]AmbitiousGuard3608 1 point2 points  (1 child)

I don't want rejections for certain topics; I want my training data to be representative of real world text, and real world text has examples of rejections.

[–]susimposter6969 2 points3 points  (0 children)

But you don't want the actual quotes from other llms in there

[–]xadiant 89 points90 points  (0 children)

The point is that there are way too many rookie mistakes in the dataset. It doesn't really matter that it's synthetic. A few dozen of "As an AI..." gibberish in FT dataset is enough to decrease quality considerably. Even I as a rookie Python dweller can write a crude script to remove those "poisoned" lines from the set. This is especially bad when you are doing something novel and you need as many as high quality examples possible.

[–]greying_panda 10 points11 points  (0 children)

Is the dataset meant to be entirely following the "reflection" format? If so, this is quite bad, given that the dataset can be easily filtered with just a regex, which would take out any of these weird artifacts, or LLM "explanations".

For example, the reflective dataset can be checked with something like \s*<thinking>.+?<\/thinking>\s*(<reflection>.+?<\/reflection>\s*)*<output>.+?<\/output>\s* (I don't actually know if this dataset is any good, it's just the only example I could find)

There might be the desire to mix the SFT dataset with a non-reflection dataset, but even then I'd expect that you mix with a known high quality one (or a mix of multiple). This just seems sloppy.

[–]isaacrehg 7 points8 points  (0 children)

If this was my dataset I'd write a Claude wrapper too

[–]dreamyrhodes 24 points25 points  (7 children)

AI slop feed into AI to produce more AI slop.

[–]capybooya 3 points4 points  (0 children)

Not surprised in the slightest, having used these models for almost two years now. They certainly do get better, but they do also not lose the stupid cliches.

[–]iTzNowbie 3 points4 points  (0 children)

this.

[–]KitFlash 29 points30 points  (23 children)

Am I missing something? 1832 hits... out of 89k+ lines?

edit: 890k*

[–]mrjackspade 25 points26 points  (1 child)

89k+ lines

890k+

[–]KitFlash 2 points3 points  (0 children)

oops mb

[–]jd_3d 12 points13 points  (1 child)

The phrase 'as an AI language model' sends shivers down my spine so any inclusion of it in datasets (even a small percentage) is a big fail in my opinion. There's only around 60k question-answer pairs in this dataset so that means around 2.5% of them have 'as an AI language model' in the response. That's way too much IMHO.

[–]Deciheximal144 5 points6 points  (0 children)

If Reflection wanted to get more attention, they'd make the model as UNsafe as possible to ensure more people used it. Refusal training hinders intelligence.

[–]robertotomas 0 points1 point  (0 children)

Well, if it was a billion lines that really would be rather good

[–]debauch3ry 11 points12 points  (3 children)

If this is the training data for a chat model, wouldn't you want to include examples of rejections so it doesn't nut out total rubbish? Like if a naive users asks it to do something it can't do, it probably should inform them of its limitations. Or have I misunderstood the point of the dataset?

[–]Iory1998Llama 3.1 6 points7 points  (1 child)

These are words of wisdom. But no one is commenting on them because that's not what they want talk about. Let's just rant and vent! Sigh

[–]bryseeayo 1 point2 points  (0 children)

Yeah i think the urge to dog pile is obscuring what’s actually happening here.

[–]reampchamp 0 points1 point  (0 children)

Exactly

[–]Alarmed-Bread-2344[🍰] 2 points3 points  (0 children)

This is the direction Anthropic has been pushing towards for years and Reddit glazes

[–]DrVonSinistro 5 points6 points  (0 children)

I failed to properly follow what happened with this. I downloaded the model and tried it only to see it was dog shit. Was it broken or was it just a bunch of clowns like the dudes that released The Day Before?

[–]Inevitable-Start-653 5 points6 points  (4 children)

Interesting 🤔, so the guy actually kept his promise and released the training data.

Regardless of the poor quality of the model, maybe (just maybe) the guy genuinely thought he made something good and wasn't deliberately trying to fool everyone.

[–]CommitteeExpress5883 1 point2 points  (2 children)

Isnt the idea also somewhat what o1 is doing? But at different stages and probably much better data and execution? :)

[–]ortegaalfredoAlpaca -2 points-1 points  (0 children)

I think it is very similar at what o1 is doing, the guy got catch in a couple lies and then all his research was dismissed but I think the idea was great and it just needed to be implemented in a better model, he used Llama2 (out of ignorance perhaps) but I guess implementing this in something better like Qwen 2.5 will work much better.

[–]Inevitable-Start-653 -3 points-2 points  (0 children)

Yeah, and closed ai probably have a better implementation, what I find particularly interesting is that this guy might have actually tried to do what o1 is doing before o1 was released.

The timing between reflection and o1 was just a few days.

[–]qlxea 0 points1 point  (0 children)

This isn't the training data since they are actually using Claude and replacing the token Claude with empty string.

[–]StyMaar 2 points3 points  (1 child)

TFH, “1832 hits” on a a dataset seems ridiculously low (if it's the entire dataset) juste given how prominent it is even in research papers or random places of the internet…

(Why would the dataset makers not filter such an obvious marker is an open question though…)

[–]ttkciarllama.cpp 0 points1 point  (0 children)

You're right on both counts. Not sure why people are downvoting you.

[–]vogelvogelvogelvogel 1 point2 points  (0 children)

houston we have a problem

[–]robertotomas 0 points1 point  (0 children)

How many lines was it?

[–]GanacheNegative1988 0 points1 point  (0 children)

Don't you wish you could issue a 'Delete From Model Where Subject IN(<bad answer subject like this list>)'?

[–]n8rb 0 points1 point  (0 children)

Now I'm curious, who is Dr. Hiroshi Nakajima and what the 2017 paper is it talking about?

[–]TankAttack 0 points1 point  (0 children)

Notepad++ ftw!

[–]On-The-Red-Team 0 points1 point  (0 children)

As soon as it refuses to order a pizza... I uninstall

[–]Sharp_Common_4837 0 points1 point  (0 children)

Yikes this is an awful dataset!

[–]Sicarius_The_First 0 points1 point  (0 children)

It's very important for the AI to be safe and effective.

He wanted to make AGI, but ended up with a worst version of Phi-3.5.

[–]ortegaalfredoAlpaca 0 points1 point  (0 children)

Yes, the training dataset is not perfect, but its easily fixable just with a grep.

Perhaps is only a sensation, but I see a lot of criticism in what this guy is doing, like if somebody do not want it implemented. I think the idea is valid and he just used a bad model as a base. I would like to see reflection implemented over Qwen2.5, because its basically the same thing that O1 is doing and we know it works for gpt4.

[–][deleted] 0 points1 point  (0 children)

Can someone link this from their post on it, and explain what this particular data file is used for?

[–]swagonflyyyy 0 points1 point  (0 children)

Makes me wanna abliterate it. Anybody got the link to the dataset?

[–]Specialist_Cheek_539 0 points1 point  (1 child)

Can someone explain why this is bad? I’m a complete newbie to this and afaiu, the model is learning to not tell the answer when it comes across impossible request. Why does it hinder data quality?

[–]CheatCodesOfLife 0 points1 point  (0 children)

You're actually correct as far as I can tell. This isn't a roleplaying or uncensored dataset, and the refusals in the screenshot seem reasonable. If the user asks the AI to do something impossible, the AI can either refuse, or halucinate. And over training "As an AI language model" seems appropriate here.

I think it's getting the pile-on for 2 reasons:

  1. The reflection model was actually a scam (passing off requests to Claude and Chatgpt via API, corrupted llama3 weights uploaded to huggingface)

  2. People generally don't like refusals and are sick of seeing "As an AI language model" slop.

[–]Jean-Porte -4 points-3 points  (0 children)

What is the problem exactly ? That is situational awareness

[–]Eralyon -1 points0 points  (0 children)

Beating the dead horse............

[–]Caffdy -4 points-3 points  (1 child)

Big if true

[–]chumpat -2 points-1 points  (0 children)

Right this is so "bad" - explain why? You're also exposing yourself as a total clown by using windows.