×
all 37 comments

[–]Ouitos 16 points17 points  (2 children)

Goodhart's Law at its best.

Hopefully in a not too distant future, there will be some form of mutli-company-and-university-wise consortium for proper model evaluation that don't rely on good faith, and make it hard to identify models.

[–]lostmsu 1 point2 points  (1 child)

If a model evaluation method does not require good faith, why does it need a consortium?

[–]Ouitos 0 points1 point  (0 children)

I'd say a good benchmark needs money, especially if you want it to be robust for potential cheaters.

But a good benchmark is also a good way to prove the value of your model. Having a consortium means that everyone competitors and the like agreed to abide by the same rule, which means that no category is more profitable for a particular competitor.

You do rely on good faith of competitors, especially with the possibility of cartels, that's why I think universities need to be in the equation.

I do believe many other industries have developped the same kind of true neutral benchmark that is the result of consensus between competitors and universities.

I found this read pretty interesting on the matter (you know where to look to read it for free) : https://www.sciencedirect.com/science/article/abs/pii/S014829631000233X

Note that it's also possible that an independent company performs this kind of benchmarks.

That is the case for example with https://www.dxomark.fr/ for image quality, or to some extent with Giskar for LLMs : https://www.giskard.ai/

[–]LoganKilpatrick1 10 points11 points  (1 child)

> If I'm doing it, then someone else is also doing it. Especially companies like Google. They own the internet!

No, we don't do that, would defeat the purpose of the arena. We 100% don't do that. Can't speak for the rest of the internet though clearly.

[–]cwl1907 7 points8 points  (4 children)

[–]lostmsu 6 points7 points  (2 children)

Wow, I'm not sure removing the post was such a good thing in this case. How do we know lmarena's statement is true? I mean, it is likely they had protections, but it is possible the OP was able to circumvent them.

> Python script won't be enough

Even the phrasing here implies that they don't actually know. They just assume that their protection worked, but they did explicitly not verify the claim of the topicstarter.

But the worst part is that on their request this post was removed. I mean even if the OP was wrong and was shadowbanned, the topic still deserves discussion, and their original account of events matters.

[–]gwern 4 points5 points  (1 child)

Even the phrasing here implies that they don't actually know. They just assume that their protection worked, but they did explicitly not verify the claim of the topicstarter.

Yeah, what I noticed about this statement is that they don't say they blocked this attack, even though it's a very specific attack where OP gave every detail you could possibly need to ID it. They only say that the attacker 'may not notice' their votes being filtered out or "We'll release a test showing this kind of attack fails". They don't say, 'yeah, we already knew about it and had been blocking it while it was happening, and if the votes suddenly happened to go in the attacker's favor, well, it was just a sheer coincidence, maybe the attacker has good taste in LLMs and got lucky, it happens'. (Also, are some CAPTCHAs now considered amazing security...?)

I've read many organizations respond to news about hacks of them, and when the response is to pound the table about how many defenses you have and how the attack couldn't've happened and demand the claims be deleted - that usually means the attack succeeded and they're in denial.

[–]osmarks 2 points3 points  (0 children)

It was always somewhat problematic anyway, in that the median user has wrong opinions, is quite sensitive to style and does not really push the limits of the models.

[–]kjunhot 5 points6 points  (0 children)

this is wild

[–]H4RZ3RK4S3 7 points8 points  (6 children)

One can bet on the rankings in the LM arena?!?! How f**ked up is this world (and how stupid is the other side of such a bet) we're currently living in???

[–]derfw 1 point2 points  (5 children)

what's the problem

[–]H4RZ3RK4S3 1 point2 points  (4 children)

I think it's very weird and a sign of a very unhealthy society, where everyone is purely looking for their own gain over others (like in a zero-sum game) and everything is only about making more and more money. I understand that people try to play these systems. There is just no overall benefit from it. No economic value being created, no scientific or societal progress gained. Just selfish money hoarding.

[–]osmarks 0 points1 point  (1 child)

What? Prediction markets serve a very useful purpose (predicting things).

[–]farmingvillein 2 points3 points  (0 children)

Yes, although that gets muddied when they warp incentives (like perhaps here).

(Although sometimes good! Owning stocks provides incentives to make those stocks go up, which is generally a good thing, etc.)

[–]Pink_fagg 1 point2 points  (0 children)

That would be the same case as if they put the benchmark data in the training set. We can only assume there are no bad players

[–]ganzzahl -5 points-4 points  (11 children)

If this is true, it was an absolutely unethical thing to do, to the point that I can hardly bring myself to imagine you as anything but a self-consumed ass.

At the very latest, you should have stopped and notified the research community when your attempts were not detected.

[–]andarmanik -3 points-2 points  (1 child)

You say they should notify people but that’s what this post is.

[–]ganzzahl -1 points0 points  (0 children)

They did it on two markets and made $15k in profit before informing anyone...

[–]ath3nA47 -1 points0 points  (0 children)

bro made 10k, cashed out, started a war between OAI vs Google for the votes, and single-handedly proved LM arena is not accurate on their ranking system. Absolute chad lol

[–]lostmsu -5 points-4 points  (0 children)

Coincidentally, I am building an alternative to LM arena that should be much less prone to gaming like this, because it doesn't require humans in the loop.

You can shortly describe the mechanism as Turing test battle royale: https://trashtalk.borg.games/

The main difference is that you have no direct way to tell opposing models to do something.