LM arena public voting is not objective for LLM evaluation [D]

Ouitos · 2025-01-23T15:01:37+00:00

Hopefully in a not too distant future, there will be some form of mutli-company-and-university-wise consortium for proper model evaluation that don't rely on good faith, and make it hard to identify models.

LoganKilpatrick1 · 2025-01-23T17:17:55+00:00

> If I'm doing it, then someone else is also doing it. Especially companies like Google. They own the internet!

No, we don't do that, would defeat the purpose of the arena. We 100% don't do that. Can't speak for the rest of the internet though clearly.

cwl1907 · 2025-01-23T18:07:58+00:00

official lmarena reply on X:
https://x.com/lmarena_ai/status/1882485590798819656

osmarks · 2025-01-23T16:36:03+00:00

It was always somewhat problematic anyway, in that the median user has wrong opinions, is quite sensitive to style and does not really push the limits of the models.

kjunhot · 2025-01-23T14:08:34+00:00

this is wild

H4RZ3RK4S3 · 2025-01-23T15:05:23+00:00

One can bet on the rankings in the LM arena?!?! How f**ked up is this world (and how stupid is the other side of such a bet) we're currently living in???

Pink_fagg · 2025-01-23T15:10:11+00:00

That would be the same case as if they put the benchmark data in the training set. We can only assume there are no bad players

ganzzahl · 2025-01-23T14:41:09+00:00

If this is true, it was an absolutely unethical thing to do, to the point that I can hardly bring myself to imagine you as anything but a self-consumed ass.

At the very latest, you should have stopped and notified the research community when your attempts were not detected.

ath3nA47 · 2025-01-23T17:10:24+00:00

bro made 10k, cashed out, started a war between OAI vs Google for the votes, and single-handedly proved LM arena is not accurate on their ranking system. Absolute chad lol

lostmsu · 2025-01-23T14:28:32+00:00

Coincidentally, I am building an alternative to LM arena that should be much less prone to gaming like this, because it doesn't require humans in the loop.

You can shortly describe the mechanism as Turing test battle royale: https://trashtalk.borg.games/

The main difference is that you have no direct way to tell opposing models to do something.

MachineLearning

Rules For Posts

+Research

+Discussion

+Project

+News

@slashML on Twitter

Chat with us on Slack

Beginners:

MODERATORS