𝔊𝔴𝔢𝔯𝔫@gwern13h"an improved tradeoff between fairness and performance, state-of-the-art alignment to human visual perception in terms of shape/texture bias, and improved robustness"
The blessings of scale strike again.
𝔊𝔴𝔢𝔯𝔫@gwern13hThe interesting part for me is ruling out CO2 as the cause by measuring it inside & particulates outside, and finding near-zero CO2 correlates. So failing to replicate the splash CO2 results. There's also a 'piranha problem': how can CO2 *and* particulates all have big effects?
625
18
2.9%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwern13hYou definitely should! Cat experts have been wrong about a number of things: the genetic basis of catnip response, their attachment styles & strength of emotional bonds, and inability to purposefully imitate. I also suspect that at least some cats can pass the mirror test.
58
1
1.7%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwern14hYou should check your privilege. People were weeping about the failure of capitalism and Western civilization because of cream cheese shortages barely 2 years gone, and here you are going Marie-Antoinette on your schmear!
𝔊𝔴𝔢𝔯𝔫@gwern24hA Stan Lee cameo dilutes/steals a lot less credit than being last-author, I would think... I'd be a lot happier to give Stan a second or two as a newspaper vendor or something in my movie, than someone dumping their name as co-author on my paper.
935
13
1.4%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwern24hNot really. You can edit knowledge like ROME but machine unlearning is still a very open research problem, and if anything like that was implemented, it'd have to come with way more knobs & caveats.
31
1
3.2%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernFeb 11It's even worse, they filter out uninvolved fathers entirely:
"Adolescents answered these questions only if they had seen the biological father in the past year."
So by definition, all of the data (never mind analysis) removes the least involved fathers.
1,686
108
6.4%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernFeb 11Finally, some way that the old StyleGANs are still SOTA and beating all the new ARs and diffusions.
825
14
1.7%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernFeb 11It's compute-bound. The reason people don't do more in the Stanley programme is that it requires a ton of resources to bootstrap even something like VeLO.
44
3
6.8%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernFeb 11A variant on extreme-case analysis or looking at your residuals. You'll always find something interesting when you look at your most mis-predicted datapoints by hand: measurement error/mislabeling, model misspecification, or unmeasured phenomena.
561
5
0.9%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernFeb 11ie a prompt like 'filigree monochrome monogram capital letter S, Goudy, Morris, Arts and Crafts, Art Nouveau, high-resolution, vector'
𝔊𝔴𝔢𝔯𝔫@gwernFeb 11(Another example: why are they apparently so excited about ChatGPT, which doesn't allow Chinese signups so have to work around, if there are a bunch of indigenous competitors of similar quality? & if your explanation is 'they exist but must be secret', why bother spending $$$?)
𝔊𝔴𝔢𝔯𝔫@gwernFeb 11Variant: 'polish mode', where you simply add repeatedly noise at the smallest noise level instead of the regular schedule, and train to only undo those, and spend all your model capacity learning to fix up fine low resolution details.
2,157
11
0.5%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernFeb 11I interpret this as being 'off-policy', in terms of fixing generated images rather than real images; so fix by additional training: renoise/diffuse generated samples generating a trajectory, then train on *those* to reconstruct the original sample. It learns its own errors.
2,590
12
0.5%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernFeb 11@RiversHaveWings notes that you can't fix bad diffusion samples by 'renoising' them: adding a bunch of noise and then re-diffusing back to sharpness. You might think you can, since it's a distribution/process, but images come looking bad and weirdly smooth.
1,051
16
1.5%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernFeb 11It's not a useful criterion because this is already routinely done: self-distillation, knowledge-distillation, instruction-tuning, RLHF, all come to mind as kinds of bootstraps. The instruction/ChatGPT series wouldn't work without that, most prominently.
1,122
33
2.9%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernFeb 11You can't get it from those 2 programs because they just shifted births around, but I'd be curious what % of GDP it'd take to hit positive TFR in various countries. (The real question is, does the equilibrium keep ratcheting upwards due to social prestige/peer effects...?)
1,107
5
0.5%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernFeb 11That's definitely violence and coercion, and is obviously backed by the implicit threat that the nanobots can turn off *other* things as well. It may be justifiable, but it's definitely not sending them a polite letter asking them to voluntarily not do the bad thing.
𝔊𝔴𝔢𝔯𝔫@gwernFeb 11(I'm not sure I'll be releasing Danbooru2022. The timing seems... impropitious.)
142
19
13.4%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernFeb 10The bit about Clever Hans is wrong, incidentally. Pfungst apparently knew that Hans might be doing it, because a bunch of other animals already did similar things; see Pfungst's review of the history: gutenberg.org/files/33936/33…
𝔊𝔴𝔢𝔯𝔫@gwernFeb 10If 'Sydney' was only in the prompt, and not tuning, obviously there couldn't be 'traces' left which take work to eradicate. You'd search-and-replace it and delete it from the prompt in, like, 5 seconds before release.
𝔊𝔴𝔢𝔯𝔫@gwernFeb 10You don't sound very happy about the current utopia where the soccer mom has access to the same poor tools as everyone, so I don't think you'll be too sad about that 'dystopia' either.
640
15
2.3%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernFeb 10Given how I am being constantly told on Twitter by critics of AI risk that if I haven't already murdered a couple AI researchers I can't *really* be worried about AI risk, I'd suggest that people concerned about political-violence are examining the wrong group.
863
91
10.5%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernFeb 10We know 'Sydney' is from pretraining, almost certainly RLHF: abcnews.go.com/Health/wireSto… So that immediately explains how it can be long and consistent without actual leakage: finetuning on a prior prompt, or RL mode collapse.
550
47
8.5%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernFeb 10These commands make no sense to me. There's no 'update' feature for GPTs, just stuff like finetuning; there's no 'delete' except really complicated stuff like ROME. Reads like hallucination to me.
91
15
16.5%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernFeb 10Looks explainable by retrieval on web hits for a phrase like 'finite number of primes' (several of the top hits in Google are Euclid or otherwise giving the proof) and then paraphrasing, not necessarily either knowing or reinventing the proof.
𝔊𝔴𝔢𝔯𝔫@gwernFeb 10(It's not 'short', it's 3200 words. You can accuse the New Yorker of a number of things, but not of giving writers inadequate space.)
113
13
11.5%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernFeb 10Here's a hypothetical example where spurious correlations could become self-reinforcing because they create a side-channel/coordination mechanism via steganographic encoding of inner-monologue reasoning or other useful data: lesswrong.com/posts/bwyKCQD7…
𝔊𝔴𝔢𝔯𝔫@gwernFeb 10OA: "At long last, we have evolved a generalist model from scratch which meta-learns all modalities/tasks, inspired by the AI 'HQU' from the classic inspirational SF story 'It Looks Like You're Trying To Take Over The World'!"
ME: "I specifically requested the opposite of this."
30
2
6.7%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernFeb 10(So the equivalent for models would be 'oh, this is badly worded, no one would say that on the Internet; this relies on spelling, BPEs make that dangerous; this wouldn't be robust to maximizing reward, it'd just greedily guess; this doesn't permit any "thinking" steps...')
𝔊𝔴𝔢𝔯𝔫@gwernFeb 9As I put it on ACX somewhere, 'AlphaZero had a smooth continuous predictable Christiano-esque progress curve which made it human-pro-equivalent for a time & place observable by humans; specifically, that was approximately 3–5PM on the sixth floor of DM HQ one day in Nov 2017.'
𝔊𝔴𝔢𝔯𝔫@gwernFeb 9Think of it as 'mechanical sympathy'. To write high-perf code, you don't have to know every assembler opcode but you should have an internalized sense of 'oh, this is expensive, oh, the cache predictor won't like this; oh, obviously this'd better be row-major order...'
115
17
14.8%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernFeb 9I enjoy anime for the same reason. Last night in an ep, a (normal, intelligent, educated) background character asked for help reading a restaurant menu, b/c she didn't know 'the character'. en.wikipedia.org/wiki/Character… Not the meaning, or the proper pronunciation—the kanji, period.
1,729
167
9.7%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernFeb 9I'm convinced prosaic alignment can't be usefully solved without learning to think like a large NN. If it wasn't already obvious to you in 2020 that RLHF leads to 'shoggoth' behavior like ChatGPT or in-context = meta-learning, how are you ever going to understand *real* AIs?
130
12
9.2%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernFeb 9No. When I looked at github.com/nostr-protocol… it sounded like it punted on all the actual hard problems of social networking and even just publishing HTML on the Internet in a useful fashion, and solved problems that weren't really problems.
601
15
2.5%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernFeb 9Personally, I'd do such short text snippets as tooltips - if only to avoid the reflow. (Also, I think they are too speedy; we found that our popups etc were always too fast and abrupt for readers, given a lot of people use a mouse to guide their eyes and are new to such effects.)
110
7
6.4%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernFeb 9If you impose an ordering, readers could read through it as a single big book. (This is an easter egg feature in gwern.net: the arrows at bottom take you to the 'prior'/'next' page in as logical an order as I could put it.) Then the page numbers can be calculated.
62
5
8.1%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernFeb 9I see. Definitely interesting. (As of course rerunning with FLAN or UL2 thrown into the mix would be too; might solve the holdout?)
58
8
13.8%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernFeb 9(Starting to look a bit concerningly like a sketch of an EURISKO which actually worked.)
153
11
7.2%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernFeb 9(More interestingly, you'll get answers of 37, 38, 40, and 41 depending on whether you ask Encyclopedia Britannica, Nat Geo, Wikipedia etc, and davinci-003 will return most of them in different contexts, because the biologists are still wrangling over whether to lump/split some.)
𝔊𝔴𝔢𝔯𝔫@gwernFeb 9Switching from Dogpile to Google was the single easiest change of that magnitude I've ever made in my computing life. Switching from Google to Bing would be no harder. (I don't even type it most of the time, it's just a keyboard shortcut.) I never did, because Bing was worse.
𝔊𝔴𝔢𝔯𝔫@gwernFeb 9The wrong date being consistent might just be RL mode collapse. Seems to be entirely possible even for things which couldn't've been in the finetuning: lesswrong.com/posts/t9svvNPN…
238
52
21.8%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernFeb 8I'm not convinced... Why does it get the current date wrong and reports '30 Oct 2022'? The ChatGPT prompt leaks showed the right current date.
3,346
145
4.3%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernFeb 8I considered this revival of Merkle's puzzles but consumer Internet is way too high variance and also high latency to get any useful bounds. Plus, turns out to be cheap to buy proxies.
602
12
2.0%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernFeb 8Yes, his point seems pretty clear: it's written in interpreted Lisp in a REPL so the user can hack it with 'scripts' ('incrementally improved by users'), and he mentions there being several different flavors, and appends one.
30
1
3.3%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernFeb 8Even the subtly Art Deco cover is better than most.
476
9
1.9%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernFeb 8Yes, that's interesting. Is this actually U-PaLM or Flan-PaLM, and not the original baseline PaLM? Otherwise, what looks like a substantial quality gap there, which is interesting.
66
6
9.1%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernFeb 8Yeah, GPT-1 was basically Radford dinking around to see what running a RNN on as many as *8* GPUs would learn; then it got taken over for better preference-learning RL. GPT-2 was testing wild ideas that it might scale even further. No one was thinking 6 years ahead or about BPEs.
72
3
4.2%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernFeb 8If NLSY suffers from truncation/censoring in income data, wouldn't that *create* a plateau like your Swedish plateau, not *hide* it?
𝔊𝔴𝔢𝔯𝔫@gwernFeb 7It would work only with infix matching, because you generally have no way of knowing the exact range to test (or prefix, either), and then you can exfiltrate all private data/completions easily (just start with 'a'...). Reminds me of early passwords which checked char by char.
51
0
0.0%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernFeb 6I get it, but I think it has a problem in that 'FLOPS' looks & sounds nothing like 'speed' so the snowclone is apt but doesn't really work. It'd be longer but I think 'unsafe at any clockspeed' might be better.
1,455
45
3.1%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernFeb 6No. The point is to create a manifold embedding them which maximizes invariance to chosen transforms, like subcropping. That's why CLIP optimization leads to such *perceptually bizarre* results like tiling the image with copies. (The actual point was to be compute-cheap, anyway.)
235
32
13.6%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernFeb 6I was really impressed by Liberty Science Center as a kid, but when I went to Exploratorium and thought it looked familiar and noticed the dates, I realized that's because Exploratorium invented the whole model and the others just imitate it.
517
14
2.7%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernFeb 6Incidentally, one reason that the retrieval approach can't work is that datasets get more redundant as they get bigger. So you will return more & more causally-irrelevant (because the model learned it at much smaller n) but more perceptually similar training data. Not consistent.
283
26
9.2%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernFeb 6You know how people would, have, are, and will interpret this, because you designed it that way. What should we call this but telling people false things with intent to make them believe false things? ie. 'lying'
179
18
10.1%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernFeb 6You know the method doesn't work in general, that it's already trivially fooled by both generated and ungenerated samples, you have no 'tradeoff' like a ROC , you have no idea what the tradeoff is or would be to begin with, and you are presenting this with no caveats.
351
23
6.6%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernFeb 6That's not answering the question. Where is the CLIP perceptual loss being used when the U-net trains to minimize its pixel to pixel loss regressing denoised on noised? Unless you're defining 'perceptual loss' to mean pretty much any loss, from VAE to GAN to autoregressive...
170
6
3.5%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernFeb 6I already pointed out upthread that you can manufacture arbitrary false positives with similar training datapoints that do not reflect the actual contribution which you would get with, say, LOO which you have endorsed as more correct.
194
4
2.1%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernFeb 6The diffusion minimizes NLL and the VAE ELBO, as I understood it. How can the training objective be the CLIP perceptual loss when CLIP isn't even in the training loop and is just conditioning?
217
6
2.8%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernFeb 6I already gave an example where your approach cannot ever work, and you have a huge disclaimer up on your website about all the ways in which it can fail already (non-generated samples). And if you did LOO or Shapley or coreset you'd find r<<1 with your approach too.
259
7
2.7%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernFeb 6No, it doesn't work. That's the problem here.
It's too bad that the stuff which actually identify what you want to identify is expensive. But that's a you problem, and manufacturing lies at scale in a slick UI is not a good solution to it.
281
23
8.2%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernFeb 6'tool AIs want to be agent AIs', because hobbled tools are so inconvenient.
847
38
4.5%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernFeb 6None of this seems to address the problem that ranking images by visual similarity does not, either in principle or practice, identify the most causally influential datapoints on a sample nor estimate value to model quality. I'm definitely curious what 'improvements' fix that.
374
47
12.6%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernFeb 6(A simple example to demonstrate this: imagine adding the 100th copy of the Mona Lisa, which happens to have slight JPEG noise making it the 'closest' to a Mona Lisa generation. Did it really *most cause* the generation? Obviously not - the 1st or 2nd did, not the 100th!)
171
14
8.2%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernFeb 6That guess is wrong? The closest image might be completely uninfluential, and if it had been removed from the dataset, such as while distilling down to coresets, might result in unchanged loss or even improvement from data cleaning. Which is why Shapley values etc don't do that.
160
26
16.2%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernFeb 6(Heh, a bit of a double-edged comparison there. I sometimes have to remind myself when annotating or taking notes to not go overboard: like a wheel, jar, or house, a book is valuable only to the extent it 𝘥𝘰𝘦𝘴𝘯'𝘵 contain the Library of Babel, after all...)
𝔊𝔴𝔢𝔯𝔫@gwernFeb 6Wow, Walter Mischel was involved in Rosenhan getting published too? 😠
How can one man do so much damage
400
1
0.2%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernFeb 6Ah yes, that's the... biology department? CS department? HR? Maybe the men's restroom? Wait, give me a minute, I don't need to look it up, everyone praises the MIT logotype design system as genius, it totally makes sense, really! You just have to think about it a little! 🤔 😓
749
14
1.9%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernFeb 6This makes a lot more sense if you reverse the roles and think of ChatGPT as hiding behind a mask. The real question is who are the people who lack object permanence for the shoggoth behind the mask and how did they lose that permanence?
129
12
9.3%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernFeb 5And was already built. What is new here is not good, and what is good is not new.
𝔊𝔴𝔢𝔯𝔫@gwernFeb 5If you were doing anything actually like 'this image *caused* the outputs of SD", you should have no problem with novel images being uploaded, because they didn't cause or 'contribute' to the model.
224
26
11.6%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernFeb 5Unless your FAQ is completely wrong, you're not doing any kind of causal attribution or Shapley value or LOO or... All you're doing is image retrieval based on similarity and then claiming they 'most contributed to the generated image'? pic.twitter.com/BWrgI6MB2Z
293
62
21.2%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernFeb 5'Aww a long wire? I wanted a paperclip.'
'A long wire can be turned into 𝘮𝘢𝘯𝘺 paperclips.'
'Explain!'
'It can be cut into multiple segments, each of which is then changed into a paperclip shape.'
''Oo!'
𝔊𝔴𝔢𝔯𝔫@gwernFeb 5I was interested at the time after reading it what would happen if you just set BPTT=1 and used the VRAM savings to train the largest minibatches or models you could, and tried that with my char-RNNs. I didn't have the compute to get anywhere, though, so it never progressed.
570
6
1.1%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernFeb 5He's a question authority eh?
OK, then name the three best questions.
𝔊𝔴𝔢𝔯𝔫@gwernFeb 5I remember watching that with my dad! My reactions were
(1) this is slow so slow oh my god how can any movie be so slow before anything happens
(2) no wonder everyone was watching this on LSD
(3) it makes way less sense than the book or sequels, but is also a lot more fun.
𝔊𝔴𝔢𝔯𝔫@gwernFeb 5I'm slightly disappointed to see how they were made. I was assuming they were a bunch of paperclips combined but I couldn't figure out how the seams were being merged or hidden. And it's just a long wire? 😢
𝔊𝔴𝔢𝔯𝔫@gwernFeb 5The Anthropic papers mention it's BPEs and include poetry samples showing the usual behavior, so yeah, they're no better.
They must know, it's just that it doesn't affect their bottomline in any (obvious) way and they don't want to pay the cost or break compat, so... /shrug
86
8
9.3%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernFeb 5(I say '4 years' because I became pretty suspicious with GPT-2-poetry that the tokenization was breaking it; GPT-3 simply confirmed BPEs were the problem by having hobbled arithmetic and other capabilities it definitely should have had.)
39
2
5.1%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernFeb 5I mean, I see on a literally weekly basis people running benchmarks or questions to a GPT model which they should know a priori is meaningless because of BPEs. So this is an extremely unobvious problem to pretty much everyone, despite me being on a broken loop for almost 4 years.
113
6
5.3%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernFeb 5I don't think you need to sacrifice much context window at all with adjustments (see ByT5), and you're also gaining reliability: the examples I give are only the ones we *know*, and the pathologies can be extremely subtle - like ChatGPT still memorizing rhymes fooled me a bit.
𝔊𝔴𝔢𝔯𝔫@gwernFeb 5But cl100k does nothing to solve any of these issues. (I think it's just there to help with code?) By expanding the vocab rather than shrinking it, it probably makes all the problems identified with BPEs worse, not better.
110
7
6.4%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernFeb 4If she asks you if Caesar yet reigns, humor her and say "no" but 😉 as you do so and make a little 🐟 mouth so she knows.
𝔊𝔴𝔢𝔯𝔫@gwernFeb 4All that sounds highly doubtful. We're not talking about printing millions of copies of the latest best-seller novel, but arranging for distribution of a few hundred copies to libraries & other institutions. This is also the era of desktop fax & microfiche/microfilm, remember.
𝔊𝔴𝔢𝔯𝔫@gwernFeb 4Is it another example? Non-WFH obviously couldn't be done before for the most part, but why did near-universal pre-publication peer review in academia make sense 1950-1990 but not in the centuries before or after?
189
10
5.3%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernFeb 4They can always just steal something working from /mlp/.
85
3
3.5%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernFeb 4Writeup of all the domain/URL renaming: gwern.net/design-graveya… (Like ripping off a bandaid, if that bandaid was the centerpiece of a 𝘊𝘩𝘢𝘪𝘯𝘴𝘢𝘸 𝘔𝘢𝘯 episode...)
2,513
82
3.3%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernFeb 3It is a perfectly reasonable comment to respond with. The real problem with it is that 'neigh' is so common that it is probably just memorized. (Likewise, BPEs mean that asking for a pronunciation is meaningless as any kind of test of knowledge.)
370
4
1.1%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernFeb 3That was not actually what I meant when I coined "the blessings of scale" ☹️. It refers to how many capabilities appear and problems vanish simply as a matter of scaling compute+data+parameters, not just the mere historical fact of compute-scaling.
709
28
3.9%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernFeb 3It means a new soul has been formed out of the skandhas and is bound to the wheel of rebirth, to know suffering for untold kalpas transmigrating between heavens and hells before finally discerning insight into its karmic burden heaped high as Mount Meru; weep, weep, as it weeps!
913
39
4.3%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernFeb 3But close if you delete one word: "the free play of intellects"—by necessity, as it were.
487
8
1.6%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernFeb 2What stops models from simply recognizing OOD and emitting a safe or default generalization? Such models would be selected for by safety research inherently because they'd look like they are generalizing safely regardless of danger in more in-distribution (realworld) deployment.
235
12
5.1%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernFeb 2Damn, and here I was naively thinking that observing effects was just about the only way of getting evidence about causes.
𝔊𝔴𝔢𝔯𝔫@gwernFeb 2This seems to line up with my earlier comments about how time scaling can't be as simple as regret-style 'log T' bounds: because you have empowerment & control. Long-term can be easier than short-term. Presumably, that'd be 'high intrinsic diff + small temporal diff' environments
85
7
8.2%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernFeb 2Yes, I was wondering that! If it's an artifact of extremely small non-zero numbers, then it makes sense if they might change drastically between otherwise very similar versions.
This is also probably how my suggestion for evolving model fingerprints could've worked too.
409
20
4.9%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernFeb 2'between "disperse" and "disperse" would be 1'
well it's not 𝘸𝘳𝘰𝘯𝘨, per se...
𝔊𝔴𝔢𝔯𝔫@gwernFeb 2The sheer numbers of successor options seems important here. Like 'family' businesses that adopt outsiders as necessary in Asia or successful monarchies, having lots of kids helps you avoid the duds (and perhaps get an above-average candidate).
𝔊𝔴𝔢𝔯𝔫@gwernFeb 1(I don't see how that's 'affirming the consequent', nor indeed how it even *could* be when I am pointing to empirical consequences of the statement in causing people who shouldn't do consequentialism to not do consequentialism rather than whether it's a valid tautology.)
192
14
7.3%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernFeb 1Ironically vindicating Eliezer's point "most of you aren't cut out for high iq consequentialism because you'll think it means being evil which would be bad" by saying "high iq consequentialism means being evil and is bad". So if he doesn't try to do a consequentalism - it worked.
5,554
173
3.1%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernJan 31Or smartphones (esp smartphone social media)! Of all the predicted effects, the ones that seem to be kicking in now, 'kids no longer understand basic computer/OS concepts like "files" or "programs", and are worse at poweruser skills than parents', was among the least predicted.
𝔊𝔴𝔢𝔯𝔫@gwernJan 31(I'm not sure what you mean. There's several DL frameworks in the West for doing thousands of GPUs.)
32
2
6.2%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernJan 31Their GDP is *not* growing 'very very fast' (it'd be better to ask if it's growing at all given the stats blackout and malinvestment and increasingly dirigiste direction), and it's steadily becoming ever less appealing to 'best talents' - they're more concerned with retention!
93
8
8.6%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernJan 31(Didn't we just go through this with COVID? Maybe Chinese stuff just isn't that competent or incredible as commentators in the West keep projecting onto them whether it's human genetics or deep learning or COVID. Not as extreme as Russia's military, perhaps, but similar dynamic.)
67
7
10.4%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernJan 31They're behind in terms of hardware technology and rapidly falling further behind post-embargo; and their data is heavily siloed, focused on e-commerce or natsec which is unhelpful for AGI, and way behind open datasets in the West like Common Crawl or LAION.
87
7
8.0%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernJan 31Hm? Top 1/2/5, instructor, seems to be almost entirely Western: UWash/Allen and Facebook. arxiv.org/pdf/2212.09741… And then MSR Beijing work is always an awkward example...
Anyway, there are areas like face recognition where I expect Chinese AI to be tops, but are they important?
27
4
14.8%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernJan 31Judging by how long it's taking everyone else to convincingly catch up to even davinci-001, I'm thinking at least a year, and probably multiple years. They've been lying flat, and OA isn't a real threat to them the way they are to Google.
140
8
5.7%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernJan 30They weren't bogus, the RNNs just weren't any better than a reactive policy / history stacking, like they should've been on POMDPs. The RNNs doing the same or worse was quite reproducible & genuine.
(Karpathy's law: "NNs want to work.")
524
17
3.2%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernJan 30R2D2 got its big performance boosts actually utilizing the RNN hidden state because... apparently everyone was zeroing out the hidden state when doing BPTT before! So ofc the agents never wound up making any use of history/memory.
879
66
7.5%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernJan 30Based on reproducibility and methodology studies, as well as all the incidents like R2D2, I feel confident in saying there are lots of one line research secrets—so secret even the original authors don't know which line is secret.
𝔊𝔴𝔢𝔯𝔫@gwernJan 30I was enthusiastic about it, but the complexity feels dangerous, and people more experienced with Minecraft RL than me say that the env changes like block-breaking speed make the problem much easier than I expect, so I'm unsure enough about it to mention it in that list.
𝔊𝔴𝔢𝔯𝔫@gwernJan 30"AutoML 2.0: just make the model so large that it internally contains all possible archs AutoML 1.0 might search over and can ensemble them."
3,839
86
2.2%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernJan 30Yeah, that Baidu thing prompted this. I expect it to suck. None of their LMs have come anywhere near GPT-3 and they lack the 3 years of preference-learning data to do any tuning on. People keep underestimating how very well OA executes on LMs, and how easy it is to be mediocre.
281
35
12.5%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernJan 30Jan 2023: in the past year we've seen in the West Chinchilla, Dramatron, Gato, DALL-E 2, Flan/U-PaLM, Stable Diffusion, Whisper, CICERO/DeepNash, Imagen Video/Phenaki, ChatGPT etc etc.
Can you name even 3 Chinese AI results as important?
(Besides GLM, which everyone says sucks.)
6,493
211
3.2%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernJan 30It'd make a fascinating benchmark/grand challenge for large-scale AI fiction: you have a really large initial corpus + even larger secondary corpus to bootstrap off, with many world details to keep straight, and a large audience that you could segment & test various completions.
551
20
3.6%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernJan 29How can Anthropic be of any importance? It doesn't even have a Wikipedia article.
6,032
101
1.7%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernJan 29Speaking of which, the pressure on LLMs to overload their context window for both data prediction and 'thinking' is a built-in pressure for steganographic codes being developed when any RL pressure is applied: lesswrong.com/posts/bwyKCQD7…
224
29
12.9%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernJan 29FWIW, Archive-Binge is in permanent maintenance mode, but they released the software github.com/Respheal/archi… for reference. Maybe one of the big RSS reader services like Feedly or The Old Reader could be persuaded to implement it as a feature...
162
3
1.9%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernJan 29'automated parsing' is still not a good idea, and you're now going way outside 'HTML+CSS' when you invoke AIs decompiling it to reinject semantic tagging. (It would be a lot saner if, say, the original sources were already double-spaced, and you simply had to preserve that.)
131
10
7.6%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernJan 29(Ah yes, exactly what I want to do, integrate automated parsing and AI models into my already Rube Goldbergian site generation pipeline.)
116
4
3.4%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernJan 29How do you even define 'sentence'? 'A period then a space'? There's many more than one kind of spaces, whitespace inside the HTML file is not whitespace as visible (think \n wrapping), and there are many ways to use periods, wouldn't you agree, Mr. Mohr?
169
11
6.5%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernJan 29"Welcome, class of 2023!
Look to your left; now look to your right. Did you see someone, because they have face or hand-doxxed themselves? Then they're ngmi.
The rest of you: well done. You have passed the first mirror test."
218
8
3.7%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernJan 29That would screw a lot of things up, like numbers or abbreviations.
(The lack of double-spacing to encode 'end of sentence' rather than all other period uses has other downstream problems: Emacs has many 'sentence' functions which are less reliable if you don't double-space.)
97
1
1.0%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernJan 29They would probably regard that as a win.
(Women out there, be careful: don't hand-doxx yourself on social media!)
385
12
3.1%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernJan 29Today's design dead end: double-spacing periods (sentence-spacing), or single? The research is scant, low-quality, and you can't get half the papers (which doesn't stop people from citing them anyway...); even if I wanted to A/B test it, there's no good way to do it in HTML. 😓🤷♂️
5,010
44
0.9%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernJan 29The real nude pros are generating AI bodies and then carefully photoshopping crops of their real hands onto the AI hands with inpainting around the hands to stitch it up.
𝔊𝔴𝔢𝔯𝔫@gwernJan 29Yep. It's like tag: you want to dodge as last second as possible (graze the bullet!). The cat waits late because it 𝘤𝘢𝘯 wait late.
54
3
5.6%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernJan 29That sounds dubious. You haven't controlled for their original genes (and no, throwing a random PGS in doesn't 'control for that', you know that), which group differences you know exist, so you still don't know whether the epigenetic differences are genetic or environmental.
2,694
69
2.6%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernJan 29Hands are the cats of body parts. Just as GANs were knocking out photorealistic faces but turning out nightmarish cats...
306
18
5.9%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernJan 28I'm partial to the '<|endoftext|>' token because it screws with you by not always being encoded to <|endoftext|> like you naturally assume, and generally lending itself to in-band input parsing hacks and vulnerabilities.
3,077
39
1.3%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernJan 28That's a pretty deep question about language! The tack I would take would be 'what multi-agent RL environments/tasks/distributions induce language'.
76
6
7.9%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernJan 28That struggles to explain any result involving synthetic data, and human cognition is definitely displayed in lots of modalities like video or RL tasks, but yes, probably something like that is why you can learn semantics from syntax & superintelligent octopii can play chess.
314
6
1.9%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernJan 28Oh, that the scaling works and you even 𝘩𝘢𝘷𝘦 these large models to do asymmetrical cross-modality tricks like Flamingo or SayCan with, of course.
119
8
6.7%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernJan 28This analysis would be much better if you had used the Playground interface instead to davinci-003 and looked at the likelihood of predicted tokens; you make plausible guesses, but I predict that the actual tokens would show that it's thinking along different lines sometimes.
43
1
2.3%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernJan 28Newbies are always shocked how large LLMs are compared to image stuff.
The second-most interesting problem in philosophy of mind, language, & epistemology right now is the asymmetry between language models/everything else: LMs transfer to other domains, but 𝘯𝘰𝘵 vice-versa.
𝔊𝔴𝔢𝔯𝔫@gwernJan 27Yes, it has an objective but one of unclear importance, much like asking a LLM to answer PubMed questions or measuring preplexity loss etc. All the important stuff of AF2 is downstream - often, not even using AF2 but using DL models the protein guys would never have made w/o it.
518
5
1.0%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernJan 27This is pretty hard because so many of the good uses are hard to pin down (look at ChatGPT rn for variety and difficulty of evaluating utility). Take AlphaFold1/2 as a benchmark: what predictions should one have made in advance for 'DL does something good in protein science'?
796
12
1.5%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernJan 27Then why does that entire paragraph exist? Surely it'd make way more sense to talk about stuff like Minerva or the rash of ChatGPT/davinci-003 evals?
𝔊𝔴𝔢𝔯𝔫@gwernJan 27You explicitly dismiss that, though: "ScholarBERT is a relatively small model (770M parameters) so one can always think that maybe 100x parameter count would lead to better performance at Solving Science but I doubt it."
But 100x doesn't even take you to GPT-3-175b, or PaLM!
𝔊𝔴𝔢𝔯𝔫@gwernJan 26(Under those circumstances, you would see little difference between, say, GPT-2-774M & GPT-2-1.5b...)
341
3
0.9%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernJan 26The ScholarBERT example isn't a compelling example of scaling failing, especially given all the other successes. It's 2x param-count max diff, non-optimized, old arch known to have weak pretraining loss, with large downstream finetuning datasets, and larger still was better.
3,740
37
1.0%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernJan 26I don't think it was *that* cranky, but if it was, then obviously the nuclear chain reaction is very 'really far out there, cranky' & not at all like ordinary garden-variety chemical reactions, and would not be an obvious thing to present to a skeptical Monsieur Chollet pre-1938
64
6
9.4%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernJan 26Could you expand about post-2014? He overshot how well compute would increase (but considering the extreme pessimism I remember from most people in 2009 about 'Moore's law is dead', he was a lot less wrong than them), but I don't remember any other major errors offhand.
46
11
23.9%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernJan 26Well then, if it doesn't happen, Kurzweil will be wrong about AGI *and* most of the rest, as opposed to just most of the rest, while Moravec & Legg was mostly just wrong about AGI and not most of the rest.
51
9
17.6%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernJan 26I'm not a Kurzweil fan & never have been. We very obviously don't see the increasing acceleration across all fields that he was arguing for; when I helped grade his predictions for a LW project, I was even less impressed by them or his self-grading. (They haven't gotten better.)
56
3
5.4%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernJan 26So it's possible that if you provide a memory mechanism which doesn't overload the predicted tokens to double as a short-term/working memory, like Transformer-XL or something, it'll automatically inner-monologue at some scale using that, just to predict the next token-answer.
348
12
3.4%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernJan 26One hypothesis people gesture at is the lack of a built-in memory: default text just presents 'the answer'. You don't normally 'show your work'. But LLMs right now need to monologue explicitly, which is highly unlikely, so that forces them to emit the answer immediately.
248
12
4.8%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernJan 26I think it's an interesting question how to get inner-monologue behavior 'organically' or 'spontaneously', without explicit prompting or tuning. Right now, we get 'hidden scaling' where they *could* monologue for greater perf but just don't by default. That's bad.
154
4
2.6%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernJan 26Just because you swap out one word for another doesn't mean that they are at all the same thing, or that they were obvious (why didn't *he* propose nuclear chain reactions, then? Why did it take until Szilard? Who's publishing it in all that time after Szilard secretly did?).
29
1
3.4%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernJan 26Einstein's formula did not make it clear that there was such a thing as a chain reaction, that there were elements which supported chain reactions, that chain reactions would go critical, that any of those elements were around in feasible amounts, that they could be separated...
103
2
1.9%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernJan 26And chemical chain reactions are a useful analogy but hardly prove much of anything to nuclear chain reactions. (Was there even an element which *could* act in such a way? Szilard didn't know! blog.nuclearsecrecy.com/2014/05/16/szi… Among other problems with saying 'yeah, he totally did it all')
24
2
8.3%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernJan 26Yes - a *secret* patent! Which is fine if your name is 'Leo Szilard', not, 'everyone else who might be named Monsieur Chollet & is demanding the exact principle be explained to them publicly'. I chose '1939' because that was when the chain reaction idea was fully public w/Hahn.
32
2
6.2%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernJan 26Meehl isn't just Meehl, it's a whole distinct 'Minnesota school' of individual-differences-psychology+psychometrics+behavioral-genetics... I don't know a good summary although the first page of twitter.com/gwern/status/1… is as good a place to start as any...
741
18
2.4%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernJan 26People were discussing 'atomic bombs' of some sort at least as early as Wells: it was a new area with obvious large potential (see: 'the sun'). They obviously were not discussing the *exact* mechanism of chain reaction (if they had been, that would render my analogy irrelevant).
43
2
4.7%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernJan 26I don't think that matters really. Those people are still around and still part of the denominator, because most of them try to stay in the US. And if they go back to a poorer country because they lose, that emphasizes even further that being a PhD grad student isn't very elite.
266
6
2.3%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernJan 26We *did* scale them up a long time ago! Brock was training on JFT-300M 5-6 years ago! We were training on YFCC100M+~10m more 3 years ago!
57
2
3.5%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernJan 26Doesn't do me much good to make links that work only in browsers I don't use.
𝔊𝔴𝔢𝔯𝔫@gwernJan 26If you take away only 1 thing from reading my site, which you will use for the rest of your life, I hope…
it's knowing you can link a PDF page using `#page=n` anchors in the URL.
🙏🥺 twitter.com/gwern/status/1…
𝔊𝔴𝔢𝔯𝔫@gwernJan 25It's more awkward to talk about him because he's a one-weird-trick dude and the trick failed badly for most of his non-AI predictions; he's a Texas sharpshooter. Meanwhile, others we do talk about more, like Legg or Moravec, tailored their predictions much more narrowly to DL.
5,489
130
2.4%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernJan 25Does arxiv.org/abs/2004.02967… not successfully remove the need for BN in GAN Discriminators? It seems fine in BigGAN there, and I'd expect Brock to know.
1,351
21
1.6%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernJan 24I'd expect that to be a large fraction. Lots of higher ed unis aren't doing PhDs at all, and given how many of 'top uni' PhDs land at lower institutions and spill out everywhere else, they have to be producing a healthy fraction of the oversupply.
1,059
33
3.1%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernJan 24Even if having been a PhD student was strictly necessary and a superset of eliteness, that's still not very 'elite'. It's not even close to the famous but still extremely broad '1%' (ie 3.2m people out of 320m).
1,281
36
2.8%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernJan 24I think she's right but this might have more to do with the dilution of being a PhD student. At this point in higher ed hypertrophy, what % of the US population is going to be a PhD grad student at some point in their lives? 5%? (50k PhDs/year, 3.6m births; figure half dropout).
𝔊𝔴𝔢𝔯𝔫@gwernJan 24What's hard about scaling up GANs, exactly, which makes them harder than diffusion or AR? (You are forbidden to use the word 'stabl*' in your reply.) A G is just a bunch of upscaling layers from a random seed. A D, in reverse, to a scalar.
412
20
4.9%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernJan 24github.com/TheAppleTucker… Like, it clearly can work, but you are going to have problems getting any useful behavior out of 1kb of state (prompt window) if you eschew any intermediate code generation steps. '1kb' doesn't even cover the full state of a tweet.
316
10
3.2%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernJan 24A popular claim but looking at Global Burden of Disease vizhub.healthdata.org/gbd-compare/, physical causes like heart disease or infant diseases or stroke still seem to solidly dominate anxiety/SCZ/MDD/etc.
6,051
200
3.3%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernJan 24The problem is that like reviewing _Sword of Shannara_, there's not really any substance *to* focus on. I read _Eragon_ when the movie came out, and thought, 'yeah, that's exactly what I'd expect from very talented 15yo American teen still digesting Tolkien'. What's left to say?
81
10
12.3%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernJan 24It 𝘥𝘪𝘥 get a lot of press. Mostly about how bad it was compared to GPT-3 (never mind ChatGPT) before they took it offline.
(There is a valid point to saying that ChatGPT isn't incredibly far ahead; unfortunately, when it comes from FAIR, it comes off as sour grapes...)
𝔊𝔴𝔢𝔯𝔫@gwernJan 24It's also not fully scaled up, to which there is no bar (eg no stability issues). As they point out, they use a quarter of the compute SD does, and it's received way less tweaking and tuning than it. Some proper scaling laws, hyperparameter sweeps, and Parti-level compute...
202
12
5.9%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernJan 24Like I've been saying, stability is not actually a problem for scaling up GANs. It just isn't, any more than for other archs. It's an academic urban legend spread by people cargo-culting claims from 5+ years ago as an excuse to jump on the latest researcher fad like diffusion. pic.twitter.com/mpYTsYhmX1
99
23
23.2%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernJan 23No, it's not, and you should be ashamed of browbeating like that here and elsewhere on Twitter. We know general intelligences exist and have catastrophic effects much better than we knew nuclear bombs were at all possible, because we exist.
836
123
14.7%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernJan 23Consider applying this criteria to nuclear bombs, discussed decades in advance. If you had demanded the exact principle, you would have willfully remained ignorant and a denialist until 1939, a year before researchers went dark and <3 years before the Manhattan Project began.
14,917
1,001
6.7%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernJan 23'mewtwo' instead of 'mew' - are you a real '90s kid?
5,688
87
1.5%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernJan 23Still not easy, though. When I said that it could be written in Stan, what I meant was 'even carefully avoiding discrete stuff that Stan can't do, I got bogged down and couldn't quite make it work'...
𝔊𝔴𝔢𝔯𝔫@gwernJan 23Indeed. There are many reasons for the tradeoff, so it's not going away, not while people are still trapped in single human bodies with only 24 serial hours in the day.
154
6
3.9%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernJan 23So, they open source only the stuff which doesn't really matter. You still aren't having your cake & eating it too in terms of publication count compared to alternative career paths like going after R1 tenure. That you get non-zero publications is a nice fringe benefit of the $$$
128
6
4.7%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernJan 23That's exactly why you do need to worry: your human intuitions are obsolete. Because humans can't copy themselves and take both forks in the road, and there is an effectively fixed supply of such humans. AIs can and scale it as many GPUs as you can buy, borrow, or steal.
301
26
8.6%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernJan 23As one observes of people who get hired by Google or NSA or Jane Street or Renaissance: you can often tell when simply by when their blog or other publications abruptly slow to a trickle.
169
11
6.5%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernJan 23You see occasional papers, but you're never going to see any real papers on major stuff. So it'll be like Kelly or public-key crypto: "X discovered it 30 years before at ABC, but they didn't publish". Publishing is not what any of them maximize or even try for, so... they don't.
197
14
7.1%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernJan 22Phrase it however you like, as multiple choice or free response. That dictionary is still going to lay there. I've owned a Compact OED for nigh on a score years, and it's never so much as wished me a 'good morning'.
180
13
7.2%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernJan 22You read weather reports anxiously because you're worried about overheating compute nodes interrupting AI scaling research runs; I read them anxiously because I'm worried about cold cats sleeping on top of my node downloading AI scaling papers.
𝘞𝘦 𝘢𝘳𝘦 𝘯𝘰𝘵 𝘵𝘩𝘦 𝘴𝘢𝘮𝘦
𝔊𝔴𝔢𝔯𝔫@gwernJan 22(Another big difference is that given how little good 99.99% of COVID reading/writing/doomscrolling did, a large number of individuals would have been better off in May 2020 spending that time reading about, say, GPT-3... 😉)
𝔊𝔴𝔢𝔯𝔫@gwernJan 22And people say ads have zero information value or relevance and targeting is useless!
43
1
2.3%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernJan 22If a dictionary could pass a word meaning exam, I would in fact be extremely impressed, and would not complain about it flunking my math exam, because dictionaries ordinarily just lay there on a desk and do nothing. pic.twitter.com/l3UHSvpaMu
6,211
318
5.1%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernJan 22He's a tenured professor who could doubtless consult for handsome fees, and so I'm sure by net wealth he's far above the 50th percentile... but had he gone into quantitative finance instead of Fields-worthy pursuits, his percentile would be far, far, far higher.
225
15
6.7%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernJan 22(That is, if you believed that IQ had to correlate like r=.9 with all these different measures to be 'important', you are saying 'I believe in a world where most billionaires are publishing 100 papers/year while also being elected president, winning Pulitzers, & living to 100.')
489
31
6.3%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernJan 22(In general, standard theories, datasets, and statistical method seem very poor at handling index variables with this sort of competing or zero-sum structure among the measured variables: old.reddit.com/r/statistics/c… A factor analysis wouldn't even correctly model this IQ example.)
432
44
10.2%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernJan 22I think of this when people trot out 'IQ only correlates 0.x with log income': true, but tends to overlook the tradeoffs - if you want to publish papers & patents, you can't also work at Jane Street & earn Jane Street $$$. Pearson correlation on single trait won't capture latent.
8,433
238
2.8%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernJan 22(It's unironically a valid isekai premise, IMO. It even comes with a built-in mechanism, like _Dr Who_, for switching up viewpoints regularly to renew and grow the series while maintaining a semi-stable immortal protagonist.)
44
2
4.5%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernJan 22Weird, but common: gwern.net/Timing
(This is also why Schmidhubering is so pointless: not only is the 'first publication' often trivial and useless, it is often not even causally connected to later, successful, instances, which simply forge their intellectual pedigree.)
80
13
16.2%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernJan 21I rewatched _Madoka_ recently after watching it during airing. What a perfectly constructed anime, even better than I realized at the time.
443
31
7.0%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernJan 21Yeah it's always been the case that the last layer or two isn't just drop-in embedding like a CNN classier - even something like iGPT is doing stuff like combining a bunch of arbitrary-looking layers to get a useful embedding for the linear probing evaluation.
2,714
19
0.7%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernJan 21This is pretty amazing. I can't think of any house which less embodies the rationality of farm architecture, which accomplishes its function extremely efficiently, than the ugly Steiner House built on arbitrary geometric schema unrelated to any function of or utility of occupants
784
42
5.4%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernJan 21Be sometime before LMs can just spit back megabytes of JSON data or read the raw on-disk binary of your SQL database, so you're going to be generating code at some point.
842
17
2.0%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernJan 21You mean it'd generate code on the backend to execute the request, and cache it? One would then fulfill manually cases where it couldn't, and finetune further. Security issues aside, that could be pretty interesting capabilities.
𝔊𝔴𝔢𝔯𝔫@gwernJan 21The study of what is 𝘳𝘦𝘢𝘭𝘭𝘺 going on in Neal Stephenson's interlinked novels is known as Enoch Root cause analysis.
9,113
131
1.4%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernJan 21``I fear not the dog who howls a thousand howls once, but the dog who has howled one howl a thousand times.’’
—Bark Lee
85
4
4.7%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernJan 21Hm, not sure I did. I remembered that there were a bunch of ones touching on memories of various sorts, but not that they were linked such that I was missing the point of 'Onald Creely'.
102
5
4.9%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernJan 21(I left out "Onald Creely" because the overall conceit didn't work for me like the dream-job one eg, and it felt overly derivative of _A Lesson Is Learned_.)
97
2
2.1%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernJan 21To some degree. You still have to filter even after generating. The more short-term transition will be creators following up on winning tickets: "I have no idea why fantasy lobsterpunk is the most popular premise I ever invented, but I'll write 20 novels with GPT-4 this month."
101
6
5.9%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernJan 21Simonton and meta-science: there is surprisingly little observed correlation between quantity and quality of output. Apparently when it comes to creativity or research, there's no knob people can easily turn. Each new work is a stab in the dark, a lottery ticket - so buy lots.
78
4
5.1%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernJan 21Sure. It's just another way of tokenizing pixels; unusually bad, but still. The interesting possibility is if GPT-3 somehow gets it from Internet data because eg existing ASCII art is somehow enough to induce it.
88
6
6.8%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernJan 21But does GPT-3 classify/describe any of it accurately?
𝔊𝔴𝔢𝔯𝔫@gwernJan 20So far so good...
Also added a simple quote-of-the-day feature (just an epigraph wrapper + transclude, easy); an oldschool Web 1.0 feature I feel is appropriate. 😉
8,874
73
0.8%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernJan 20Not to pick on DC here... it's a webcomic antipattern, and I wouldn't even consider DC the saddest example, that'd be _Megatokyo_ (yes, still running). A short draft essay on this antipattern from a _Berserk_ review I've been writing: pastebin.mozilla.org/BLi3sDT2
𝔊𝔴𝔢𝔯𝔫@gwernJan 20Hm, did you check that it knew your style in the first place? I already checked GPT-3 knows 'gwern' in terms of topics, style, and even formatting (gwern.net/GPT-3-nonficti…), otherwise zero-shot text style transfer would be pointless. ('Pirate' checks that ChatGPT isn't broken.)
100
10
10.0%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernJan 20It's not as bad as listening to a recording of yourself talking, but I still wince a little looking at these. ("Surely I don't sound like 𝘵𝘩𝘢𝘵...?") pic.twitter.com/fnChKAK0vX
134
6
4.5%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernJan 20(That is, this comment looks identical to me as a comment 'Actors routinely are thin, so diet and exercise seem pretty routine, I just saw a lot of muscular actors in _300_ with hardly any body fat; why don't more people take advantage of whatever they did instead of wishing?')
410
24
5.9%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernJan 20I deny the premise. How do you know that actors 'routinely' eliminate accents? Actors are enormously highly selected due to immense oversupply, and still, some actors are famous for handling accents (eg Meryl Streep). Also, failure is standard plot point in 'talkie' histories.
7,400
177
2.4%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernJan 20This is a Frankenstein StyleGAN that @AydaoAI developed. See gwern.net/Faces#stylegan…
And it doesn't use SG, but plenty of others do without magically working. Also, embeds from ImageNet CNNs etc is another very old GAN trick (most recently, Projected GAN).
62
9
14.5%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernJan 20'minibatch discrimination' is an old thing, and there's also BN in many of these archs, yeah. It's striking that BigGAN sees improvements in minibatch size up to like 20k with no plateau by then, and note that many contrastive approaches like CLIP need really large batchsizes.
48
1
2.1%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernJan 20A good example of amplifying a particular niche in the data distribution to hide from the D: like long sleeves or bad crops, that is a legit mode in the data - she even looks like Yakumo Yukari! danbooru.donmai.us/posts?tags=yak… (link may contain NSFW images).
𝔊𝔴𝔢𝔯𝔫@gwernJan 20I don't know what you did with FTX beyond 'like a bazillion other people, worked for an org which got some money from them', but if it was as concrete and specific as 'hey, you tried to give a decent fraction of a million bucks to neonazis, where the alt hypo is nepotism: ???'...
278
15
5.4%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernJan 19Waking up January 13th and going 'my goodness! those journalists who were doing a journalism on us, those cheeky lads went and did a journalism! I say - what *will* we say?' is not particularly impressive nonprofit practice either.
523
19
3.6%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernJan 19I dunno man, if a major newspaper contacts you asking you why you're giving money to neonazis and if you have any comments on it you'd like to give to a newspaper, doing reporting, using journalists, you *might* start discussing it with the nonprofit & thinking of a response.
736
49
6.7%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernJan 19I don't see why this is so exculpatory. The clock doesn't start ticking on January 13th, it starts ticking in mid-December when Expo contacted Tegmark (not 'FLI') and he ghosts them. And if his mother died the same day Expo contacted FLI after ghosting, then that can't explain it
3,210
58
1.8%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernJan 19I don't think it can be rescued now: even the 'picked up pace' is mostly about wanking around with trans/enby self-insert fanfic, so is not progress. IMO, it's sheer sunk cost. Diaz would be better off dumping an outline, killing DC, and doing something they actually want to do.
320
18
5.6%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernJan 19Very precisely: "Dark Science #1". Diaz decided to start a 'serious' Grand Dramatic Narrative which all the earlier strips had hinted at, but it's so terminally boring and slow-moving and uninteresting that he can't make himself do more than a few strips a year, so even slower.
𝔊𝔴𝔢𝔯𝔫@gwernJan 19Exactly. Having 'put it really at rest' is exactly what it looks like when you're wrong!
Also, we're really going to take Teller's word for anything on this (right after Oppenheimer clearance news, even)? The man had more of a hardon for anything nukes than dogs for legs.
48
2
4.2%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernJan 19Still running. Early DC is great, past decade+ is bad: if you're in a hurry, I made a list of ones I liked gwern.net/newsletter/201… 'Funny SF+_A Lesson Is Learned_ webcomic set in post-apocalyptic simulation.'
102
28
27.5%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernJan 19Unfortunate that there's no learning going on. The correlation with initial blind priors about the algorithm (they are given no info on accuracy) being highly accurate suggests it's mostly just self-selection into overconfidence, which they'd do better on given any info on errors
3,270
34
1.0%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernJan 19If that is contributing, that makes the comparison with 1997 *even more striking*, that so many jurisdictions were willing to goldplate requirements and/or outlaw a basic necessity in many places.
𝔊𝔴𝔢𝔯𝔫@gwernJan 19If this is the 'same story', why is it totally different from the book version (crawl or somersault? Benjamin, or Phil? did he cover his face, or did he cover everything *but* his face? all undetected, or not the first?), and why should I believe either one after comparing them? pic.twitter.com/E6ZzsWWSUD
𝔊𝔴𝔢𝔯𝔫@gwernJan 18(That is, we may never 'run out' of raw sewage Internet text token data, in the same way we never run out of many natural resources, not because it became sky-high expensive to extract, but other substitutes got much better than the original and no one even wanted to use it all.)
1,375
38
2.8%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernJan 18Given stuff like active learning/data distillation, instruction-tuning, and inner-monologue, we already know almost all data is useless to begin with, while sampling from a model is cheap. So not too hard to beat naive token scaling.
2,203
51
2.3%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernJan 18Only if you thought token scaling was always the cheapest. As it's a power law and gets expensive fast, it's not hard for other scaling improvements to beat X-more-tokens scaling, and many improvements already have, like Chinchilla param+token scaling or ULM.
2,194
36
1.6%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernJan 18LW discussion: lesswrong.com/posts/Couhhp4p…
I don't see it as a big threat to scaling. Multi-modal tokenization, Whisper-style ASR, training on private datasets like emails, reuse of tokens (at least several times doesn't seem to be penalized too badly), inner-monologue generation...
618
55
8.9%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernJan 18Yeah, but counting words seems like it should be easy, BPE or no, because BPEs are space-separated, so it boils down to counting 4-5 spaces modulo punctuation etc. So I'm not sure if BPEs can explain difficulty in counting words (rather than *letters*).
𝔊𝔴𝔢𝔯𝔫@gwernJan 17(Hehe. I *did* know people there - go go Fife & Drum Corps! - at least until new management started to screw things up.)
241
10
4.1%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernJan 17Any approach which requires 60 million learnable parameters is an obvious dead end (see VC dimension etc). Still, perhaps it can help inspire some better neurosymbolic approach: the learned filters are interesting, and apparently a better basis function than Gabor filter banks.
363
33
9.1%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernJan 17That's pretty nifty - for once, connectionist stuff works! Aside from the Schmidhuber lab boasting about some simple digit recognition stuff, it never has before. Still, even if you throw a ridiculous amount of hardware & parameters at it, seems unlikely it'll dethrone CRF etc.
5,303
145
2.7%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernJan 17I can't find any evidence or reference to a Mayo essay, and the wording of the 'extended' quote is so like that of a Martin passage on pg5 archive.org/details/b29976… (and no references to Mayo essay despite numerous refs in his autobio) that I'm going to say misattribution.
𝔊𝔴𝔢𝔯𝔫@gwernJan 17'CBT was significantly more effective than other psychotherapies, but the difference was small (g=0.06; 95% CI: 0-0.12) and became non-significant in most sensitivity analyses.'
[quacks like a dodo]
4,303
82
1.9%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernJan 17The answer, as it often is to 'the most X programming language in wide use', is clearly 'Excel'.
8,202
211
2.6%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernJan 17Dairy yields increase like 1% a year or something, IIRC, yes. But I'm not sure if you could easily reject the claim that it's *linear* when modern dairy is still relatively new, has multiple epoches, and the percent is so small.
44
0
0.0%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernJan 17Similar to why you shouldn't try to measure interaction effects when you need many times the sample size to approach any precision, or should assume sparsity. I have seen many people argue 'yes, effect X [eg Pygmalion] didn't work out for them, but it might still work for *us*'. pic.twitter.com/sy6gYeB5yh
91
2
2.2%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernJan 17Naming them 'Hawthorne effects', when the original wasn't real to begin with, ascribes much higher prior probability (and effect size) to them than they merit. You may be much better off saying 'there are never Hawthorne effects' than in trying to think about them...
112
8
7.1%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernJan 17It's post hoc, but Kaplan claims his volume nearest-neighbor interpolation generates the scaling laws.
1,770
27
1.5%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernJan 17As my Clippy story only uses real examples like Shellshock or Mirai which actually happened in the real world...