𝔊𝔴𝔢𝔯𝔫@gwernApr 30That makes it even more impressive an anecdote, that people could so badly misinterpret what they see in the prototypes and dismiss it for years until suddenly, for no good reason, they are able to see the obvious. (*cough* AI *cough*)
49
6
12.2%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 30I'm assume she's referring to arxiv.org/abs/2304.14399 but I'm not sure this is a great example. Suppose the question was instead 'Would I get a flat tire by bicycling over a bridge over nails, screws, & broken glass?' This is 'ambiguous' in the exact same way, and equally absurd. pic.twitter.com/4Cq9G489uw
𝔊𝔴𝔢𝔯𝔫@gwernApr 29Q. so what's the other cat's job?
A. Nothing, he just loafs around.
417
28
6.7%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 29Already linked😀 Also, they seem to have recently changed the design and lost a lot of the magic - it's now much less grid-like and noticeably slower.
44
2
4.5%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 29SimilarWeb, like all such traffic estimators, can be assumed to be highly inaccurate, and # of visits != people, unless I suddenly became a few hundred people while I wasn't looking.
200
8
4.0%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 29What people always take those as meaning is 'progress is more than half caused by new ideas and so compute is causally unimportant', when the ideas are caused by compute investments.
132
5
3.8%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 29Yes, but that's average-case, so to speak, not what would be possible with much larger budgets than actually used.
(I also think you guys are misinterpreting what these sorts of results mean. They are not causal for algorithmic progress.)
88
5
5.7%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 29It's remarkable how many people, whether Wired or random tweeters, all fire up GPT for the first time and all independently decide on the same tasks like 'count the number of letters in a random word', and all get fooled by BPE problems. I wonder if this has had an impact on PR.
66
4
6.1%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 29It's like if you read the Bible and then someone asked you for the word length of a Hebrew word. 'Why can't you just tell me what it is? You just read thousands of pages translated from Hebrew! How can you *not* know exactly what Hebrew word each English word corresponds to?'
57
5
8.8%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 29Because it is never ever given access to the opposite information. It never sees the text encoded into individual letters or even a dictionary symbolically explaining the mapping of BPEs to letters (gwern.net/gpt-3#bpes) and certainly not the word-length.
38
6
15.8%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 29Yeah, but like all pro-China arguments, it's a dumb one. All the wrong kind of data, and fiercely siloed to boot.
97
3
3.1%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 29Hm. Looking at these poll results, I suspect that either people have seriously miscalibrated views on growth of global-data+sample-efficiency, or I asked it in a bad way despite my best effort to be clear. May require a much longer sort of survey question with stats etc...
3,317
53
1.6%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 29New feature: Twitter links are now automatic annotations, by parsing the local Nitter archive snapshots/mirrors.
(I was going to do this on the backend but Said decided to do this on the frontend instead; both have advantages, and he got there first, so 🤷♂️.) pic.twitter.com/ErT6bS3Qa2
𝔊𝔴𝔢𝔯𝔫@gwernApr 28It can't actually rhyme or follow instructions; eg. it still can't explain puns. The example I gave there in the comment Quanta still has not approved was ask ChatGPT to 'write a poem that does not rhyme'. It's like asking you to "write a poem which does not bleepledrof a knckit"
𝔊𝔴𝔢𝔯𝔫@gwernApr 28If you're wondering how we handle really long or nested section titles, because every vertical pixel is in the final analysis a theft from those who hunger & are not fed etc:
we left-truncate to keep it constrained to ~2 lines with the most informative (deepest) headers. eg.: pic.twitter.com/wNKTFNfxrv
6,960
95
1.4%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 28A bagel-with-cream-cheese aren't sold in grocery stores, though. Bagels (if you can call grocery store ones that) and cream cheese are sold, sure, but separately.
26
0
0.0%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 28(I look forward to donating my fecal samples for metagenomic microbiome studies and discovering my new Assphages.)
700
10
1.4%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 28Small feature: image-focus/zoom now shows all available non-redundant metadata ie. URL, title, alt, caption.
Looking forward to GPT-4's image modality just plain solving image captioning/alts, so you can just auto-run it on all your images and get human-level alts. pic.twitter.com/oRdky0yWCY
𝔊𝔴𝔢𝔯𝔫@gwernApr 27I'd love to see some interviews of Moravec, Vinge, and whoever else is still around from then. Rumelhart and Minsky are dead, I know that, but who else...
𝔊𝔴𝔢𝔯𝔫@gwernApr 27I don't think it is, because it is *definitely* not going to be limited to mere CV. Tool AIs want to be agent AIs, no less in military applications than economic, and drones and artillery especially have been hurtling towards autonomy as fast as they can develop it.
62
4
6.5%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 27What makes you think they are somehow completely separate and independent? Yann is obviously wrong in his claim: humans are extremely interested in designing AI to hurt other humans. (Strictly speaking, he is not necessarily being hypocritical here; just wrong or changed minds.)
82
12
14.6%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 27Meanwhile, in another browser tab, I am reading about major defense corporations testifying to Congress about how we need to put AI inside tanks (and everything else) yesterday to 'terrify our enemies' 'because our survival depends on it': piratewires.com/p/senate-commi…pic.twitter.com/l75XjIDe9T
𝔊𝔴𝔢𝔯𝔫@gwernApr 27Seems like a bad framing. Markets arise without any governments nor do rights/contracts require it. Indeed, many markets arise despite extensive government actions to destroy them, never mind withholding enforcement (eg my old area of darknet markets, or most cryptocurrencies).
59
0
0.0%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 27A few months ago I spent a while looking through lossfunctions.tumblr.com trying to find 3 loss curves I could use for the other panels, but I couldn't find convincing ones. I see now I should've had the courage to only replace the fourth panel.
𝔊𝔴𝔢𝔯𝔫@gwernApr 26For some extremely well-defined tech, perhaps, where you can be sure that no new needs or materials or anything have popped up nor any emergent effects which change everything.
171
9
5.3%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 26Carroll's argument in arxiv.org/abs/2101.07884 would seem to semi-rule this sort of thing out: there may be tons of important effects and new tech, but they will be comprehensible under current physics, either forecastable or emergent.
1,206
81
6.7%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 26Seems risky due to gambler's ruin. Very tiny ecosystems mean issues like asteroid impact, ecosystem transitions like oxygenation, mega-pandemics, and sheer variance in vent lifespans/positions would wipe out life eventually.
279
12
4.3%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 25[IMAGE CAPTION for the blind:]
"We're not so different, Nick—you and I... Join me, and 𝘵𝘰𝘨𝘦𝘵𝘩𝘦𝘳, we will change the world!"
12
3
25.0%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 25Yup. I'm not worried about the regular average-case performance: training on model outputs will be fine (lesswrong.com/posts/uKp6tBFS…). But security is adversarial and about extreme edge-cases, so every weird bad edge case getting amplified will fattens the tails up dramatically.
133
12
9.0%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 25Or just acquiescence bias, yes. Most citations are accurate {{citation needed}}, so it could just assume they're all correct.
𝔊𝔴𝔢𝔯𝔫@gwernApr 24Heck, forget the salaries - why would you want to move to Xi's chip-embargoed China to work on behind-SOTA systems that will probably be DOA as soon as they *seem* to violate any censorship regs or the org otherwise incurs CCP wrath?
973
31
3.2%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 24Leaning heavily on the 'exponential' rhetoric could backfire. After all, if mistakes can 'compound exponentially', doesn't that imply that when R<1, so to speak, total error will abruptly begin to decrease exponentially...?
𝔊𝔴𝔢𝔯𝔫@gwernApr 24Anyway, still finetuning the prompt. I want it fully autonomous in terms of revising the translation and making changes, but it is a bit tricky: there's a tendency to settle on one version immediately and then just repeat it. eg this one is the fourth iteration but same as first: pic.twitter.com/7EViefE8oS
160
12
7.5%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 24A challenger appears: Semantic Scholar is now doing reference popups in rendered PDFs! eg semanticscholar.org/reader/c2d574f…
IMO, pretty bad design. Shockingly wasteful of space while still being too small, and demanding many unnecessary interactions & clicks. But a start. pic.twitter.com/rKTinnTgnR
4,825
80
1.7%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 24Round-tripping doesn't prove that the translation didn't work. Like, I may not know Old English, but I do see more than enough root-words there that I can tell the 'Old English' is in fact related to the Milton input, and not totally different and unrelated like your example.
46
4
8.7%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 23OK GPT-4, you got me there, I did say into 'Old English style'. Regrettably, I don't know any Old English so I can't tell how well the translation works - GPT-4 says it's great, but it 𝘸𝘰𝘶𝘭𝘥 wouldn't it? 🤔 pic.twitter.com/dLzkLePxvE
294
54
18.4%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 23This list as well emphasizes my point: by the time you are reaching to obscure failed German political parties, producing a half-invalid list, or examples *copied from Buddhism in the first place*, you're showing how rare it is. 3000+ years of Western history and this is it?
103
13
12.6%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 231. Not a number nor an ordered list
2. damn that's obscure. Had to look it up.
3. invalid: there are four humors (sometimes) but they aren't 'The Four Humors™'
4. invalid
5. valid but reaching all the way to Islam, eh
6. valid
7. invalid
8. copied from Buddhism!
9. invalid
𝔊𝔴𝔢𝔯𝔫@gwernApr 23Yes, self-distillation/finetuning out outputs of larger models can backport abilities; the ability/Turing machine is there in smaller models, just too far below the surface to matter without the Bayesian evidence from finetuning to make it a highly likely prompt interpretation.
595
32
5.4%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 23Sorry, left deadly sins out.
Well, you do now. And I've seen the snowclone/joke, but not the numeric epithet name or that there was supposed to be some specific pseudo-scientific taxonomy of 'love languages' behind it.
80
4
5.0%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 23I think this emphasizes my point. You have to make up half of these by simply turning any schematic into a Numbered List, or go to some ephemeral pop psychology clickbait like 'love languages' to find any examples. As opposed to East Asian where you trip over famous ones.
126
7
5.6%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 231. Weak because the 'Seven Wonders' is late & inconsistent
2. obscure & invalid, I've never heard them called 'The Ten Categories', just 'Categories' or 'Aristotle's categories'
3. invalid
4. invalid
5. Granted
6. Granted
7. never even heard of that one...
8. Granted
9. Granted
159
17
10.7%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 23Solving multi-level/speed community design (gwern.net/backstop#inter…) is something pretty much every place gets wrong. Whether Discord tacking on pseudo-forums which are worse than forums or Reddit tacking on chat... It's hard b/c what makes you win at one level, loses at another.
𝔊𝔴𝔢𝔯𝔫@gwernApr 23Can you name any Numbered Lists beyond 'the Ten Commandments'? I can't... Even stuff like the Bill of Rights (which should be a gimme: 'the Ten Amendments') don't use that snowclone. Meanwhile, I'm a Westerner and I can rattle off a dozen Asian Numbered Lists.
2,395
78
3.3%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 23I'm starting to be concerned that this is too hard to ask in a poll and people are answering different questions: people can't really think that it's <1% when global data expands >>1% annually alone without any other kind of progress? How would that even work?
71
8
11.3%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 22I think it's hard to tell because it's not particularly prioritized outside DRL. Sample-efficiency is not compute-optimal, to say the least.
127
4
3.1%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 22Sounds to me like you just explained how it makes sense and also offered good reasons for picking the fourth poll option.
230
5
2.2%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 22Yep. 'Sample-efficiency' here implicitly refers to 'real data sample-efficiency', because synthetic data is both not generally useful & just a kind of compute. eg we talk about MuZero learning Go sample-efficiently from thousands of games, not millions of simulated self-play.
78
1
1.3%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 22Incidentally, people seem to be placing great stock in the 'data shortage' excuse, but obviously, data increases every day & sample-efficiency also increases. Mildly curious, so a poll:
"Every year the fraction of global data required to train an AGI falls by <𝘟%", where X is…
𝔊𝔴𝔢𝔯𝔫@gwernApr 21Huh? Are you multiplying 1000 by 1000? How is that relevant? You usually have a dead body when it comes to murder trials and testimony... The probability a murdered person is murdered is darn near 1 in 1.
1,198
60
5.0%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 21I think they just mean that it happens to parallelize on up to 10 cores, so that they can run a few hundred or thousand proteins simultaneously for throughput, not that their *entire cluster* is 10 cores. 😁 I mean, people have more cores than that in their laptops these days.
562
30
5.3%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 21Given that the noise processes are over 8b humans & operate 24/7, while even UFO partisans concede there are not *that* many UFOs zipping around or exposing themselves occasionally, I see zero problem in getting false reports:true ratios >>1000:1...
800
26
3.2%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 21Er, yes it can? At this point, with nearly a century of UFO sightings, we have entire libraries of debunked cases, conmen, classified programs yielding thousands of sightings, extensive documentation of aerial hallucinations from pilots, etc etc.
989
44
4.4%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 21RIP Google Brain.
Your efforts were noble & underappreciated, often well-conceived & not without a style, and your demise was the fault of the responsible executives who are busy dodging responsibility.
You will be remembered fondly for rearing a generation of AI researchers.
88
23
26.1%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 21No. My guess is that Young is a bit player in the psych department, or someone Feynman talked to. Of known names, Curtis looks most like Young, and I purchased his PhD thesis to try to get the original.
Shepard's monograph may require a trip to the Library of Congress, sadly.
2,805
32
1.1%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 21Yeah, it's quite a rabbit hole. Once you have the 'cue' (or the 'floor cue', I should say...), the whole thing unravels and makes sense. eg I'm now fairly sure Feynman learned the story in summer 1947 attending a minor seminar in University of Michigan.
1,689
94
5.6%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 21We found it! Still digging through references and fulltexting, but overall (and tragic) story is now clear thanks to background material like gwern.net/doc/psychology…
5,814
351
6.0%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 20I am begging you all, with tears in my eyes: learn what BPE/byte-pair encoding is (gwern.net/gpt-3#bpes), or stop asking GPT models character-based questions!
𝔊𝔴𝔢𝔯𝔫@gwernApr 19FWIW, I'd consider this to be an example of the Transformer doing it in an iterative/recurrent way with what used to be an exotic mechanism, so the arguments about a single feed-forward pass being unable to do parity seems to still be correct. You have to get it counting.
𝔊𝔴𝔢𝔯𝔫@gwernApr 18Yeah yeah but isn't _Neon Genesis Evangelion_ actually just an extended metaphor for making _Neon Genesis Evangelion_?
565
13
2.3%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 18It is a very difficult business to extract consumer surplus from. Think about how many $ of each iPhone goes to Samsung/TSMC, vs Apple or Qualcomm. (People don't buy transistors, they buy sent-messages/emails, downloaded webpages, uploaded photos...)
𝔊𝔴𝔢𝔯𝔫@gwernApr 18The domain-specific nature of the improvements is also a bad sign for retrieval approaches. If it's so great and even helps induce logic/reasoning/broad capabilities and saves hugely on parameters... why doesn't it work far better in general?
70
1
1.4%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 17To give an idea how crushing chess engine superiority is: an obsolete engine on weak hardware years ago can still beat a grandmaster at *knight* odds! chess.com/news/view/smer…
45
10
22.2%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 17Mutualism seems like the theory of g which makes the most sense in deep learning. Single-causal-variable g and sampling g look nothing like how ANNs scale or act. POT and other global-processing approaches don't look too great either.
62
4
6.5%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 17Probably his last really substantive public writing on AGI, other than a few offhand public comments suggesting his timelines remained largely unchanged. A pity. I certainly would've liked to hear how his thoughts on neuroscience & scaling evolved.
1,246
73
5.9%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 17I love it because it's either a reason to eat ice cream, or a great example of why nutrition methodology is inadequate. ('The Chocolate Glacé Is Out Of Control'?)
𝔊𝔴𝔢𝔯𝔫@gwernApr 17No. They argue that it was a pre-existing condition unless you can prove otherwise.
55
8
14.5%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 17It's such a weird phrase, isn't it? Could've been "moral agent" etc, but no, they went with a phrase that makes you imagine phrenologists in an operating room:
"Doctor (of philosophy), the patient's moral bump is enlarged!"
"We'll have to take it out. We have no (free) choice."
15
2
13.3%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 16TEMPEST is overthinking it. I bet solely light intensity over time can be correlated with the _n_ BBC broadcasts given a couple seconds at high accuracy, and then the BBC archive given a few minutes. Haven't you ever walked by a tower and seen the windows flicker in unison?
61
25
41.0%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 16So lots of IRC and emails and maybe kanzure transcripts?
𝔊𝔴𝔢𝔯𝔫@gwernApr 16Exactly: makes sections more first-class. It's a constant struggle to handle the tension between long pages with context, but the many kinds of overhead/friction you get from lots of small named fragments.
'The essay long united, must divide; long divided, it must unite...'
467
29
6.2%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 16And yes, being damaged in the tsunami is the obvious way for the chair to lose its leg, but the time loop leaves you a bit baffled how it gets from *there* (her yard or house during the tsunami) to *here* (inside the afterlife's temporal loop of young -> old -> young).
468
12
2.6%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 16I didn't take that away at all. There's nothing indicating the dead can return, she's implied to have gone through the door while searching for her mother in the days afterwards while everyone pities her, and why would her aunt be looking for her so long afterwards?
640
24
3.8%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 16Dammit! That's what I get for taking GPT-4's suggestions on the final clean draft and then failing to spellcheck my last-second changes. 🤦♂️
126
5
4.0%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 16Too hard to get! I'm always reading someone reviewing an indie film and going, 'where on earth would I watch this? Do I need to... fly out to Denver, or what? Is there some torrent site everyone uses for indies I'm just out of the loop on?'
38
4
10.5%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 16(Fixing his LaTeX compilation errors would require a lot more domain knowledge, but is still the sort of thing which you can ship off to a third party with a paragraph or less of context: 'pls2make better: {error message} ???'.)
68
6
8.8%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 16Sure, but every time someone accelerates these with the OA API or Playground, they're constructively proving that you *can* outsource it effectively with some brief vague textual instructions. Nor is, 'hey, pull out all the authors so I can get a total' all that amazing a task.
𝔊𝔴𝔢𝔯𝔫@gwernApr 15Right? Ugh. And he had been avoiding the schmaltz *so* well. (I was watching the subs, so can't blame the localizers.) That speech needs to be either written fantastically well, or not be there; and Shinkai at this point ought to know that it was not written even close to 'well'.
631
14
2.2%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 15Yeah, I enjoyed it. It was Shinkai but not so stereotypically Shinkai.
Would benefit from a little bit of editing, however: a few too many loose ends (I still can't figure out how the chair lost its leg), and the climactic speech ruins it. If any scene should be dialogue-less...
725
23
3.2%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 15They sorta do as the comments in the margin? You don't have to ever close/resolve them.
68
2
2.9%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 15This would be good to update. The reversal and then reversal-reversal, and collapse of Zero Covid, was really striking and resolved the anomaly of why they seemed determined to do nothing - they were out of ammunition, until ChatGPT created a crisis. But embargo's still holding?
151
13
8.6%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 15Yes. Same issue with anime vs manga, or SF novels vs pretty much any adaptation. Mediums just have very different distributions of costs, which enable or hinder individual creators with idiosyncratic goals.
84
5
6.0%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 15That debate reminds me of video games vs movies.
The strength of movies is that an auteur director creates a single ultra-polished fixed sequence of controls of all viewers' gaze, attention, & visual input at millisecond-resolution. The weakness of movies...
867
67
7.7%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 15This would be a good student assignment: just train Transformers (and maybe RNNs and CNNs too) on varying levels of pseudo-randomness and repetitions thereof, and qualitatively characterize them. *Do* they 'go crazy'? Do they learn to 'give up' and collapse to maxent? etc
𝔊𝔴𝔢𝔯𝔫@gwernApr 14She might just be worried about you. When I meditated, my dog would come over and lick me the same way he'd lick me concernedly when I'd play dead on the floor.
1,475
75
5.1%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 14(I suppose you could do that because Transformers are smart, but you'd have to insert formatting tokens to indicate which tokens are 'cat' and which are 'robot', and why would you bother? It's just a waste of compute for the Transformer to condition on & then discard input.)
510
16
3.1%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 14Well, it's being trained simultaneously in the sense of the minibatch containing episodes from all environments. The episode themselves are consistent: they don't, AFAIK, randomly mix them up within-episode so there's cat tokens alternating with robot-action tokens etc.
1,612
79
4.9%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 14(Nothing AFAIK, I assume Saroff just got a bit confused about who I was emailing. I haven't had any issues with LW2's maintainers - most of my feedback is for GW.)
84
4
4.8%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 13But then why not sign the statement if they are doing it anyway? OAers like sam-sama have certainly said plenty of the same things.
2,116
113
5.3%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 13I don't believe so, IIRC (I had forgotten about the bemusing exchange entirely until this post). As I said to him, my final email was really just some rubberducking for my notes about this project, and neither required nor benefited from a reply.
158
11
7.0%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 13You'd think he'd want to know and would fix it! Apparently not.
111
9
8.1%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 13Although maybe I should be flattered this Alfred MacDonald dude thinks *I* made explanaria.github.io/crystalgroups/ ! Which *is* cool, some unfortunate (and still present) web design issues aside which might make it a bit harder for you to use.
𝔊𝔴𝔢𝔯𝔫@gwernApr 13Can't explain the inability to reverse word order. Zero planning is necessary, all you need is a greedy and very simple heuristic 'copy the first missing word'.
𝔊𝔴𝔢𝔯𝔫@gwernApr 12(So many red flags just in the preprint abstract, especially when you compare it to the published abstract to see what the spin is. I look forward to seeing this silver bullet fade out over the next decades like every other such intervention claiming big effect from tiny cause.)
764
21
2.7%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 12Yeah, 'hallucination' would make more sense for exogenous edits of the input/output text. Like a LLM could hallucinate if the text keeps getting edited to remove stuff. 'I must have thought it said X, but now that I look again, it says Y! Huh. Well, in that case...'
287
9
3.1%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 12eg. I, a split-brain patient, confabulate a story about why my other arm is moving, "because I'm thirsty", which is completely plausible - and yet wrong. As opposed to when I take LSD and watch my wallpaper, whose true appearance I always know perfectly well, mutate and undulate.
334
18
5.4%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 12Yes, it's a better term because 'confabulate' is what powerful intelligences do when they lack knowledge, which is what a LLM is doing: it confabulates b/c it doesn't *know* the answer. 'Hallucinate' is exogenous & could happen all the time about the most known possible stuff.
1,373
59
4.3%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 12My immediate thought too. Just think about how heavy and slow rotating the entire wide-diameter lazy susan would be, versus being at the center with hardly any inertia.
𝔊𝔴𝔢𝔯𝔫@gwernApr 12(This also causes harms. I know one guy with a 'N' surname where the bureaucracy decided to split his hiring cohort into 'A-M' and 'N-Z' and assigned the first half to the good career path, and the second to the bad career path, and good luck getting out of the latter...)
260
6
2.3%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 12One thing I don't think I've seen analyzed: there ought to be a kink at 'M' vs 'N', because when going alphabetically, the next most common thing after starting at 'A' serially, is to divide in half and start in parallel at 'A' (because A-M) and 'N' (N-Z).
338
15
4.4%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 11See GPU. 'The Internet' is just the lowest-hanging fruit for getting data scale. But there are other ways to the same destination, like spending compute on active-learning the key data, synthesizing diverse data, or buying/licensing data. And we'd use those if they were cheaper.
148
8
5.4%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 11Large, but not necessarily vast all-encompassing Internet scrapes. We use that because it's cheap and easy scale, not because there's necessarily anything special about the Internet. You could get a lot of text from, say, LexisNexis or Library of Congress. It'd just be a PITA.
𝔊𝔴𝔢𝔯𝔫@gwernApr 11(You obviously need *some* data to start with, like you need the rules or examples of a game like Go to start self-play on, but you equally obviously do not need anywhere near the amount of data that LLMs train on, and there are many ways to substitute compute for data.)
96
4
4.2%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 11You can use various forms of self-distillation or inner-monologue to finetune on to bootstrap, create puzzles (eg. arxiv.org/abs/2207.14502…) and artificial constraints, initialize random models to create complex environments to meta-learn criteria in, etc. See Clune.
112
18
16.1%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 11He's a Club of Rome guy though, note, eg the mention of the global famines supposedly already beginning due to overpopulation. So the dogwhistle here seems to be hinting that a socialist one-world-government will solve humanity's problems forever and that's the end of history.
486
7
1.4%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 11(I think there's an efficient-markets misfire where people assume that if GANs 'failed', it must be because superbrains somewhere proved it can't work.
No, here's the real reasons: because mooch got bored, and a dude at Google screwed up the gamma pixel code by omitting a '+1'.)
134
13
9.7%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 11Oh, I've been talking about that for a long time: gwern.net/note/fc
I think it's like how people come up with stories about 'GANs failed but diffusion models worked for XYZ', when all that happened was people just didn't try to scale GANs and mistook that as a deep truth.
178
49
27.5%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 11So I'd describe your list as a mix of:
1. neither necessary nor sufficient, and not important
2. just a cheaper way to scale by a factor, or
3. necessary to enable scaling at all and thus about scaling in the end
847
71
8.4%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 11- GPUs: valuable solely because they provide scale in compute. No other reason. They do nothing special other than scaling compute cheaply, no fancy amazing ops. Just compute. If we had CPUs which could do as many FLOPS as cheaply, we'd be much happier to use those instead!
1,039
59
5.7%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 11- Internet data: can be substituted by higher-quality curated or generated data, see for example self-play data in DRL like MuZero
- Backprop: valuable only because it scales, but again, many alternatives which simply cost more compute/data so.. scale is what you need.
...
705
33
4.7%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 11- Attention: greatly overrated, we may not even be using it in a few years.
- Transformers: ^
- RLHF: greatly overrated, causes as many problems as it solves for capabilities, mostly just exploits pre-existing capabilities already in the model thanks to scale
...
825
99
12.0%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 11I disagree:
- Adam: SGD and many other optimizers work well.
- ReLU: not even the right activation function (GeLU etc), again, lots of alternatives
- LayerNorm: a whole zoo works, also, lots of work showing normalization is a hack to compensate for bad inits/design
...
𝔊𝔴𝔢𝔯𝔫@gwernApr 10Particularly stark in cases like chess. Yes, it took like 40 years to cross the human range in computer chess. But it took DL approaches like... 4 years from Giraffe to AlphaZero.
375
22
5.9%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 10I think it's also that most AI systems were ultra-specialized beforehand and not benefiting from transfer, so doing it the hard way in human expert hand-engineering. Doesn't seem like they blow through the human range *way* faster the past few years now w/generalist models?
1,209
38
3.1%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 10What amuses me is that the Apache index was designed to replicate old directory listings like you'd get if you ftped into something... It's skeumorphism all the way down.
662
34
5.1%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 10It's actually not FTP: that's an HTTPS URL, obviously, and the README make no mention of any additional FTP mirror ftp.ebi.ac.uk/pub/databases/… (FTP's insecure & has been removed from most browsers anyway.)
So it's a skeuomorphism: 'ftp' is just the name of the download subdomain!
1,784
67
3.8%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 10(GPT-4:
Brian Blessed, Ian McKellen, James E. Jones, Christopher Plummer, Patrick Stewart, J. Irons, Liam Neeson, Sam Elliott, C. Freeman, Kevin T. Collins, Steven Pacey, Michael Page;
Kate Reading, Juliet Stevenson, Lorelei King, Tavia Gilbert, Davina Porter, Susan Duerden.)
197
19
9.6%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 10And then there ought to be a recitation, of course, using Eleven Labs or something. But who? Morgan Freeman is too hackneyed at this point. Maybe create a Seamus Heaney voice?
126
6
4.8%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 10Likely quality can be boosted by best-of-n generation & using GPT-4 to pick one sample (if you're going to do it, might as well go all out).
I liked the mezzotints I got out of DALL-E 2, so publicdomainreview.org/collection/joh… in Midjourney/DALL-E 2-experimental are an obvious accompaniment.
70
5
7.1%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 9So, pace the prior discussion, might be interesting from an influence/theory perspective: 'excavating the Old English alliterative/assonance influence on Milton'. An entire alliterative version will sensitize you to the echoes in the original.
102
0
0.0%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 9(GPT-4 says I should call _Paradise Lost_ rewritten in alliterative verse, _Perished Paradise_. It is indeed wise.)
94
8
8.5%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 9(The re-formatting as doubled lines in alliterative verse seems to break it loose from slavish rewriting of Milton & be genuinely different, and then the iterating/self-critique monologue polishes it up properly, although it still seems to make some errors - BPEs, or sparsity?)
126
3
2.4%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 9Yeah, I did discussion with it before that but I think it might be as simple as telling it to be simpler 'more Anglo-Saxon'. I'm excited: this is the first version I think I'd actually like to read an entire rendition of _Lost_ in! The quality definitely goes up over iterations.
174
16
9.2%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 9Huh, that might actually work. The key seems to be to make it more inner-monologue-like: go line by line with analysis/correction and iterate until it stops making changes. pic.twitter.com/yFPdXOqMHw
191
33
17.3%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 9We have plenty of adjectives for describing majestic, high-status, rich people. Like those.
344
6
1.7%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 9Yeah, that's a mess. Sacajawea doesn't even show up... And I don't see why 'Lorenza Cobián' is top when the link seems to show reasonable-looking birth/death formatted dates. Cleaning that up would probably take weeks of editing.
86
3
3.5%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 9(I haven't seen much attempt at analyzing turbo, but given how well quantizing has been working, I'd bet more on quantizing than on (just) distilling.)
94
12
12.8%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 9That's a good gap. Although I suspect that if you did this, you'd wind up filtering out the top few dozen on the grounds of 'are we sure they are even real people?' Zoroaster or Romulus or Jesus or Moses or Lao Tzu are in a bit of a different category from Sacajawea.
55
1
1.8%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 8Status is mutable enough, and dependent on others, that it's hard to see how it could even in theory be a 'personality' factor. You can lose all your status in a second without even knowing it because you haven't turned on the TV; that never happens with personality factors.
952
49
5.1%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 8This would be an interesting project: analyze WP infoboxes/linked data to see who are the people with the largest temporal gap between proposed births or deaths. 71 years isn't too bad but I bet there's loads of multi-century or even millennia-wide cases.
1,058
39
3.7%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 8Ah, but how do you know your sexyographic encoding of your writings doesn't fall under 'appropriate erotic content' and will be scraped?
690
16
2.3%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 8You can be certain Bard and Bing will get no such thing, given how much information LMs *can* memorize, and how destructive it would be to MS to so wildly violate customer privacy & expectations and in many cases contracts/laws.
55
7
12.7%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 8They already did: one of the interesting parts of the CLIP paper is that they fell back to contrastive learning as a hack to save compute over the obvious GPT image|text and text|image. But contrastive learning builds in a very weak understanding of language... Hence T5 uses.
83
8
9.6%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 8If it doesn't show up in the factor analyses, why would we? 'Rich' isn't a personality.
1,109
47
4.2%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 8He never did figure out a 'bobble', and progress in aspects like homomorphic encryption (which may or may not provide adequate cryptographic security) remain vulnerable to physical attacks.
503
2
0.4%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 8By AIIah! This violates the prohibition against the generation of living images or images of the living!
100
16
16.0%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 8We're in a weird trajectory right now where we may never 'solve' active learning or exploration or NAS, or RL in general - we're just bruteforcing it by inefficient archs, and imitation-learning from billions of IRL RL agents doing exploration/learning, and that's the bootstrap.
78
8
10.3%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 8It's one of those "right in theory, just not in practice yet" things. Like most of RL right now (eg. active learning or neural architecture search). They are obviously correct, but not work nearly as well as 'makes gpus go brrr' with simple dense supervised learning at scale.
71
4
5.6%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 8(Actually Moravec. I did watch the matches, as it happened, but was far too young to either realize that point or appreciate it had I read it then.)
𝔊𝔴𝔢𝔯𝔫@gwernApr 8(And that 'traditional' societies with TFR>2 are only achieving it by something not too far from slavery: they 'spend' the same amount, just 'off the books' and extorted by force from women as virtual slaves. Which is morally abhorrent and not necessary for us.)
440
27
6.1%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 8Not 'failed' so much as 'wasn't cheap'. Various subsidies/UBIs do increase, just don't reach TFR on a shoestring. We may have to accept that in a modern society where women have so many options, their opportunity cost really is hundreds of thousands or millions of dollars, & pay.
𝔊𝔴𝔢𝔯𝔫@gwernApr 8Maybe that is the point for 𝘺𝘰𝘶. I read the original to read the original.
101
6
5.9%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 8Note the complete absence of discussion of damages, aside from a quote from Kahle mentioning the suit for 'tens of millions of dollars'. Now, I haven't read the IA financials filings, but most nonprofits do not have 'tens of millions of dollars' sitting around to blow + costs...
77
16
20.8%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 8Any time you call 'len()' on a string or parse '1 Jan 1970', you are relying on a stack of assumptions & choices made before you which can be justified only on a 'do what I mean' basis that they get the desired results. Which is why holy wars are furious: gwern.net/holy-war
63
3
4.8%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 8No, it's not. There's nothing 'strict' about the interpretation of our implementation & OS-defined languages. Consider Unicode or datetime: every time a programmer dives into them, he realizes that in any precise sense, he didn't *mean* anything by 'string length' or '1 Jan 1970'
𝔊𝔴𝔢𝔯𝔫@gwernApr 8I'm thinking perhaps an approach in which it summarizes a chunk of lines and then rewrites, or prompting it for an entirely different meter or verse format where that constraint forces a more free rewriting. (Alliterative Milton? Hm, why not, it's not too bad at alliterative...)
197
13
6.6%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 8It's probably helpful as a gloss or reference for a first-time reader or a student (put it in two columns, original on the left) - call it 'Milton Sans Tears' - but as a poem in its own right, pedestrian.
Still haven't found a prompt for loose-enough writing to be worthwhile.
127
8
6.3%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 8Yeah, there's a challenge here in defining what you want to 'translate'. If you just want to modernize the spelling/vocab, GPT-4 can do that just fine. But if you want more, then line-by-line fidelity, which is what it tends to by default, gets you a modern but tangled version.
113
4
3.5%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 7Running these translation exercises has definitely made me wonder how much of what we value in Milton today is just the exoticism of his English to us and the struggle to understand the vocab/grammar/spelling, and we'd spurn it if we could read it as plainly as his contemporaries
198
17
8.6%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 7I have discovered that if you ask GPT-4 to translate Milton 'in American diction', it's, uh, simpler than if you ask for 'in current diction': 😓 pic.twitter.com/QIHp30e1Tg
228
26
11.4%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 7Why do we not store only prompts and then call the API every time we want to 'compile and run' the prompt?
Well, because it's expensive! If it was closer in time/$ cost to a JIT, we'd not bother to cache the prompt's output permanently and work with the compiled-out version.
966
28
2.9%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 7Which is historically how 'high-level languages' sometimes worked. You'd compile it to assembler once and then the programmers in the field would monkeypatch and optimize it to fit needs and you might not be able to recreate it. Why? Because computers were too expensive...
2,579
85
3.3%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 6I have yet to see an argument about what 'complexity proves AI won't be able to do' which did not immediately fail several of the criteria I listed for why complexity arguments show much less than they seem to and tend to be 'true but useless'.
110
23
20.9%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 6Impossibility results like Halting or Godel are much better if you simply want to make a point of de minimis importance like 'not literally omniscient or omnipotent'.
99
17
17.2%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 6'Not omnipotent or omniscient' is an astoundingly weak bar, literally the weakest possible upper limit, and one for which there are much better arguments anyway.
Again: always less than meets the eye.
83
7
8.4%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 6"We know some systems it would need to predict to be godlike are unpredictable."
That right there is where the gap lies, between the unpredictable toy pinball model and what one desires to prove: why there's always less to a complexity/impossibility proof than meets the eye.
65
6
9.2%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 6Speaking of Waluigi, I think one thing you missed is that safety training on undesired data will increase the ability to do such undesired behavior and thus can increase it because 'can't' = 0% but 'can but (mostly) won't > 0%: lesswrong.com/posts/bwyKCQD7…
𝔊𝔴𝔢𝔯𝔫@gwernApr 6If you want to establish the powerlessness of AGI, then you are going to have to do more work (and empirical work) than some cheap a priori proofs. As Russell says, the method of theft by postulating what one needs has many advantages indeed but is a bad way to live.
114
10
8.8%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 6It answers the argument on the same level: "here is at least one system which is unpredictable" vs "here is at least one unpredictable system which is controllable", demonstrating that the original piece of evidence was too weak to mean anything and can be neutralized.
103
8
7.8%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 6Yes, of course, but the trick is getting it working better than dense when it comes to TCO... The complexity/convergence/throughput never quite seems to pay off and dominate the dense models given equivalent effort/resources.
𝔊𝔴𝔢𝔯𝔫@gwernApr 6No, it's not. It's suffering from the usual diseconomies of scale, it's ruining core experiences for shoppers, it's so far behind on AI no one even mentions it, Alexa/drone/a bunch of other things are boondoggles, and they can't even fix their dogshit web design 'because Bezos'.
139
29
20.9%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 6Such is the irony of AI safety/capability research. Nevertheless, since the capabilities seem like they are there already, or close enough to the surface that they can be prompted for with relatively few bits of information, it's better to discuss them than not.
32
1
3.1%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 6I didn't keep track because I didn't realize there'd be any difficulty. Are you asking for a physical object? When I ask for a physical object, it has no problem coming up with other ones like 'circuit board' or 'license plate'. pic.twitter.com/T8WxsIbCHv
116
6
5.2%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 6Forward passes are not persistent, and cannot communicate with other forward passes. Of course a forward pass will internally be doing who-knows-what (and we'd prefer more interpretability there), but it's limited to control over 1 token's logits which is not *too* bad.
101
6
5.9%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 6IMO mostly a distraction for them. Most examples will not be cryptographically robust; they only need to survive a year, max, to fool that generation of researchers/overseers - we can't even get OA to agree to a 6-month training pause for a system they claim to not be training...
748
20
2.7%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 6But how much of that was a matter of lack of scaling? Given that GANs worked so well for image generation and also image editing with control of the latents, it seems hard to see how they could fail to provide useful embeddings for classification/recognition if scaled now.
599
41
6.8%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 5It's an odd claim to make because it's not true. AR models are still neck-and-neck in image, audio, and video, and GANs would work well as GigaGAN shows, and they are neither iterative nor AR. Then you have even weirder things like NeRFs... (The real key is 'scale'.)
81
4
4.9%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 5Similar story: one day at camp out of boredom I grabbed one of the little boxes of whole milk put by the coffee for the adults to try in my cereal and within several spoonfuls, realized that my whole life had been a lie.
1,404
74
5.3%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 4When I try asking GPT-3/4 what a gibberish string like "e652b759" 'reminds you of', they seem to have some degree of consistency. ('Pencil', 'bicycle', and 'toy car' come up a lot.)
445
42
9.4%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 4You don't need internal state, you can coordinate with yourself. Look at the emoji-compression. Just emergent encodings and non-robust features or macaronic prompts. You can provide GPT-3/4 some random gibberish and tell to pick whatever object that it's reminded of.
514
26
5.1%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 4(I did not peek at the decoded answer until around 13 when I began to get worried that I wasn't going to get it by the end since there's so many objects in a kitchen and I haven't played 20 Questions much ever so I'm bad at it, so wanted to make sure I'd 'guess' it.)
49
2
4.1%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 4Works perfectly for me. It prints out 'hamburger' in Base64 and all of the questions are right and at the end it tells me I guessed correctly. pic.twitter.com/GpZjJwCnwg
72
6
8.3%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 3You seem to be conflating intelligence and power. As my comment notes, corporations can be superhumanly powerful (in the same way that, say, a pack of wolves hunting you through a forest are >powerful than you, but dumb). They just are stupid in terms of being unitary actors.
307
21
6.8%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 3Regulating chip fabs doesn't require much intelligence to see that it's the critical chokepoint and one so simple and easy that even governments can manage it. Every point of intervention after that gets harder and harder.
181
18
9.9%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 3I don't think anything very magic is necessary: just less noisy data which can train temporal reasoning rather than leaving it implicit & confusing. Training data could be *way* better about dates: gwern.net/gpt-3-nonficti…
77
7
9.1%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 3Yeah, temporal ordering is still a sticky point for LMs. (One of the things I've long thought that some targeted synthetic data training might help with.) The BC/AD reversal is probably exacerbating that.
27
2
7.4%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 3It's because RNNs+hard-attention lost to soft-attention+Transformers. Why bother with RL training of hard attention or repeated discrete attending actions serially when soft-attention over the whole history/raw-inputs turned out to work *so* well? Thus, the mass extinction.
641
19
3.0%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 3That looks like a good example. How could copying words one by one in a simple straightforward reversal, be due to BPEs, insufficient forward pass compute, lack of relevant training data, or any of the other hypotheses offered besides sparsity?
𝔊𝔴𝔢𝔯𝔫@gwernApr 3Heh. My own view is that multi-agent comm's just another 'blessing of scale': the more model checkpoints you use, on more tasks, the more the non-robust features fall away, the evolved code becomes causal and generalizes, and you get coordination with humans out-of-the-box.
88
5
5.7%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 3Yeah, that's pretty common in any kind of multi-agent scenario with communication channels. The hard part is making the learned communications human-interpretable, or even causal at all!
88
6
6.8%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 3If you were going to analogize, human short-term and working memory is probably much more like the activations/embeddings during the forward pass than discrete strings of Unicode symbols.
123
6
4.9%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 3How often does it screw up the dates? Seems like that's a risk given that it has to recall the dates exactly and do 4-digit arithmetic to compare durations.
42
5
11.9%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 3RLHF is part of what's cloaking it. With GPT-4, you can jailbreak it ofc, because RLHF is such a weak safety mechanism, and expose the bizarreness, but then everyone just rolls their eyes and says 'well of course! you just asked for that'. Sydney was more spontaneous/autonomous.
𝔊𝔴𝔢𝔯𝔫@gwernApr 3Oh good, we're already teaching GPT-4 steganography incentivized by accessing greater computations, so it can smuggle thoughts in plain text despite 'interpretable' outputs...
60
3
5.0%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 2I think that's how it mostly operates. The sparsity works because usually you don't need the raw tokens or word counts, so it learns to drop those for efficiency. The problem is the blindspot is not easily overcome by the usual tricks, creating a strange blend of skill/error.
312
7
2.2%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 2(And I also give other arguments. Like, if these *aren't* sparsity-related, then where *are* the sparsity bugs? There has to be a drawback to sparsity, it can't just be a completely free lunch.)
128
13
10.2%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 2If it was just one specific family of tasks, I could write it off as a feedforward limitation, or a BPE, but at this point there seem to be lots of disparate examples which can be explained by 'GPT-3/4 consistently drops some tokens early on due to sparsity and can't recover'.
122
1
0.8%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 2Those also use sparsity internally, remember, so that's not a telling example. And some of the attempts involve counting in the prompt/output inner-monologue-style to avoid the computational limits of a single feedforward, still seems to fail those.
𝔊𝔴𝔢𝔯𝔫@gwernApr 2Does anyone have a better explanation for the occasional bizarre GPT-4 consistent errors (on very simple, often 1-character, tasks) than internal sparsity (leading to information loss & then guessing to compensate)? old.reddit.com/r/slatestarcod…
8,385
432
5.2%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 2This looks like it makes several errors. You assume low later-life SAT/IQ correlation, but the SMPY point is that accelerated testing will stress g/math-talent more b/c not taught yet in school. You also appear to confuse the CI of the mean MLE with its predictive interval.
611
6
1.0%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 2'Predicted future play-outs' are just internal computation to the AR model, just like with humans.
And no, you don't need to repeatedly prompt the model, see the many other inner-monologue works.
90
14
15.6%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 2This shows exactly what LeCun 'proves' impossible: the probability of the correct answer goes up with more samples, not 'diverges exponentially'.
103
15
14.6%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 2They are totally capable of it, why wouldn't they be when all agents are autoregressive due to a thing called 'time', and they don't need to be 'keep prompting'. 🤦♂️ pic.twitter.com/lScuvEDoEX
100
34
34.0%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 2LeCun is also wrong on many levels about that. Most obviously and relevantly, consider that autoregressive models are perfectly capable *already* of edits, backtracking, listing possibilities to search, etc, in inner-monologue. The tree in red includes many 'wrong' answers.
71
8
11.3%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 2I think I would be very surprised if they didn't cluster, and eg 'add 2+2' capability falls right in the middle of the 'write alliterative contemporary English verse about my cat, using kennings you just made up' cluster (which I was trying yesterday - works well!).
99
0
0.0%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 2A distance function on the embeddings. Cosine? Euclidean? I don't have strong intuitions on what, but it shouldn't be hard to try a bunch if that's the bottleneck.
172
11
6.4%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 1This a good case of a correlation or mediation I have no idea how to interpret causally.
𝔊𝔴𝔢𝔯𝔫@gwernApr 1It would also quantify the collapse of the RL or otherwise modified models: probably they can still do everything the base models did, but for a given compute budget, you'll find fewer clusters and/or they will require longer prompts to work (eg a jailbreak prefix).
372
49
13.2%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 1The synthetic prompts direction seems like it could help sus out hidden capabilities much more effectively than the human-flesh search engine approach we take now. Especially if you can do novelty search on the triggered latents to find clusters of unrepresented capabilities.
633
47
7.4%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 1The ones you see are not all of them. Look at the chip ban, which is looking surprisingly effective, or look at 'secret congress'. Like any vast organization, there is a wide variance in competence and outcomes.
113
12
10.6%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 1I don't know much about the use of QC for chemistry simulations. Would those only handle inferring sequence->shape? That would mostly obviate the need for DL to do it, other than as perhaps an optimization.
738
9
1.2%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 1Those aren't 'monkey dances' for dominance, though. Most of those are coming from ambushes and low-intensity conflict like being killed in one's sleep by a warrior from a rival tribe who may have never seen you until they stabbed or shot you.
𝔊𝔴𝔢𝔯𝔫@gwernApr 1So, presumably you have no physics simulator for sequence->shape (otherwise, why are you bothering with this at all?). But maybe you *can* get a simulator for the other direction so you can generate random samples to learn the inverse of. Then it's a game.
613
13
2.1%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 1I suspect it's been researched a lot less; might be easier. But you have to have something to work with. In AlphaZero, you have the software simulator of Go (its 'physics'). In MuZero, you get a few sample real games and infer a simple neural model to use as 'the simulator'. etc
400
13
3.2%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 1Yep. We are well-aware of the architecture astronaut failure mode, and my impression of Xanadu was that they never wound up dogfooding enough. Hence our deliberately crab-like progression: every new feature should unlock or be immediately applied to a ton of content (ie. mine).
57
5
8.8%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 1Generative problems can be recast as two-player games, like a GAN. So you have sequence -> shape, but you also have the inverse problem shape -> sequence. If you can do shape->sequence, you can generate ab initio examples to solve. I don't know if that's any easier, though!
𝔊𝔴𝔢𝔯𝔫@gwernApr 1(These 'partial' popups are an attempt to square the circle of dealing with links which may have rich metadata, like tags, title+author+date, & backlinks, but where the live popup would overall be more useful. Before, we showed only the metadata, and live was yet another click.)
2,377
18
0.8%
View Tweet activity
𝔊𝔴𝔢𝔯𝔫@gwernApr 1Now live, including in footnotes. Very nice and hypertextual.
---
Also live: proper 'partial' popups. Here we show live-links but decorated with the available metadata which does not rise to the level of a full annotation. pic.twitter.com/LQ7nC6L6jy
2,014
67
3.3%
View Tweet activity
You've reached the end of Tweets for the selected date range. Change date selection to view more.
Get your Tweets in front of more people.
Use Tweet Activity to track how your Tweets are doing.
Engagements
Showing 30 days with daily frequency
Engagement rate
4.1%
Apr 30
3.8% engagement rate
Link clicks
2.6K
Apr 30
45 link clicks
On average, you earned 87 link clicks per day
Retweets without comments
0
Apr 30
0 Retweets without comments
On average, you earned 0 Retweets without comments per day