Gwern on GPT-3

栏目: IT技术 · 发布时间: 4年前

内容简介：On

On “ GPT-3 : Language Models are Few-Shot Learners” , Brown et al 2020 ( poems , compare my finetuned GPT-2 poetry ; random samples ; “OpenAI API ” with real-world demos)

Learning to learn. OA releases the long-awaited followup to GPT-2 , one model to rule them all: a 117✕ larger 175b-parameter model with far more powerful language generation, which lets it solve a wide variety of problems from arithmetic to English translation to unscrambling anagrams to SAT analogies—purely from being prompted with text examples, without any specialized training or finetuning whatsoever, merely next-word prediction training on a big Internet text corpus. This implies GPT-3 ’s attention mechanisms serve as “fast weights” that have “learned to learn” by training on sufficiently varied data, forcing it to do more than just learn ordinary textual relationships. Like OpenAI’s Jukebox just weeks ago, the announcement of GPT-3 appears to have sunk almost without a trace, so I will go into more depth than usual.

“Attacks only get better.” 2 years ago, GPT-1 was interestingly useful pretraining and adorable with its “sentiment neuron”. 1 year ago, GPT-2 was impressive with its excellent text generation & finetuning capabilities. This year, GPT-3 is scary because it’s an obsolete model, which small & shallow compared to what’s possible, with a simple uniform architecturetrained in the dumbest way possible (unidirectional prediction of next text token) on a single impoverished modality (random Internet HTML text dumps) on tiny data (fits on a laptop), sampled in a dumb way, and yet, the first version already manifests crazy runtime meta-learning—and the scaling curves still are not bending! The samples are also better than ever, whether it’s GPT-3 inventing new penis jokesor writing (mostly working) JavaScript tutorials about rotating arrays.

Not the whole picture, but a big part. Does it set SOTA on every task? No, of course not. But the question is not whether we can lawyerly find any way in which it might not work, but whether there is any way which it might work . And there are many ways it might work better (see the “Limitations” section for just a few). Does GPT-3 do anything like steer a robot around SF shooting lasers and rockets at humans? No, of course not. It is ‘just’ a text prediction model, an idiot savant of text; but an idiot savant, we should remember, is only a genetic mutation or bit of brain damage away from a normal human. If RL is the cherry on the top of the supervised learning, and unsupervised learning is the frosting on top of the unsupervised learning cake, well, the bakers are getting pretty good.

Scaling still working. I was surprised, as I had expected closer to 100b parameters, and I thought that the performance of CTRL / Meena / MegatronLM / T5 / Turing- NLG / GPipe suggested that, the scaling papersnotwithstanding, the scaling curves had started to bend and by 100b, it might be hard to justify further scaling. However, GPT-3 hits twice that without noticeable change in scaling factors: its scaling continues to be roughly logarithmic/power-law, as it was for much smaller models & as forecast, and it has not hit a regime where gains effectively halt or start to require increases vastly beyond feasibility. That suggests that it would be both possible and useful to head to trillions of parameters (which are still well within available compute & budgets, requiring merely thousands of GPU s & perhaps $10–$100m budgets assuming no improvements which of course there will be, see Hernandez & Brown 2020 etc in this issue), and eyeballing the graphs, many benchmarks like the Winograd schema WinoGrande would fall by 10t parameters.

GPT-3 is an extraordinarily expensive model by the standards of machine learning: it is estimated that training it may require the annual cost of more machine learning researchers than you can count on one hand (~$5m), up to $30 of hard drive space to store the model (500–800GB), and multiple pennies of electricity per 100 pages of output (0.4 kWH). Researchers are concerned about the prospects for scaling: can ML afford to run projects which cost more than 0.1 milli-Manhattan-Projects?Would it be worthwhile, even if it represented another large leap in AI capabilities, to spend up to 10 milli-Manhattan-Projects to scale GPT-3 100x to achieve human-like performance in some domains? Many researchers feel that such a suggestion is absurd and refutes the entire idea of scaling machine learning research further, and that the field would be more productive if it instead focused on research which can be conducted by an impoverished goatherder on an old laptop running off solar panels. Nevertheless, I think we can expect further scaling.

“Extrapolating the spectacular performance of GPT-3 into the future suggests that the answer to life, the universe and everything is just 4.398 trillion parameters.”

Geoff Hinton , joking around—right?

We don’t know how to train NNs. As I keep saying, “NNs are lazy” and can do far more than we make them do when we push them beyond easy answers & cheap shortcuts. The bitter lesson is the harder and bigger, the better. (Besides GPT-3 , one could mention recent progress in semi-supervised learning & the model-based DRL renaissance.)

Blessings of scale: stability→generalization→meta-learning. GPT-3 is hamstrung by its training & data, but DL enjoys an unreasonably effective blessing of dimensionality—just simply training a big model on a lot of data induces better properties like meta-learning without even the slightest bit of that architecture being built in; and in general, training on more and harder tasks creates ever more human-like performance, generalization, and robustness. The GPT models, and i GPT for images, show that simply scaling up models & datasets without any supervision produces results competitive with the best (and most complex) alternatives, using the same simple architecture. OA5 does not just scale to, but stabilizes at, minibatches of millions due to gradient noise . OA5-like, Big GAN stabilizes at large-scale image datasets like JFT-300M & benefits from unusually large minibatches, while classifier CNN s like BiT transfer & robustify with human-like errors, multimodal learning produces better representations on less data (eg Vi LBERT / Video BERT , motivatingOA’s interest), and RNN s can predict videos .AlphaStar reaches human-level with hundreds of competing self-players to cover possible strategies. Imitation learning DRL like MetaMimic generalizes at hundreds of tasks to train a deep net. Disentanglement emerges in Style GAN with sufficiently deep w embeddings, or in relational networks / GQN / Transformers with enough samples to force factorization. Training Dactyl on millions of domain randomizations induced similar implicit meta-learning where during each runtime invocation, the RNN probes its environment and encodes its understanding of robot hand control into its hidden state; and DD- PPO outperforms classical robot planners by scaling 2 orders. Or in Procgen , training on hundreds of levels trains agents individually, but at thousands of levels, they begin to generalize to unseen levels.AlphaZero demonstrated truly superhuman Go without ‘delusions’ just by training a bigger model on a richer signal & pro-level play without any search—and MuZero , for that matter, demonstrated that just training an RNN end-to-end to predict a reward on enough data is enough to obsolete even AlphaZero and learn tree search implicitly (but better). And on and on.

The scaling hypothesis that, once we find a scalable architecture like self-attention or convolutions, we can simply train ever larger NNs and ever more sophisticated behavior will emerge naturally as the easiest way to optimize for all the tasks & data, looks increasingly plausible.

Keeping track. GPT-3 in 2020 makes as good a point as any to take a look back on the past decade. In 2010, one could easily fit everyone in the world who genuinely believed in deep learning into a moderate-sized conference room (assisted slightly by the fact that 3 of them were busy founding DeepMind ). Someone interested in machine learning in 2010 might have read about some stuff in recognizing hand-written digits using all of 1–2 million parameters, or some modest neural tweaks to standard hidden Markov model voice-recognition. In 2010, who would have predicted that over the next 10 years, deep learning would undergo a Cambrian explosion causing a mass extinction of alternative approaches throughout machine learning, that models would scale up to 175,000 million parameters, and that these enormous models would just spontaneously develop all these capabilities, aside from a few diehard connectionists written off as willfully-deluded old-school fanatics by the rest of the AI community (never mind the world), such asMoravec, Schmidhuber, Sutskever, Legg, & Amodei?

Hindsight is 20/20. Even in 2015, the scaling hypothesis seemed highly dubious: you needed something to scale, after all, and it was all too easy to look at flaws in existing systems and imagine that they would never go away and progress would sigmoid any month now, soon. Like the genomics revolution where a few far-sighted seers extrapolated that the necessary n for GWAS es would increase exponentially & deliver powerful PGS es soon, while sober experts wrung their hands over “missing heritability” & the miraculous complexity of biology & scoff about how such n requirements proved GWAS was a failed paradigm, the future arrived at first slowly and then quickly. Yet, here we are: all honor to the fanatics, and shame and humiliation to the critics!If only one could go back 10 years, or even 5, to watch every AI researchers’ head explode reading this paper… Unfortunately, few heads appear to be exploding now, because human capacity for hindsight & excuses is boundless (“I can get that much with finetuning, anyway I predicted it all along, how boring”) and “there is no fire alarm” . (If you are still certain that there is near-zero probability of AGI in the next few decades, why? Did you predict—in writing—capabilities like GPT-3 ? Is this how you expect AI failure to look in the decades beforehand? What specific task, what specific number, would convince you otherwise? How would the world look different than it does now if these crude prototype insect-brain-sized DL systems were not on a path to success?)

Authority without accountability. What should we think about the experts? Projections of failure were made by eminent, respectable, serious people. They spoke in considered tones of why AI hype was excessive and might trigger an “AI winter”, and the fundamental flaws of fashionable approaches and why brute force could not work. These statements were made routinely in 2014, 2015, 2016… And they were wrong. I am aware of few issuing a mea culpa or reflecting on it. It is a puzzling failure, and I’ve reflected on it before .

Phatic, not predictive. There is, however, a certain tone of voice the bien pensant all speak in, whose sound is the same whether right or wrong; a tone shared with many statements in January to March of this year; a tone we can also find in a 1940 Scientific American article authoritatively titled, “Don’t Worry—It Can’t Happen” , which advised the reader to not be concerned about it any longer “and get sleep”. (‘It’ was the atomic bomb, about which certain scientists had stopped talking, raising public concerns; not only could it happen, the British bomb project had already begun, and 5 years later it did happen.)

The iron law of bureaucracy: Cathedral gothic. This tone of voice is the voice of authority .

The voice of authority insists on calm, and people not “panicking” (the chief of sins).

The voice of authority assures you that it won’t happen (because it can’t happen).

The voice utters simple arguments about why the status quo will prevail, and considers only how the wild new idea could fail (and not all the possible options).

The voice is not, and does not deal in, uncertainty; things will either happen or they will not, and since it will not happen, there is no need to take any precautions (and you should not worry because it can’t happen).

The voice does not believe in drawing lines on graphs (it is rank numerology).

The voice does not issue any numerical predictions (which could be falsified).

The voice will not share its source code (for complicated reasons which cannot be explained to the laity).

The voice is opposed to unethical things like randomized experiments on volunteers (but will overlook the insult).

The voice does not have a model of the future (because a model implies it does not already know the future).

The voice is concerned about its public image (and unkind gossip about it by other speakers of the voice).

The voice is always sober, respectable, and credentialed (the voice would be pleased to write an op-ed for your national magazine and/or newspaper).

The voice speaks, and is not spoken to (you cannot ask the voice what objective fact would change its mind).

The voice never changes its mind (until it does).

The voice is never surprised by events in the world (only disappointed).

The voice advises you to go back to sleep (right now).

When someone speaks about future possibilities, what is the tone of their voice?

以上就是本文的全部内容，希望本文的内容对大家的学习或者工作能带来一定的帮助，也希望大家多多支持码农网

查看所有标签

猜你喜欢:

Gwern on GPT-3

本站部分资源来源于网络，本站转载出于传递更多信息之目的，版权归原作者或者来源机构所有，如转载稿涉及版权问题，请联系我们。

码农书籍

Designing Web Navigation

James Kalbach / O'Reilly Media / 2007-8-15 / USD 49.99

Thoroughly rewritten for today's web environment, this bestselling book offers a fresh look at a fundamental topic of web site development: navigation design. Amid all the changes to the Web in the pa......一起来看看《Designing Web Navigation》这本书的介绍吧!

码农工具

Gwern on GPT-3

Designing Web Navigation

HTML 压缩/解压工具

XML、JSON 在线转换

XML 在线格式化