内容简介:After the success of Bert, the field of NLP is increasingly moving in the direction of creatingWhile this type of
GPT-3: The New Mighty Language Model from OpenAI
Pushing Deep Learning to the Limit with 175B Parameters
Introduction
OpenAI recently released pre-print of its new mighty language model GPT-3. Its a much bigger and better version of its predecessor GPT-2. In fact, with close to 175B trainable parameters, GPT-3 is much bigger in terms of size in comparison to anything else out there. Here is a comparison of number of parameters of recent popular pre trained NLP models, GPT-3 clearly stands out.
What’s New?
After the success of Bert, the field of NLP is increasingly moving in the direction of creating pre-trained language models , trained on huge text corpus (in an unsupervised way), which are later fine-tuned on specific tasks such as translation, question answering etc using much smaller task specific datasets.
While this type of transfer learning obviates the need to use task specific model architectures, but you still need task specific datasets, which are a pain to collect, to achieve good performance.
Humans by contrast learn in a very different way, and have the ability to learn a new task based on very few examples. GPT-3 aims to address this specific pain point, that is, its a task agnostic model, which needs zero to very limited examples to do well and achieve close to state of the art performance on a number of NLP tasks
Terminologies
Before we deep dive, it may be useful to define some commonly used terminologies:
- NPL Tasks: These are tasks which have something to do with human languages, example — Language Translation, Text Classification (e.g. Sentiment extraction), Reading Comprehension, Named Entity Recognition (e.g. recognizing person, location, company names in text)
- Language Model s: These are models which can predict the most likely next words (and their probabilities) given a set of words (think something like Google query auto-complete). Turns out these type of models are useful for a host of other tasks although they may be trained on mundane next word prediction
- Zero / One / Few shot learning: Refers to model’s ability to learn a new task by seeing zero / one / few examples for that task
- Transfer Learning: Refers to the notion in Deep Learning where you train a model for one task (example object detection in images) , but the ability to leverage and build upon that for some other different task (example assessing MRI scans). After massive success in Computer Vision, its in vogue in NLP these days.
- Transformer Models : Deep learning family of models, used primarily in NLP, which forms the basic building block of most of the state-of-the-art NLP architectures these days.
The Approach
The model is built using the standard concepts of Transformer , Attention etc and using the typical Common Crawl, Wikipedia, Books and some additional data sources. A lot of things — pre training, model, data are similar to GPT-2, but everything (model size, data size, training time) is just a lot bigger. In fact its humongous size is what drives most of the benefits of the model.
The following graph shows the benefit in accuracy for various Zero / One / Few shot tasks as a function of number of Model parameters, clearly major gains are achieved due to the scaled up size.
Most of the things used in the model are so huge — example 96 Attention layers, Batch Size of 3.2M, 175B Parameters — that they are unlike anything in the past. The model is ~10x larger in terms of number of parameters to the next closest thing (Microsoft Turing NLG with 17B parameters)
There is no need to do gradient / parameter updates (fine tuning) for using the GPT-3 model for various tasks. One can just interact with the model using natural language and/or provide some examples of the tasks that you are trying to do and the model will do it!
What Does All this Mean?
The concept of not requiring large custom, task specific datasets, in addition to not requiring task specific model architectures is a huge step in direction of making cutting edge NLP more accessible.
While GPT-3 delivers great performance on a lot of NLP tasks example — word prediction, common sense reasoning — but it doesn’t do equally well on everything. For instance it doesn’t do great on things like — Text synthesis, some reading comprehension tasks etc. In addition to this, it also suffers from bias in the data which may lead the model to generate stereotyped or prejudiced content. So there is more work to be done here.
In addition to all this, the huge size of GPT-3, makes it out of bounds for almost everyone except a select few companies and research labs in the world. As per the authors, the model is very versatile and contains a very wide range of skills not needed for specific tasks and there might be a scope of creating smaller, more manageable task specific models using the concept of distillation .
Would be exciting to see how this thing evolves in future.
以上就是本文的全部内容,希望本文的内容对大家的学习或者工作能带来一定的帮助,也希望大家多多支持 码农网
猜你喜欢:本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。
重新定义团队:谷歌如何工作
拉兹洛·博克 / 宋伟 / 中信出版集团 / 2015-12-1 / CNY 56.00
谷歌首席人才官拉斯洛•博克权威力作,谷歌公开认可的谷歌高层作品,首度揭秘谷歌颠覆工业时代模式的人才和团队管理的核心法则,《纽约时报》畅销榜第一名,Business Insider 2015最佳商业书籍,谷歌的创造力就在于此! 编辑推荐! 1、 谷歌人才官首次公开谷歌人才和团队管理的核心秘籍 在谷歌执掌人事多年的拉斯洛•博克是人才和团队管理的顶级专家。他加入谷歌后,谷歌的员工数从六......一起来看看 《重新定义团队:谷歌如何工作》 这本书的介绍吧!