GPT-3: The New Mighty Language Model from OpenAI

栏目: IT技术 · 发布时间: 5年前

内容简介:After the success of Bert, the field of NLP is increasingly moving in the direction of creatingWhile this type of

GPT-3: The New Mighty Language Model from OpenAI

Pushing Deep Learning to the Limit with 175B Parameters

Introduction

OpenAI recently released pre-print of its new mighty language model GPT-3. Its a much bigger and better version of its predecessor GPT-2. In fact, with close to 175B trainable parameters, GPT-3 is much bigger in terms of size in comparison to anything else out there. Here is a comparison of number of parameters of recent popular pre trained NLP models, GPT-3 clearly stands out.

What’s New?

After the success of Bert, the field of NLP is increasingly moving in the direction of creating pre-trained language models , trained on huge text corpus (in an unsupervised way), which are later fine-tuned on specific tasks such as translation, question answering etc using much smaller task specific datasets.

While this type of transfer learning obviates the need to use task specific model architectures, but you still need task specific datasets, which are a pain to collect, to achieve good performance.

Humans by contrast learn in a very different way, and have the ability to learn a new task based on very few examples. GPT-3 aims to address this specific pain point, that is, its a task agnostic model, which needs zero to very limited examples to do well and achieve close to state of the art performance on a number of NLP tasks

Terminologies

Before we deep dive, it may be useful to define some commonly used terminologies:

  • NPL Tasks: These are tasks which have something to do with human languages, example — Language Translation, Text Classification (e.g. Sentiment extraction), Reading Comprehension, Named Entity Recognition (e.g. recognizing person, location, company names in text)
  • Language Model s: These are models which can predict the most likely next words (and their probabilities) given a set of words (think something like Google query auto-complete). Turns out these type of models are useful for a host of other tasks although they may be trained on mundane next word prediction
  • Zero / One / Few shot learning: Refers to model’s ability to learn a new task by seeing zero / one / few examples for that task
  • Transfer Learning: Refers to the notion in Deep Learning where you train a model for one task (example object detection in images) , but the ability to leverage and build upon that for some other different task (example assessing MRI scans). After massive success in Computer Vision, its in vogue in NLP these days.
  • Transformer Models : Deep learning family of models, used primarily in NLP, which forms the basic building block of most of the state-of-the-art NLP architectures these days.

The Approach

The model is built using the standard concepts of Transformer , Attention etc and using the typical Common Crawl, Wikipedia, Books and some additional data sources. A lot of things — pre training, model, data are similar to GPT-2, but everything (model size, data size, training time) is just a lot bigger. In fact its humongous size is what drives most of the benefits of the model.

The following graph shows the benefit in accuracy for various Zero / One / Few shot tasks as a function of number of Model parameters, clearly major gains are achieved due to the scaled up size.

Source: Paper

Most of the things used in the model are so huge — example 96 Attention layers, Batch Size of 3.2M, 175B Parameters — that they are unlike anything in the past. The model is ~10x larger in terms of number of parameters to the next closest thing (Microsoft Turing NLG with 17B parameters)

There is no need to do gradient / parameter updates (fine tuning) for using the GPT-3 model for various tasks. One can just interact with the model using natural language and/or provide some examples of the tasks that you are trying to do and the model will do it!

Source: Paper

What Does All this Mean?

The concept of not requiring large custom, task specific datasets, in addition to not requiring task specific model architectures is a huge step in direction of making cutting edge NLP more accessible.

While GPT-3 delivers great performance on a lot of NLP tasks example — word prediction, common sense reasoning — but it doesn’t do equally well on everything. For instance it doesn’t do great on things like — Text synthesis, some reading comprehension tasks etc. In addition to this, it also suffers from bias in the data which may lead the model to generate stereotyped or prejudiced content. So there is more work to be done here.

In addition to all this, the huge size of GPT-3, makes it out of bounds for almost everyone except a select few companies and research labs in the world. As per the authors, the model is very versatile and contains a very wide range of skills not needed for specific tasks and there might be a scope of creating smaller, more manageable task specific models using the concept of distillation .

Would be exciting to see how this thing evolves in future.


以上就是本文的全部内容,希望本文的内容对大家的学习或者工作能带来一定的帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

社群营销实战手册

社群营销实战手册

秋叶、邻三月、秦阳 / 人民邮电出版社 / 2018-1 / 69.00元

互联网正从“物以类聚”,走向“人以群分”的时代。秋叶等人的“社群营销”,并非单纯靠社群卖东西,而是建立一种中心化的、自行运转的生态,让“同好”们形成紧密的联系,创造出海量营销机会。 《社群营销实战手册 从社群运营到社群经济》共5章内容,从社群的定位、建立、扩张、变现、运营,到社群的生命周期延长、社群运营团队的打造和管理以及社群管理工具,大量干货秘笈一应俱全,并提供丰富的运营实战案例,全面解读社群的......一起来看看 《社群营销实战手册》 这本书的介绍吧!

XML 在线格式化
XML 在线格式化

在线 XML 格式化压缩工具

HEX CMYK 转换工具
HEX CMYK 转换工具

HEX CMYK 互转工具

HEX HSV 转换工具
HEX HSV 转换工具

HEX HSV 互换工具