Evolution of Language Models: N-Grams, Word Embeddings, Attention & Transformers

栏目: IT技术 · 发布时间: 4年前

内容简介:In this post, I thought it would be nice to collate some research on the advancements of Natural Language Processing (NLP) over the years.You’d be surprised at how young this domain really is.I know I was.

In this post, I thought it would be nice to collate some research on the advancements of Natural Language Processing (NLP) over the years.

May 19 ·7min read

You’d be surprised at how young this domain really is.

I know I was.

But first and foremost, let’s lay the foundations on what a Language Model is.

Language Models are simply models that assign probabilities to sequences of words.

It could be something as simple as N-Grams to Neural Language Models.

Even pretrained word embeddings are derived from language modelling i.e. Word2Vec, Glove, SVD, LSA

I tend to think of Language Models as the larger umbrella in which a whole bunch of things fall under.

With that, let’s start from the beginning. :)

Note: Bear with me till the 2000s. It gets more interesting from there on.

Before 1948-1980— Birth of N-Grams and Rule Systems

Evolution of Language Models: N-Grams, Word Embeddings, Attention & Transformers

Photo by Tim Bish on Unsplash

By and large, majority of the NLP systems in this period were based on rules and the first few language models came in the form of N-Grams.

It’s unclear from my research who coined this term.

However, the first references of N-Grams came from Claude Shannon’s paper “ A Mathematical Theory of Communications ” published in 1948.

Shannon references N-Grams a total of 3 times in this paper.

This meant that the concept of N-Grams was probably formulated before 1948 by someone else.

1980-1990 — Rise of compute power and the Birth of RNN

Evolution of Language Models: N-Grams, Word Embeddings, Attention & Transformers

A diagram for a one-unit recurrent neural network (RNN), 19 June 2017 by fdeloche ( source )

During this decade, majority of the NLP research focused on statistical models capable of making probabilistic decisions.

In 1982, John Hopfield introduced the Recurrent Neural Network (RNN) to be used for operations on sequence data i.e. text or voice

By 1986, the first ideas of representing words as vectors emerged. These studies were conducted by Geoffrey Hinton, one of the Godfathers of modern day AI research.( Hinton et al. 1986 ; Rumelhart et al. 1986 )

1990-2000— The Rise of NLP Research and the Birth of LSTM

Evolution of Language Models: N-Grams, Word Embeddings, Attention & Transformers

A diagram for a one-unit Long Short-Term Memory (LSTM), 20 June 2017 by fdeloche ( source )

In the 1990s, NLP analysis began to grow in popularity.

N-Grams became extremely useful in making sense of textual data.

By 1997, the idea of Long Short Term Memory networks (LSTM) was introduced by Hochreiter et al. (1997).

However, there was still a lack of compute power in this period to fully utilize the neural language models to its fullest potential.

2003 — The First Neural Language Model

In 2003, the very first feed-forward neural network language model was proposed by Bengio et al. (2003).

Bengio et al. (2003) model consisted of a single hidden layer feed-forward network used to predict the next word of a sequence.

Evolution of Language Models: N-Grams, Word Embeddings, Attention & Transformers

The first neural language model by Bengio et al. 2003 ( source )

Although feature vectors already existed by this time, Bengio et al.(2003) were the ones that brought the concept to the masses.

Today, we know them as W ord Embeddings. :)

Note: There were tons of other research such as multi-task learning with neural networks (Collobert & Weston, 2008) and more that was researched in this decade as well.

2013 — Birth of Widespread Pretrained Word Embeddings (Word2Vec by Google)

In 2013, Google introduced Word2Vec . (Mikolov et al., 2013)

The goal of Mikolov et al. (2013) was to introduce novel techniques to be able to learn high-quality word embeddings from huge corpora that were transferable across NLP applications.

These techniques were the:

  • Continuous bag-of-words (CBOW) &
  • Skip-Gram

Evolution of Language Models: N-Grams, Word Embeddings, Attention & Transformers

Word2Vec models. The CBOW architecture predicts the current word based on the context, and the Skip-gram predicts surrounding words given the current word. By Mikolov et al. 2013 ( source )

The results of Mikolov et al. (2013) pretrained word embeddings paved the way for a multitude of NLP applications for years to come.

Till this day, people still use pretrained word embeddings for various NLP applications.

It was in this period that LSTMs, RNNs and Gated Recurrent Units (GRU) started to be widely adopted for many different NLP applications as well.

2014 — Standford: Global Vectors (Glove)

A year after Word2Vec was introduced, Pennington et al. (2014) from Standford University presented Glove .

Glove was a set of pretrained word embeddings trained on a different set of corpora with a different technique.

Pennington et al. (2014) found that word embeddings could be learned by co-occurrence matrices and proved that their method could outperform Word2Vec on word similarity tasks and Named Entity Recognition (NER).

Evolution of Language Models: N-Grams, Word Embeddings, Attention & Transformers

Overall accuracy on the word analogy task Glove vs CBOW vs Skip-Gram by Pennington et al. 2014 ( Source )
As an anecdote, I believe more applications use Glove than Word2Vec.

2015 — The Comeback: SVD and LSA Word Embeddings & The Birth of Attention Models

Evolution of Language Models: N-Grams, Word Embeddings, Attention & Transformers

Photo by Science in HD on Unsplash

Recent trends on neural network models were seemingly outperforming traditional models on word similarity and analogy detection tasks.

It was here that researchers Levy et al. (2015) conducted a study on these trending methodologies to learn how they stacked up against the traditional statistical methods.

Levy et al. (2015) found that with proper tuning, classic matrix factorization methods like SVD and LSA attained similar results to Word2Vec or Glove.

They concluded that there were insignificant performance differences between the old and new methods and that there was no evidence of an advantage to any single approach over the others.

I guess the lesson here is that new shiny toys aren’t always better than old (not so shiny) toys.

The Birth of the Attention Model

In previous studies, the problem with Neural Machine Translation (NMT) with RNNs was that they tend to “forget” what was learnt if the sentences got too long.

This was noted as the problem of “long-term dependencies”.

As such, Bahdanau et al. (2015) proposed the attention mechanism to address this issue.

Rather than having a model remember an entire input sequence before translation, the attention mechanism replicates how humans would go about a translation task.

The mechanism allowed models to focus on only the words that best helped the model to translate a word correctly.


以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

SRE

SRE

贝特西 拜尔 (Betsy Beyer)、等 / 孙宇聪 / 电子工业出版社 / 2016-10-1 / CNY 108.00

大型软件系统生命周期的绝大部分都处于“使用”阶段,而非“设计”或“实现”阶段。那么为什么我们却总是认为软件工程应该首要关注设计和实现呢?在《SRE:Google运维解密》中,Google SRE的关键成员解释了他们是如何对软件进行生命周期的整体性关注的,以及为什么这样做能够帮助Google成功地构建、部署、监控和运维世界上现存最大的软件系统。通过阅读《SRE:Google运维解密》,读者可以学习到......一起来看看 《SRE》 这本书的介绍吧!

JS 压缩/解压工具
JS 压缩/解压工具

在线压缩/解压 JS 代码

RGB转16进制工具
RGB转16进制工具

RGB HEX 互转工具

XML 在线格式化
XML 在线格式化

在线 XML 格式化压缩工具