Evolution of Language Models: N-Grams, Word Embeddings, Attention & Transformers

栏目: IT技术 · 发布时间: 5年前

内容简介:In this post, I thought it would be nice to collate some research on the advancements of Natural Language Processing (NLP) over the years.You’d be surprised at how young this domain really is.I know I was.

In this post, I thought it would be nice to collate some research on the advancements of Natural Language Processing (NLP) over the years.

May 19 ·7min read

You’d be surprised at how young this domain really is.

I know I was.

But first and foremost, let’s lay the foundations on what a Language Model is.

Language Models are simply models that assign probabilities to sequences of words.

It could be something as simple as N-Grams to Neural Language Models.

Even pretrained word embeddings are derived from language modelling i.e. Word2Vec, Glove, SVD, LSA

I tend to think of Language Models as the larger umbrella in which a whole bunch of things fall under.

With that, let’s start from the beginning. :)

Note: Bear with me till the 2000s. It gets more interesting from there on.

Before 1948-1980— Birth of N-Grams and Rule Systems

Evolution of Language Models: N-Grams, Word Embeddings, Attention & Transformers

Photo by Tim Bish on Unsplash

By and large, majority of the NLP systems in this period were based on rules and the first few language models came in the form of N-Grams.

It’s unclear from my research who coined this term.

However, the first references of N-Grams came from Claude Shannon’s paper “ A Mathematical Theory of Communications ” published in 1948.

Shannon references N-Grams a total of 3 times in this paper.

This meant that the concept of N-Grams was probably formulated before 1948 by someone else.

1980-1990 — Rise of compute power and the Birth of RNN

Evolution of Language Models: N-Grams, Word Embeddings, Attention & Transformers

A diagram for a one-unit recurrent neural network (RNN), 19 June 2017 by fdeloche ( source )

During this decade, majority of the NLP research focused on statistical models capable of making probabilistic decisions.

In 1982, John Hopfield introduced the Recurrent Neural Network (RNN) to be used for operations on sequence data i.e. text or voice

By 1986, the first ideas of representing words as vectors emerged. These studies were conducted by Geoffrey Hinton, one of the Godfathers of modern day AI research.( Hinton et al. 1986 ; Rumelhart et al. 1986 )

1990-2000— The Rise of NLP Research and the Birth of LSTM

Evolution of Language Models: N-Grams, Word Embeddings, Attention & Transformers

A diagram for a one-unit Long Short-Term Memory (LSTM), 20 June 2017 by fdeloche ( source )

In the 1990s, NLP analysis began to grow in popularity.

N-Grams became extremely useful in making sense of textual data.

By 1997, the idea of Long Short Term Memory networks (LSTM) was introduced by Hochreiter et al. (1997).

However, there was still a lack of compute power in this period to fully utilize the neural language models to its fullest potential.

2003 — The First Neural Language Model

In 2003, the very first feed-forward neural network language model was proposed by Bengio et al. (2003).

Bengio et al. (2003) model consisted of a single hidden layer feed-forward network used to predict the next word of a sequence.

Evolution of Language Models: N-Grams, Word Embeddings, Attention & Transformers

The first neural language model by Bengio et al. 2003 ( source )

Although feature vectors already existed by this time, Bengio et al.(2003) were the ones that brought the concept to the masses.

Today, we know them as W ord Embeddings. :)

Note: There were tons of other research such as multi-task learning with neural networks (Collobert & Weston, 2008) and more that was researched in this decade as well.

2013 — Birth of Widespread Pretrained Word Embeddings (Word2Vec by Google)

In 2013, Google introduced Word2Vec . (Mikolov et al., 2013)

The goal of Mikolov et al. (2013) was to introduce novel techniques to be able to learn high-quality word embeddings from huge corpora that were transferable across NLP applications.

These techniques were the:

  • Continuous bag-of-words (CBOW) &
  • Skip-Gram

Evolution of Language Models: N-Grams, Word Embeddings, Attention & Transformers

Word2Vec models. The CBOW architecture predicts the current word based on the context, and the Skip-gram predicts surrounding words given the current word. By Mikolov et al. 2013 ( source )

The results of Mikolov et al. (2013) pretrained word embeddings paved the way for a multitude of NLP applications for years to come.

Till this day, people still use pretrained word embeddings for various NLP applications.

It was in this period that LSTMs, RNNs and Gated Recurrent Units (GRU) started to be widely adopted for many different NLP applications as well.

2014 — Standford: Global Vectors (Glove)

A year after Word2Vec was introduced, Pennington et al. (2014) from Standford University presented Glove .

Glove was a set of pretrained word embeddings trained on a different set of corpora with a different technique.

Pennington et al. (2014) found that word embeddings could be learned by co-occurrence matrices and proved that their method could outperform Word2Vec on word similarity tasks and Named Entity Recognition (NER).

Evolution of Language Models: N-Grams, Word Embeddings, Attention & Transformers

Overall accuracy on the word analogy task Glove vs CBOW vs Skip-Gram by Pennington et al. 2014 ( Source )
As an anecdote, I believe more applications use Glove than Word2Vec.

2015 — The Comeback: SVD and LSA Word Embeddings & The Birth of Attention Models

Evolution of Language Models: N-Grams, Word Embeddings, Attention & Transformers

Photo by Science in HD on Unsplash

Recent trends on neural network models were seemingly outperforming traditional models on word similarity and analogy detection tasks.

It was here that researchers Levy et al. (2015) conducted a study on these trending methodologies to learn how they stacked up against the traditional statistical methods.

Levy et al. (2015) found that with proper tuning, classic matrix factorization methods like SVD and LSA attained similar results to Word2Vec or Glove.

They concluded that there were insignificant performance differences between the old and new methods and that there was no evidence of an advantage to any single approach over the others.

I guess the lesson here is that new shiny toys aren’t always better than old (not so shiny) toys.

The Birth of the Attention Model

In previous studies, the problem with Neural Machine Translation (NMT) with RNNs was that they tend to “forget” what was learnt if the sentences got too long.

This was noted as the problem of “long-term dependencies”.

As such, Bahdanau et al. (2015) proposed the attention mechanism to address this issue.

Rather than having a model remember an entire input sequence before translation, the attention mechanism replicates how humans would go about a translation task.

The mechanism allowed models to focus on only the words that best helped the model to translate a word correctly.


以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

The Elements of Statistical Learning

The Elements of Statistical Learning

Trevor Hastie、Robert Tibshirani、Jerome Friedman / Springer / 2009-10-1 / GBP 62.99

During the past decade there has been an explosion in computation and information technology. With it have come vast amounts of data in a variety of fields such as medicine, biology, finance, and mark......一起来看看 《The Elements of Statistical Learning》 这本书的介绍吧!

MD5 加密
MD5 加密

MD5 加密工具

XML、JSON 在线转换
XML、JSON 在线转换

在线XML、JSON转换工具

Markdown 在线编辑器
Markdown 在线编辑器

Markdown 在线编辑器