Evolution of Language Models: N-Grams, Word Embeddings, Attention & Transformers

栏目: IT技术 · 发布时间: 4年前

内容简介：In this post, I thought it would be nice to collate some research on the advancements of Natural Language Processing (NLP) over the years.You’d be surprised at how young this domain really is.I know I was.

In this post, I thought it would be nice to collate some research on the advancements of Natural Language Processing (NLP) over the years.

Timothy Tan

May 19 ·7min read

You’d be surprised at how young this domain really is.

I know I was.

But first and foremost, let’s lay the foundations on what a Language Model is.

Language Models are simply models that assign probabilities to sequences of words.

It could be something as simple as N-Grams to Neural Language Models.

Even pretrained word embeddings are derived from language modelling i.e. Word2Vec, Glove, SVD, LSA

I tend to think of Language Models as the larger umbrella in which a whole bunch of things fall under.

With that, let’s start from the beginning. :)

Note: Bear with me till the 2000s. It gets more interesting from there on.

Before 1948-1980— Birth of N-Grams and Rule Systems

Evolution of Language Models: N-Grams, Word Embeddings, Attention & Transformers — Photo by Tim Bish on Unsplash

By and large, majority of the NLP systems in this period were based on rules and the first few language models came in the form of N-Grams.

It’s unclear from my research who coined this term.

However, the first references of N-Grams came from Claude Shannon’s paper “ A Mathematical Theory of Communications ” published in 1948.

Shannon references N-Grams a total of 3 times in this paper.

This meant that the concept of N-Grams was probably formulated before 1948 by someone else.

1980-1990 — Rise of compute power and the Birth of RNN

During this decade, majority of the NLP research focused on statistical models capable of making probabilistic decisions.

In 1982, John Hopfield introduced the Recurrent Neural Network (RNN) to be used for operations on sequence data i.e. text or voice

By 1986, the first ideas of representing words as vectors emerged. These studies were conducted by Geoffrey Hinton, one of the Godfathers of modern day AI research.( Hinton et al. 1986 ; Rumelhart et al. 1986 )

1990-2000— The Rise of NLP Research and the Birth of LSTM

In the 1990s, NLP analysis began to grow in popularity.

N-Grams became extremely useful in making sense of textual data.

By 1997, the idea of Long Short Term Memory networks (LSTM) was introduced by Hochreiter et al. (1997).

However, there was still a lack of compute power in this period to fully utilize the neural language models to its fullest potential.

2003 — The First Neural Language Model

In 2003, the very first feed-forward neural network language model was proposed by Bengio et al. (2003).

Bengio et al. (2003) model consisted of a single hidden layer feed-forward network used to predict the next word of a sequence.

Although feature vectors already existed by this time, Bengio et al.(2003) were the ones that brought the concept to the masses.

Today, we know them as W ord Embeddings. :)

Note: There were tons of other research such as multi-task learning with neural networks (Collobert & Weston, 2008) and more that was researched in this decade as well.

2013 — Birth of Widespread Pretrained Word Embeddings (Word2Vec by Google)

In 2013, Google introduced Word2Vec . (Mikolov et al., 2013)

The goal of Mikolov et al. (2013) was to introduce novel techniques to be able to learn high-quality word embeddings from huge corpora that were transferable across NLP applications.

These techniques were the:

Continuous bag-of-words (CBOW) &
Skip-Gram

The results of Mikolov et al. (2013) pretrained word embeddings paved the way for a multitude of NLP applications for years to come.

Till this day, people still use pretrained word embeddings for various NLP applications.

It was in this period that LSTMs, RNNs and Gated Recurrent Units (GRU) started to be widely adopted for many different NLP applications as well.

2014 — Standford: Global Vectors (Glove)

A year after Word2Vec was introduced, Pennington et al. (2014) from Standford University presented Glove .

Glove was a set of pretrained word embeddings trained on a different set of corpora with a different technique.

Pennington et al. (2014) found that word embeddings could be learned by co-occurrence matrices and proved that their method could outperform Word2Vec on word similarity tasks and Named Entity Recognition (NER).

As an anecdote, I believe more applications use Glove than Word2Vec.

2015 — The Comeback: SVD and LSA Word Embeddings & The Birth of Attention Models

Recent trends on neural network models were seemingly outperforming traditional models on word similarity and analogy detection tasks.

It was here that researchers Levy et al. (2015) conducted a study on these trending methodologies to learn how they stacked up against the traditional statistical methods.

Levy et al. (2015) found that with proper tuning, classic matrix factorization methods like SVD and LSA attained similar results to Word2Vec or Glove.

They concluded that there were insignificant performance differences between the old and new methods and that there was no evidence of an advantage to any single approach over the others.

I guess the lesson here is that new shiny toys aren’t always better than old (not so shiny) toys.

The Birth of the Attention Model

In previous studies, the problem with Neural Machine Translation (NMT) with RNNs was that they tend to “forget” what was learnt if the sentences got too long.

This was noted as the problem of “long-term dependencies”.

As such, Bahdanau et al. (2015) proposed the attention mechanism to address this issue.

Rather than having a model remember an entire input sequence before translation, the attention mechanism replicates how humans would go about a translation task.

The mechanism allowed models to focus on only the words that best helped the model to translate a word correctly.

以上就是本文的全部内容，希望对大家的学习有所帮助，也希望大家多多支持码农网

查看所有标签

猜你喜欢:

Evolution of Language Models: N-Grams, Word Embeddings, Attention & Transformers

本站部分资源来源于网络，本站转载出于传递更多信息之目的，版权归原作者或者来源机构所有，如转载稿涉及版权问题，请联系我们。

码农书籍

Ajax模式与最佳实践

Christian Gross / 李锟、张祖良、蔡毅、赵泽欣 / 电子工业出版社 / 2007-3 / 49.80元

Ajax 正在将我们带入到下一代的网络应用中。本书深入探讨了动态的网络应用，将Ajax和REST集成在一起作为单独的解决方案。一个很大的优势是，与Ajax相似，REST可以和现今存在的技术一起使用。现在上百万的客户端计算机都是基于Ajax的，上百万的服务器是基于REST的。　　无论你是否已经开发过Ajax应用程序，这都是一本理想的书。因为这本书描述了各种各样的模式和最好的实践经验。通过此......一起来看看《Ajax模式与最佳实践》这本书的介绍吧!

码农工具