内容简介:In this post, I thought it would be nice to collate some research on the advancements of Natural Language Processing (NLP) over the years.You’d be surprised at how young this domain really is.I know I was.
In this post, I thought it would be nice to collate some research on the advancements of Natural Language Processing (NLP) over the years.
May 19 ·7min read
You’d be surprised at how young this domain really is.
I know I was.
But first and foremost, let’s lay the foundations on what a Language Model is.
Language Models are simply models that assign probabilities to sequences of words.
It could be something as simple as N-Grams to Neural Language Models.
Even pretrained word embeddings are derived from language modelling i.e. Word2Vec, Glove, SVD, LSA
I tend to think of Language Models as the larger umbrella in which a whole bunch of things fall under.
With that, let’s start from the beginning. :)
Note: Bear with me till the 2000s. It gets more interesting from there on.
Before 1948-1980— Birth of N-Grams and Rule Systems
By and large, majority of the NLP systems in this period were based on rules and the first few language models came in the form of N-Grams.
It’s unclear from my research who coined this term.
However, the first references of N-Grams came from Claude Shannon’s paper “ A Mathematical Theory of Communications ” published in 1948.
Shannon references N-Grams a total of 3 times in this paper.
This meant that the concept of N-Grams was probably formulated before 1948 by someone else.
1980-1990 — Rise of compute power and the Birth of RNN
During this decade, majority of the NLP research focused on statistical models capable of making probabilistic decisions.
In 1982, John Hopfield introduced the Recurrent Neural Network (RNN) to be used for operations on sequence data i.e. text or voice
By 1986, the first ideas of representing words as vectors emerged. These studies were conducted by Geoffrey Hinton, one of the Godfathers of modern day AI research.( Hinton et al. 1986 ; Rumelhart et al. 1986 )
1990-2000— The Rise of NLP Research and the Birth of LSTM
In the 1990s, NLP analysis began to grow in popularity.
N-Grams became extremely useful in making sense of textual data.
By 1997, the idea of Long Short Term Memory networks (LSTM) was introduced by Hochreiter et al. (1997).
However, there was still a lack of compute power in this period to fully utilize the neural language models to its fullest potential.
2003 — The First Neural Language Model
In 2003, the very first feed-forward neural network language model was proposed by Bengio et al. (2003).
Bengio et al. (2003) model consisted of a single hidden layer feed-forward network used to predict the next word of a sequence.
Although feature vectors already existed by this time, Bengio et al.(2003) were the ones that brought the concept to the masses.
Today, we know them as W ord Embeddings. :)
Note: There were tons of other research such as multi-task learning with neural networks (Collobert & Weston, 2008) and more that was researched in this decade as well.
2013 — Birth of Widespread Pretrained Word Embeddings (Word2Vec by Google)
In 2013, Google introduced Word2Vec . (Mikolov et al., 2013)
The goal of Mikolov et al. (2013) was to introduce novel techniques to be able to learn high-quality word embeddings from huge corpora that were transferable across NLP applications.
These techniques were the:
- Continuous bag-of-words (CBOW) &
- Skip-Gram
The results of Mikolov et al. (2013) pretrained word embeddings paved the way for a multitude of NLP applications for years to come.
Till this day, people still use pretrained word embeddings for various NLP applications.
It was in this period that LSTMs, RNNs and Gated Recurrent Units (GRU) started to be widely adopted for many different NLP applications as well.
2014 — Standford: Global Vectors (Glove)
A year after Word2Vec was introduced, Pennington et al. (2014) from Standford University presented Glove .
Glove was a set of pretrained word embeddings trained on a different set of corpora with a different technique.
Pennington et al. (2014) found that word embeddings could be learned by co-occurrence matrices and proved that their method could outperform Word2Vec on word similarity tasks and Named Entity Recognition (NER).
As an anecdote, I believe more applications use Glove than Word2Vec.
2015 — The Comeback: SVD and LSA Word Embeddings & The Birth of Attention Models
Recent trends on neural network models were seemingly outperforming traditional models on word similarity and analogy detection tasks.
It was here that researchers Levy et al. (2015) conducted a study on these trending methodologies to learn how they stacked up against the traditional statistical methods.
Levy et al. (2015) found that with proper tuning, classic matrix factorization methods like SVD and LSA attained similar results to Word2Vec or Glove.
They concluded that there were insignificant performance differences between the old and new methods and that there was no evidence of an advantage to any single approach over the others.
I guess the lesson here is that new shiny toys aren’t always better than old (not so shiny) toys.
The Birth of the Attention Model
In previous studies, the problem with Neural Machine Translation (NMT) with RNNs was that they tend to “forget” what was learnt if the sentences got too long.
This was noted as the problem of “long-term dependencies”.
As such, Bahdanau et al. (2015) proposed the attention mechanism to address this issue.
Rather than having a model remember an entire input sequence before translation, the attention mechanism replicates how humans would go about a translation task.
The mechanism allowed models to focus on only the words that best helped the model to translate a word correctly.
以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网
猜你喜欢:本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。