Word Embeddings and Embedding Projector of TensorFlow

栏目: IT技术 · 发布时间: 4年前

内容简介:Theoretical explanation and a practical example.Word embedding is a technique to represent words (i.e. tokens) in a vocabulary. It is considered as one of the most useful and important concept in natural language processing (NLP).In this post, I will cover

Word Embeddings and Embedding Projector of TensorFlow

Theoretical explanation and a practical example.

Photo by Ross Joyner on Unsplash

Word embedding is a technique to represent words (i.e. tokens) in a vocabulary. It is considered as one of the most useful and important concept in natural language processing (NLP).

In this post, I will cover the idea of word embedding and how it is useful in NLP. Then, we will go over a practical example to comprehend the concept using embedding projector of TensorFlow.

Word embedding means representing a word with vectors in n-dimensional vector space. Consider a vocabulary that contains 10000 words. With traditional number encoding, words are represented with numbers from 1 to 10000. Downside of this approach is that we cannot capture any information about the meaning of words because numbers are assigned to words without any consideration of the meaning.

If we use word embedding with a dimension of 16, each word is represented with a 16-dimensional vector. The main advantage of word embedding is that words that share similar context can be represented close to each other in the vector space. Thus, vectors carry a sense of semantic of a word. Let’s assuma we are trying to do sentiment analysis of customer reviews. If we use word embeddings to represent words in the reviews, words associated with positive meaning point a particular way. Similarly, words with negative meaning are likely to point in a different direction.

A very famous analogy that bears the idea of word embeddings is king-queen example. It is based on the vector representations of the words “king”, “quenn”, “man” and “woman”. If we substract man from the kind and then add woman, we will end up with a vector very close to queen:

There are different methods to measure the similarity of vectors. One of the most common methods is cosine similarity which is the cosine of the angle between two vectors. Unlike euclidean distance, cosine similarity does not take the magnitude of vectors into consideration when measuring the similarity. Thus, cosine similarity focuses on the orientation of the vectors, not the length.

Consider the words “exciting”, “boring”, “thrilling”, and “dull”. In a 2-dimensional vector space, the vectors for these words might look like:

Word embedding in 2-dimensional space

As the angle between vectors decreases, cosine of the angle increases and thus, the cosine similarity increases. If two vectors lay in the same direction (angle between them is zero), the cosine similarity is 1. On the other hand, if two vectors point in the opposite direction (angle between them is 180), cosine similarity is -1.

When we use word embeddings, the model learns that thrilling and exciting more like to share the same context than thrilling and boring. If we represented the words with integers, the model would have no idea of the context of these words.

There are different methods to create word embeddings such as Word2Vec, GloVe or an embedding layer of a neural network. Another advantage of word embedding is that we can use a pre-trained embedding in our models. For instance, Word2Vec and GloVe embeddings are open to public and can be used for natural language processing texts. We can also choose to train our own embeddings using an embedding layer in a neural network. For examples, we can add Embedding layer in a sequential model of Keras . Please note that it requires lots of data to train an embedding layer with high performance.

The example we had with 4 words is a very simple case but grasps the idea and motivation behind word embeddings. To visualize and inspect more complication examples, we can use Embedding Projector of TensorFlow.


以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

设计模式之禅(第2版)

设计模式之禅(第2版)

秦小波 / 机械工业出版社 / 2014-2-25 / 89.00元

本书是设计模式领域公认的3本经典著作之一,“极具趣味,容易理解,但讲解又极为严谨和透彻”是本书的写作风格和方法的最大特点。第1版2010年出版,畅销至今,广受好评,是该领域的里程碑著作。深刻解读6大设计原则和28种设计模式的准确定义、应用方法和最佳实践,全方位比较各种同类模式之间的异同,详细讲解将不同的模式组合使用的方法。第2版在第1版的基础上有两方面的改进,一方面结合读者的意见和建议对原有内容中......一起来看看 《设计模式之禅(第2版)》 这本书的介绍吧!

HTML 编码/解码
HTML 编码/解码

HTML 编码/解码

正则表达式在线测试
正则表达式在线测试

正则表达式在线测试

HEX HSV 转换工具
HEX HSV 转换工具

HEX HSV 互换工具