Mining Order From Chaos: the Ingenious and Creative Fusion of NLP & Graph Theory

栏目: IT技术 · 发布时间: 5年前

A knowledge (semantic) graph is, daresay, one of the most fascinating concepts in data science. The applications, extensions, and potential of knowledge graphs to mine order from the chaos of unstructured text is truly mind-blowing.

The graph consists of nodes and edges, where a node represents an entity and an edge represents a relationship. No entity in a graph can be repeated twice, and when there are enough entities in a graph, the connections between each can reveal worlds of information.

Just with a few entities, interesting relationships begin to emerge. As a general rule, entities are nouns and relationships are verbs; for instance, “the USA is a member of NATO” would correspond to a graph relationship “[entity USA] to [entity NATO] with [relationship member of]”. Just using text from three to four sentences of information, one could construct a rudimentary knowledge graph:

Imagine the sheer amount of knowledge possessed in a complete Wikipedia article, or even an entire book! One could perform detailed analyses with this abundance of data; for example, identifying the most important entities or what the most common action or relationship an entity is on the receiving end of. Unfortunately, while building knowledge graphs is simple for humans, it is not scalable. We can build simple rule-based automated graph-builders.

To demonstrate the automation of knowledge-graph building, consider an expert of a biography of the great computer scientist and founder of artificial intelligence, Alan Turing. Since we’ve established that entities are nouns and verbs are relationships, let us first split the text into chunks, where each contains a relationship between two objects.

A simple method to do this is to separate by sentence, but a more rigorous method would be to separate by clause, since there may be many clauses and hence relationships in a single sentence (“she walked her dog to the park, then she bought food”).

Identifying the objects involved — entity extraction — is a more difficult task. Consider “Turing test”: this is an example of a nested entity, or an entity within the name of another entity. While POS (part of speech) tagging is sufficient for single-word nouns, one will need to use dependency parsing for multi-word nouns.

Dependency parsing is the task of recognizing a sentence and assigning a syntax-based structure to it. Because dependency trees are based on grammar and not word-by-word, it doesn’t care how many words an object consists of, as long as it is enclosed by other structures like verbs (‘proposed’) or transitioning phrases (‘as a…’). It is also used to find the verb that relates the two objects, systematically following what it believes is the syntax of the sentence and the rules of grammar. One can also use similar methods to link pronouns (‘he’, ‘she’, ‘they’) to the person it refers to (pronoun resolution).

It is worth mentioning that one may also benefit from building a knowledge graph by adding synonyms; tutorials will often show examples with the same word repeated many times for simplicity, but to humans using the same word repeatedly is so looked down-upon that writers actively find synonyms (words that mean the same thing as another word). One way to do this is with Hearst patterns, named after Marti Hearst, a computational linguistics researcher and professor at UC Berkeley. In her extensive research, she discovered a set of reoccurring patterns that can be reliably used to extract information.


以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

硅谷创投课

硅谷创投课

[美]加里·维纳查克 / 林怡 / 北京联合出版社 / 2017-6 / 52

☆通用电气前CEO杰克·韦尔奇力荐,影响500强企业CMO的美国互联网意见领袖全新力作! ☆《纽约时报》榜单全新畅销书,把握来自硅谷的互联网风口浪潮! ☆70后创投鬼才,影响美国00后一代商业观的网络红人、科技公司天使投资人面对面解答你创投、管理、运营中的 一切困惑! ☆来自无数实战的真实商业意见!年轻人为什么买你的账?投资人凭什么会把钱交给你?企业家更应该做的事到底是什么?告诉......一起来看看 《硅谷创投课》 这本书的介绍吧!

JSON 在线解析
JSON 在线解析

在线 JSON 格式化工具

随机密码生成器
随机密码生成器

多种字符组合密码

MD5 加密
MD5 加密

MD5 加密工具