Exploring the Graph Catalog feature of Neo4j Graph data science plugin on a Lord of the Rin...

栏目: IT技术 · 发布时间: 4年前

内容简介:To continue the introduction of theGraph algorithms run on a graph data model which is aThe graph catalog is a concept within the GDS library that allows managing multiple graph projections by name. Using that name, a created graph can be used many times i

To continue the introduction of the Neo4j Graph data science plugin , we will go over the basics and look at the graph management part of the library. This is the part of the library that is responsible for projecting the stored Neo4j graph into memory, which allows for faster execution of the graph algorithms. A significant portion of the graph management section is the Graph Catalog feature. Just what is it? Let’s look at the official documentation .

Graph algorithms run on a graph data model which is a projection of the Neo4j property graph data model. A graph projection can be seen as a view over the stored graph, containing only analytical relevant, potentially aggregated, topological and property information. Graph projections are stored entirely in-memory using compressed data structures optimized for topology and property lookup operations.

The graph catalog is a concept within the GDS library that allows managing multiple graph projections by name. Using that name, a created graph can be used many times in the analytical workflow.

There are two distinct options to project the stored graph into memory:

  • Native projection  provides the best performance by reading from the Neo4j store files.
  • Cypher projection , the more flexible, expressive approach with lesser focus on performance

In this blog post, we will take a deep dive into the native projection option of the Graph Catalog.

Graph model

I have used the GoT dataset more times than I can remember, so I decided to explore the internet and search for new exciting graphs. I stumbled upon this Lord of the Rings dataset made available by José Calvo that we will use in this blog post.

Exploring the Graph Catalog feature of Neo4j Graph data science plugin on a Lord of the Rin...

The dataset describes interactions between persons, places, groups, and things (The Ring). When choosing how to model this dataset, I decided to have “main” nodes with two labels, primary label “Node” and secondary label one of the following:

  • Person
  • Place
  • Group
  • Thing

In the picture above, I added only “Person” as a secondary label due to clarity. We store interactions from each book as a separate relationship, meaning that the interactions from the first book will be saved as the INTERACTS_1 relationships, second book interactions as INTERACTS_2 , and so on. Keep in mind that we will treat interaction relationships as undirected and weighted. These “main” nodes can also be a part of a class such as orcs, elves, men, and many more.

Import data

As mentioned, the dataset is available on GitHub , and we can easily fetch it with the LOAD CSV statement.

Import nodes

LOAD CSV WITH HEADERS FROM 
"https://raw.githubusercontent.com/morethanbooks/projects/master/LotR/ontologies/ontology.csv" as row FIELDTERMINATOR "\t"
WITH row, CASE row.type WHEN 'per' THEN 'Person'
                        WHEN 'gro' THEN 'Group'
                        WHEN 'thin' THEN 'Thing'
                        WHEN 'pla' THEN 'Place' END as label
CALL apoc.create.nodes(['Node',label], [apoc.map.clean(row,['type','subtype'],[null,""])]) YIELD node
WITH node, row.subtype as class
MERGE (c:Class{id:class})
MERGE (node)-[:PART_OF]->(c)

The primary reason we added two labels to “main” nodes was to optimize the import of interaction relationships. Now, you might say the optimization is unnecessary due to the tiny dataset we have, and I would agree, but let’s pretend we might be dealing with millions of nodes. We start by defining the unique constraint on the label “Node”.

CREATE CONSTRAINT ON (n:Node) ASSERT n.id IS UNIQUE

Import relationships

Now that we have set up the unique constraint,  cypher query planner will use it to match our existing nodes much faster.

UNWIND ['1','2','3'] as book
LOAD CSV WITH HEADERS FROM "https://raw.githubusercontent.com/morethanbooks/projects/master/LotR/tables/networks-id-volume" + book + ".csv" AS row
MATCH (source:Node{id:coalesce(row.IdSource,row.Source)})
MATCH (target:Node{id:coalesce(row.IdTarget,row.Target)})
CALL apoc.create.relationship(source, "INTERACTS_" + book,
         {weight:toInteger(row.Weight)}, target) YIELD rel
RETURN distinct true

GDS Graph Catalog

The syntax for creating named graphs in Graph Catalog is:

CALL gds.graph.create(graph name, node label, relationship type).

Describing nodes we want to project

In general, with the native projection variant, there are three options to describe the nodes we want to project into memory:

  • Project a single node label using a string:
    • ‘Label’ (‘*’ is a wildcard operator that projects all nodes)
  • Project multiple node labels using an array:
    • [‘Label1’, ‘Label2’, ‘Label3’]
  • Project multiple node labels with their properties using a configuration map:
    • {
        Label: {
          label: "Label",
          properties: [
            "property1",
            "property2"
          ]
        },
        Label2: {
          label: "Label2",
          properties: [
            "foo",
            "bar"
          ]
        }
      }

An important thing to note regarding projecting node labels:

In the in-memory graph, all projected node labels are merged into a single label. Unlike for relationship projections, it is currently not possible to specify a filter on projected labels. If the graph is used as input for an algorithm, all nodes will be considered.

While we can filter which node labels we want to project to the in-memory graph, additional filtering of nodes when executing graph algorithm is currently not supported.

Describing relationships we want to project

The syntax to describe the relationships we want to project is very similar to that of the nodes.

  • Project a single relationship type using a string:
    • ‘TYPE’ (‘*’ is a wildcard that projects all relationship-types)
  • Project multiple relationship types using an array:
    • [‘TYPE1′,’TYPE2’]
  • Project more relationship types with their properties using a configuration map:
    • {ALIAS_OF_TYPE: {type:'RELATIONSHIP_TYPE', 
                       orientation: 'NATURAL',
                       aggregation: 'DEFAULT'
                       properties:[property1,property2]}​

The orientation parameter in the configuration map defines the direction of the relationships we want to project. Possible values are:

  • ‘NATURAL’ -> each relationship is projected the same way as it is stored in Neo4j
  • ‘REVERSE’ -> each relationship is reversed during graph projection
  • ‘UNDIRECTED’ -> each relationship is projected in both natural and reverse orientation

An important thing to note is that the GDS library supports running graph algorithms on a multigraph . The aggregation parameter is handy when we want to convert a multigraph to a single graph(not a multigraph), but we’ll take a closer look at that in another blog post.

Let’s look at some practical examples now.

Whole graph

Let’s start by projecting the entire graph into memory using the wildcard operator for both the nodes and the relationships.

CALL gds.graph.create('whole_graph','*', '*')

Most of the time, we start the graph analysis by running the (weakly) connected components algorithm to get an idea of how (dis)connected our graph really is.

CALL gds.wcc.stream('whole_graph') YIELD nodeId, componentId
RETURN componentId, count(*) as size
ORDER BY size DESC LIMIT 10

Results

componentId size
0 86

The graph as a whole consists of a single component. Usually, what you’ll get with real-world data is a single super component (85+% of all nodes) and a few small disconnected components.

Drop the projected graph from the catalog

After each analysis, we will release the projected graph from memory.

CALL gds.graph.drop('whole_graph');

Interaction graph

In the next step, we want to ignore PART_OF relationships and only focus on INTERACTS_X relationships. We will use an array for describing relationship-types to take into account all three INTERACTS_X relationships.

CALL gds.graph.create('all_interacts','Node',
     ['INTERACTS_1', 'INTERACTS_2', 'INTERACTS_3'])

Let’s run the weakly connected components algorithm on our new projected graph.

CALL gds.wcc.stream('all_interacts') YIELD nodeId, componentId
RETURN componentId, count(*) as size, 
       collect(gds.util.asNode(nodeId).Label) as ids
ORDER BY size DESC LIMIT 10

Results

componentId size ids
0 73 [“Anduin”, “Aragorn”, “Arathorn”, and many more]
26 1 [“Mirkwood”]
25 1 [“Old Forest”]

Our new graph consists of three components. We have a single super component and two tiny components consisting of only a single node. We can deduce that locations “Mirkwood” and “Old Forest” have no INTERACTS_X relationships.

Let’s use the same projected graph and only look at interactions from the first book. We can filter which relationship-types should the graph algorithm consider with the relationshipTypes parameter.

CALL gds.wcc.stream('all_interacts', 
    {relationshipTypes:['INTERACTS_1']})
YIELD nodeId, componentId
RETURN componentId, count(*) as size, 
       collect(gds.util.asNode(nodeId).Label) as ids
ORDER BY size DESC LIMIT 10

Results

componentId size ids
0 62 [“Anduin”, “Aragorn”, “Arathorn”, “Arwen”, and many more]
21 1 [“Ents”]
26 1 [“Mirkwood”]
23 1 [“Eorl”]
24 1 [“Éowyn”]
25 1 [“Old Forest”]
22 1 [“Éomer”]
27 1 [“Faramir”]
38 1 [“Gorbag”]
6 1 [“Beregond”]

We get more disconnected components if we take into account only interactions from the first book. This makes sense as some of the characters/locations haven’t yet been introduced in the first book, so they have no INTERACTS_1 relationships.

Drop the projected graph from the catalog

CALL gds.graph.drop('all_interacts');

Undirected weighted interaction graph

In the last example, we will show how to project an undirected weighted graph. We will consider only nodes labeled Person and Thing, and for relationships, we will project all the INTERACTS_X relationships along with their weight property, which will be treated as undirected.

CALL gds.graph.create('undirected_weighted',['Person', 'Thing'],
    {INTERACTS_1:{type: 'INTERACTS_1',
                  orientation: 'UNDIRECTED',
                  properties:['weight']},
    INTERACTS_2:{type:'INTERACTS_2',
                 orientation: 'UNDIRECTED',
                 properties:['weight']},
    INTERACTS_3: {type:'INTERACTS_3',
                  orientation:'UNDIRECTED',
                  properties:['weight']}});

Unweighted pageRank

To run the unweighted pageRank on our projected graph, we don’t have to specify any additional configuration.

CALL gds.pageRank.stream('undirected_weighted')
YIELD nodeId, score
RETURN gds.util.asNode(nodeId).Label as name, score
ORDER BY score DESC LIMIT 5

Results

name score
“Aragorn” 2.258091664128006
“Gandalf” 2.2152436876669523
“Frodo” 2.11306254575029
“Sam” 1.8062801460735498
“Gimli” 1.6464465597644447

Weighted pageRank

To let know the algorithm that it should take relationship weights into account, we need to use relationshipWeightProperty parameter.

CALL gds.pageRank.stream('undirected_weighted', 
   {relationshipWeightProperty:'weight'})
YIELD nodeId, score 
RETURN gds.util.asNode(nodeId).Label as name, score 
ORDER BY score DESC 
LIMIT 5

Results

name score
“Frodo” 5.101856858469546
“Gandalf” 3.7572644826024777
“Sam” 3.4707343533635133
“Aragorn” 3.246119194291532
“Pippin” 2.355583811737597

As Frodo has more interactions (defined as weight) with other characters, he comes out on top with the weighted variant of the pageRank.

Graph analysis of the first book:

To finish this blog post, we will analyze the network of the first book. We start by running the weighted pageRank on the interaction relationships from the first book only.

CALL gds.pageRank.stream('undirected_weighted', 
    {relationshipWeightProperty:'weight', 
     relationshipTypes:['INTERACTS_1']})
YIELD nodeId, score 
RETURN gds.util.asNode(nodeId).Label as name, score 
ORDER BY score DESC 
LIMIT 5

Results

name score
“Frodo” 5.518011997081339
“Gandalf” 2.8533709928393365
“Aragorn” 2.801861433405428
“Sam” 2.6516043133102363
“Ring” 1.9770310912281268

Standard suspects come on top. Frodo is by far the most central character, followed by Gandalf, Aragorn, and Sam, who are in the same ballpark of importance judging by the pageRank centrality. The Ring also shows up in the top five ladder of essential characters.

Community detection with Louvain algorithm

We are also interested in the community structure of the network. We will use the Louvain modularity algorithm to determine the community structure.

CALL gds.louvain.stream('undirected_weighted',
   {relationshipWeightProperty:'weight', 
    relationshipTypes:['INTERACTS_1']})
YIELD nodeId, communityId
RETURN communityId,
       collect(gds.util.asNode(nodeId).Label) as members 
ORDER BY length(members) DESC LIMIT 5

Results

communityId members
36 [“Aragorn”, “Arathorn”, “Arwen”, “Boromir”, “Denethor”, “Elendil”, “Elrond”, “Glorfindel”, “Gollum”, “Isildur”, “Ring”, “Saruman”, “Sauron”, “Shadowfax”, “Treebeard”]
18 [“Balin”, “Celeborn”, “Durin”, “Galadriel”, “Gimli”, “Glóin”, “Haldir”, “Legolas”, “Thorin”, “Thráin”]
34 [“Bilbo”, “Bill”, “Frodo”, “Gandalf”, “Gildor”, “Merry”, “Pippin”, “Sam”]
42 [“Goldberry”, “Bombadil”]
13 [“Éomer”]

Drop the projected graph

CALL gds.graph.drop('undirected_weighted');

Visualization with Neo4j Bloom

As a picture is worth more than a thousand words, we will visualize the network of the first book. PageRank is used to determine the size of the node and community is used to determine the color of the node.

For the first time, I will use Neo4j Bloom to visualize the network, as the latest release (1.2) adds the support of the rule-based styling of the visualization.

Exploring the Graph Catalog feature of Neo4j Graph data science plugin on a Lord of the Rin...

We can observe that Aragorn is in the same community as Sauron and Saruman. This has nothing to do with the actual community detection algorithm, but it shows the weakness of extracting information from the text as co-occurrences graphs. I guess the next step would be obtaining knowledge from the book with the context of relationships such as friend, enemy, or others.

Conclusion

I hope you after reading this blog post, you got a better understanding of how to project graphs with the Graph Data Science library . Stay tuned as we still have to go through the Cypher projection part of the Graph management section.

As always, the code is available on GitHub .


以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

集体智慧编程

集体智慧编程

Toby Segaran / 莫映、王开福 / 电子工业出版社 / 2015-3 / 79.00元

《集体智慧编程》以机器学习与计算统计为主题背景,专门讲述如何挖掘和分析Web 上的数据和资源,如何分析用户体验、市场营销、个人品味等诸多信息,并得出有用的结论,通过复杂的算法来从Web 网站获取、收集并分析用户的数据和反馈信息,以便创造新的用户价值和商业价值。全书内容翔实,包括协作过滤技术(实现关联产品推荐功能)、集群数据分析(在大规模数据集中发掘相似的数据子集)、搜索引擎核心技术(爬虫、索引、查......一起来看看 《集体智慧编程》 这本书的介绍吧!

在线进制转换器
在线进制转换器

各进制数互转换器

MD5 加密
MD5 加密

MD5 加密工具

HSV CMYK 转换工具
HSV CMYK 转换工具

HSV CMYK互换工具