内容简介:To continue the introduction of theGraph algorithms run on a graph data model which is aThe graph catalog is a concept within the GDS library that allows managing multiple graph projections by name. Using that name, a created graph can be used many times i
To continue the introduction of the Neo4j Graph data science plugin , we will go over the basics and look at the graph management part of the library. This is the part of the library that is responsible for projecting the stored Neo4j graph into memory, which allows for faster execution of the graph algorithms. A significant portion of the graph management section is the Graph Catalog feature. Just what is it? Let’s look at the official documentation .
Graph algorithms run on a graph data model which is a projection of the Neo4j property graph data model. A graph projection can be seen as a view over the stored graph, containing only analytical relevant, potentially aggregated, topological and property information. Graph projections are stored entirely in-memory using compressed data structures optimized for topology and property lookup operations.
The graph catalog is a concept within the GDS library that allows managing multiple graph projections by name. Using that name, a created graph can be used many times in the analytical workflow.
There are two distinct options to project the stored graph into memory:
- Native projection provides the best performance by reading from the Neo4j store files.
- Cypher projection , the more flexible, expressive approach with lesser focus on performance
In this blog post, we will take a deep dive into the native projection option of the Graph Catalog.
Graph model
I have used the GoT dataset more times than I can remember, so I decided to explore the internet and search for new exciting graphs. I stumbled upon this Lord of the Rings dataset made available by José Calvo that we will use in this blog post.
The dataset describes interactions between persons, places, groups, and things (The Ring). When choosing how to model this dataset, I decided to have “main” nodes with two labels, primary label “Node” and secondary label one of the following:
- Person
- Place
- Group
- Thing
In the picture above, I added only “Person” as a secondary label due to clarity. We store interactions from each book as a separate relationship, meaning that the interactions from the first book will be saved as the INTERACTS_1 relationships, second book interactions as INTERACTS_2 , and so on. Keep in mind that we will treat interaction relationships as undirected and weighted. These “main” nodes can also be a part of a class such as orcs, elves, men, and many more.
Import data
As mentioned, the dataset is available on GitHub , and we can easily fetch it with the LOAD CSV
statement.
Import nodes
LOAD CSV WITH HEADERS FROM "https://raw.githubusercontent.com/morethanbooks/projects/master/LotR/ontologies/ontology.csv" as row FIELDTERMINATOR "\t" WITH row, CASE row.type WHEN 'per' THEN 'Person' WHEN 'gro' THEN 'Group' WHEN 'thin' THEN 'Thing' WHEN 'pla' THEN 'Place' END as label CALL apoc.create.nodes(['Node',label], [apoc.map.clean(row,['type','subtype'],[null,""])]) YIELD node WITH node, row.subtype as class MERGE (c:Class{id:class}) MERGE (node)-[:PART_OF]->(c)
The primary reason we added two labels to “main” nodes was to optimize the import of interaction relationships. Now, you might say the optimization is unnecessary due to the tiny dataset we have, and I would agree, but let’s pretend we might be dealing with millions of nodes. We start by defining the unique constraint on the label “Node”.
CREATE CONSTRAINT ON (n:Node) ASSERT n.id IS UNIQUE
Import relationships
Now that we have set up the unique constraint, cypher query planner will use it to match our existing nodes much faster.
UNWIND ['1','2','3'] as book LOAD CSV WITH HEADERS FROM "https://raw.githubusercontent.com/morethanbooks/projects/master/LotR/tables/networks-id-volume" + book + ".csv" AS row MATCH (source:Node{id:coalesce(row.IdSource,row.Source)}) MATCH (target:Node{id:coalesce(row.IdTarget,row.Target)}) CALL apoc.create.relationship(source, "INTERACTS_" + book, {weight:toInteger(row.Weight)}, target) YIELD rel RETURN distinct true
GDS Graph Catalog
The syntax for creating named graphs in Graph Catalog is:
CALL gds.graph.create(graph name, node label, relationship type).
Describing nodes we want to project
In general, with the native projection variant, there are three options to describe the nodes we want to project into memory:
- Project a single node label using a string:
- ‘Label’ (‘*’ is a wildcard operator that projects all nodes)
- Project multiple node labels using an array:
- [‘Label1’, ‘Label2’, ‘Label3’]
- Project multiple node labels with their properties using a configuration map:
-
{ Label: { label: "Label", properties: [ "property1", "property2" ] }, Label2: { label: "Label2", properties: [ "foo", "bar" ] } }
-
An important thing to note regarding projecting node labels:
In the in-memory graph, all projected node labels are merged into a single label. Unlike for relationship projections, it is currently not possible to specify a filter on projected labels. If the graph is used as input for an algorithm, all nodes will be considered.
While we can filter which node labels we want to project to the in-memory graph, additional filtering of nodes when executing graph algorithm is currently not supported.
Describing relationships we want to project
The syntax to describe the relationships we want to project is very similar to that of the nodes.
- Project a single relationship type using a string:
- ‘TYPE’ (‘*’ is a wildcard that projects all relationship-types)
- Project multiple relationship types using an array:
- [‘TYPE1′,’TYPE2’]
- Project more relationship types with their properties using a configuration map:
-
{ALIAS_OF_TYPE: {type:'RELATIONSHIP_TYPE', orientation: 'NATURAL', aggregation: 'DEFAULT' properties:[property1,property2]}
-
The orientation parameter in the configuration map defines the direction of the relationships we want to project. Possible values are:
- ‘NATURAL’ -> each relationship is projected the same way as it is stored in Neo4j
- ‘REVERSE’ -> each relationship is reversed during graph projection
- ‘UNDIRECTED’ -> each relationship is projected in both natural and reverse orientation
An important thing to note is that the GDS library supports running graph algorithms on a multigraph . The aggregation parameter is handy when we want to convert a multigraph to a single graph(not a multigraph), but we’ll take a closer look at that in another blog post.
Let’s look at some practical examples now.
Whole graph
Let’s start by projecting the entire graph into memory using the wildcard operator for both the nodes and the relationships.
CALL gds.graph.create('whole_graph','*', '*')
Most of the time, we start the graph analysis by running the (weakly) connected components algorithm to get an idea of how (dis)connected our graph really is.
CALL gds.wcc.stream('whole_graph') YIELD nodeId, componentId RETURN componentId, count(*) as size ORDER BY size DESC LIMIT 10
Results
componentId | size |
---|---|
0 | 86 |
The graph as a whole consists of a single component. Usually, what you’ll get with real-world data is a single super component (85+% of all nodes) and a few small disconnected components.
Drop the projected graph from the catalog
After each analysis, we will release the projected graph from memory.
CALL gds.graph.drop('whole_graph');
Interaction graph
In the next step, we want to ignore PART_OF relationships and only focus on INTERACTS_X relationships. We will use an array for describing relationship-types to take into account all three INTERACTS_X relationships.
CALL gds.graph.create('all_interacts','Node', ['INTERACTS_1', 'INTERACTS_2', 'INTERACTS_3'])
Let’s run the weakly connected components algorithm on our new projected graph.
CALL gds.wcc.stream('all_interacts') YIELD nodeId, componentId RETURN componentId, count(*) as size, collect(gds.util.asNode(nodeId).Label) as ids ORDER BY size DESC LIMIT 10
Results
componentId | size | ids |
---|---|---|
0 | 73 | [“Anduin”, “Aragorn”, “Arathorn”, and many more] |
26 | 1 | [“Mirkwood”] |
25 | 1 | [“Old Forest”] |
Our new graph consists of three components. We have a single super component and two tiny components consisting of only a single node. We can deduce that locations “Mirkwood” and “Old Forest” have no INTERACTS_X relationships.
Let’s use the same projected graph and only look at interactions from the first book. We can filter which relationship-types should the graph algorithm consider with the relationshipTypes parameter.
CALL gds.wcc.stream('all_interacts', {relationshipTypes:['INTERACTS_1']}) YIELD nodeId, componentId RETURN componentId, count(*) as size, collect(gds.util.asNode(nodeId).Label) as ids ORDER BY size DESC LIMIT 10
Results
componentId | size | ids |
---|---|---|
0 | 62 | [“Anduin”, “Aragorn”, “Arathorn”, “Arwen”, and many more] |
21 | 1 | [“Ents”] |
26 | 1 | [“Mirkwood”] |
23 | 1 | [“Eorl”] |
24 | 1 | [“Éowyn”] |
25 | 1 | [“Old Forest”] |
22 | 1 | [“Éomer”] |
27 | 1 | [“Faramir”] |
38 | 1 | [“Gorbag”] |
6 | 1 | [“Beregond”] |
We get more disconnected components if we take into account only interactions from the first book. This makes sense as some of the characters/locations haven’t yet been introduced in the first book, so they have no INTERACTS_1 relationships.
Drop the projected graph from the catalog
CALL gds.graph.drop('all_interacts');
Undirected weighted interaction graph
In the last example, we will show how to project an undirected weighted graph. We will consider only nodes labeled Person and Thing, and for relationships, we will project all the INTERACTS_X relationships along with their weight property, which will be treated as undirected.
CALL gds.graph.create('undirected_weighted',['Person', 'Thing'], {INTERACTS_1:{type: 'INTERACTS_1', orientation: 'UNDIRECTED', properties:['weight']}, INTERACTS_2:{type:'INTERACTS_2', orientation: 'UNDIRECTED', properties:['weight']}, INTERACTS_3: {type:'INTERACTS_3', orientation:'UNDIRECTED', properties:['weight']}});
Unweighted pageRank
To run the unweighted pageRank on our projected graph, we don’t have to specify any additional configuration.
CALL gds.pageRank.stream('undirected_weighted') YIELD nodeId, score RETURN gds.util.asNode(nodeId).Label as name, score ORDER BY score DESC LIMIT 5
Results
name | score |
---|---|
“Aragorn” | 2.258091664128006 |
“Gandalf” | 2.2152436876669523 |
“Frodo” | 2.11306254575029 |
“Sam” | 1.8062801460735498 |
“Gimli” | 1.6464465597644447 |
Weighted pageRank
To let know the algorithm that it should take relationship weights into account, we need to use relationshipWeightProperty parameter.
CALL gds.pageRank.stream('undirected_weighted', {relationshipWeightProperty:'weight'}) YIELD nodeId, score RETURN gds.util.asNode(nodeId).Label as name, score ORDER BY score DESC LIMIT 5
Results
name | score |
---|---|
“Frodo” | 5.101856858469546 |
“Gandalf” | 3.7572644826024777 |
“Sam” | 3.4707343533635133 |
“Aragorn” | 3.246119194291532 |
“Pippin” | 2.355583811737597 |
As Frodo has more interactions (defined as weight) with other characters, he comes out on top with the weighted variant of the pageRank.
Graph analysis of the first book:
To finish this blog post, we will analyze the network of the first book. We start by running the weighted pageRank on the interaction relationships from the first book only.
CALL gds.pageRank.stream('undirected_weighted', {relationshipWeightProperty:'weight', relationshipTypes:['INTERACTS_1']}) YIELD nodeId, score RETURN gds.util.asNode(nodeId).Label as name, score ORDER BY score DESC LIMIT 5
Results
name | score |
---|---|
“Frodo” | 5.518011997081339 |
“Gandalf” | 2.8533709928393365 |
“Aragorn” | 2.801861433405428 |
“Sam” | 2.6516043133102363 |
“Ring” | 1.9770310912281268 |
Standard suspects come on top. Frodo is by far the most central character, followed by Gandalf, Aragorn, and Sam, who are in the same ballpark of importance judging by the pageRank centrality. The Ring also shows up in the top five ladder of essential characters.
Community detection with Louvain algorithm
We are also interested in the community structure of the network. We will use the Louvain modularity algorithm to determine the community structure.
CALL gds.louvain.stream('undirected_weighted', {relationshipWeightProperty:'weight', relationshipTypes:['INTERACTS_1']}) YIELD nodeId, communityId RETURN communityId, collect(gds.util.asNode(nodeId).Label) as members ORDER BY length(members) DESC LIMIT 5
Results
communityId | members |
---|---|
36 | [“Aragorn”, “Arathorn”, “Arwen”, “Boromir”, “Denethor”, “Elendil”, “Elrond”, “Glorfindel”, “Gollum”, “Isildur”, “Ring”, “Saruman”, “Sauron”, “Shadowfax”, “Treebeard”] |
18 | [“Balin”, “Celeborn”, “Durin”, “Galadriel”, “Gimli”, “Glóin”, “Haldir”, “Legolas”, “Thorin”, “Thráin”] |
34 | [“Bilbo”, “Bill”, “Frodo”, “Gandalf”, “Gildor”, “Merry”, “Pippin”, “Sam”] |
42 | [“Goldberry”, “Bombadil”] |
13 | [“Éomer”] |
Drop the projected graph
CALL gds.graph.drop('undirected_weighted');
Visualization with Neo4j Bloom
As a picture is worth more than a thousand words, we will visualize the network of the first book. PageRank is used to determine the size of the node and community is used to determine the color of the node.
For the first time, I will use Neo4j Bloom to visualize the network, as the latest release (1.2) adds the support of the rule-based styling of the visualization.
We can observe that Aragorn is in the same community as Sauron and Saruman. This has nothing to do with the actual community detection algorithm, but it shows the weakness of extracting information from the text as co-occurrences graphs. I guess the next step would be obtaining knowledge from the book with the context of relationships such as friend, enemy, or others.
Conclusion
I hope you after reading this blog post, you got a better understanding of how to project graphs with the Graph Data Science library . Stay tuned as we still have to go through the Cypher projection part of the Graph management section.
As always, the code is available on GitHub .
以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网
猜你喜欢:本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。