Presenting multiple node label support and graph mutability features of the Neo4j Graph Dat...

栏目: IT技术 · 发布时间: 4年前

内容简介：Just recently, theI like to search for new networks to not get boring and repetitive in my blog posts. Here we will delve into the Star Wars world. The interactions between characters are available

Just recently, the v1.1.0 version of the Neo4j Graph Data Science was released. I decided to take a short break from watching the Joy of Painting by Bob Ross , which I highly recommend, and inspect some of the new features. For me, the most anticipated new feature is the ability to filter nodes by labels when executing graph algorithms. Similarly to relationships, where we can choose which ones should be considered by the graph algorithm with the relationshipTypes parameter, we can now filter which nodes should be used with the new nodeLabels parameter. We will also take a look at the new projected graph mutability feature.

I like to search for new networks to not get boring and repetitive in my blog posts. Here we will delve into the Star Wars world. The interactions between characters are available on GitHub , thanks to Evelina Gabasova . We will combine it with the dataset made available by Florent Georges, who scraped the Star Wars API and added descriptions from Wikipedia.

Graph schema and import

Presenting multiple node label support and graph mutability features of the Neo4j Graph Dat...

Our graph consists of characters that have INTERACTX relationships with other characters. The value of the X indicates in which episode the interaction occurred, e.g. INTERACTS1 indicates an interaction in the first episode. Each character also belongs to a single species, which is represented as a relationship to the species node.

We will start by importing the interaction social network. The dataset is a bit tricky as characters don’t have a unique id available throughout the episodes. Instead, in each episode, a new id is determined as a zero-based index of the character in the “nodes” element of the JSON file. To mitigate this issue, we use the apoc.cypher.runMany procedure to run two transactions for each episode. In the first, we store the nodes and assign them the value based on the zero-based index. In the second transaction, we use that value to store links between characters. This procedure is repeated for each episode.

UNWIND range(1,7) as episode
CALL apoc.load.json('https://raw.githubusercontent.com/evelinag/StarWars-social-network/master/networks/starwars-episode-' + episode + '-interactions.json') YIELD value as data
CALL apoc.cypher.runMany("
    WITH $data.nodes as nodes 
    UNWIND range(0,length(nodes)-1) as value 
    MERGE (c:Character{id:nodes[value].name}) 
    SET c.value = value 
    RETURN distinct 'done'
    ;\n
    WITH $data as data, $episode as episode
    UNWIND data.links as link
    MERGE (source:Character{value:link.source})
    MERGE (target:Character{value:link.target})
    WITH source, target, link.value as weight, episode
    CALL apoc.create.relationship(source,'INTERACTS' + episode, 
         {weight:weight}, target) YIELD rel
    RETURN distinct 'done'",
    {data:data, episode:episode},{statistics:true,timeout:10}) 
YIELD result
RETURN result

We will continue by enriching our graph with information about the species. Unfortunately, the names of characters between the two datasets do not match exactly. There are several approaches that we could use to find close matches of the characters’ names. APOC library has a whole section dedicated to text-similarity procedures . I decided to use the Full-text search index to match the characters’ names. We begin by defining the full-text search index on the id property of characters.

CALL db.index.fulltext.createNodeIndex(
    "names",["Character"],["id"])

Each full-text search query might return more than a single result, but we will look only at the top result. If the score of the top result is above an arbitrary threshold, 0.85 in our case, we will assume it is the same character. We can look at some examples:

CALL apoc.load.json('https://raw.githubusercontent.com/fgeorges/star-wars-dataset/master/data/enriched.json') YIELD value UNWIND value.people as person
CALL db.index.fulltext.queryNodes("names", 
    replace(person.name,'é','e')) 
YIELD node,score
RETURN person.name as species_dataset_name,
       collect(node)[0].id as interactions_dataset_name
LIMIT 5

Results

species_dataset_name	interactions_dataset_name
“Luke Skywalker”	“LUKE”
“C-3PO”	“C-3PO”
“Darth Vader”	“DARTH VADER”
“Leia Organa”	“LEIA”
“Owen Lars”	“OWEN”

More or less, it seems that the species dataset contains full names, while the interactions dataset contains only the first names of the characters. As the full-text search matching seems fine, let us proceed and store the additional information to the graph.

CALL apoc.load.json('https://raw.githubusercontent.com/fgeorges/star-wars-dataset/master/data/enriched.json')
YIELD value
UNWIND value.people as person
// search for characters
CALL db.index.fulltext.queryNodes("names", 
    replace(person.name,'é','e'))
YIELD node,score
// collect the top hit
WITH person, collect([node,score])[0] as top_hit
WITH person, top_hit[0] as node, top_hit[1] as score
// threshold
WHERE score > 0.85
// enrich characters
SET node += apoc.map.clean(person,
    ['films','vehicles','starships','species'],['n/a'])
WITH node, person.species as person_species
UNWIND person_species as species
MERGE (s:Species{id:species})
MERGE (node)-[:BELONG_TO]->(s)

We will also import additional meta-data about the species.

CALL apoc.load.json('https://raw.githubusercontent.com/fgeorges/star-wars-dataset/master/data/enriched.json')
YIELD value
UNWIND value.species as species
MATCH (s:Species{id:species.url})
SET s += apoc.map.clean(species, 
    ['films','homeworld','people','url'],['n/a','unknown'])

Unfortunately, we haven’t assigned the species to all of our characters yet. Some characters show up in the first dataset that do not show up in the second. As the focus of this blog post is not to show how to scrape the internet efficiently, I prepared a CSV file that we can use to assign missing species values.

LOAD CSV WITH HEADERS FROM 
"https://raw.githubusercontent.com/tomasonjo/blogs/master/Star_Wars/star_wars_species.csv" as row
MATCH (c:Character{id:row.name})
MERGE (s:Species{name:row.species})
MERGE (c)-[:BELONG_TO]->(s)

Some species have only one or two members, so I decided to group them under “Other” species. This way, we will make our further graph analysis more relevant.

MATCH (s:Species)
WHERE size((s)<-[:BELONG_TO]-()) <= 2
MERGE (unknown:Species{name:'Other'})
WITH s, unknown
MATCH (s)<-[:BELONG_TO]-(character)
MERGE (unknown)<-[:BELONG_TO]-(character)
DETACH DELETE s

Multiple node label support

We can now look at the multiple node label projection and how it allows us to filter nodes when executing graph algorithms. As we will be using the native projection , we have to assign secondary labels to characters based on their species. We will use the apoc.create.addLabels procedure to assign secondary labels.

MATCH (c:Character)-[:BELONG_TO]->(species)
CALL apoc.create.addLabels([id(c)], [species.name]) YIELD node
RETURN distinct 'done'

If you are new to the concept of the GDS graph catalog , you might want to look at one of myprevious blog posts. We will use the array option to describe the node labels we want to project. Nodes belonging to all five species will be projected. All relationships in the interaction network will also be projected and treated as undirected and weighted. As there could be more than a single relationship between a given pair of nodes, we are projecting a multigraph .

CALL gds.graph.create('starwars',
  ['Other','Human','Droid','Neimodian','Gungan'],
  {INTERACTS1:{type:'INTERACTS1', orientation:'UNDIRECTED', properties:['weight']},
   INTERACTS2:{type:'INTERACTS2', orientation:'UNDIRECTED', properties:['weight']},
   INTERACTS3:{type:'INTERACTS3', orientation:'UNDIRECTED', properties:['weight']},
   INTERACTS4:{type:'INTERACTS4', orientation:'UNDIRECTED', properties:['weight']},
   INTERACTS5:{type:'INTERACTS5', orientation:'UNDIRECTED', properties:['weight']},
   INTERACTS6:{type:'INTERACTS6', orientation:'UNDIRECTED', properties:['weight']}})

To start off, let’s compute the weighted PageRank on the whole projected graph.

CALL gds.pageRank.stream('starwars',
   {relationshipWeightProperty:'weight'})
YIELD nodeId, score
RETURN gds.util.asNode(nodeId).name as name, score
ORDER BY score DESC LIMIT 5

Results

name	score
“Han Solo”	3.711507879383862
“Anakin Skywalker”	3.5772032469511035
“Padmé Amidala”	3.4175086730159823
“C-3PO”	3.3624297386966657
“Luke Skywalker”	3.316930921562016

The results are not surprising, except maybe that Han Solo seems to be the most important character in the galaxy. This is due to Anakin Skywalker and Darth Vader being treated as two separate entities in our network, otherwise probably Anakin would be on top. Nonetheless, judging by the PageRank score, all the top five characters are quite similar in their importance, and no one really stands out.

Let’s say we want to find the most important characters in the Gungan social network. We can consider only Gungans in the algorithm with the nodeLabels parameter.

CALL gds.pageRank.stream('starwars',{nodeLabels:['Gungan'], 
   relationshipWeightProperty:'weight'})
YIELD nodeId, score
RETURN gds.util.asNode(nodeId).id as name, score
ORDER BY score DESC LIMIT 5

name	score
“JAR JAR”	1.3676453701686113
“BOSS NASS”	1.0597422499209648
“TARPALS”	1.0597422499209648
“GENERAL CEEL”	0.3810879142256453

As you might expect, the man, the legend Jar Jar, comes out on top.

We can also combine both node and relationship filters. We will compute PageRank based on the network of humans and other species interactions in the fifth and sixth episodes.

CALL gds.pageRank.stream('starwars',{nodeLabels:['Other','Human'],
   relationshipTypes:['INTERACTS5','INTERACTS6'], 
   relationshipWeightProperty:'weight'})
YIELD nodeId, score
RETURN gds.util.asNode(nodeId).name as name, score
ORDER BY score DESC LIMIT 5

name	score
“Han Solo”	3.928774843923747
“Leia Organa”	2.789427683595568
“Luke Skywalker”	2.5331994574517003
“Padmé Amidala”	2.0019969200715417
“Biggs Darklighter”	1.9624054052866993

It looks like the kids Luke and Leia have grown up. The addition of the node filtering feature allows us to be very precise when describing the subset of the projected graph we want to consider as an input of a graph algorithm. After we are done with the analysis, we can drop the projected graph from the catalog.

CALL gds.graph.drop('starwars')

Graph mutability

With the graph mutability feature, graph algorithms have gotten a third mode mutate on top of the previously supported stream and write . We can look at the official documentation to get a better sense of what it offers:

The mutate mode is very similar to the write mode but instead of writing results to the Neo4j database, they are made available at the in-memory graph. Note that the mutateProperty must not exist in the in-memory graph beforehand. This enables running multiple algorithms on the same in-memory graph without writing results to Neo4j in-between algorithm executions.

Basically, mutate mode is almost the same as the write mode, except it doesn’t write results directly to the Neo4j stored graph, but instead stores the results to the same in-memory projected graph, that was used as an input of the graph algorithm. Later, if we want to, we can still store the mutated properties of the in-memory graph to Neo4j with the gds.graph.writeNodeProperties procedure.

We will start by projecting a new in-memory graph. This time we will usecypher projection. In the node query, we will filter only characters that appeared in the first episode. We also provide the labels column, which will allow us to filter nodes when executing algorithms. If you look closely, the labels column is in the form of an array, which means that a single node can have multiple labels. Also, with cypher projection, we can provide a virtual label, as shown in the query.

CALL gds.graph.create.cypher('starwars_cypher',
  'MATCH (species)<-[:BELONG_TO]-(c:Character)
   // filter characters from first episode
   WHERE (c)-[:INTERACTS1]-() 
   RETURN id(c) as id, [species.name] as labels',
  'MATCH (c1:Character)-[r]-(c2:Character) 
   RETURN id(c1) as source, id(c2) as target, 
          type(r) as type, r.weight as weight',
   {validateRelationships:false}
)

Another new feature of the 1.1 release is the validateRelationships parameter of the cypher projection. The default setting throws an error if we try to project relationships between nodes that are not described in the node query. As we project only characters from the first episode in the node query but try to project interactions between all characters in the relationship query, we would get an error. We could either add a filter in the relationship query or just set the validateRelationship parameter to false. This way, all relationships between nodes not described in the node query will be silently ignored during the graph projection.

Finally, we can try out the mutate mode of the Louvain algorithm , which is used for community detection. We will investigate the community structure of the first episode within species and not on the whole network. To achieve this, we have to run the algorithm separately for each species and consider only characters from the given species with the nodeLabels parameter and only interactions from the first episode with the relationshipTypes parameter. To define the designated property used to store results to the in-memory graph, we use the mutateProperty parameter.

MATCH (s:Species)
CALL gds.louvain.mutate('starwars_cypher', 
    {nodeLabels:[s.name], mutateProperty: s.name + '_1',
     relationshipTypes:['INTERACTS1']}) 
YIELD communityCount
RETURN s.name as species, communityCount

Results

species	communityCount
“Human”	4
“Droid”	2
“Neimodian”	1
“Gungan”	1
“Other”	5

We can now use the mutated properties of the characters’ community from the first episode as seed property for calculating communities in the second episode. Louvain algorithm is initialized by assigning each node a unique id. With the seedProperty parameter, we can define the initial unique id of each node. If you want to learn more about the seed property and its use-cases, I wrote a blog post about it.

MATCH (s:Species)
CALL gds.louvain.mutate('starwars_cypher',
  {nodeLabels:[s.name], mutateProperty: s.name + '_2',
  relationshipTypes:['INTERACTS2'], seedProperty: s.name + '_1'})
YIELD communityCount
RETURN s.name as species, communityCount

Results

species	communityCount
“Other”	3
“Human”	2
“Droid”	2
“Neimodian”	1
“Gungan”	1

We can observe that the count of communities in the second episode has slightly reduced. To write the mutated properties back to Neo4j stored graph we can use the

gds.graph.writeNodeProperties procedure.

CALL gds.graph.writeNodeProperties('starwars_cypher',
  ['Human_1','Droid_1','Other_1',
   'Human_2','Droid_2','Other_2'])

We can now examine the community structure of humans within the first episode.

MATCH (c:Human)
WHERE exists (c.Human_1)
RETURN c.Human_1 as community,
       count(*) as size, 
       collect(c.id) as members

Results

community	size	members
16	5	[“KITSTER”, “QUI-GON”, “PADME”, “JIRA”, “SHMI”]
4	3	[“VALORUM”, “BAIL ORGANA”, “EMPEROR”]
6	6	[“MACE WINDU”, “RABE”, “BRAVO TWO”, “BRAVO THREE”, “RIC OLIE”, “ANAKIN”]
11	3	[“OBI-WAN”, “CAPTAIN PANAKA”, “SIO BIBBLE”]

And inspect how those communities evolved in the second episode.

MATCH (c:Human)
WHERE exists (c.Human_2)
RETURN c.Human_2 as community,
       count(*) as size,
       collect(c.id) as members

Results

community	size	members
11	7	[“QUI-GON”, “OBI-WAN”, “EMPEROR”, “CAPTAIN PANAKA”, “JIRA”, “MACE WINDU”, “BAIL ORGANA”]
12	10	[“SIO BIBBLE”, “PADME”, “RIC OLIE”, “ANAKIN”, “SHMI”, “KITSTER”, “VALORUM”, “RABE”, “BRAVO TWO”, “BRAVO THREE”]

Unfortunately, I am not a domain expert to give some useful insights about the community structure.

Conclusion

With multiple node label support, we are now able to project the whole graph in memory and be very specific in describing the subgraph we want to use as an input to the graph algorithm as we can now filter both nodes and relationships. Graph mutability allows us to chain various graph algorithms without the need to drop and recreate projected graphs with newly computed properties. In my opinion, this is a great addition to the Neo4j Graph Data Science library.

If you have any cool dataset in mind you would like me to analyze or have any feedback as to what you would like to see next, let me know. As always, the code is available on GitHub .

以上所述就是小编给大家介绍的《Presenting multiple node label support and graph mutability features of the Neo4j Graph Dat...》，希望对大家有所帮助，如果大家有任何疑问请给我留言，小编会及时回复大家的。在此也非常感谢大家对码农网的支持！

查看所有标签

猜你喜欢:

Presenting multiple node label support and graph mutability features of the Neo4j Graph Dat...

本站部分资源来源于网络，本站转载出于传递更多信息之目的，版权归原作者或者来源机构所有，如转载稿涉及版权问题，请联系我们。

码农书籍

可伸缩架构

【美】Lee Atchison / 张若飞、张现双 / 电子工业出版社 / 2017-7 / 65

随着互联网的发展越来越成熟，流量和数据量飞速增长，许多公司的关键应用程序都面临着伸缩性的问题，系统变得越来越复杂和脆弱，从而导致风险上升、可用性降低。《可伸缩架构：面向增长应用的高可用》是一本实践指南，让IT、DevOps和系统稳定性管理员能够了解到，如何避免应用程序在发展过程中变得缓慢、数据不一致或者彻底不可用等问题。规模增长并不只意味着处理更多的用户，还包括管理更多的风险和保证系统的可用性。作......一起来看看《可伸缩架构》这本书的介绍吧!

码农工具