内容简介:Just recently, theI like to search for new networks to not get boring and repetitive in my blog posts. Here we will delve into the Star Wars world. The interactions between characters are available
Just recently, the v1.1.0 version of the Neo4j Graph Data Science was released. I decided to take a short break from watching the Joy of Painting by Bob Ross , which I highly recommend, and inspect some of the new features. For me, the most anticipated new feature is the ability to filter nodes by labels when executing graph algorithms. Similarly to relationships, where we can choose which ones should be considered by the graph algorithm with the relationshipTypes parameter, we can now filter which nodes should be used with the new nodeLabels parameter. We will also take a look at the new projected graph mutability feature.
I like to search for new networks to not get boring and repetitive in my blog posts. Here we will delve into the Star Wars world. The interactions between characters are available on GitHub , thanks to Evelina Gabasova . We will combine it with the dataset made available by Florent Georges, who scraped the Star Wars API and added descriptions from Wikipedia.
Graph schema and import
Our graph consists of characters that have INTERACTX relationships with other characters. The value of the X indicates in which episode the interaction occurred, e.g. INTERACTS1 indicates an interaction in the first episode. Each character also belongs to a single species, which is represented as a relationship to the species node.
We will start by importing the interaction social network. The dataset is a bit tricky as characters don’t have a unique id available throughout the episodes. Instead, in each episode, a new id is determined as a zero-based index of the character in the “nodes” element of the JSON file. To mitigate this issue, we use the apoc.cypher.runMany
procedure to run two transactions for each episode. In the first, we store the nodes and assign them the value based on the zero-based index. In the second transaction, we use that value to store links between characters. This procedure is repeated for each episode.
UNWIND range(1,7) as episode CALL apoc.load.json('https://raw.githubusercontent.com/evelinag/StarWars-social-network/master/networks/starwars-episode-' + episode + '-interactions.json') YIELD value as data CALL apoc.cypher.runMany(" WITH $data.nodes as nodes UNWIND range(0,length(nodes)-1) as value MERGE (c:Character{id:nodes[value].name}) SET c.value = value RETURN distinct 'done' ;\n WITH $data as data, $episode as episode UNWIND data.links as link MERGE (source:Character{value:link.source}) MERGE (target:Character{value:link.target}) WITH source, target, link.value as weight, episode CALL apoc.create.relationship(source,'INTERACTS' + episode, {weight:weight}, target) YIELD rel RETURN distinct 'done'", {data:data, episode:episode},{statistics:true,timeout:10}) YIELD result RETURN result
We will continue by enriching our graph with information about the species. Unfortunately, the names of characters between the two datasets do not match exactly. There are several approaches that we could use to find close matches of the characters’ names. APOC library has a whole section dedicated to text-similarity procedures . I decided to use the Full-text search index to match the characters’ names. We begin by defining the full-text search index on the id property of characters.
CALL db.index.fulltext.createNodeIndex( "names",["Character"],["id"])
Each full-text search query might return more than a single result, but we will look only at the top result. If the score of the top result is above an arbitrary threshold, 0.85 in our case, we will assume it is the same character. We can look at some examples:
CALL apoc.load.json('https://raw.githubusercontent.com/fgeorges/star-wars-dataset/master/data/enriched.json') YIELD value UNWIND value.people as person CALL db.index.fulltext.queryNodes("names", replace(person.name,'é','e')) YIELD node,score RETURN person.name as species_dataset_name, collect(node)[0].id as interactions_dataset_name LIMIT 5
Results
species_dataset_name | interactions_dataset_name |
---|---|
“Luke Skywalker” | “LUKE” |
“C-3PO” | “C-3PO” |
“Darth Vader” | “DARTH VADER” |
“Leia Organa” | “LEIA” |
“Owen Lars” | “OWEN” |
More or less, it seems that the species dataset contains full names, while the interactions dataset contains only the first names of the characters. As the full-text search matching seems fine, let us proceed and store the additional information to the graph.
CALL apoc.load.json('https://raw.githubusercontent.com/fgeorges/star-wars-dataset/master/data/enriched.json') YIELD value UNWIND value.people as person // search for characters CALL db.index.fulltext.queryNodes("names", replace(person.name,'é','e')) YIELD node,score // collect the top hit WITH person, collect([node,score])[0] as top_hit WITH person, top_hit[0] as node, top_hit[1] as score // threshold WHERE score > 0.85 // enrich characters SET node += apoc.map.clean(person, ['films','vehicles','starships','species'],['n/a']) WITH node, person.species as person_species UNWIND person_species as species MERGE (s:Species{id:species}) MERGE (node)-[:BELONG_TO]->(s)
We will also import additional meta-data about the species.
CALL apoc.load.json('https://raw.githubusercontent.com/fgeorges/star-wars-dataset/master/data/enriched.json') YIELD value UNWIND value.species as species MATCH (s:Species{id:species.url}) SET s += apoc.map.clean(species, ['films','homeworld','people','url'],['n/a','unknown'])
Unfortunately, we haven’t assigned the species to all of our characters yet. Some characters show up in the first dataset that do not show up in the second. As the focus of this blog post is not to show how to scrape the internet efficiently, I prepared a CSV file that we can use to assign missing species values.
LOAD CSV WITH HEADERS FROM "https://raw.githubusercontent.com/tomasonjo/blogs/master/Star_Wars/star_wars_species.csv" as row MATCH (c:Character{id:row.name}) MERGE (s:Species{name:row.species}) MERGE (c)-[:BELONG_TO]->(s)
Some species have only one or two members, so I decided to group them under “Other” species. This way, we will make our further graph analysis more relevant.
MATCH (s:Species) WHERE size((s)<-[:BELONG_TO]-()) <= 2 MERGE (unknown:Species{name:'Other'}) WITH s, unknown MATCH (s)<-[:BELONG_TO]-(character) MERGE (unknown)<-[:BELONG_TO]-(character) DETACH DELETE s
Multiple node label support
We can now look at the multiple node label projection and how it allows us to filter nodes when executing graph algorithms. As we will be using the native projection , we have to assign secondary labels to characters based on their species. We will use the apoc.create.addLabels
procedure to assign secondary labels.
MATCH (c:Character)-[:BELONG_TO]->(species) CALL apoc.create.addLabels([id(c)], [species.name]) YIELD node RETURN distinct 'done'
If you are new to the concept of the GDS graph catalog , you might want to look at one of myprevious blog posts. We will use the array option to describe the node labels we want to project. Nodes belonging to all five species will be projected. All relationships in the interaction network will also be projected and treated as undirected and weighted. As there could be more than a single relationship between a given pair of nodes, we are projecting a multigraph .
CALL gds.graph.create('starwars', ['Other','Human','Droid','Neimodian','Gungan'], {INTERACTS1:{type:'INTERACTS1', orientation:'UNDIRECTED', properties:['weight']}, INTERACTS2:{type:'INTERACTS2', orientation:'UNDIRECTED', properties:['weight']}, INTERACTS3:{type:'INTERACTS3', orientation:'UNDIRECTED', properties:['weight']}, INTERACTS4:{type:'INTERACTS4', orientation:'UNDIRECTED', properties:['weight']}, INTERACTS5:{type:'INTERACTS5', orientation:'UNDIRECTED', properties:['weight']}, INTERACTS6:{type:'INTERACTS6', orientation:'UNDIRECTED', properties:['weight']}})
To start off, let’s compute the weighted PageRank on the whole projected graph.
CALL gds.pageRank.stream('starwars', {relationshipWeightProperty:'weight'}) YIELD nodeId, score RETURN gds.util.asNode(nodeId).name as name, score ORDER BY score DESC LIMIT 5
Results
name | score |
---|---|
“Han Solo” | 3.711507879383862 |
“Anakin Skywalker” | 3.5772032469511035 |
“Padmé Amidala” | 3.4175086730159823 |
“C-3PO” | 3.3624297386966657 |
“Luke Skywalker” | 3.316930921562016 |
The results are not surprising, except maybe that Han Solo seems to be the most important character in the galaxy. This is due to Anakin Skywalker and Darth Vader being treated as two separate entities in our network, otherwise probably Anakin would be on top. Nonetheless, judging by the PageRank score, all the top five characters are quite similar in their importance, and no one really stands out.
Let’s say we want to find the most important characters in the Gungan social network. We can consider only Gungans in the algorithm with the nodeLabels parameter.
CALL gds.pageRank.stream('starwars',{nodeLabels:['Gungan'], relationshipWeightProperty:'weight'}) YIELD nodeId, score RETURN gds.util.asNode(nodeId).id as name, score ORDER BY score DESC LIMIT 5
name | score |
---|---|
“JAR JAR” | 1.3676453701686113 |
“BOSS NASS” | 1.0597422499209648 |
“TARPALS” | 1.0597422499209648 |
“GENERAL CEEL” | 0.3810879142256453 |
As you might expect, the man, the legend Jar Jar, comes out on top.
We can also combine both node and relationship filters. We will compute PageRank based on the network of humans and other species interactions in the fifth and sixth episodes.
CALL gds.pageRank.stream('starwars',{nodeLabels:['Other','Human'], relationshipTypes:['INTERACTS5','INTERACTS6'], relationshipWeightProperty:'weight'}) YIELD nodeId, score RETURN gds.util.asNode(nodeId).name as name, score ORDER BY score DESC LIMIT 5
name | score |
---|---|
“Han Solo” | 3.928774843923747 |
“Leia Organa” | 2.789427683595568 |
“Luke Skywalker” | 2.5331994574517003 |
“Padmé Amidala” | 2.0019969200715417 |
“Biggs Darklighter” | 1.9624054052866993 |
It looks like the kids Luke and Leia have grown up. The addition of the node filtering feature allows us to be very precise when describing the subset of the projected graph we want to consider as an input of a graph algorithm. After we are done with the analysis, we can drop the projected graph from the catalog.
CALL gds.graph.drop('starwars')
Graph mutability
With the graph mutability feature, graph algorithms have gotten a third mode mutate on top of the previously supported stream and write . We can look at the official documentation to get a better sense of what it offers:
The mutate
mode is very similar to the write
mode but instead of writing results to the Neo4j database, they are made available at the in-memory graph. Note that the mutateProperty
must not exist in the in-memory graph beforehand. This enables running multiple algorithms on the same in-memory graph without writing results to Neo4j in-between algorithm executions.
Basically, mutate mode is almost the same as the write mode, except it doesn’t write results directly to the Neo4j stored graph, but instead stores the results to the same in-memory projected graph, that was used as an input of the graph algorithm. Later, if we want to, we can still store the mutated properties of the in-memory graph to Neo4j with the gds.graph.writeNodeProperties
procedure.
We will start by projecting a new in-memory graph. This time we will usecypher projection. In the node query, we will filter only characters that appeared in the first episode. We also provide the labels column, which will allow us to filter nodes when executing algorithms. If you look closely, the labels column is in the form of an array, which means that a single node can have multiple labels. Also, with cypher projection, we can provide a virtual label, as shown in the query.
CALL gds.graph.create.cypher('starwars_cypher', 'MATCH (species)<-[:BELONG_TO]-(c:Character) // filter characters from first episode WHERE (c)-[:INTERACTS1]-() RETURN id(c) as id, [species.name] as labels', 'MATCH (c1:Character)-[r]-(c2:Character) RETURN id(c1) as source, id(c2) as target, type(r) as type, r.weight as weight', {validateRelationships:false} )
Another new feature of the 1.1 release is the validateRelationships parameter of the cypher projection. The default setting throws an error if we try to project relationships between nodes that are not described in the node query. As we project only characters from the first episode in the node query but try to project interactions between all characters in the relationship query, we would get an error. We could either add a filter in the relationship query or just set the validateRelationship parameter to false. This way, all relationships between nodes not described in the node query will be silently ignored during the graph projection.
Finally, we can try out the mutate mode of the Louvain algorithm , which is used for community detection. We will investigate the community structure of the first episode within species and not on the whole network. To achieve this, we have to run the algorithm separately for each species and consider only characters from the given species with the nodeLabels parameter and only interactions from the first episode with the relationshipTypes parameter. To define the designated property used to store results to the in-memory graph, we use the mutateProperty parameter.
MATCH (s:Species) CALL gds.louvain.mutate('starwars_cypher', {nodeLabels:[s.name], mutateProperty: s.name + '_1', relationshipTypes:['INTERACTS1']}) YIELD communityCount RETURN s.name as species, communityCount
Results
species | communityCount |
---|---|
“Human” | 4 |
“Droid” | 2 |
“Neimodian” | 1 |
“Gungan” | 1 |
“Other” | 5 |
We can now use the mutated properties of the characters’ community from the first episode as seed property for calculating communities in the second episode. Louvain algorithm is initialized by assigning each node a unique id. With the seedProperty parameter, we can define the initial unique id of each node. If you want to learn more about the seed property and its use-cases, I wrote a blog post about it.
MATCH (s:Species) CALL gds.louvain.mutate('starwars_cypher', {nodeLabels:[s.name], mutateProperty: s.name + '_2', relationshipTypes:['INTERACTS2'], seedProperty: s.name + '_1'}) YIELD communityCount RETURN s.name as species, communityCount
Results
species | communityCount |
---|---|
“Other” | 3 |
“Human” | 2 |
“Droid” | 2 |
“Neimodian” | 1 |
“Gungan” | 1 |
We can observe that the count of communities in the second episode has slightly reduced. To write the mutated properties back to Neo4j stored graph we can use the
gds.graph.writeNodeProperties
procedure.
CALL gds.graph.writeNodeProperties('starwars_cypher', ['Human_1','Droid_1','Other_1', 'Human_2','Droid_2','Other_2'])
We can now examine the community structure of humans within the first episode.
MATCH (c:Human) WHERE exists (c.Human_1) RETURN c.Human_1 as community, count(*) as size, collect(c.id) as members
Results
community | size | members |
---|---|---|
16 | 5 | [“KITSTER”, “QUI-GON”, “PADME”, “JIRA”, “SHMI”] |
4 | 3 | [“VALORUM”, “BAIL ORGANA”, “EMPEROR”] |
6 | 6 | [“MACE WINDU”, “RABE”, “BRAVO TWO”, “BRAVO THREE”, “RIC OLIE”, “ANAKIN”] |
11 | 3 | [“OBI-WAN”, “CAPTAIN PANAKA”, “SIO BIBBLE”] |
And inspect how those communities evolved in the second episode.
MATCH (c:Human) WHERE exists (c.Human_2) RETURN c.Human_2 as community, count(*) as size, collect(c.id) as members
Results
community | size | members |
---|---|---|
11 | 7 | [“QUI-GON”, “OBI-WAN”, “EMPEROR”, “CAPTAIN PANAKA”, “JIRA”, “MACE WINDU”, “BAIL ORGANA”] |
12 | 10 | [“SIO BIBBLE”, “PADME”, “RIC OLIE”, “ANAKIN”, “SHMI”, “KITSTER”, “VALORUM”, “RABE”, “BRAVO TWO”, “BRAVO THREE”] |
Unfortunately, I am not a domain expert to give some useful insights about the community structure.
Conclusion
With multiple node label support, we are now able to project the whole graph in memory and be very specific in describing the subgraph we want to use as an input to the graph algorithm as we can now filter both nodes and relationships. Graph mutability allows us to chain various graph algorithms without the need to drop and recreate projected graphs with newly computed properties. In my opinion, this is a great addition to the Neo4j Graph Data Science library.
If you have any cool dataset in mind you would like me to analyze or have any feedback as to what you would like to see next, let me know. As always, the code is available on GitHub .
以上所述就是小编给大家介绍的《Presenting multiple node label support and graph mutability features of the Neo4j Graph Dat...》,希望对大家有所帮助,如果大家有任何疑问请给我留言,小编会及时回复大家的。在此也非常感谢大家对 码农网 的支持!
猜你喜欢:本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。