Presenting multiple node label support and graph mutability features of the Neo4j Graph Dat...

栏目: IT技术 · 发布时间: 4年前

内容简介:Just recently, theI like to search for new networks to not get boring and repetitive in my blog posts. Here we will delve into the Star Wars world. The interactions between characters are available

Just recently, the v1.1.0 version of the Neo4j Graph Data Science was released. I decided to take a short break from watching the Joy of Painting by Bob Ross , which I highly recommend, and inspect some of the new features. For me, the most anticipated new feature is the ability to filter nodes by labels when executing graph algorithms. Similarly to relationships, where we can choose which ones should be considered by the graph algorithm with the relationshipTypes parameter, we can now filter which nodes should be used with the new nodeLabels parameter. We will also take a look at the new projected graph mutability feature.

I like to search for new networks to not get boring and repetitive in my blog posts. Here we will delve into the Star Wars world. The interactions between characters are available on GitHub , thanks to Evelina Gabasova . We will combine it with the dataset made available by Florent Georges, who scraped the Star Wars API and added descriptions from Wikipedia.

Graph schema and import

Presenting multiple node label support and graph mutability features of the Neo4j Graph Dat...

Our graph consists of characters that have INTERACTX relationships with other characters. The value of the X indicates in which episode the interaction occurred, e.g. INTERACTS1 indicates an interaction in the first episode. Each character also belongs to a single species, which is represented as a relationship to the species node.

We will start by importing the interaction social network. The dataset is a bit tricky as characters don’t have a unique id available throughout the episodes. Instead, in each episode, a new id is determined as a zero-based index of the character in the “nodes” element of the JSON file. To mitigate this issue, we use the apoc.cypher.runMany procedure to run two transactions for each episode. In the first, we store the nodes and assign them the value based on the zero-based index. In the second transaction, we use that value to store links between characters. This procedure is repeated for each episode.

UNWIND range(1,7) as episode
CALL apoc.load.json('https://raw.githubusercontent.com/evelinag/StarWars-social-network/master/networks/starwars-episode-' + episode + '-interactions.json') YIELD value as data
CALL apoc.cypher.runMany("
    WITH $data.nodes as nodes 
    UNWIND range(0,length(nodes)-1) as value 
    MERGE (c:Character{id:nodes[value].name}) 
    SET c.value = value 
    RETURN distinct 'done'
    ;\n
    WITH $data as data, $episode as episode
    UNWIND data.links as link
    MERGE (source:Character{value:link.source})
    MERGE (target:Character{value:link.target})
    WITH source, target, link.value as weight, episode
    CALL apoc.create.relationship(source,'INTERACTS' + episode, 
         {weight:weight}, target) YIELD rel
    RETURN distinct 'done'",
    {data:data, episode:episode},{statistics:true,timeout:10}) 
YIELD result
RETURN result

We will continue by enriching our graph with information about the species. Unfortunately, the names of characters between the two datasets do not match exactly. There are several approaches that we could use to find close matches of the characters’ names. APOC library has a whole section dedicated to text-similarity procedures . I decided to use the Full-text search index to match the characters’ names. We begin by defining the full-text search index on the id property of characters.

CALL db.index.fulltext.createNodeIndex(
    "names",["Character"],["id"])

Each full-text search query might return more than a single result, but we will look only at the top result. If the score of the top result is above an arbitrary threshold, 0.85 in our case, we will assume it is the same character. We can look at some examples:

CALL apoc.load.json('https://raw.githubusercontent.com/fgeorges/star-wars-dataset/master/data/enriched.json') YIELD value UNWIND value.people as person
CALL db.index.fulltext.queryNodes("names", 
    replace(person.name,'é','e')) 
YIELD node,score
RETURN person.name as species_dataset_name,
       collect(node)[0].id as interactions_dataset_name
LIMIT 5

Results

species_dataset_name interactions_dataset_name
“Luke Skywalker” “LUKE”
“C-3PO” “C-3PO”
“Darth Vader” “DARTH VADER”
“Leia Organa” “LEIA”
“Owen Lars” “OWEN”

More or less, it seems that the species dataset contains full names, while the interactions dataset contains only the first names of the characters. As the full-text search matching seems fine, let us proceed and store the additional information to the graph.

CALL apoc.load.json('https://raw.githubusercontent.com/fgeorges/star-wars-dataset/master/data/enriched.json')
YIELD value
UNWIND value.people as person
// search for characters
CALL db.index.fulltext.queryNodes("names", 
    replace(person.name,'é','e'))
YIELD node,score
// collect the top hit
WITH person, collect([node,score])[0] as top_hit
WITH person, top_hit[0] as node, top_hit[1] as score
// threshold
WHERE score > 0.85
// enrich characters
SET node += apoc.map.clean(person,
    ['films','vehicles','starships','species'],['n/a'])
WITH node, person.species as person_species
UNWIND person_species as species
MERGE (s:Species{id:species})
MERGE (node)-[:BELONG_TO]->(s)

We will also import additional meta-data about the species.

CALL apoc.load.json('https://raw.githubusercontent.com/fgeorges/star-wars-dataset/master/data/enriched.json')
YIELD value
UNWIND value.species as species
MATCH (s:Species{id:species.url})
SET s += apoc.map.clean(species, 
    ['films','homeworld','people','url'],['n/a','unknown'])

Unfortunately, we haven’t assigned the species to all of our characters yet. Some characters show up in the first dataset that do not show up in the second. As the focus of this blog post is not to show how to scrape the internet efficiently, I prepared a CSV file that we can use to assign missing species values.

LOAD CSV WITH HEADERS FROM 
"https://raw.githubusercontent.com/tomasonjo/blogs/master/Star_Wars/star_wars_species.csv" as row
MATCH (c:Character{id:row.name})
MERGE (s:Species{name:row.species})
MERGE (c)-[:BELONG_TO]->(s)

Some species have only one or two members, so I decided to group them under “Other” species. This way, we will make our further graph analysis more relevant.

MATCH (s:Species)
WHERE size((s)<-[:BELONG_TO]-()) <= 2
MERGE (unknown:Species{name:'Other'})
WITH s, unknown
MATCH (s)<-[:BELONG_TO]-(character)
MERGE (unknown)<-[:BELONG_TO]-(character)
DETACH DELETE s

Multiple node label support

We can now look at the multiple node label projection and how it allows us to filter nodes when executing graph algorithms. As we will be using the native projection , we have to assign secondary labels to characters based on their species. We will use the apoc.create.addLabels procedure to assign secondary labels.

MATCH (c:Character)-[:BELONG_TO]->(species)
CALL apoc.create.addLabels([id(c)], [species.name]) YIELD node
RETURN distinct 'done'

If you are new to the concept of the GDS graph catalog , you might want to look at one of myprevious blog posts. We will use the array option to describe the node labels we want to project. Nodes belonging to all five species will be projected. All relationships in the interaction network will also be projected and treated as undirected and weighted. As there could be more than a single relationship between a given pair of nodes, we are projecting a multigraph .

CALL gds.graph.create('starwars',
  ['Other','Human','Droid','Neimodian','Gungan'],
  {INTERACTS1:{type:'INTERACTS1', orientation:'UNDIRECTED', properties:['weight']},
   INTERACTS2:{type:'INTERACTS2', orientation:'UNDIRECTED', properties:['weight']},
   INTERACTS3:{type:'INTERACTS3', orientation:'UNDIRECTED', properties:['weight']},
   INTERACTS4:{type:'INTERACTS4', orientation:'UNDIRECTED', properties:['weight']},
   INTERACTS5:{type:'INTERACTS5', orientation:'UNDIRECTED', properties:['weight']},
   INTERACTS6:{type:'INTERACTS6', orientation:'UNDIRECTED', properties:['weight']}})

To start off, let’s compute the weighted PageRank on the whole projected graph.

CALL gds.pageRank.stream('starwars',
   {relationshipWeightProperty:'weight'})
YIELD nodeId, score
RETURN gds.util.asNode(nodeId).name as name, score
ORDER BY score DESC LIMIT 5

Results

name score
“Han Solo” 3.711507879383862
“Anakin Skywalker” 3.5772032469511035
“Padmé Amidala” 3.4175086730159823
“C-3PO” 3.3624297386966657
“Luke Skywalker” 3.316930921562016

The results are not surprising, except maybe that Han Solo seems to be the most important character in the galaxy. This is due to Anakin Skywalker and Darth Vader being treated as two separate entities in our network, otherwise probably Anakin would be on top. Nonetheless, judging by the PageRank score, all the top five characters are quite similar in their importance, and no one really stands out.

Let’s say we want to find the most important characters in the Gungan social network. We can consider only Gungans in the algorithm with the nodeLabels parameter.

CALL gds.pageRank.stream('starwars',{nodeLabels:['Gungan'], 
   relationshipWeightProperty:'weight'})
YIELD nodeId, score
RETURN gds.util.asNode(nodeId).id as name, score
ORDER BY score DESC LIMIT 5
name score
“JAR JAR” 1.3676453701686113
“BOSS NASS” 1.0597422499209648
“TARPALS” 1.0597422499209648
“GENERAL CEEL” 0.3810879142256453

As you might expect, the man, the legend Jar Jar, comes out on top.

We can also combine both node and relationship filters. We will compute PageRank based on the network of humans and other species interactions in the fifth and sixth episodes.

CALL gds.pageRank.stream('starwars',{nodeLabels:['Other','Human'],
   relationshipTypes:['INTERACTS5','INTERACTS6'], 
   relationshipWeightProperty:'weight'})
YIELD nodeId, score
RETURN gds.util.asNode(nodeId).name as name, score
ORDER BY score DESC LIMIT 5
name score
“Han Solo” 3.928774843923747
“Leia Organa” 2.789427683595568
“Luke Skywalker” 2.5331994574517003
“Padmé Amidala” 2.0019969200715417
“Biggs Darklighter” 1.9624054052866993

It looks like the kids Luke and Leia have grown up. The addition of the node filtering feature allows us to be very precise when describing the subset of the projected graph we want to consider as an input of a graph algorithm. After we are done with the analysis, we can drop the projected graph from the catalog.

CALL gds.graph.drop('starwars')

Graph mutability

With the graph mutability feature, graph algorithms have gotten a third mode mutate on top of the previously supported stream and write . We can look at the official documentation to get a better sense of what it offers:

The mutate mode is very similar to the write mode but instead of writing results to the Neo4j database, they are made available at the in-memory graph. Note that the mutateProperty must not exist in the in-memory graph beforehand. This enables running multiple algorithms on the same in-memory graph without writing results to Neo4j in-between algorithm executions.

Basically, mutate mode is almost the same as the write mode, except it doesn’t write results directly to the Neo4j stored graph, but instead stores the results to the same in-memory projected graph, that was used as an input of the graph algorithm. Later, if we want to, we can still store the mutated properties of the in-memory graph to Neo4j with the gds.graph.writeNodeProperties procedure.

We will start by projecting a new in-memory graph. This time we will usecypher projection. In the node query, we will filter only characters that appeared in the first episode. We also provide the labels column, which will allow us to filter nodes when executing algorithms. If you look closely, the labels column is in the form of an array, which means that a single node can have multiple labels. Also, with cypher projection, we can provide a virtual label, as shown in the query.

CALL gds.graph.create.cypher('starwars_cypher',
  'MATCH (species)<-[:BELONG_TO]-(c:Character)
   // filter characters from first episode
   WHERE (c)-[:INTERACTS1]-() 
   RETURN id(c) as id, [species.name] as labels',
  'MATCH (c1:Character)-[r]-(c2:Character) 
   RETURN id(c1) as source, id(c2) as target, 
          type(r) as type, r.weight as weight',
   {validateRelationships:false}
)

Another new feature of the 1.1 release is the validateRelationships parameter of the cypher projection. The default setting throws an error if we try to project relationships between nodes that are not described in the node query. As we project only characters from the first episode in the node query but try to project interactions between all characters in the relationship query, we would get an error. We could either add a filter in the relationship query or just set the validateRelationship parameter to false. This way, all relationships between nodes not described in the node query will be silently ignored during the graph projection.

Finally, we can try out the mutate mode of the Louvain algorithm , which is used for community detection. We will investigate the community structure of the first episode within species and not on the whole network. To achieve this, we have to run the algorithm separately for each species and consider only characters from the given species with the nodeLabels parameter and only interactions from the first episode with the relationshipTypes parameter.  To define the designated property used to store results to the in-memory graph, we use the mutateProperty parameter.

MATCH (s:Species)
CALL gds.louvain.mutate('starwars_cypher', 
    {nodeLabels:[s.name], mutateProperty: s.name + '_1',
     relationshipTypes:['INTERACTS1']}) 
YIELD communityCount
RETURN s.name as species, communityCount

Results

species communityCount
“Human” 4
“Droid” 2
“Neimodian” 1
“Gungan” 1
“Other” 5

We can now use the mutated properties of the characters’ community from the first episode as seed property for calculating communities in the second episode. Louvain algorithm is initialized by assigning each node a unique id. With the seedProperty parameter, we can define the initial unique id of each node. If you want to learn more about the seed property and its use-cases, I wrote a blog post about it.

MATCH (s:Species)
CALL gds.louvain.mutate('starwars_cypher',
  {nodeLabels:[s.name], mutateProperty: s.name + '_2',
  relationshipTypes:['INTERACTS2'], seedProperty: s.name + '_1'})
YIELD communityCount
RETURN s.name as species, communityCount

Results

species communityCount
“Other” 3
“Human” 2
“Droid” 2
“Neimodian” 1
“Gungan” 1

We can observe that the count of communities in the second episode has slightly reduced. To write the mutated properties back to Neo4j stored graph we can use the

gds.graph.writeNodeProperties procedure.

CALL gds.graph.writeNodeProperties('starwars_cypher',
  ['Human_1','Droid_1','Other_1',
   'Human_2','Droid_2','Other_2'])

We can now examine the community structure of humans within the first episode.

MATCH (c:Human)
WHERE exists (c.Human_1)
RETURN c.Human_1 as community,
       count(*) as size, 
       collect(c.id) as members

Results

community size members
16 5 [“KITSTER”, “QUI-GON”, “PADME”, “JIRA”, “SHMI”]
4 3 [“VALORUM”, “BAIL ORGANA”, “EMPEROR”]
6 6 [“MACE WINDU”, “RABE”, “BRAVO TWO”, “BRAVO THREE”, “RIC OLIE”, “ANAKIN”]
11 3 [“OBI-WAN”, “CAPTAIN PANAKA”, “SIO BIBBLE”]

And inspect how those communities evolved in the second episode.

MATCH (c:Human)
WHERE exists (c.Human_2)
RETURN c.Human_2 as community,
       count(*) as size,
       collect(c.id) as members

Results

community size members
11 7 [“QUI-GON”, “OBI-WAN”, “EMPEROR”, “CAPTAIN PANAKA”, “JIRA”, “MACE WINDU”, “BAIL ORGANA”]
12 10 [“SIO BIBBLE”, “PADME”, “RIC OLIE”, “ANAKIN”, “SHMI”, “KITSTER”, “VALORUM”, “RABE”, “BRAVO TWO”, “BRAVO THREE”]

Unfortunately, I am not a domain expert to give some useful insights about the community structure.

Conclusion

With multiple node label support, we are now able to project the whole graph in memory and be very specific in describing the subgraph we want to use as an input to the graph algorithm as we can now filter both nodes and relationships. Graph mutability allows us to chain various graph algorithms without the need to drop and recreate projected graphs with newly computed properties. In my opinion, this is a great addition to the Neo4j Graph Data Science library.

If you have any cool dataset in mind you would like me to analyze or have any feedback as to what you would like to see next, let me know. As always, the code is available on GitHub .


以上所述就是小编给大家介绍的《Presenting multiple node label support and graph mutability features of the Neo4j Graph Dat...》,希望对大家有所帮助,如果大家有任何疑问请给我留言,小编会及时回复大家的。在此也非常感谢大家对 码农网 的支持!

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

颠覆医疗

颠覆医疗

[美]埃里克·托普 / 张南、魏薇、何雨师 / 译言·东西文库/电子工业出版社 / 2014-1-20 / 55.00

“创造性破坏”是奥地利经济学家约瑟夫·熊彼特最著名的理论,当一个产业在革新之时,都需要大规模地淘汰旧的技术与生产体系,并建立起新的生产体系。电器之于火器、汽车之于马车、个人计算机之于照排系统,都是一次又一次的“创造性破坏”,旧的体系完全不复存在,新的体系随之取代。 “创造性破坏”已经深深地改变了我们的生活,在这个数字时代,我们身边的一切都被“数字化”了。只有一处,也许是由于其本身的根深蒂固,......一起来看看 《颠覆医疗》 这本书的介绍吧!

MD5 加密
MD5 加密

MD5 加密工具

SHA 加密
SHA 加密

SHA 加密工具

XML 在线格式化
XML 在线格式化

在线 XML 格式化压缩工具