Graphgist version available

I've created a graphgist version of this blog post. Its the same text, but the examples work right in the browser: http://gist.neo4j.org/?6078256

The Graph Databases book and it's examples

I downloaded the 'Graph Databases' book from http://graphdatabases.com/, and even got a printed version for free at a neo4j meetup on tuesday. I like neo4j, and the book, and I am really grateful for both.

The book says, on page 27, it uses cypher in the 2.0 version. Great. I'm using neo4j-community-2.0.0-M03 anyhow, because I need to use the transactional http endpoint. That exists in 2.0 only, and only speaks cypher.

The problem: the examples (starting from page 44) don't work. You can use the create statement from page 44, but when you try to use the reading request from page 47:

START   theater=node:venue(name='Theatre Royal'),
        newcastle=node:city(name='Newcastle'),
        bard=node:author(lastname='Shakespeare')
MATCH   (newcastle)<-[:STREET|CITY*1..2]-(theater)
        <-[:VENUE]-()-[:PERFORMANCE_OF]->()-[:PRODUCTION_OF]->
        (play)<-[:WROTE_PLAY]-(bard)
RETURN  DISTINCT play.title AS play

you get the following result:

MissingIndexException: Index 'author' does not exist

Why?

Indexing using cypher

Lets look at the first line:

START   theater=node:venue(name='Theatre Royal'),

This line tries to lookup up a node in the venue index, which has 'Theatre Royal' stored for the index property name. One could also say, its using a legacy index. This index needs setting up first. You can't do that from cypher, but thats not even the main problem. To use legacy indexes, you need to manually trigger adding/updates/deletes of nodes and relationships to this index. And you can't do that from cypher either, and thats a problem. So even though we can put the shakespeare data into our graph, we don't get it into the indexes. And hence we can't search the indexes. Now we could use the command line interface, or the REST Api, but we won't, because I need to use the transactional http endpoint (with seperate rollback commands etc.) :-).

Rescue comes in the form of Schema/Labels. You can attach as many labels to a node if you like, and you can create auto updating indexes. Using cypher only. Those indexes will not only automaticly update, they also are used behind the scenes without explicit mentioning. Isn't this great? Thought so...

I prepared some modified examples below (for chapter 4). They actually run, using cypher only. Before you use them, clean out your database of the example data above, if needed:

start n=node(*) match n-[r]->m delete r,n,m;

(This actually cleans out everything, so know what you do)

Modified examples (chapter 3)

Besides updating the examples, I also add semicola at the end of phrases, so that you don't stumple upon errors every time you copy and paste (like I do). And changed the formatting a bit to my preferred style.

Creating the Shakespear Graph

Page 44:

create  
    (shakespeare:Author { firstname: 'William', lastname: 'Shakespeare' }),
    (juliusCaesar:Character { title: 'Julius Caesar' }),
    (shakespeare)-[:WROTE_PLAY { year: 1599 }]->(juliusCaesar),
    (theTempest:Play { title: 'The Tempest' }),
    (shakespeare)-[:WROTE_PLAY { year: 1610}]->(theTempest),
    (rsc:Company { name: 'RSC' }),
    (production1:Production { name: 'Julius Caesar' }),
    (rsc)-[:PRODUCED]->(production1),
    (production1)-[:PRODUCTION_OF]->(juliusCaesar),
    (performance1:Performance { date: 20120729 }),
    (performance1:Performance)-[:PERFORMANCE_OF]->(production1),
    (production2:Production { name: 'The Tempest' }),
    (rsc)-[:PRODUCED]->(production2),
    (production2)-[:PRODUCTION_OF]->(theTempest),
    (performance2:Performance { date: 20061121 }),
    (performance2)-[:PERFORMANCE_OF]->(production2),
    (performance3:performance { date: 20120730 }),
    (performance3)-[:PERFORMANCE_OF]->(production1),
    (billy:Person { name: 'Billy' }),
    (review:Review { rating: 5, review: 'This was awesome!' }),
    (billy)-[:WROTE_REVIEW]->(review),
    (review)-[:RATED]->(performance1),
    (theatreRoyal:Venue { name: 'Theatre Royal' }),
    (performance1)-[:VENUE]->(theatreRoyal),
    (performance2)-[:VENUE]->(theatreRoyal),
    (performance3)-[:VENUE]->(theatreRoyal),
    (greyStreet:Street { name: 'Grey Street' }),
    (theatreRoyal)-[:STREET]->(greyStreet),
    (newcastle:City { name: 'Newcastle' }),
    (greyStreet)-[:CITY]->(newcastle),
    (tyneAndWear:County { name: 'Tyne and Wear' }),
    (newcastle)-[:COUNTY]->(tyneAndWear),
    (england:Country { name: 'England' }),
    (tyneAndWear)-[:COUNTRY]->(england),
    (stratford:City { name: 'Stratford upon Avon' }),
    (stratford)-[:COUNTRY]->(england),
    (rsc)-[:BASED_IN]->(stratford),
    (shakespeare)-[:BORN_IN]->stratford;

I assigned now labels to all node. That wouldn't have been necessary, but it felt a bit clearer to me. The labes are :Author, :Character and so forth.

Lets also create some indexes on some of the labels:

create index on :Author(firstname);
create index on :Author(lastname);
create index on :City(name);
create index on :Venue(name);

Beginning a Query

As the text talks about the START statement, and this won't be used in the same way with the label indexes, it's a bit hard to translate. But lets try.

Page 46:

match 
    theater:Venue,
    newcastle:City,
    bard:Author
where 
    theater.name='Theatre Royal' and
    newcastle.name='Newcastle' and
    bard.lastname='Shakespeare'

(Just like in the book, it doesn't do anything)

Declaring Information Patterns to Find

Page 46:

match
    (newcastle)<-[:STREET|CITY*1..2]-(theater)
    <-[:VENUE]-()-[:PERFORMANCE_OF]->()-[:PRODUCTION_OF]->
    (play)<-[:WROTE_PLAY]-(bard)

This is exactly the same.

Page 47:

match 
    theater:Venue,
    newcastle:City,
    bard:Author,
    (newcastle)<-[:STREET|CITY*1..2]-(theater)
    <-[:VENUE]-()-[:PERFORMANCE_OF]->()-[:PRODUCTION_OF]->
    (play)<-[:WROTE_PLAY]-(bard)
where
    theater.name='Theatre Royal' and
    newcastle.name='Newcastle' and
    bard.lastname='Shakespeare'                  
return 
    distinct play.title as play;

Contstraining Matches

Page 48:

match 
    theater:Venue,
    newcastle:City,
    bard:Author,
    (newcastle)<-[:STREET|CITY*1..2]-(theater)
    <-[:VENUE]-()-[:PERFORMANCE_OF]->()-[:PRODUCTION_OF]->
    (play)<-[w:WROTE_PLAY]-(bard)
where 
    theater.name='Theatre Royal' and
    newcastle.name='Newcastle' and
    bard.lastname='Shakespeare' and
    w.year > 1608
return 
    distinct play.title as play;

Processing Results

Page 49:

match 
    theater:Venue,
    newcastle:City,
    bard:Author,
    (newcastle)<-[:STREET|CITY*1..2]-(theater)
    <-[:VENUE]-()-[p:PERFORMANCE_OF]->()-[:PRODUCTION_OF]->
    (play)<-[:WROTE_PLAY]-(bard)
where 
    theater.name='Theatre Royal' and
    newcastle.name='Newcastle' and
    bard.lastname='Shakespeare'
return 
    play.title as play, count(p) as performance_count
order by 
    performance_count desc;

Query Chaining

Page 50:

match 
    bard:Author,
    (bard)-[w:WROTE_PLAY]->(play)
where 
    bard.lastname='Shakespeare'
with  
    play
order by 
    w.year desc
return 
    collect(play.title) as plays;

A Sensible First Iteration?

Create another index:

create index on :User(username)

Page 51:

create  
    (alice:User {username: 'Alice'}),
    (bob:User {username: 'Bob'}),
    (charlie:User {username: 'Charlie'}),
    (davina:User {username: 'Davina'}),
    (edward:User {username: 'Edward'}),
    (alice)-[:ALIAS_OF]->(bob);

Page 51, 2nd:

match 
    bob:User,
    charlie:User,
    davina:User,
    edward:User
where 
    bob.username='Bob' and
    charlie.username='Charlie' and
    davina.username='Davina' and
    edward.username='Edward'
create 
    (bob)-[:EMAILED]->(charlie),
    (bob)-[:CC]->(davina),
    (bob)-[:BCC]->(edward);

Page 52:

match 
    bob:User,
    charlie:User,
    (bob)-[e:EMAILED]->(charlie)
where
    bob.username='Bob' and 
    charlie.username='Charlie'
return 
    e;

Second Time's the Charm

Page 53:

create 
    (email_1:Email {id: '1', content: 'Hi Charlie, ... Kind regards, Bob'}),
    (bob)-[:SENT]->(email_1),
    (email_1)-[:TO]->(charlie),
    (email_1)-[:CC]->(davina),
    (email_1)-[:CC]->(alice),
    (email_1)-[:BCC]->(edward);

Dont' use this example yet, its incomplete. Instead, create some indexes:

create index on :Email(id);
create index on :Email(content);

Page 54:

match 
    alice:User,
    bob:User,
    charlie:User,
    davina:User,
    edward:User
where 
    alice.username='Alice' and
    bob.username='Bob' and
    charlie.username='Charlie' and
    davina.username='Davina' and
    edward.username='Edward'
create 
    (email_1:Email {id: '1', content: 'email contents'}),
    (bob)-[:SENT]->(email_1),
    (email_1)-[:TO]->(charlie),
    (email_1)-[:CC]->(davina),
    (email_1)-[:CC]->(alice),
    (email_1)-[:BCC]->(edward),
    (email_2:Email {id: '2', content: 'email contents'}),
    (bob)-[:SENT]->(email_2),
    (email_2)-[:TO]->(davina),
    (email_2)-[:BCC]->(edward),
    (email_3:Email {id: '3', content: 'email contents'}),
    (davina)-[:SENT]->(email_3),
    (email_3)-[:TO]->(bob),
    (email_3)-[:CC]->(edward),
    (email_4:Email {id: '4', content: 'email contents'}),
    (charlie)-[:SENT]->(email_4),
    (email_4)-[:TO]->(bob),
    (email_4)-[:TO]->(davina),
    (email_4)-[:TO]->(edward),
    (email_5:Email {id: '5', content: 'email contents'}),
    (davina)-[:SENT]->(email_5),
    (email_5)-[:TO]->(alice),
    (email_5)-[:BCC]->(bob),
    (email_5)-[:BCC]->(edward);

I added the missing start(now match/where) at the top, and brought the create statements all into one, to shorten the code a bit.

Page 55:

match 
    bob:User,
    (bob)-[:SENT]->(email)-[:CC]->(alias),
    (alias)-[:ALIAS_OF]->(bob)
where 
    bob.username='Bob'
return 
    email;

Evolving the Domain

Another theoretical example, don't use it, on Page 57:

match email:Email
where emai.id='1234'
create (alice)-[:REPLIED_TO]->(email);
create (davina)-[:FORWARDED]->(email)-[:TO]->(charlie);

Page 57, bottom:

match   
    alice:User,
    bob:User,
    charlie:User,
    davina:User,
    edward:User
where
    alice.username='Alice' and
    bob.username='Bob' and
    charlie.username='Charlie' and
    davina.username='Davina' and
    edward.username='Edward'
 create 
    (email_6:Email {id: '6', content: 'email'}),
    (bob)-[:SENT]->(email_6),
    (email_6)-[:TO]->(charlie),
    (email_6)-[:TO]->(davina),
    (reply_1:Email {id: '7', content: 'response'}),
    (reply_1)-[:REPLY_TO]->(email_6),
    (davina)-[:SENT]->(reply_1),
    (reply_1)-[:TO]->(bob),
    (reply_1)-[:TO]->(charlie),
    (reply_2:Email {id: '8', content: 'response'}),
    (reply_2)-[:REPLY_TO]->(email_6),
    (bob)-[:SENT]->(reply_2),
    (reply_2)-[:TO]->(davina),
    (reply_2)-[:TO]->(charlie),
    (reply_2)-[:CC]->(alice),
    (reply_3:Email {id: '9', content: 'response'}),
    (reply_3)-[:REPLY_TO]->(reply_1),
    (charlie)-[:SENT]->(reply_3),
    (reply_3)-[:TO]->(bob),
    (reply_3)-[:TO]->(davina),
    (reply_4:Email {id: '10', content: 'response'}),
    (reply_4)-[:REPLY_TO]->(reply_3),
    (bob)-[:SENT]->(reply_4),
    (reply_4)-[:TO]->(charlie),
    (reply_4)-[:TO]->(davina);

Page 58, bottom:

match 
    email:Email,
    p=(email)<-[:REPLY_TO*1..4]-()<-[:SENT]-(replier)
where
    email.id='6'
return 
    replier.username AS replier, length(p) - 1 AS depth
order by
    depth;

Page 60:

match   
    alice:User,
    bob:User,
    charlie:User,
    davina:User
where
    alice.username='Alice' and
    bob.username='Bob' and
    charlie.username='Charlie' and
    davina.username='Davina'
create
    (email_11:Email {id: '11', content: 'email'}),
    (alice)-[:SENT]->(email_11)-[:TO]->(bob),
    (email_12:Email {id: '12', content: 'email'}),
    (email_12)-[:FORWARD_OF]->(email_11),
    (bob)-[:SENT]->(email_12)-[:TO]->(charlie),
    (email_13:Email {id: '13', content: 'email'}),
    (email_13)-[:FORWARD_OF]->(email_12),
    (charlie)-[:SENT]->(email_13)-[:TO]->(davina);

Page 61:

match
    email:Email,
    (email)<-[f:FORWARD_OF*]-()
where
    email.id='11'
return
    count(f);

Other approaches

node_auto_index

One other possibility would be to use the node_auto_index instead (by uncommenting the related statements in the neo4j.properties file, and setting the appropriate properties to be indexed).

This would then turn the query:

START   theater=node:venue(name='Theatre Royal') return theater;

into:

START   theater=node:node_auto_index(name='Theatre Royal') return theater;

This would be doable I guess.One could not only index name, but a property called label as well, to avoid namespace issues. But I guess this would

contradict the efforts of labels in the 2.0 version, and
lead to one gigantic index for all of the properties of all of the nodes.

So even though it works for the book, don't see it as a good way forward.