After reading most of "Graph Database" and a tiny bit of the "Neo4j in Action" book, I got exited about neo4j. One reason is the very friendly cypher query language, the other the promised performance. So, the question is: how good is the performance of neo4j in a web development setting?

A suitable test imho would be the task that the books above use: find friend of friends, or friend of friends of friends. So, finding friends at a depth of up to 5 is the task.

Results

A depth of 3 means "find the number of friends of friends of a person". Here are the performance results. I used sql for mysql and cypher (over REST) for neo4j. Numbers represent the time in secs for a query call.

Claims from the 'Graph Databases' book

These are the numbers from tabe 2-1. It cites the 'Neo4j in Action' book for this. Chapter 1 of 'Neo4j in Action' is about 'finding friends of friends'. The text above the table 2-1 in 'Graph Databases' gives a context of 'find a path between two given friends', but this is cleary not table 2-1 or the cited comparison is about. So, without further ado, table 2-1 for 1 million people:

depth    neo4j             RDBMS         

1        ....              .....
2        0.01              0.016
3        0.168            30.267
4        1.359          1543.505
5        2.132        unfinished

My results: distinct friends (100k people)

depth    neo4j             mysql       python

1        0.010             0.000        0.000
2        0.018             0.001        0.000
3        0.538             0.072        0.009
4       22.544             3.600        0.330
5     1269.942           180.143        0.758

My results: distinct friends (1m people)

depth    neo4j             mysql       python

1        0.010             0.000        0.000
2        0.018             0.002        0.000
3        0.689             0.082        0.012
4       30.057             5.598        1.079
5     1441.397*          300.000        9.791

"*": single run only

First assassement

The numbers above don't look that great for neo4j. When comparing the numbers to the ones published in the 'Graph Databases' book in table 2.1 (and also with the number in "Neo4j in Action") I get the following impression:

the numbers for mysql are completely wrong/skewed in the books.
the numbers for neo4j are quite wrong in the book, or differ greatly for cypher.
neo4j is not significantly faster then mysql (for the test case).

Also, I am surprised how well the pure python approach compares to the databases.

My hope is that I did something terribly wrong on the neo4j side. I really want to be neo4j faster then this. I trust the authors of the books to tell the truth to their audiences, and that its just a question of tuning, memory settings, cache heating or other configuration magic.

I asked about the neo4j issues on stackoverflow: http://stackoverflow.com/questions/17822333, but have no real answers yet.

The answers so far would lead to a comparison of a mysql database with sql vs. a neo4j database without cypher (that is: without a query language). This obviously would compare apple to oranges, and I am sure nobody would want that.

Setup

I have a web development background, so I expect a database to behave in a certain way:

The database acts like a server - I want multiple clients talking to the server. And the server must be reachable using python.
It has transactions over multiple queries, so that I can do a rollback after the last query/create/update action
I'd prefer a query language for all create/update/delete operations.

With mysql this means using sql, and for neo4j this means cypher. This because the neo4j server will support a transactional http endpoint that fullfills requirement 2 in version 2 of neo4j. Using said endpoint one needs to use cypher.

For the speed performance I won't use transactional mechanisms though - neo4j recommended to use 1.9.2, because 2.0.0M3 is not speed optimised yet. Hence I can't only use the non-quite transactional REST api on neo4j. Fine with me, but then we have to non-transactional tables on mysql as well, for fairness.

My test system is a 64 bit ubuntu with 16GB running on a i5-3570K CPU @ 3.40GHz. I use mysql 5.5 and neo4j 1.9.2 community edition.

A note on the requirements as mentioned in the books

My requirements match the story of the 'Graph Databases' book, the slides that neo4j has on the interwebs and also a presentation they gave in Hamburg at xing some days ago. Cypher is praised and used everywhere in examples. Comparing neo4j to mysql without the use of cypher is comparing apples and oranges.

I am aware that the 'Neo4j in Action' book actually mentions the use of the traversal api (but the 'Graph Databases' doesn't). This would be the mentioned apple and oranges comparison. One team needs to parse a generic language, the other one can use an optimised api? I don't think so! Actually, if one wanted to use a specific traversal api I would then compare that to the performance the python script that works directly on the data structure. Also, in my use case of (python) web development, a java api isn't available.

Having said that: I'd love to see a program that measured performance on the sample random datasets, using whatever optimzied api that is available!

Generating the dataset

All scripts required for the setup can be found on https://github.com/jhb/neo4j-experiements/. To save time, one can download sample datasets from https://github.com/jhb/neo4j-testdata. To use the scripts it is required to have a python with requests and simpleson installed. For mysql you also need mysqldb-python.

First, the sample data is created with `friendsdata.py`:

python friendsdata.py 100000

100000 is the number of people

This creates a python pickle file that contains a python structure for the friend relationships. It has the form of:

dict(person1id=[friend1id,friend2id,...],
     ...)

This data then gets imported to neo4j and mysql using the respective importer scripts. The pickle file is used 'directly' for python processing.

Generating the dataset seperatly allows me to work on exactly the same data. If I had seperate datasets, they might have 'miracle' random differences in distrubution, that might affect the outcome of the performance comparison. So lets go for the same data.

Importing to neo4j

The data from the dataset is imported to neo4j using the import_friends_neo4j_rest.py script. It is called this way:

python import_friends_neo4j_rest.py friends.pickle

friends.pickle is optional, and is the name of the dataset file

Importing a dataset for 1 million person entries will take a rather long time!

Importing to mysql

I use the following table structure (on mysql 5.5):

DROP TABLE IF EXISTS t_user;
CREATE TABLE t_user (
    id bigint(20) NOT NULL,
    `name` varchar(255) NOT NULL,
    PRIMARY KEY (id),
    KEY `name` (`name`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;

DROP TABLE IF EXISTS t_user_friend;
CREATE TABLE t_user_friend (
  id bigint(20) NOT NULL,
  user_1 bigint(20) NOT NULL,
  user_2 bigint(20) NOT NULL,
  PRIMARY KEY (id),
  KEY user_1 (user_1),
  KEY user_2 (user_2)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;

I would strongly recommend to create the indexes only after importing the data. The data can be imported with the import_friends_mysql.py script:

python import_friends_mysql.py friends.pickle

friends.pickle is the name of the dataset file

Importing large datasets might takes a while :-)

Querying

There are some scripts for querying using mysql, neo4j and pure python. They all do more or less the same:

create a random starting point (person1234)
create a query to find friends, using the random starting point and searching at a certain depth
measure the time taken to execute this query on the server
repeat this a number of times
return the average time taken

Tuning neo4j

I am not to sure how to best tune neo4j. My big hope in the light of the results above is that somebody will point me to the correct way of tuning neo4j.

Anyhow, what I did is in adding the following to neo4j.properties:

neostore.nodestore.db.mapped_memory=250M
neostore.relationshipstore.db.mapped_memory=2048M

(or to be more exact: I try to match the sizes of the storage dbs, as recommended in http://docs.neo4j.org/chunked/stable/configuration-io-examples.html)

and to neo4j-wrapper.conf:

wrapper.java.initmemory=1024
wrapper.java.maxmemory=8192

(this is from http://docs.neo4j.org/chunked/stable/server-performance.html)

The server is then started. I use read the data files to have it cached by the os:

cd data/graph.db/
ls neostore* | xargs -i dd if={} of=/dev/null bs=100M

I the use neo4j-shell to further 'warm up' the caches:

start n=node(*) return count(distinct n.noscenda_name);
start r=relationship(*) return count(distinct type(r));

Querying neo4j

I use the script query_friends_neo4j_distinct.py. It is called like this:

python query_friends_neo4j_distinct.py 100000 3 10

100000 is the max number of person entries
3 is the depth
10 the number of repetitions

This script creates a query of the form:

start person=node:node_auto_index(noscenda_name={target}) match (person)-[:friend]->()-[:friend]->(friend) return count(distinct friend);

(for depth 3 in this example)

Tuning mysql

To give mysql a fair chance the following change was made to /etc/mysql/my.cnf:

key_buffer = 4G

I also read in the files from the database (as root):

sudo -i
cd /var/lib/mysql/friends
ls * | xargs -i dd if={} of=/dev/null bs=100M

The server is then restarted (and the indexes added if needed). The caches are further warmed up with:

select count(distinct name) from t_user;
select count(distinct t_user_friend.id) from t_user_friend;

Querying mysql

Mysql is queried using `query_friends_mysql_distinct.py`:

python query_friends_mysql.py 100000 3 10

100000 is the max number of person entries
3 is the depth
10 the number of repetitions

This creates a query of the form:

select 
    count(distinct uf4.user_2)
from
    t_user_friend as uf1, 
    t_user_friend as uf2, 
    t_user_friend as uf3, 
    t_user_friend as uf4
where
    uf1.user_1='46973' and
    uf1.user_2 = uf2.user_1 and
    uf2.user_2 = uf3.user_1 and
    uf3.user_2 = uf4.user_1;

This is as close as I could get to chapter one of "Neo4j in Action". I did not want to return all results, because obviously that would then just measure the scrolling capabilities of my terminal. Hence the 'count'.

Processing with python

Just out of curiosity I wrote a naive query script query_friends_python.py, which can be called as:

python query_friends_mysql.py firends.pickle 100000 3 10

friends.pickle is the name of the dataset file
100000 is the max number of person entries
3 is the depth
10 the number of repetitions

There is also query_friends_python_all.py which goes through depths 1-5, and repeats queries at each depth a defined number of times. This saves the overhead of having to read in the potentially big pickle file between each go:

python query_friend_python_all.py friends.pickle 100000 10

friends.pickle is the name of the dataset file
100000 is the max number of person entries
10 the number of repetitions

Just in case you are wondering: the results of the tests are at the top of the page

Improving results

One first idea I had comes in the form of a cypher statement:

start n=node({startid})
match n--> m
with distinct m as f1
match f1-->m
with distinct m as f2
match f2 -->m 
with distinct m as f3
match f3 --> m
with distinct m as f4
match f4-->m 
with distinct m as f5
return count(f5)

I guess what the script does is to make sure that on step we avoid traversing to duplicates. Tested on 100k/1m people (leaving out steps as necessary, one run only):

depth   100k           1m

1      0.010        0.010
2      0.028        0.017
3      0.376        0.484
4      7.278        18.95
5     18.225      462.466

Still bad.

Python alternative

I started to work on a python alternative, which seems to provide better performance and works for me (tm). https://baach.de/Members/jhb/neo4j-performance-compared-to-graphagus