Skip to content. | Skip to navigation

Personal tools

Navigation

You are here: Home / Members / jhb / neo4j performance compared to mysql

neo4j performance compared to mysql

by Jörg Baach last modified Jul 23, 2015 03:17 PM
I try to validate the performance claims and comparisons in the 'Graph Database' book. Using the friend-of-a-friend example from the book I get very different results. neo4j performs worse then mysql in my observation.

After reading most of "Graph Database" and a tiny bit of the "Neo4j in Action" book, I got exited about neo4j. One reason is the very friendly cypher query language, the other the promised performance. So, the question is: how good is the performance of neo4j in a web development setting?

A suitable test imho would be the task that the books above use: find friend of friends, or friend of friends of friends. So, finding friends at a depth of up to 5 is the task.

Results

A depth of 3 means "find the number of friends of friends of a person". Here are the performance results. I used sql for mysql and cypher (over REST) for neo4j. Numbers represent the time in secs for a query call.

Claims from the 'Graph Databases' book

These are the numbers from tabe 2-1. It cites the 'Neo4j in Action' book for this. Chapter 1 of 'Neo4j in Action' is about 'finding friends of friends'. The text above the table 2-1 in 'Graph Databases' gives a context of 'find a path between two given friends', but this is cleary not table 2-1 or the cited comparison is about. So, without further ado, table 2-1 for 1 million people:

depth    neo4j             RDBMS

1        ....              .....
2        0.01              0.016
3        0.168            30.267
4        1.359          1543.505
5        2.132        unfinished

My results: distinct friends (100k people)

depth    neo4j             mysql       python

1        0.010             0.000        0.000
2        0.018             0.001        0.000
3        0.538             0.072        0.009
4       22.544             3.600        0.330
5     1269.942           180.143        0.758

My results: distinct friends (1m people)

depth    neo4j             mysql       python

1        0.010             0.000        0.000
2        0.018             0.002        0.000
3        0.689             0.082        0.012
4       30.057             5.598        1.079
5     1441.397*          300.000        9.791

"*": single run only

First assassement

The numbers above don't look that great for neo4j. When comparing the numbers to the ones published in the 'Graph Databases' book in table 2.1 (and also with the number in "Neo4j in Action") I get the following impression:

  • the numbers for mysql are completely wrong/skewed in the books.
  • the numbers for neo4j are quite wrong in the book, or differ greatly for cypher.
  • neo4j is not significantly faster then mysql (for the test case).

Also, I am surprised how well the pure python approach compares to the databases.

My hope is that I did something terribly wrong on the neo4j side. I really want to be neo4j faster then this. I trust the authors of the books to tell the truth to their audiences, and that its just a question of tuning, memory settings, cache heating or other configuration magic.

I asked about the neo4j issues on stackoverflow: http://stackoverflow.com/questions/17822333, but have no real answers yet.

The answers so far would lead to a comparison of a mysql database with sql vs. a neo4j database without cypher (that is: without a query language). This obviously would compare apple to oranges, and I am sure nobody would want that.

Setup

I have a web development background, so I expect a database to behave in a certain way:

  1. The database acts like a server - I want multiple clients talking to the server. And the server must be reachable using python.
  2. It has transactions over multiple queries, so that I can do a rollback after the last query/create/update action
  3. I'd prefer a query language for all create/update/delete operations.

With mysql this means using sql, and for neo4j this means cypher. This because the neo4j server will support a transactional http endpoint that fullfills requirement 2 in version 2 of neo4j. Using said endpoint one needs to use cypher.

For the speed performance I won't use transactional mechanisms though - neo4j recommended to use 1.9.2, because 2.0.0M3 is not speed optimised yet. Hence I can't only use the non-quite transactional REST api on neo4j. Fine with me, but then we have to non-transactional tables on mysql as well, for fairness.

My test system is a 64 bit ubuntu with 16GB running on a i5-3570K CPU @ 3.40GHz. I use mysql 5.5 and neo4j 1.9.2 community edition.

A note on the requirements as mentioned in the books

My requirements match the story of the 'Graph Databases' book, the slides that neo4j has on the interwebs and also a presentation they gave in Hamburg at xing some days ago. Cypher is praised and used everywhere in examples. Comparing neo4j to mysql without the use of cypher is comparing apples and oranges.

I am aware that the 'Neo4j in Action' book actually mentions the use of the traversal api (but the 'Graph Databases' doesn't). This would be the mentioned apple and oranges comparison. One team needs to parse a generic language, the other one can use an optimised api? I don't think so! Actually, if one wanted to use a specific traversal api I would then compare that to the performance the python script that works directly on the data structure. Also, in my use case of (python) web development, a java api isn't available.

Having said that: I'd love to see a program that measured performance on the sample random datasets, using whatever optimzied api that is available!

Generating the dataset

All scripts required for the setup can be found on https://github.com/jhb/neo4j-experiements/. To save time, one can download sample datasets from https://github.com/jhb/neo4j-testdata. To use the scripts it is required to have a python with requests and simpleson installed. For mysql you also need mysqldb-python.

First, the sample data is created with friendsdata.py:

python friendsdata.py 100000
  • 100000 is the number of people

This creates a python pickle file that contains a python structure for the friend relationships. It has the form of:

dict(person1id=[friend1id,friend2id,...],
     ...)

This data then gets imported to neo4j and mysql using the respective importer scripts. The pickle file is used 'directly' for python processing.

Generating the dataset seperatly allows me to work on exactly the same data. If I had seperate datasets, they might have 'miracle' random differences in distrubution, that might affect the outcome of the performance comparison. So lets go for the same data.

Importing to neo4j

The data from the dataset is imported to neo4j using the import_friends_neo4j_rest.py script. It is called this way:

python import_friends_neo4j_rest.py friends.pickle
  • friends.pickle is optional, and is the name of the dataset file

Importing a dataset for 1 million person entries will take a rather long time!

Importing to mysql

I use the following table structure (on mysql 5.5):

DROP TABLE IF EXISTS t_user;
CREATE TABLE t_user (
    id bigint(20) NOT NULL,
    `name` varchar(255) NOT NULL,
    PRIMARY KEY (id),
    KEY `name` (`name`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;

DROP TABLE IF EXISTS t_user_friend;
CREATE TABLE t_user_friend (
  id bigint(20) NOT NULL,
  user_1 bigint(20) NOT NULL,
  user_2 bigint(20) NOT NULL,
  PRIMARY KEY (id),
  KEY user_1 (user_1),
  KEY user_2 (user_2)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;

I would strongly recommend to create the indexes only after importing the data. The data can be imported with the import_friends_mysql.py script:

python import_friends_mysql.py friends.pickle
  • friends.pickle is the name of the dataset file

Importing large datasets might takes a while :-)

Querying

There are some scripts for querying using mysql, neo4j and pure python. They all do more or less the same:

  • create a random starting point (person1234)
  • create a query to find friends, using the random starting point and searching at a certain depth
  • measure the time taken to execute this query on the server
  • repeat this a number of times
  • return the average time taken

Tuning neo4j

I am not to sure how to best tune neo4j. My big hope in the light of the results above is that somebody will point me to the correct way of tuning neo4j.

Anyhow, what I did is in adding the following to neo4j.properties:

neostore.nodestore.db.mapped_memory=250M
neostore.relationshipstore.db.mapped_memory=2048M

(or to be more exact: I try to match the sizes of the storage dbs, as recommended in http://docs.neo4j.org/chunked/stable/configuration-io-examples.html)

and to neo4j-wrapper.conf:

wrapper.java.initmemory=1024
wrapper.java.maxmemory=8192

(this is from http://docs.neo4j.org/chunked/stable/server-performance.html)

The server is then started. I use read the data files to have it cached by the os:

cd data/graph.db/
ls neostore* | xargs -i dd if={} of=/dev/null bs=100M

I the use neo4j-shell to further 'warm up' the caches:

start n=node(*) return count(distinct n.noscenda_name);
start r=relationship(*) return count(distinct type(r));

Querying neo4j

I use the script query_friends_neo4j_distinct.py. It is called like this:

python query_friends_neo4j_distinct.py 100000 3 10
  • 100000 is the max number of person entries
  • 3 is the depth
  • 10 the number of repetitions

This script creates a query of the form:

start person=node:node_auto_index(noscenda_name={target}) match (person)-[:friend]->()-[:friend]->(friend) return count(distinct friend);

(for depth 3 in this example)

Tuning mysql

To give mysql a fair chance the following change was made to /etc/mysql/my.cnf:

key_buffer = 4G

I also read in the files from the database (as root):

sudo -i
cd /var/lib/mysql/friends
ls * | xargs -i dd if={} of=/dev/null bs=100M

The server is then restarted (and the indexes added if needed). The caches are further warmed up with:

select count(distinct name) from t_user;
select count(distinct t_user_friend.id) from t_user_friend;

Querying mysql

Mysql is queried using query_friends_mysql_distinct.py:

python query_friends_mysql.py 100000 3 10
  • 100000 is the max number of person entries
  • 3 is the depth
  • 10 the number of repetitions

This creates a query of the form:

select
    count(distinct uf4.user_2)
from
    t_user_friend as uf1,
    t_user_friend as uf2,
    t_user_friend as uf3,
    t_user_friend as uf4
where
    uf1.user_1='46973' and
    uf1.user_2 = uf2.user_1 and
    uf2.user_2 = uf3.user_1 and
    uf3.user_2 = uf4.user_1;

This is as close as I could get to chapter one of "Neo4j in Action". I did not want to return all results, because obviously that would then just measure the scrolling capabilities of my terminal. Hence the 'count'.

Processing with python

Just out of curiosity I wrote a naive query script query_friends_python.py, which can be called as:

python query_friends_mysql.py firends.pickle 100000 3 10
  • friends.pickle is the name of the dataset file
  • 100000 is the max number of person entries
  • 3 is the depth
  • 10 the number of repetitions

There is also query_friends_python_all.py which goes through depths 1-5, and repeats queries at each depth a defined number of times. This saves the overhead of having to read in the potentially big pickle file between each go:

python query_friend_python_all.py friends.pickle 100000 10
  • friends.pickle is the name of the dataset file
  • 100000 is the max number of person entries
  • 10 the number of repetitions

Just in case you are wondering: the results of the tests are at the top of the page

Improving results

One first idea I had comes in the form of a cypher statement:

start n=node({startid})
match n--> m
with distinct m as f1
match f1-->m
with distinct m as f2
match f2 -->m
with distinct m as f3
match f3 --> m
with distinct m as f4
match f4-->m
with distinct m as f5
return count(f5)

I guess what the script does is to make sure that on step we avoid traversing to duplicates. Tested on 100k/1m people (leaving out steps as necessary, one run only):

depth   100k           1m

1      0.010        0.010
2      0.028        0.017
3      0.376        0.484
4      7.278        18.95
5     18.225      462.466

Still bad.

Python alternative

I started to work on a python alternative, which seems to provide better performance and works for me (tm). https://baach.de/Members/jhb/neo4j-performance-compared-to-graphagus

Filed under: , , ,
Charles
Charles says:
Dec 12, 2013 01:10 AM

Puzzled by your summation 'Still bad'
100k records, 1 run, 18.225 neo4j
100k records, N runs, 180.143 mysql
Perhaps the summation ought to be 'significantly better'

1M records, 1 run, 462.466, neo4j
1M records, N runs, 300.00 mysql
Perhaps the summation for 1M records ought to be 'a bit worse ... but, since the same N runs were not made with neo4j (as were made with mysql) then this comparison is not legitimate'

Joerg Baach
Joerg Baach says:
Dec 31, 2013 12:58 PM

When comparing the improved results to my original measurements, its an improvement. When comparing to the original claim for 1m people it takes over 200 times longer than promised. Thtats what I call it 'bad'.

Uma Priyadarsi
Uma Priyadarsi says:
Jun 07, 2014 07:21 AM

How many edges are there in total?

Uma Priyadarsi
Uma Priyadarsi says:
Jun 07, 2014 08:29 AM

Could you also please elaborate on the algorithm you used, to find the nodes at any depth from a given node?

I tried to replicate the results with 1 million edges and 50 million edges. 50 edges per node..

When I try to find nodes at a depth of 5, I take almost 50 seconds.

My program runs in c++, and I generate the database every time, and store it in the memory itself, i.e in the ram and not the hard disk. Still it takes 50 seconds.

Generating datasets doesn't take much time though. To generate 4 million nodes with 200 million edges, it takes about 80 seconds.. To generate 1 million nodes with 50 million edges it took 20 seconds.

I store the database in the form of an adjacency list manner. So suppose node j has an edge with node i, then in index i or the adjacency list, 'j' will be listed. To be precise : vector< vector<int> > adjacency_list.

After generating the database, the program takes input from the user the root node, and the depth (d). The program then finds all the nodes at a depth d from the root node the user gave as input.

I use Dijkstra's algorithm to implement the above query. To my knowledge that is fastest.

So, could you please explain where I might be going wrong.

Uma Priyadarsi
Uma Priyadarsi says:
Jun 07, 2014 02:58 PM

Its cool, I found my mistake..

There was no need to do anything hi-fi like the Djiksta' algorithm.. Plain old bfs was all that was needed..

Now, I'm getting much much better results. To be trutful, better than results you tabulated here...

For 1 million nodes, and 50 million edges(50 per node) to find nodes at depth 5 it takes 0.84 seconds

Joerg Baach
Joerg Baach says:
Jun 11, 2014 09:48 AM

Hi Uma, could you share your code? I would be really interested to learn how you got the performance increase using cypher. Or did you use an api directly (which would be much faster, but a complete different comparison to the one in the book and my blog article)?

Uma Priyadarsi
Uma Priyadarsi says:
Jul 09, 2014 05:07 PM

Hi Joerg.. I'm sorry for the delayed reply. I forgot that I commented on this article. I was checking again, to compare my results with the ones you tabulated above. I kind of use your results as a benchmark.

Since, I last commented, my code has undergone a lot of changes. Anyhow, the reason it is faster, is because of:
I'm a bit of a novice, so do forgive me, if I have made wrong comparisons with your results to what I have been doing.
1) I did not use neo4j or any api :p.. I used only c++ data types to store my database..
2) My whole database is stored in the RAM, and not in hard disk.

So, generally c++ is faster than python(atleast thats what I hear), and since none of my data is stored in the hard disk, I guess thats the reason, my query was faster.

But that was about a month ago. I'm now targeting for a database with a higher amount of nodes and edges, so I'm having to store the database in the hard disk. Now my results, are worse than before.. for example:
for 100,000 nodes each with 10 edges per node, to find all the nodes at depth 5, it took 22 seconds.
I'm now thinking on how to make my code take advantage of the speed I get when I store in ram, and the bigger size of graph databases I can store when I'm storing it in my hard disk.
But, still if you want to take a look at my code on how it finds nodes at a certain depth d, here you go:

void nodesAtDepth(int node_i,int depth){//finding nodes at depth d from node i _ running bfs to find nodes at depth d from node i
std::clock_t start;
double duration;
start = std::clock();

vector<long long>depthRecord(nodeCount,nodeCount);depthRecord[node_i]=0;//the maximum distance between any two nodes cannot be more than nodeCount.

queue<int>node_Queue;node_Queue.push(node_i);

while(!node_Queue.empty()){
int id=node_Queue.front();
node_Queue.pop();
for(int j=0;j<nodes[id].knowEdge.size();j++){
int newNode=nodes[id].knowEdge[j];
if(depthRecord[id]+1 <=depth && depthRecord[id]+1<depthRecord[newNode]){
depthRecord[newNode]=depthRecord[id]+1;
node_Queue.push(newNode);
}
}
}
duration = ( std::clock() - start ) / (double) CLOCKS_PER_SEC;

int counter=0;
for(int j=0;j<depthRecord.size();j++){
if(depthRecord[j]==depth){cout<<"node(s) at depth "<<depth<<" from node "<<node_i<<" "<<j<<"\n";counter++;}
}

std::cout<<"time taken "<< duration <<" second(s). Total number of nodes at depth "<<depth<<" from node "<<node_i <<" "<<counter<<"\n\n";
}

Yadvendra Singh
Yadvendra Singh says:
Nov 19, 2014 02:30 PM

use neo4j from the console if you are using others from console as REST will add network overhead

Erik
Erik says:
Dec 01, 2014 08:54 AM

I haven't done such thorough investigation of performance so thank you very much for your work! I am using Neo4j mostly for its structural advantages since I am working with graphs and didn't want to implement graph algorithms in SQL. However, I also observers performance issues with Cypher, also for newer versions of Neo4j. When I have such a case, I add a specialized server plugin endpoint to my own Neo4j server plugin. You could think of this like stored procedures in your SQL server. But still I completely agree: This is comparing apples with oranges. While it works well for me to replace Cypher queries with specialized java plugin code, quite a few queries that should be easily done via Cypher I had to put into my plugin with a comment explaining that it is just faster this way. I really hope Cypher performance will be maturing significantly in the future. Or perhaps we will just learn that this is the actual pitfall worth graph databases: their arbitrary structure might just be too hard for automatically optimized queries.

Ramin
Ramin says:
Dec 16, 2014 11:48 PM

You sure about your SQL query? Shouldn't you join the "t_user_friend" table with itself 3 times to get to the depth of 3 in the graph? Or is this some kind of short-hand SQL that I'm not familiar with?

nils
nils says:
Feb 06, 2015 12:23 PM

@Ramin: the ',' is a JOIN without any conditions (any to any) he adds the 'ON' conditions in the WHERE clause.

Tobin
Tobin says:
May 20, 2015 07:46 PM

A **huge** problem with these tests is that they're totally biased: you're comparing retrieving data over HTTP (REST) - not to mention using a REST API (which adds slowness too) vs. bare metal requests to a db using its native language (SQL for MYSQL).

I suggest you do your requests to Neo4j using a JAVA API. Your performance, I bet, will be MUCH better.

Please give that a shot and reports your performance!

Read this thread too (Python adds slowness too).

https://groups.google.com/forum/#!topic/neo4j/bFPHyNAWI-Y

Joerg Baach
Joerg Baach says:
May 22, 2015 10:06 AM

Hi Tobin, I described the setup in the 'setup' section of this page. The books brought examples in cypher, which is comparable to sql for a database.

The whole background are the dubios claims in the "graph databases" book, which seem to make use of cypher in the first place.

Programaths
Programaths says:
Jun 12, 2015 06:37 PM

When you use Cypher through the REST API you use the full web stack including marshaling and marshaling of request. (Parsing headers & body)
When you connect to MySQL, you make a single connection with a dedicated and optimized protocol.

To better understand, simply benchmark the time you would take to do 1 million request to google. It will be a lot slower than your MySQL query!

The time you lose is the time spend to wire the query to Neo4j and wire the answer back to your Python script.

You can improve you benchmark by adding the same requirement to MySQL:
MySQL should be accessed in a restful way. That means that you should send the queries through HTPP and receive any format you can parse client side. (Eg. JSON)

OR, you should compare it with the Java client. See http://stackoverflow.com/questions/25976190/performance-of-java-api-versus-python-with-cypher-for-neo4j for details on this.

citynorm
citynorm says:
Aug 10, 2015 07:57 PM

Hi Joerg,

thanks for running the benchmark! I'm currently debating whether I should use mixed neo4j and MySQL or just stuck with MySQL in the first place!

Can you please help me understand a part of your SQL query? I'm wondering how it addresses friends I added vs friends in my network. In neo4j it would be the difference between directed and undirected. In your SQL, I worry that you only look at people that user 46973 friended vs including his friends as well as people who friended him. See:
select
count(*)
from
t_user_friend as uf1
where
uf1.user_1='46973' or uf1.user_2='46973';
=> returns 98 vs 50.

If you add all those OR relations the 3 degree friend query becomes much slower.

And fwiw on my tuned machine the 3 degree friend query takes 25sec instead of 0.07sec.

citynorm
citynorm says:
Aug 10, 2015 08:08 PM

PS UNtuned machine

Stephane
Stephane says:
Sep 02, 2015 03:25 PM

@citynorm You have a configuration with both MySQL and Neo4J ? I'm interested :-) http://stackoverflow.com/questions/32296676/tests-fail-with-a-transactionrequiredexception-no-transaction-is-in-progress-ex

JohnR
JohnR says:
Oct 29, 2015 05:24 PM

Just a note (because I just completed a similar MySQL benchmark), you will get very skewed results if you simply ask for a count. This is because the database can "cheat" and use indexes only for counts. If you return the actual results (then pipe to tail or devnull, you'll get better accuracy. Regardless, for 1st and 2nd degree connections, I found MySQL very fast. For 3rd and beyond, it was very slow. Makes sense, because for MySQL this algorithm runs in roughly f^d time, where f = average number of friends per person and d = degrees of separation or depth.

James Fremen
James Fremen says:
Mar 19, 2016 03:09 AM

Regardless, BFS using a query-centric graph or SQL database isn't going to scale beyond toy problems. If you're doing a lot of this you should look at graph traversal algorithms.

Ciaran
Ciaran says:
Jul 01, 2016 09:26 AM

Your test is poorly designed (REST API vs MySQL client, MySQL for count() uses index only, no data retrieval, myISAM storage is not as durable as Neo4j storage, and so on) yet it is the first in Google for "neo4j performance".
The "TL;DR" camp of people, like many CIOs will be disappointed by Neo4j after spending only a second on the subject.
It seems your article will be here for the end of times, but not as an example of bad design but as an example of how Neo4j is a poor performer. It's not fair...

Hunter Sherman
Hunter Sherman says:
Jul 15, 2016 06:30 PM

As others have stated, you're comparing requests made to Neo4j over a RESTful service to native requests made to MySQL. It's an apples to oranges comparison.

The book does not state that they used the RESTful approach to make their comparison, and it appears in the book before the section where they introduce Cypher (at least in my version) so you can't assume that they were making their queries via the RESTful interface.

Angel G
Angel G says:
Aug 26, 2017 05:33 AM

To compare MySQL / MaraDB to a specialized graph database, you should use the specialized graph processing engine for MySQL / MariaDB - OQGRAPH.

Add comment

You can add a comment by filling out the form below. Plain text formatting.

Question: What is 42 minus 19?
Your answer: