dbpedia and neo4j (first steps II)
The last days I spend quite some time playing around with (a.k.a researching) neo4j, trying to get dbpedia data into neo4j.
One of the ascpects that I wanted to find about was the exact situation of indexing. I put up a question on stackoverflow (http://stackoverflow.com/questions/17428276). The short version is:
- for my use cases I want to connect using http
- i need to have full transactional support (commit and rollback)
- this leaves 'only' the transactional endpoint of version 2
- this can only cypher
- cypher can only do label indexes
- you can't change the configuration of label indexes to fulltext
- auto_indexing can be configured to support fulltext search
There are example datasets in the graph database book. These examples don't work out the book. I made some examples that do work. The only problem is that the example data is not very interesting. Luckily there is dbpedia. They provide content from the wikipedia in different forms of rdf/triples/n3/turtle. Nodes and relationships of interesting data.
I decided to use the mappingbased properties dataset, or "Ontology Infobox Properties". Downloading the data is simple, getting it into neo4j was not that easy at first, but I got something well working in the end:
For linux users I put together some scripts in the neo4j-experiments repo on github. If you have java, python, virtualenv and bunzip2 installed, basically all you need to do is:
git clone https://github.com/jhb/neo4j-experiements.git cd neo4j-experiements ./doit.sh
- setup a python virtualenv
- install neo4j 2.0 M3
- configure the auto_indexes (see neo4j.properties and the doit.sh file)
- download the german dbpedia data
- parse the dbpedia data
- import the dbpedia using a little neo4j connector
- setup some label indexes
Be aware, this is very ugly code. The kind you get at three in the morning :-)
Deadends and lessons learned
There were some wrong routes I took while trying to get dbpedia into the neo4j.
I tried to do string substituion for cypher queries, before using them in my connector. Stuff like:
"create (n %s)" % mydata
This is a bad idea (tm). First, it meant tons of unicode encoding problems. Probably around 5% of statements didn't work. And then, its slow. Importing the 580k nodes and 1300k relations seemed to take 15 hours or so. This is not good for the impacient coder, and by accident I stopped the importer script as well. I then started to look for a faster approach. I tried to export the statements into a text file, and used neo4j-shell to import that. Very slow as well. So looking for more alternatives.
There is the python embedded project, which only works with 1.9. Documenation can be found at: http://docs.neo4j.org/drivers/python-embedded/snapshot/#python-embedded Its working, I played around with it before (its not thread safe, it seems). Using its api I could import the data really, really fast. Looked good, the only problem is that it only uses version 1.9 of the neo4j project, which is useless for my real world use cases. And there is no easy 1.9 -> 2.0 conversion.
But the speed ot the embedded python got me thinking (for once). The import speed as such can be fast, so why are my approaches with the statement so slow. And then it hit me - the documentation for neo4j said somewhere that one should always use statements with parameters, to speed up things. The idea seems that neo4j can parse the statement, cache it, and then fill in the variable parameters as need later on.
And that in the end did the trick. The import speed, using the transactional http endpoint is around 4000 entries per second. Write operations. Transactions. Over http. Fantastic.
So lesson learned:
- read and follow the documentation
- neo4j is fast if you follow the rules.
The property 'noscenda_name' of nodes contains the title from dbpedia. There is a fulltext index for this attribute:
start n=node:node_auto_index('xmlns_name:matrix') return n.noscenda_name;
Yes, I know, there are three lines to many :-)
There is also the label 'node' set on all nodes, and some indexes on the noscenda_* attributes:
match n:node-[r]->m where n.noscenda_name="dbpedia:Matrix_(Film)" return n.noscenda_name,type(r),m.noscenda_name;
And something more demanding, searching for all nodes connect that connect over up to 4 hops to something that is linked by the film matrix:
match r=n:node-->()<-[*..4]-m where n.noscenda_name='dbpedia:Matrix_(Film)' return length(r) as l,m.noscenda_name order by l desc limit 10;
This will return:
+--------------------------------------------+ | l | m.noscenda_name | +--------------------------------------------+ | 5 | "dbpedia:GNU_Common_Lisp" | | 5 | "dbpedia:Perl_(Programmiersprache)" | | 5 | "dbpedia:Haskell_(Programmiersprache)" | | 5 | "dbpedia:Oz_(Programmiersprache)" | | 5 | "dbpedia:Tcl" | | 5 | "dbpedia:Ruby_(Programmiersprache)" | | 5 | "dbpedia:Python_(Programmiersprache)" | | 5 | "dbpedia:Dylan_(Programmiersprache)" | | 5 | "dbpedia:Scheme" | | 5 | "dbpedia:Esthwaite_Water" | +--------------------------------------------+
Now, thats surprising (and interesting)....