Skip to content. | Skip to navigation

Personal tools

Navigation

You are here: Home / Members / jhb / dbpedia and neo4j (first steps II)

dbpedia and neo4j (first steps II)

by Jörg Baach last modified Jul 23, 2015 03:17 PM
Importing dbpedia data into neo4j. Its fast (if you do it right). And indexing can be solved.

The last days I spend quite some time playing around with (a.k.a researching) neo4j, trying to get dbpedia data into neo4j.

Indexing

One of the ascpects that I wanted to find about was the exact situation of indexing. I put up a question on stackoverflow (http://stackoverflow.com/questions/17428276). The short version is:

  • for my use cases I want to connect using http
  • i need to have full transactional support (commit and rollback)
  • this leaves 'only' the transactional endpoint of version 2
  • this can only cypher
  • cypher can only do label indexes
  • you can't change the configuration of label indexes to fulltext
  • auto_indexing can be configured to support fulltext search

Example data

There are example datasets in the graph database book. These examples don't work out the book. I made some examples that do work. The only problem is that the example data is not very interesting. Luckily there is dbpedia. They provide content from the wikipedia in different forms of rdf/triples/n3/turtle. Nodes and relationships of interesting data.

I decided to use the mappingbased properties dataset, or "Ontology Infobox Properties". Downloading the data is simple, getting it into neo4j was not that easy at first, but I got something well working in the end:

Playground

For linux users I put together some scripts in the neo4j-experiments repo on github. If you have java, python, virtualenv and bunzip2 installed, basically all you need to do is:

git clone https://github.com/jhb/neo4j-experiements.git
cd neo4j-experiements
./doit.sh

This will:

  • setup a python virtualenv
  • install neo4j 2.0 M3
  • configure the auto_indexes (see neo4j.properties and the doit.sh file)
  • download the german dbpedia data
  • parse the dbpedia data
  • import the dbpedia using a little neo4j connector
  • setup some label indexes

Be aware, this is very ugly code. The kind you get at three in the morning :-)

Deadends and lessons learned

There were some wrong routes I took while trying to get dbpedia into the neo4j.

I tried to do string substituion for cypher queries, before using them in my connector. Stuff like:

"create (n %s)" % mydata

This is a bad idea (tm). First, it meant tons of unicode encoding problems. Probably around 5% of statements didn't work. And then, its slow. Importing the 580k nodes and 1300k relations seemed to take 15 hours or so. This is not good for the impacient coder, and by accident I stopped the importer script as well. I then started to look for a faster approach. I tried to export the statements into a text file, and used neo4j-shell to import that. Very slow as well. So looking for more alternatives.

There is the python embedded project, which only works with 1.9. Documenation can be found at: http://docs.neo4j.org/drivers/python-embedded/snapshot/#python-embedded Its working, I played around with it before (its not thread safe, it seems). Using its api I could import the data really, really fast. Looked good, the only problem is that it only uses version 1.9 of the neo4j project, which is useless for my real world use cases. And there is no easy 1.9 -> 2.0 conversion.

But the speed ot the embedded python got me thinking (for once). The import speed as such can be fast, so why are my approaches with the statement so slow. And then it hit me - the documentation for neo4j said somewhere that one should always use statements with parameters, to speed up things. The idea seems that neo4j can parse the statement, cache it, and then fill in the variable parameters as need later on.

And that in the end did the trick. The import speed, using the transactional http endpoint is around 4000 entries per second. Write operations. Transactions. Over http. Fantastic.

So lesson learned:

  • read and follow the documentation
  • neo4j is fast if you follow the rules.

Playing around

The property 'noscenda_name' of nodes contains the title from dbpedia. There is a fulltext index for this attribute:

start n=node:node_auto_index('xmlns_name:matrix') return n.noscenda_name;

Yes, I know, there are three lines to many :-)

There is also the label 'node' set on all nodes, and some indexes on the noscenda_* attributes:

match n:node-[r]->m where n.noscenda_name="dbpedia:Matrix_(Film)" return n.noscenda_name,type(r),m.noscenda_name;

And something more demanding, searching for all nodes connect that connect over up to 4 hops to something that is linked by the film matrix:

match r=n:node-->()<-[*..4]-m where n.noscenda_name='dbpedia:Matrix_(Film)' return length(r) as l,m.noscenda_name order by l desc limit 10;

This will return:

+--------------------------------------------+
| l | m.noscenda_name                        |
+--------------------------------------------+
| 5 | "dbpedia:GNU_Common_Lisp"              |
| 5 | "dbpedia:Perl_(Programmiersprache)"    |
| 5 | "dbpedia:Haskell_(Programmiersprache)" |
| 5 | "dbpedia:Oz_(Programmiersprache)"      |
| 5 | "dbpedia:Tcl"                          |
| 5 | "dbpedia:Ruby_(Programmiersprache)"    |
| 5 | "dbpedia:Python_(Programmiersprache)"  |
| 5 | "dbpedia:Dylan_(Programmiersprache)"   |
| 5 | "dbpedia:Scheme"                       |
| 5 | "dbpedia:Esthwaite_Water"              |
+--------------------------------------------+

Now, thats surprising (and interesting)....

Filed under: ,
Petra
Petra says:
May 26, 2014 01:28 PM

Hi. Thank you for sharing your experience and code.
I have a problem with importing the relations to neo4j.

The error:
Traceback (most recent call last):
File "D:/_Work/Progs/Lod2Neo4j/neo4j-experiements/importer.py", line 91, in <module>
result = g.query(statements)
File "D:\_Work\Progs\Lod2Neo4j\neo4j-experiements\neo4jconnector.py", line 108, in query
return self.call(payload)
File "D:\_Work\Progs\Lod2Neo4j\neo4j-experiements\neo4jconnector.py", line 94, in call
raise e
Exception: Expected a propertycontainer or number here, but got: row

The statements look like this:
['start n=node({origin}),m=node({target}) create unique n-[:`w3_22-rdf-syntax-ns` {kw}]->m', {'origin': u'row', 'kw': {'noscenda_origin': 'dbp_m', 'noscenda_uid': 'a8dd17d283824e6aba3d1afe01abb7e5'}, 'target': u'row'}]
['start n=node({origin}),m=node({target}) create unique n-[:`dbpedia_state` {kw}]->m', {'origin': u'row', 'kw': {'noscenda_origin': 'dbp_m', 'noscenda_uid': '759107f97d794e71989f33e68e4f5290'}, 'target': u'row'}]
['start n=node({origin}),m=node({target}) create unique n-[:`w3_22-rdf-syntax-ns` {kw}]->m', {'origin': u'row', 'kw': {'noscenda_origin': 'dbp_m', 'noscenda_uid': '79a5dd3932c4433aa8b8aba169834019'}, 'target': u'row'}]
['start n=node({origin}),m=node({target}) create unique n-[:`w3_22-rdf-syntax-ns` {kw}]->m', {'origin': u'row', 'kw': {'noscenda_origin': 'dbp_m', 'noscenda_uid': 'a687f23e4b5b41f587775cadafedca97'}, 'target': u'row'}]
['start n=node({origin}),m=node({target}) create unique n-[:`dbpedia_country` {kw}]->m', {'origin': u'row', 'kw': {'noscenda_origin': 'dbp_m', 'noscenda_uid': 'a3386c0e95404f22804fc844cae15ecb'}, 'target': u'row'}]

Do you have any hints how to solve the problem?
Thank you in advance,
Petra

Joerg Baach
Joerg Baach says:
May 26, 2014 02:18 PM

Hi Petra,

I had a short look at it, but couldn't find the error. Would probably need more time to fix. Currently my focused has drifted a bit away from neo4j (see https://baach.de/Members/jhb/neo4j-performance-compared-to-mysql).

If you however find a solution, please drop me a line :-)

Sorry for not being able to help on this.

Petra
Petra says:
May 27, 2014 11:14 AM

Hi.
I solved the problem.
Line 121 of neo4jconnector.py should be
out.append(dict(zip(cols,r["row"])))
instead of
out.append(dict(zip(cols,r)))

With this correction it works perfectly.
Best, Petra

Joerg Baach
Joerg Baach says:
Jun 11, 2014 09:42 AM

Thats great. Thanks a lot, changed it online (wihtout having it tested, though)

Eliza
Eliza says:
Apr 08, 2016 07:43 AM

Hello! Thank you for your code.

I have a problem in your code with transactionurl (transactionurl is not defined). I don't know where it used and for what purpouse the program needed this value. Can you help me?

Thank you in advance,
Liza

Eliza
Eliza says:
Apr 08, 2016 07:44 AM

Hello! Thank you for your code.

I have a problem in your code with transactionurl (transactionurl is not defined). I don't know where it used and for what purpouse the program needed this value. Can you help me?

Thank you in advance,
Liza

Add comment

You can add a comment by filling out the form below. Plain text formatting.

Question: What is 6 times 7?
Your answer: