SOLVED: neo4j full text indexing

Update: This whole issue is fixed in current versions (I am using 4.4.5 right now), where I can actually create a long string property and have it full-text indexed as well. Hurray!

CREATE FULLTEXT INDEX my_fulltext FOR (n:MyText) ON EACH [n.text]
CREATE (n:MyText) SET n.text = REDUCE(text='', s IN [x IN range(0,100000) | 'x' ] | text+" "+s) return id(n)
CALL db.index.fulltext.queryNodes("my_fulltext", "x x x") YIELD node, score

See: https://neo4j.com/docs/cypher-manual/current/indexes-for-full-text-search

For my project I need two things: a graph and a full text index. After having researched quite some options neo4j is appealing once more, becaus of the elegance of cypher. If only there was proper full text indexing, by which I ideally mean the query power of lucene.

So, what options we have?

Preparation - the test data

Lets have ourselves a little python test script:

from neo4j.v1 import GraphDatabase
driver = GraphDatabase.driver("bolt://localhost:7687", auth=("neo4j", "graph"))

with driver.session() as session:

    session.run('create (n:Article {id:{id},foo:{text}})',id=1,text='neo '* 1000)
    print 'Does is work with short text on non-indexed property?'
    session.sync()

    session.run('create (n:Article {id:{id},foo:{text}})', id=1, text='neo ' * 10000)
    print 'Does is it work with a text > 32k on non-indexed property?'
    session.sync()

    session.run('create (n:Article {id:{id},text:{text}})', id=1, text='neo ' * 1000)
    print 'Does it work with a short text on indexed property?'
    session.sync()

    session.run('create (n:Article {id:{id},text:{text}})', id=1, text='neo ' * 10000)
    print 'Does it work with a text > 32k text on indexed property?'
    print session.sync()

This creates four objects, please notice how the testtext is sometimes short, sometimes long, and stored to 'foo' in 1 and 2, while being stored to 'text' in 3 and 4.

Out of the box - Schema Indexes

With a current neo4j (3.1.2) we get schema indexing. I quite like the concept. When querying, the most fancy thing you get is CONTAINS. You can if some characters are in the text, but that's it. Also, to get back the relevance of a result you need to use the REST API.

Lets have a quick check if it works. First we insert data using our test script, and then:

neo4j-sh (?)$ CREATE INDEX ON :Article(text);
+-------------------+
| No data returned. |
+-------------------+
Indexes added: 1
19 ms
neo4j-sh (?)$ match (n) where n.text contains ('eo n') return n.id;
+------+
| n.id |
+------+
| 3    |
| 4    |
+------+
2 rows

Great, it works (see below).

This could nearly suffice, but having at least AND / OR options would be good.

Lets try the alternatives.

Legacy indexing - node_auto_index

An old mechanism which allows a special (lucene) index, which automatically indexes all nodes as they are created. The is a nice blog post describing how to set up these automatic indexes for neo4j.

Short version, for future reference:

In neo4j.conf

dbms.auto_index.nodes.enabled=true
dbms.auto_index.nodes.keys=text

(you can use a comma seperated list of fields, e.g. text,title,searchableText)

Next, in the neo4j-shell:

CREATE n:Article {text:"foobar"}
index --set-config node_auto_index type fulltext
MATCH (n:Article) set n.text = n.text 
START n=node:node_auto_index("text:*ob*") RETURN n;

From what I get there first needs be an object with the indexed property for the index to be created. Once the index is created, it can be modified to be fulltext, but needs repopulation. After that it can be queried.

Does it work in real life? Which gives us the following output:

Does is work with short text on non-indexed property?
Does is it work with a text > 32k on non-indexed property?
Does it work with a short text on indexed property?
Does it work with a text > 32k text on indexed property?
Traceback (most recent call last):
 File "/home/joerg/projects/enterprisesearch/quicktest.py", line 16, in <module>
 print 'Does it work with a text > 32k text on indexed property?'
 File "/home/joerg/projects/enterprisesearch/env/local/lib/python2.7/site-packages/neo4j/v1/bolt.py", line 115, in __exit__
 self.close()
 File "/home/joerg/projects/enterprisesearch/env/local/lib/python2.7/site-packages/neo4j/v1/bolt.py", line 121, in close
 self.sync()
 File "/home/joerg/projects/enterprisesearch/env/local/lib/python2.7/site-packages/neo4j/v1/bolt.py", line 164, in sync
 self.connection.sync()
 File "/home/joerg/projects/enterprisesearch/env/local/lib/python2.7/site-packages/neo4j/bolt/connection.py", line 421, in sync
 count += self.fetch()
 File "/home/joerg/projects/enterprisesearch/env/local/lib/python2.7/site-packages/neo4j/bolt/connection.py", line 407, in fetch
 response.on_failure(metadata or {})
 File "/home/joerg/projects/enterprisesearch/env/local/lib/python2.7/site-packages/neo4j/v1/bolt.py", line 222, in on_failure
 raise self.error_class(metadata)
neo4j.v1.api.CypherError: Could not apply the transaction to the store after written to log

Not only that, but neo4j itself gets stuck, e.g you can't reconnect the neo4-shell. A quick look into debug.log shows:

2017-03-20 11:46:27.818+0000 ERROR [o.n.b.v.r.ErrorReporter] Client triggered an unexpected error [TransactionStartFailed]: Database has encountered some problem, please perform necessary action (tx recovery/restart), reference 91fe5c49-4716-40c7-b762-2a1354910cfe. Database has encountered some problem, please perform necessary action (tx recovery/restart)
org.neo4j.graphdb.TransactionFailureException: Database has encountered some problem, please perform necessary action (tx recovery/restart)
 at org.neo4j.kernel.impl.factory.ClassicCoreSPI.beginTransaction(ClassicCoreSPI.java:181)
 at org.neo4j.kernel.impl.factory.GraphDatabaseFacade.beginTransactionInternal(GraphDatabaseFacade.java:578)
 at org.neo4j.kernel.impl.factory.GraphDatabaseFacade.beginTransaction(GraphDatabaseFacade.java:383)
 at org.neo4j.bolt.v1.runtime.TransactionStateMachineSPI.beginTransaction(TransactionStateMachineSPI.java:95)
 at org.neo4j.bolt.v1.runtime.TransactionStateMachine$State$1.run(TransactionStateMachine.java:184)
 at org.neo4j.bolt.v1.runtime.TransactionStateMachine.run(TransactionStateMachine.java:77)
 at org.neo4j.bolt.v1.runtime.BoltStateMachine$State$2.run(BoltStateMachine.java:396)
 at org.neo4j.bolt.v1.runtime.BoltStateMachine.run(BoltStateMachine.java:196)
 at org.neo4j.bolt.v1.messaging.BoltMessageRouter.lambda$onRun$3(BoltMessageRouter.java:80)
 at org.neo4j.bolt.v1.runtime.concurrent.RunnableBoltWorker.execute(RunnableBoltWorker.java:135)
 at org.neo4j.bolt.v1.runtime.concurrent.RunnableBoltWorker.run(RunnableBoltWorker.java:89)
 at java.lang.Thread.run(Thread.java:745)
Caused by: org.neo4j.kernel.api.exceptions.TransactionFailureException: Database has encountered some problem, please perform necessary action (tx recovery/restart)
 at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
 at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
 at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
 at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
 at org.neo4j.kernel.internal.DatabaseHealth.assertHealthy(DatabaseHealth.java:62)
 at org.neo4j.kernel.impl.api.Kernel.newTransaction(Kernel.java:99)
 at org.neo4j.kernel.impl.factory.ClassicCoreSPI.beginTransaction(ClassicCoreSPI.java:173)
 ... 11 more
Caused by: java.lang.IllegalArgumentException: Document contains at least one immense term in field="text_e" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped.  Please correct the analyzer to not produce such terms.  The prefix of the first immense term is: '[110, 101, 111, 32, 110, 101, 111, 32, 110, 101, 111, 32, 110, 101, 111, 32, 110, 101, 111, 32, 110, 101, 111, 32, 110, 101, 111, 32, 110, 101]...', original message: bytes can be at most 32766 in length; got 40000
 at org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:692)
 at org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:365)
 at org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:321)
 at org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:234)
 at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:450)
 at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1477)
 at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1256)
 at org.neo4j.index.impl.lucene.legacy.CommitContext.applyDocuments(CommitContext.java:104)
 at org.neo4j.index.impl.lucene.legacy.CommitContext.close(CommitContext.java:112)
 at org.neo4j.index.impl.lucene.legacy.LuceneCommandApplier.close(LuceneCommandApplier.java:141)
 at org.neo4j.kernel.impl.api.LegacyBatchIndexApplier.close(LegacyBatchIndexApplier.java:106)
 at org.neo4j.kernel.impl.api.BatchTransactionApplierFacade.close(BatchTransactionApplierFacade.java:70)
 at org.neo4j.kernel.impl.storageengine.impl.recordstorage.RecordStorageEngine.apply(RecordStorageEngine.java:354)
 at org.neo4j.kernel.impl.api.TransactionRepresentationCommitProcess.applyToStore(TransactionRepresentationCommitProcess.java:78)
 at org.neo4j.kernel.impl.api.TransactionRepresentationCommitProcess.commit(TransactionRepresentationCommitProcess.java:51)
 at org.neo4j.kernel.impl.api.KernelTransactionImplementation.commit(KernelTransactionImplementation.java:608)
 at org.neo4j.kernel.impl.api.KernelTransactionImplementation.closeTransaction(KernelTransactionImplementation.java:484)
 at org.neo4j.kernel.api.KernelTransaction.close(KernelTransaction.java:135)
 at org.neo4j.bolt.v1.runtime.TransactionStateMachine$State.closeTransaction(TransactionStateMachine.java:325)
 at org.neo4j.bolt.v1.runtime.TransactionStateMachine$State$1.streamResult(TransactionStateMachine.java:213)
 at org.neo4j.bolt.v1.runtime.TransactionStateMachine.streamResult(TransactionStateMachine.java:93)
 at org.neo4j.bolt.v1.runtime.BoltStateMachine$State$3.pullAll(BoltStateMachine.java:449)
 at org.neo4j.bolt.v1.runtime.BoltStateMachine.pullAll(BoltStateMachine.java:232)
 at org.neo4j.bolt.v1.messaging.BoltMessageRouter.lambda$onPullAll$6(BoltMessageRouter.java:98)
 at org.neo4j.bolt.v1.runtime.concurrent.RunnableBoltWorker.execute(RunnableBoltWorker.java:135)
 at org.neo4j.bolt.v1.runtime.concurrent.RunnableBoltWorker.executeBatch(RunnableBoltWorker.java:122)
 at org.neo4j.bolt.v1.runtime.concurrent.RunnableBoltWorker.run(RunnableBoltWorker.java:94)
 ... 1 more
Caused by: org.apache.lucene.util.BytesRefHash$MaxBytesLengthExceededException: bytes can be at most 32766 in length; got 40000
 at org.apache.lucene.util.BytesRefHash.add(BytesRefHash.java:284)
 at org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:150)
 at org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:682)
 ... 27 more

From what I get is that the node_auto_index created is somehow limited to 32k in text size, which is a bit small.

Btw, one way to get neo4j back to live is to kill (-9) neo4j, delete data/databases/graph.db/neostore.transaction.db.0, and start neo4j again. Well, actually no, that doesn't help either - neo4j can connect, but neo4j still hangs when trying to shut it down...

APOC to the rescue?

There is a collection of Awesome Procedures On Cypher for Neo4j 3.x - codenamed "apoc". The apoc manual tells us about manual indexes. Let's try this!

First, we need the proper version for neo4j, or vice versa. The latest version of APOC is 3.1.0.4, so we can only use it with neo4j (by placing the .jar file into the plugins directory).

After starting up neo4j, we run the test script again, creating 4 nodes - no problem so far, because the nodes are just created. Now, lets do the indexing:

neo4j-sh (?)$ match (n) call apoc.index.addNode(n,['text']) return count(*);

And this gives us this result:

108 ms

WARNING: Failed to invoke procedure `apoc.index.addNode`: Caused by: java.lang.IllegalArgumentException: Document contains at least one immense term in field="text_e" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped. Please correct the analyzer to not produce such terms. The prefix of the first immense term is: '110, 101, 111, 32, 110, 101, 111, 32, 110, 101, 111, 32, 110, 101, 111, 32, 110, 101, 111, 32, 110, 101, 111, 32, 110, 101, 111, 32, 110, 101...', original message: bytes can be at most 32766 in length; got 40000

Just to make sure, lets do it step by step:

neo4j-sh (?)$ match (n {id:1}) call apoc.index.addNode(n,['foo']) return count(*);
+----------+
| count(*) |
+----------+
| 1        |
+----------+
1 row
63 ms

Now on to the longer text:

neo4j-sh (?)$ match (n {id:2}) call apoc.index.addNode(n,['foo']) return count(*);
+----------+
| count(*) |
+----------+
| 1        |
+----------+
1 row
21 ms
TransactionFailureException: Transaction was marked as successful, but unable to commit transaction so rolled back.

If you try this one again, the database is again in a undesired state, e.g. pretty much unusable.

Summary?

The first approach works, the other two don't. A clear winner? Well, the first approach leaves a lot to be desired, like ranking, complex queries etc. If, and only if the third approach would work I'd say that APOC would win. Let's hope that it's just a bug or that I am making a mistake here...

Lets see what SA says: http://stackoverflow.com/questions/42909304/neo4j-indexing-properties-that-are-longer-then-32k-in-lucene

Update 1: long text breaks neo4j on the REST API

For a short time I was hoping that maybe the reason for the broken lucene indexes is in BOLT or the Cypher parser. Turns out you can reproduce the errors on the REST API as well:

curl -X POST -H Accept:application/json -u neo4j:graph -v http://localhost:7474/db/data/node -H "Content-Type: application/json" --data-binary "@testdatashort.json"

works fine,

curl -X POST -H Accept:application/json -u neo4j:graph -v http://localhost:7474/db/data/node -H "Content-Type: application/json" --data-binary "@testdata.json"

breaks. The two mentioned files with just some random words (n1 ...n1000 and n1...n6999 respectively) are here:

Update 2 - confirmation of the 32k limit

I've got answers from the neo4j team to my questions on http://stackoverflow.com/questions/42909304/neo4j-indexing-properties-that-are-longer-then-32k-in-lucene (thanks neo4j team). There is a 32k limit, and at least for 3.2 it is going to stay - neo4j makes sure though that the database will giver a proper errormessage when trying to cross that limit: https://github.com/neo4j/neo4j/pull/8404.

So, where to go from here? The way I understand the development of neo4j, they try to focus more and more on the schema indexes, and work on improving them, and hopefully the full text capabilities as well. They have introduced 'CONTAINS', and maybe there is more to come. The legacy indexes could maybe be improved by switching the field type used in the lucene indexes over to something like text_general, ut given the pull request I doubt that they will ever do that.

So, where to go from here? I guess I will keep neo4j for prototyping, and do more research into alternatives, e.g. ArangoDB, graphagus and NewtDB etc.

Update 3 - schema indexes are limited to 32k

Or: one byte to kill them all

From what I see schema indexes fail silently beyond the 32k limit.

from neo4j.v1 import GraphDatabase
import time

driver = GraphDatabase.driver("bolt://localhost:7687", auth=("neo4j", "graph"))

basetext = ' '.join(['word%04i' % i for i in range(1,5001)])
print len(basetext)

with driver.session() as session:

    session.run('create index on :Article(text)')
    session.sync()

    for i in range(0,10):
        text = basetext[:32760+i*1]
        textlength = len(text)

        session.run('create (n:Article {id: {id},textlength:{textlength},text:{text}})',
                    id=i,
                    textlength=textlength,
                    text=text)
        print 'Article:', i,'textlength:', textlength

    time.sleep(5)
    print 'Querying'

    result = session.run('match (n:Article) where n.text contains {word} return n.id,n.textlength',
                         word='word0001')

    for r in result:
        print 'Article: ',r['n.id'], 'textlength: ',r['n.textlength']

The script creates a sample text, and creates 10 Articles with a .text property that contains the first 32760 to 32769 characters. It waits some seconds for the index to update, then searches for all articles containing the very first word, and is using the index.

This gives (tested on 3.1.2 and 3.1.3):

44999
Article: 0 textlength: 32760
Article: 1 textlength: 32761
Article: 2 textlength: 32762
Article: 3 textlength: 32763
Article: 4 textlength: 32764
Article: 5 textlength: 32765
Article: 6 textlength: 32766
Article: 7 textlength: 32767
Article: 8 textlength: 32768
Article: 9 textlength: 32769
Querying
Article:  0 textlength:  32760
Article:  1 textlength:  32761
Article:  2 textlength:  32762
Article:  3 textlength:  32763
Article:  4 textlength:  32764
Article:  5 textlength:  32765
Article:  6 textlength:  32766

Which means that no Article with a text length beyond 32766 bytes is returned, even though it happily lives in the database.

In 3.2.0 Alpha 7 the situation changes (even though there is some deja-vu):

44999
Article: 0 textlength: 32760
Article: 1 textlength: 32761
Article: 2 textlength: 32762
Article: 3 textlength: 32763
Article: 4 textlength: 32764
Article: 5 textlength: 32765
Article: 6 textlength: 32766
Article: 7 textlength: 32767
Article: 8 textlength: 32768
Article: 9 textlength: 32769
Querying
Traceback (most recent call last):
 [...]
neo4j.v1.api.CypherError: Could not apply the transaction to the store after written to log

and if you try again:

neo4j.v1.api.CypherError: Database has encountered some problem, please perform necessary action (tx recovery/restart)

So back to square 1, the database ceased working because one byte toomuch. The blue byte of death?

What I wonder: what part of the documentation/advertisement have I missed that describes this limit?. Where is this documented?

Update 4: neo4j bug

Somebody filed a bug for this some days ago: https://github.com/neo4j/neo4j/issues/9331. I only found this after creating my little reproducer script, that contains one X too much and will kill neo4j as described above (works on 3.2.)