I want to build a metadata server, that does fulltext search, stores links between files, ideally accross users. How can this be done? Especially: how can the server be notified of all the important changes in the filesystem in realtime?
Update June 2022
The actual answer is auditd. This easy installable server can be configured to log filesystem actions. The log format is a bit complicated, but I wrote a small tool in python to work with it: auditd_tools
Results
With the different technical approaches one of the most important bits showed to be the upstart time. E.g. inotify is a really nice system to get informed. But everytime I would reboot the machine (or my laptop) minutes of idling are required to start up the system.
In the end it seems that right now different solutions are suitable for different scenarios:
- fanotify: sounds like the best approach, but we have to wait until its there.
- intofiy if the machine is very rarely restarted, and startup time does not matter. And memory consumption neither
- samba audit vfs: when startup time is relevant, and at the same time it can be ensured that all file access goes through samba only
- python-fuse: When startup time is crucial, and only one user accesses the directories (maybe this can be fixed?)
Update December 6th, 2012
Having spent some more time on this issue, I found the following:
- On Ubuntu 12.04 fanotify works.
- There is at least one python binding that seems to work: https://bitbucket.org/mjs0/pyfanotify
- Looking at forum entry it is confirmed - fanotify does not monitor deletes. I will never know when files are removed from the filesystem, so no way to remove them from e.g. my fulltext search engine. This bug entry seems to confirm this.
- Which would lead back to inotify, which still takes a long time to setup for recursive directory setup
- Or back to use fuse to write a virtual filesystem (layer), which would notify me of all changes. A nice python binding is fusepy. This works, but is somewhat slow (at least in python)
So, the choice is either long setup time but no big perfomance impact by using inotify, or very short setup time, but a performance hit. Great.
Research
inotify
This seems to be the standard in modern kernels. One needs to add watches for all files, and then gets notified. Problems seem to be the number of watches - its at least one watch per directory. If the system restarts, the watches need to be set again, and a stat is done on each of the dirs. Takes a rather long time.
dnotify
The antecessor of inotify. Seemed to have the issue of blocking filesystems.
fam
http://oss.sgi.com/projects/fam/
http://oss.sgi.com/projects/fam/news.html
File alteration monitor - doesn't seem to be in use any more
Fanotify
http://lwn.net/Articles/339253/
"fanotify, built on top of fsnotify, is supposed to replace intofiy which replaced dnotify".
"fanotify has two basic 'modes' directed and global. fanotify directed works much like inotify in that userspace marks inodes it is interested in and gets events from those inodes. fanotify global instead indicates that it wants everything on the system and then individually marks inodes that it doesn't care about."
This is very much exactly what I would want to use, only its not there yet.
tripwire
Used for security audits to see if files have changed. This is more part of intrusion detection than a file system change monitor. Needs to be run regulary to scan the filesystem.
samba vfs audit
http://www.samba.org/samba/docs/man/Samba-HOWTO-Collection/VFS.html#id2650781
A module for the samba server. Can be configured (it seems) to monitor use and change of files. Requires access to files through samba, of course.
Systemtap
http://sourceware.org/systemtap/
Could be a nice approach, but I haven't managed to write a script that handles the case where files are changed without full path, e.g. 'touch foobar' instead of '/home/joerg/tmp/foobar'. So far I only got notified of a 'foobar' being accessed, but which one?
Python-fuse
https://sourceforge.net/apps/mediawiki/fuse/index.php?title=Main_Page
The idea is to put a small layer on top of the real filesystem. Something along the line of: http://esteve.tizos.net/archives/searchable-filesystem-with-fuse-python/. I modified his script a bit, so that it does not do any indexing, but logs to a file: my proof of concept
llfuse (python)
(update April 8th, 2011)
http://code.google.com/p/python-llfuse/
This seems to be a better fuse binding which actually supports proper release calls. Which means we could act upon having written the file.
There is a ubuntu .deb at:
http://ppa.launchpad.net/nikratio/s3ql/ubuntu/pool/main/p/python-llfuse/
Research links
- http://www.little-idiot.de/linuxsolutionguide/notify.htm
- An older german page, points to changedfiles
- http://www.bangstate.com/changedfiles/
- Exactly what I would need, but needs a 2.4 kernel
- http://www.linux.com/archive/feature/150200
- http://projects.l3ib.org/trac/fsniper
- Fsniper allows watching directories / files
- http://www.pubbs.net/kernel/200905/109416/
- Links to fsnotify/fanotify. From what I see that would be exactly whats needed, but it does not seem to be there (yet).
- http://esteve.tizos.net/archives/searchable-filesystem-with-fuse-python/
- python-fuse driven filesystem with hook to indexing. Maybe a good starting point?