Home

Understanding CMFEditions

Updated:
Created:

It seems I have way too much stuff in the portal history for objects that don't longer exits. What can I do? And how does it all work?

The Problem

It seems I have way too much stuff in the portal history for objects that don't longer exits. What can I do? And how does it all work?

After deleting a lot of objects from a site I was surprised to see that even after packing, the Data.fs was still quite bloated: around 4GB for some hundred folders. That can't possibly be.

Tracing

I found the following page http://plone.org/documentation/kb/debug-zodb-bloat

That was all quite interesting, and I followed the steps through. I got the impression that I had lots of OFS.Image.Pdata and also a bit too much P.A.Field.Image for my taste:

Class Name                                     Count    TBytes      Pct   AvgSize
---------------------------------------------- ------- ---------    ----- -------
...
<class 'OFS.Image.Pdata'>                        47133  3840900148  86.8% 81490.68
...
<class 'Products.Archetypes.Field.Image'>        25326  265376666   6.0%  10478.43

So now I folloed 'Step 2' and ran

./bin/zopepy parts/zope2/lib/python/ZODB/scripts/dump.py var/filestorage/Data.fs > dump.txt

and looked at it. Especially I searched for the Pdata. One interesting (of a lot of them) had the oid "0x00000000002263b8".

To help my further research I created a little script that would allow me to do the 'Step 3' in the debug-zodb-bloat a bit better, by building the refmap only once, and storing it for later use:

import sys
sys.path.append('.')

from inspectZodbUtils import buildRefmap, doSearch
from ZODB.FileStorage import FileStorage
import cPickle

if len(sys.argv)>1:
    filename = sys.argv[1]
    storage = FileStorage(filename)
    refmap = buildRefmap(storage)
    cPickle.dump(refmap,open('refmap.pickle','w'))
else:
    refmap = cPickle.load(open('refmap.pickle'))
import ipdb; ipdb.set_trace()    
print 'fini'

So the first I would run it as

../bin/instance run q2.py ../var/filestorage/Data.fs

and later on only as

../bin/instance run q2.py

Maybe this is the point in time to praise a) cPickle and b) ipdb. Actually I started to read the pickle code as well, to kind of get an understanding of what was going on.

Anyhow, at the ipdb prompt I could now look at the oid mentioned above:

from ZODB.utils import p64
i = p64(0x00000000002263b8)
[app._p_jar[y] for y in path]

which gave the following output:

[<Image at image_mini>,
 <BTrees._OOBTree.OOBTree object at 0x1172a89c>,
 <ATImage at debug_configurations.PNG>,
 <Products.CMFEditions.ArchivistTool.ObjectData object at 0x11722dac>,
 <Products.CMFEditions.ZVCStorageTool.ZVCAwareWrapper object at 0x1172296c>,
 <ZopeVersion at 7>,
 <BTrees._OOBTree.OOBTree object at 0x1172a8e4>,
 <ZopeVersionHistory at 695836488>,
 BTrees._OOBTree.OOBucket([('6851117129', <ZopeVersionHistory at 6851117129>), ('6851967824', <ZopeVersionHistory at 6851967824>), ('6865017397', <ZopeVersionHistory at 6865017397>), ('6865947487', <ZopeVersionHistory at 6865947487>), ('6869947712', <ZopeVersionHistory at 6869947712>), ('6876654125', <ZopeVersionHistory at 6876654125>), ('6885760954', <ZopeVersionHistory at 6885760954>), ('6911165914', <ZopeVersionHistory at 6911165914>), ('6921026365', <ZopeVersionHistory at 6921026365>), ('6924684527', <ZopeVersionHistory at 6924684527>), ('6928625820', <ZopeVersionHistory at 6928625820>), ('6930059518', <ZopeVersionHistory at 6930059518>), ('6931991625', <ZopeVersionHistory at 6931991625>), ('6937499171', <ZopeVersionHistory at 6937499171>), ('6942075215', <ZopeVersionHistory at 6942075215>), ('694421509', <ZopeVersionHistory at 694421509>), ('6958252627', <ZopeVersionHistory at 6958252627>), ('695836488', <ZopeVersionHistory at 695836488>), ('6962374688', <ZopeVersionHistory at 6962374688>), ('6969874691', <ZopeVersionHistory at 6969874691>), ('6970090560', <ZopeVersionHistory at 6970090560>), ('697019247', <ZopeVersionHistory at 697019247>), ('6972404889', <ZopeVersionHistory at 6972404889>), ('6972858120', <ZopeVersionHistory at 6972858120>), ('6973604267', <ZopeVersionHistory at 6973604267>)]),
 BTrees._OOBTree.OOBucket([('6773749252', <ZopeVersionHistory at 6773749252>), ('6776428556', <ZopeVersionHistory at 6776428556>), ('6777239827', <ZopeVersionHistory at 6777239827>), ('6777885573', <ZopeVersionHistory at 6777885573>), ('6784418225', <ZopeVersionHistory at 6784418225>), ('6785227879', <ZopeVersionHistory at 6785227879>), ('6788995801', <ZopeVersionHistory at 6788995801>), ('6791790198', <ZopeVersionHistory at 6791790198>), ('679295379', <ZopeVersionHistory at 679295379>), ('6795881181', <ZopeVersionHistory at 6795881181>), ('6801955479', <ZopeVersionHistory at 6801955479>), ('6802303487', <ZopeVersionHistory at 6802303487>), ('681163508', <ZopeVersionHistory at 681163508>), ('6816736841', <ZopeVersionHistory at 6816736841>), ('6821976622', <ZopeVersionHistory at 6821976622>), ('6822305396', <ZopeVersionHistory at 6822305396>), ('6826208798', <ZopeVersionHistory at 6826208798>), ('6829700253', <ZopeVersionHistory at 6829700253>), ('683348517', <ZopeVersionHistory at 683348517>), ('6834359534', <ZopeVersionHistory at 6834359534>), ('68404997', <ZopeVersionHistory at 68404997>), ('6841189737', <ZopeVersionHistory at 6841189737>), ('6847502691', <ZopeVersionHistory at 6847502691>)]),
 BTrees._OOBTree.OOBucket([('6666052774', <ZopeVersionHistory at 6666052774>), ('6672478127', <ZopeVersionHistory at 6672478127>), ('6675922832', <ZopeVersionHistory at 6675922832>), ('6676319233', <ZopeVersionHistory at 6676319233>), ('6681315686', <ZopeVersionHistory at 6681315686>), ('6682030763', <ZopeVersionHistory at 6682030763>), ('6684334569', <ZopeVersionHistory at 6684334569>), ('668754745', <ZopeVersionHistory at 668754745>), ('6694231287', <ZopeVersionHistory at 6694231287>), ('6698458229', <ZopeVersionHistory at 6698458229>), ('6699928945', <ZopeVersionHistory at 6699928945>), ('6699959128', <ZopeVersionHistory at 6699959128>), ('6717912996', <ZopeVersionHistory at 6717912996>), ('6725309273', <ZopeVersionHistory at 6725309273>), ('6732365671', <ZopeVersionHistory at 6732365671>), ('6732441266', <ZopeVersionHistory at 6732441266>), ('6745243275', <ZopeVersionHistory at 6745243275>), ('6755179016', <ZopeVersionHistory at 6755179016>), ('6755687148', <ZopeVersionHistory at 6755687148>), ('6757413776', <ZopeVersionHistory at 6757413776>), ('6761760255', <ZopeVersionHistory at 6761760255>), ('6762277917', <ZopeVersionHistory at 6762277917>), ('6767266586', <ZopeVersionHistory at 6767266586>), ('6768276066', <ZopeVersionHistory at 6768276066>)]),
 <BTrees._OOBTree.OOBTree object at 0x1172aa04>,
 <ZopeRepository at repo>,
 <ZVCStorageTool at portal_historiesstorage>,
 <PloneSite at Plone>,
 <Application at >,
 {'Application': <Application at >, 'ZGlobals': <BTrees._OOBTree.OOBTree object at 0xc2ee074>}]

What does that tell me:

I poked a bit around in portal_historiesstorage, and then a bit more. This is really confusing land. What is this all about?

A bit of a hint is in Products/CMFEditions/doc, and there also in presentations. There are several tools involved, policies and whatever. I also had a short look at interfaces, and also at the tests, but they didn't make too much sense to me.

So, after poking around some more, and just not getting it, I decided to stick the beloved 'import ipdb; ipdb.set_trace' into line 285, that in the 'save' method of the CopyModifyMergeRepositoryTool. This was to see what would happen if I created a new object. This is a version of my notes, following through the code. Its not 100% complete. I tried to note down all the calls that seemed to be somewhat interesting, and tried to keep the nesting intact. Sometimes I also wrote down comments, variable values etc.

/home/joerg/projects/myproject/development/eggs/Products.CMFEditions-1.2.4-py2.4.egg/Products/CMFEditions/CopyModifyMergeRepositoryTool.py(285)

/home/joerg/projects/myproject/development/eggs/Products.CMFEditions-1.2.4-py2.4.egg/Products/CMFEditions/CopyModifyMergeRepositoryTool.py(421)_prepareSysMetadata()
    not too interesting

/home/joerg/projects/myproject/development/eggs/Products.CMFEditions-1.2.4-py2.4.egg/Products/CMFEditions/CopyModifyMergeRepositoryTool.py(433)_recursiveSave()

    /home/joerg/projects/myproject/development/eggs/Products.CMFEditions-1.2.4-py2.4.egg/Products/CMFEditions/ArchivistTool.py(210)prepare()

        /home/joerg/projects/myproject/development/eggs/Products.CMFEditions-1.2.4-py2.4.egg/Products/CMFEditions/utilities.py(43)dereference()

            /home/joerg/projects/myproject/development/eggs/Products.CMFUid-2.1.2-py2.4.egg/Products/CMFUid/UniqueIdHandlerTool.py(129)queryUid()
                returns the default, which is None.

            returns obj, history_id, here (obj,None). history_id here referes to cmf_uid

            /home/joerg/projects/myproject/development/eggs/Products.CMFEditions-1.2.4-py2.4.egg/Products/CMFEditions/ZVCStorageTool.py(187)isRegistered()

                /home/joerg/projects/myproject/development/eggs/Products.CMFEditions-1.2.4-py2.4.egg/Products/CMFEditions/ZVCStorageTool.py(408)_getShadowStorage()  
                    return self._shadowStorage

            wants to return shadow_storage.isRegistered(history_id), but goes to

            /home/joerg/projects/myproject/development/parts/zope2/lib/python/ZODB/Connection.py(749)setstate()
                This is just loading its ghost, np

            /home/joerg/projects/myproject/development/eggs/Products.CMFEditions-1.2.4-py2.4.egg/Products/CMFEditions/ZVCStorageTool.py(760)isRegistered()
                return history_id in self._storage

            return False

        /home/joerg/projects/myproject/development/eggs/Products.CMFUid-2.1.2-py2.4.egg/Products/CMFUid/UniqueIdHandlerTool.py(103)register()

            /home/joerg/projects/myproject/development/eggs/Products.CMFUid-2.1.2-py2.4.egg/Products/CMFUid/UniqueIdHandlerTool.py(129)queryUid()
                returns the default, which is None. Same as above

            /home/joerg/projects/myproject/development/eggs/Products.CMFUid-2.1.2-py2.4.egg/Products/CMFUid/UniqueIdGeneratorTool.py(61)__call__()
                self._uid_counter += 1
                return self._uid_counter   (8552)

            /home/joerg/projects/myproject/development/eggs/Products.CMFUid-2.1.2-py2.4.egg/Products/CMFUid/UniqueIdHandlerTool.py(91)_setUid()
                annotation = anno_tool(obj, self.UID_ATTRIBUTE_NAME)
                annotation.setUid(uid)

                /home/joerg/projects/myproject/development/eggs/Products.CMFUid-2.1.2-py2.4.egg/Products/CMFUid/UniqueIdAnnotationTool.py(65)setUid()
                    self._uid = uid

               obj.cmf_uid == annotation

            return uid (8552)

        history_id == 8552

        /home/joerg/projects/myproject/development/eggs/Products.CMFEditions-1.2.4-py2.4.egg/Products/CMFEditions/ModifierRegistryTool.py(139)getReferencedAttributes()

            return {} //keine Referenzen

        /home/joerg/projects/myproject/development/eggs/Products.CMFEditions-1.2.4-py2.4.egg/Products/CMFEditions/ArchivistTool.py(182)_cloneByPickle()

            //pickles, gets size of pickle
            //creates clone by unpickling

            return approxSize, u.load(), inside_orefs, outside_orefs // return approxSize, u.load(), (), ()

        /home/joerg/projects/myproject/development/eggs/Products.CMFEditions-1.2.4-py2.4.egg/Products/CMFEditions/ModifierRegistryTool.py(220)beforeSaveModifier()

            for ignored_id, mod in self._collectModifiers(obj, ISaveRetrieveModifier):
                mdata, icrefs, ocrefs = mod.beforeSaveModifier(obj, obj_clone)

                Products.CMFEditions.StandardModifiers.RetainUIDs -> return {}, [], []
                Products.CMFEditions.StandardModifiers.NotRetainATRefs -> return {}, [], []
                Products.CMFEditions.StandardModifiers.RetainWorkflowStateAndHistory -> return {"review_state": review_state}, [], []
                Products.CMFEditions.StandardModifiers.RetainPermissionsSettings -> return {}, [], []

            return metadata, inside_crefs, outside_crefs // ({'review_state': 'published'}, [], [])

    sys_metadata.update(metadata) // now: {'comment': '', 'originator': None, 'review_state': 'published', 'timestamp': 1341068928.3322461}

    clone_info = ObjectData(clone, inside_crefs, outside_crefs)
    obj_info = ObjectData(obj, inside_orefs, outside_orefs)

    // o = ObjectData(foo,bar,baz) -> o.foo, o.bar, ob.baz

    return PreparedObject(history_id, obj_info, clone_info, 
                          referenced_data, app_metadata,
                          sys_metadata, is_registered, approxSize)

    /home/joerg/projects/myproject/development/eggs/Products.CMFEditions-1.2.4-py2.4.egg/Products/CMFEditions/ArchivistTool.py(397)__init__()

        //Basically another wrapper object.

        modifies sys_metadata for lots of values. Interesting:

        sys_metadata['parent'] = {
            'history_id': portal_uidhandler.register(parent), ....

        and:

        metadata = {
            'sys_metadata': sys_metadata,
            'app_metadata': app_metadata,
        }
        ...
        self.metadata = metadata
    ...
    sys_metadata['originator'] = "%s.%s.%s" % (prep.history_id,
                                               clone.version_id,
                                               clone.location_id, )     #'8552.0.0'
    //saving references. None here

    /home/joerg/projects/myproject/development/eggs/Products.CMFEditions-1.2.4-py2.4.egg/Products/CMFEditions/ArchivistTool.py(275)save()

        /home/joerg/projects/myproject/development/eggs/Products.CMFEditions-1.2.4-py2.4.egg/Products/CMFEditions/ArchivistTool.py(263)register()

            #def register(self, prepared_obj)
            return storage.register(prepared_obj.history_id, 
                                    prepared_obj.clone,
                                    prepared_obj.referenced_data,
                                    prepared_obj.metadata)

            /home/joerg/projects/myproject/development/eggs/Products.CMFEditions-1.2.4-py2.4.egg/Products/CMFEditions/ZVCStorageTool.py(198)register()

                zvc_method = self._getZVCRepo().applyVersionControl # /Plone/portal_historiesstorage/repo

                /home/joerg/projects/myproject/development/eggs/Products.CMFEditions-1.2.4-py2.4.egg/Products/CMFEditions/ZVCStorageTool.py(367)_applyOrCheckin()
                    #def _applyOrCheckin(self, zvc_method, history_id, shadowInfo, 
                                         object, referenced_data, metadata):

                    #zvc_method == ZopeRepository.applyVersionControl
                    #history_id == 8552
                    #shadowInfo == {'vc_info': None}
                    #object == prepared_obj.clone -- see above
                    #referenced_odata == {}
                    #metadata -- see above :516

                    checking size -> pickling if in doubt

                    vc_info = self._getVcInfo(object, shadowInfo)
                    /home/joerg/projects/myproject/development/eggs/Products.CMFEditions-1.2.4-py2.4.egg/Products/CMFEditions/ZVCStorageTool.py(467)_getVcInfo()
                        return None (in this case)

                    zvc_obj = ZVCAwareWrapper(object, referenced_data, metadata, 
                                              vc_info)
                    /home/joerg/projects/myproject/development/eggs/Products.CMFEditions-1.2.4-py2.4.egg/Products/CMFEditions/ZVCStorageTool.py(941)__init__()

                        #a wrapper object. Interesting in 945
                        #self._physicalPath = metadata['sys_metadata'].get('physicalPath', ())[:] # copy
                        # here: ()

                    /home/joerg/projects/myproject/development/eggs/Products.CMFEditions-1.2.4-py2.4.egg/Products/CMFEditions/ZVCStorageTool.py(512)_encodeMetadata()

                        basically comment+pickle of metadata

                    /home/joerg/projects/myproject/development/eggs/Products.ZopeVersionControl-1.0a1-py2.4.egg/Products/ZopeVersionControl/Repository.py(146)applyVersionControl()
                        #this is zvc_method                                 
                        #now for the real stuff
                        ...
                        branch = 'mainline'
                        ...
                        history = self.createVersionHistory(object)
                        /home/joerg/projects/myproject/development/eggs/Products.ZopeVersionControl-1.0a1-py2.4.egg/Products/ZopeVersionControl/Repository.py(51)createVersionHistory()

                            history_id = str(randint(1, 9999999999))

                            /home/joerg/projects/myproject/development/eggs/Products.ZopeVersionControl-1.0a1-py2.4.egg/Products/ZopeVersionControl/VersionHistory.py(32)__init__()

                                #history_id now e.g. '7417491791'

                                mainline = self.createBranch('mainline', None)
                                /home/joerg/projects/myproject/development/eggs/Products.ZopeVersionControl-1.0a1-py2.4.egg/Products/ZopeVersionControl/VersionHistory.py(84)createBranch()

                                    branch = BranchInfo(branch_id, version_id)
                                    /home/joerg/projects/myproject/development/eggs/Products.ZopeVersionControl-1.0a1-py2.4.egg/Products/ZopeVersionControl/VersionHistory.py(230)__init__()
                                    #just a mapping object.

                                    self._branches[branch_id] = branch

                                self.id = history_id

                            self._histories[history_id] = history
                            return history.__of__(self)

                        version = history.createVersion(object, branch)
                        /home/joerg/projects/myproject/development/eggs/Products.ZopeVersionControl-1.0a1-py2.4.egg/Products/ZopeVersionControl/VersionHistory.py(96)createVersion()
                            branch = self._branches.get(branch_id) #branch_id == 'mainline'
                            ...
                            version_id = '%d' % (len(branch) + 1) #'1'
                            version = ZopeVersion(version_id, object)
                            /home/joerg/projects/myproject/development/eggs/Products.ZopeVersionControl-1.0a1-py2.4.egg/Products/ZopeVersionControl/Version.py(72)__init__()
                                #Another mapper object
                                #does _not_ contain the object (which is version, a ZVCAwareWrapper)

                            ...                                        
                            branch.append(version)    
                            /home/joerg/projects/myproject/development/eggs/Products.ZopeVersionControl-1.0a1-py2.4.egg/Products/ZopeVersionControl/VersionHistory.py(242)append()

                                #just metadata storage, not the actual object
                                #the ids in self.m_order count down, and only point to version id [(2147483647, '1')]

                            self._versions[version_id] = version #still no object

                            version.saveState(object)
                            /home/joerg/projects/myproject/development/eggs/Products.ZopeVersionControl-1.0a1-py2.4.egg/Products/ZopeVersionControl/Version.py(93)saveState()

                                self._data = self.stateCopy(obj, self) #becomes effectively the clonebypickled clone of the clone?
                                /home/joerg/projects/myproject/development/eggs/Products.ZopeVersionControl-1.0a1-py2.4.egg/Products/ZopeVersionControl/Version.py(105)stateCopy()

                                    ignore = listNonVersionedObjects(obj)
                                    /home/joerg/projects/myproject/development/eggs/Products.ZopeVersionControl-1.0a1-py2.4.egg/Products/ZopeVersionControl/nonversioned.py(64)listNonVersionedObjects(

                                        return getNonVersionedDataAdapter(obj).listNonVersionedObjects()
                                        /home/joerg/projects/myproject/development/eggs/Products.ZopeVersionControl-1.0a1-py2.4.egg/Products/ZopeVersionControl/nonversioned.py(40)getNonVersionedDataAdapter()

                                            return StandardNonVersionedDataAdapter(obj)

                                        /home/joerg/projects/myproject/development/eggs/Products.ZopeVersionControl-1.0a1-py2.4.egg/Products/ZopeVersionControl/nonversioned.py(92)listNonVersionedObjects()
                                            return ()

                                    res = cloneByPickle(aq_base(obj), ignore)

                                    removeNonVersionedData(res)

                                        getNonVersionedDataAdapter(obj).removeNonVersionedData()
                                        /home/joerg/projects/myproject/development/eggs/Products.ZopeVersionControl-1.0a1-py2.4.egg/Products/ZopeVersionControl/nonversioned.py(40)getNonVersionedDataAdapter()
                                            return StandardNonVersionedDataAdapter(obj)

                                        /home/joerg/projects/myproject/development/eggs/Products.ZopeVersionControl-1.0a1-py2.4.egg/Products/ZopeVersionControl/nonversioned.py(94)removeNonVersionedData()

                                            #did not do anything

                                    return res

                            return version.__of__(self)

                        history_id = history.getId() # '7417491791'
                        version_id = version.getId() # '1'

                        info = VersionInfo(history_id, version_id, VersionInfo.CHECKED_IN)
                        /home/joerg/projects/myproject/development/eggs/Products.ZopeVersionControl-1.0a1-py2.4.egg/Products/ZopeVersionControl/Utility.py(51)__init__()
                            #a wrapper object around the given data, add logged in user

                        object.__vc_info__ = info

                        history.addLogEntry(version_id,LogEntry.ACTION_CHECKIN,_findPath(object),message is None and 'Initial checkin.' or message)
                        /home/joerg/projects/myproject/development/eggs/Products.ZopeVersionControl-1.0a1-py2.4.egg/Products/ZopeVersionControl/VersionHistory.py(51)addLogEntry()
                            #creates a wrapper object, adds it under a count down key in self._eventLog._data[key]

                        return object

                    shadowInfo = {"vc_info": zvc_obj.__vc_info__,
                                  "metadata": metadata,}

                    history = self._getShadowHistory(history_id, autoAdd=True)
                    #boils down to: self._shadowStorage._storage[history_id]=ShadowHistory()

                    return history.save(shadowInfo) #just the metadata, returns the version_id in the _shadowStorage_

        #    
    #prep.original.object.version_id == 0

And now, what have I learned? I was most interested (and confused) by the use of history_id - what are they, and where do they come from?

The numbers here are from another run, with a new object at a higher cmf_uid of 8553:

#obj is the freshly created object
cmfuid = obj.cmf_uid() # 8553
ph = getToolByName(context.portal_url.getPortalObject(),'portal_historiesstorage')
hid = ph._shadowStorage._storage[cmfuid]._full[0]['vc_info'].history_id #'7496765414'
zvh = ph.zvc_repo._histories[hid] 
vid = zvh._branches['mainline'].m_order[2147483647] #'1'
zv = zvh._versions[vid]
zv._data._object.object.title == obj.title

So, I got the feeling that there is a zvc_repo that contains the actual version information, and the _shadowStorage, which has some meta information. zvc_repo seems to work with the 'funny' numbers, which are randomly chosen longer ones, while _shadowStorage seems to use rather numbered short numbers.

This reminded me also of another post that by now seemed to be quite related to my question of understanding cmfeditions, the bloat and the repo:

http://stackoverflow.com/questions/9683466/purging-all-old-cmfeditions-versions

I got the feeling that some information in there is maybe a bit wrong, especially in regards to just deleting the _shadowStorage.

How could I find out? Also by now, I wanted to see what was really going on. Looking at the data would be nice. But pprint doesn't work here (it doesn't show you the contents of classes), the collective.zodbbrowser only works on plone4, and did not seem to be quite suited.

Structure

So I hacked a really ugly tool together, that can show me the full tree below a certain object. The tool is called rprint (not published yet because its real ugly code). It walks the tree of the objects, shows me the content of its subelements (only stored data). If it encounters an object like a Btree which has an items method, it either only prints and follows the first item, or items that match a list of ids that I have specified.

Using the cmf_uid and history_id from above, I got the following:

ipdb> reload(rprint); rprint.rprint(portal.portal_historiesstorage,[8553,'7496765414'])

<Products.CMFEditions.ZVCStorageTool.portal_historiesstorage>

     _shadowStorage:  <Products.CMFEditions.ZVCStorageTool.ShadowStorage>
         _storage:  <BTrees._OOBTree.OOBTree>
             "8553":  <Products.CMFEditions.ZVCStorageTool.ShadowHistory>
                 _full:  <BTrees._IOBTree.IOBTree>
                     "0":  {
                         "metadata":  {
                             "sys_metadata":  {
                                 "comment":  ''
                                 "originator":  '8553.0.0'
                                 "parent":  {
                                     "history_id":  2
                                     "location_id":  None
                                     "version_id":  None
                                     }
                                 "timestamp":  1341238710.606384
                                 "review_state":  'published'
                                 "approxSize":  3041
                                 "principal":  'myadmin'
                                 }
                             "app_metadata":  { }
                             }
                         "vc_info":  <Products.ZopeVersionControl.Utility.VersionInfo>
                             status:  1
                             timestamp:  1341238712.795213
                             history_id:  '7496765414'
                             user_id:  'myadmin'
                             version_id:  '1'
                         }
                     ...
                 _available:  [ 0]
                 nextVersionId:  1
                 _sizeInaccurate:  False
                 _approxSize:  3041
             ...
     zvc_repo:  <Products.ZopeVersionControl.ZopeRepository.ZopeRepository>
         _labels:  <BTrees._OIBTree.OIBTree>
             {}
         _branches:  <BTrees._OIBTree.OIBTree>
             "mainline":  1
             ...
         title:  'ZVC Storage'
         _created:  1292350937.0929461
         id:  'repo'
         _histories:  <BTrees._OOBTree.OOBTree>
             "7496765414":  <Products.ZopeVersionControl.ZopeVersionHistory.ZopeVersionHistory>
                 _labels:  <BTrees._OOBTree.OOBTree>
                     {}
                 _branches:  <BTrees._OOBTree.OOBTree>
                     "mainline":  <Products.ZopeVersionControl.VersionHistory.BranchInfo>
                         date_created:  1341238712.7773039
                         m_date:  <BTrees._IIBTree.IIBTree>
                             "22353978":  2147483647
                             ...
                         root:  None
                         name:  'mainline'
                         m_order:  <BTrees._IOBTree.IOBTree>
                             "2147483647":  '1'
                             ...
                     ...
                 _eventLog:  <Products.ZopeVersionControl.EventLog.EventLog>
                     _data:  <BTrees._IOBTree.IOBTree>
                         "2147483647":  <Products.ZopeVersionControl.EventLog.LogEntry>
                             user_id:  'myadmin'
                             path:  ''
                             action:  1
                             timestamp:  1341238712.7953091
                             message:  "S''\n.\x00\n\x80\x02}q\x01(U\x0csys_metadataq\x02}q\x03(U\x07commentq\x04U\x00U\noriginatorq\x05U\x088553.0.0q\x06U\x06parentq\x07}q\x08(U\nhistory_idq\tK\x02U\x0blocation_idq\nNU\nversion_idq\x0bNuU\ttimestampq\x0cGA\xd3\xfckm\xa6\xce\xffU\x0creview_stateq\rU\tpublishedq\x0eU\napproxSizeq\x0fM\xe1\x0bU\tprincipalq\x10U\x07myadminq\x11uU\x0capp_metadataq\x12}q\x13u."
                             version_id:  '1'
                         ...
                 _versions:  <BTrees._OOBTree.OOBTree>
                     "1":  <Products.ZopeVersionControl.ZopeVersion.ZopeVersion>
                         date_created:  1341238712.777617
                         _data:  <Products.CMFEditions.ZVCStorageTool.ZVCAwareWrapper>
                             _physicalPath:  ( )
                             _object:  <Products.CMFEditions.ArchivistTool.ObjectData>
                                 outside_refs:  [ ]
                                 object:  <Products.ATContentTypes.content.document.ATDocument>
                                     _Sharing_page__Delegate_Contributor_role_Permission:  ( )
                                     _signature:  "FL\x06\xbf\x95\xedQ/\x96Xl\x89\x1a'b\xea"
                                     creation_date:  DateTime('2012/07/02 16:18:03.004 GMT+2')
                                     marshall_hook:  None
                                     _Access_contents_information_Permission:  ( 'Editor', 'Manager', 'Owner', 'Reader')
                                     location_id:  0
                                     id:  'test3-title'
                                     tableContents:  False
                                     _at_creation_flag:  False
                                     cmf_uid:  <Products.CMFUid.UniqueIdAnnotationTool.UniqueIdAnnotation>
                                         _uid:  8553
                                         id:  'cmf_uid'
                                     modification_date:  DateTime('2012/07/02 16:18:03.097 GMT+2')
                                     title:  u'test3 title'
                                     demarshall_hook:  None
                                     workflow_history:  <Persistence.mapping.PersistentMapping>
                                         data:  {
                                             "baworkflow":  (
                                                 {
                                                     "action":  None
                                                     "review_state":  'published'
                                                     "actor":  'myadmin'
                                                     "comments":  ''
                                                     "time":  DateTime('2012/07/02 16:18:03.009 GMT+2')
                                                     }
                                                 )
                                             }
                                     __provides__:  <zope.interface.Provides object at 0xff17dcc>
                                     _Change_portal_events_Permission:  ( 'Editor', 'Manager', 'Owner')
                                     _EtagSupport__etag:  'ts41238683.1'
                                     presentation:  False
                                     __annotations__:  <BTrees._OOBTree.OOBTree>
                                         "Archetypes.storage.AnnotationStorage-text":  <Products.Archetypes.BaseUnit.BaseUnit>
                                             mimetype:  'text/html'
                                             binary:  0
                                             filename:  ''
                                             raw:  u'<p>test3 maintext</p>'
                                             __name__:  'text'
                                             id:  'text'
                                             original_encoding:  'utf-8'
                                             size:  21
                                         ...
                                     portal_type:  'Document'
                                     version_id:  0
                                     _at_uid:  '3bdb26d9de1c4a1f3765d89dc8b1611a'
                                     _Sharing_page__Delegate_Reader_role_Permission:  ( 'Editor', 'Manager', 'Owner')
                                     at_references:  <OFS.Folder.Folder>
                                         id:  'at_references'
                                     _Modify_portal_content_Permission:  ( 'Editor', 'Manager', 'Owner')
                                     _View_Permission:  ( 'Editor', 'Manager', 'Owner', 'Reader')
                                     _Sharing_page__Delegate_Editor_role_Permission:  ( 'Editor', 'Manager', 'Owner')
                                     _List_folder_contents_Permission:  ( 'Editor', 'Manager', 'Owner', 'Reader')
                                     _owner:  (
                                         [ 'acl_users']
                                         'myadmin',
                                         )
                                     _md:  <Persistence.mapping.PersistentMapping>
                                         data:  {
                                             "creators":  ( u'myadmin',)
                                             "description":  <Products.Archetypes.BaseUnit.BaseUnit>
                                                 mimetype:  'text/plain'
                                                 binary:  0
                                                 filename:  ''
                                                 raw:  u'test3 description'
                                                 __name__:  'description'
                                                 id:  'description'
                                                 original_encoding:  'utf-8'
                                                 size:  17
                                             "effectiveDate":  None
                                             "rights":  <Products.Archetypes.BaseUnit.BaseUnit>
                                                 mimetype:  'text/html'
                                                 binary:  0
                                                 filename:  ''
                                                 raw:  u''
                                                 __name__:  'rights'
                                                 id:  'rights'
                                                 original_encoding:  'ascii'
                                                 size:  0
                                             "subject":  ( )
                                             "language":  u''
                                             "location":  u''
                                             "contributors":  ( )
                                             "allowDiscussion":  None
                                             "excludeFromNav":  False
                                             "expirationDate":  None
                                             }
                                     _Sharing_page__Delegate_Reviewer_role_Permission:  ( )
                                     __ac_local_roles__:  {
                                         "myadmin":  [ 'Owner']
                                         }
                                 inside_refs:  [ ]
                             _referenced_data:  { }
                         id:  '1'
                     ...
                 id:  '7496765414'
             ...
     __ac_local_roles__:  {
         "myadmin":  [ 'Owner']
         }

(the script can actually supress following down certain classes. I have surpressed DateTime, but left in the Products.ATContentTypes.content. Makes this output a bit longer to read, but this way I can see the cmf_uid annotation of the cloned object).

And this now really helped me to get the picture of whats going on.

With this in mind, was ready again to look at the 'documentation' of the highest level object in cmfedition land, the portal_repository (CopyModifyMergeRepositoryTool). The interface documentation is in CMFEditions/interfaces/IRepository. It seems that there history_id is always the cmf_uid. Selector can be one of two things, either a counter of versions, or the key in _versions in the zvc_repo. Also very interesting is the test06_retrieveWithNoMoreExistentObject in tests/test_CopyModifyMergeRepository. Here its interesting how a temporary object is used as a carrier for the cmf_uid we are really interested in.

Results

So, it seems that reason for the bloat in the Data.fs is lots of old revision of images, where the plone object no longer exists. By creating temporary Images with cmf_uids of the 'lost' Images, and then calling

portal_repository.purge(temp_obj, selector_aka_version_id_or_counter)

I should be able to purge really all old version, or run a modified version of the script that zupo posted in

http://stackoverflow.com/questions/9683466/purging-all-old-cmfeditions-versions

What I realy learned though are two things: documenting research steps rocks. Without notes I would have been completely lost in this maze. Being able to see the data structure that code operates on seems to be a really important tool to understand plone land. Before python, I used sql or phpmyadmin a lot to understand the data structures. This gave me a mental picture of the data the code in the app worked on. Data structures told me sooo much more then the code about the application. In zope land it used to be the zmi, where one could poke around the data structures. It helped a lot. plone land however the zmi gets more and more abandoned, and its always a huge pile of hidden layers of adapters and annotations. Even though there might be interfaces, tests and source - the overall picture for me is in the data structures, and that is (or was) missing.

Update 17.7.2012

Here is a script to delete old revisions from the histories. After this, the Data.fs packs down to 270 MB

import os
from os import environ
from StringIO import StringIO
import logging

from AccessControl.SecurityManagement import newSecurityManager
from AccessControl.SecurityManager import setSecurityPolicy
from Testing.makerequest import makerequest
from Products.CMFCore.tests.base.security import PermissiveSecurityPolicy, OmnipotentUser

# Force application logging level to DEBUG and log output to stdout for all loggers
import sys, logging

root_logger = logging.getLogger()
root_logger.setLevel(logging.DEBUG)

handler = logging.StreamHandler(sys.stdout)
formatter = logging.Formatter("%(asctime)s - %(name)s - %(levelname)s - %(message)s")
handler.setFormatter(formatter)
root_logger.addHandler(handler)

def spoofRequest(app):
 """
 Make REQUEST variable to be available on the Zope application server.

 This allows acquisition to work properly
 """
 _policy=PermissiveSecurityPolicy()
 _oldpolicy=setSecurityPolicy(_policy)
 newSecurityManager(None, OmnipotentUser().__of__(app.acl_users))
 return makerequest(app)

# Enable Faux HTTP request object
app = spoofRequest(app)
portal = app.Plone

hs = portal.portal_historiesstorage
ppp = portal.portal_purgepolicy
shadowstorage =  hs._shadowStorage._storage

keys = sorted(shadowstorage.keys())


result = portal.portal_catalog()
cataloged = set()   
for r in result:
 cataloged.add(r.cmf_uid)

extra = [k for k in keys if k not in cataloged]

#lets clean the historiesstorage
for key in extra:
 print 'purging %s (%s)' % (key,len(hs.getHistory(key,countPurged=False)))
 while len(hs.getHistory(key,countPurged=False)):
 hs.purge(key,0,metadata=dict(sys_metadata=dict(comment='clear by jhb')),countPurged=False)

import transaction; transaction.commit()
print 'fini'