Fun with unicode, aka the limbo of coding

How do I use unicode with zope, page templates and the zmi? (you can also jump to the final conclusion at the bottom <#bottom>.…) Ok, there are two issues - displaying unicode, and getting unicode in from forms. First, I need to remember that unicode are abstract objects, that can be encoding to a certain mapping, e.g. utf-8, which represents it. To display unicode, all we need to do is tell the browser (and zope in the same go) that we want and have utf-8 encoding: http://wiki.zope.org/zope2/HowToInternationaliseWithPTS#encoding e.g.

Now, the page will be encoded in utf-8, and zope knows that this is its job, so it tries its best. Ok, the way back needs to participants: the browser encoding form input in the proper way, and zope knowing whats coming: The browser knows what encoding to use because we have set the encoding to utf-8 in the first place - logic being, that if the whole page is utf-8, so is the form, so is the content of the form when submitted. Now, the browser keeps that encoding as its little secret, and does not tell the server its posting the form to. e.g. Host: localhost:8094 User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv: Gecko/20070321 Firefox/ (Swiftfox) Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5 Accept-Language: en-us,en;q=0.7,de-de;q=0.3 Accept-Encoding: gzip,deflate Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7 Keep-Alive: 300 Connection: keep-alive So, how does the server (aka zope) know about it? Well, here its getting clumsy. We need to tell for each and every itsy tiny form field that a unicode is coming, and be how it is encoded. e.g or (utf8 and utf-8 both seem to work) Now, zope knows that its unicode, and proper utf-8 encoded, and we get unicode(!) objects, like u'foo' in our scripts. Great! But... How to store them? If storing it in a normal string property, it seems to work as well.… ...kind of. Try to change the content in the ZMI, and you get an error. Point being - the ZMI needs to know that the property contains unicode as as well. In the zmi we have the choice of string/ustring for the properties. Not hard to guess that ustring is the right one. If we store our beloved unicode string object in a nice and cosy ustring property, it sleeps really nice in its little place in the ZODB, and the ZMI is cool and happy, because it knows about it. So, the results: * Set the content-type for output to utf-8: setHeader('Content-Type','text/html;; charset=utf-8') * Mark all the form fields as unicode and utf8: name="text:ustring:utf8" * Store them in proper unicode fields: ustring, ulines. The alternative would be, it seems, to not mark the fields, get utf-8 instead of unicode into the scripts, and then either decode or store the utf-8 strings... but this surely leads to one hell of a mess, I would say - and how would you change data in the zmi? (...lot of testing in the meantime.…) Ok, there is one problem - you can't turn the title of objects into a ustring. So what know? Surprisingly simple, it seems: * Set the content-type for output to utf-8: setHeader('Content-Type','text/html;; charset=utf-8') * Don't mark the fields specially * Set a property called management_page_charset to utf-8 on the apps topfolder or the rootfolder The result: we send out utf-8, get utf-8 back, and store it as utf-8. No encoding done at all. The zmi knows about it, and displays all strings as utf-8. We know only need to make sure that our indexes in the catalogs know about it (do we actually?) Sources: * http://wiki.zope.org/zope2/Internationalization * http://www.zope.org/Members/htrd/howto/unicode * http://article.gmane.org/gmane.comp.web.zope.plone.internationalization/1076 * tav, of course