The Good News About PHP and Unicode

Last October Joel Spolsky wrote an article in which, among other things, he lambasted PHP, and its (lack of) Unicode support. The article was widely read, and linked to. Scott Reyen wrote a reply demonstrating how one could write one’s own string handling methods that operated on the underlying integer representation of a Unicode string. This, more then Joel’s original article, convinced me (and a number of other people I’ve talked to) that PHP’s string handling was badly broken, and I wasn’t going near it. Scott’s article is academically interesting, but I’m sorry, you’ll have to shoot me before I go back to working with strings as arrays of integers.

Pleasantly Surprised

It’s on my todo list at work to start helping put together our internationalization strategy, so with a certain amount of trepidation I started experimenting. Hmmm, contrary to impression left by Joel’s article strings of multi-byte characters seem to be working, I can store them, print them, append them. I can create multi-byte variable names. So far, better then expected. Oops, strlen() is returning a false count, and we’ll take that to be indicative of a wider range of problem. Now what?

Out on the corners of my awareness I’ve known about the multi-byte string extension to PHP. It was something for handling Japanese characters, the documentation doesn’t read as quite fluent (which doesn’t make it all that different then a lot of documentation written by people I presume to be native english speakers), and besides it’s icky to have to prepend mb_ to all your function calls.

Turns out that mbstring besides adding native support for a number of CJKV encodings, also supports several of the Unicode encodings including the only one almost anyone really cares about, UTF-8. And you don’t have to use the mb_ functions, as you can tell mbstring to overload the native string, mail and regex handling functions transparently masking the originals with multi-byte aware ones.

Okay, that seems to all be working, I dug out one of my comment scripts, and punched in the phrase “An ḃfuil do ċroÃ ag bualaḋ Ã³ ḟaitÃos an ġrÃ¡ a ṁeall lena ṗÃ³g Ã©ada Ã³ ṡlÃ do leasa ṫÃº?” (escaped here, because this page is iso-8859-1, but trust me the original was utf-8). Oops, my MySQL isn’t properly supporting utf-8. I could probably tweak my tables with my current install to work, but MySQL 4.1.x looks like serious upgrade in internationalization over the 4.0.x series so that is my next project. I’ll probably upgrade my PHP5 to the newly released RC1 while I’m at it (as I’ve been testing all of this with my PHP4 build)

Tempest? Teapot? Am I missing Something?

Being a mono-lingual American, with limited experience with internationalization I’m not sure where to go next to try to break this stuff? (probably the PCRE extension!) What exactly is it that PHP is lacking that makes it so impossible to build an i18n app that it deserved to be sentenced to 6 month peelings onions on submarine. (whether this is a punishment of course depends on your temperament) Insights? Anyone?

Now just being able to display languages other then English is not the only thing one needs for successful internationalization; there are a slew of issues relating to formatting of dates, and numbers, etc, etc, etc. But its a nice place to start.