This is a read-only archive. Find the latest Linux articles, documentation, and answers at the new Linux.com!

Linux.com

Feature: PHP

Upcoming PHP release will offer Unicode support

By Bruce Byfield on February 28, 2007 (8:00:00 AM)

Share    Print    Comments   

Andrei Zmievski is one of the leading developers of the PHP programming language. Since March 2005, he has been working with about 20 other developers to add Unicode support to version 6.0 of PHP. Now their efforts are nearing an alpha release.

Unicode is an effort to map the characters of all human languages for use with computers. Version 5.0 of Unicode, released in the fall of 2006, contains nearly 100,000 characters and has the capacity for about a million. Support for Unicode in software is well underway, usually via one of the Unicode Transformation Formats: UTF-8, UTF-16, or UTF-32.

An important part of the implementation of Unicode in software is support for the Common Locale Data Repository (CLDR). As Zmievski explains, this concept of locales goes far beyond the traditional concept of locales in POSIX-like systems such as GNU/Linux. It includes not just character sets, but also linguistic and cultural preferences for such things as date formats, currencies, and -- of particular interest to programmers -- how data is collated and sorted. In German, for example, characters with umlauts appear immediately after regular characters, while in Swedish they are added at the end of the alphabet.

Such information is not static, Zmievski emphasizes, but can evolve over time. For example, the euro currency was recently added to many European locales. Similarly, modern Spanish has different rules for collation than traditional Spanish. The CLDR currently lists 360 locales for 121 different languages.

For PHP, Zmievski says, the problem is that the "core language knows little to nothing about encoding and processing multilingual data. In current versions of PHP, extensions such as iocnv and mbstring rely entirely on POSIX locales." Unicode support is possible in current versions, Zmievski says, but "there's a lot of hoops to jump through."

As a result, those programmers fortunate enough to be comfortable in English, which tends to be the dominant language of computing, often see little reason to care about Unicode. "One of the things I notice," Zmievski says, "is that when I give my talk in countries outside the US, I get a full room. When I talk in the US? I get maybe a dozen."

However, with the increasing internationalization of the Internet -- as evidenced by the increased demand for international domain names in non-Latin character sets -- Zmievski insists that PHP, like every other language, "has got to be able to confront the evolving world. There is a demand, even if people don't realize it for themselves yet."

Changes in PHP

According to Zmievski, Unicode support in PHP 6.0 will include a broad selection of International Components for Unicode (ICU). These components will include provision for such actions as converting between one locale or character set and another, collation, transliteration, Unicode text processing, and Unicode regular expressions. Such functionality will be available when a Unicode.semantics code switch is enabled.

To accommodate this change, PHP 6.0 will switch from having a single, generic string type to having two: a Unicode string type for text data, implemented through UTF-16, and a binary type, which will include actual binary data and text data for legacy locales. Perhaps the most obvious difference in the string types is that each character in a binary string will be one byte long, while in a Unicode string, a character may use more than a single byte, depending on the language and how it is encoded. In addition, within Unicode strings, characters may be referenced by either name or code point.

When a PHP program runs, runtime encoding will specify which encoding to use. The encoding for a script will be encoded either as an INI setting, or with a declare () statement in the first line, in much the same way as in an XML file. The encoding may be changed later in the script with a pragma. The encoding for standard output and for file and directory name may also be specified, as well as how conversions between the two string types are handled. Since legacy character sets cannot support all Unicode characters, programmers will also be able to set how conversion errors are handled and the format in which PHP reports them.

With Unicode support, not only will identifiers within the code be able to use Unicode characters, but a whole range of new functionality will become available. Programmers will be able to specify how information is collated by choosing a locale, and by specifying criteria, such as how accented or upper case characters are treated. Even more usefully, text can be converted from one locale to another, so that, for example, English speakers can read Greek names in Latin characters, or a Japanese reader can convert full-width characters to half-width ones on the fly.

Adding Unicode support, Zmievski warns, will cause a certain amount of obsolescence in PHP. The set locale () string function, for example, will be deprecated. Zmievski also anticipates that "a couple of .ini options, and a couple of functions" will join it, but insists that "everything will work transparently" in the end.

The current state of Unicode development

According to Zmievski, most of the basic Unicode functionality is complete. The PHP Unicode team is currently analyzing functions to check which ones will require upgrading. As of February 12, he estimates that 61%, or 1,844 of 3,047 extension functions, were ready for Unicode support. He is hoping for an alpha release by the end of the first quarter of 2007, and the final version of PHP 6 by the end of the year.

In addition to upgrading functions, the Unicode team also faces other problems. "We need to start working on documentation," Zmievski says, "documenting not just the behavior of functions but the features that have changed, and then an introduction to generic Unicode -- what it does, what it needs, and how to work with it."

However, he adds, "The largest problem is figuring how we build this thing so that you can run your PHP 5 scripts on PHP 6 without a few of them blowing up."

Perhaps equally importantly, Zmievski sees a lot of educational work that is needed. Many in the PHP community, he suggests, are only vaguely aware of the growing necessity of Unicode, and are holding back from using the developer builds of PHP 6. He compares this attitude to the reaction of most people to the danger of earthquakes. "I live in San Francisco," he says. "Everyone knows that you should have an earthquake-preparedness kit. But how many people do that, and how many actually keep them up to date?" In much the same way, while the PHP community knows that the changes are coming, Zmievski worries that it may not be readying for them.

Zmievski talks regularly at conferences about this work in progress, explaining the need for it and encouraging other programmers to start experimenting with the work his team has already done. "That's why I give my talk," he says. "So that people will know that, yeah, they can basically start using it."

Bruce Byfield is a computer journalist who writes regularly for NewsForge, Linux.com, and IT Manager's Journal.

Bruce Byfield is a computer journalist who writes regularly for Linux.com.

Share    Print    Comments   

Comments

on Upcoming PHP release will offer Unicode support

Note: Comments are owned by the poster. We are not responsible for their content.

Here we go again... :(

Posted by: Anonymous Coward on February 28, 2007 11:44 PM
Adding Unicode support, Zmievski warns, will cause a certain amount of obsolescence in PHP. The set locale () string function, for example, will be deprecated. Zmievski also anticipates that "a couple of<nobr> <wbr></nobr>.ini options, and a couple of functions" will join it, but insists that "everything will work transparently" in the end.


Is that like the 'everything will work transparently' promise that the PHP developers made with the PHP4 to PHP5 changes to the object model? (For those of you who don't do lots of PHP, they changed the meaning of the = operator from assign by copy to assign by reference for objects. Everything more complex than helloworld.php got broken by that.)



Having said that, if PHP were to really get decent unicode support that works across platforms then I'll be happy. At present I have to carry around a gettext emulation layer (amongst other detritus) because I cannot guarantee that PHP actually has gettext installed, that the system has the locales installed and that I can work out the precise name of the locale on the installed machine (is it de_DE.UTF8 or de_DE.UTF8@euro or ge or<nobr> <wbr></nobr>...?).


Hopefully this is good news.... now if we can get the hosting companies to upgrade to PHP5 by the end of 2008 then perhaps we'll get PHP6 on those machines by<nobr> <wbr></nobr>.... ummm<nobr> <wbr></nobr>.... 2012?

#

Some people never learn...

Posted by: Anonymous Coward on March 01, 2007 01:31 AM
So Unicode support is going to become a configurable option, like magic_quotes, safe_mode, open_basedir and register_globals before it? Does Mr. Zmievski still not realise how frictioned and flaky PHP interpreters are across different web servers and hosting companies?
Semantics configurability might help newbies and quick and dirty workarounds, but doesn't actually strenghten the programming language. Let's just hope the php.net interpreter won't remain the dominant implementation.

#

About these people who feel they don't care

Posted by: Anonymous Coward on March 01, 2007 04:25 AM
>
> As a result, those programmers fortunate enough to
> be comfortable in English, which tends to be the
> dominant language of computing, often see little
> reason to care about Unicode. "One of the things I
> notice," Zmievski says, "is that when I give my
> talk in countries outside the US, I get a full
> room. When I talk in the US? I get maybe a dozen."
>

Indeed, and this is something which is really frustrating<nobr> <wbr></nobr>:/ It works basically the same when you are talking about accessibility and usability. It mostly works for them, so they don't care at all, even when they are building and maintaining products aimed at others... (and even when it would benefit them).

For Unicode, the worst is, for example, people, who are on a forum related to Japanese animation, with nicknames in Japanese, Japanese in signatures, and even sometimes talking Japanese in posts... and when there are talks about converting the forum from ISO-8859-1, to UTF-8, the very same people say they don't see the need, as "they are seeing the Japanese characters without problems"... (the browsers try to guess what could be these characters, or assume UTF-8... and people don't know about it, and they reject the improvements, without understanding they are already using it, and/or that it would simplify things tremendously... -granted, technologies are not made simple, and very few people will tell them, or are able to tell them, how things really work).

Well, this is really why more thinking should be put, from the very beginning, on the design, of all new technology... When the technology is out, it's too late, and every later change will take years and years to be somewhat supported by the majority, even when it would clearly be a huge benefit...

I understand that when it was the time of US-ASCII, it would have been difficult, both intellectually, and technically, to think about Unicode, but it's been quite a number of years since we aquired a clear picture of what was all this all about, and, shamely, not much seems to have change, as far as I see it... people get used to bad designs, and it's very hard to get out, and create really better things.

For example, with PHP, named arguments are still refused by PHP devs... they simply ignore the arguments, and keep thinking it is more simple to rely on basic, compatibility-, evolution-, clarity-, and compactness-problematic (yeah, when you have to set the last argument of a methof with 6 or 7 arguments you don't care about, named arguments are more compact) linear, anonymous arguments...

There is also the problem with the really limited implementation of non-scalar data, including objects... you can't use them as default value for arguments, you can't initialize class properties and static variables with them... you can't define define them as constant...

Resolving these issues would tremendously simplify PHP, just like full Unicode support would (the guys who are refusing Unicode sure never handled any other language than pure English, and surely, never mixed any language in a program or a database...), but a lot of people don't seem to get it, and this is really frustrating<nobr> <wbr></nobr>:/

#

Re:About these people who feel they don't care

Posted by: Anonymous Coward on March 01, 2007 04:39 AM
Oh, and by the way, this is really similar to some people ranting about how webmasters criticize Microsoft for their poor support for Web standards (at the very least, in IE 5-6), when these people never tried to code a website (at least, never without some WYSIWYG editor doing, somehow, and in limited ways, the dirty job for them, and never checking in "alternative" browsers, in mobile browsers, with assistive technologies, etc. -well, they seldom know about them, anyway), and when they are the first to complain, when something does not work on the website they are browsing...

Really frustrating ^_^; (and most infuriating, when you wasted the past few hours, trying to find why on earth the code wouldn't work in IE, so these users could use the website you are building, and benefit from it... -you know, when they are happily using some website, and ranting in the comments of some article about IE or Firefox, about the webmasters, while the webmaster of this very website, wasted days and days on compatibility for these users... ^_^;).

#

Amazing...

Posted by: Anonymous Coward on March 01, 2007 07:02 AM
Python has had native Unicode handling since... what, 2001?

Having to deal with more than my share of PHP locale issues (and a million of other issues), it's more and more a relief to use a properly maintained alternative scripting language. PHP has become the VB of this decade: a poor language which gives you enough rope to hang yourself, used by too many ignorant "programmer"-hacks.

Then again, I'm currently working on a project to rewrite a messy unreliable unmaintainable PHP-tool into Python. Those hacks are partly responsible that I'm being paid, so I shouldn't whine...

#

Re:Amazing...

Posted by: Anonymous Coward on March 01, 2007 09:51 AM
In PHP's defence, it is possible to write good, high quality, maintainable software in PHP. It's even possible to write software that's all those things, in addition to being secure, fast, and scalable.

However, doing that requires exactly the same knowledge that it would in Python, Ruby, Java, or whatever.

The barrier to entry on PHP is so low that it's basically non-existant, which is why so much PHP code is garbage. But code written by people who know what they're doing can be as good as anything else.

Granted, I came into this around PHP5, and haven't had to lug a huge legacy codebase around with me, or worry about compatibility with web hosts running badly configured copies of PHP4. So my experience is probably far more positive than most people's.

#

Re:Amazing...

Posted by: Anonymous Coward on March 02, 2007 04:05 AM
Amen brother, good programming has only to do 1% with the language and 99% to do with the developer.

A good developer knows how to avoid traps on the language besides knowing how to code.

#

Re:Amazing...

Posted by: Administrator on March 02, 2007 11:04 AM
I agree with this, you just have to be comfortable and use to the language itself.

#

Re:Amazing...

Posted by: Anonymous Coward on March 15, 2007 04:25 PM
>Python has had native Unicode handling since... what, 2001?

If I read the article correctly, PHP6 will go beyond just Unicode handling but will also provide 'improved localisation' with the Common Locale Data Repository (CLDR) which looks interesting..

#

I18n with PHP5: Pitfalls

Posted by: Anonymous Coward on March 15, 2007 02:30 PM
This is a good read: <a href="http://www.onphp5.com/article/22" title="onphp5.com">I18n with PHP5: Pitfalls</a onphp5.com>

#

Upcoming PHP release will offer Unicode support

Posted by: Anonymous [ip: 212.220.84.105] on December 08, 2007 08:41 PM
i use unicode in my <a href="http://mp3-lite.com">mp3 legal shop</a>

#

Upcoming PHP release will offer Unicode support

Posted by: Anonymous [ip: 84.204.100.57] on December 16, 2007 01:47 AM
Very usefull material<a href=http://mp3onix.com>!</a> Thank you<a href=dvd-movie-download.com>!</a>

#

This story has been archived. Comments can no longer be posted.



 
Tableless layout Validate XHTML 1.0 Strict Validate CSS Powered by Xaraya