This is a read-only archive. Find the latest Linux articles, documentation, and answers at the new!

Re:utf isnt all that its cracked up to be.

Posted by: Anonymous Coward on November 02, 2004 02:06 AM
i do know how utf* works and that utf8 isnt the size of the total number of glyphs encoded. shesh.

utf8 is great for english, i agree. but usless when used as the base of an OS, ala all the free unices.

hmm have you used the unicode spec at all regarding asian lagnuages? japanese, chinese and korean are all lumped into the same space but are 3 disctinctly different languages. how many glyphs are in han chinese? now how many are in the unicode codespace?? hmm thats right, quite a deficiancy isnt there?

i look at it like dropping the z off english. so people dont use it much, does it mean we can drop it off? no, its a silly thing to say, so why should other languages be treated like a second fiddle?

we did some database work for a japanese multinational. we were quite shocked to realise you cant store some surnames as they still use old/traditional japanese kana+kanji... for which there is no codepoint in unicode... how insulted would you be if someone came along and said you spell your surname wrong and here is the new whizzbang modernisation.

no i didnt mean ucs-4, which is just utf32 more or less. same as ucs1+ucs2 are just utf8 + utf16.

i think the unicode standard is WAY overly complex. (ever tried to sort thai glyphs??!?!?! the thai unicode set is all back to front and mixed up. its rediculous).

the hardest thing of all is the internet is more or less english 8bit and so to is the software...

I have been waiting for ages for one of the free unices to step up and go at least UTF16 internally but none have come forward. The worlds software programmers on the whole need a kick up the butt in regards to proper i18n.

ranting on and wotnot, I cant really point and blame when things were designed in the 70s mostly by americans (ala language design C, hardware, etc). legacy issues abound eh!

I also think its quite stupid that they are being lobied (since i dont think its official just yet) for JRRT elvish/dwarven languages to be included when they cant even finish the chinese encoding.

and now I think about it, yes ucs-4, utf32. I was under the impression ucs32 didnt have the encoding blah 0x0010FFFF that mars utf32 (ie the backward compatibleness with utf16 etc).


Return to Introduction to Unicode