This is a read-only archive. Find the latest Linux articles, documentation, and answers at the new Linux.com!

Linux.com

Feature

Introduction to Unicode

By Michał Kosmulski on November 01, 2004 (8:00:00 AM)

Share    Print    Comments   

Unicode, or the Universal Character Set (UCS), was developed to end once and for all the problems associated with the abundance of character sets used for writing text in different languages. It is a single character set whose goal is to be a superset of all others used before, and to contain every character used in writing any language (including many dead languages) as well as other symbols used in mathematics and engineering. Any charset can be losslessly converted to Unicode, as we'll see.

ASCII, a character set based on 7-bit integers, used to be and still is popular. While its provision for 128 characters was sufficient at the time of its birth in the 1960s, the growing popularity of personal computing all over the world made ASCII inadequate for people speaking and writing many different languages with different alphabets.

Newer 8-bit character sets, such as the ISO-8859 family, could represent 256 characters (actually fewer, as not all could be used for printable characters). This solution was good enough for many practical uses, but while each character set contained characters necessary for writing several languages, there was no way to put in a single document characters from two different languages that used characters found in two distinct character sets. In the case of plain text files, another problem was how to make software automatically recognize the encoding; in most cases human intervention was required to tell which character set was used for each file. A totally new class of problems was associated with using Asian languages in computing; non-Latin alphabets posed new challenges due to the some languages' needs for more than 256 characters, right to left text, and other features not taken into account by existing standards.

Unicode aims to resolve all of those issues.

Two organizations maintain the Unicode standard -- the Unicode Consortium and the International Organization for Standardization (ISO). The names Unicode and ISO/IEC 10646 are equivalent when referring to the character set (however, Unicode Consortium's definition of Unicode provides more than just the character set standard -- it also includes a standard for writing bidirectional text and other related issues).

Unicode encodings

Unicode defines a (rather large) number of characters and assigns each of them a unique number, the Unicode code, by which it can be referenced. How these codes are stored on disk or in a computer's memory is a matter of encoding. The most common Unicode encodings are called UTF-n, where UTF stands for Unicode Transformation Format and n is a number specifying the number of bits in a basic unit used by the encoding.

Note that Unicode changes one assumption which had been correct for many years before, namely that one byte always represents one character. As you'll see, a single Unicode character is often represented by more than one byte of data, since the number of Unicode characters exceeds 256, the number of different values which can be encoded in a single byte. Thus, a distinction must be made between the number of characters and the number of bytes in a piece of text.

Two very common encodings are UTF-16 and UTF-8. In UTF-16, which is used by modern Microsoft Windows systems, each character is represented as one or two 16-bit (two-byte) words. Unix-like operating systems, including Linux, use another encoding scheme, called UTF-8, where each Unicode character is represented as one or more bytes (up to four; an older version of the standard allowed up to six).

UTF-8 has several interesting properties which make it suitable for this task. First, ASCII characters are encoded in exactly the same way in ASCII and in UTF-8. This means that any ASCII text file is also a correct UTF-8 encoded Unicode text file, representing the same text. In addition, when encoding characters that take up more than one byte in UTF-8, characters from the ASCII character set are never used. This ensures, among other things, that if a piece of software interprets such a file as plain ASCII, non-ASCII characters are ignored or in worst case treated as random junk, but they can't be read in as ASCII characters (which could accidentally form some correct but possibly malicious configuration option in a config file or lead to other unpredictable results). Given the importance of text files in Unix, these properties are significant. Thanks to the way UTF-8 was designed, old configuration files, shell scripts, and even lots of age-old software can function properly with Unicode text, even though Unicode was invented years after they came to be.

How Linux handles Unicode

When we say that a Linux system "can handle Unicode," we usually mean that it meets several conditions:

  • Unicode characters can be used in filenames.
  • Basic system software is capable of dealing with Unicode file names, Unicode strings as command-line parameters, etc.
  • End-user software such as text editors can display and edit Unicode files.

Thanks to the properties of UTF-8 encoding, the Linux kernel, the innermost and lowest-level part of the operating system, can handle Unicode filenames without even having the user tell it that UTF-8 is to be used. All character strings, including filenames, are treated by the kernel in such a way that they appear to it only as strings of bytes. Thus, it doesn't care and does not need to know whether a pair of consecutive bytes should logically be treated as two characters or a single one. The only risk of the kernel being fooled would be, for example, for a filename to contain a multibyte Unicode character encoded in such a way that one of the bytes used to represent it was a slash or some other character that has a special meaning in file names. Fortunately, as we noted, UTF-8 never uses ASCII characters for encoding multibyte characters, so neither the slash nor any other special character can appear as part of one and therefore there is no risk associated with using Unicode in filenames.

Filesystem types not originally intended for use with Unices, such as those used by Windows, are slightly different as we'll se later on.

User space programs use so-called locale information to correctly convert bytes to characters, and for other tasks such as determining the language for application messages and date and time formats. It is defined by values of special environmental variables. Correctly written applications should be capable of using UTF-8 strings in place of ASCII strings right away, if the locale indicates so.

Most end-user applications can handle Unicode characters, including applications written for the GNOME and KDE desktop environments, OpenOffice.org, the Mozilla family products and others. However, Unicode is more than just a character set -- it introduces rules for character composition, bidirectional writing, and other advances features that are not always supported by common software.

Some command-line utilities have problems with multibyte characters. For example, tr always assumes that one character is represented as one byte, regardless of the locale. Also, common shells such as Bash (and other utilities using the getline library, it seems) tend to get confused if multibyte characters are inserted at the command line and then removed using the Backspace or Delete key.

If using Unicode sounds appealing, come back tomorrow to learn how to deploy Unicode in Linux.

A continually updated version of this article can be found at the author's Web site.

Michał Kosmulski is a student at Warsaw University and Warsaw University of Technology.

Share    Print    Comments   

Comments

on Introduction to Unicode

Note: Comments are owned by the poster. We are not responsible for their content.

utf isnt all that its cracked up to be.

Posted by: Anonymous Coward on November 01, 2004 05:30 PM
unicode is bollocks. ucs32 is vastly better as it encodes everything in a single bitspace. utf 3.x was stuffed up with being backwar compat with utf8+utf16. utf8 is useless for anything but english. utf16 isnt much better.

anyone who has had to deal with unicode3.x/4.x spec will tell you how bad it is. utf32 still isnt capable of presenting some asian languages completely!

why represent a language if your not going to represent all of it?

mm ucs32, aka utf32 without all the garbage encoding rules, gives a nice uncomplicated 32bit space...

#

Re:utf isnt all that its cracked up to be.

Posted by: David Breakey on November 02, 2004 01:20 AM

Sure, if you're completely unconcerned with backwards compatibility. Unfortunately, the real world doesn't work out that nicely. For instance, compose a mail message in a 32-bit encoding scheme and watch it almost invariably get mangled by all the routers and mail processing hubs between you and the recipient; now encode the same message in UTF8…


Incidentally, do you even know how UTF8 works? The number doesn't indicate the potential encoding range at all; UTF8 is every bit as capable of representing the full Unicode space as any of the others. It does this by being a variable encoding, using from one to four bytes to encode a single character.


Each scheme is designed to address different requirements. UTF8 is intended for when English is a dominant language, in which case it is more space efficient, or when full compatibility with the ASCII7 standard is a must.


Incidentally, can you provide some specific examples of how UTF32 can't represent Asian languages completely? I haven't come across anything yet that isn't a result of the various standards groups arguing over the best way to encode them…the technical implementation is perfectly capable, even using UTF8.


Incidentally, don't you mean UCS-4, which is also a Unicode standard?

#

Re:utf isnt all that its cracked up to be.

Posted by: Anonymous Coward on November 02, 2004 02:06 AM
i do know how utf* works and that utf8 isnt the size of the total number of glyphs encoded. shesh.

utf8 is great for english, i agree. but usless when used as the base of an OS, ala all the free unices.

hmm have you used the unicode spec at all regarding asian lagnuages? japanese, chinese and korean are all lumped into the same space but are 3 disctinctly different languages. how many glyphs are in han chinese? now how many are in the unicode codespace?? hmm thats right, quite a deficiancy isnt there?

i look at it like dropping the z off english. so people dont use it much, does it mean we can drop it off? no, its a silly thing to say, so why should other languages be treated like a second fiddle?

we did some database work for a japanese multinational. we were quite shocked to realise you cant store some surnames as they still use old/traditional japanese kana+kanji... for which there is no codepoint in unicode... how insulted would you be if someone came along and said you spell your surname wrong and here is the new whizzbang modernisation.

no i didnt mean ucs-4, which is just utf32 more or less. same as ucs1+ucs2 are just utf8 + utf16.

i think the unicode standard is WAY overly complex. (ever tried to sort thai glyphs??!?!?! the thai unicode set is all back to front and mixed up. its rediculous).

the hardest thing of all is the internet is more or less english 8bit and so to is the software...

I have been waiting for ages for one of the free unices to step up and go at least UTF16 internally but none have come forward. The worlds software programmers on the whole need a kick up the butt in regards to proper i18n.

ranting on and wotnot, I cant really point and blame when things were designed in the 70s mostly by americans (ala language design C, hardware, etc). legacy issues abound eh!

I also think its quite stupid that they are being lobied (since i dont think its official just yet) for JRRT elvish/dwarven languages to be included when they cant even finish the chinese encoding.

and now I think about it, yes ucs-4, utf32. I was under the impression ucs32 didnt have the encoding blah 0x0010FFFF that mars utf32 (ie the backward compatibleness with utf16 etc).

#

Re:utf isnt all that its cracked up to be.

Posted by: Anonymous Coward on November 02, 2004 08:37 PM
You're confused.

UTF-8, when expanded to "Unicode Transcription Form
in 8 bits", can handle all Unicode characters up to
0x10FFFF (IIRC).

UTF-8, when expanded to "UCS Transcription Form in
8 bits" (UCS = ISO-10646-1), can handle all ISO
characters up to 0xFFFFFFFF, *although* the ISO and
Unicode have agreed to never use these.

So there is no difference between UCS-4 and UTF-8,
except that the latter is variable-length, ASCII
compatible, not prone to endianness bugs and _can_
imply larger files for some asian scripts.<nobr> <wbr></nobr>//mirabile - http://mirbsd.de/

#

Re:utf isnt all that its cracked up to be.

Posted by: Anonymous Coward on November 03, 2004 05:50 AM
UTF-32 represents the exact same amount of characters as UTF-8 or UTF-16. What you are saying doesn't make much sense.

Also, you said it didn't include all the Chinese characters. I think you are mistaken, and you mention the lack of traditional kanji as an example. What happens is that the simplified and traditional versions correspond to same codepoint and it's up to the font to display the correct glyph.

#

Excellent Introduction

Posted by: Anonymous Coward on November 01, 2004 05:45 PM
Can't wait for some notes on using Unicode.

I was actually reading about Unicode and how to use it in (X)HTML last night (okay, 2am this morning...).

I found some excellent articles regarding the usage of Unicode in (X)HTML:
<A HREF="http://www.dwheeler.com/essays/quotes-in-html.html" title="dwheeler.com">Curling Quotes in HTML, SGML, and XML</a dwheeler.com> - Many notes on the correct (and incorrect) usage of quotes and how they ought be noted in HTML
<A HREF="http://www.alistapart.com/articles/emen/" title="alistapart.com">The Trouble With EM ín EN (and Other Shady Characters)</a alistapart.com> - All about punctuation (but does include HTML escape sequences for characters)

#

Re:Excellent Introduction

Posted by: Anonymous Coward on November 01, 2004 09:46 PM
Here are <A HREF="http://www.pixelbeat.org/docs/utf8.html" title="pixelbeat.org">my unicode notes</a pixelbeat.org>

#

Introducing ASCII, a space-saving encoding scheme

Posted by: Anonymous Coward on November 03, 2004 09:06 AM
"if using Unicode sounds appealing, come back tomorrow..."

What's so appealing about using up more space and waiting longer for documents to be transmitted, rendered and stored? (I'm talking about documents that you know are going to be in English, not about documents in other languages). As long as a document is going to be in English anyway, what's wrong with ASCII? Does everything need to be bloated just to get people to buy new hardware?

#

Re:Introducing ASCII, a space-saving encoding sche

Posted by: Anonymous Coward on November 05, 2004 05:32 PM
What's wrong with YOU?

Let me repeat it for you:

As long as there are only ASCII (ISO_646.irv:1991)
characters in a document, UTF-8 == ASCII.<nobr> <wbr></nobr>//mirabile - http://mirbsd.de/

#

This story has been archived. Comments can no longer be posted.



 
Tableless layout Validate XHTML 1.0 Strict Validate CSS Powered by Xaraya