This is a read-only archive. Find the latest Linux articles, documentation, and answers at the new Linux.com!

Linux.com

Feature

Using Unicode in Linux

By Michał Kosmulski on November 02, 2004 (8:00:00 AM)

Share    Print    Comments   

Last time we talked about Unicode and its benefits. If you've decided you want to reap those benefits yourself, here's how to convert a Linux system from another encoding system to Unicode.

First of all, check whether you're already using a Unicode locale. The command locale prints out the values of environmental variables that influence the locale settings. A complete description of their meanings is available in locale man pages. Usually, locale names consist of a lowercase language code followed by an underscore and an uppercase country code (e.g. en_US for U.S. English). Unicode locale names that use UTF-8 encoding additionally end with ".UTF-8." If such names are present in the output of locale, you are already using a Unicode locale.

If you do need to make the conversion, back up all your important data first, as you'll be converting your disk's filesystems. Note that backups made prior to converting the filesystem and thereafter are somewhat incompatible. As we noted last time, the operating system and many utilities do not realize what characters the bytes in file names represent. Among the utilities with this problem is the tar program, which is a popular backup tool. If you are using an en_US locale now and some filename contains the character "ä" (German A umlaut), it is represented as a single byte: hex 0xE4. After you move to UTF-8, it will be represented as two bytes: 0xC3 0xA4. However, neither the filesystem nor tar know that these two different byte sequences can represent the same character. If you restore your old file from backup after moving to a UTF-8 locale, the old one-byte sequence will be used in the restored file's name, making it different from the new version's name. Under a UTF-8 locale, this single byte won't be considered "ä" but rather an invalid UTF-8 sequence, and will be displayed as a placeholder or as an octal representation of the erroneous byte only. So if you restore data from older backups or archives after you move to UTF-8, you may need to run a filename conversion similar to the one we'll describe below on the extracted files in order to get the filenames correct.

To be able to use UTF-8 locales without having to invest too much work in it, you will need glibc (GNU C library) version 2.2 or newer (any reasonably modern distro should have it). You can check your version by running /lib/libc.so.6.

The following paragraphs give a step-by-step description of how to perform the conversion. Most operations described below must be performed as root.

Setting the locale

Certain environment variables tell applications which locale is to be used. Commonly used variables are:

  • LC_ALL -- When set, the value of this variable overrides the values of all other LC_* variables.
  • LC_* -- These variables control different aspects of the locale. For example, LC_CTYPE controls the way upper- to lowercase conversion takes place, while LC_TIME controls the date and time format. LC_MESSAGES defines the language for application messages. Details can be found in the man page for locale(7).
  • LANG -- If LC_ALL is not set, then locale parameters whose corresponding LC_* variables are not set default to the value of LANG.

Before modifying your locale, remember or save to a file the output of locale, which shows your current locale. Also, note down the output of locale -k LC_CTYPE | fgrep charmap (your current character encoding), as you will need this information later on.

In order to tell applications to use UTF-8 encoding, and assuming U.S. English is your preferred language, you could use the following command:

export LC_ALL=en_US.UTF-8

Applications started afterwards from the same terminal window should be aware of UTF-8. To check if that's the case, you could for example use the command wc. wc -c will tell you the number of bytes and wc -m the number of characters in a file or in data read from standard input (end typing with Enter and Ctrl-D). In a UTF-8 locale, if text contains non-ASCII characters, the number of bytes will be greater than the number of characters. For example:

user@host:~$ wc -c
Bär
5
user@host:~$ wc -m
Bär
4

This three-character word is encoded using four bytes in UTF-8 (the extra character or byte is the end-of-line marker).

If the test failed (i.e. wc returns the same number in both cases), your system probably came without UTF-8 locale definitions, and you will have to use localedef to generate them. For example, if en_US.UTF-8 is missing, you can generate it from en_US using:

localedef -i en_US -f UTF-8 en_US.UTF-8

Since values of environment variables last only as long as your session, you have to put your export commands in /etc/profile so that they are run for each user the next time he or she logs in. If you perform your work from inside KDE, you will have to log out and back in so that environmental variables can be re-read in order for changes to take effect. GNOME seems to always use UTF-8 internally, even if the locale is not UTF-8-based. No matter which desktop environment you are using, it may be necessary to log out and, if you are using a login manager (e.g. KDM or GDM), restart the X Window System by pressing Ctrl-Alt-Backspace so that /etc/profile is re-read and all applications come to know about the new locale.

Converting filesystems

The next step is to convert your filesystems. This is the only risky part of the transition, so do make a backup of all important data from your disks if you haven't done so yet.

As noted above, the Linux kernel doesn't care about character encodings. For common Linux filesystems (ext2, ext3, ReiserFS, and other filesystems typical for Unices), information that a particular filesystem uses one encoding or another is not stored as a part of that filesystem. Only locale-controlling environment variables tell software that particular bytes should be displayed as one or another character. Filesystems found on Microsoft Windows machines (NTFS and FAT) are different in that they store filenames on disk in some particular encoding. The kernel must translate this encoding to the system encoding, which will be UTF-8 in our case.

If you have Windows partitions on your system, you will have to take care that they are mounted with correct options. For FAT and ISO9660 (used by CD-ROMs) partitions, option utf8 makes the system translate the filesystem's character encoding to UTF-8. For NTFS, nls=utf8 is the recommended option (utf8 should also work). Add these mount options to filesystems of these types in your /etc/fstab to make them mount with the correct options. A fragment of /etc/fstab might then look like this (other options may vary depending upon your setup):

/dev/hda2        /mnt/c           ntfs        defaults,ro,nls=utf8                        1 0
/dev/hda3        /mnt/d           vfat        defaults,quiet,utf8                         1 0
/dev/cdrom       /mnt/cdrom       iso9660     defaults,noauto,users,ro,utf8               0 0
# If using supermount, add "utf8" to the options _after_ two dashes, e.g.
#none             /mnt/cdrom       supermount  fs=iso9660,dev=/dev/cdrom,--,auto,ro,utf8   0 0
/dev/fd0         /mnt/floppy      auto        defaults,noauto,users,rw,quiet,utf8         0 0

After you modify /etc/fstab, you should remount the filesystems in question by issuing a mount -o remount /mnt/mount-point command for each of them. Non-ASCII characters in filenames on those filesystems should now be displayed correctly again. Note that this requires the kernel to be capable of converting between character sets, so support for UTF-8 must be compiled in or available as a module. This option is available in "File systems"->"Native Language Support"->"NLS UTF-8" in the kernel configuration program. Depending upon which encoding your Windows partitions use, you may also need to compile in support for that encoding, too. Check this page for a list of codepages used by various language versions of FAT. NTFS always uses Unicode internally and doesn't need any kernel NLS options except for UTF-8 support.

Native Linux filesystems do not store information about the character encoding used, so you must physically change the names of all files to the new encoding, as opposed to the simple remounting of FAT and NTFS volumes. In theory, all you need to do is execute a command like:

mv original-filename filename-in-UTF-8-encoding

for each file. In practice, things tend to be a little more complicated. First of all, you may already have UTF-8-encoded filenames on your disk without knowing it. For example, some GNOME applications tend to create UTF-8 filenames regardless of the locale used, and kbd (a set of utilities for handling console fonts) comes with a sample file called ♪♬ (two music notes) in the documentation. During conversion, these files need to be identified and their names left unchanged.

Another issue to look out for is directories. Since both the directory name and names of files contained within may need changing to their UTF-8 equivalents, you can't simply create a list of all files and directories and then perform mv old-name new-name for each of them. If you did and first renamed a directory, then the path referring to the files in that directory would no longer be valid. So, order is important. In the directory tree, the leaves (i.e. files) should be renamed first, then the lowest-level directories, then their parent directories, and so on.

Below is a script that tries to perform the necessary convertion automatically. Note that using it may be dangerous -- back up important data first! While it should work in most cases, this script isn't bulletproof. In order to keep things simple, it doesn't handle some special cases such as spaces in mount paths (which are really rare) and read-only filesystems (it is not obvious what should be done with those; if you do intend to convert a read-only mounted hard disk partition, remount it manually as read-write with mount -o remount,rw /some/mount/path before running the script). Depending on filesystem size and the number of files that need conversion, this script may take a long time to complete, especially since for simplicity's sake it is far from optimal (all this could probably be done in Perl in a much more compact way).

Remember to modify orgcharset in the script below to the name of your old character set found out in one of previous steps using locale -k LC_CTYPE.

#!/bin/sh

fstab=/etc/fstab
orgcharset=INVALID_CHARSET_NAME

export LC_ALL=POSIX

# Find filesystems suitable for conversion
filesystems=`awk '!/vfat|ntfs|iso9960|udf|auto|autofs|swap|subfs|sysfs|proc|devpts|nfs|smbfs|^#/{print $2}' "$fstab"`
# Locate files whose names need to be converted and sort the list
find $filesystems -xdev | {
	while read; do
		# Check if the filename needs conversion (i.e. is not a correct UTF-8 string)
		if ! echo `basename "$REPLY"` | iconv -f UTF-8 -t UTF-8 &>/dev/null; then
			echo "$REPLY"
		fi
	done
} | sort -r | {
	# Rename files
	while read; do
		dirname=`dirname "$REPLY"`
		orgfname=`basename "$REPLY"`
		newfname=`echo "$orgfname" | iconv -f "$orgcharset" -t UTF-8`
		if [ $? -ne 0 ]; then
			echo "Error: iconv failed for $REPLY. Skipping." >&2
			continue
		fi
		mv "$REPLY" "$dirname"/"$newfname"
	done
}

Converting text files

It is convenient for user text files to use the system's default encoding, so after moving to UTF-8 you may want to convert your text files too. Converting configuration files isn't necessary, as programs that can handle non-ASCII data in their config almost always use UTF-8 for storing it already. You can convert a single text file with iconv:

iconv -f old-encoding -t UTF-8 filename > temp.tmp && mv temp.tmp filename

Again, make sure this actually works before playing with important data.

Getting fonts with Unicode support

Unicode fonts for the text console are usually shipped with major Linux distributions. To enable UTF-8 on the console, run unicode_start (unicode_stop to return to previous one-byte encoding mode).

In order to be able to actually see Unicode characters displayed by X applications, you need to download and install Unicode fonts. Bitstream Vera is a TrueType font available under an open source license that now comes with many Linux distributions. Unfortunately, it contains few characters. An extended version, with support for most Latin accented characters, is called Hunky Font. A family of Unicode fonts called FreeFont is available under the GPL. There are also a number of free-as-in-beer fonts on the Web, including Microsoft Core Fonts (a package containing among others the popular typefaces Arial and Times New Roman), Bitstream Cyberbit (only Roman style is available, but it has a very good Unicode coverage), Gentium, and many others. Of course, there are also lots of commercial fonts that can be used with X.

Summary

Using UTF-8 has many advantages over using a single-byte locale. A minor one is the ability to use any character in file names and on the command line. The main advantage of Unicode, however, is that it allows easier data exchange and better interoperability than any other character set. UTF-8 is meant to replace ASCII in the future, so at some point "text file" is going to mean "UTF-8 file" just as it means "ASCII file" now.

Links

Unicode Consortium
Unicode page in Wikipedia
UTF-8 page in Wikipedia
man page for UTF-8 (or run man 7 utf8)
man page for Unicode (or run man 7 unicode)
UTF-8 and Unicode FAQ
Unicode HOWTO (somewhat outdated)

A continually updated version of this article can be found at the author's Web site.

Michał Kosmulski is a student at Warsaw University and Warsaw University of Technology.

Share    Print    Comments   

Comments

on Using Unicode in Linux

Note: Comments are owned by the poster. We are not responsible for their content.

all my systems are now unicoded

Posted by: housetier on November 02, 2004 05:57 PM
thanks for the tips in this article. I went over my configuration and did find places I had to tweak; now all my linux computers are using unicode.

#

URW-fonts

Posted by: Anonymous Coward on November 02, 2004 11:31 PM
The author forgot to mention URW-fonts - very good, LGPL'd, set of Unicode fonts. Any decent Linux distro has it.

One can get if from Freshmeat.net (http://freshmeat.net/projects/urw-fonts-cyrillic<nobr>/<wbr></nobr> ).

#

A bit of a warning... about UniCode! Security?

Posted by: Anonymous Coward on November 03, 2004 03:49 AM
You may want to wash UniCode ideas thru the washer before use.

Microsoft has had security issues with their UniCode layer... ! SecurityFocus.com had some articles a few years back about it, and comments related to the articles in the months that followed indicated a weakness here... due to the obvious reasons.

Is the above approved by the groups that are building the Security Enhanced Linux stuff?

#

UTF-8 suxx

Posted by: Anonymous Coward on November 03, 2004 07:14 PM
Anyone whose native language is not English/European agrees that UTF-8 does suck 'cos it's ugly hack to cope with unsufficient length of data word and to be modern re-incarnation of that 7-bit ASCII.
UTF-16 has much more pluses, and I think Bill the Gate did the Right Thing on the topic.

#

Re:UTF-8 suxx

Posted by: Anonymous Coward on November 14, 2004 08:22 AM
UTF-8 is far from an ugly hack. It is a way to represent Unicode characters as a sequence of bytes. It was beautifully designed:

o It is ASCII transparent: any sequence of ASCII bytes is a correct sequence of UTF-8 bytes that corresponds to the equivalent Unicode characters. Reciprocally, any non-ASCII Unicode character is guaranteed to be encoded using only non-ASCII characters in the UTF-8 stream. All of this guarantees that systems that only care about bytes and some special ASCII characters like slash "/" (typically the case of the Unix VFS) can imediately "speak" Unicode.

o Sorting: sorting a UTF-8 string gives the same result as sorting a Unicode string. Again, not a single line of code to add.

And for your information, UTF-16 is obsolete, since it can represent all of the Unicode v3.0 character space. That is exactly why glibc 2.1 (and up) uses UTF-32. When the Windows people realized that in Windows 2K, they did the same "ugly hack": they started to use what is called "surrogates", i.e. encode one Unicode character with a sequence of 2 UTF-16 characters...

#

File conversion script destroys file permissions

Posted by: Anonymous Coward on November 04, 2004 07:25 AM
The file conversion script in the article has a serious bug, it destroys the file's permissions.

The given script:

iconv -f old-encoding -t UTF-8 filename > temp.tmp && mv temp.tmp filename

Should be written:

iconv -f old-encoding -t UTF-8 filename > temp.$$ && cat temp.$$ > filename && rm temp.$$

These changes also allow concurrent jobs to run in parallel.

Karl O. Pinc kop@meme.com

#

Re:File conversion script destroys file permission

Posted by: Anonymous Coward on November 09, 2004 08:22 PM
These changes also allow concurrent jobs to run in parallel


no, it does not.

Dany

#

How to use Unicode without en_US

Posted by: Anonymous Coward on November 16, 2004 03:31 AM
I really like using the "C" locale instead of "en_US". Particularly the sorting order.

So I tried this:

export LC_ALL=C.UTF-8

that doesn't work.

How do I get UTF-8 with the sorting behavior I want?

#

Re:How to use Unicode without en_US

Posted by: Anonymous Coward on December 20, 2004 05:37 PM
The two locales C and POSIX were designed for 8-bit characters and there are no UTF-8 versions of them shipped with glibc by default. It might be possible to generate a Unicode version of either using localedef (this is described in the article), but I haven't tried that.

#

This story has been archived. Comments can no longer be posted.



 
Tableless layout Validate XHTML 1.0 Strict Validate CSS Powered by Xaraya