This is a read-only archive. Find the latest Linux articles, documentation, and answers at the new Linux.com!

Linux.com

Feature

CLI Magic: Convert file names to a different encoding with convmv

By Manolis Tzanidakis on December 11, 2006 (8:00:00 AM)

Share    Print    Comments   

User Level: Beginner to intermediate

Recent versions of most Linux distributions support non-English languages out of the box by using the Unicode standard. I was pleasantly surprised when I found out that I was able to read and write in Greek -- my native language -- on a fresh Ubuntu Edgy Eft installation without any manual intervention. Unfortunately, my happiness lasted only until I tried to open files with Greek file names. Instead of Greek characters I saw garbage. I've been using the 8-bit ISO 8859-7 encoding for Greek file names, and since it worked well I was too lazy to convert my systems to Unicode. Manually renaming hundreds of files in order to convert them to Unicode was not an option; I needed some kind of automation. Convmv is the right tool for that job.

Convmv is a Perl program that converts file names and directories between different character encodings. It converts only the file names, not the content of the files, and can also convert a whole filesystem, including symlinks. Most Linux distributions offer packages for convmv, and you can also find it in the FreeBSD ports and NetBSD pkgsrc. Manual installation is fairly easy since the program depends only on Perl, which is installed by default on virtually all Linux distributions and BSD variants; running make install will install the program in /usr/local/bin and its man page in /usr/local/share/man/man1.

Running convmv without any arguments prints the list of all available options. All options are explained in detail in the program's man page.

Let's start by running convmv --list to display the supported encodings. To convert all Ogg/Vorbis files in the current directory from ISO 8859-7 to UTF-8 (Unicode) run convmv -f iso-8859-7 -t utf8 *.ogg. This command will not actually rename the files -- it just prints what it should do. To rename the files, add the --notest option. If you want the program to ask for confirmation before any action, add the -i option to enable interactive mode.

By default the program checks whether the file names you want to rename are already using the specified encoding and skips them accordingly. Although you can speed up the whole process by disabling this feature with the --nosmart switch, it's better not to, since it could lead to "double-encoded" file names with incorrect characters. Nevertheless, the man page has a section on how to repair double-encoded files. The program will also stop if you try to rename a file by giving it a name with the target encoding that already exists on the same path. You can however use the --replace switch to have that file overwritten in case its content is the same as that of the original file.

After making sure that your options work correctly, it's time to convert the whole filesystem to UTF-8 with a single command. We will also add the -r switch, which enables recursive mode. For example, issue convmv -f iso-8859-7 -t utf8 -r --notest --replace ~/data to convert all the files and directories inside the data directory in your home from ISO 8859-7 to UTF-8. You can also use convmv to convert file names to all upper or lower case with the --upper and --lower options respectively. If the file is not ASCII-encoded you must also supply its encoding with the -f switch.

Besides the conversion to Unicode, convmv can be useful when you need to exchange files with users of obsolete operating systems that have no support for the Unicode charset, such as Windows 98 or older versions of Linux. Speaking of cross-platform interoperability, Mac OS X has a strange way of handling Unicode-encoded file names. Linux and most other Unix-like OS use the C normalization form (NFC) for encoding to UTF-8, while OS X uses NFD. Convmv can convert file names between these two standards with the --nfc and --nfd switches. You might face similar issues with the JFS and NFS v4 file systems; check the convmv man page for more information.

Convmv made my transition to Unicode as painless as possible. It converted all my files while I was making a cup of coffee, giving me plenty of time to play with the new version of Ubuntu.

Share    Print    Comments   

Comments

on CLI Magic: Convert file names to a different encoding with convmv

Note: Comments are owned by the poster. We are not responsible for their content.

Hmm

Posted by: Anonymous Coward on December 11, 2006 05:35 PM
Haven't Linux supported Unicode filenames for a way long time now?

I know a certain other operating system where I have been able to use Swedish characters such as 'åäö' in them for a very long time.

UTF-8 was invented by Ken Thompson and Rob Pike. It is the native encoding in Plan 9 and the whole system was converted to use it everywhere in 1992.

#

Re:Hmm

Posted by: Anonymous Coward on December 13, 2006 01:54 AM
You're right, I for example already dealt with these problems a couple of years ago, I also used convmv back then. It's been well worth the hustle, because since then I never had to care about such things again, Linux has great support for UTF-8<nobr> <wbr></nobr>;-)

#

Symlinks

Posted by: Anonymous Coward on December 12, 2006 01:08 PM
Does it handle symlinks correctly (i.e. change the link to point to the new name) ?

#

Re:Symlinks

Posted by: Anonymous Coward on December 13, 2006 01:57 AM
convmv has a feature to detect double encoding, I don't know of any content converter that does the same thing (however I also haven't bothered about such things for years now). If you know perl, you might try to adapt convmv to recode content instead of just filenames (use with care) or just reuse the portion of convmv that does the recoding for your particular application.

#

converting text

Posted by: Anonymous Coward on December 12, 2006 02:43 PM
how about converting file contents that include both iso8859-* and utf-8 characters? =)

converting such a file iso8859->utf8 results in utf8 characters getting fucked.

#

convmv is usefull for fixing

Posted by: Anonymous Coward on December 12, 2006 06:59 PM
Linux supports filenames according to you local settings and filesystem.

Problem arises when you have a copied filename from one encoding to another (different filesystems/locales). You help you convmv to convert encoding

I had ISO-8859-1 reiserfs filesystem, when I copied files to a UTF-8 reiserfs filesystem, files with such characteres "çãó" were not handled because they still used a single byte, so convmv fixed this problem easily.

For file contents you use "recode" or "iconv"

Rui Vilela

#

Some more backround on Unicode in Linux

Posted by: Administrator on December 14, 2006 02:46 AM
My article <a href="http://hektor.umcs.lublin.pl/~mikosmul/computing/articles/linux-unicode.html" title="lublin.pl">Using Unicode in Linux</a lublin.pl>, originally also published on newsforge.com, gives some more detail on Unicode in Linux and, in particular, explains how to convert your filesystem manually, i.e. without using the PERL script described here. Using a script is probably more convenient, though.

#

CLI Magic: Convert file names to a different encoding with convmv

Posted by: Anonymous [ip: 24.91.10.17] on November 24, 2007 04:40 PM
This looks like a great solution, i'm rsyncing data from ntfs to ufs (windows to freenas) and it doesn't like to funky characters. I guess maybe I can try to compile this under windows and see if it helps me.

#

This story has been archived. Comments can no longer be posted.



 
Tableless layout Validate XHTML 1.0 Strict Validate CSS Powered by Xaraya