This is a read-only archive. Find the latest Linux articles, documentation, and answers at the new Linux.com!

Linux.com

Feature

Viewing Word files at the command line

By Scott Nesbitt on March 01, 2006 (8:00:00 AM)

Share    Print    Comments   

As a Linux user, there are times when you have to play nicely with users of Windows or Mac OS -- such as when they send you Microsoft Word files. When you receive a Word file, you can either follow Richard Stallman's advice and refuse it, or bite the bullet and work with it. Modern Linux word processors -- such as OpenOffice.org Writer, AbiWord, KWord, and TextMaker -- can deal with most Word files. But if you don't want to fire up a word processor in order to read or print the document, you can turn to the command line. A handful of small but powerful Linux command line utilities make viewing, printing, and even converting Word files to another format a breeze.

Antiword

Antiword is a nifty application that can convert Word documents to plain text, PostScript, and PDF. According to the developer, conversion to DocBook XML is still experimental and doesn't always work well.

Antiword is very flexible. It can read and convert files created with Word versions 2.0 to 2003, and you can run it on multiple operating systems, including Linux, Mac OS X, RISC OS, FreeBSD, and OpenVMS. On top of that, you can set the paper size for documents converted to PostScript or PDF, include any text that was removed from the file (but which Word notoriously keeps a record of), and display any hidden text.

For the most part, you'll just want to view a Word document. To do that, you just have to type the following command:

antiword file.doc

The Word document will be converted to text and printed to the screen. If you're running Antiword in a terminal window, you'll have to scroll up to view the full text of the document. To get around this, you can pipe the output from Antiword to the less utility, which will allow you to scroll through the document page by page from the top:

antiword file.doc | less

Catdoc

Slightly less flexible than Antiword, but still useful, is Catdoc, whose developer explains that "it does same work for .doc files as the Unix cat command for plain ASCII files."

While Antiword tries to retain some of the formatting of a Word file, Catdoc is a quick and dirty tool. It outputs either LaTeX or plain text, and little else. The LaTeX output leaves a lot to be desired -- it does nothing beyond adding the LaTeX formatting for tables or special characters. You'll have to add the LaTeX preamble and any other formatting code yourself.

Catdoc has some rudimentary support for tables. If it's converting a simple table, the output will be passable. If the table is more complex, say with nested elements, it won't be pretty.

To run Catdoc, type the following command:

catdoc <output_format> filename.doc

You can specify the output format using the -a (text) or -t (LaTeX) option. So, to convert the Word file whitepaper.doc to text, type:

catdoc -a whitepaper.doc

As with Antiword, you can pipe the output from Catdoc to the less utility.

wvWare

wvWare is part of of wv, a library of that enables developers to code software that can read and write Word files. In fact, both AbiWord and KWord use wv for importing Word documents. wvWare can handle documents created with Word from version 6 to 2000. It converts Word 2.0 documents to text only.

Used by itself without any command line options, wvWare will convert a Word document to HTML and display the code on the screen. If you want to write the HTML to a file, use the following command:

wvWare file.doc > file.html

But you're not stuck with HTML. wvWare comes with a set of scripts that can convert Word files to a number of other formats, including plain text, HTML, LaTeX, PDF, PostScript, LaTeX DVI, and WML. These scripts are usually installed in the folder /usr/bin. You can get a list of them by typing ls /usr/bin/wv* at the command line.

If you want to convert a Word document to text, use the following command:

wvText file.doc file.txt

I've never been able to pipe the output to the less utility or a text editor. I've always had to open a file converted with wvWare in an editor or browser.

you can view Word files using wvWare and the w3m text-mode Web browser, as detailed in the book Linux Desktop Hacks . I've tried this hack with the text-based browsers Lynx and Links as well, but w3m does the best job out of the three.

To use this hack, type the following command:

wvWare -x /usr/lib/wv/wvHtml.xml file.doc | w3m -T text/html

You can also encapsulate the above command in a script if you decide to use this hack regularly. wvWare converts the Word file to HTML using the configuration file named wvHtml.xml, then pipes the output to the w3m browser.

A gotcha or two

While Antiword, Catdoc, and wvWare do a good job handling most Word files, you might run into documents that don't want to cooperate with you. I've found that these utilities sometimes can't process documents that are saved with Word's Fast Save feature, which quickly saves a file by tacking any changes to the end of the file. For example, Antiword might display the cryptic message The Small Block Depot is damaged when it encounters a Fast Saved file. This doesn't happen with all Fast Saved files, however.

As well, out of the box these programs might display garbage characters when converting Word files that use non-Latin character sets or that contain graphics. Check the documentation for the program that you're using for information on how to deal with character sets and graphics.

You don't need a word processor to view Microsoft Word documents on Linux. With the right command line apps, you can view or print those files in a flash with just a few keystrokes.

Scott Nesbitt is a Toronto-based technical writer and journalist who is a big fan of useful little command-line utilities.

Scott Nesbitt is a freelance journalist and technical writer based in Toronto, Canada.

Share    Print    Comments   

Comments

on Viewing Word files at the command line

Note: Comments are owned by the poster. We are not responsible for their content.

False bill of goods

Posted by: Anonymous Coward on March 01, 2006 05:22 PM
viewing, printing, and even converting Word files to another format a breeze

Converting Microsoft Word files to another format is anything but a breeze. In fact, no conversion process I have seen works 100% of the time.


That will always be true, because if a perfect conversion utility existed, Microsoft would just "extend" the format to break it in the next version of Microsoft Word. They are perfectly free to do that; they own the format.


However little you like Stallman, he is absolutely right in pointing out that there is only one way to deal with proprietary formats. When people send me a ".doc" file, I tell them I can't read it and ask them politely to send a<nobr> <wbr></nobr>.pdf file.


Of course, they need to buy software to generate pdf files, since Microsoft Word (by design) does not support any standard format out-of-the-box. But that's their problem, which they created themselves when they decided to use Microsoft Word instead of OpenOffice.

#

Re:False bill of goods

Posted by: Anonymous Coward on March 02, 2006 05:45 PM
Don't just start sounding like a smart guy, but I'm sure people who send you doc files aren't the one or the ones to choose what application they will want in their company computers. If you have the MS Office, many people who might send you a doc will have no problem than have them all get stuck to ask what a pdf is to someone else.

Not that I'm telling everything should be run as MS Office, it sucks that everyone needs it to just get it read, but in the end, putting an ideal solution where it's just not worth the effort goes to philosophy problems.

#

Re:False bill of goods

Posted by: Anonymous Coward on March 03, 2006 05:44 AM
Let me guess: either you're the most unpopular person in your workplace, you work with very forgiving people, or you're the boss. Many people work on collaborative writing projects with higher-ups who can't be bothered to figure out what latex, xml, and odt are. We receive dozens of<nobr> <wbr></nobr>.doc announcements every day from colleagues we hardly know. If trying to dictate to people what format to use doesn't get you fired, at least it will make people think you're annoying while still failing to win many people over.

I do most of my work in xml, but when I have to send a paper to someone else for revisions, I often convert it into<nobr> <wbr></nobr>.doc because most of the people I work with wouldn't have any idea what to do with any other format. Not to say speaking up about open formats isn't worthwhile, but you do have to pick your battles and be strategic about how you do it.

I'd rather use antiword to browse a few Word documents than send dozens of emails every day to people I hardly know telling them to use another format.

#

Re:False bill of goods

Posted by: Anonymous Coward on December 31, 2006 03:45 PM
On Windows you can use PDFCreator, no need to buy PDF software to create<nobr> <wbr></nobr>.pdf files.

#

Re:False bill of goods

Posted by: Anonymous Coward on April 25, 2007 02:25 AM
There are Readers available for<nobr> <wbr></nobr>.doc just the same as there are Readers for<nobr> <wbr></nobr>.pdf.

They are both sound like proprietary formats to me.

#

CUPS filter

Posted by: Anonymous Coward on March 01, 2006 07:41 PM
Another neat thing about command line tools is that you could use them to make a CUPS filter. Then you can print these files using a simple lp or lpr command.

#

Re:Seriously

Posted by: Anonymous Coward on March 01, 2006 08:02 PM
"Why would you not want to open a word processor to read a word document ?"

1) Because you have a website and you want to display the text of a Word document instead of making th user download it.

2) Because you want to search the text without the formatting.

3) Because you want to extract the text without formatting.

4) Because you want to put the content into another application.

5)...

Don't be so shortsighted.

#

Re:Seriously

Posted by: Anonymous Coward on March 02, 2006 03:02 AM
5) Because you have 300<nobr> <wbr></nobr>.doc files that you want to convert to a usable format. The command line should allow you to do a batch operation instead of doing one at a time with a wordprocessor.

#

Re:Seriously

Posted by: Anonymous Coward on March 02, 2006 05:49 PM
What I don't like about it is, if you have 300 docs, then some will be half or quite broken when you convert and the time it takes to make the same document back as they were are quite time taking.

But that's the nature of proprietry versus hobby programmer's programs.

#

Re:Seriously

Posted by: Anonymous Coward on March 03, 2006 01:58 AM
Not if they're plain text with no complex formatting. They will look a little funny, but they'll work.

#

lower back pain

Posted by: Anonymous Coward on May 28, 2006 01:51 PM
[URL=http://painrelief.fanspace.com/index.htm] Pain relief [/URL]

  [URL=http://lowerbackpain.0pi.com/backpain.htm] Back Pain [/URL]

  [URL=http://painreliefproduct.guildspace.com] Pain relief [/URL]
[URL=http://painreliefmedic.friendpages.com] Pain relief [/URL]
[URL=http://nervepainrelief.jeeran.com/painrelief<nobr>.<wbr></nobr> htm] Nerve pain relief [/URL]

#

another ugly hack for doc to txt

Posted by: Anonymous Coward on March 01, 2006 08:47 PM
wvText file.doc file.txt && less file.txt && rm -f file.txt

that ain't pretty, nor very smart, but it will do the trick,

#

Re:another ugly hack for doc to txt

Posted by: Anonymous Coward on March 01, 2006 09:00 PM
actually, the rather closely associated command:

rm file.doc

is my favourite way of reading word docs.

#

Re:another ugly hack for doc to txt

Posted by: Anonymous Coward on March 02, 2006 05:50 PM
If it rather works in a real business place.

#

Re:Seriously

Posted by: Anonymous Coward on March 01, 2006 09:54 PM
Quite naive. I know a company that made thousands using these tools for CUPS filtering and print routing.

A lack of creativity on your behalf it not the fault of the author.

These are excellent tools.

#

strings

Posted by: Anonymous Coward on March 01, 2006 10:52 PM
Check out the command "strings". If you just need to see the text in a file, something like this does it:

strings foobar.doc | less

#

Re:strings

Posted by: Anonymous Coward on March 02, 2006 02:17 AM
Except that Word uses Unicode. You have to strip out the NULLs first, like so:

tr -d \\0 foobar.doc | strings | less

#

Re:strings

Posted by: Anonymous Coward on March 02, 2006 09:23 PM
I use a PERL script as<nobr> <wbr></nobr>:-
#!/usr/contrib/bin/perl -w
binmode STDIN;
my $data;
while ( read(STDIN, $data, 80)) {

      print "$data \n";
}

Which does the job
Just change the'80' if you want larger lines
And yes embedded pictures just cause funning lines, but it is a simple starter

#

odt

Posted by: Anonymous Coward on March 02, 2006 12:24 AM
are there similar tools for watching openoffice files?

#

Re:odt

Posted by: Anonymous Coward on March 02, 2006 05:52 PM
At least you can obtain the document viewer (oo.org itself) for free... and convert it to other format as well.

#

Re:odt

Posted by: Anonymous Coward on March 02, 2006 10:02 PM
Certainly: <a href="http://siag.nu/o3read/" title="siag.nu">http://siag.nu/o3read/</a siag.nu>

Ulric

#

Re:odt

Posted by: Anonymous Coward on March 03, 2006 05:10 AM
That would be nice--I hope there is. The point is not just to avoid proprietary formats but to avoid opening up a heavy WYSIWIG viewer for the simple task of looking at text. o3read, from what I can tell, is for<nobr> <wbr></nobr>.sxw, not<nobr> <wbr></nobr>.odt. And the "strings" command returns garbage.

#

Re:odt

Posted by: Administrator on March 04, 2006 08:49 AM
odt is zipped xml, so unzip my.odt and then find the xml file with the text, and parse out all the tags, shouldn't be too difficult with a simple bash script, but don't have time to write one right now.

Sam

#

You could use strings

Posted by: Anonymous Coward on March 02, 2006 02:36 AM
If you want a quick reader for Word documents, pass the document through strings and you'll be able to read most of the content:

strings word.doc |less

After all, a word document is a text document with bloat.

#

Unicode

Posted by: Anonymous Coward on March 03, 2006 01:56 AM
It's Unicode. You have to strip out the NULLs first, or strings won't detect anything:

tr -d \\0 word.doc | strings | less

(Note the '' - tr always operates on standard input, not a file.)

#

Re:Unicode

Posted by: Anonymous Coward on March 03, 2006 01:57 AM
D'oh. That should be:

tr -d \\0 < word.doc | strings | less

(Note the '<' - tr always operates on standard input, not a file.)

Dammit, Slashcode, when I say Plain Text I mean Plain Text!

#

mutt

Posted by: Anonymous Coward on March 02, 2006 03:19 AM
<a href="http://72.14.207.104/search?q=cache:zJ06bhsxNvIJ:www.oreilly.com/catalog/linuxdeskhks/chapter/hack54.pdf+linux+hack+view+word+file+terminal&hl=en&gl=us&ct=clnk&cd=2&client=opera" title="72.14.207.104">Mutt could</a 72.14.207.104> also be used

#

Re:useful tools

Posted by: Anonymous Coward on March 02, 2006 03:39 PM
Try tell your colleagues to use LyX. It's a graphical interface to LaTeX, much similar to a word processor, they might feel more at ease with it. And it runs on Windows too.

#

Re:useful tools

Posted by: Administrator on March 03, 2006 10:53 PM
People born and bred with MS Word won't change easily to LaTeX or Lyx. Even getting them to use OpenOffice has proved impossible. Actually I have tried lyx and quite like it, but usually end up moving back to pure LaTeX for writing articles and suchlike. But that's another discussion<nobr> <wbr></nobr>:-)

#

Just to view it

Posted by: Anonymous Coward on March 02, 2006 05:54 PM
Just to read it, though half off topic, use Word Viewer from MS. No need to have MS Office.

<a href="http://www.microsoft.com/downloads/details.aspx?FamilyID=95e24c87-8732-48d5-8689-ab826e7b8fdf&DisplayLang=en" title="microsoft.com">http://www.microsoft.com/downloads/details.aspx?F<nobr>a<wbr></nobr> milyID=95e24c87-8732-48d5-8689-ab826e7b8fdf&Displ<nobr>a<wbr></nobr> yLang=en</a microsoft.com>

Similar exists for Excel and PowerPoint.

#

Didn't read the article, did you?

Posted by: Anonymous Coward on March 02, 2006 09:19 PM

If you'd bothered to read the article - even just the beginning of the article - in fact, even just the first four words, before replying to the title, you'd have found that the article was aimed at Linux users.


The Microsoft product you mention is not, of course, available on Linux.

#

Re:Seriously

Posted by: Anonymous Coward on March 03, 2006 12:35 AM
Actually, you can watch a movie from the command line using mplayer<nobr> <wbr></nobr>:-). With mplayer you can start up a movie in the frame buffer without starting X. This can even make a movie watchable if you have a fair amount of processing power, but not a decent DRI capable driver for your video card. In my experience though, audio synchronization can be a bit of an issue (but it is adjustable).

#

Re:Seriously

Posted by: Anonymous Coward on March 03, 2006 05:18 PM
Or just use the "aa" video driver. Doesn't even need a frame buffer. Any text terminal will do, works even with remote terminals.

Let's see your Windows Media Player convert the video into ascii characters and play the video in a terminal window<nobr> <wbr></nobr>;-).

#

Catdoc / antiword for search engine

Posted by: Anonymous Coward on August 09, 2006 10:39 PM
I've just been testing antiword and catdoc for Windows. They are used by <a href="http://www.phpdig.net/" title="phpdig.net">phpdig</a phpdig.net>, a search engine (*).

The last version of antiword outputs [pic] instead of garbage when finding an image in a word file. The Win32 port of catdoc is keeping the garbage.

When dealing with enormous file it improves performance a lot.

I've not tried the linux binary.

(*) <a href="http://www.htdig.org/" title="htdig.org">ht://dig</a htdig.org> also uses catdoc when digging Winword files.

#

Re:Seriously

Posted by: Administrator on March 05, 2006 12:57 AM
Actually, you can not only play a movie in the command line, but you can do it with ASCII graphics!!

I forgot what program did that though.

Anyway, some people are really retro and like the whole command like thing. Go figure.

There's also the occasional situation where someone is using SSH.

#

useful tools

Posted by: Administrator on March 01, 2006 06:52 PM
Good article - it is useful to know these tools.

I hate using Word, but 90% of my colleagues in other Institutes use it. I am currently involved in 3 draft papers, with deadline March 31st, written by Word-users. I have told them all that Linux+LaTeX would be much better, but for some reason they are not converted. So, Word it is (usually OOo, with its imperfect Word compatability), but if I can read word with simple tools, great.

And why the command line? Well, I often want to save Word files as<nobr> <wbr></nobr>.txt, maybe for import into LaTeX, maybe just to grab the abstract so I can store it into my bibtex collection. Why would I want to open the file in OOo and save as<nobr> <wbr></nobr>.txt when I can do the same job with the command line?

#

Seriously

Posted by: Administrator on March 01, 2006 06:34 PM
It's always going to be rubbish to do it this way. Why would you not want to open a word processor to read a word document ? Most PDAs can do that. My 5 years old Zaurus can anyway. You mean you have a computer that can't ?

Why not watch a movie in the command line ? It'd be nice to have less pointless subjects to be honnest.

#

Viewing Word files at the command line with OpenOffice...

Posted by: Anonymous [ip: 24.68.129.82] on September 03, 2007 09:39 PM
You can also convert with openoffice on the command line using macros and the -invisible option:

Check out http://www.xml.com/pub/a/2006/01/11/from-microsoft-to-openoffice.html -- you can convert to text if you use the Filtername "Text" ala:

<code> MakePropertyValue( "FilterName", "Text" )</code>

#

Viewing Word files at the command line

Posted by: Anonymous [ip: 10.147.2.72] on December 19, 2007 03:13 PM
I have met the problems which are mentioned in your article.
"I've found that these utilities sometimes can't process documents that are saved with Word's Fast Save feature, which quickly saves a file by tacking any changes to the end of the file"
I used the Catdoc in Linux for converting the word documents to texts and then save them in database. But after processing more than 13000 documents, there are 87 errors named "This was fast-saved N times. Some information is lost" being catched. The percentage of errors is over 0.6 , it is a little high.
I know the reason is that these documents are saved with the option "fast saved", the tool Catdoc can not handle it well in this situation.
Is there anyone with the same problem and how do you solve it ?
Or anyone knows some famous forum can help me find out a solution?
It is very urgent, thanks.

#

This story has been archived. Comments can no longer be posted.



 
Tableless layout Validate XHTML 1.0 Strict Validate CSS Powered by Xaraya