This is a read-only archive. Find the latest Linux articles, documentation, and answers at the new Linux.com!

Linux.com

Feature: Tools & Utilities

Make etexts pretty with GutenMark

By Dmitri Popov on August 28, 2008 (7:00:00 PM)

Share    Print    Comments   

Project Gutenberg, the online library of more than 25,000 free books, is a treasure trove for bookworms and casual readers alike, but turning electronic text files into a readable form is not as easy as it may seem. In theory, since etexts are just plain text files, you should be able to open and read them on any platform without any tweaking. In practice, however, this approach rarely works. Hard line breaks, for example, may ruin the text flow, making it virtually impossible to read the book on a mobile device. Another problem is that most books are stored as single files, so locating a particular chapter or section in a lengthy book can be a serious nuisance. Then there are minor but annoying formatting quirks, such as inconsistent handling of italicized text, use of straight quotes instead of smart ones, and so on.

Fixing all these and other issues manually to make an etext readable -- or even printable -- is a daunting proposition. Thankfully, the GutenMark tool can take most of the burden off your shoulders. The utility converts Project Gutenberg etexts into neatly formatted HTML or LaTeX files.

The goal of the GutenMark project is to create a tool that produces files that don't require any additional cleanup or tweaking. While it still has some way to go before it achieves this goal, GutenMark does a remarkable job of turning etexts into readable and printable files.

Initially, GutenMark was a command-line tool, but the latest version of the application comes with the GUItenMark graphical interface and the GutenSplit tool, which can split a single file into multiple chapters. These tools come from a single installer, but before you download and run it on your system, you have to make sure that the system has all the required packages: glitz, libpng, and libtiff. On Ubuntu, you can install them using the sudo apt-get install libglitz1 libpng libtiff command. You also need to create a couple of symbolic links, as follows:

sudo ln --symbolic /usr/lib/libtiff.so.4 /usr/lib/libtiff.so.3 sudo ln --symbolic /usr/lib/libexpat.so.1 /usr/lib/libexpat.so.0

Now you can download the GutenMark installer and make it executable. The installation instructions on GutenMark's Web site recommend that instead of using the chmod command, you make the installer executable by right-clicking on it and ticking the Execute check box. Run the installer and GutenMark is ready to go.

Using the GUI version of GutenMark to convert etext is straightforward. Use the Input Files pane on the left to add one or several etexts, then configure the available conversion options by ticking the desired check boxes. Most of the options are self-explanatory, and you can experiment with different settings to achieve the best results. GutenMark allows you to save different settings as profiles. You can, for example, create two separate profiles for converting etexts to HTML and LaTeX, or you can set up different profiles for different languages.

When converting etexts to HTML, you have an option to split the source file into multiple chapters. To enable this feature, tick the Split at headings check box, and specify the splitting points. Usually, ticking the H1 (Heading 1) check box works just fine, but you can chop the etexts into smaller pieces by enabling other heading options. If you choose to split the etext, make sure you enable the Table of contents option, which creates a separate HTML file with links to the created chapters.

To convert the selected etext, press the Arrow button, and the converted files appear in the Output Files pane. You can then open the converted files directly from within GutenMark by double-clicking on them.

If you prefer to use GutenMark from the command line, the Usage page provides a detailed description of the available command-line options. Even if you stick to the GUI, the page can help you to figure out what each option does.

Although GutenMark does a formidable job of converting etexts to HTML format, which is readable on virtually any device, the converted files might still need some manual tweaking. It's a good idea to go through a converted file and correct any remaining issues before you load it to your device. This is, however, a minor nuisance compared to converting an entire etext by hand.

Dmitri Popov is a freelance writer whose articles have appeared in Russian, British, US, German, and Danish computer magazines.

Share    Print    Comments   

Comments

on Make etexts pretty with GutenMark

Note: Comments are owned by the poster. We are not responsible for their content.

Make etexts pretty with GutenMark

Posted by: Anonymous [ip: 76.74.207.42] on August 28, 2008 08:35 PM
Hmm... they could use some Gnome HIG love.

#

Re: Inheiritting a legacy

Posted by: Anonymous [ip: 75.93.4.233] on August 29, 2008 03:38 AM
huh? of here of all places. i like politics, but please, let's keep comments on topic. and as for my political persuation just in case you think i'm knocking you: i voted for ron paul and yes i'm upset with both parties.

more on topic though, this sounds like a great tool. i think XHTML+CSS or some other format would have been a better solution to begin with. i personally think that a common markup standard should be put into place for stuff like ebooks, that would aliveate some of the mistakes that programs make when parsing the files. XHTML, HTML 5, and XML make great candidates because they are all well known and can be extended in a manner that would importing other sources (such as images and linking) a lot easier.

#

Make etexts pretty with GutenMark

Posted by: Anonymous [ip: 84.110.109.5] on August 29, 2008 11:58 AM
Personally I hope they use ODF since it seems to be the most suitable format that allows for everything a book needs(page numbering isn't supported in (x)HTML as far as I know) and it can be easily exported to various other formats(pdf, html, doc)

Latex would be another great option as far as I can tell, although I doubt most people would be able to use it.

#

Make etexts pretty with GutenMark

Posted by: Anonymous [ip: 80.52.205.82] on August 29, 2008 01:55 PM
I've always wondered why the Gutenberg Project hasn't used a markup scheme from the beginning. It actually seems pretty dumb to use plainetxt just to later write tools to decorate the plaintext with markup. Not to mention all the lost processsing capabilities.

#

Re: Make etexts pretty with GutenMark

Posted by: Anonymous [ip: 190.25.164.67] on August 29, 2008 04:17 PM
Because markup is fluff and creates an obesity problem. It is almost trivial in this day and age, but it wasn't in the 70's when the Project Gutenberg was born.

In those days, when you weren't around obviously, you would have to pay an arm and a leg to AT&T for the privilege of making a long distance phone call with a 300 baud modem that would take 10 minutes to download a 25Kb file. In fact, that's why the zip format came into being, it was a way to reduce your phone bill costs up to one-tenth of those hard earned greenbacks. It was also one of the best examples of why you may need to reverse engineer a format in order to escape a technological monopoly. PK-Zip Inc. btw survived without buying the country legistator bodies to issue protectionist laws that protect their revenue.

#

Re(1): Make etexts pretty with GutenMark

Posted by: Anonymous [ip: 80.52.205.82] on August 29, 2008 05:16 PM
Heh, thanks for info. I really didn't suspect Project Gutenberg is that old, much older than SGML and most of the generic markup research. Anyway, perhaps it is time to start Project Mark Gutenberg and gradually add structure to the plaintext documents? Advantages would be multiple.

#

Make etexts pretty with GutenMark

Posted by: Anonymous [ip: 170.135.241.45] on August 29, 2008 06:54 PM
Project Gutenberg has steadfastly taken the position that their sole goal is to preserve the text of documents (primarily books) that are in the public domain. That's why they still limit their etexts to just ASCII.

The good news is that PG links to a page that includes a lot of other repositories that do add more:

http://onlinebooks.library.upenn.edu/archives.html

#

Re: Make etexts pretty with GutenMark

Posted by: V. L. Simpson on September 02, 2008 10:22 PM
PG accepts other formats. They just want an ASCII file along with them as the lowest common denominator.

#

This story has been archived. Comments can no longer be posted.



 
Tableless layout Validate XHTML 1.0 Strict Validate CSS Powered by Xaraya