- About Us
Libferris allows you to index and perform full text search on a number of file formats, including PDF, manual pages, and office documents. The recent availability of packages of libferris and its dependencies for Fedora, Ubuntu, and openSUSE makes it simpler to use the library to provide a file server search interface for the Web. Libferris was initially created to provide a virtual filesystem interface, similar to GnomeVFS and KDE's KIO. Over time libferris has gained sophisticated support for indexing and searching filesystems.
The technique described here makes use of a new user called libferrissearch on the file server to run the search interface. Using a dedicated user allows you to explicitly grant libferrissearch access to only files that you want the Web interface to find, and allows the search interface to return results which might be accessible to the user via NFS but which are not accessible to the Web server. This makes the software more useful to people who wish to take advantage of libferris for file server search, but it does introduce a bit of extra complexity in setting up the search system.
The most robust index plugin for libferris is for the PostgreSQL database, which should be running on the file server in order for you to use it with libferris. If you wish to have PostgreSQL running on another machine, you can pass host=pghostname to the fcreate commands that are contained in the script below.
The commands below start out being executed as the root user, and take advantage of two scripts which are shipped with libferris. The scripts have quite long names, and are used only during the initial setup. The first command creates some template databases in the PostgreSQL server which are tailored for libferris use. Once you have these template databases, a regular user can create new libferris indexes that support full text search. The script next creates a new user, makes PostgreSQL aware of that user, and allows the user to create new databases. We then change to that user to set up libferris and its indexing. First we execute ferris-first-time-user to set up ~/.ferris and its various files for this new user, then create a default home database for the user. Finally, we execute the second setup script from the libferris distribution to create a new PostgreSQL database and tell libferris that it should use that index for full text and metadata searches. Each user can have a default full text and metadata index for performing searches with libferris.
Once libferris is set up you can use the findexadd and feaindexadd commands to populate the index. The first command updates only full text information in the index, while the latter updates only file metadata information. Running the below command as the libferrissearch user populates the libferris indexes with all the files under /docs. If a file has not been modified since it was last indexed then libferris quickly skips over it, so the below commands can be added to a cron job to quickly keep the index up-to-date.
For this article I've populated /docs with some text files from Project Gutenberg, as well as the PDF file valgrind_manual.pdf from the Valgrind distribution. The following commands verify that the index is able to be used to find the documents. In the final command we can see that the Valgrind manual can be retrieved by its content just like the text files.
We want to have our PHP code be executed as the libferrissearch user. I use the mod_suphp Apache module to force this to happen. On a Fedora 8 machine you can install this module from the default repositories using yum. As some PHP code expects not to be running as a different user, I tend to only explicitly enable this module for directories which I wish to use it for. The commands below set up mod_suphp to operate in the http://localhost/libferrissearch, directory which I will use for the libferris search interface.
To turn off suPHP by default add the following to the end of the main HTML directory directive in /etc/httpd/conf/httpd.conf:
Once suPHP is off by default you can enable it by editing /etc/httpd/conf.d/mod_suphp.conf and uncommenting the following line:
You should then restart the Apache server. At this stage we have an Apache Web server that can use mod_suphp on directories which we have explicitly nominated. Now we can move on to setting up the libferrissearch directory and the PHP scripts. Inside the /var/www/html/libferrissearch directory we need to create three files: A PHP script to actually perform the search and return the result, an XSL stylesheet, and a main form page to let the user input the query and see the results.
The first script is runquery-simple.php, which performs the heavy lifting. Some parameters the user can change are defined at the top of the script. I'll cover the stylesheet in a moment. The restriction can be one of filter, filter-10, or filter-100, with the later two returning a maximum of 10 or 100 results respectively. The showea definition is what metadata from the results we are interested in seeing. For information on the metadata that libferris makes available, see the libferris eadescriptions page. Having the parent-url in the results allows us to group files by which directory contains them.
Next, the query itself is taken from a CGI parameter and a query is formed using ferrisls and its --xml mode to obtain the result set as an XML file. In order to include a link to a custom stylesheet we pass the --hide-xml-declaration to ferrisls so that the <?xml... declaration is left out of the output of ferrisls. This way the XML declaration can be included in the PHP code and we can explicitly link to the stylesheet for rendering the XML result set.
The XSL file, xml-results-to-xhtml.xsl, which the above PHP links to, is shown below. The transform take the XML output from ferrisls and create an HTML document complete with color-coding on alternate rows in the result set. The first template matches the top-level XML element and creates the bulk of the HTML document. The second template match outputs a single result in a color-coded table row.
These three scripts should go into /var/www/html/libferrissearch.
In the screen shot at right I have performed a full text search for "mad" on the file server.
There are many more possible uses for a PHP Web interface to libferris. Since libferris has the ability to compute cryptographic checksums such as MD5 and SHA1 you can include checksums in the index and later compare them against the current cryptographic checksum for files to detect file modifications or possible media errors. If you have geotagged files, such as JPEG images with GPS coordinates in them, you can create a "network link" endpoint for use in Google Earth.
Different libferris indexes can also be federated to form a single index. This is useful for allowing different storage and update policies for different parts of an index. For example, you could create a single index for manual pages that is updated only when new software is installed on the system.
Ben Martin has been working on filesystems for more than 10 years. He completed his Ph.D. and now offers consulting services focused on libferris, filesystems, and search solutions.