In the ongoing saga of "distributed is better"...

I want to have the documentation for whatever tool I'm using at the moment, available without having to be online, in my local system. Such documentation is available as HTML pages in some website, most of the time. Making a mirror of such sites is not really that hard (I often make such a mirror), to check it offline. But, one of the nice features of such websites are that they also provide a really handy search feature, which of course my local mirror doesn't provide because it depends on the server over there in the Internet.

The goal here is to have a local (as in local host) mechanism that provides a web search interface for a local mirror of a website. I should be able to type some text in a search form, get a listing of the more relevant matches, and see the corresponding pages. Without having to be online.

But, creating a local index for the mirror, and the corresponding web search interface for it, is not really that hard (for morlocks!). You have to perform an initial installation and configuration, only once, and a couple of things every time you want to create a new index for a new set of documents/pages.

Note: I'm assuming you use Linux, and more specifically Debian. If not, well, you should :P

Initial configuration

Install Apache, and Xapian Omega:

apt-get install apache2 xapian-omega

Now, create a few directories: one to hold the files we want to index, another for their indexes and another for Omega's config files:

DOC_DIR=/multimedia/documentation
CFG_DIR=/multimedia/omega/
IDX_DIR=/multimedia/omega/indexes
mkdir -p $DOC_DIR
mkdir -p $CFG_DIR
mkdir -p $IDX_DIR

Configure Omega. Create the $CFG_DIR/omega_config file, with the following content:

# Directory containing Xapian databases: this is the value of $IDX_DIR
database_dir /multimedia/omega/indexes

# This value is valid for Debian installations
template_dir /usr/share/xapian-omega/templates

# Directory to write Omega logs to. Make sure that the user used to run
# Apache has write permissions there
log_dir /var/log/xapian-omega

# This value is valid for Debian installations
cdb_dir /var/lib/xapian-omega/cdb

Now configure Apache: this consists mostly in telling Apache where to find the CGI script that performs the search, tell that script of its configuration via an environment variable, and where are the documents it should serve (our local mirrors). Create a file named /etc/apache2/sites-available/omega, with the following content:

<VirtualHost *:80>
    # Change this to a proper value
    ServerAdmin admin@localhost

    ServerName s.home.org
    DefaultType text/html

    # this is $CFG_DIR/omega_config; adjust appropiately
    SetEnv OMEGA_CONFIG_FILE /multimedia/omega/omega_config

    # this is $DOC_DIR; adjust appropiately
    DocumentRoot /multimedia/documentation
    <Directory />
        Options FollowSymLinks
        AllowOverride None
    </Directory>

    # this is $DOC_DIR; adjust appropiately
    <Directory /multimedia/documentation>
        Options Indexes FollowSymLinks MultiViews
        AllowOverride None
        Order allow,deny
        allow from all
    </Directory>

    ScriptAlias /cgi-bin/ /usr/lib/cgi-bin/
    <Directory "/usr/lib/cgi-bin">
        AllowOverride None
        Options +ExecCGI -MultiViews +SymLinksIfOwnerMatch
        Order allow,deny
        Allow from all
    </Directory>

    ErrorLog ${APACHE_LOG_DIR}/error.log
    LogLevel warn
    CustomLog ${APACHE_LOG_DIR}/access.log combined

</VirtualHost>

Now you must make s.home.org point to the local machine. Edit the /etc/hosts file and add the following line:

127.0.1.1   s.home.org  s

Now enable the new Apache site:

a2ensite omega
/etc/init.d/apache2 restart

Finally, you should put a page with a proper search form. Create the $DOC_DIR/index.html file with the following text:

<html>
    <head>
        <meta http-equiv="content-type" content="text/html; charset=UTF-8">
    </head>
    <body>
        <div>
            <form action="/cgi-bin/omega/omega" target="_top" method="GET">
                <input type="search" name="P" value="" size="15">
                <input type="hidden" name="DEFAULTOP" value="and">
                <input type="hidden" name="xFILTERS" value="--O">
                <input type="SUBMIT" value="Search">
            </form>
        </div>
    </body>
</html>

That's the initial configuration.

Adding a website

I'll exemplify the process of adding a website using the Django documentation.

First of all, you should fetch the documentation to your computer. Usually wget would be the tool to use (something like wget -r -np -nc -p -k <your-site>), but in this case, we can just download a zip file with all the documentation: https://www.djangoproject.com/m/docs/django-docs-1.3-en.zip. Fetch that file, and extract it to $DOC_DIR/django:

mkdir $DOC_DIR/django
cd $DOC_DIR/django
wget https://www.djangoproject.com/m/docs/django-docs-1.3-en.zip
unzip django-docs-1.3-en.zip
rm django-docs-1.3-en.zip

Now, create the index for those files. Execute the following command:

omindex --mime-type=:text/html --db $IDX_DIR/django --url /django/ $DOC_DIR/django

That command does the following:

  • Indexes all the files in $DOC_DIR/django.
  • Saves all the files required for the index in a new Xapian "database": $IDX_DIR/django.
  • Configures the search results to always prepend /django/ to the URL of each result, so it matches with the DocumentRoot we defined for the Apache configuration, and the directory we created there ($DOC_DIR/django).

That's all you have to do: fetch the docs, put them somewhere, create an index for them. You can now visit http://s.home.org, and search into those documents.

However, there is still some room for improvement: the search at that page checks all the databases, which is to say, all the websites you might have indexed. You can narrow the scope of the search, specifying which database you want to check, by indicating the database to search for, passing a DB GET parameter to the CGI script. The value for this parameter can be any of the directory names available at $IDX_DIR.

With this in mind, you could add a select field to the form at http://s.home.org, and add a new entry every time you add a new website, or make that an script that reads all the directories in $IDX_DIR on each request.

Or, if you are lazy like me, you can just add a custom search to your browser, one for each database you want to use. With Chromium, it goes like this: Preferences -> Manage Search Engines -> Other search engines, and add a search engine for django, with the keyword dj, and the URL http://s.home.org/cgi-bin/omega/omega?P=%s&DB=django&DEFAULTOP=and&xFILTERS=--O. Note both the DB=django and the P=%s GET arguments. With this in place, you can type dj <whatever you want to search for> in chromium's address bar, and jump straight to the results.

There you go: lightning-fast, offline access and search in your documentation.

Addenda

There are tools that do precisely this, like dwww, or doc-central, tailored to index all the documentation available in your system. I tried some of them, and found them inadequate: I either didn't like the search interface (which forced me to use words of more than 3 characters (!)), or the way of adding new documents. But those might be useful to you, so I'm mentioning them.