In the ongoing saga of "distributed is better"...
I want to have the documentation for whatever tool I'm using at the moment, available without having to be online, in my local system. Such documentation is available as HTML pages in some website, most of the time. Making a mirror of such sites is not really that hard (I often make such a mirror), to check it offline. But, one of the nice features of such websites are that they also provide a really handy search feature, which of course my local mirror doesn't provide because it depends on the server over there in the Internet.
The goal here is to have a local (as in local host) mechanism that provides a web search interface for a local mirror of a website. I should be able to type some text in a search form, get a listing of the more relevant matches, and see the corresponding pages. Without having to be online.
But, creating a local index for the mirror, and the corresponding web search interface for it, is not really that hard (for morlocks!). You have to perform an initial installation and configuration, only once, and a couple of things every time you want to create a new index for a new set of documents/pages.
Note: I'm assuming you use Linux, and more specifically Debian. If not, well, you should :P
Initial configuration
Install Apache, and Xapian Omega:
apt-get install apache2 xapian-omega
Now, create a few directories: one to hold the files we want to index, another for their indexes and another for Omega's config files:
DOC_DIR=/multimedia/documentation
CFG_DIR=/multimedia/omega/
IDX_DIR=/multimedia/omega/indexes
mkdir -p $DOC_DIR
mkdir -p $CFG_DIR
mkdir -p $IDX_DIR
Configure Omega. Create the $CFG_DIR/omega_config
file, with the following
content:
# Directory containing Xapian databases: this is the value of $IDX_DIR
database_dir /multimedia/omega/indexes
# This value is valid for Debian installations
template_dir /usr/share/xapian-omega/templates
# Directory to write Omega logs to. Make sure that the user used to run
# Apache has write permissions there
log_dir /var/log/xapian-omega
# This value is valid for Debian installations
cdb_dir /var/lib/xapian-omega/cdb
Now configure Apache: this consists mostly in telling Apache where to find the
CGI script that performs the search, tell that script of its configuration via
an environment variable, and where are the documents it should serve (our local
mirrors). Create a file named /etc/apache2/sites-available/omega
, with the
following content:
<VirtualHost *:80>
# Change this to a proper value
ServerAdmin admin@localhost
ServerName s.home.org
DefaultType text/html
# this is $CFG_DIR/omega_config; adjust appropiately
SetEnv OMEGA_CONFIG_FILE /multimedia/omega/omega_config
# this is $DOC_DIR; adjust appropiately
DocumentRoot /multimedia/documentation
<Directory />
Options FollowSymLinks
AllowOverride None
</Directory>
# this is $DOC_DIR; adjust appropiately
<Directory /multimedia/documentation>
Options Indexes FollowSymLinks MultiViews
AllowOverride None
Order allow,deny
allow from all
</Directory>
ScriptAlias /cgi-bin/ /usr/lib/cgi-bin/
<Directory "/usr/lib/cgi-bin">
AllowOverride None
Options +ExecCGI -MultiViews +SymLinksIfOwnerMatch
Order allow,deny
Allow from all
</Directory>
ErrorLog ${APACHE_LOG_DIR}/error.log
LogLevel warn
CustomLog ${APACHE_LOG_DIR}/access.log combined
</VirtualHost>
Now you must make s.home.org
point to the local machine. Edit the /etc/hosts
file and add the following line:
127.0.1.1 s.home.org s
Now enable the new Apache site:
a2ensite omega
/etc/init.d/apache2 restart
Finally, you should put a page with a proper search form. Create the
$DOC_DIR/index.html
file with the following text:
<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
</head>
<body>
<div>
<form action="/cgi-bin/omega/omega" target="_top" method="GET">
<input type="search" name="P" value="" size="15">
<input type="hidden" name="DEFAULTOP" value="and">
<input type="hidden" name="xFILTERS" value="--O">
<input type="SUBMIT" value="Search">
</form>
</div>
</body>
</html>
That's the initial configuration.
Adding a website
I'll exemplify the process of adding a website using the Django documentation.
First of all, you should fetch the documentation to your computer. Usually
wget
would be the tool to use (something like wget -r -np -nc -p -k
<your-site>
), but in this case, we can just download a zip file with all the
documentation: https://www.djangoproject.com/m/docs/django-docs-1.3-en.zip.
Fetch that file, and extract it to $DOC_DIR/django
:
mkdir $DOC_DIR/django
cd $DOC_DIR/django
wget https://www.djangoproject.com/m/docs/django-docs-1.3-en.zip
unzip django-docs-1.3-en.zip
rm django-docs-1.3-en.zip
Now, create the index for those files. Execute the following command:
omindex --mime-type=:text/html --db $IDX_DIR/django --url /django/ $DOC_DIR/django
That command does the following:
- Indexes all the files in
$DOC_DIR/django
. - Saves all the files required for the index in a new Xapian "database":
$IDX_DIR/django
. - Configures the search results to always prepend
/django/
to the URL of each result, so it matches with theDocumentRoot
we defined for the Apache configuration, and the directory we created there ($DOC_DIR/django
).
That's all you have to do: fetch the docs, put them somewhere, create an index
for them. You can now visit http://s.home.org
, and search into those
documents.
However, there is still some room for improvement: the search at that page
checks all the databases, which is to say, all the websites you might have
indexed. You can narrow the scope of the search, specifying which database
you want to check, by indicating the database to search for, passing a DB
GET
parameter to the CGI script. The value for this parameter can be any of the
directory names available at $IDX_DIR
.
With this in mind, you could add a select field to the form at
http://s.home.org
, and add a new entry every time you add a new website, or
make that an script that reads all the directories in $IDX_DIR
on each
request.
Or, if you are lazy like me, you can just add a custom search to your browser,
one for each database you want to use. With Chromium, it goes like this:
Preferences ->
Manage Search Engines ->
Other search engines, and add a
search engine for django, with the keyword dj
, and the URL
http://s.home.org/cgi-bin/omega/omega?P=%s&DB=django&DEFAULTOP=and&xFILTERS=--O
.
Note both the DB=django
and the P=%s
GET arguments. With this in place, you
can type dj <whatever you want to search for>
in chromium's address bar, and
jump straight to the results.
There you go: lightning-fast, offline access and search in your documentation.
Addenda
There are tools that do precisely this, like dwww, or doc-central, tailored to index all the documentation available in your system. I tried some of them, and found them inadequate: I either didn't like the search interface (which forced me to use words of more than 3 characters (!)), or the way of adding new documents. But those might be useful to you, so I'm mentioning them.