Why a project switched from Google Search Appliance to Zend_Lucene

I repost some of my blog posts made @ liip. Please see here for the original post and comments: http://blog.liip.ch/archive/2011/01/13/why-a-project-switched-from-googl...

Google technology does a good job when searching the wild and treacherous realms of the public internet. However, the commercial Google Search Appliance (GSA) sold for searching intranet websites did not convince me at all. For a client, we first had to integrate the GSA, later we reimplemented search with Zend_Lucene. Some thoughts comparing the two search solutions.

This post became rather lengthy. If you just want the summary of my pro and con for GSA versus Lucene, scroll right to the end :-)

In a project we got to take over, the customer had already bought a GSA (the "cheap" one - only about $20'000). There was a list of wishes from the client how to optimally integrate the appliance into his web sites:

  • Limit access to authorized users
  • Index all content in all languages
  • Filter content by target group (information present as META in the HTML headers)
  • Show a box with results from their employee directory

GSA Software

The GSA made problems with most of those requests.

When you activate access protection, the GSA makes a HEAD request on the first 20 or so search results for each single search request, to check if that user has the right to see that document. As on our site, there are no individual visibility requirements, we did not need that. But there is no way to deactivate this check, resulting in unnecessary load on the web server. We ended up catching the GSA HEAD request quite early and just send a Not Modified response without further looking into the request.

The GSA completely ignores the language declaration (whether in META or in the attribute or inside the HTML head) and uses it's own heuristics. This might be fine for public Internet, when you can assume many sites declaring their content to be in the server installation language even if it is not - but in a controlled environment we can make sure those headers are correct. We talked to google support about this, but they could only confirm that its not possible. This was annoying, as the heuristics was wrong, for example when some part of a page content was in another language.

The spider component messed up with some bugs from the web site we needed to index. We found that the same parameter got repeated over and over on an URL. Those cycles led to having the same page indexed many times and the limit of 500'000 indexed pages being filled up. This is of course a bug in the web server, but we found no way to help the GSA not to stumble over it.

Filtering by meta information would work. But we have binary documents like PDF, Word and so on. There was no way to set the meta information for those documents. requiredfields=gsahintview:group1|-gsahintview should trigger a filter to say either we have the meta information with a specific value, or no meta at all. However, Google confirmed that, this combination of filter expressions is not possible. They updated their documentation to at least explain the restrictions.

The only thing that really worked without hassle was the search box. You can configure the GSA to request data from the web server and return an XML fragment that is integrated into the search result page.

Support by Google was a very positive aspect. They answered fast and without fuss, and have been motivated to help. They seemed competent - so I guess when they did not propose alternatives but simply said there is no such feature, there really was no alternative for our feature requests.

GSA Hardware

The google hardware however was a real nuisance. You get the appliance as a standard sized server to put into the rack. Have the hardware locally makes sense. It won't use external bandwith for indexing and you can be more secure about your confidential data. But during the 2 years we used the GSA, there were 3 hardware failures. As part of the setup test, our hoster checks if the system work properly by unplugging the whole system. While this is not good for data of course, the hardware should survive that. The GSA did not and had to be sent for repair. There were two more hardware issues - one was simply a RAM module signaling an error. But as the hoster is not allowed to open the box, even such simple repair took quite a while. Our client did not want to buy more than one Appliance for his system, as they are rather expensive. So you usually do not have a replacement ready. With any other server, the hoster can fix the system rather fast or in the worst case just re-install the system from backups. With the GSA there is no such redundancy.

The GSA is not only closed in on hardware level. You also do not have shell access to the system, so all configuration has to be done in the web interface. Versioning of that information can only be done by exporting and potentially re-importing the complete configuration. I like to have all relevant stuff in version control for better tracking.

Zend Lucene

The GSA license is for two years. After that period, another amount of 20 something thousand dollars has to be payed if you want to keep support. At that point, we discussed the state with our client and decided to invest a bit more than the license and go to an environment where we have more control and redundancy. The new search uses the Zend_Lucene component to build indexes. As everything is PHP here, the indexer uses the website framework itselves to render the pages and build the indexes.

  • We run separate instances of the process for each web site and each language, each building one index. In the beginning we had one script to build all indexes, but a PHP script running for over 24 hours was not very reliable - and we wanted to use the power of the multicore machine, as each PHP instance is single threaded. Lucene is rather CPU intensive to analyze text.
  • We did not want to touch existing code that changes content. We did not want to risk breaking normal operations in case something is wrong with Lucene. Every hour, a cronjob looks for new or changed documents to update the index. Every weekend, all indexes are rebuilt and - after a sanity check - replace the old indexes. Deletion of content neither triggers lucene. Until the index is rebuilt, the result page generation will just ignore results items that no longer exist in the database.
  • For documents, we use linux programs to convert the file into plain text that is analyzed by lucene (see code below). Except for docx and friends (the new XML formats of Microsoft Office 2007) which are natively supported
    • .msg, .txt, .html: cat
    • .doc, .dot: antiword (worked better than catdoc)
    • .rtf: catdoc
    • .ppt: catppt (part of catdoc package)
    • .pdf: pdftotext (part of xpdf)
    • We ignore zip files, although PHP would allow to open them.
  • All kind of meta information can be specified during indexing. This solves the language specification issue. As the database knows about the document languages, even binary documents are indexed in the correct language
  • The indexes are copied to each server (opening them over the shared nfs file server is not possible as Zend_Lucene wants to lock the files and nfs does not support that). This provides redundancy in case a server crashes. And the integration test server can run its own copy and index the test database.
  • We where able to fine-tune ranking relevance based on type and age of content.
  • To improve finding similar words, we used the stemming filters. We choose php-stemmer and are quite happy with it.
  • If we run into performance problems, we could switch to the Java Lucene for handling search requests, as the binary index format is compatible between Zend_Lucene and Java Lucene.

Indexing about 50'000 documents takes about a full day, running parallel scripts and having CPU cores pretty busy. But our webservers are bored over the weekend anyways. If this would be an issue, we could buy a separate server for searching, as you have in the case of the GSA. The hardware of that server would probably be more reliable and could be fixed by our hoster.

The resulting indexes are only a couple of megabyte. So even though Zend_Lucene has to load the index file for each search request, it is quite fast. Loading the index takes about 50ms of the request. I assume the file system cache keeps the file in memory

Zend_Lucene worked out quite well for us, although today, I would probably use Apache Solr to save some work, especially reading documents and for stemming.

Code fragment for reading binary files as plain text:

$map = array('ppt' => 'catppt %filename% 2>/dev/null',
             'pdf' => 'pdftotext -enc UTF-8 %filename% - 2>/dev/null', //the "-" tells to output to stdout
             'txt' => 'cat %filename% 2>/dev/null'
             ...);

if (! file_exists($filename))
    throw new Exception("File does not exist: '$filename'");

$type = pathinfo($filename, PATHINFO_EXTENSION);
if (! isset($map[$type]))
    throw new Exception("Unsupported document type: '$type'");

$filename = escapeshellarg($filename);
$cmd = str_replace('%filename%', $filename, $cmd[$type]);
$output = array(); $status = 0;
exec ($cmd, $output, $status);
if ($status != 0)
    throw new Exception("Converting $filename: exit status $status");

return implode($output, "\n");

Conclusions

Google Search Appliance

Pro:

+ Reputation with client and acceptance by users as it's a known brand

+ Good ranking algorithms for text and handle stemming

+ Responsive and helpful support

Con:

- Closed "black box" system

- You are not allowed to fix the hardware yourself

- No redundancy unless you buy several boxes

- Missing options to tailor to our use case (use HTML language information, request pages, filter flexibility)

- Significant price tag for the license, plus still quite some work to customize the GSA and adapt your own systems

Zend_Lucene

Pro:

+ Very flexible to build exactly what we needed

+ The problematic framework made less problems, we can iterate over content lists instead of parsing URLs to spider the site

+ Well documented and there is a community (not much experience as we did not have questions)

+ No arbitrary limitations on number of pages in an index.

+ Proved reliable for two years now

+ If performance ever becomes an issue, we can switch to Java Lucene while keeping the php indexer

Con:

- In-depth programming needed

- Thus more risk for bugs

- More to build by hand than with the GSA - but for us not as much as license costs plus customization of the web system to play well with GSA.