Latest: Genstatic, my first sip of coffee

Content with Style

Web Technique

Fulltext searches with Xapian and PHP

by Pascal Opitz on November 3 2009, 16:35

Sometimes MySQL just isn't quick enough. Especially when it comes to fulltext searches. Everything needs to be indexed correctly, and if we're using different fields with different weights for a relevance percentage, things get very complicated quickly. Xapian to the rescue.

What is Xapian?

Xapian is a Search Engine Library, similar to Lucene and Sphinx. It's compiled from C++ code and therefore pretty low level. There are PHP, Perl and Python bindings available for it, which are straight forward to use. Packages are available for Ubuntu and Red Hat, it compiles on OSX and you can run it on Windows via CygWin.

Demo Scripts

Rather than explaining why and how, I decided to create some demo scripts instead. The XapianWrapper PHP class I've created is quite large, so feel free to download the example zip file as well.

db.sql


CREATE DATABASE `demo`;

CREATE TABLE `demo`.`demo` (
`id` INT( 10 ) UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY ,
`unique_key` VARCHAR( 255 ) NOT NULL ,
`name` VARCHAR( 255 ) NULL DEFAULT NULL ,
`summary` TEXT NULL DEFAULT NULL ,
`date` DATETIME NULL DEFAULT NULL ,
UNIQUE (`unique_key`));

INSERT INTO `demo`.`demo` 
(`id`, `unique_key`, `name`, `summary`, `date`)
VALUES (NULL, 'foo', 'foo', 'foo bar test', '2008-11-05 00:00:00'), 
(NULL , 'bar', 'bar', 'test foo bar', '2009-11-05 00:00:00');        

XapianWrapper.php


<?php
// includes
require_once 'xapian.php';

// main class
class XapianWrapper {
  const XAPIAN_FIELD_URL = 0;
  const XAPIAN_FIELD_NAME = 1;
  const XAPIAN_FIELD_DATE = 2;
  const XAPIAN_FIELD_UID = 3;
  const XAPIAN_FIELD_SUMMARY = 4;
  const XAPIAN_PREFIX_UID = "UID:";
  
  const SETTINGS_XAPIAN_DB = './xapian_db';

  const SETTINGS_MYSQL_HOST = 'localhost';
  const SETTINGS_MYSQL_USER = 'root';
  const SETTINGS_MYSQL_PASS = 'root';
  const SETTINGS_MYSQL_DB = 'demo';
  const SETTINGS_MYSQL_TABLE = 'demo';

  const DEFAULT_COUNT = 10;

  private $mysql_link;
  private $category_cache;
  
  private $xapian_read_db;
  private $xapian_write_db;
  private $xapian_stemmer;
  private $xapian_indexer;
  private $xapian_enquire;

  private function xapian_init_readonly() {
    try{
      $this->xapian_read_db = new XapianDatabase(self::SETTINGS_XAPIAN_DB);
      $this->xapian_stemmer = new XapianStem("english");
      $this->xapian_enquire = new XapianEnquire($this->xapian_read_db);
    } catch(Exception $e) {
      throw new Exception('Could initialize Xapian: ' . $e->getMessage());
    } 
  }
  
  private function xapian_init_writable() {
    try{
      $this->xapian_write_db = new XapianWritableDatabase(self::SETTINGS_XAPIAN_DB, Xapian::DB_CREATE_OR_OPEN);  
      $this->xapian_indexer = new XapianTermGenerator();
      $this->xapian_stemmer = new XapianStem("english");
      $this->xapian_indexer->set_stemmer($this->xapian_stemmer);
    } catch(Exception $e) {
      throw new Exception('Could initialize Xapian: ' . $e->getMessage());
    } 
  }
  
  private function mysql_init() {
    $this->mysql_link = mysql_connect(self::SETTINGS_MYSQL_HOST, self::SETTINGS_MYSQL_USER, self::SETTINGS_MYSQL_PASS);
    if (!$this->mysql_link) {
      throw new Exception('Could not connect: ' . mysql_error());
    }

    $db_selected = mysql_select_db(self::SETTINGS_MYSQL_DB, $this->mysql_link);
    if (!$db_selected) {
      throw new Exception('Can\'t use db : ' . mysql_error());
    }
  }
  
  /**
   * Index method
   *
   */
  public function index($params) {
    $this->xapian_init_writable();
    $this->mysql_init();
    
    $start = microtime(true);

    $response = new stdClass();
    $response->indexed = array();

    $offset = (isset($params['offset'])) ? intval($params['offset']) : 0;
    $count = (isset($params['count'])) ? intval($params['count']) : self::DEFAULT_COUNT;
    $sql = 'SELECT * FROM '.self::SETTINGS_MYSQL_TABLE.' LIMIT ' . $offset . ', ' . $count . ';';

    $result = mysql_query($sql);

    if (!$result) {
      throw new Exception('Invalid query: ' . mysql_error());
    }
    
    $this->xapian_write_db->begin_transaction();

    while ($row = mysql_fetch_array($result, MYSQL_ASSOC)) {
       $response->indexed[] = $this->index_row($row);      
    }

    $this->xapian_write_db->commit_transaction();
    mysql_free_result($result);
    mysql_close($this->mysql_link);

    return $response;
  }
    

  private function index_row($row) {
    $doc = new XapianDocument();

    $this->xapian_indexer->set_document($doc);
    $this->xapian_indexer->index_text($row['name'],50);
    $this->xapian_indexer->index_text($row['summary'], 1);

    $GUID = self::XAPIAN_PREFIX_UID . $row['unique_key'];
    $doc->add_term($GUID);

    $doc->add_value(self::XAPIAN_FIELD_URL, $row['url']);
    $doc->add_value(self::XAPIAN_FIELD_DATE, date('Ymd', strtotime($row['date'])));
    $doc->add_value(self::XAPIAN_FIELD_UID, $row['unique_key']);
    $doc->add_value(self::XAPIAN_FIELD_NAME, $row['name']);
    $doc->add_value(self::XAPIAN_FIELD_SUMMARY, $row['summary']);
    
    $this->xapian_write_db->replace_document(strval($GUID), $doc);

    $row_response = array();
    $row_response['name'] = $row['name'];
    $row_response['guid'] = $row['unique_key'];
    $row_response['url'] = $row['url'];
    return $row_response; 
  }
  
  /**
   * Delete method
   *
   */
  public function delete($params) {
    $this->xapian_init_writable();

    $this->xapian_write_db->begin_transaction();

    $response = array();

    foreach($params['items'] as $param_guid) {      
      $GUID = self::XAPIAN_PREFIX_UID . $param_guid;
      $this->xapian_write_db->delete_document(strval($GUID));
      $response[] = $param_guid;
    }
    
    $this->xapian_write_db->commit_transaction();
    return $response;
  }

  /**
   * Search method
   *
   */
  public function search($params) {
    $this->xapian_init_readonly();

    $start = microtime(true);

    // queries array to later construct full query
    $arr_queries = array();

    // from date
    if(!empty($params['date_from'])) {
      $arr_queries[] = new XapianQuery(XapianQuery::OP_VALUE_GE, 6, date('Ymd', strtotime($params['date_from'])));
    }

    // to date
    if(!empty($params['date_to'])) {
      $arr_queries[] = new XapianQuery(XapianQuery::OP_VALUE_LE, 6, date('Ymd', strtotime($params['date_to'])));
    }

    // unique key
    if(!empty($params['unique_key'])) {
      $arr_queries[] = new XapianQuery(self::XAPIAN_PREFIX_UID . $params['unique_key']);
    }

    // normal search query parsed
    if(!empty($params['search'])) {
      $qp = new XapianQueryParser();
      $qp->set_stemmer($this->xapian_stemmer);
      $qp->set_database($this->xapian_read_db);
      $qp->set_stemming_strategy(XapianQueryParser::STEM_SOME);
      $arr_queries[] = $qp->parse_query($params['search']);
    }

    // Find the results for the query.
        // construct final query
    $query = array_pop($arr_queries);

    foreach($arr_queries as $sq) {
      $query = new XapianQuery(XapianQuery::OP_AND, $query, $sq);
    }    
    $this->xapian_enquire->set_query($query);
  
    // set the count to the specified params
    $offset = (isset($params['offset'])) ? intval($params['offset']) : 0;
    $count = (isset($params['count'])) ? intval($params['count']) : self::DEFAULT_COUNT;
    $matches = $this->xapian_enquire->get_mset($offset, $count);

    $response = new stdClass();
    $response->result_count = $matches->get_matches_estimated();
    $results = array();

    $i = $matches->begin();
    while (!$i->equals($matches->end())) {
      $m = array();

      $n = $i->get_rank() + 1;
      $doc = $i->get_document();

      $m['position'] = $n;
      $m['url'] = $doc->get_value(self::XAPIAN_FIELD_URL);
      $m['name'] = $doc->get_value(self::XAPIAN_FIELD_NAME);
      $m['summary'] = $doc->get_value(self::XAPIAN_FIELD_SUMMARY);
      $m['date'] = $doc->get_value(self::XAPIAN_FIELD_DATE);
      $m['unique_key'] = $doc->get_value(self::XAPIAN_FIELD_UID);
      $m['percent'] = $i->get_percent();

      $results[count($results)] = $m;
      $i->next();
    }

    $response->results = $results;
    $end = microtime(true);
    
    // runtime info
    $response->execute = new stdClass();
    $response->execute->call = 'search';
    $response->execute->offset = $offset;
    $response->execute->count = $count;
    $response->execute->start = $start;
    $response->execute->end = $end;
    $response->execute->time = $end - $start;

    // debug stuff
    $response->execute->debug = $query->get_description();

    return $response;
  }
}

index.php


<?php

require_once 'XapianWrapper.php';

$x = new XapianWrapper();
$res = $x->index(array());
print_r($res);               

search.php


<?php

require_once 'XapianWrapper.php';

$x = new XapianWrapper();
$params = array('search' => 'foo');
$res = $x->search($params);
print_r($res);           

delete.php


<?php

require_once 'XapianWrapper.php';
$x = new XapianWrapper();
$params = array(
	'items' => array('foo'),
);
$res = $x->delete($params);
print_r($res);

Using the example

Once you've unzipped the expamples, you should now be able to create the DB from the db.sql file, and run the php examples via command line.


bash$ php index.php 
stdClass Object
(
    [indexed] => Array
        (
            [0] => Array
                (
                    [name] => foo
                    [guid] => foo
                    [url] => 
                )

            [1] => Array
                (
                    [name] => bar
                    [guid] => bar
                    [url] => 
                )

        )

)
bash$ php search.php 
stdClass Object
(
    [result_count] => 2
    [results] => Array
        (
            [0] => Array
                (
                    [position] => 1
                    [url] => 
                    [name] => foo
                    [summary] => foo bar test
                    [date] => 20081105
                    [unique_key] => foo
                    [percent] => 100
                )

            [1] => Array
                (
                    [position] => 2
                    [url] => 
                    [name] => bar
                    [summary] => test foo bar
                    [date] => 20091105
                    [unique_key] => bar
                    [percent] => 50
                )

        )

    [execute] => stdClass Object
        (
            [call] => search
            [offset] => 0
            [count] => 10
            [start] => 1256674866.79
            [end] => 1256674866.79
            [time] => 0.000944852828979
            [debug] => Xapian::Query(Zfoo:(pos=1))
        )

)
bash$ php delete.php 
Array
(
    [0] => foo
)
bash$ php search.php 
stdClass Object
(
    [result_count] => 1
    [results] => Array
        (
            [0] => Array
                (
                    [position] => 1
                    [url] => 
                    [name] => bar
                    [summary] => test foo bar
                    [date] => 20091105
                    [unique_key] => bar
                    [percent] => 100
                )

        )

    [execute] => stdClass Object
        (
            [call] => search
            [offset] => 0
            [count] => 10
            [start] => 1256674876.02
            [end] => 1256674876.02
            [time] => 0.000872850418091
            [debug] => Xapian::Query(Zfoo:(pos=1))
        )

)

Right, I leave it up to you to amend the examples to suit your individual needs. As always any feedback or improvements are welcome. Happy indexing and searching everyone.

Comments

  • Hi Pascal,

    Just to say there are also bindings to C#, Ruby, Tcl and even Guile, and you don't need Cygwin for Windows - you can compile it with Visual C++ into a Windows DLL. We've produced some pre-compiled packages for Windows at http://www.flax.co.uk/xapian_windows.shtml

    Cheers

    Charlie

    by Charlie Hull on November 4 2009, 15:37 - #

  • Did anyone benchmark any speed results comparing spinx for example?

    by Roky on November 4 2009, 21:03 - #

  • No real benchmarks from my side, I'm afraid. All I can say is that I found xapian very quick when searching, but much slower when building up the index itself.

    From what I gather sphinx is more about indexing an existing mysql table and indexes that very quick. The business requirements for my project however need date comparison and the indexing of aggregated data, rather than an existing table. Therefore I chose xapian, as it was more transparent and tweakable to me.

    I didn't find any good benchmarks either while doing my quick bit of research, apart from this Rails based benchmark.

    Hope this helps.

    by Pascal Opitz on November 4 2009, 21:34 - #

  • Made a mistake btw.

    throw('message' . var_dump($e));

    Not a good idea, unless you want to write straight to the php output! Fixed in the copy and zipped example code.

    by Pascal Opitz on November 26 2009, 10:10 - #

  • just curious, where or what is in xapian.php which is "require_once"d by XapianWrapper.php class?

    by p0ts on November 29 2009, 16:06 - #

  • It is the php bindings for xapian, which usually install in your php include path somewhere.

    sudo aptitude install xapian-core xapian-bindings

    On Ubuntu this is all you need to do, and the bindings will go into /usr/share/php5 I believe.

    by Pascal Opitz on November 30 2009, 06:11 - #

  • I have problem with xapian with php 5.3. I have an error: Fatal error: Call to undefined function new_WritableDatabase() in C:\serwer\www\xapian1\xapian.php on line 1134 Any ideas??

    by Tomek on December 4 2009, 11:37 - #

  • Hi I have installed xapian and it is working fine. But I wanted know whether is there anything like displaying the related results using this full text search. Help is appreciated.

    by Venkat on March 8 2010, 13:24 - #


Comments for this article are closed.