Some information retrieval tools

Michel Beigbeder -- 2006/18/09

Please, let me know if you find any error in the following information.

Evaluation

trec_eval trec_eval
trec_eval.8.1.tar.gz trec_eval.8.0.tar.gz trec_eval.7.3.tar.gz trec_eval.7.0beta trec_eval.v3beta trec2_eval trec_eval_hp trec1_eval
The software for doing IR system evaluation.

Links:


3 IR tools tested or in use within the RIM team

search engine group (RMIT)-logo Zettair
team site
tool site
Previously known as lucy.
Zettair is a (small) set of software written in language C for text indexing and retrieval.

Comment: The index format is very easy to understand. It is easy to add its own weighting scheme too. The straightforward programming style makes easy to add other features (in indexing for instance).

mg-book-logo mg
tool site
book site
MG is an open-source compressing, indexing and retrieval system for text, images, and textual images. It is written in language C.

Development discontinued since August 1999

Comment: The book does not help too much to understand the software, but anyway it is a very good book on both compression and information retrieval.

The software is more difficult to extend than Zettair because there is heavy use of (complex) macros to tackle with the compression features. But we succeeded in some extensions by inserting our own code in some key points, both in the indexing and the querying phases. However, it is very difficult to create new code to directly access to the index (again this is due to the complex compression mechanisms in use).

Links:

smart smart
tool site
Smart implements the basic vector model of information retrieval. It is possible to experiment with different weighting schemes. It is written in language C.

Development discontinued since 1992.

Comment: Not easy to install. The configuration mechanism is difficult to understand. The configuration process is error prone. Some (badly) documented features actually don't work. Extensions that fit well in the vector model are not too difficult but it is quite impossible to add other ones.

Links: Because this software is difficult to use and its internal documentation is not good, here are some links on how to use it.


7 softwares not tested

cheshire-logo Cheshire
tool site
(Mainly C) (Most recently modified file: 2005-01-13 in V2.41)
A Next-Generation Online Catalog and Full-Text Information Retrieval System.
DataparkSearch Engine-logo DataparkSearch Engine
tool site
(C) (Most recently modified file: 2005-12-01 in V4.35)
DataparkSearch Engine is a full-featured open sources web-based search engine released under the GNU General Public License and designed to organize search within a website, group of websites, intranet or local system.
lemur-logo Lemur
tool site
(C++) The Lemur Toolkit for Language Modeling and Information Retrieval.
lucene-logo Lucene
tool site
(Java) (Most recently modified file: 2004-11-29 in V1.4.3)
Lucene is a high-performance, full-featured text search engine library written entirely in Java.
senga-logo Senga
tool site
(Mainly C++)
Senga is a development group focused on information retrieval software.
  • Catalog is a perl program that allows to create, maintain and display Yahoo! like directories. (Last version 1.03 2001-07-11)
  • The purpose of GNU mifluz is to provide a C++ library to build and query a full text inverted index. (Last release 0.23.0 2001-07-23)
  • unac is a C library and command that removes accents from a string. (Last version 1.7.0 2002-09-02)
  • uri is a library that analyses URIs and transform them. (Last version 2.13 2001-07-16)
  • webbase is a crawler for the Internet. It has two main functions : crawl the WEB to get documents and build a full text database with this documents. (Last version 5.17.0 2001-09-10)
Development discontinued since 2002.
terrier-logo Terrier
tool site
(Java) (Last version 1.0.2) (Most recently modified file: 2005-03-17 in V1.0.2)
Terrier is a software for the rapid development of Web, intranet and desktop search engines. More generally, it is a modular platform for the rapid development of large-scale Information Retrieval applications, providing indexing and retrieval functionalities.
wumpus-logo Wumpus
tool site
(C++) (Most recently modified file: 2005-11-30 in V2005-11-30)
Wumpus is an information retrieval system. Its main purpose is to study issues that arise in the context of indexing dynamic text collections in multi-user environments.
xapian-logo Xapian
tool site
(C++)
(Most recently modified file: 2005-07-15 in V 0.9.2)

Xapian is an Open Source Probabilistic Information Retrieval library, released under the GPL. It's written in C++.

Features: Ranked probablistic search, Relevance feedback, Phrase and proximity searching, Structured boolean search operators, Stemming (Danish, Dutch, English, Finnish, French, German, Italian, Norwegian, Portuguese, Russian, Spanish, and Swedish).


1 library not tested

bow Bow
library site
Bow (or libbow) is a library of C code useful for writing statistical text analysis, language modeling and information retrieval programs. The current distribution includes the library, as well as front-ends for document classification (rainbow), document retrieval (arrow) and document clustering (crossbow).