Implementation
of a File-Based Indexing Framework for the TopX Search Engine
TopX is an efficient and effective search engine for text and
semistructured data. The current, Java-based implementation of TopX
relies on index structures that are stored in a relational database,
which creates problems with usability, index size, and efficient
accesses. The goal of this thesis is therefore to develop a new,
file-based indexing framework that replaces the existing TopX index and
does no longer use a relational database, not even for temporary
storage.
Important
goals of this thesis are
- Implementation of a file-based indexer that can handle
huge amount of data (e.g., terabytes of text or the complete collection
of Wikipedia documents in XML)
- Integration of different scoring functions and index
layouts (including support for pair-based proximity scores)
- Lossless index compression techniques
- Lossy index compression techniques with quality
guarantees,
including horizontal and vertical index pruning
- Efficient index updates (incremental and/or batch updates)
- Highly distributed indexing for high scalability
The indexing
framework should be implemented in C++, hence excellent programming
skills in C++ are mandatory.
Advisor: Ralf
Schenkel, Andreas Broschart
Student: Levan Kasradze
Level: Master
Status: running
Start: 2007
Prerequisites: Excellent programming skills in C++, some
SQL and XML experience
Additional Information and
Literature
- M.
Theobald, R. Schenkel, G. Weikum: An
Efficient and Versatile Query Engine for TopX Search, VLDB 2005
- R.
Schenkel et al.: Efficient Text Proximity Search, SPIRE 2007, to appear
(preprint available on request)
- D.
Carmel et al.: Static Index Pruning for Information Retrieval Systems,
SIGIR 2001
- N.
Fuhr and N. Gövert: Index compression vs. retrieval time of inverted
files for XML documents. Technical Report, University of Dortmund, 2002.
- I.H.
Witten, A. Moffat, T.C. Bell: Managing Gigabytes: Compressing and
Indexing Documents and Images, Morgan Kaufmann, 1999.
Back to the
list of topics.
last change:
Ralf Schenkel, January 8,
2008.