Creation and Evaluation of Heterogeneous XML Document Collections based on the Internet Movie Database
Abstract
Research about Information Retrieval on XML data today suffers from a lack of large, heterogeneous collections of XML documents. While there are some large collections available (like DBLP or the INEX document set), they have a homogeneous structure and are therefore inadequate for similarity search. The aim of this thesis is building a heterogeneous collection of XML documents based on data from the Internet Movie Database IMDB.
We already have implemented a basic importer that imports raw data from IMDB into an Oracle database, and a basic exporter that exports that data in a straight-forward, homogeneous XML format. The following extensions to the existing software should be made in this project:
As there is no formal specification for IMDB's raw data format and the data there is often in bad shape, the importer should be extended to capture more of IMDB's data. This should not take more than four weeks, and it is a good way of learning the database schema used to store the extracted data.
The exporter should be extended to create more heterogeneous XML documents. As an example, the fact that actor A played in movie M can be expressed, among others, in one of the following three XML fragments:
- <movie title="M"><actor name="A"/></movie>
- <film><titel>M</titel><schauspieler>A</schauspieler></film>
- <review><movie title="M"/><actor name="A"/>...</review>
and there are probably many more. The generated XML documents should contain links, represent all information from IMDB, and may contain redundant information. This task definitely calls for creativity, the more heterogeneous the generated data is, the better.
Finally, the retrieval effectiveness of our XXL search engine (any maybe also of the COMPASS search engine) should be evaluated using the heterogeneous collection with a set of queries. For each query, the original, homogeneous collection is used to find all relevant answers, and the task is then to compare these results to the results on the heterogenous collection.
Based on the results of this thesis, we plan to consider advanced notions of connectivity of XML elements and find out how good they perform on the heterogeneous data set. This includes the implementation of a distance measure on directed graphs that allows following edges in inverse direction. As time will probably not allow to do all this within this thesis, the topic can be continued in a Master's thesis or Diplomarbeit.
Organization
Guidance: Ralf Schenkel
Student: Ivelina Stavreva
Level: FoPra
Start: June 2004
Status: finished
Prerequisites: Basic Knowledge about XML, experience with Java
•
Additional Information and Literature
last change: Ralf Schenkel, June 8, 2006.