Advanced
Divide-and-Conquer Algorithms for Computing Two-Hop Covers for Large
Collections of XML Documents
Abstract
While the established path index structures for XML data perform quite well at evaluating most XPath axes, they lack efficient support for the descendants and ancestors axes. In the HOPI project we are currently developing a new path index that applies the concept of a two-hop cover to efficiently evaluate these axes. The results that we got even with an unoptimized implementation strongly indicate that this approach can outperform all existing path index structures. To support very large XML data collections (with the Web as the ultimate target), we improved the theroretical concept of a 2-hop cover with a divide-and-conquer algorithm for building the cover that partitions the document graph, builds the covers for the partitions, and finally joins the partition covers into the cover for the complete document set.
However, the divide-and-conquer algorithm is still quite basic, as it simply builds partitions with a fixed number of nodes and completely ignores the connectivity within these nodes. The main topic of this thesis is to reconsider this algorithm, especially to look into more elaborated approaches for partitioning the initial graph. A first step in this direction would be to add connectivity information to the partitioning algorithm, but there are many more possible approaches. The thesis should examine several alternative partitioning methods, implement them within the existing HOPI implementation, and compare their results both qualitatively as well as quantitatively on large sets of XML data.
Organization
Guidance: Ralf Schenkel
Student: Andreas Broschart
Level: Master's Thesis
Start: December 2003
Status: Finished
Prerequisites: Basic Knowledge about XML, experience with Java
•
Additional Information and Literature
last change: Ralf Schenkel, June 8, 2006.