Advanced Divide-and-Conquer Algorithms for Computing Two-Hop Covers for Large Collections of XML Documents

 

 

   Abstract

While the established path index structures for XML data perform quite well at evaluating most XPath axes, they lack efficient support for the descendants and ancestors axes. In the HOPI project we are currently developing a new path index that applies the concept of a two-hop cover to efficiently evaluate these axes. The results that we got even with an unoptimized implementation strongly indicate that this approach can outperform all existing path index structures. To support very large XML data collections (with the Web as the ultimate target), we improved the theroretical concept of a 2-hop cover with a divide-and-conquer algorithm for building the cover that partitions the document graph, builds the covers for the partitions, and finally joins the partition covers into the cover for the complete document set.

However, the divide-and-conquer algorithm is still quite basic, as it simply builds partitions with a fixed number of nodes and completely ignores the connectivity within these nodes. The main topic of this thesis is to reconsider this algorithm, especially to look into more elaborated approaches for partitioning the initial graph. A first step in this direction would be to add connectivity information to the partitioning algorithm, but there are many more possible approaches. The thesis should examine several alternative partitioning methods, implement them within the existing HOPI implementation, and compare their results both qualitatively as well as quantitatively on large sets of XML data.

   Organization

Guidance:        Ralf Schenkel
Student:         
 Andreas Broschart
Level:              Master's Thesis
Start:              
December 2003
Status:             Finished

Prerequisites:  Basic Knowledge about XML, experience with Java

        

   Additional Information and Literature

  • Edith Cohen, Eran Halperin, Haim Kaplan, and Uri Zwick: Reachability and Distance Queries via 2-Hop Labels. In: Proceedings of the Thirteenth Annual ACM-SIAM Symposium on Discrete Algorithms, January 6-8, 2002, San Francisco, pages 937-946, 2002.
  • Ralf Schenkel, Anja Theobald, and Gerhard Weikum: HOPI: An Efficient Connection Index for Complex XML Document Collections. 9th International Conference on Extending Database Technology (EDBT 2004), Heraklion - Crete, Greece, March 14-18, 2004.

Back to the list of topics.

last change: Ralf Schenkel, June 8, 2006.