TADA
home
Assignment 2: Patterns & Compression

For this second assignment you will have to choose one topic from the list below, read the articles, and hand in a report that critically discusses this material and answers the assignment questions. Reports should summarise the key aspects, but more importantly, should include original and critical thought that show you have acquired a meta level understanding of the topic – plain summaries will not suffice. All sources you use should be appropriately referenced, any text you quote should be clearly identified as such. The expected length of a report is 5 pages, but there is no limit. The deadline is June 5th at 14:00 Saarbrücken standard-time. You are free to hand in earlier.

For the topic of your assignment, choose one of the following:

  1. Sounds... familiar...

    Clustering is one of the key tasks in data mining. The task is easy: group things that are similar. Formally defining a meaningful measure for similarity is not so simple, as we typically have no idea what we are looking for, and even less of an idea of how things may be similar in the data we have. As a result there exist godzillian many similarity measures and clustering algorithms — each of which is claimed to somehow being much much 'better' than anything that already existed.

    Cilibrasi and Vitanyi define the ultimate clustering using Kolmogorov complexity [1] and it can be approximated using off-the-shelf compression algorithms: you only need to use a data compression algorithm that suits your data. Is the framework they propose as powerful as they claim? Does this make sense at all? Why and when (not)? Are there any hidden assumptions? Can you think of examples where this framework would not be so easy to apply, or not work at all?

    Read [2]. What is the novelty of this paper w.r.t [1]? Are there key differences in the scores? What are the implications? How would you rate the novelty of this paper, what would you say is the key contribution?

    (Bonus) In a follow-up paper [3] Cilibrasi and Vitanyi propose to use Google as a `compression algorithm'. Does this make sense? Argue why (not).

  2. What's loss got to do with it?

    The Minimum Description Length (MDL) principle [4]. states that for meaningful inference from data your encoding should be lossless and efficient. That is, you should be able to completely reconstruct your data, and the encoding you use should not waste any bits. Why would loss be bad? What are the pitfalls (if any) for the 'generalized' MDL proposed by Lakshmanan et al. [5]?

    Many authors claiming to use MDL do in fact not use a lossless encoding nor an efficient encoding. Read Lucchese et al. [6] Closely investigate the encoding they use. Reason about the effect of the choices made in the encoding, what this means for the results and whether these correspond with the overall goals. If you find deficiencies, try to propose improvements.

    (Bonus) Read Chakrabarti et al. [7]. Is their encoding lossless? Why (not)? Propose improvements.

    Although [4] gives sufficient details and insight for this assignment, the full tutorial is also available to those who want to know more about MDL [8].

  3. Look, Mom, no hands! — Hard

    One of the advantages of using Minimum Description Length (MDL) [4] is that it allows to define parameter-free methods [9]. Some people go as far as claiming this is the future of data mining [10] as it allows for true exploratory data mining. Is this so? Where did the parameters go, and what about any assumptions? Is MDL a magic wand? What if we do not like what MDL says is the best? Discuss critically.

  4. So Significant, So Interesting

    Initially, frequency was thought to be a good measure for the interestingness of a pattern – you want to know what has been sold often, after all. After realising that the resulting set of patterns was more often than not enormous in size, as well as extremely redundant, research attention shifted to find better measure of interestingness. A natural approach then seemed to only report those patterns for which the frequency is significant with regard to some background model. Unsurprisingly, this turned out to be much harder than expected.

    Read both Brin et al. [11] and Webb [12]. Both give examples of how to identify patterns that are somehow deviating from an expectation, both consider a lot more information than simply the expected frequency under the marginals, yet both approaches are otherwise quite different. Analyse what the core ideas are of both approaches, give a succinct summary, and give a detailed discussion on how they differ, what in your view the strong and weak points of both approaches are. Is either of this measure the ultimate measure of interestingness for itemsets, or are there further improvements possible?

    (Yes, Brin is Sergey Brin from Google.) (Although in [12] Webb does not give an algorithm for mining self sufficient itemsets, in a recent paper [13] we showed how we can mine these efficiently.)

Return the assignment by email to tada@mpi-inf.mpg.de by 5 June, 1400 hours. The subject of the email must start with [TADA]. The assignment must be returned as a PDF and it must contain your name, matriculation number, and e-mail address together with the exact topic of the assignment.

Topics 1 and 2 include Bonus questions, whereas Topics 3 is Hard. Grading will take this into account.


References

You will need a username and password to access the papers outside the MPI network. Contact the lecturer if you don't know the username or password.

[1] Cilibrasi, R. & Vitányi, P. Clustering by Compression. IEEE Transactions on Information Technology, 51(4):1523-1545, 2005.
[2] Campana, B.J. & Keogh, E. A Compression Based Distance Measure for Texture. In Proceedings of the 10th SIAM International Conference on Data Mining (SDM), Columbus, OH, 2010.
[3] Cilibrasi, R. & Vitányi, P. The Google Similarity Distance. IEEE Transactions on Knowledge and Data Engineering, 19(3):370-383
[4] Grünwald, P. Minimum Description Length Tutorial (shortened version). In Advances in Minimum Description Length, MIT Press, 2005.
[5] Lakshmanan, L.V., Ng, R.T., Wang, C.X., Zhou, X. & Johnson, T.J. A Generalized MDL Approach for Summarization. In Proceedings of the 28th International Conference on Very Large Data Bases (VLDB), Hong Kong, China, 2002.
[6] Lucchese, C., Orlando, S. & Perego, R. Mining Top-K Patterns from Binary Datasets in presence of Noise. In Proceedings of the 10th SIAM International Conference on Data Mining (SDM), Columbus, OH, pages 165-176, 2010.
[7] Chakrabarti, D., Papadimitriou, S., Modha, D.S. & Faloutsos, C. Fully automatic cross-associations. In Proceedings of the 10th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), Seattle, WA, pages 79-88, 2004.
[8] Grünwald, P. Minimum Description Length Tutorial (complete version). In Advances in Minimum Description Length, MIT Press, 2005.
[9] Mampaey, M. & Vreeken, J. Summarising Data by Clustering Items. In Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD), Barcelona, Spain, pages 321-336, Springer, 2010.
[10] Keogh, E., Lonardi, S. & Ratanamahatana, C.A. Towards parameter-free data mining. In Proceedings of the 10th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), Seattle, WA, pages 206-215, 2004.
[11] Brin, S., Motwani, R. & Silverstein, C. Beyond Market Baskets: Generalizing Association Rules to Correlations. In Proceedings of the ACM International Conference on Management of Data (SIGMOD), Tucson, AZ, pages 265-276, ACM, 1997.
[12] Webb, G.I. Self-sufficient itemsets: An approach to screening potentially interesting associations between items. ACM Transactions on Knowledge Discovery from Data, 4(1):1-20, 2010.
[13] Webb, G. & Vreeken, J. Efficient Discovery of the Most Interesting Associations. ACM Transactions on Knowledge Discovery from Data, 8(3):1-31, ACM, 2014.