TADA
home
Assignment 1: The Warm-Up

For this first assignment you will have to choose one topic from the list below, read the articles, and hand in a report that critically discusses this material and answers the assignment questions. Reports should summarise the key aspects, but more importantly, should include original and critical thought that show you have acquired a meta level understanding of the topic – plain summaries will not suffice. All sources you use should be appropriately referenced, any text you quote should be clearly identified as such. The expected length of a report is 5 pages, but there is no limit. The deadline is May 8th at 14:00 Saarbrücken standard-time. You are free to hand in earlier.

For the topic of your assignment, choose one of the following:

  1. Did Tukey invent Data Mining?

    Read [1] and discuss how exploratory data analysis relates to data mining.

  2. (Don't) Believe the Hype

    Read [2]. The authors introduce a method for detecting correlation in data. They present their approach very confidently. How does it relate to data mining? How strong are their claims? Is the method earth shattering or not? Read [3]. Are there any other practical or theoretical weak points you can find?

  3. Big Data: The Best Thing slice Sliced Bread or just Another Bottle of Snake Oil?

    Read [4,5,6,7,8]. Is Big Data worth all the hype? What are the prospects? What are the (potential) problems? Are these problems insurmountable? What are your opinions about Big Data?

  4. Where did the candidates go? — Hard

    The standard approach to mine frequent itemsets is to

    1. generate a set of candidate itemsets,
    2. test which are frequent, and
    3. use those to generate new candidates,
    and iterate until done. Eclat [9], proposed in 1997, is an example of a simple yet very efficient algorithm for mining frequent itemsets that follows this principle in a depth-first search.

    The authors of [10] claim that their method can mine frequent itemsets without candidate generation. This raises the question: where did the candidates go? Discuss whether this claim is valid or not, and why.

    (Bonus) TreeProjection [11] was proposed before [10]. The authors of [10] almost aggressively discuss that FPGrowth is really different than TreeProjection. Are they really? Why (not)? Discuss, and if possible, give an example where they are (not) different. Hint: consider how they explore the search space.

Return the assignment by email to tada@mpi-inf.mpg.de by 8 May, 1400 hours. The subject of the email must start with [TADA]. The assignment must be returned as a PDF and it must contain your name, matriculation number, and e-mail address together with the exact topic of the assignment.

Topic 4 is hard and contains an optional extra question. Grading of this topic takes this hardness into account.

References

You will need a username and password to access the papers outside the MPI network. Contact the lecturer if you don't know the username or password.

[1] Tukey, J. We Need Both Exploratory and Confirmatory. American Statistician, 34(1):23-25, 1980.
[2] Reshef, D.N., Reshef, Y.A., Finucane, H.K., Grossman, S.R., McVean, G., Turnbaugh, P.J., Lander, E.S., Mitzenmacher, M. & Sabeti, P.C. Detecting Novel Associations in Large Data Sets. Science, 334(6062):1518-1524, 2011.
[3] Simon, N. & Tibshirani, R. Comment on Detecting Novel Associations in Large Data Sets by Reshef et al, Science Dec 16, 2011. arXiv, 1401(7645), 2011.
[4] Harford, T. Big Data: are we making a big mistake?. Financial Times 28 March 2014
[5] Lazer, D., Kennedy, R., King, G. & Vespignani, A. The Parable of Google Flu: Traps in Big Data Analysis. Science, 343, 2014.
[6] White, M. How Big Data is Chaning Science (and Society). Pacific Standard 8 November 2013
[7] Manyika, J., Chui, M., Brown, B., Bughin, J., Dobbs, R., Roxburgh, C. & Byers, A.H. Big data: The next frontier for innovation, competition, and productivity (executive summary). McKinsey Global Institute, May 2011
[8] Boyd, D. & Crawford, K. Critical Questions for Big Data. Inform. Comm. Soc., 15(5):662-679
[9] Zaki, M.J., Parthasarathy, S., Ogihara, M. & Li, W. New algorithms for fast discovery of association rules. In Proceedings of the 3rd ACM International Conference on Knowledge Discovery and Data Mining (KDD), Newport Beach, CA, 1997.
[10] Han, J., Pei, J. & Yin, Y. Mining frequent patterns without candidate generation. In Proceedings of the ACM International Conference on Management of Data (SIGMOD), Dallas, TX, pages 1-12, ACM, 2000.
[11] Agarwal, R.C., Aggarwal, C.C. & Prasad, V. A tree projection algorithm for generation of frequent item sets. J. Parallel Distr. Com., 61(3):350-371, 2001.