home
Assignment 3: Surprise!

The deadline is July 3rd at 23:59 Saarbrücken standard-time. You are free to hand in earlier. For the third assignment you will have to choose one topic from the list below, read the articles, and hand in a report that critically discusses this material and answers the assignment questions. Reports should summarise the key aspects, but more importantly, should include original and critical thought that show you have acquired a meta level understanding of the topic – plain summaries will not suffice. All sources you use should be appropriately referenced, any text you quote should be clearly identified as such. The expected length of a report is 5 pages, but there is no limit.

For the topic of your assignment, choose one of the following:

1. I Want That One

Data Mining aims at finding interesting novel knowledge from data. Yet, what is interesting? What is interestingness? De Bie [1,2] argues interestingness is inherently subjective, and that we should try to model the state of mind of the analyst, and proposes to do so using information theoretic principles. What are the desired properties of the functions $$InformationContent$$ and $$DescriptionalComplexity$$? Is it not a play of words to hide subjective interestingness into two objective scores? How can we evaluate whether these functions behave well? That is, whether they correspond to what users find complex, resp. simple?

Next, read [3]. The goal of these authors is clearly related to subjective interestingness, yet they take a rather different approach. Discuss what the key differences are, how the two frameworks/tasks relate, and whether this task can be solved/approached by the framework of De Bie. If you think it can, discuss what the distribution $$P$$ would look like, what the kind of background knowledge it would have to take, and what the information contents and descriptional complexity terms would have to measure. If you think it cannot, explain in detail where it breaks and how.

2. Rows and Columns

Say the only things you know about your data are that a) its binary, b) its dimensions, and c) its row and column margins. Although this may not seem like much, you can now express an expectation for results mined on the actual data. First read [4], and then [5]. The latter show some impressive plots! Investigate closely what the differences between the different proposed ways to sample random data are. Explain these in your own words, and discuss in particular their assumptions, their strengths, and weaknesses.

But wait, entry-wise PDFs... Haven't we seen this before? Didn't De Bie [6] write a paper on this a few years back? Didn't he use very nice theoretical backing, namely the MaxEnt principle? Discuss how [5] and [6] relate. The model De Bie proposes reduces to a Rasch model, which is a model Golshan et al. compare to. Investigate their results once again, and discuss what it means that they find that the MaxEnt (Rasch) model based on row and column margins obtains higher error than their own methods.

3. Elementary, my dear Watson — (Hard)

In data mining the goal is to find interesting structure of your data, things that somehow stand out. There are many ways to define 'standing out'. Significance testing based on background knowledge is one of them, but can, again, be done in different ways. There are two main schools. Gionis et al. [4] propose to measure significance using (swap) randomization, whereas De Bie argues to use Maximum Entropy (MaxEnt) modelling [6]. (Variants exist for all sorts of data types.) What are the key differences between the two approaches? What are the differences in background knowledge that can be incorporated? What are the key differences in the assumptions about the background knowledge? What will the effect be in practice? Which do you think is more practical? Why?

Within the MaxEnt school, there exist two sub-schools. In addition to [6], read [7]. What are the key differences between the models? What are the differences in type of knowledge they can incorporate? Can we use both models to test significance of the same types of structures/patterns? Are the two approaches unitable? Does it make any sense to have the type of background information of [6] incorporated into [7], and how about vice versa? If it does, sketch how you think this would work.

4. Caveat Lector — (Hard)

Determining causal relations from data is perhaps the holiest of grails in data mining. Not surprisingly, it is not an easy task. We recently proposed a new approach to inferring whether $$X$$ causes $$Y$$, or vice versa. The main principle is simple; if describing $$X$$ is easier when given $$Y$$, than vice versa, we say that $$Y$$ is more likely to be caused $$X$$ than vice versa. To use this principle to use, we define a practical score based on cumulative entropy. Experiments show it works rather well. At the same time, it's a first attempt. And first attempts always mean there are some points of improvement.

In this assignment your task is to carefully read [8] and critically discuss it. That is, identify the most important choices the author makes, and evaluate whether you agree with him, or whether alternate solutions (also) make (more) sense. In other words, find as many points of improvement of the principle, as well as of the practical instantiation, as possible. That is, you should consider every aspect critically. Does considering $$\Delta_{X' \rightarrow Y}$$ instead of $$\Delta_{X \rightarrow Y}$$ really make more sense? Further, instead of calculating it directly, we estimate conditional and multivariate cumulative entropy. What are the effects of the way we estimate it on the causality score? Would it be possible to improve in theory? Would it be possible to improve in practice? How?

Return the assignment by email to tada@mpi-inf.mpg.de by 3 July, 2359 hours. The subject of the email must start with [TADA]. The assignment must be returned as a PDF and it must contain your name, matriculation number, and e-mail address together with the exact topic of the assignment.

Grading will take into account both Hardness of questions, as well as whether you answer the Bonus questions.

## References

You will need a username and password to access the papers outside the MPI network. Contact the lecturer if you don't know the username or password.

 [1] De Bie, T. An information theoretic framework for data mining. In Proceedings of the 17th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), San Diego, CA, pages 564-572, ACM, 2011. [2] De Bie, T. Subjective Interestingness in Exploratory Data Mining. Springer, 2013. [3] Dzyuba, V., van Leeuwen, M., Nijssen, S. & De Raedt, L. Interactive Learning of Pattern Rankings. International Journal on Artificial Intelligence Tools, 23(6), 2014. [4] Gionis, A., Mannila, H., Mielikäinen, T. & Tsaparas, P. Assessing Data Mining Results Via Swap Randomization. In Proceedings of the 12th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), Philadelphia, PA, pages 167-176, ACM, 2006. [5] Golshan, B., Byers, J.W. & Terzi, E. What do row and column marginals reveal about your dataset?. In Proceedings of the 26th Annual Conference on Neural Information Processing Systems (NIPS), Lake Tahoe, NV, 2013. [6] De Bie, T. Maximum entropy models and subjective interestingness: an application to tiles in binary databases. Data Mining and Knowledge Discovery, 23(3):407-446, Springer, 2011. [7] Mampaey, M., Tatti, N. & Vreeken, J. Tell Me What I Need To Know: Succinctly Summarizing Data with Itemsets. In Proceedings of the 17th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), San Diego, CA, pages 573-581, ACM, 2011. [8] Vreeken, J. Causal Inference by Direction of Information. In Proceedings of the SIAM International Conference on Data Mining (SDM'15), SIAM, 2015.