TADA
home
Assignment 4: Graphs, graphs everywhere

The deadline is July 31st at 23:59 Saarbrücken standard-time. You are free to hand in earlier. For the third assignment you will have to choose one topic from the list below, read the articles, and hand in a report that critically discusses this material and answers the assignment questions. Reports should summarise the key aspects, but more importantly, should include original and critical thought that show you have acquired a meta level understanding of the topic – plain summaries will not suffice. All sources you use should be appropriately referenced, any text you quote should be clearly identified as such. The expected length of a report is 5 pages, but there is no limit.

For the topic of your assignment, choose one of the following:

  1. Storylines

    Read both [1] and [2]. These papers consider the problem of making sense of large collections of text documents. Analyse both, and discuss. Are the goals clearly defined? Do the authors make principled, or ad-hoc, choices to define the solution? Are the results convincing? Are the algorithms sufficiently evaluated? What extra experiments would you have liked to see? Can you identify possible improvements for the algorithms, how would you approach the problem?

  2. Growing a Graph

    Generating synthetic graphs that exhibit realistic properties is a difficult problem with many applications. In the last few years the stochastic Kronecker approach [3,4] got very popular, receiving a lot of praise left and right. Recently, however, there have been a few papers that are more critical about the model [5].

    Read [3] and [5] and form your own opinion. Optionally, read (the relevant parts of) [4] for a more in-depth discussion.

  3. Featuring Recursion and Aggregation

    In the last twenty years a small group of European researchers has been working on multi-relational data mining: mining data over multiple relations. One key problem is how to generate informative local 'multi relational features'. One of the main solutions is called 'propositionalization'. Read Knobbe et al. [6] as an example.

    In recent years mining graph-structured data has become very popular, in particular in the USA. Initially only simple undirected unweighted and unlabelled graphs were considered. These days more and more work appears on more complex graphs, considering more interesting problem settings. Read Henderson et al. [7] as an example.

    Discuss the relation between graph mining and multi-relational data mining. Are they identical, similar, or ultimately different? How similar are the problems and the solutions these two papers propose? Is graph mining reinventing propositionalization? Why (not)? Is graph mining reinventing multi-relational data mining? Why (not)?

  4. The Root of All Evil — (Hard)

    Discovering sources of viral infections is a hot area of research. Recently, within three months three groups of authors published results on how to discover culprits from incomplete data. Read these three papers [8,9,10] and discuss how (dis)similar the explored ideas and proposed solutions are. Be critical, try to see through the hype, not everything that is said to be radically different has to be such.

    (Bonus) Do (any of) the proposed methods have any weaknesses? Do you see more general, more realistic problem settings can be solved? (For example, how would you extend [8] to the SIR model?) Discuss. Warning: this part is open ended, it considers a research question that we are currently looking into ourselves. If you do well, you are invited to join our efforts and share in the eternal glory that awaits us.

Return the assignment by email to tada@mpi-inf.mpg.de by 31 July, 2359 hours. The subject of the email must start with [TADA]. The assignment must be returned as a PDF and it must contain your name, matriculation number, and e-mail address together with the exact topic of the assignment.

Grading will take into account both Hardness of questions, as well as whether you answer the Bonus questions.


References

You will need a username and password to access the papers outside the MPI network. Contact the lecturer if you don't know the username or password.

[1] Shahaf, D. & Guestrin, C. Connecting the dots between news articles. In Proceedings of the 16th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), Washington, DC, pages 623-632, 2010.
[2] Shahaf, D., Guestrin, C. & Horvitz, E. Metro maps of science. In Proceedings of the 18th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), Beijing, China, pages 1122-1130, 2012.
[3] Leskovec, J. & Faloutsos, C. Realistic, Mathematically Tractable Graph Generation and Evolution, Using Kronecker Multiplication. In Proceedings of the 9th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD), Porto, Portugal, 2005.
[4] Leskovec, J., Chakrabarti, D., Kleinberg, J., Faloutsos, C. & Ghahramani, Z. Kronecker Graphs: An Approach to Modeling Networks. Journal of Machine Learning Research, 11(Feb):985-1042, 2010.
[5] Seshadhri, C., Pinar, A. & Kolda, T. An in-depth analysis of stochastic Kronecker graphs. , 60(2):13-32, 2013.
[6] Knobbe, A., de Haas, M. & Siebes, A. Propositionalisation and Aggregates. In Proceedings of the 5th European Conference on Principles of Data Mining and Knowledge Discoverys (PKDD), Freiburg, Germany, 2001.
[7] Henderson, K., Gallagher, B., Li, L., Akoglu, L., Eliassi-Rad, T., Tong, H. & Faloutsos, C. It's who you know: graph mining using recursive structural features. In Proceedings of the 17th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), San Diego, CA, 2011.
[8] Sundareisan, S., Vreeken, J. & Prakash, B.A. Hidden Hazards: Finding Missing Nodes in Large Graph Epidemics. In Proceedings of the SIAM International Conference on Data Mining (SDM'15), SIAM, 2015.
[9] Sefer, E. & Kingsford, C. Diffusion Archaeology for Diffusion Progression History Reconstruction. In Proceedings of the 14th IEEE International Conference on Data Mining (ICDM), 2014.
[10] Farajtabar, M., Gomez-Rodriguez, M., Du, N., Zamani, M., Zha, H. & Song, L. Back to the Past: Source Identification in Diffusion Networks from Partially Observed Cascades. In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS), 2015.