- October 19th — TADA won the Busy Beaver award for best Advanced Lecture of the 2015 summer term!
- July 31st — slides of 12th lecture available.
- July 24th — slides of 11th lecture available.
- July 23rd — slides of 10th lecture available.
- July 10th — slides of 9th lecture available.
- July 3rd — 4th assignment handed out.
- June 26th — slides of 8th lecture available.
- June 19th — slides of 7th lecture available.
- June 12th — slides of 6th lecture available.
- June 9th — 3rd assignment handed out.
- June 5th — slides of 5th lecture available.
- May 29th — slides of 4th lecture available.
- May 22th — slides of 3rd lecture available.
- May 15th — slides of 2nd lecture available.
- May 8th — 2nd assignment handed out.
- April 24th — 1st assignment handed out and slides of 1st lecture available. Exam dates announced.
- April 21st — lecture room changed to E1.7 0.01
Course Information
Type | Advanced Lecture (5 ECTS) |
Lecturer | Dr. Jilles Vreeken |
tada (at) mpi-inf.mpg.de | |
Lectures |
Fridays, 14–16 o'clock in Room E1.7 0.01. |
Summary | In this advanced course we'll be investigating hot topics in data mining that the lecturer thinks are cool. This course is for those of you who are interested in Big Data Analytics, Data Science, Data Mining, Machine Learning – or, as the lecturer prefers to call it – Exploratory Data Analysis. We'll be looking into how to discover interesting and useful patterns from data, efficiently measure non-linear correlations and determine causal directions, as well as how to analyse large graphs. |
Schedule
Month | Day | Topic | Slides | Assignment | Req. Reading | Opt. Reading |
---|---|---|---|---|---|---|
April | 24 | Introduction, Practicalities | 1st assignment out | |||
May | 1 | yay, holiday! – no class | ||||
8 | planned jetlag – no class | deadline 1st, 2nd out | ||||
15 | Interesting Patterns | [1] 4–4.2 | [7,8,9,10,11,12] | |||
22 | Mining Useful Patterns | [2] | [13,14,15] | |||
29 | Entropy and Information | [3] 2-2.4 | [16,17] | |||
June | 5 | Discovering Correlation | deadline 2nd, 3rd out | [4] | [18,19,20] | |
12 | Discovering Causation | |||||
19 | Determining Significance | [1] 4.4 | [21,22,23] | |||
26 | Subjective Interestingness | [1] 5 | [24,25,26] | |||
July | 3 | planned absence – no class | deadline 3rd, 4th out | |||
10 | Graph Summarisation | [5] | [,28,29] | |||
17 |
Mining Data that Changes guest lecturer: Dr. Pauli Miettinen |
[30,31,32] | ||||
24 | Rumours in Graphs | [6] | [33,34,35] | |||
31 | Wrap-Up with Ask me Anything | deadline 4th assignment |
Structure and Content
In general terms, the course will consist of
- lectures, and
- assignments that include critically reading scientific articles
- Mining Interesting Patterns
- Mining Complex Correlations
- Mining Large Graphs
Assignments
Students will individually do one assignment per topic – four in total. For every assignment, you will have to read one or more research papers and hand in a report that critically discusses this material and answers the assignment questions. Reports should summarise the key aspects, but more importantly, should include original and critical thought that show you have acquired a meta level understanding of the topic – plain summaries will not suffice. All sources you've drawn from should be referenced. The expected length of a report is 5 pages, but there is no limit.
The deadlines for the reports are at 14:00 Saarbrücken standard-time. You are free to hand in earlier.
Materials
All required and optional reading will be made available. You will need a username and password to access the papers outside the MPI network. Contact the lecturer if you don't know the username or password.
Required Reading
[1] | Interesting Patterns. In Frequent Pattern Mining, Aggarwal, C. & Han, J., pages 105-134, Springer, 2014. |
[2] | Mining and Using Sets of Patterns through Compression. In Frequent Pattern Mining, Aggarwal, C. & Han, J., pages 165-198, Springer, 2014. |
[3] | Entropy, Relative Entropy, and Mutual Information. In Elements of Information Theory, Wiley-Interscience New York, 2006. |
[4] | New Evidence for the Theory of the Stork. Paediatric and Perinatal Epidemiology, 18(1):88-92, 2004. |
[5] | VoG: Summarizing Graphs using Rich Vocabularies. In Proceedings of the SIAM International Conference on Data Mining (SDM), Philadelphia, PA, pages 91-99, SIAM, 2014. |
[6] | Spotting Culprits in Epidemics: How many and Which ones?. In Proceedings of the 12th IEEE International Conference on Data Mining (ICDM), Brussels, Belgium, IEEE, 2012. |
Optional Reading
[7] | Fast Discovery of Association Rules. In Advances in Knowledge Discovery and Data Mining, pages 307-328, AAAI/MIT Press, 1996. |
[8] | Efficiently mining long patterns from databases. In Proceedings of the ACM International Conference on Management of Data (SIGMOD), Seattle, WA, pages 85-93, 1998. |
[9] | Discovering Frequent Closed Itemsets for Association Rules. In Proceedings of the 7th International Conference on Database Theory (ICDT), Jerusalem, Israel, pages 398-416, ACM, 1999. |
[10] | Self-sufficient itemsets: An approach to screening potentially interesting associations between items. ACM Transactions on Knowledge Discovery from Data, 4(1):1-20, 2010. |
[11] | Tiling Databases. In Proceedings of Discovery Science, pages 278-289, 2004. |
[12] | Comparing Apples and Oranges: Measuring Differences between Data Mining Results. In Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD), Athens, Greece, pages 398-413, Springer, 2011. |
[13] | Krimp: Mining Itemsets that Compress. Data Mining and Knowledge Discovery, 23(1):169-214, Springer, 2011. |
[14] | Identifying the Components. Data Mining and Knowledge Discovery, 19(2):173-292, 2009. |
[15] | The Odd One Out: Identifying and Characterising Anomalies. In Proceedings of the 11th SIAM International Conference on Data Mining (SDM), Mesa, AZ, pages 804-815, Society for Industrial and Applied Mathematics (SIAM), 2011. |
[16] | A Mathematical Theory of Communication. The Bell System Technical Journal, 27:379-423, 623-656, 1948. |
[17] | Finding low-entropy sets and trees from binary data. In Proceedings of the 13th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), San Jose, CA, pages 350-359, 2007. |
[18] | Detecting Novel Associations in Large Data Sets. Science, 334(6062):1518-1524, 2011. |
[19] | On cumulative entropies. Journal of Statistical Planning and Inference, 139(2009):4072-4087, 2009. |
[20] | Multivariate Maximal Correlation Analysis. In Proceedings of the 31st International Conference on Machine Learning (ICML), Beijing, China, pages 775-783, JMLR, 2014. |
[21] | Assessing data mining results via swap randomization. ACM Transactions on Knowledge Discovery from Data, 1(3), ACM, 2007. |
[22] | Tell me something I don't know: randomization strategies for iterative data mining. In Proceedings of the 15th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), Paris, France, pages 379-388, ACM, 2009. |
[23] | Randomization methods for assessing data analysis results on real-valued matrices. Statistical Analysis and Data Mining, 2(4):209-230, 2009. |
[24] | On the rationale of maximum-entropy methods. Proceedings of the IEEE, 70(9):939-952, IEEE, 1982. |
[25] | Maximum entropy models and subjective interestingness: an application to tiles in binary databases. Data Mining and Knowledge Discovery, 23(3):407-446, Springer, 2011. |
[26] | Maximum Entropy Models for Iteratively Identifying Subjectively Interesting Structure in Real-Valued Data. In Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD), Prague, Czech Republic, pages 256-271, Springer, 2013. |
[] | Fully automatic cross-associations. In Proceedings of the 10th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), Seattle, WA, pages 79-88, 2004. |
[28] | Beyond Caveman Communities: Hubs and Spokes for Graph Compression and Mining. In Proceedings of the 11th IEEE International Conference on Data Mining (ICDM), Vancouver, Canada, pages 300-309, IEEE, 2011. |
[29] | Doulion: Counting Triangles in Massive Graphs with a Coin. In Proceedings of the 15th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), Paris, France, ACM, 2009. |
[30] | Mining of Massive Datasets. Cambridge University Press, 2013. |
[31] | Clustering Data Streams. In Proceedings of the Annual Symposium on Foundations of Computer Science (FOCS), pages 359 |
[32] | Beyond streams and graphs: dynamic tensor analysis. In Proceedings of the 12th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), Philadelphia, PA, pages 374-383, 2006. |
[33] | Finding effectors in social networks. In Proceedings of the 16th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), Washington, DC, pages 1059-1068, ACM, 2010. |
[34] | Rumors in a Network: Who's the Culprit?. IEEE Transactions on Information Technology, 57(8):5163-5181, 2011. |
[35] | Hidden Hazards: Finding Missing Nodes in Large Graph Epidemics. In Proceedings of the SIAM International Conference on Data Mining (SDM'15), SIAM, 2015. |
Course format
The course has two hours of lectures per week. There are no weekly tutorial group meetings. Instead, the students have to write essays based on the material covered on the lectures and scientific articles assigned to them by the lecturer.
Grading and Exam
The assignments will be graded in scale of Fail, Pass, and Excellent. You can fail at most one assignment; two failures mean you fail the course. Any assignment not handed in by the deadline is automatically considered failed.
You can earn up to three bonus points by obtaining Excellent or Very Good grades for the assignments. An Excellent grade gives you one bonus point, as do every two Very Good grades, up to a maximum of three bonus points. Each bonus point improves your final grade by 1/3 assuming you pass the final exam. For example, if you have two bonus points and you receive 2.0 from the final exam, your final grade will be 1.3. You fail the course if you fail the final exam, irrespective of your possible bonus points. Failed assignments do not reduce your final grade, provided you are eligible to sit the final exam.
The final exams will be oral. The final exam will cover all the material discussed in the lectures and from each assignment one topic of the student's choice. The main exam will be on August 3rd, if needed also on August 4th. The re-exam will be on September 28th. The exact day and time per student will be announced later. Inform the lecturer of any potential clashes as soon as you know them.
Prerequisites
Students should have basic working knowledge of data analysis and statistics, e.g. by successfully having taken courses related to data mining, machine learning, and/or statistics, such as Machine Learning, Probabilistic Graphical Models, Statistical Learning, Information Retrieval and Data Mining, etc.