Projects
home
Pattern Sets & Summarization

with Danai Koutra, U Kang, and Christos Faloutsos
Measuring the difference between data mining results is an important open problem in exploratory data mining. We discuss an information theoretic approach for measuring how much information is shared between results, and give a proof of concept for binary data.
Go to the VoG page.
Stijl mines descriptions of ordered binary data. We model data hierarchically with noisy tiles - rectangles with significantly different density than their parent tile. To identify good trees, we employ the Minimum Description Length principle, and give an algorithm for mining optimal sub-tiles in just O(nmmin(n,m)) time.
Go to the Stijl page.
with Nikolaj Tatti
We consider mining informative serial episodes — subsequences allowing for gaps — from event sequence data. We formalize the problem by the Minimum Description Length principle, and give algorithms for selecting good pattern sets from candidate collections as well as for parameter free mining of such models directly from data.
Go to the Sqs page.
Slim mines high-quality Krimp code tables directly from data, as opposed to filtering a candidate collection. By doing so, Slim obtains smaller code tables that provide better compression ratios, while also improving on classification accuracy, runtime, and reducing the memory complexity with orders of magnitude.
Go to the Slim page.
with Nikolaj Tatti
Measuring the difference between data mining results is an important open problem in exploratory data mining. We discuss an information theoretic approach for measuring how much information is shared between results, and give a proof of concept for binary data.
Go to the Apples & Oranges page.
Winner of the ACM SIGKDD 2011 Best Student Paper Award — with Michael Mampaey & Nikolaj Tatti
mtv is a well-founded approach for summarizing data with itemsets; using a probabilistic maximum entropy model, we iteratively find that itemset that provides us the most new information, and update our model accordingly. We can either mine top-k patterns, or identify the best summarisation by MDL or BIC.
Go to the mtv page.
Boolean Matrix Factorization has many desirable properties, such as high interpretability and natural sparsity. However, no method for selecting the correct model order has been available. We propose to use the Minimum Description Length principle, and show that besides solving the problem, this well-founded approach has numerous benefits, e.g., it is automatic, does not require a likelihood function, and, as experiments show, is highly accurate.
Go to the mdl4bmf page.
A good first impression of a dataset is paramount to how we proceed our analysis. We discuss mining high-quality high-level descriptive summaries for binary and categorical data. Our approach builds summaries by clustering attributes that strongly correlate, and uses the Minimum Description Length principle to identify the best clustering.
Go to the Attribute Clustering page.
We aim at finding itemsets that characterise the data well. To this end, we construct decision trees by which we can pack the data succinctly, and from which we can subsequently identify the most important itemsets. The Pack algorithm can either filter a candidate collection, as well as mine its models directly from data.
Go to the Pack page.
with Matthijs van Leeuwen & Arno Siebes
The Krimp algorithm mines sets of itemsets by the MDL principle, defining the best set of patterns as the set that compresses the data best. The resulting code tables are orders of magnitude smaller than the number of (closed) frequent itemsets. They are highly characteristic for the data, and obtain high accuracy on many data mining tasks.
Go to the Krimp page.

Graph Mining
with Leman Akoglu, Hanghang Tong, Polo Chau, Nikolaj Tatti & Christos Faloutsos
Suppose we are given a large graph in which, by some external process, a handful of nodes are marked. What can we say about these nodes? Are they close together in the graph? or, if segregated, how many groups do they form? We approach this problem by trying to find simple connection pathways between sets of marked nodes — using MDL to identify the optimal result. We propose the efficient dot2dot algorithm for approximating this goal.
Go to the dot2dot page.
Given a snapshot of a large graph, in which an infection has been spreading for some time, can we identify those nodes from which the infection started to spread? In other words, can we reliably tell who the culprits are? With NetSleuth, we answer this question affirmatively for the Susceptible-Infected virus propagation model.
Go to the NetSleuth page.

Causality and Correlation Analysis
We consider non-parametric causal inference. That is, given two variables of which we know that they are be correlated, Ergo can efficiently and reliably infer their causal direction – without having to assume a distribution.
Go to the Ergo page.
with Hoang-Vu Nguyen, Emmanuel Müller, and Klemens Böhm
Correlation analysis is one of the key elements of data mining, but many methods are limited to either linear or pairwise correlations. We propose mac, a non-parametric, multivariate and non-linear correlation measure based on maximal correlation analysis.
Go to the mac page.
with Hoang-Vu Nguyen, Emmanuel Müller, and Klemens Böhm
Discretization is the transformation of continuous data into discrete bins. The key question is how to discretize data such that multivariate associations and patterns are preserved. That is exactly what ipd does.
Go to the ipd page.

Other Projects & Implementations
phpBibLib is a PHP library for easily parsing and displaying entries from bibtex files, including the possibility of using citations in a webpage and displaying the corresponding references.
Go to the phpBibLib page.
by Jilles Vreeken, based on the paper 'Tiling Databases' by Geerts, Goethals & Mielikainen.
Go to the Tiling page.