Publications

My current main interest is in machine learning for structured and multi-relational, and network data. I have recently worked on applying reinforcement learning to sports data, which have a very complex structure (players, teams, matches, actions, times, locations). My work on learning has followed two main principles. 1) Analyze the complexity of learning problems to establish which methods achieve the best possible performance for a given learning problem. 2) Get the most out of the information you have: Develop new learning methods to work with complex data , rather than simplify the data to work with existing methods. Examples of complex data that require new methods include complex networks, event logs, natural language, and images.

My past publications have addressed a number of topics in different areas, such as learning graphical models, statistical-relational learning, machine learning for particle physics, computational learning theory, belief revision, computation theory, inductive inference, philosophy of science, and game theory (logic, foundations, evolutionary analysis). Here is a graph of connections among research topics I have worked on. The notes on the papers describe these connections.

Generative Models for Graph Data
Sports Analytics and Reinforcement Learning
Learning Bayesian Networks and Causal Models
Automated Scientific Discovery
Learning Theory
Formal Epistemology
Belief Revision
Game Theory

Learning Generative Models for Graph Data

A special interest of mine is learning Bayesian network structures for multi-relational data (aka heterogeneous information networks). I have also worked on neural generative models and how to combine them with symbolic Bayesian networks.

Deep Generative Models for Subgraph Prediction. Erfaneh Mahmoudzadeh*, Parmis Naddaf*, Kiarash Zahirnia*, Oliver Schulte* (2024). European Conference on Artificial Intelligence (50th Anniversary!)
This paper introduces subgraph queries as a new task for deep graph learning. Unlike traditional graph prediction tasks that focus on individual components like link prediction or node classification, subgraph queries jointly predict the components of a target subgraph based on evidence that is represented by an observed subgraph. Simply put, the input is a subgraph, and the output is a probability for a target subgraph. Subgraph queries define a much larger space of graph prediction tasks compared to single-component tasks (link prediction/link classification). Our approach is inference from a single generative model . Since this is a new task for neural graph generative models, it's hard to find a baseline. In the paper we consider limited queries whose target is either a single component, or multiple components predicted independently by baseline methods. We find that inference from a single model is competitive with baselines customized for single-component tasks (e.g., node classification), especially in the inductive setting where to goal is to generalized from observed training nodes to new target nodes.
- Connections.
  - My goal here is to bring neural generative models up to the query answering power of statistical-relational models, which can also handle arbitrary subgraph queries. Example systems include Alchemy and Problog.
  - This paper builds on our 2023 CIKM paper on joint link prediction. The difference is that now we support queries concerning joint node classification (e.g., is this entire set of nodes anomalous?). This extension requires extending deep generative models so that they generate node features/labels in addition to links.
  - I see a general trend in machine learning to build multi-task systems that are evaluated on their ability to handle a larger set of queries. As opposed to a classifier that operates on a fixed query type with a constant-size target.
Rule-Enhanced Graph Learning. Ali Khazraee, Abdolreza Mirzaei, Majjid Farhadi, Parmis Nadaff, Kiarash Zahirnia, Mohammad Salameh, Kevin Cannons, Richard Mar, Mingyi Wu, Oliver Schulte (2024). Workshop on Structured Probabilistic Inference & Generative Modeling. SPIGM@ICML 2023.
Logical Rules have been used for decades in the field of Knowledge Representation to capture domain knowledge. They can be non-deterministic to allow for uncertainty and exceptions. For example, the rule "If a person works in a city, they live in that city too" might be assigned a confidence value of 70% to show that the conclusion holds 70% of the time, given the rule condition. This paper is about the question of how a given set of if-then rules can be used to constrain and enhance deep graph generation. Our basic idea is a rule moment matching approach where deep graph generation is to constrain learning such that the expected rule instantiation counts, given the generative model, match the observed rule instantiation counts. This the domain knowledge is incorporated through the training objective. The computational problem in implementing this idea is computing the expected rule instantiation counts. We show that for a variational graph auto-encoder/matrix factorization model, where links are conditionally independent given node embeddings, this can be done through matrix multiplication. The matrix multiplication approach to instance counting provides a differentiable training objective for deep graph learning. Our empirical results suggest that the quality of the generated graphs increases by orders of magnitude. Performance on downstream tasks such as link prediction and node classification improves too though not by as much.
- Connections.
  - Our method for computing expected instance counts under a model utilizes the result that in a VGAE/matrix factorization model, expected motif counts = motif counts in the expected graph. See my arxiv paper on "Computing Expected Motif Counts".
  - This paper is my latest attempt to address a difficult fundamental topic that I've been thinking about for years: how to derive predictions for single target instances from general relational frequencies. See the discussions in my previous papers such as "Fast Learning of Relational Dependency Networks" and "Modelling Relational Statistics With Bayes Nets". See also my 2015 talk on the subject. Rule moment matching was proposed by Kuzelka et al. to as a constraint on generative models. They show that the maximum entropy distribution that satisfies rule moment matching is given by a log-linear model (Markov Logic Network) with maximum likelihood parameters. The difference is that such a log-linear model only achieves moment matching. For example, if the observed graph exhibits strong community structure and/or nodes with strong centrality measures, the log-linear model will not capture such patterns unless they are somehow expressed by the given rules. Our approach aims to get the best of both worlds: the graph neural networks can capture complex relational patterns like community structure, but through moment matching they also incorporate the general frequency knowledge expressed in the logical rules.
  - The Fenstad representation theorem also incorporates a moment matching condition. I believe that it provides a strong foundation to the effect that any query answering system that assigns probabilities to first-order formulas can be represented as satisfying moment matching. I hope to work and write more on this later. Our 2024 ECAI paper on subgraph queries aims to show how deep graph generative model can be used to answer such quries.
- Workshop Poster
From Graph Diffusion to Graph Classification. Jia Jun Cheng Xian*, Sadegh Mahdavi*, Renjie Liao, Oliver Schulte (2024). Workshop on Structured Probabilistic Inference & Generative Modeling. SPIGM@ICML 2023.
A generative model P(X,Y) entails a classification model P(Y|X) since joint probabilities entail conditional probabilities. Thus in principle, one derive a classifier from a generative model. A famous example is that logistic regression can be derived from a generative model where P(X|Y) is a class-conditional Gaussian model. Whether it is more effective to derive a classifier from a generative model than to train one directly depends on the data type and use cases. For non-relational i.i.d data, it is well-known that classifiers derived from generative models (generative classifiers) can be more sample-efficient than classifiers trained directly. In this paper we investigate deriving a graph classifier from a graph diffusion model. Our main finding is that you can use a diffusion model as a zero-shot classifier without retraining, but to get competitive classification accuracy, you need to use a discriminative training objective for the diffusion model.
When I started thinking about generative classifiers derived from deep models (see arxiv paper below), it seemed like an unusual approach to graph classification. Since then several papers have appeared on generative classifiers for image and text data. Deep generative classifiers seem to have become a hot topic. I believe we are the first to develop a deep generative classifier for graph data.
- Workshop Poster
- From Graph Generation to Graph Classification. Oliver Schulte (2023). Arxiv Preprint. Outlines the basic idea and theory.
A Simple Latent Variable Model for Graph Learning and Inference. Manfred Jaeger, Antonio Longa, Steve Azzolin, Oliver Schulte, Andrea Passerini (2023). Proceedings Learning on Graphs (LoG).
Our 2020 IJCAI paper showed that in principle, exchangeable/permutation-invariant models (node ids don't matter) that are projective (graph size doesn't matter) can be represented with 1-dimensional node embeddings and a suitable decoder function. Here we learn the decoder function in the style of a block model, by jointly assigning nodes to clusters and learning a tabular decoder for cluster pairings. Manfred came up with a clever importance sampling scheme to jointly learn the clusters and decoder function. One advantage of this model is that it's flexible for a number of different graph prediction queries. For example, marginalizing over unspecified links is tractable because it can be done by discrete summation.
Neural Graph Generation from Graph Statistics. Kiarash Zahirnia, Oliver Schulte, Mark Coates, Yaochen Hu (2023). Proceedings Neural Information Processing (Neurips). In our Neurips paper last year, we showed how graph generation can be improved (and controlled) by extracting graph statistics and matching them during training. By "graph statistic" I mean things like degree histograms, graph diameter etc. In this paper we take the idea a step further: suppose we do not look at individual adjacencies at all, but only match graph statistics . This is actually what traditional parametric graph generation methods do (e.g. you can generate an Erdos-Renyi graph from the density parameter). Neural graph generation from statistics only works quite well on real-world graphs: the graphs are much more realistic than what is generated by parametric methods, and competitive with what neural methods generate from the full adjacency matrix. Other than scientific interest, why would we want to do this? The motivation here is privacy . We often need to protect individual relationships (e.g., is Taylor Swift dating Ethan Slater), so we cannot assume they are available for learning. So we only ask for aggregate statistics (e.g., how many people has Taylor Swift dated?). In fact, we can protect even the aggregate statistics with a differentially private perturbation mechanism. Learning from perturbed node-level statistics works much better than learning from perturbed adjacencies, because the node-level statistics are more coarse-grained and hence more stable.
Computing Expected Motif Counts for Exchangeable Graph Generative Models. Oliver Schulte (2023). Arxiv Preprint.
A computational problem that arises quite frequently in network analysis is to find the expected number of motif instantiations given a parametrized model. In our 2022 Neurips paper for example, we trained a generative model by matching the expected graph statistics to observed graph statistics (e.g., expected number of triangles to observed number of triangles). In that paper, we approximated the expected graph statistics by applying the graph statistics to the expected graph. The expected graph is a probabilistic graph where each link is assigned a probability between 0 and 1. The expected graph is easy to compute for encoder-decoder graph models, so the approximation is a big help computationally. In this note I prove that the expected motif count is exactly the motif count in the expected graph.
Joint Link Prediction Via Inference from a Model. Parmis Naddaf, Erfaneh Nejad, Manfred Jaeger, Oliver Schulte (2023). Proceedings Computational Intelligence and Knowledge Management (CIKM). Github.
Most deep graph learning papers look at the predictive tasks of node/graph classification or link prediction. Statistical-relational learning systems actually allow you to pose a much richer set of queries, for example about a whole group of links. For example, you can ask whether Jennifer Anniston and Courtney Cox will both join the cast of a new movie. This paper considers how to extend the range of queries you can answer with a trained generative model to groups of links, which we call joint link prediction. The basic idea is to compute the likelihood of a target set of links, conditional on a set of observed evidence links (and node attributes). If the set of evidence links forms a subgraph, our approach reduces to conditional graph generation. But we cannot assume this in general, so we have to fill in unspecified evidence links among the observed nodes. Our approach is importance sampling, where we use a distribution that sets all unspecified links to 1, weighted with a prior distribution that sets them all to 0. This works well for joint link prediction, and better than an independent prediction baseline. But I feel there is definitely more to be done to extend query answering with graph generative models.
Related material.
- Workshop Version, presented at Deep Graph Learning Workshop@AAAI 2022. Deep Variational Inference for Inductive Link Prediction. Oliver Schulte, Parmis Naddaf, Xia Hu, Manfred Jaeger. Workshop Video.
- Connections: This paper is in the spirit of approximate Bayesian inference where the goal is to answer a wide range of queries by conditioning on a generative model.
Deep Learning of Latent Edge Types from Relational Data. Kiarash Zahirnia, Oliver Schulte, Ke Li, Ankita Sakhuja and Parmis Nadaf (2022).Proceedings of the 35th Canadian Conference on Artificial Intelligence 2022. Award for Best Student Paper. Conference Video., Github.
The starting point of this paper is the observation that many relational structures are generated by latent edge types not given in the data. For example, IMDB lists all cast members of Mission Impossible, but does not indicate that Tom Cruise is its star. We design a system that infers latent edge types, in a way that is easy to build on top of existing graph learning methods. It is essentially a multi-head architecture (like the Transformer) where we learn k parallel node embeddings representing k latent edge types. The observed edges are then reconstructed as linear combinations of the latent edge types. This approach shows promising results in recovering ground-truth edge types and improves link prediction. Thank you to Canadian AI for the best student paper award!
Related material.
- Workshop Version, presented at Deep Graph Learning Workshop@AAAI 2022. Deep Learning of Latent Edge Types from Relational Data. Kiarash Zahirnia, Oliver Schulte, Ke Li, Ankita Sakhuja and Parmis Naddaf. Workshop Slides , Workshop Video.
Micro and Macro Level Graph Modeling for Graph Variational Auto-Encoders. Kiarash Zahirnia, Oliver Schulte, Parmis Nadaf and Ke Li (2022) Proceedings NeurIPS 2022. Github and Poster, Slides, Video, OpenReview.
Before deep graph learning, graph models were often based on general graph patterns or graph statistics. For example exponential random graph models or in Markov logic networks are defined by a set of patterns aka sufficient statistics aka motifs (e.g. number of triangles, cliques). The weights of these patterns are learned so that expected pattern instance counts match observed pattern instance counts. So I wondered how we can combine the well-established approach of generating graphs based on statistics with deep generative models? Our answer is a joint training objective where a moment-matching objective regularizes a deep learning objective so that expected graph statistics for the neural generative model match expected graph statistics. We develop a computationally efficient method that scales to medium-sized graphs. Statistics matching works well to improve the realism of graphs generated by encoder-decoder models, which are the fastest model at generation time.
Related material.
- Previous Arxiv preprint. Generating the Graph Gestalt: Kernel-Regularized Graph Representation Learning. Kiarash Zahirnia, Ankita Sakhuja, Oliver Schulte, Parmis Nadaf, Ke Li and Xia Hu (2021).
Pre and Post Counting for Scalable Statistical-Relational Model Discovery. Richard Mar and Oliver Schulte (2021). Tenth International Workshop on Statistical Relational AI@ICLR, Athens. Poster.

This paper builds on work on learning first-order Bayesian networks for relational data. My previous Learn-and-Join algorithm is the most scalable graphical model discovery method for multi-relational data. But it does not scale well in the number of link types because we precompute a set of counts for all possible combinations of link types. The number of possible combinations grows exponentially in the number of link types and far exceeds the number of combinations actually observed in the data. We developed a more scalable approach where counts for multiple link types are computed only on-demand during the Bayes net search, as they are needed to score potential parents in the Bayesian network.
Model-based exception mining for object-relational data. F. Riahi and Oliver Schulte (2020). Data Mining and Knowledge Discovery 34(3), pp. 681--722.

This paper develops an application of a first-order Bayesian networks for anomaly detection or exception mining. The starting point is the observation that a first-order Bayesian network defines a number of motifs or patterns. For example, an edge like Gender(User) -> Rating(Movie, User) defines a set of 1-link motifs involving a movie and a user. Now for a given node, we look at its ego network and compute the distribution of the Bayes net motifs in its ego network (e.g., the number of 5-star ratings that Jane Doe has given) with the distribution over all nodes (e.g., the number of 5-star ratings given by a random woman user). Now anomaly detection can be performed by computing a distance between two distributions. We actually developed a new one for this purpose, which is approximately total variation distance. There is an extensive evaluation on athlethes, movies, actors, other fun types of data.
Leveraging Approximate Constraints for Localized Data Error Detection. Mohan Zhang, Oliver Schulte and Yudong Luo (2021). Proceedings of aiDM '21: Fourth Workshop in Exploiting AI Techniques for Data Management, pp. 36--44.

This paper builds on our SigMOD paper. There we developed the idea that if you are given statistical (in)dependence constraints, which could come from a user or a machine learning system like a Bayesian network, you can check your data for violations of the statistical constraints. This paper develops a tree-based localization technique for finding where in the data the statistical constraint is violated to a high degree. That is the area you should take a closer look at for data cleaning. We demonstrate this technique in hockey data and in synthetic data.
SCODED: Statistical Constraint Oriented Data Error Detection. Jing Nathan Yan, Oliver Schulte, MoHan Zhang, Jiannan Wang, and Reynold Cheng (2020). Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data (SIGMOD’20), pp. 845–860.

I agree with Pedro Domingos that data quality management is a potential killer app for statistical modelling and especially statistical-relational modelling. In this paper we develop the idea that if you are given statistical (in)dependence constraints, which could come from a user or a machine learning system like a Bayesian network, you can check your data for violations of the statistical constraints. We show that independence constraint generalize standard constraints like multi-valued dependencies. On the algorithmic side, we present an efficient method for repairing violations of statistical constraints.
Deep Generative Probabilistic Graph Neural Networks for Scene Graph Generation. Mahmoud Khademi and Oliver Schulte (2020). Proceedings AAAI Conference on Artificial Intelligence (AAAI 2020), Vol. 34(7). We also contributed two related papers to the Neurips 2020 workshop on graph representation learning.

This (and the paper below) are early papers on what has by now (2023) become a very hot topic: combining text and images through embeddings to analyze both. We were especially interested in scene graph generation and visual question answering. We anticipated auto-regressive graph generation in this paper (building up a graph incrementally) and also the idea of using reinforcement learning to sequentially build up a structured object. I don't think we were the first to apply RL for structured generation, but it is the main idea behind GFlownets which are gaining popularity.
Dynamic Gated Graph Neural Networks for Scene Graph Generation. Mahmoud Khademi and Oliver Schulte (2019). Asian Conference on Computer Vision, LNCS 11366, pp. 1–17, 2019. The link is to our own camera-ready version. Conference Slides. The official publisher's version is here.
One of the most challenging and important applications of relational learning is to extract relational information from unstructured data. Hoon and Domingos have referred to this as a "killer app" for statistical-relational methods. In Natural Language Processing, building a network of relational facts that summarize what is in a text known as Information Extraction or Knowledge Graph Discovery. In computer vision, building a network of relational facts that summarize what is in an image is known as scene graph generation. The current paper presents an approach to building up a scene graph from an image. It is based on deep learning to extract objects and features and reinforcement learning to decide what network component to add next.
Which is harder: NLP or Vision? I have studied both information extraction in NLP and scene graph generation in vision. I found it interesting to compare the two.
- An advantage of NLP is that a text contains discrete labels for entities (nodes). Thus in "Harry loves Ginny" we have "Harry" and "Ginny" to refer to entities, and "loves" to refer to their relationship. In vision, if we had a picture of Harry next to Ginny, an algorithm has to analyze the image to figure out that there are two people depicted, and ideally find their shapes. This is called "semantic segmentation" in vision and it's a hard problem. In this paper we use bounding boxes, so we get an idea of what objects there are and where they are located, but not their exact shape.
- Similarly a text contains a discrete label for relationships (e.g. "loves"), and you can more or less tell from English grammar that that is a relationship between the two entities. In contrast, if Harry is for instance hugging Ginny in an image, you would have to infer that relationship. This is close to a problem called "activity recognition", also hard. Recognizing something more complex like "Harry is looking at Ginny lovingly" may be beyond the reach of computers forever.
- An advantage of vision is that an object occurs just once in an image. So if you can find it, you don't have to link multiple occurrences. In NLP, you might have "Harry loves Ginny. She agreed to marry him." and you have to recognize that "Harry" and "him" refer to the same entity. This kind of entity linking is easy for humans but very difficult for computers. A counterpart in vision would be object tracking in video. But even there the occurrence of an object in different frames is subject to strong continuity constraints that you don't have in language.
- An advantage of vision for relationship detection is that the image comes with easily recognized spatial relationships ("on top of", "to the left of"). For example if you can manage to localize a human and a bike, you can compute easily that the human is on top of the bike, and then infer (with an appropriate model) that the human is riding the bike.
A Complete Characterization of Projectivity for Statistical Relational Models. Manfred Jaeger and Oliver Schulte (2020). Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI 2020), pp.4283-4290.
This paper settles the question that we started to investigate in our workshop paper: when are graph generative models independent of population size (see discussion below)? Basically this paper shows that if you add exchangeability/permutation invariance (node ids don't matter), then for projectivity it is necessary and sufficient to use node embeddings such that links are conditionally independent given node embeddings. Which is the common assumption in encoder-decoder models. So our theorem gives a kind of foundation for encoder-decoder models in graph generative learning.
Inference, Learning, and Population Size: Projectivity for SRL Models. Manfred Jaeger and Oliver Schulte (2018). StarAI Workshop @IJCAI. Best Paper Award.
A desirable property of a relational or network model is that adding new entities or nodes to a model about which we know nothing, does not change the predictions of the model for previously known entities or nodes. For example, if you read that Facebook has added another million users, this should not change your probabilistic inferences about your current friends. Technically, this means that given a large set of n nodes, the marginal predictions over a smaller set of m nodes should be the same as if you made predictions based on only the m nodes to begin with. In statistical process theory, this property is called projectivity. This paper asks whether common statistical-relational models are projective in this sense. Unfortunately the answer is generally no, unless the model has a restricted form, where links are independent given the attributes of the involved entities. This conditional independence property is weaker than assuming that entities or nodes are independent (as in i.i.d. data), but enough to ensure projectivity.

This work as inspired by the great paper of Shalizi and Rinaldo on projectivity in exponential random graph models. They consider homogeneous graphs with no attributes only, we consider models for heterogeneous graphs as well.

Personal view: I'm starting to think that the only way to get projective network models is build a conditional independence type model. These are in fact the currently most popular ones, well-known variants include matrix factorization, exchangeabilty, blocks models, embedding models in deep learning. I don't have a theorem to that effect, but I've been trying hard to get projectivity, beyond what is in this paper. Also in our paper on relational dependency networks we showed that what to my mind is a natural log-linear definition of conditional probabilities in a relational model, cannot be made consistent with a single distribution over possible worlds (networks).
Tutorial for learning Bayesian networks for complex relational data. Chapters 3 and 4 discuss parameter and structure learning. Presented at AAAI, ECML, Canadian AI.
Locally Consistent Bayesian Network Scores for Multi-Relational Data. Oliver Schulte and Sajjad Gholami (2017). Proceedings IJCAI 2017, pp.2693-2700.

Sums up our theoretical work on scoring Bayesian networks for relational data. To my mind the pinnacle of model selection theory for Bayesian networks is the work by Chickering and Meek on identifying an optimal Bayes net structure. In this paper we show how to generalize any decomposable model selection score for Bayesian network so that it applies to relational data, preserving Chickering and Meek's local consistency properties: as the number of observations increases, 1) edges necessary for representing the data generating distribution will be added, and 2) unnecessary edges will be deleted. Part of the reason why it took me a long time to get this result is that it is not possible to achieve it in the traditional model selection setting, where one defines a score for a single model. Instead for multi-relational data, you need to use a gain function that measures how much one model B1 improves over another B2 . Our multi-relational gain functions cannot be represented as the difference of two single-model scores (score(B1)-score(B2)).
Connections. This extends my paper on a tractable pseudo-likelihood function for Bayesian networks. That paper showed how to generalize the standard i.ie. likelihood score for multi-relational data. The new paper shows how to add a model complexity penalty term to the likelihood score.
Related material
- Model Selection Scores for Multi-Relational Bayesian Networks, Sajjad Gholami, Oliver Schulte, Vidhi Jain (2017). Invited Paper at the DeLBP workshop. A short version focusing on likelihood-based scores (e.g. AIC, BIC).
- Consistent Bayesian Network Scores for Multi-Relational Data. Oliver Schulte and Sajjad Gholami (2016). Presented at the StarAI Workshop, IJCAI 2016. Has more ablation studies (e.g. evaluating scores without any scaling).
Learning Bayes Nets for Relational Data with Link Uncertainty. Zhensong Qian and Oliver Schulte (2013). Proceedings IJCAI-GKR Workshop on Graph Structures for Knowledge Representation and Reasoning. Springer Lecture Notes in Artificial Intelligence, Vol. 8323, pp. 123-137.
A shorter version appeared in the IJCAI-AIBD Workshop on AI and Big Data.

We extend Bayes net learning to find correlations between different types of links or relationships (e.g., users who watch a video about Barbie dolls also search for Barbie dolls on the web). The computational problem is that to find such correlations, you need to count both the number of existing relationships and the number of nonexisting relationships. That is, the number of users who searched for Barbie dolls and the number who did not. The fast Moebius transform can be used to efficiently find these counts. This algorithm allows us to look for link correlations in database tables that contain hundreds of thousands of rows.
Learning Graphical Models for Relational Data via Lattice Search. Oliver Schulte and Hassan Khosravi (2012). Machine Learning, 88:3, pp.331-368.
Final Version before typesetting
Our main Learn-and-Join algorithm for on structure learning for Bayes nets with relational data. One of the key problems in relational learning is that correlations among entities and their attributes depend on the kind of links, or chains of links, that exist between the attributes (what Jiawei Han has called a "metapath"). But there are many potentially relevant link chains, so somehow we need to efficiently search them and combine the results in a single model. Our proposal is to use the fact that chains have a lattice structure (subchains are like subsets). We do a level-wise search through this lattice, starting with short chains and increasing the chain length one step at a time (in the spirit of the Apriori algorithm). Correlations or adjacencies found on shorter chains are propagated to Bayes net learning on superchains. The general intuition is that a dependency should be evaluated in the most specific possible context. There are many details to take care of, so the paper is pretty long. But half of it presents simulation results, so if you just want to know how the algorithm works, there is not so much to read. For parameter learning, we convert the Bayes net to a Markov net, then apply Markov Logic Network parameter learning and inference. This turns out to be an effective way to do inference with Bayes nets in relational data, and an effective way to learn Markov models. A 2-for-1 deal: two types of graphical models with one algorithm.

Connections. The structure search algorithm of Friedmann et al. uses a similar hierarchical strategy. The main difference is that they do not propagate edges as constraints, but reconsider each edge using the results of learning for shorter chains as a starting point.
Related material
- Structure Learning for Markov Logic Networks with Many Descriptive Attributes. Hassan Khosravi, Oliver Schulte, Tong Man, Xiaoyuan Xu, Bahareh Bina (2010). Proceedings of the Twenty-Fourth Conference on Artificial Intelligence (AAAI), pp.487-493. The conference version of the Learn-and-Join algorithm.
- Conference Presentation (Powerpoint).
FACTORBASE: multi-relational structure learning with SQL all the way. Zhensong Qian and Oliver Schulte (2018). International Journal of Data Science and Analytics, Springer, 2018, 7(4), 1-21. . The link is to the final revision typeset by us. The official publication link is here.
My first systems/software engineering paper. I have been working on machine learning for multi-relational data for years now. Programming machine learning for relational data is hard! After years of working hard to develop code, I think I have found a framework that makes implementing multi-relational learning feasible. In fact, with our new approach it's not only feasible, it's fun! Well, if you're inclined to think that programming is fun.

The key idea is to use a design philosophy that has come out of the database community: Use a database system to store data and models. This is thinking outside the box because most people think of a database system as something that manages data, not something that manages models. One obvious advantage is that the DB system handles big models as easily as it handles big data: no more worries about how to store 10M parameter values in main memory, for instance. But even more important for statistical-relational learning is the fact that relational algebra provides a high-level language for managing structured objects. This is because structured objects make programming for relational learning difficult and error-prone:
- The random variables have internal structure, because they are defined in first-order logic.
- Graphical models are common, so we have structured models with structured components.
- Tables of parameter values and tables of sufficient statistics need to be cross-referenced with metadata about structured random variables.
- Many structure learning algorithms, like our learn-and-join algorithm and gradient boosting, need to manage sets/ensembles of models.
If you store all these objects as tables in the database, you can use relational algebra in the form of SQL scripts to construct, manage, and cross-reference complex big structured statistical objects. For example, we have written a five-line SQL query that computes the predictions of a log-linear model on a million test cases in a very efficient and reliable way. Another great thing about SQL is that it is standardized. So our system works out of the box as long as your data are stored in relational database system. And I'm hoping that different research groups will share and reuse SQL scripts. Kimmig, Mihalkova and Getoor have written a great survey on statistical-relational learning. Our FactorBase system implements pretty much all the operations they describe using MySQL scripts.
The first paper that I know of about implementing a graphical model in a relational database (for i.i.d. data) is by Wong et al. The general approach is often associated with the BayesStore system due to Wang et al.
More material on this topic.
- FactorBase: Multi-Relational Model Learning with SQL All The Way. Zhensong Qian and Oliver Schulte (2015). Proceedings Data Science and Advanced Analytics (DSAA 2015), pp. 1--10.
- Conference presentation at DSAA 2015. 15 minute lecture. PDF format.
- 4-slide spotlight presentation at the UAI-StarAI 2015 Workshop.
- UAI-StarAI 2015 Workshop Poster. Perhaps the best way to get a quick idea of the details.
- 6-page version at the 2015 Learning Systems NIPS Workshop. Assumes less background in statistical-relational learning than the StarAI paper, but more background in machine learning than the DSAA paper. The Poster.
- Preprint. This is an expanded version. It adds information about how we query the system catalog to define a set of random variables. For single-table data, you basically can assume that each column defines a random variable. For relational data, this is more complicated. (Walker et al. call this the "setup problem".) Most systems fall back on requiring the user to define random variables in a system-specific language. We show that you can provide a good default setup by querying the DB system catalog, push-button with no user input. Singh and Graepel take a similar approach for latent variable models.
Propositionalization for Unsupervised Outlier Detection in Multi-Relational Data. Fatemeh Riahi and Oliver Schulte 2016). Proceedings of FLAIRS 2016. Conference Presentation.

This paper presents a novel application of Markov Logic Networks to outlier detection. The key idea is that the formulas in a Markov Logic Network represent features that can be used as input to an outlier detection method. For example in soccer data, we may have a formula like DribbleEfficiency(Team,Match) = high, PassEfficiency(Team,Match) = hi. This represents an interaction between the dribble efficiency and the pass efficiency of teams. Counting how many times this formula is instantiated for a particular team amounts means to counting how many times the team achieves high dribble efficiency and high pass efficiency. In other words, the count represents an interaction between dribble and pass efficiency.
We can illustrate how this works for Premier League data. We apply our Markov Learning Network method to learn 331 conjunctive features. We make a data matrix with 331 columns and one row for each team. A cell contains the number of times the feature is instantiated for each team. (We also tried the proportion of instantiations and TF-IDF features.) Then this data matrix is treated as a standard matrix for i.i.d. data (pseudo-iid data view), and given to a single-table outlier detection method (cite implementation).

What I like about this paper is that outlier detection is a novel application of Markov Logic Networks. It is also a novel application of the propositionalization strategy. Propositionalization refers to preprocessing a relational database to “flatten” it into a single table. I also like the term “relation elimination” because the preprocessing eliminates all relationships and leaves only features of individual entities. Relation elimination has been extensively researched for classification, but I believe that we are the first to use this strategy for outlier detection.
Fast Learning of Relational Dependency Networks. Oliver Schulte, Zhensong Qian, Arthur E. Kirkpatrick, Xiaoqian Yin and Yan Sun (2016). Machine Learning, pp.1-30.

This paper pulls together threads on a difficult topic that I've been thinking about for years: how to derive predictions for single target instances from general relational frequencies. The core of the paper is a log-linear formula to use a first-order Bayesian network for classification. For instance, suppose you wanted to predict the genre of a movie based on (i) its country of origin, (ii) the genders of its actors, (iii) the ratings of its users and their genders. Our log-linear formula combines all this different information in a way that properly generalizes the log-linear Bayesian network classification formula for the case of i.i.d. data. The key property that we wanted is that the formula should work well if you use the empirical observed frequencies as Bayesian network parameter estimates (e.g. the percentage of movie-actor pairs where the movie is a drama and the actor is a woman). In order for this to be the case, you need to use proportions, not counts as feature functions in the log-linear model. Ravkic, Ramon and Davis also use frequencies not counts.
Once you have a classifier formula that defines a conditional probability for each target node given values for all the other nodes, you can put the different conditional probabilities together in a dependency network to perform joint inference to predict the values of multiple nodes. As explained by Neville and Jensen, the big advantage of dependency networks over Bayesian networks is that they can represent the cyclic dependencies that are so typical of relational data (if you and I are friends, my affection for Harry Potter predicts yours, and vice versa). In previous work, we developed the learn-and-join algorithm that efficiently learns a Bayesian network that represents relational frequencies. (See our paper " Modelling Relational Statistics With Bayes Nets" below). For inference about single target instances---like the genre of "Fargo"--we convert the Bayes Net template model to a relational dependency network to perform prediction at the instance level. Since we use observed frequencies as parameter estimates, that can be computed directly from the data, parameter estimation is very fast. And since the learn-and-join algorithm (see below) provides fast structure learning, the whole package finds a relational dependency network in a scalable way. On the order of 5 min for most databases. The largest is an instance of the IMDb movie database with over 1M ratings that takes about 40 min for structure learning. We're working to increase scalability even more---stay tuned!

What I like best about this approach: That we can work with large datasets. And that the parameter values can be interpreted, since they represent observed frequencies of events in the data. This means that the reasoning behind the system's conclusions can be explained in terms of intuitively meaningful weights for intuitively relevant factors.
The problems addressed in this paper concern the most important differences between learning from relational data and learning from independent i.i.d. data points. It's taken me a long time to get clear on these differences. In case you want to know more about the general issues, here's an explanation of what I've learned.
More material on this topic.
1. Fast Learning of Relational Dependency Networks. Oliver Schulte, Zhensong Qian, Arthur Kirkpatrick (2014). Conference on Inductive Logic Programming (ILP 2014). The conference predecessor. Shorter version. Compared to the journal paper, this is missing the characterization of when the resulting dependency networks are consistent.
2. Presentation for the ILP Conference.
3. Random Regression for Bayes Nets Applied to Relational Data. Oliver Schulte, Hassan Khosravi, Tianxiang Gao and Yuke Zhu (2012). UAI-StarAI Workshop on Statistical-Relational AI. Short Workshop paper. The main new material here are empirical results on using counts instead of proportions in the log-linear conditional probabilities models for the dependency network parameters. This is the official recommendation of the Alchemy system for how to convert first-order/parametrized Bayesian networks to Markov Logic networks. But counts as feature functions do much worse than proportions.
  The paper also emphasizes the random regression interpretation of the proportion model: it is equivalent to the expected value of a log-conditional probability given a randomly selected instance of the families in the Markov blanket.
4. Poster Presentation for the SRL Workshop
Model-based Outlier Detection for Object-Relational Data. Sarah Riahi and Oliver Schulte (2015). IEEE Symposium Series on Computational Intelligence, CI and Data Management Track (IEEE SSCI 2015), pp. 1590-1598.
Best Student Paper Award. Thank you to SSCI!

Unsupervised outlier detection is a major application area for generative models. I also think it is an important problem from an AI point of view, because it relates to the tension between wanting rules that are as general as possible and as specific as necessary. (Remember Tweety the bird who is also a penguin?). To apply model learning to outlier detection, the idea is to build a model of normal behaviour. Then given a potential outlier datapoint, you apply the model to compute how likely the features associated with the data point are. If the likelihood is low according to the normal population model, you have an outlier.

The difficulty is that for relational data, the low-likelihood approach does not work just as is, because for a potential outlier object, there is not a single feature vector. Instead there is a relational substructure that we refer to as the object database . For instance, if we want to know whether a soccer player is unusual, we want to look at a set of feature vectors, one for each match. One approach is to score the potential outlier object's database using the random selection likelihood function I have developed in previous work. This works quite well and is a nice application of the random selection likelihood I think. But looking closely at synthetic and real-world datasets (Movies, Sports), Sarah Riahi and I found we could improve on this baseline method.

The basic idea is to look at the likelihood ratio (KL divergence) between two parameter settings: parameters learned from the entire population reflecting general normal behaviour, and parameters learned for the individual object. Given a Bayesian network representation of relational frequency, the likelihood ratio amounts to comparing correlations that hold in the general population to correlations that hold for a specific object. This idea of looking for unusual correlations, often expressed as association rules, is quite common in rule-based approaches and in subgroup discovery. It falls out nicely from the Bayesian network formalism. Finally, we realized we could improve even more by applying a mutual-information decomposition to the normal and individual parameters before computing the likelihood ratio. This means that we compare two different quantities to score outliers: the difference in individual feature distributions, considered in isolation, and the difference in the strength of correlations among features ("the lift"). This may sound complicated but the results of the decomposition are in fact easy to interpret. We show that by going through rankings of soccer players and movies as unusual.
Connections. We can think of a relational outlier as being like a subgroup of size 1. This means we can transfer ideas from subgroup discovery. What we arrive at is basically a relational version of Arno Knobbe's Exceptional Model Mining framework. (I develop this perspective in my tutorial.) Also, Peter Flach's group in Bristol have been researching a related issue, how to measure the difference in class distributions in a subgroup. It looks like they've independently arrived at a similar formula, using expected log-distances to avoid the cancelling out. I was happy to see independent confirmation of this idea.
More material on this topic.
- SSCI Conference Presentation (CIDM track).
- A Proposal for Statistical Outlier Detection in Relational Structures. Fatemeh Riahi, Oliver Schulte, Qing Li (2014), AAAI-StarAI Workshop.
  A predecessor paper at a workshop. The difference is that instead of comparing two parameter settings (general and individual) for a fixed Bayesian network structure, here we compare the likelihood ratios of two different Bayesian structures: one learned for the general population and one for the individual. I still think this is a good idea, but it is also more complicated than fixing the structure and comparing only parameter values. So for the conference paper, we went with the simpler version.
  An interesting result in this paper is that there is a strong correlation between how anomalous an individual is, according to this, measure, and how successful the object is. For example, the correlation between the standing of a UK Premier League team and its anomaly measure is 0.61. Considering how many factors go into the final standing, it's remarkable to me that any single metric should be so strongly related to this outcome. We show similar results for players and movie ratings. We definitely plan to follow up on this.
The CTU Prague Relational Learning Repository. Jan Motl and Oliver Schulte (2015). E-print archive

I am a co-creator and co-administrator of the Prague Relational Learning Repository. I hope you find it useful!
Aggregating Predictions vs. Aggregating Features for Relational Classification. Oliver Schulte and Kurt Routley (2014). IEEE Symposium Series on Computational Intelligence, CI and Data Management Track (IEEE SSCI 2014).

Another attempt to understand discriminative learning for relational databases. We're getting closer to a solution, see our paper on learning Relational Dependency Networks above. This paper investigates traditional methods, with fairly extensive simulations. The big picture is that relational learning needs to combine information from different related entities. For example, research has shown that you can predict the gender of a Twitter user quite well from his or her tweets. Unigram features work well for a variety of languages. This means that basically you can just look at how even a user tweets certain words. You can build a quite accurate classifier that looks at just a single tweet to predict gender. Suppose you have not just a single tweet for a user, but 100. Presumably you can predict his or her gender even better, but how? There are two basic strategies: 1) you can aggregate the features of different tweets into a single feature. For instance, if the word "foo" is used more often by women, you could compute the average frequency of "foo" over all 100 documents. In this way you end up with a single feature vector summarizing statistics for all 100 documents. 2) You could make a prediction for each tweet separately, then aggregate the predictions. For instance, you could estimate a probability that the user is a woman based on each tweet, then average the resulting probabilities. We call that aggregating scores (usually called a "combining rule" in previous work). Both approaches have been developed extensively in previous work on relational learning. What we haven't seen previously is a comparison of which works better. So in this paper we try out a pretty large set of aggregates for both features and scores and compare. The details depends on how you aggregate and what base classifier you use (e.g., logistic regression, SVM). Still, the overall message is that using a standard classifier with a set of feature aggregation functions works best. The main reason is that in that case, standard classifiers will find good aggregate features for you. In contrast, it's not easy to learn good score aggregation rules, and there isn't a universally best one for every dataset.
For me a striking thing about this work is that my preconceptions were overturned by simulations. In these situations computer science feels more like an empirical than a mathematical discipline. Jensen and Neville have done great work on statistical issues (type I/type II error) that arise when features are aggregated for relational classification. An intuition about this is that aggregating increases your feature space while at the same time decreasing the number of effective data points. Statistically, this sounds like a nonstarter. For instance, if you aggregate a user's tweets into a single number using, say, average, median, max, min, you have four new features but only a single data point rather than n data points for all of her or his tweets. Jiawei Han has published a paper on Naive Bayes Classifiers for relational data where he points to this loss of information. Nonetheless, score aggregation has its problems too. We provide a fairly extensive conceptual discussion of these statistical issues in the paper if you are interested.
Computing Multi-Relational Sufficient Statistics for Large Databases. Zhensong Qian, Oliver Schulte, Yan Su (2014). Computational Intelligence and Knowledge Management (CIKM).

This is an algorithmic paper that shows how to compute sufficient statistics---how to count how often events occur in a database that specify a certain query condition. The new challenge is that we want to count sufficient statistics for events that involve both positive and negative relationships. E.g., do users who watch a video about Barbie dolls also search for Barbie dolls on the web? Or how often does it happen that of three users, two are friends and the other two are not? One of the new contributions of this paper is that we take standard data analysis methods and show that they produce something better when you provide such statistics. These methods include feature selection, association rule mining, and Bayesian network learning. In our paper "Modelling Relational Statistics With Bayes Nets", we showed that the counting problem can in principle be solved by using the Fast Moebius Transform. But to make this work for large databases requires dealing with a lot of complexity in the data format, and making use of as much information in the relational database schema as possible. The dynamic Moebius Join program that we describe essentially builds up sufficient statistics for larger relationship sets from smaller ones. It utilizes the same lattice of relationship chains that we use to structure Bayesian network learning. (See "Learning Graphical Models for Relational Data via Lattice Search".)
One aspect of this paper that I like is that almost everything is done within the relational database system using SQL (relational algebra). We store contingency tables containing sufficient statistics in the database. We use SQL queries to build initial contingency tables from the data tables, and to build bigger contingency tables from smaller ones. The moral is that relational databases systems are good not only for managing big data, but also for managing big models!
Virtual Joins With Nonexistent Links. Khosravi H., Bina, B., Schulte, O. (2009). Inductive Logic Programming (ILP-09).
A precursor to our CIKM paper about computing multi-relational sufficient statistics. A new dynamic programming algorithm for computing the number of instantiations that satisfy a first-order clause. We need this for parameter estimation in our relational Bayes nets. Yes, the problem is #P-complete, but read on!
Modelling Relational Statistics With Bayes Nets. Oliver Schulte, Hassan Khosravi, Arthur Kirkpatrick, Tianxiang Gao, Yuke Zhu (2014). Machine Learning (94:1), pp. 105-125. The link provides the final version typeset by us; for the typeset journal version please see springer.com

This paper concerns the problem of modelling relational statistics, to answer queries like "what is the percentage of intelligent students in upper-level classes" or "what percentage of women give high ratings to action movies"? We refer to such queries as class-level queries, since they are about classes of individuals/links, rather than specific individuals. Class-level queries show generic correlations in a database, and can be applied in strategic planning and query optimization. This paper shows how Poole's Parametrized Bayes nets can be used for class-level modelling.
The conceptual contribution is to specify a new semantics for a Bayes net so that it is clear in which sense a Bayes net represent class-level frequencies. As Lise Getoor pointed out in her dissertation, the standard grounding semantics is not sufficient because it translates a 1st-order model into a statistical model of individuals, their attributes and relationships---the classes disappear. We propose using Halpern's random selection semantics instead. This works for any model based on 1st-order logic, for example also for Markov Logic Networks.
The computational contribution is an algorithm for computing sufficient statistics in a database when negated relations are involved. A social network example would be "find the number of users who are both from Germany and who are not friends". It quickly becomes infeasible to do this by enumerating all user pairs who are not friends. It turns out that we can compute sufficient statistics that involve negated relations, in terms of counts that involve only positive relations. We show in this paper that the classic fast Moebius transform algorithm can be applied to transform positive event counts into negated event counts in a highly efficient manner.
Modelling Relational Statistics With Bayes Nets. Oliver Schulte, Hassan Khosravi, Arthur Kirkpatrick, Tianxiang Gao, Yuke Zhu (2012). Conference on Inductive Logic Programming.
Shorter Version from ICML-SRL Workshop on Statistical-Relational Learning. Workshop Presentation.
Note: The ILP organizers ask for the final camera-ready version after the feedback at the conference has been given, so while the paper has been accepted for inclusion in the proceedings, what is posted not is not the final form. The SRL workshop paper is in the final stable form. These are shorter versions of our Machine Learning journal paper above.
All the reviews from the SRL workshop were good, but on this paper we received an usually knowledgeable review. This reviewer not only knew of the Moebius transform, but also knew that the version we are using is sometimes called the zeta transform. We appreciated this review very much.
Identifying Important Nodes in Relational Data Fatemeh Riahi, Oliver Schulte, Qing Li (2013). AAAI Late Breaking Paper Track.
IJCAI-HINA Workshop version (one page longer). Also presented at AAAI-SRL Workshop.

The idea of ranking nodes in a network is very familiar from web search. In this paper we pursue a different approach where the importance of a node is measured by how much being linked to the node explains the behaviour of other nodes linked to it. In terms of model selection, a node is important if introducing the node into a model leads to a better explanation of the relational patterns in a network. Introducing new terms into a model increases the model's capacity to explain observations, but also increases the complexity of the model. This trade-off can be quantified in terms of a statistical model selection score like BIC: A node is important to the extent that introducing it into the model improves the explanatory power of the model more than it increases its complexity.
A special case of statistical importance are team-player problems, where the goal is to measure the importance of a player's contributions. We apply the statistical importance measure to English Premier League soccer data.
Transaction-Based Link Strength Prediction in a Social Network. Hassan Khosravi, Ali Bozorgkhan, Oliver Schulte (2013). IEEE SSCI Symposium.
We present a new model of friendship strength in social networks. The input data are transaction logs between friends, and friendship ratings, indicating how close a friendship is. We had access to an anonymized version of the Cloob social network that provides friendship ratings. Our model assigns two latent factors to each user, one that describes their tendency to rate others, and another that describes their tendency to be rated. Following work on social recommendation by Ma et al., as well as Jamali and Ester, we assume a recursive dependency between latent factors: a user's latent factors are linear combinations of her friend's latent factors, where the number of transactions between them is the weight given to a friend's latent factors. In this way the transaction matrix can be used to help factorize the ratings matrix, without being factorized itself. It's hard to know whether there really is a recursive dependency between unobserved variables. Empirically, previous work showed that the recursive dependency model performs well for social recommendation and trust propagation. Our results in this paper show good predictive performance for friendship strength prediction, on Cloob and a synthetic dataset.
A Hierarchy of Independence Assumptions for Multi-relational Bayes Net Classifiers. Oliver Schulte, Bahareh Bina, Branden Crawford, Derek Bingham, Yi Xiong (2013). Proceedings IEEE SSCI Symposium, pp. 150-159.
We showed in the paper on decision forests below that an independence assumption can lead to fast accurate multi-relational classification. In this paper we advance our theoretical understanding of multi-relational classification by constructing a hierarchy of different independence assumptions, from weaker to stronger assumptions. We mathematically derive multi-relational classification formulas for each level in the hierarchy; all these formulas are log-linear models. The weakest assumption is that different links are independent of each other given the descriptive attributes of the linked entities. This is a new principle. The other hierarchy levels correspond to previous classification formulas proposed by other researchers; this means that the conditional link independence assumption is implicit in previous formulas. While the link independence assumption leads to the best classification performance, the advantage is small. A bigger improvement is achieved by using logistic regression to learn weights for different data tables, as in our paper on multi-relational decision forests below.
Simple Decision Forests for Multi-Relational Classification. Bahareh Bina, Oliver Schulte, Branden Crawford, Zhensong Qian, Yi Xiong (2013). Decision Support Systems 54(3): 1269-1279. The link provides the final version typeset by us; for the typeset journal version please see here.

This is my first paper on discriminative learning (classification) for relational data, as opposed to learning generative models. The special approach taken here is to begin with an independence assumption, roughly that information from different rows is independent given the class label. This can be thought of as a kind of naive Bayes assumption for relational data tables. (Manjunath, Murty and Sitaram from HP Research independently developed a multi-relational classifier based on a similar assumption.) Having defined this assumption formally, we can derive a multi-relational classification model from it. Basically what you get is this: apply a standard single-table probabilistic classifier separately to individual data tables, then use a log-linear formula to combine the different probabilistic assumptions. A big selling point for this approach is scalability: the independence assumption allows us to learn a model for different tables independently, which is very fast.

Classifiers built on the independence assumption work quite well, but there is a problem in that information from a data table may be redundant given information from others, which hurts classification accuracy. My student Bahareh Bina came up with an elegant approach to pruning/weighting data tables: take the aggregate predictions from different tables as regressors in a logistic regression model. Then standard logistic regression learns weights for different tables. Redundant tables receive weight 0, and the nonzero weights provide a ranking for which relational pathways carry the most information.

Erratum: The typesetters messed up the reference numbering at the very last stage. If you're trying to track down a reference, hopefully you can find it using the title or author mentioned in the text. Otherwise send me an email.
Challenge Paper: Marginal Probabilities for Instances and Classes Oliver Schulte (2012). ICML-SRL Workshop on Statistical Relational Learning. Workshop Presentation.
The workshop organizers solicited papers that pose a challenge for the SRL community. I thought this was a great idea. I had a challenge ready: A problem that I have been thinking about for several years and not been able to solve. The problem is this: In his classic paper on logic and probability, Joe Halpern proposed a basic principle that in the absence of any further information, our probabilistic beliefs about an individual should track the frequencies in the population that the individual comes from. Thus if the only thing we know about Tweety is that Tweety is a bird, then the probability that Tweety flies should be the proportion of fliers in the bird population. Halpern proves that this is equivalent to the relevant instance of Miller's principle. James Cussen pointed out to me that it can also be seen as an instance of David Lewis' famous Principal Principle. In the paper I give four different arguments supporting the principle.
The challenge is this: As reasonable as it seems, it's not easy to get an SRL system to satisfy the principle. What you need is for marginal inferences about ground atoms to track population frequencies. In the paper I give some example of how this can be worked out for various type of Bayes net structures in special circumstances. What I'd really like is a general result of the form "as long as the database is representative of population frequencies, learning a stochastic logic program will produce an model that satisfies Halpern's principle". Or similar results for Bayes nets, stochastic logic programs, etc. The only model class for which I can give a general argument are latent variable models, like those obtained by matrix factorization.

Update: I since realized that Markov logic network parameter learning satisfies Halpern's principle, with maximum likelihood as the objective function. This is because in a log-linear model, the maximum likelihood estimates have the property that the expected feature instantiation counts are the observed instantiation counts.

Update: Chris Re gave a great invited talk at the StarAI 2015 workshop. He said that he makes everybody who wants to work on DeepDive take an oath that they will validate their system by calibrating marginals against the data. His reason is that marginal probabilities are relatively easy for users to check and give them confidence in the system. That's yet another argument for why SRL models should give marginal inferences that track frequencies in the data.
Learning Directed Relational Models With Recursive Dependencies. Oliver Schulte, and Hassan Khosravi (2012). Machine Learning, 89:299-316.
Extended Abstract appears in Proceedings of the Conference on Inductive Logic Programming (ILP 2011), LNAI 7207, pp.39-44.
One of the biggest problems in using Bayes nets with relational data is the fact that the data feature cyclic dependencies (see papers below). One observation we prove in this paper is that this happens exactly when you have self-relationships where an entity type is related to other members of the same type (e.g., users are friends with each other in a social network). I proposed a pseudo-likelihood measure that allows us to learn Bayes nets efficiently even in this case (see SIAM SDM paper 2011). This paper presents a set of experiments that focus on learning with the pseudo-likelihood on databases with cyclic dependencies.

There is a nice theoretical contribution as well. Relational Bayes nets can represent recursive dependencies elegantly using "copies" of variables. Thus we can have a rule like "Smokes(X), Friend(X,Y) -> Smokes(Y)" to say that the smoking of a person X is associated with the smoking of their friends. The problem for structure learning now is that we don't want to treat all the copies of Smokes equally, because that would lead to a duplication of the same statistical patterns. Our proposal is to designate one of the copies as the main copy (e.g., "Smokes(Y)"). The rule is then that only the main copy can have parents. So a copy like "Smokes(X)" is just an auxilliary random variable for modelling the dependence of smoking on itself. We prove a theorem to the effect that restricting Bayes nets to this format involves no loss of generality. In a way this is a detail but it's an important one, and without the theorem it's not easy to see how to address it.
Learning Compact Markov Logic Networks With Decision Trees. Hassan Khosravi, Oliver Schulte, Jianfeng Hu and Tianxing Gao (2012). Machine Learning , 89:257-277.
Extended abstract appears in the Proceedings of the Conference on Inductive Logic Programming (ILP 2011), LNAI 7207, pp.21-26.
One of the ideas we have been working out is to combine learning for directed relational models with inference in undirected relational models. This addresses the problem of how to do inference when there are cyclic dependencies in the relational data (see papers below). One problem we noticed is that if you just convert a Bayes net into a Markov Logic Network, you get many logical clauses (rules). The reason is that tables of conditional probabilities contain more rows than necessary because the tabular representation does not exploit local independencies for a given child-parent configuration. In this paper we show that if you use decision trees instead of conditional probability tables in the Bayes net, you get a much more compact set of rules. This makes the learned set of rules more readable, improves weight learning for rules, and ultimately increases predictive accuracy.
A tractable pseudo-likelihood function for Bayes Nets applied to relational data. Oliver Schulte (2011). Proceedings of the SIAM SDM Conference on Data Mining, pp.462-473.
The most practically useful theoretical paper I have written. It considers the question of how we can quantify how well a Bayes net model fits the information in a relational database. In fact, the proposal works for any statistical-relational model that uses 1st-order logical variable. This is a problem because in relational data, units are not independent of each other, so we cannot apply the model to each unit separately and multiply the results. My proposal is to define the model fit by considering how well the model explains, on average, the properties of a random instantiation of the 1st-order variables. For instance in a social network, we can imagine randomly sampling pairs of actors, and applying the Bayes net to compute a probability for that pair. The average log-probability over all pairs is my proposed pseudo-likelihood measure.

The paper establishes various desirable theoretical properties of the pseudo-likelihood measure, gives an efficient closed form for computing it, and describes efficient algorithms for Bayes net parameter and structure learning that maximize the measure. Warning: the reviewers said the paper was "a little dense", so you may want to look at my powerpoint presentation first.

Connections. Xiang and Neville independently introduced the same normalization idea for Markov Logic Networks.

Erratum In Table 1, the column labelled C(X) should be omitted.
The Minimum Consistent Subset Cover Problem and its Applications in Data Mining . B. Gao, J. Cai, M.Ester, O.Schulte and H. Xiong (2007). Proceedings of KDD 2007 (Knowledge Discovery and Data Mining).
Consider the following algorithmic problem: Given a set of points X, and a consistency constraint on subsets of X, find a minimum collection of consistent subsets that covers all points in X. We show that a number of data summarization problems in data mining can be viewed as instances of this problem (e.g., finding a minimum rule set for classifying, frequent pattern search, converse k-clustering). Many of these problems satisfy a monotonicity constraint: all subsets of a consistent set are also consistent. For subset cover problems with the monotonicity constraint, the paper describes new heuristic algorithms that take advantage of the constraint.
Learning Class-Level Bayes Nets for Relational Data. O. Schulte, H. Khosravi, F. Moser, M. Ester (2009). CS-Learning Preprint Archive.
This is our first description of our research results on learning Bayes nets for relational data. The main idea is to learn Bayes nets for the class-level distribution over attributes and relationships, as defined in Halpern and Bacchus' classic AI work from the 90s. Shorter versions are available as follows.
- 3 pages: Bayes Nets for combining logical and probabilistic structure . Schulte, O., Khosravi H., Bina, B. (2009). STRUCK workshop at IJCAI-09.
- 6 pages: Join Bayes Nets: A New Type of Bayes net for Relational Data. Schulte, O., Khosravi H., Bina, B., F.Moser (2009). GKR workshop at IJCAI-09. More on structure learning for statistical-relational models.
Defining Association Rules in the Relational Calculus, O. Schulte, F. Moser, M. Ester, Z.Lu (2007). SFU School of Computing Science Technical Report 2007-23.
I'm starting to work on various issues that arise when we combine probabilistic concepts with a logic-based formalism, like relational databases. This paper defines the concept of support and confidence for a large class of association rules, based on safe queries in the domain relational calculus (a logical query language equivalent to SQL). The key problem is this: given a logical formula with free variables and a database instance (finite model), what is the percentage of free variables that satisfy the formula? For example, consider the formula Customer(X) AND Student(X). It seems this should be neither the percentage of students who are customers, nor the percentage of customers who are students. Our proposal is to define the frequency as the ratio of (#people who are both students and customers)/(#people who are students or customers). We extend this basic idea to general safe queries and prove various metatheorems about it, such as the a priori property.

The key differences with other relational rule approaches (e.g., WARMR) are this: 1) we do not assume that the user specifies a single target or base table; instead, we consider combinations of base tables (Student union Customer). 2) The base tables are defined dynamically and depend on the particular query formula, so there is no fixed table with respect to which the frequency of all formulas is defined. 3) Our rule language is much richer than conjunctions of atoms; it allows nested quantification and negation.

Sports Analytics and Reinforcement Learning
Sports data are a rich application area for many of the topics I'm interested in: machine learning for structured data: especially spatio-temporal processes, heterogeneous networks: lots of different agents with different types of relationships, and game theory: decision-making in multi-agent interactions. My general vision is to apply reinforcement learning to sports data. This is a rich well-developed area of machine learning that fits sports very well. In many papers I use reinforcement learning to learn a value function for a sport (hockey so far) and then apply the value function to answer questions about the sport, like evaluating the performance of players.
Risk, Reward, and Reinforcement Learning in Ice Hockey Analytics. Sheng Xu, Oliver Schulte, Yudong Luo, Pascal Poupart, Guiliang Liu (2024). Chapter in Artificial Intelligence and Machine Learning in Sports Science, Springer 2025. I'm posting a previous draft, for the final version see the official book publication. The editors translated our article into German.

Follows up on our Neurips 2023 paper on risk-sensitive player evaluation. My motivating intuition for this work is the observation that strong players take many risks. Think of Messi dribbling around 3 defenders before passing the ball, or Sidney Crosby carrying the puck towards the goal rather than dump behind it. Our Neurips paper built tools for estimating the distribution of a team's success chances using distributional reinforcement learning . In this paper we use the tools to assess the risk-taking behavior of NHL teams. For each action by a team, we compute the standard deviation of success chances due to this action. Riskier actions that carry a higher risk of failure but also a higher chance of scoring will be assigned a higher standard deviation. If stronger teams take more risks, we expect risk-taking to correlate with team standing (total team points over a season). We found a 0.92 correlation between a team's risk-taking and their season standing . Here risk-taking is measured by the sum of outcome standard deviations associated with actions by the team. So the connection between risk-taking and team success is even stronger than I expected. We also investigate other measures of risk such as Gini deviation and value-at-risk familiar from portfolio theory.
Match Predictions in the National Hockey League using Box Scores. Michael Davis, Tim Swartz, Oliver Schulte, Juan Higuera, Mehrsan Javan. (2023). Chapter in Statistics Meets Sports: What We Can Learn from Sports Data. I'm posting a previous draft, for the final version see the official book publication.

Box scores give you match-level statistics (e.g., this team won x face-offs, scored y goals). This paper shows how you can generalize rating systems like Elo to take into account box scores. The basic idea is to compute a fractional-type Elo rating that predicts, when team A plays team B, what share of events each team will have. E.g. the faceoff Elo ratings of team A and team B could predict that team A will win 70% of the face-offs. The event-specific elo ratings can be fed into a match outcome predictor and do seem to have predictive value, measured in terms of accurate forecasting and betting gains.
Learning Tree Interpretation from Object Representation for Deep Reinforcement Learning. Guiliang Liu, Xiangyu Sun, Oliver Schulte and Pascal Poupart (2021) Proceedings of the Conference on Neural Information Processing Systems (NeurIPS) 2021, pp. 19622--19636.

The general topic here is interpreting learned latent representations, one of the challenges of deep learning. The setting here is deep reinforcement learning with objects, e.g. based on image data. We assume an encoder-decoder architecture for the value function. The goal is to help interpret the image/object representations. Our idea was to discretize the latent space using a kind of regression tree such that discretized regions make meaningful differences to the learned value function.
Uncertainty-Aware Reinforcement Learning for Risk-Sensitive Player Evaluation in Sports Game. Guiliang Liu, Yudong Luo, Oliver Schulte and Pascal Poupart (2022). Proceedings Conference on Neural Information Processing Systems (NeurIPS) 2022.

I was telling viewers of our poster that "this is the only application paper at the Neurips conference". One of the big under-explored issues in sports analytics is the risk-reward trade-off. For example in basketball, you can take a long shot and try to score 3 points, or go for a less risky short shot and be satisfied with 2 points only. Many sports analytics researchers argue that coaches are too conservative and should take more risks to maximize their overall chances of winning. But how to quantify risk taking? In this paper we show that risk can be quantified by applying distribution reinforcement learning. In the sports context, distributional RL learns to assign to each match context, not only a probability of team success, but a distribution over team success (e.g. over the future scores at the end of a possession). Thus it captures not only the mean but also the variance of the value distribution. We then rank players according to risk-seeking behavior (high variance outcomes, basically) vs. risk-averse behavior (favouring low-variance outcomes). I think this approach to quantifying the uncertainty in dynamic sports predictions is powerful.
Distributional Reinforcement Learning with Monotonic Splines. Yudong Luo, Guiliang Liu, Haonan Duan, Oliver Schulte and Pascal Poupart (2022). Proceedings of The Tenth International Conference on Learning Representations (ICLR) 2022.

Develops a new approach to distributional RL, where the distribution of the value function is learned through quintile regression, and the quintiles are represented splines. We apply this technique in our paper on risk-sensitive player evaluation.
Valuing Actions and Ranking Hockey Players With Machine Learning. Oliver Schulte (2022). Proceedings of the Linkoping Hockey Analytics Conference (LINHAC) 2022., pp 2--9.

They invited me to give a keynote at the Linkoping conference and asked me to contribute an extended abstract. I took the opportunity to write a short, and I hope accessible, introduction to how machine learning can be used for valuing actions and ranking players.
Deep soccer analytics: learning an action-value function for evaluating soccer players. See here for version typeset by us. Liu G, Luo Y, Schulte O, Kharrat T. Data Mining and Knowledge Discovery (2020). Vol 34(5):1531-59.
This paper applies the reinforcement learning approach for ranking hockey players to soccer. Soccer is even harder to model than hockey because goals are so sparse (0 to 1 per game vs. 3-4 goals in hockey). So our neural network architecture for predicting team success is a bit more complex (two towers). Also we used a fine-tuning approach where we pooled data from different leagues and then fine-tuned for a specific league. If I had known the term then, I might have called it a foundation model for soccer! As with ice hockey, the system gives sensible and informative action values and player rankings.
More material on this topic.
- ECML Conference Presentation Slides.
- Conference Video.
Cracking the Black Box: Distilling Deep Sports Analytics Xiangyu Sun Jack Davis Oliver Schulte Guiliang Liu. Proceedings ACM SIGKDD 2020, pp.3154–3162.
This is one of my favourite papers on interpretability. We show that you can distill a black-box neural net model for an expected value function into an interpretable white-box model using linear-model trees. Our datasets are large (7M data points), which broke existing packages for learning linear model tree. We discovered you can use segmented regression to find the split points in a tree. Segmented regression is an iterative procedure and scales very well, basically sublinear. The tree really helps you understand the model. In fact, it helped us realize that the model made no sense in certain circumstances, given domain knowledge. Tracing back the nonsense to the data helped us realize a mistake someone had made in data preprocessing. So mimic learning can be a data cleaning tool as well!
More material on this topic.
- Conference Presentation Slides.
- Conference Video.
Inverse Reinforcement Learning for Team Sports: Valuing Actions and Players Yudong-Luo, Oliver Schulte, Pascal Poupart. Proceedings IJCAI 2020, pp.3356-3363. Code.
I have been working on what I call "precision sports analytics", trying to understand not only the general patterns in a sport, but also the specific strengths and weaknesses of a team. Inverse RL offers one approach to this customized understanding. The general idea in inverse RL is to infer the rewards an agent is seeking from their behavior---what are they after? In this way we can learn for example that one team may value possession, whereas another team tries to move quickly into their offensive zone. We developed a Markov game approach for inverse Rl. Another innovation compared to standard inverse RL is that we regularize the inferred rewards with the actual rewards (goals). The learned rewards are plausible and dense, so get a idea of progress for a team in every match situation, not just goals once in a while.
At the same conference I saw a paper that applied inverse RL to infer the goals of tumors in cancer cells. Very cool! I think they could also benefit from the idea of regularizing learned rewards by actual rewards (e.g., multiplying is a reward for the tumor).

More material on this topic.
- Conference Presentation Slides and Conference Presentation Poster .
- Conference Video.
- Workshop Slides.
  Great Team Sports Workshop at AAAI! A special place in my heart because that's the last conference I attended in person since the Covid pandemic.
Learning Agent Representations for Ice Hockey Guiliang Liu, Oliver Schulte, Pascal Poupart, Mike Rudd, and Mehrsan Javan. Proceedings Neurips 2020.
This is our main paper on what I call "precision sports analytics", trying to understand not only the general patterns in a sport, but also the specific strengths and weaknesses of a team. We learn an embedding for each of over 2K professional athletes in the NHL. Or approach is to train a recurrent variational auto-encoder to predict which player is acting in a given match context (the player identification problem). We can then use to encoder part to produce a dynamic player embedding for each match context. The embeddings make sense in visualizations and help with downstream tasks.
More material on this topic.
- Neurips Presentation Slides.
- Neurips Video.
- Workshop Version. Workshop Slides.
  Great Team Sports Workshop at AAAI! A special place in my heart because that's the last conference I've attended in person since the Covid pandemic.
Toward Interpretable Deep Reinforcement Learning with Linear Model U-Trees. Guiliang Liu, Oliver Schulte, Wang Zhu and Qingcan Li. Proceedings ECML 2018. Conference Presentation Slides.

Value functions learned by neural nets can represent powerful complex sequential patterns. But it is opaque just what patterns they have captured. We developed a mimic learning approach to interpreting value functions associated with Markov models, using linear model trees. A linear model tree is like a regression tree but with linear models in the leaves. You can think of it as an interpretable ensemble of linear regression equations. Compared to regression trees, linear model trees are much more compact.
Interpreting Deep Sports Analytics: Valuing Actions and Players in the NHL. Guiliang Liu and Wang Zhu and Oliver Schulte (2018). Proceedings of the 5th Workshop on Machine Learning and Data Mining for Sports Analytics @ECML PKDD', LNAI 1330, Springer, pp. 93--105. Workshop Presentation Slides.

We have developed a mimic learning approach to interpreting value functions associated with Markov models, using linear model trees (see ECML 2018 paper above). This paper applies this approach to interpreting the value function we obtained by applying deep learning in ice hockey (see IJCAI 2018 paper). This allows us to assess feature importance (e.g. how important is manpower for scoring goals?). We also use the model trees to quantify how exceptional a player is.
Deep Reinforcement Learning in Ice Hockey for Context-Aware Player Evaluation. Guiliang Liu and Oliver Schulte (2018). Proceedings IJCAI 2018, pp.3442-3448.
IJCAI Conference Presentation Slides.
Bringing deep RL to sports analytics! The set-up is the same as in our Sloan paper, but with the neural net we can work with spatio-temporal coordinates without having to discretize them. And with the recurrent neural net, we can learn from sequences directly. Nice visualizations of scoring chance dynamics through time and space.
This is a complex domain with millions of data points, so the neural net is hard to train. We could not fit all the details into the paper. One small thing I'm proud is a novel initialization, where we train the neural net first to predict the average Q-value for the average input vector. This helped a lot. It's a simple idea but I have not seen this elsewhere.
Model Trees for Identifying Exceptional Players in the NHL and NBA Drafts. Yejia Liu and Oliver Schulte and Chao Li (2018). Proceedings of the 5th Workshop on Machine Learning and Data Mining for Sports Analytics @ECML PKDD', LNAI 1330, Springer, pp. 69-81. Workshop Presentation Slides. National Hockey League Draft Dataset.

This is my first paper based on draft data (as opposed to play-by-play data). The basic problem is this: given junior league box scores (e.g. points scored, goals scored, etc.), predict future performance in a professional league. We used a logistic regression model tree. This type of model provides several features that draft analysts have looked for: 1) The tree splits set threshold that define groups of players (e.g. all players with scouting rank above 10 and plus-minus above 1). 2) The linear regression equations in the tree leaves provide feature importance weight. 3) The predictive accuracy is state-of-the-art.
One of my favourite aspects is a novel solution to the excess-zeros problem: having a large number of data records where your target feature for prediction is 0. For example, only about half of the players with an NHL contract ever actually play an NHL game. Our target is to predict the total number of NHL games for each player, so we have many 0s. Our suggestion is to adopt a classification approach instead: predict whether a player will play any games, then use the confidence that they will play >0 games to rank them.
Apples-to-Apples: Clustering and Ranking NHL Players Using Location Information and Scoring Impact. Oliver Schulte, Zeyu Zhao, Mehrsan Javan, Philippe Desaulniers. (2017). MIT Sloan Sports Analytics Conference.

Probably the most succinct and accessible write-up the Markov game approach to evaluating first actions, then players. we've done. What is new compared to the journal paper below, is that we cluster hockey players into types, then rank them within each cluster. Generally ranking players makes most sense for comparing those with similar roles.
Reality Check. I like tracking how players we ranked highly do in the future. Our ranking identifies Erik Karlsson as the top player. And he's been exceptional in the 2017 playoffs! We mention Jack Eichel as a rising star with relatively low salary, based on the 2015-2016 NHL season. He's having a killer 2016-2017 season.

Connections. That ranking and clustering should go together is the main insight behind the RankClus system from Jiawei Han's group for ranking individuals in relational data.
Erratum. When we made the player activity heatmaps, we inadvertently used the wrong sign for the y-coordinates, which means that we switched the left and right side of the rink (from the perspective of the attacking team). Here is a corrected version. Thank you to everyone on Twitter for pointing this out---peer review has helped us!
More material on this topic.
- Sloan Poster.
- SFU News published an article about the Karlsson connection.
- Breakdown of Clusters: Head counts and Heatmaps for each learned cluster.
  - Defensemen
  - Forwards, including a breakdown by position.
A Markov Game Model for Valuing Actions, Locations, and Team Performance in Ice Hockey. Oliver Schulte, Mahmoud Khademi, Sajjad Gholami, Zeyu Zhao, Mehrsan Javan, Philippe Desaulniers. (2017). Data Mining and Knowledge Discovery, pp.1-23. Final version available on link.springer.com (paywall)

More Markov game modelling for the NHL. The difference compared to our UAI paper is that the model presented includes spatial information about puck location. We discretize the spatial locations by clustering them, with a separate cluster for each action type. Another difference is that in addition to ranking players, we also rank teams. What surprised me here is that simply adding up the values of actions by a team does not work. This is because our action values are defined in how they change the current goal scoring chances, and in the sum of these deltas over a game almost all terms cancel out. What works well though is Zeyu Zhao's suggestion of adding the values of a team's states throughout the game. Intuitively, this measures how long a team manages to keep how much of an advantage. This metric correlates highly with team match results.
Erratum . The numbers for the states and the actions given in the introduction are inconsistent with the numbers in the later detail sections. That's because our earlier version had the state include the last action. The referees suggested changing the definitions to simplify them, but then we did not update the numbers in the introduction, only later in the text. Thank you to Niel Rustenburg for pointing this out.
Biased Penalty Calls in the National Hockey League. Beaudoin, David and Schulte, Oliver and Swartz, Tim (2016). Statistical Analysis and Data Mining: The ASA Data Science Journal. Volume 9, Issue 5, October 2016, Pages 365-372.

This paper is an instance of the problem of identifying unusual individuals (outlier detection, exception mining) in structured data. In this case we are talking about referees in the National Hockey League. The twist is that, we are interested not so much in whether a particular referee is different from a typical referee, but whether a particular referee is different from the ideal referee. The ideal referee for us is one who on average gives as many penalties to the home team as to the away team. The tricky part is modelling how referee behavior may depend on context factors such as game time and game score.
Connections. To evaluate a referee, say Tim Peel, the paper considers (i) the population of all referees, (ii) the population of all referees minus Tim Peel, and evaluates whether the referee set shows more bias with Tim Peel than without. In the Exceptional Model Mining framework, one compares the individual (subgroup) against the whole population without the individual. In our exception mining/outlier detection work (below with Riahi), we compare the individual against the whole population. These three options seem to be pretty much cover it.
A Markov Game Model for Valuing Player Actions in Ice Hockey. Kurt Routley and Oliver Schulte (2015). Proceedings Uncertainty in Artificial Intelligence (UAI 2015), pp. 782-791.
The high-level idea in this paper is to treat sports analytics as a branch of reinforcement learning. (In the sense of AI, not psychology.) We build a big Markov game model for ice hockey as played in the National Hockey League. In this type of model, the game moves to state to state with a certain probability, which depends on the actions taken by both teams. Our model contains over 1.3 million states and is learned from over 8 million events, from 10 NHL seasons. As in my previous papers on learning big models from big data, we use a relational database to store both the data and the model.

We apply the Markov game model to assign a value to actions . Evaluating the actions of players is a common task for sports analytics. Reinforcement learning has developed the concept of an action-value function, often denoted as the Q-function, that addresses this problem using AI techniques. There are efficient dynamic programming algorithms for computing the Q-function even for large state spaces. For sports analytics, the Q-function has two key advantages.
- Context-dependence. The impact of an action depends on the game context in which it's executed. For example, a goal is worth more if the game is tied than if one time is leading by four. A second shot on goal shortly after a first one is more likely to succeed. Etc.
- Lookahead. Hockey actions have medium term effects, not only immediate ones. For example, gaining a powerplay does not necessarily lead to a goal immediately or within the next minute.
We use the action values to rank players: each time a player performs an action, we assign him "points" depending on the action impact as measured by the action value function. The ranks are given by the sum of all points over a season. This is like the well-known +/- score but instead of adding/subtracting one point only when a goal is scored, we assign a continuous range of points for all of a player's actions.

If you are into ice hockey, it's fun and revealing to look at the resulting rankings. For instance, all three members of St. Louis' famed STL line (Schwartz, Tarasenko, Lehtera) are among the top 20 in our list. In fact, Jori Lehtera tops our goal impact list, although his linemate Tarasenko outscored him by far. Our analysis suggests that Lehtera's actions create the opportunities that Tarasenko exploits, leading to a high goal impact ranking for Lehtera and a high goal score for Tarasenko. This pattern fits the traditional division of labor between a center and a wing player. Tarasenko is also the most undervalued player in our list. Starting with the 2015-2016 season, St. Louis has signed him for an annual average more than 7 times higher, which our analysis strongly supports.
Another interesting case is Jason Spezza, who was on top of our goal impact list in the 2013-2014 season. However, his plus-minus score was an awful -26. This is because his team the Ottawa senators was even worse overall (-29 goal differential). The goal impact score rewards players for good decisions even if their teammates do not convert them into goals. After the 2013-2014 season, Spezza requested a trade, which according to our analysis was definitely the right thing for him to do. With the Dallas Stars, his plus-minus score came more in line with his goal impact score, as we would expect (-7 overall, still negative).
As I mentioned, we also look at players' impact on penalties. Sadly, two Vancouver Canucks players take spots 2 and 4 when it comes to actions that lead to penalties (Dorsett and Bieksa). Even more sadly, none of the Canucks player make it into the top 20 when it comes to actions that lead to goals.

More material on this topic.
- Supplementary Material.
- Poster.
- 2-slide spotlight presentation. Perhaps the best way to get a quick idea of the details.
- MLSA presentation at ECML 2015. 15 minute lecture. PDF format.
- MLSA Workshop paper at ECML 2015. Shorter workshop version. Does not show the range of action values, looks only at goals, not penalties. Shows that goal impact rankings are consistent in the sense that they correlate highly from one season to the next. Compared to the UAI paper, this is a bit more geared towards readers interested in hockey analytics and less in the computer science aspects. The Q-value ticker pretty much tells the whole story in one picture (Figure 1).
- Overview Talk. This presentation adds more context about sports analytics in general, with references. I've given versions of this at the Universities of Alberta, BC, and Fraser Valley.
Learning Bayesian Networks and Causal Models
When Should Reinforcement Learning Use Causal Reasoning, Oliver Schulte, and Pascal Poupart. 2025. Transactions on Machine Learning Research, 31 pages. Video and Slides.
Conditional probabilities measure correlations, not causation, and can give wrong estimates as to the effects of actions. For example, the probability of rain is high given that you observe open umbrellas, but an agent cannot hope to make it rain by opening their umbrella. Since reinforcement learning is about learning to act to maximize a reward, it would appear that causal reasoning should be key to reinforcement learning. Nonetheless, most reinforcement learning is based on conditional probabilities, and standard textbooks happily spend 500+ pages without once mentioning key causal concepts like interventions and counterfactuals. The question we address in this paper is when reinforcement learning does and does not need causal reasoning. Our answer is that it depends on the RL setting: in a setting where the RL agent is learning from actions whose causes the agent can observe, conditional and causal probabilities coincide, and RL algorithms can remain blissfully unaware of the difference between correlation and causation. For example, if you are driving around town, you know why you made each decision, and you can learn from the results of your actions with traditional conditional probabilities. We formalize and prove this claim using causal theory and the concept of a partially observable Markov decision process.
In contrast, if the causes of the exploratory actions that generate the data are only partially observable for the learning agent, conditional probabilities can badly misestimate causal probabilities, and therefore badly mislead agents into poor decisions, as we show with examples. We refer to this setting as the POOOP setting: partially observable off-policy offline RL. For example, if you are watching other drivers and you are missing information that they are using to make decisions (e.g., traffic signs), you are in the POOOP setting and you need to estimate causal probabilities, not simply use conditional probabilities.
Cause-Effect Inference in Location-Scale Noise Models: Maximum Likelihood vs. Independence Testing. Xiangyu Sun and Oliver Schulte (2023). Proceedings Neural Information Processing (Neurips). Arxiv Preprint.

In a standard statistics textbook view, causal inference is a simple story: the way to do is to run a randomized controlled trial; without an RCT, you cannot rule out confounders so you can't infer anything. Causal modelling researchers such as SGS have looked deeply at the case where we can guarantee from domain knowledge that no confounders are present (causal sufficiency). There is a rich theory of what you can and cannot learn and how. For example, you can distinguish a structure X -> Y <- Z from a causal structure X -> Y -> Z. The second chain structure implies a statistical correlation between X and Z (assuming faithfulness), whereas the first implies that X is independent of Z. However, it looked like telling the causal direction between two causal variables seemed hopeless. Statistically rain and umbrellas go together, and without background knowledge or temporal information, how can you statistically distinguish the directions X -> Y and Y -> X?

Shinizu's Lingam model introduced a brilliant approach to the 2-variable setting. Suppose that you assume a linear model with noise Y = aX + Z where Z is an unobserved noise variable accounting for unknown causal influences. Then given an observation of X and Y, you can solve for Z. Let us write Z' = Y - aX where Z' is called the residual since it measures the difference between the observed value Y and the model's prediction aX. Now Shinizu's ideas that since you can infer Z' from the data and the model, you can treat it as if you had observed it in a causal model X -> Y <- Z' . So we have essentially the 3-variable case, which can be addressed using statistics as we have seen. For example, we can test whether Z' is independent of X (i.e., whether the residuals are independent of the independent variable X). Great stuff!

Now this basic strategy can be applied to any latent variable where we can infer the latent variable as a residual. For example, it can be applied to an additive noise model Y = f(X) + Z where f can be an arbitrary nonlinear function, or a location-scale model Y = f(X) + g(X) x Z . The location-scale model is the most general one considered so far in causal modelling and has recently received a lot of attention. The question we asked in this paper is whether one should decide on the causal direction X -> Y or Y -> X for a location-scale model using (1) maximum likelihood or (2) Shinizu's strategy of computing the residual Z' and testing for the independence between X and Z'. Our answer is that you should do independence testing. The problem with the causal model likelihood is that computing it requires specifying a prior p(Z) over the noise term, and we don't really know what the right prior is. That would be okay if it didn't matter, but we show theoretically and through extensive experiments that misspecifying the prior p(Z) leads to identifying the wrong causal direction. Independence testing is robust to prior misspecification, because it is based on the noise posterior Z' given X and Y, not the noise prior.
Generative Causal Representation Learning for Out-of-Distribution Motion Forecasting. Shayan Shirahmad Gale Bagi, Zahra Gharaee, Oliver Schulte, Mark Crowley (2023). Proceedings International Conference on Machine Learning (ICML). Poster, Slides.

Pearl has developed an elegant general theory of "transportability", transferring causal knowledge between different domains. The basic idea is to keep as much as possible the causal mechanisms as domain-invariant and handle domain differences through a selection variable. For example, flicking the switch -> light goes on is a domain-invariant mechanism, but the times at which the switch is flicked may vary between different domains. We formalized a continuous variational auto-encoder version of this idea, where domain selection variables are latent, and the system is trained to use domain-invariant mechanisms (decoders) as much as possible. Another innovation is to use a Gaussian mixture model as a prior for the latent domain/environment variable. Through test-time adaptation, we can then determine whether we are dealing with an entirely new environment (a new Gaussian cluster) or whether we can subsume the test environment to one we have seen during training time. Empirically the system works well for predicting pedestrian motions in different cities. I saw a paper at Neurips 2023 that applied the same model to temporal graph data. An interesting variant for sure.
NTS-NOTEARS: Learning Nonparametric DBNs With Prior Knowledge. Xiangyu Sun, Oliver Schulte, Guiliang Liu and Pascal Poupart (2023). AISTATS, published in Proceedings of Machine Learning Research (PMLR) 2023, pp.1942--1964. Previous arxiv version. Github. Poster.
I have been working on learning Bayesian networks and causal models for a long time. I was pleased to see a recent breakthrough in the Notears system that makes it possible to use continuous optimization for learning a directed acyclic graph structure. The first step is to set up the problem so that you learn weights for each parent X to predict the value of a child Y. Basically, build a local model for predicting a child value from parent values. You can then use these weights to decide whether there should be a edge X -> Y. That basic idea has not been new, for example it is essentially Markov blanket discovery and also goes by the name of dependency network and Markov blanket network. The problem with this approach is that it typically leads to cycles in the resulting graph. E.g. you can have X-> Y -> Z -> X if X predicts Y, Y predicts Z, and Z predicts X. So we know how to learn local prediction models, but not how to learn them with a global acylicicty constraint. In a breakthrough analysis, Zhang et al. defined a continuous function h of the weight matrix W such that W minimizes h if and only if W represent an acyclic graphical model. Great stuff! We applied this idea to learning a DAG model for temporal data, where the local prediction functions are 1D convolutions. 1D convolutions are an elegant nonlinear way to combine different prediction lags. Worked great in our experiments! A Master's thesis from Lulea University conducted an independent evaluation of time series models and NTS-NOTEARS came out on top.

I worked on new algorithms for learning Bayesian networks that combine model selection with independence tests. This was inspired by my work on the learning theory of Bayesian networks.
The Imap Hybrid Method for Learning Gaussian Bayes Nets. O. Schulte, G.Frigo, R. Greiner and H. Khosravi (2010). Proceedings of the 23rd Canadian Conference on Artificial Intelligence (CANAI), pp.123--134, Springer LNCS 6085. Click here for a full version with proofs and more experimental results. The original publication is available at www.springerlink.com. Best Paper Award.
We continue the project of using the learning-theoretic analysis of learning Bayes nets from statistical tests ("Mind-change optimal learning of Bayes net structure from dependency and independency data") to develop new algorithms for this problem. We follow the hybrid approach of the previous I-map learning paper that combines model selection scores with statistical tests. The previous paper considered Bayes nets with discrete variables; this one considers continuous variables with linear Gaussian distributions. This is a subclass of structural equation models, a very widely used model class in economics, psychology, biology and other disciplines.

We give simulation evidence to show that in these models, standard model selection scores (like Bayes Information Criterion) tend to select overly complex models with too many edges. We propose using statistical tests to cross-check the model selection score: before we add an edge to the graph, we check to see if this would help to cover a statistically significant correlation. If not, we don't add the edge, even if it would increase the score. We prove that this procedure converges to a correct graphical model in the sample size limit. In simulations it gives a better match to the true data-generating graph, on average, than the unconstrained model selection score. The procedure is especially good at finding models with the correct number of edges, which is the measure of model complexity that comes out of applying the mind change analysis from learning theory to Bayes net learning.
A New Hybrid Method for Bayesian Network Learning With Dependency Constraints. Schulte, O.; Frigo, G.; Greiner, R.; Luo, W. & Khosravi, H. (2009). Proceedings IEEE SSCI Symposium, pp.53-60.
The learning-theoretic model of Bayes net structure learning presented in the paper "Mind Change Optimal Learning of Bayes Net Structure" below suggests using dependencies or significant correlations as a constraint in learning Bayes nets. We present a new hybrid criterion to make this idea mathematically precise (maximize a model selection score subject to the constraint of fitting the observed correlations), give an algorithm for optimizing the hybrid criterion, and report experimental results from simulations and real-world data. Our general conclusion is that the dependency constraints are helpful for discrete-variable Bayes nets that are not too large (10 nodes or less), especially on small to medium sample sizes (1,000-3,000). One of the main things I've learned is that multiple hypothesis testing is hard! Statisticians have recognized this for a long time and proposed various ingenious schemes. We tried a number of them with more or less success, but in our experiments it was hard to find one that solves the basic dilemma for all or even most problems: if the testing is too conservative (e.g., with low significance levels), you get little information from the test and diminishing returns on the computational overhead. If the testing is too aggressive, there are too many false positives leading to overly complex models. Hence our caveats on the range of problems for applying the hybrid method.

Automated Scientific Discovery
One of the questions that continues to fascinate me is why scientific discovery produces so much more powerful models than standard statistics or machine learning. Inspired by my work on learning theory, I developed several algorithms that computationally model this learning power, mainly for particle accelerator data.
Learning Conservation Laws Via Matrix Search. Oliver Schulte and Mark S. Drew (2010). Proceedings of the 13th Conference on Discovery Science, pp.236-250, Springer LNAI 6332. The original publication is available at www.springerlink.com.
Continues the research on learning conservation laws in physics and other areas of science. In my IJCAI 2009 paper I showed that choosing a maximally strict set of conservation laws---that rules out as many unobserved data points as possible---formalizes an important part of the methodology used by scientists. It's actually also a classic principle from Machine Learning known as selecting the least generalization of the data. In itself the principle is not enough to determine a unique set of conservation laws, so in this paper we added the requirement that the conservation laws should be maximally simple. "Simplicity" here means assigning small conserved quantities (the L1-norm of the conservation matrix). The combined criterion we call maximally simple maximally strict laws, or MSMS laws. A beautiful and surprising connection with particle ontology is that if there is a set of maximally strict laws that corresponds to a disjoint partition of particles into families (like a clustering), then that set is also the MSMS-optimum. This means that the MSMS criterion can be used to discover not only conservation laws but also particle families - it's like doing classification and clustering at the same time.

Finding an MSMS theory isn't easy because it requires minimizing the L1-norm with a non-linear constraint. My collaborator Mark Drew devised a clever way to do the minimization which is blindingly fast on realistic data sets. We show that this method re-discovers the laws and families in the Standard Model of particle physics, which is a very important scientific theory. It also works in chemistry, re-discovering the molecular structure of substances like ammonia and water.
Simultaneous Discovery of Conservation Laws and Hidden Particles With Smith Matrix Decomposition . Schulte, O. (2009). Proceedings of the Twenty-First International Joint Conference on Artificial Intelligence (IJCAI-09), pp. 1481-1487.
One of the referees was kind enough to nominate this paper for the best paper award. It's another installment in my series on conservation laws in particle physics. Previously seen on this channel ("Inferring Conservation Laws in Particle Physics", 2000): A learning-theoretic analysis shows that minimizing mind changes requires a particle theorist to adopt the following principle: try to find a set of conservation laws that is consistent with all observed reactions, and rules out as many unobserved reactions as possible. I also showed that optimally fulfilling the principle may require hidden particles: Though it may seem paradoxical, explaining the data about observable entities may require positing unobserved entities. The 2000 paper gave a mathematical analysis that characterizes exactly when this phenomenon occurs. The technical answer is that it occurs when there is an unobserved reaction that is a linear combination of observed reactions, but one that involves fractions as coefficients (e.g., 1/2 reaction1 + 1/2 reaction2).

What's new is an efficient algorithm that allows me to compute when the fractionality condition for introducing hidden particles is met. Applying it to the actual particle data leads to the rediscovery of the electron antineutrino. This may not sound exciting for readers outside of particle physics, but in the particle physics literature they say that the existence of this particle is one of the two main questions the field is investigating. From a computer science point of view, the cool thing is that the algorithmic solution is to compute the Smith Normal Form of the reaction data: A hidden particle is needed if and only if the Smith Normal Form contains a diagonal entry greater than 1. I didn't know about the Smith Normal Form before I looked at this research. It's a classic tool for integer matrices from the 19th century, tells you a lot about the matrix you are dealing with. Kind of like singular value decomposition for real-valued matrices. So from learning theory we got to hidden particles, which led to fractions of reactions, which led to the Smith Normal Form.

How I found about the Smith Normal Form is a nice story about scientific collaboration, one of my best experiences in that regard. Once I had the question about hidden particles down to a linear algebra question about fractional linear combinations, I thought I could consult one of our theory experts, so I asked Arvind Gupta. Arvind said that if it was about linear algebra, I should ask Wayne Eberley from the University of Calgary. So I sent Wayne e-mail and he said it sounded like a job for the Smith Normal Form, and if that was the case, I should ask Mark Giesbrecht from the University of Waterloo. So I sent the problem to Mark, and he sent back the solution, complete with correctness proof and implementation advice. Thank you Mark!
Algorithmic Derivation of Additive Selection Rules and Particle Families from Reaction Data . O. Schulte and M.S. Drew (2007). Posted at Preprint Archive, http://arxiv.org/abs/hep-ph/0602011.
A succinct summary (6 pages) of the result of applying the learning-theoretically optimal algorithm to finding conservation laws and particle families. (The algorithm was described in "Inferring Conservation Principles in Particle Physics", see below.) Written for physicists. Shows how particle families (e.g., baryons, lepton generations) are uniquely determined by reaction data. States the result that for any class of reactions (e.g., strong interactions), the number of conserved quantities in that class is no greater than the number of stable particles without a decay mode in the reaction class. This assumes the principles of Special Relativity - my only piece of theoretical physics so far! (Probably ever.) This paper supersedes our previous technical report posted at http://www.cs.sfu.ca/research/publications/techreports/#2006.

Learning Theory
My dissertation was mainly about learning theory. My main theorem says that a certain combination of performance criteria for learning determines an optimal learning method for each problem where they are attainable. (The criteria are reliable, steady, and fast convergence to a correct hypothesis). The papers below describe the theory that led to this result. I have applied the analysis to determine and implement the optimal theory in several domains (particle physics, learning Bayesian networks).
Causal Learning With Occam's Razor. Oliver Schulte (September 2018). Studia Logica 107(5), pp. 991--1023. The link takes you to a typeset version for viewing but not for downloading. If you want a version for download, here is the version typeset by us before proof editing by the journal. The official link is on the springer.com website.

Continues my work on a learning-theoretic analysis of learning Bayesian networks (see paper below). The extension is to consider mind-change optimal structure and parameter learning rather than just structure learning. Conditional probability parameters are represented in decision trees, aka probability estimation trees. The main result is that mind-change optimal learning of probability estimation trees minimizes the number of parameters consistent with a set of observed dependencies. The overall mind-change optimal BN learning method therefore is to minimize the number of edges for the structure and the number of parameters given the structure. This is similar to scores like AIC and BIC that penalize the number of parameters. But the epistemological justification in terms of efficient convergence is very different.
Mind-change optimal learning of Bayes net structure from dependency and independency data. Schulte, O., W. Luo, and R. Greiner (2010). Information and Computation, 208:63-82.
Previous papers of mine developed an approach to inductive inference based on success criteria like fast convergence to a correct hypothesis, and minimizing theory changes. One application area was the discovery of conservation laws and particle families in particle physics. This work opens a new application domain: learning Bayesian networks . Our learning model is based on increasing information about what statistically significant correlations obtain between domain variables (conditional on others). For example, the learner may be told that "father's eye colour is correlated with mother's eye colour given the child's eye colour".
We apply previous our previous results from general learning theory to determine that the optimal method by our success criteria is to find a Bayes net that entails the observed significant correlations with a minimum number of edges , or directed dependencies between variables. So the learning theory leads to a new notion of simplicity for Bayes nets: the fewer edges, the simpler the net.
If the data are observed independencies, the optimal method is to conjecture the Bayes net with the largest number of edges that entails these independencies. The famous PC method for Bayes net learning seems to be a heuristic method for the independency case. The paper shows that it is NP-hard to determine whether there is a unique minimum-edge Bayes net for a set of observed probabilistic dependencies (or unique maximum-edge for independencies) As far as I know, this is the first NP-hardness result in the literature concerning the existence of a uniquely optimal Bayes net structure. Personally, it's the first NP-hardness proof I've done for a published paper (as opposed to exercise). It was fun but hard because of the uniqueness constraint (full proof runs over 7 pages). Since implementing the mind-change optimal learner is NP-hard, applying the learning theory requires heuristic solutions. I worked on this with Russ Greiner and my student Gustavo Frigo, see the "hybrid method for Bayesian network learning" papers.

More material on this topic.
- Mind Change Optimal Learning of Bayes Net Structure O.Schulte, W. Luo and R. Greiner (2007). in Proceedings 20th Annual Conference on Learning Theory (COLT), LNCS 4539, pp. 187-202, San Diego, CA, June 12-15. Shorter conference version. Analyzes dependency data only, where no independencies are observed.
- Conference Presentation Slides.
Mind Change Efficient Learning. Wei Luo and Oliver Schulte (2006). Logic and Computation, 204:989-1011.
Our final paper on the relationship between topology and mind changes in the Gold learning model. It extends the previous conference paper with the same name (sorry!) in two ways: (1) we introduce a stronger, and mathematically simpler, notion of what it is for a learner to minimize mind changes at a given stage of learning, (2) we relate mind change complexity as measured in point-set topology to other notions of learning complexity from the literature (e.g.,"reducibility" of one learning problem to another).

There is a great follow-up to this paper by Matthew De Brecht and Akihiro Yamamoto from Kyoto University. They apply even more concepts from point-set topology than we did to gain insight into learnability.
Mind Change Efficient Learning. W. Luo and O. Schulte (2005). 18th Annual Conference On Learning Theory. Bertinoro, Italy, June 27-30. Springer LNAI 3559, pp.398-412, eds. P. Auer and R. Meir.
This paper continues my work on using the criterion of minimizing belief changes to determine optimal solutions to problems of inductive inference. We look at mind changes in the Gold learning paradigm, using a generalization of the idea of a mind change bound beyond finite mind change bounds to transfinite (ordinal) ones. First we characterize the learning problems that can be solved with a given ordinal mind change bound. It turns out that this corresponds exactly to Cantor's classic concept of accumulation order in a topological space. Second we characterize the mind change efficient inductive methods that achieve the best possible mind change bound given any data sequence. Basically, the mind change efficient learners are those that produces conjectures that maximize accumulation order. We show that mind change efficiency places strong constraints on inductive inference, and illustrate these constraints with examples such as identifying a linear subspace and identifying a one-variable pattern language (Angluin).
Review of Harman, G. and Kulkarni, S. Reliable Reasoning: Induction and Statistical Learning Theory. The MIT Press, London, England, 2007. O. Schulte (2008). Published in Biometrics, Sep 2008, Vol. 64(3), pp.992-993.
A review for a statistics journal by a computer scientist of a book written by a philosopher and an engineer - I couldn't resist. The book is basically an introduction to VC learning theory (known in CS as PAC theory, for "probably approximately correct"), with as little technicality as possible. Unfortunately, the word limit for Biometrics was very tight, so pretty much all I could do was a brief summary of the VC dimension and an outline of what Harmann and Kulkarni do with it to discuss problems of inductive inference.

Formal Epistemology and Philosophy of Science
Algorithmic Learning Theory is a fruitful approach to many long-standing problems in epistemology, and the philosophy of science and induction.
The Co-Discovery of Conservation Laws and Particle Families. O. Schulte (2008). Studies in History and Philosophy of Modern Physics, Vol 39/2, pp.288-314, doi:10.1016/j.shpsb.2007.11.003. The version typeset by the journal is here.
This is the expanded journal version of the previous paper on underdetermination of conservation laws in particle physics. The title alludes to a quote by David Lewis, that "laws and natural properties get discovered together". The main new result added is that particle reaction data alone are sufficient to determine the matter/antimatter division of the particle world (without a general model of particle ontology like the Standard Model). I also give more detailed comparisons between the algorithmic or learning-theoretic method for discovering conservation laws and how physicists have approached this problem in practice.
How Particle Physics Cut Nature At Its Joints. O. Schulte (2009). Proceedings of the 13th International Congress on Logic, Methodology and the Philosophy of Science (LMPS), Beijing, China, pages 267-286. Eds. Glymour, Wei, Westerstahl. College Publications. Invited Lecture.
The Congress takes place every 4 years; it began 1960. It is organized by the International Scientific Union for History and Philosophy of Science (IUHPS), which is an organization affiliated with UNESCO. This is a major interdisciplinary conference with a proud tradition; for instance, Nobel Laureate Harsanyi and computation theorist Buechi presented major results there. I felt honoured to receive an invitation to give a lecture.

The paper considers the challenge of discovering conservation laws in particle physics from a philosophical point of view rather than an algorithmic one. The main new issue is to single out a unique set of conservation laws from the infinitely many sets that make the same predictions. In previous work I showed that the criteria based on mind change efficient learning developed in previous papers select exactly the conservation laws that make the ssame predictions as those found in the Standard Model of Particle Physics (see "Algorithmic Derivation of Additive Selection Rules and Particle Families from Reaction Data" and "Inferring Conservation Principles in Particle Physics: A Case Study in the Problem of Induction" below). But there are many different sets of conservation laws that achieve this (as many as there are bases for a vector space). So the question arises: why pick one of the law sets over another? In philosophy of science, this is sometimes called "global underdetermination" and sometimes the problem of choosing between empirically equivalent theories, meaning theories that agree on all their predictions. (The standard example is Copernicus' theory of the solar system vs. Ptolemy - with suitable parameters, both can be made to give the same predictions.) The answer in this paper is that particle ontology selects a unique theory. I prove that if you want a set of conservation laws that clusters particles into disjoint families (e.g., Baryons, Muon, Tau, Electron Types), then there is at most one way of doing that. Which---you guessed it---is the one physicists have chosen. It was very surprising to me that the clustering criterion picks out a unique set of conservation laws; mathematically it is a pretty result in linear algebra. So the main thing I learned in this research is the importance of ontology or taxonomy in scientific research. Based on this insight, I make a philosophical proposal, namely to answer the question: What is a natural kind? with: A natural kind is a category that appears in the ontologically simplest empirically adequate set of generalizations.
Logically Reliable Inductive Inference. O. Schulte (2007), in Induction, Algorithmic Learning Theory, and Philosophy , pp. 157-178, eds. Michele Friend, Norma B. Goethe, Valentina S. Harizanov. Series: Logic, Epistemology, and the Unity of Science, vol. 9. Dordrecht: Springer, 2007.
Another introductory piece on formal learning theory. Provides a fair amount of comparison with other approaches to inductive inference like confirmation theory in philosophy and language learning theory in computer science.
Scientific Method. W.L. Harper and O. Schulte (2006). McMillan Encyclopaedia of Philosophy
Covers conceptions of scientific method from Galileo and Newton to contemporary statistics and epistemology. Explains how modern work on inferring Bayes Nets or graphical models of causation fits into traditional philosophical methodology going back at least to the Middle Ages.
Formal Learning Theory. O. Schulte (2002). The Stanford On-line Encyclopaedia of Philosophy, ed. E. Zalta.
The online encyclopaedia is a great idea! My contribution is an introductory entry on formal learning theory.
Inferring Conservation Principles in Particle Physics: A Case Study in the Problem of Induction. Oliver Schulte (2001). The British Journal for the Philosophy of Science, 51: 771-806.
Presents a learning-theoretic analysis of a problem that arises in particle physics: to infer conservation principles governing elementary particles from data about reactions among such particles. The criteria of reliable, fast and steady convergence to a correct theory (see "Means-Ends Epistemology" below) give strong constraints on what these inferences should be. In some cases they dictate that a particle theorist has to posit the existence of hidden particles.
What to Believe and What to Take Seriously: A reply to David Chart concerning the Riddle of Induction. Oliver Schulte (2000). The British Journal for the Philosophy of Science, 51:151-153.
In response to my article "Means-Ends Epistemology", David Chart constructed an example in which means-ends analysis sanctions the inference "all emeralds are grue" from a sample of green emeralds. I take this to show two points: (1) means-ends analysis does not depend on the predicates that an inquirer might have chosen to include in her language, but (2) means-ends analysis does depend on what hypotheses the inquirer takes seriously at the outset of inquiry. What hypotheses to take seriously is a different (though related) issue from what inferences to draw from given evidence.
Review of Martin and Osherson's "Elements of Scientific Inquiry". Oliver Schulte (2000). The British Journal for the Philosophy of Science , 51:347-352.
A review of Martin and Osherson's treatment of many topics in scientific inquiry and inductive inference from a learning-theoretic point of view.
The Logic of Reliable and Efficient Inquiry. Oliver Schulte (1999). The Journal of Philosophical Logic, 28:399-438 (compressed version only)
This paper pursues a thorough-going instrumentalist, or means-ends, approach to the theory of inductive inference. I consider three epistemic aims: convergence to a correct theory, fast convergence to a correct theory and steady convergence to a correct theory (avoiding retractions). For each of these, two questions arise: (1) What is the structure of inductive problems in which these aims are feasible? (2) When feasible, what are the inference methods that attain them? Formal learning theory provides the tools for a complete set of answers to these questions. As an illustration of the results, I apply means-ends analysis to various versions of Goodman's Riddle of Induction.
Means-Ends Epistemology. Oliver Schulte (1999). The British Journal for the Philosophy of Science, 50:1-31.
Develops the same ideas as "The Logic of Reliable and Efficient Inquiry", but in different directions, and at a much lower level of technicality. Features a means-ends vindications of (aversion of) Occam's Razor as well as the natural generalizations in a Goodmanian Riddle of Induction. I establish a hierarchy of means-ends notions of empirical success , and discuss a number of issues, results and applications of means-ends epistemology.
Learning Theory and the Philosophy of Science. Kevin Kelly, Oliver Schulte and Cory Juhl (1997). Philosophy of Science 64: 245-267.
Compares learning theory with other approaches to inductive inference, especially Bayesian confirmation theory. Has a section of examples that illustrate learning-theoretic analysis, from scientific revolutions to cognitive science. No symbols!
Reliable Belief Revision. Kevin Kelly, Oliver Schulte and Vincent Hendricks. Proceedings of the XII Joint International Congress for Logic, Methodology and the Philosophy of Science. Florence 1995. compressed version
When somebody proposes general norms for empirical inquiry, means-ends epistemology asks whether these norms help or hinder the goals of inquiry. In this paper we investigate whether axioms for "minimal-change belief revision" prevent agents from reliably finding the truth. In a nutshell, the answer is that these axioms don't help but they don't have to hurt either, provided we design our "belief revision operators" in the right way.
Topology as Epistemology. Oliver Schulte and Cory Juhl (1996). The Monist vol. 79:1, 141-147.
One interpretation of point-set topology is as the theory of inquiry for logically omniscient agents with no limitations on memory capacity (cf. Vickers' book "Topology via Logic"). This view turns topology into a powerful tool for epistemology. The paper is a friendly introduction to the connection between topology and epistemology. Features topological interpretations of Popper's falsifiability criterion.
Church's Thesis and Hume's Problem. Kevin Kelly and Oliver Schulte. Proceedings of the XII Joint International Congress for Logic, Methodology and the Philosophy of Science. Florence 1995.
Tradition has it that deductive logic and empirical inductive reasoning are two disparate forms of inquiry. However, modern theories of computation and of empirical inquiry reveal deep and fundamental analogies between the two. We describe some of the most important similarities. These analogies are yet another reason to study the empirical capacities of computationally bounded agents---hence we review the results from the paper below.
The Computable Testability of Uncomputable Theories. Kevin Kelly and Oliver Schulte (1995). Erkenntnis 43:29-66. This paper was reviewed in: The Journal of Symbolic Logic, 61: #3, p.1049.
If a computer cannot derive the predictions of a theory, is it possible for a computer to test the theory against empirical data? The surprising answer is: sometimes yes, even with infinitely uncomputable (hyperarithmetical) theories. The paper gives a complete map of the relationship between the inductive complexity of a theory---how hard it is to test the theory---and its deductive complexity---how hard it is to derive the predictions of the theory.

Belief Revision
My approach to learning is that optimal methods should be determined by clearly specified performance criteria. I wrote several papers trying to understand in what sense exactly belief revision principles lead to minimal belief changes.
How Do the Harper and Levi Identities Constrain Belief Change? O. Schulte (2006). Truth and Probability, Essays in Honour of Hugh LeBlanc, eds. Francois LePage and Bryson Brown, pp.123--137. College Publications, London.
One of the oldest ideas in belief revision theory is due to Isaac Levi: that changing your beliefs given some new information should proceed in two steps: first, give up old beliefs contradicted by the new information, second add the new information. Bill Harper proposed the reverse for belief contraction or weakening: to give up a belief p, consider what you would think if you learned that p was false. I show exactly what these two ideas entail, and under what conditions they can be used to make belief revision and belief contraction inverses of each other. - I saw Harper and Levi once at a conference at the same time, so I got to tell them about these results at the same time. Fun for me at least.
Minimal belief change, Pareto-optimality and logical consequence. Oliver Schulte (2002). Economic Theory 19 (1): 105-144.
A survey of some basic results about minimal belief revision. Covers topics such as Pareto-minimal theory change, the Levi and Harper Identities, base revision, conditionals and the Ramsey test, and the Grove representation theorem for the AGM postulates.
Minimal Belief Change and the Pareto Principle. Oliver Schulte (1999). Synthese,118:324-361.

What do changing beliefs, revising legal codes and updating knowledge bases have in common? The answer, according to many philosophers, logicians and computer scientists, is that they all ought to obey the principle of minimal belief change in the light of new information, change your beliefs or legal code or knowledgebase just enough to incorporate the new information but not more. Our paper "Reliable Belief Revision" below critically examines the effects of this principle on an agent's ability to reliably arrive at true beliefs.

In this paper, I ask what exactly a minimal belief change is. I apply the fundamental decision-theoretic principle of Pareto-Optimality to define a notion of minimal belief change, and then prove that a theory change is Pareto-minimal iff it satisfies certain axioms. It turns out that these axioms correspond exactly to three axioms for counterfactuals that are familiar from Lewis and Stalnaker's work. I also work out the consequences of Pareto-Optimality for minimal revision of axiomatic bases representing beliefs (rather than deductively closed theories). I compare Pareto-minimal belief change with the well-known AGM belief revision axioms.
Minimal Belief Change and Pareto-Optimality. Oliver Schulte (1999). Advanced Topics in Artificial Intelligence, ed. Norman Foo. Springer Lecture Notes in Computer Science, 144-155.

Game Theory
Game theory provides a mathematical theory of social interactions among multiple agents, which is an amazing thing. My own contributions cover a variety of topics: applications, decision-theoretic foundations, logical representation.
Evolutionary Equilibrium in Bayesian Routing Games: Specialization and Niche Formation. with P.Berenbrink (2010). Theoretical Computer Science, 411:1054-1074.
An extended journal version of our ESA paper below with the same title (sorry!). Analyses the selfish routing game, a popular model of user behavior on a computer network. This is basically a congestion game. Our paper combines evolutionary analysis with Bayesian games of incomplete information. A description of results appears with the conference version below. The journal version has examples and nicer pictures :-)

Erratum: in Figure 4 and on page 1060, replace the symbol p with \sigma.
Evolutionary Equilibrium in Bayesian Routing Games: Specialization and Niche Formation, with Petra Berenbrink (2007). 15th European Symposium on Algorithms (ALGO/ESA).
Most of my papers on game theory are about using logic to address foundational questions or solve computational problems. This paper is more in the standard mould of an economics game theory paper: take a game of interest and figure out its equilibria. The game of interest here is the parallel links network routing game popularized by Papadimitriou, which is a simple kind of congestion game. Our novel contribution is to consider evolutionarily stable equilibria in this game. In my view it makes a lot of sense to model a network with a large population as users, and in a population model evolutionary stability is a very natural concept to apply. The main result mathematically is that there is at most one evolutionarily stable Nash equilibrium. It was surprising to me that in this game evolutionary equilibrium has so much predictive power. We also give necessary and sufficient conditions concerning the distribution of tasks to resources that hold in the evolutionary equilibrium, which leads to some pretty pictures.
Representing von Neumann-Morgenstern Games in the Situation Calculus. O. Schulte and J. Delgrande (2004). Annals of Mathematics and Artificial Intelligence 42 (1-3): 73-101. (Special Issue on Multi-agent Systems and Computational Logic). A shorter version of this paper appeared in the 2002 AAAI Workshop on Decision and Game Theory.
Game Trees are a very general, highly-developed mathematical formalism for representing what an agent knows about its environment to plan goal-directed actions. The Situation Calculus is a logical, computational formalism for representing environmental dynamics. In this paper we show a strong relationship between the two: just about every game-theoretic model of an environment can be described in the situation calculus with a categorical set of axioms. This implies that all and only true features of the game-theoretic model are expressed by the situation calculus axioms. Some provisos are that the game tree must provide each agent with at most countably many actions, and that the agents' utility functions (goal specifications) must be continuous in the Baire topology. We give situation calculus definitions for game-theoretic concepts such as Nash equilibrium and backward induction.
Iterated Backward Inference: An Algorithm for Proper Rationalizability. O. Schulte (2003). Proceedings of TARK IX (Theoretical Aspects of Reasoning About Knowledge), Bloomington, Indiana, pp. 15--28. ACM, New York. Expanded version with full proofs.
An important approach to game theory is to examine the consequences of beliefs that rational agents may have about each other. This paper considers respect for public preferences. Asheim's epistemic characterization of respect for preferences interprets this condition as follows, roughly speaking: An agent explains unexpected behaviour by other agents as mistakes, and considers more serious mistakes less likely than less serious ones. "Properly rationalizable" outcomes of a multi-agent interaction are those that are consistent with common belief in respect for preferences. It is difficult to determine in a given game model what follows from common belief in respect for preferences. This paper proposes an algorithm for deriving consequences of that assumption called Iterated Backward Inference. Iterated Backward Inference is a procedure that generalizes standard backward induction reasoning for games of both perfect and imperfect information. Our main result shows that if respect for public preferences and perfect recall obtains, then players choose in accordance with Iterated Backward Inference.
The algorithm in this paper does not necessarily find all strategies consistent with proper rationalizability. This problem was recently solved by Andres Perea.
Knowledge and Planning in an Action-Based Multi-Agent Framework: A Case Study. B. Bart, J. Delgrande and O. Schulte (2001). Advances in Artificial Intelligence , eds. E. Stroulia and S. Matwin, Springer Lecture Notes in AI 2056, 121-130.
Together with Jim Delgrande, I showed that just about every game tree has a sound and complete (categorical) axiomatization in the situation calculus (see above). This shows that the situation calculus is sufficiently expressive to represent multi-agent interactions. We apply this insight to give a natural axiomatization of a variant of the "Clue" game in the situation calculus. The paper discusses a number of the issues that arise, such as the representation of what is common knowledge between various players and successor state axioms for the rules of the game.
Common Reasoning about Admissibility. Cristina Bicchieri and Oliver Schulte (1997). Erkenntnis 45:299-325. Also in: Probability, Dynamics and Causality, eds. Constantini, D. and Galavotti, M.C. Kluwer (1997).
I regularly use the decision-theoretic criterion of admissibility (avoiding weak dominance) to evaluate the performance of inductive methods. This paper applies admissibility to make recommendations not just for games of inquiry but for any game, including those modelling social phenomena of great interest. We define a natural procedure for eliminating weakly dominated strategies in both game trees and matrix games. We show that this procedure satisfies a commonly accepted wishlist of properties that we would like in our tools for analyzing games---one of the few solution concepts to do so. Technically speaking, iterated admissibility generalizes backward and forward induction reasoning and yields the same result in game trees as it does in the matrix games that represent them (proviso: the game tree must have perfect recall).

Publications

Learning Generative Models for Graph Data

Related material.

Related material.

Related material.

Related material

Related material

More material on this topic.

More material on this topic.

More material on this topic.

Sports Analytics and Reinforcement Learning

More material on this topic.

More material on this topic.

More material on this topic.

More material on this topic.

More material on this topic.

More material on this topic.

Learning Bayesian Networks and Causal Models

Automated Scientific Discovery

Learning Theory

More material on this topic.

Formal Epistemology and Philosophy of Science

Belief Revision

Game Theory