PublicationsMy current main interest is in machine learning for structured and multi-relational, and network data. I have recently worked on applying reinforcement learning to sports data, which have a very complex structure (players, teams, matches, actions, times, locations). My work on learning has followed two main principles. 1) Analyze the complexity of learning problems to establish which methods achieve the best possible performance for a given learning problem. 2) Get the most out of the information you have: Develop new learning methods to work with complex data , rather than simplify the data to work with existing methods. Examples of complex data that require new methods include complex networks, event logs, natural language, and images.
My past publications have addressed a number of topics in different areas, such as learning graphical models, statistical-relational learning, machine learning for particle physics, computational learning theory, belief revision, computation theory, inductive inference, philosophy of science, and game theory (logic, foundations, evolutionary analysis). Here is a graph of connections among research topics I have worked on. The notes on the papers describe these connections.
- Statistical-Relational Learning: Graphical Models for Multi-Relational Data
- Sports Analytics and Reinforcement Learning
- Learning Bayesian Networks
- Automated Scientific Discovery
- Learning Theory
- Formal Epistemology
- Belief Revision
- Game Theory
Learning Bayesian Networks for Multi-Relational DataLearning Bayesian network structure for multi-relational data (aka heterogeneous information networks) is a more challenging topic than I initially expected--both conceptually and computationally. Very important for many applications of machine learning and AI.
- Dynamic Gated Graph Neural Networks for Scene Graph Generation. Mahmoud Khademi and Oliver Schulte (2019). Asian Conference on Computer Vision, LNCS 11366, pp. 1–17, 2019. The link is to our own camera-ready version. Conference Slides. The official publisher's version is here.
One of the most challenging and important applications of relational learning is to extract relational information from unstructured data. Hoon and Domingos
have referred to this as a "killer app" for statistical-relational methods. In Natural Language Processing, building a network of relational facts that summarize what is in a text known as Information Extraction or Knowledge Graph Discovery. In computer vision, building a network of relational facts that summarize what is in an image is known as scene graph generation. The current paper presents an approach to building up a scene graph from an image. It is based on deep learning to extract objects and features and reinforcement learning to decide what network component to add next.
- An advantage of NLP is that a text contains discrete labels for entities (nodes). Thus in "Harry loves Ginny" we have "Harry" and "Ginny" to refer to entities, and "loves" to refer to their relationship. In vision, if we had a picture of Harry next to Ginny, an algorithm has to analyze the image to figure out that there are two people depicted, and ideally find their shapes. This is called "semantic segmentation" in vision and it's a hard problem. In this paper we use bounding boxes, so we get an idea of what objects there are and where they are located, but not their exact shape.
- Similarly a text contains a discrete label for relationships (e.g. "loves"), and you can more or less tell from English grammar that that is a relationship between the two entities. In contrast, if Harry is for instance hugging Ginny in an image, you would have to infer that relationship. This is close to a problem called "activity recognition", also hard. Recognizing something more complex like "Harry is looking at Ginny lovingly" may be beyond the reach of computers forever.
- An advantage of vision is that an object occurs just once in an image. So if you can find it, you don't have to link multiple occurrences. In NLP, you might have "Harry loves Ginny. She agreed to marry him." and you have to recognize that "Harry" and "him" refer to the same entity. This kind of entity linking is easy for humans but very difficult for computers. A counterpart in vision would be object tracking in video. But even there the occurrence of an object in different frames is subject to strong continuity constraints that you don't have in language.
- An advantage of vision for relationship detection is that the image comes with easily recognized spatial relationships ("on top of", "to the left of"). For example if you can manage to localize a human and a bike, you can compute easily that the human is on top of the bike, and then infer (with an appropriate model) that the human is riding the bike.
- Inference, Learning, and Population Size: Projectivity for SRL Models. Manfred Jaeger and Oliver Schulte (2018). StarAI Workshop @IJCAI. Best Paper Award.
A desirable property of a relational or network model is that adding new entities or nodes to a model about which we know nothing, does not change the predictions of the model for previously known entities or nodes. For example, if you read that Facebook has added another million users, this should not change your probabilistic inferences about your current friends. Technically, this means that given a large set of n nodes, the marginal predictions over a smaller set of m nodes should be the same as if you made predictions based on only the m nodes to begin with. In statistical process theory, this property is called projectivity. This paper asks whether common statistical-relational models are projective in this sense. Unfortunately the answer is generally no, unless the model has a restricted form, where links are independent given the attributes of the involved entities. This conditional independence property is weaker than assuming that entities or nodes are independent (as in i.i.d. data), but enough to ensure projectivity.
This work as inspired by the great paper of Shalizi and Rinaldo on projectivity in exponential random graph models. They consider homogeneous graphs with no attributes only, we consider models for heterogeneous graphs as well.
Personal view: I'm starting to think that the only way to get projective network models is build a conditional independence type model. These are in fact the currently most popular ones, well-known variants include matrix factorization, exchangeabilty, blocks models, embedding models in deep learning. I don't have a theorem to that effect, but I've been trying hard to get projectivity, beyond what is in this paper. Also in our paper on relational dependency networks we showed that what to my mind is a natural log-linear definition of conditional probabilities in a relational model, cannot be made consistent with a single distribution over possible worlds (networks).
- Tutorial for learning Bayesian networks for complex relational data. Chapters 3 and 4 discuss parameter and structure learning. Presented at AAAI, ECML, Canadian AI.
Locally Consistent Bayesian Network Scores for Multi-Relational Data. Oliver Schulte and Sajjad Gholami (2017). Proceedings IJCAI 2017, pp.2693-2700.
Sums up our theoretical work on scoring Bayesian networks for relational data. To my mind the pinnacle of model selection theory for Bayesian networks is the work by Chickering and Meek on identifying an optimal Bayes net structure. In this paper we show how to generalize any decomposable model selection score for Bayesian network so that it applies to relational data, preserving Chickering and Meek's local consistency properties: as the number of observations increases, 1) edges necessary for representing the data generating distribution will be added, and 2) unnecessary edges will be deleted. Part of the reason why it took me a long time to get this result is that it is not possible to achieve it in the traditional model selection setting, where one defines a score for a single model. Instead for multi-relational data, you need to use a gain function that measures how much one model B1 improves over another B2 . Our multi-relational gain functions cannot be represented as the difference of two single-model scores (score(B1)-score(B2)).
Connections. This extends my paper on a tractable pseudo-likelihood function for Bayesian networks. That paper showed how to generalize the standard i.ie. likelihood score for multi-relational data. The new paper shows how to add a model complexity penalty term to the likelihood score.
- Model Selection Scores for Multi-Relational Bayesian Networks, Sajjad Gholami, Oliver Schulte, Vidhi Jain (2017). Invited Paper at the DeLBP workshop. A short version focusing on likelihood-based scores (e.g. AIC, BIC).
Consistent Bayesian Network Scores for Multi-Relational Data. Oliver Schulte and Sajjad Gholami (2016). Presented at the StarAI Workshop, IJCAI 2016. Has more ablation studies (e.g. evaluating scores without any scaling).
Learning Bayes Nets for Relational Data with Link Uncertainty. Zhensong Qian and Oliver Schulte (2013). Proceedings IJCAI-GKR Workshop on Graph Structures for Knowledge Representation and Reasoning. Springer Lecture Notes in Artificial Intelligence, Vol. 8323, pp. 123-137.
A shorter version appeared in the IJCAI-AIBD Workshop on AI and Big Data.
Learning Graphical Models for Relational Data via Lattice Search. Oliver Schulte and Hassan Khosravi (2012). Machine Learning, 88:3, pp.331-368.
Final Version before typesetting
Our main Learn-and-Join algorithm for on structure learning for Bayes nets with relational data. One of the key problems in relational learning is that correlations among entities and their attributes depend on the kind of links, or chains of links, that exist between the attributes (what Jiawei Han has called a "metapath"). But there are many potentially relevant link chains, so somehow we need to efficiently search them and combine the results in a single model. Our proposal is to use the fact that chains have a lattice structure (subchains are like subsets). We do a level-wise search through this lattice, starting with short chains and increasing the chain length one step at a time (in the spirit of the Apriori algorithm). Correlations or adjacencies found on shorter chains are propagated to Bayes net learning on superchains. The general intuition is that a dependency should be evaluated in the most specific possible context. There are many details to take care of, so the paper is pretty long. But half of it presents simulation results, so if you just want to know how the algorithm works, there is not so much to read. For parameter learning, we convert the Bayes net to a Markov net, then apply Markov Logic Network parameter learning and inference. This turns out to be an effective way to do inference with Bayes nets in relational data, and an effective way to learn Markov models. A 2-for-1 deal: two types of graphical models with one algorithm.Connections. The structure search algorithm of Friedmann et al. uses a similar hierarchical strategy. The main difference is that they do not propagate edges as constraints, but reconsider each edge using the results of learning for shorter chains as a starting point.
- Structure Learning for Markov Logic Networks with Many Descriptive Attributes. Hassan Khosravi, Oliver Schulte, Tong Man, Xiaoyuan Xu, Bahareh Bina (2010). Proceedings of the Twenty-Fourth Conference on Artificial Intelligence (AAAI), pp.487-493. The conference version of the Learn-and-Join algorithm.
- Conference Presentation (Powerpoint).
FACTORBASE: multi-relational structure learning with SQL all the way. Zhensong Qian and Oliver Schulte (2018). International Journal of Data Science and Analytics, Springer, 2018, 7(4), 1-21. . The link is to the final revision typeset by us. The official publication link is here.
My first systems/software engineering paper. I have been working on machine learning for multi-relational data for years now. Programming machine learning for relational data is hard! After years of working hard to develop code, I think I have found a framework that makes implementing multi-relational learning feasible. In fact, with our new approach it's not only feasible, it's fun! Well, if you're inclined to think that programming is fun.
The key idea is to use a design philosophy that has come out of the database community: Use a database system to store data and models. This is thinking outside the box because most people think of a database system as something that manages data, not something that manages models. One obvious advantage is that the DB system handles big models as easily as it handles big data: no more worries about how to store 10M parameter values in main memory, for instance. But even more important for statistical-relational learning is the fact that relational algebra provides a high-level language for managing structured objects. This is because structured objects make programming for relational learning difficult and error-prone:
- The random variables have internal structure, because they are defined in first-order logic.
- Graphical models are common, so we have structured models with structured components.
- Tables of parameter values and tables of sufficient statistics need to be cross-referenced with metadata about structured random variables.
- Many structure learning algorithms, like our learn-and-join algorithm and gradient boosting, need to manage sets/ensembles of models.
If you store all these objects as tables in the database, you can use relational algebra in the form of SQL scripts to construct, manage, and cross-reference complex big structured statistical objects. For example, we have written a five-line SQL query that computes the predictions of a log-linear model on a million test cases in a very efficient and reliable way. Another great thing about SQL is that it is standardized. So our system works out of the box as long as your data are stored in relational database system. And I'm hoping that different research groups will share and reuse SQL scripts. Kimmig, Mihalkova and Getoor have written a great survey on statistical-relational learning. Our FactorBase system implements pretty much all the operations they describe using MySQL scripts.The first paper that I know of about implementing a graphical model in a relational database (for i.i.d. data) is by Wong et al. The general approach is often associated with the BayesStore system due to Wang et al.
More material on this topic.
FactorBase: Multi-Relational Model Learning with SQL All The Way. Zhensong Qian and Oliver Schulte (2015). Proceedings Data Science and Advanced Analytics (DSAA 2015), pp. 1--10.
- Conference presentation at DSAA 2015. 15 minute lecture. PDF format.
- 4-slide spotlight presentation at the UAI-StarAI 2015 Workshop.
- UAI-StarAI 2015 Workshop Poster. Perhaps the best way to get a quick idea of the details.
- 6-page version at the 2015 Learning Systems NIPS Workshop. Assumes less background in statistical-relational learning than the StarAI paper, but more background in machine learning than the DSAA paper. The Poster.
- Preprint. This is an expanded version. It adds information about how we query the system catalog to define a set of random variables. For single-table data, you basically can assume that each column defines a random variable. For relational data, this is more complicated. (Walker et al. call this the "setup problem".) Most systems fall back on requiring the user to define random variables in a system-specific language. We show that you can provide a good default setup by querying the DB system catalog, push-button with no user input. Singh and Graepel take a similar approach for latent variable models.
This paper presents a novel application of Markov Logic Networks to outlier detection. The key idea is that the formulas in a Markov Logic Network represent features that can be used as input to an outlier detection method. For example in soccer data, we may have a formula like DribbleEfficiency(Team,Match) = high, PassEfficiency(Team,Match) = hi. This represents an interaction between the dribble efficiency and the pass efficiency of teams. Counting how many times this formula is instantiated for a particular team amounts means to counting how many times the team achieves high dribble efficiency and high pass efficiency. In other words, the count represents an interaction between dribble and pass efficiency.
We can illustrate how this works for Premier League data. We apply our Markov Learning Network method to learn 331 conjunctive features. We make a data matrix with 331 columns and one row for each team. A cell contains the number of times the feature is instantiated for each team. (We also tried the proportion of instantiations and TF-IDF features.) Then this data matrix is treated as a standard matrix for i.i.d. data (pseudo-iid data view), and given to a single-table outlier detection method (cite implementation).
What I like about this paper is that outlier detection is a novel application of Markov Logic Networks. It is also a novel application of the propositionalization strategy. Propositionalization refers to preprocessing a relational database to “flatten” it into a single table. I also like the term “relation elimination” because the preprocessing eliminates all relationships and leaves only features of individual entities. Relation elimination has been extensively researched for classification, but I believe that we are the first to use this strategy for outlier detection.
Fast Learning of Relational Dependency Networks. Oliver Schulte, Zhensong Qian, Arthur E. Kirkpatrick, Xiaoqian Yin and Yan Sun (2016). Machine Learning, pp.1-30.
This paper pulls together threads on a difficult topic that I've been thinking about for years: how to derive predictions for single target instances from general relational frequencies. The core of the paper is a log-linear formula to use a first-order Bayesian network for classification. For instance, suppose you wanted to predict the genre of a movie based on (i) its country of origin, (ii) the genders of its actors, (iii) the ratings of its users and their genders. Our log-linear formula combines all this different information in a way that properly generalizes the log-linear Bayesian network classification formula for the case of i.i.d. data. The key property that we wanted is that the formula should work well if you use the empirical observed frequencies as Bayesian network parameter estimates (e.g. the percentage of movie-actor pairs where the movie is a drama and the actor is a woman). In order for this to be the case, you need to use proportions, not counts as feature functions in the log-linear model. Ravkic, Ramon and Davis also use frequencies not counts.
Once you have a classifier formula that defines a conditional probability for each target node given values for all the other nodes, you can put the different conditional probabilities together in a dependency network to perform joint inference to predict the values of multiple nodes. As explained by Neville and Jensen, the big advantage of dependency networks over Bayesian networks is that they can represent the cyclic dependencies that are so typical of relational data (if you and I are friends, my affection for Harry Potter predicts yours, and vice versa). In previous work, we developed the learn-and-join algorithm that efficiently learns a Bayesian network that represents relational frequencies. (See our paper " Modelling Relational Statistics With Bayes Nets" below). For inference about single target instances---like the genre of "Fargo"--we convert the Bayes Net template model to a relational dependency network to perform prediction at the instance level. Since we use observed frequencies as parameter estimates, that can be computed directly from the data, parameter estimation is very fast. And since the learn-and-join algorithm (see below) provides fast structure learning, the whole package finds a relational dependency network in a scalable way. On the order of 5 min for most databases. The largest is an instance of the IMDb movie database with over 1M ratings that takes about 40 min for structure learning. We're working to increase scalability even more---stay tuned!
What I like best about this approach: That we can work with large datasets. And that the parameter values can be interpreted, since they represent observed frequencies of events in the data. This means that the reasoning behind the system's conclusions can be explained in terms of intuitively meaningful weights for intuitively relevant factors.
The problems addressed in this paper concern the most important differences between learning from relational data and learning from independent i.i.d. data points. It's taken me a long time to get clear on these differences. In case you want to know more about the general issues, here's an explanation of what I've learned.
More material on this topic.
- Fast Learning of Relational Dependency Networks. Oliver Schulte, Zhensong Qian, Arthur Kirkpatrick (2014). Conference on Inductive Logic Programming (ILP 2014). The conference predecessor. Shorter version. Compared to the journal paper, this is missing the characterization of when the resulting dependency networks are consistent.
- Presentation for the ILP Conference.
- Random Regression for Bayes Nets Applied to Relational Data. Oliver Schulte, Hassan Khosravi, Tianxiang Gao and Yuke Zhu (2012). UAI-StarAI Workshop on Statistical-Relational AI. Short Workshop paper. The main new material here are empirical results on using counts instead of proportions in the log-linear conditional probabilities models for the dependency network parameters. This is the official recommendation of the Alchemy system
for how to convert first-order/parametrized Bayesian networks to Markov Logic networks. But counts as feature functions do much worse than proportions.
The paper also emphasizes the random regression interpretation of the proportion model: it is equivalent to the expected value of a log-conditional probability given a randomly selected instance of the families in the Markov blanket.
- Poster Presentation for the SRL Workshop
Model-based Outlier Detection for Object-Relational Data. Sarah Riahi and Oliver Schulte (2015). IEEE Symposium Series on Computational Intelligence, CI and Data Management Track (IEEE SSCI 2015), pp. 1590-1598.
Best Student Paper Award. Thank you to SSCI!
Unsupervised outlier detection is a major application area for generative models. I also think it is an important problem from an AI point of view, because it relates to the tension between wanting rules that are as general as possible and as specific as necessary. (Remember Tweety the bird who is also a penguin?). To apply model learning to outlier detection, the idea is to build a model of normal behaviour. Then given a potential outlier datapoint, you apply the model to compute how likely the features associated with the data point are. If the likelihood is low according to the normal population model, you have an outlier.
The difficulty is that for relational data, the low-likelihood approach does not work just as is, because for a potential outlier object, there is not a single feature vector. Instead there is a relational substructure that we refer to as the object database . For instance, if we want to know whether a soccer player is unusual, we want to look at a set of feature vectors, one for each match. One approach is to score the potential outlier object's database using the random selection likelihood function I have developed in previous work. This works quite well and is a nice application of the random selection likelihood I think. But looking closely at synthetic and real-world datasets (Movies, Sports), Sarah Riahi and I found we could improve on this baseline method.
The basic idea is to look at the likelihood ratio (KL divergence) between two parameter settings: parameters learned from the entire population reflecting general normal behaviour, and parameters learned for the individual object. Given a Bayesian network representation of relational frequency, the likelihood ratio amounts to comparing correlations that hold in the general population to correlations that hold for a specific object. This idea of looking for unusual correlations, often expressed as association rules, is quite common in rule-based approaches and in subgroup discovery. It falls out nicely from the Bayesian network formalism. Finally, we realized we could improve even more by applying a mutual-information decomposition to the normal and individual parameters before computing the likelihood ratio. This means that we compare two different quantities to score outliers: the difference in individual feature distributions, considered in isolation, and the difference in the strength of correlations among features ("the lift"). This may sound complicated but the results of the decomposition are in fact easy to interpret. We show that by going through rankings of soccer players and movies as unusual.Connections. We can think of a relational outlier as being like a subgroup of size 1. This means we can transfer ideas from subgroup discovery. What we arrive at is basically a relational version of Arno Knobbe's Exceptional Model Mining framework. (I develop this perspective in my tutorial.) Also, Peter Flach's group in Bristol have been researching a related issue, how to measure the difference in class distributions in a subgroup. It looks like they've independently arrived at a similar formula, using expected log-distances to avoid the cancelling out. I was happy to see independent confirmation of this idea.
More material on this topic.
- SSCI Conference Presentation (CIDM track).
A Proposal for Statistical Outlier Detection in Relational Structures. Fatemeh Riahi, Oliver Schulte, Qing Li (2014), AAAI-StarAI Workshop.
A predecessor paper at a workshop. The difference is that instead of comparing two parameter settings (general and individual) for a fixed Bayesian network structure, here we compare the likelihood ratios of two different Bayesian structures: one learned for the general population and one for the individual. I still think this is a good idea, but it is also more complicated than fixing the structure and comparing only parameter values. So for the conference paper, we went with the simpler version.
An interesting result in this paper is that there is a strong correlation between how anomalous an individual is, according to this, measure, and how successful the object is. For example, the correlation between the standing of a UK Premier League team and its anomaly measure is 0.61. Considering how many factors go into the final standing, it's remarkable to me that any single metric should be so strongly related to this outcome. We show similar results for players and movie ratings. We definitely plan to follow up on this.
Another attempt to understand discriminative learning for relational databases. We're getting closer to a solution, see our paper on learning Relational Dependency Networks above. This paper investigates traditional methods, with fairly extensive simulations. The big picture is that relational learning needs to combine information from different related entities. For example, research has shown that you can predict the gender of a Twitter user quite well from his or her tweets. Unigram features work well for a variety of languages. This means that basically you can just look at how even a user tweets certain words. You can build a quite accurate classifier that looks at just a single tweet to predict gender. Suppose you have not just a single tweet for a user, but 100. Presumably you can predict his or her gender even better, but how? There are two basic strategies: 1) you can aggregate the features of different tweets into a single feature. For instance, if the word "foo" is used more often by women, you could compute the average frequency of "foo" over all 100 documents. In this way you end up with a single feature vector summarizing statistics for all 100 documents. 2) You could make a prediction for each tweet separately, then aggregate the predictions. For instance, you could estimate a probability that the user is a woman based on each tweet, then average the resulting probabilities. We call that aggregating scores (usually called a "combining rule" in previous work). Both approaches have been developed extensively in previous work on relational learning. What we haven't seen previously is a comparison of which works better. So in this paper we try out a pretty large set of aggregates for both features and scores and compare. The details depends on how you aggregate and what base classifier you use (e.g., logistic regression, SVM). Still, the overall message is that using a standard classifier with a set of feature aggregation functions works best. The main reason is that in that case, standard classifiers will find good aggregate features for you. In contrast, it's not easy to learn good score aggregation rules, and there isn't a universally best one for every dataset.
For me a striking thing about this work is that my preconceptions were overturned by simulations. In these situations computer science feels more like an empirical than a mathematical discipline. Jensen and Neville have done great work on statistical issues (type I/type II error) that arise when features are aggregated for relational classification. An intuition about this is that aggregating increases your feature space while at the same time decreasing the number of effective data points. Statistically, this sounds like a nonstarter. For instance, if you aggregate a user's tweets into a single number using, say, average, median, max, min, you have four new features but only a single data point rather than n data points for all of her or his tweets. Jiawei Han has published a paper on Naive Bayes Classifiers for relational data where he points to this loss of information. Nonetheless, score aggregation has its problems too. We provide a fairly extensive conceptual discussion of these statistical issues in the paper if you are interested.
One aspect of this paper that I like is that almost everything is done within the relational database system using SQL (relational algebra). We store contingency tables containing sufficient statistics in the database. We use SQL queries to build initial contingency tables from the data tables, and to build bigger contingency tables from smaller ones. The moral is that relational databases systems are good not only for managing big data, but also for managing big models!
A precursor to our CIKM paper about computing multi-relational sufficient statistics. A new dynamic programming algorithm for computing the number of instantiations that satisfy a first-order clause. We need this for parameter estimation in our relational Bayes nets. Yes, the problem is #P-complete, but read on!
This paper concerns the problem of modelling relational statistics, to answer queries like "what is the percentage of intelligent students in upper-level classes" or "what percentage of women give high ratings to action movies"? We refer to such queries as class-level queries, since they are about classes of individuals/links, rather than specific individuals. Class-level queries show generic correlations in a database, and can be applied in strategic planning and query optimization. This paper shows how Poole's Parametrized Bayes nets can be used for class-level modelling.The conceptual contribution is to specify a new semantics for a Bayes net so that it is clear in which sense a Bayes net represent class-level frequencies. As Lise Getoor pointed out in her dissertation, the standard grounding semantics is not sufficient because it translates a 1st-order model into a statistical model of individuals, their attributes and relationships---the classes disappear. We propose using Halpern's random selection semantics instead. This works for any model based on 1st-order logic, for example also for Markov Logic Networks.
The computational contribution is an algorithm for computing sufficient statistics in a database when negated relations are involved. A social network example would be "find the number of users who are both from Germany and who are not friends". It quickly becomes infeasible to do this by enumerating all user pairs who are not friends. It turns out that we can compute sufficient statistics that involve negated relations, in terms of counts that involve only positive relations. We show in this paper that the classic fast Moebius transform algorithm can be applied to transform positive event counts into negated event counts in a highly efficient manner.
Shorter Version from ICML-SRL Workshop on Statistical-Relational Learning. Workshop Presentation.
Note: The ILP organizers ask for the final camera-ready version after the feedback at the conference has been given, so while the paper has been accepted for inclusion in the proceedings, what is posted not is not the final form. The SRL workshop paper is in the final stable form. These are shorter versions of our Machine Learning journal paper above.All the reviews from the SRL workshop were good, but on this paper we received an usually knowledgeable review. This reviewer not only knew of the Moebius transform, but also knew that the version we are using is sometimes called the zeta transform. We appreciated this review very much.
IJCAI-HINA Workshop version (one page longer). Also presented at AAAI-SRL Workshop.
A special case of statistical importance are team-player problems, where the goal is to measure the importance of a player's contributions. We apply the statistical importance measure to English Premier League soccer data.
We present a new model of friendship strength in social networks. The input data are transaction logs between friends, and friendship ratings, indicating how close a friendship is. We had access to an anonymized version of the Cloob social network that provides friendship ratings. Our model assigns two latent factors to each user, one that describes their tendency to rate others, and another that describes their tendency to be rated. Following work on social recommendation by Ma et al., as well as Jamali and Ester, we assume a recursive dependency between latent factors: a user's latent factors are linear combinations of her friend's latent factors, where the number of transactions between them is the weight given to a friend's latent factors. In this way the transaction matrix can be used to help factorize the ratings matrix, without being factorized itself. It's hard to know whether there really is a recursive dependency between unobserved variables. Empirically, previous work showed that the recursive dependency model performs well for social recommendation and trust propagation. Our results in this paper show good predictive performance for friendship strength prediction, on Cloob and a synthetic dataset.
We showed in the paper on decision forests below that an independence assumption can lead to fast accurate multi-relational classification. In this paper we advance our theoretical understanding of multi-relational classification by constructing a hierarchy of different independence assumptions, from weaker to stronger assumptions. We mathematically derive multi-relational classification formulas for each level in the hierarchy; all these formulas are log-linear models. The weakest assumption is that different links are independent of each other given the descriptive attributes of the linked entities. This is a new principle. The other hierarchy levels correspond to previous classification formulas proposed by other researchers; this means that the conditional link independence assumption is implicit in previous formulas. While the link independence assumption leads to the best classification performance, the advantage is small. A bigger improvement is achieved by using logistic regression to learn weights for different data tables, as in our paper on multi-relational decision forests below.
This is my first paper on discriminative learning (classification) for relational data, as opposed to learning generative models. The special approach taken here is to begin with an independence assumption, roughly that information from different rows is independent given the class label. This can be thought of as a kind of naive Bayes assumption for relational data tables. (Manjunath, Murty and Sitaram from HP Research independently developed a multi-relational classifier based on a similar assumption.) Having defined this assumption formally, we can derive a multi-relational classification model from it. Basically what you get is this: apply a standard single-table probabilistic classifier separately to individual data tables, then use a log-linear formula to combine the different probabilistic assumptions. A big selling point for this approach is scalability: the independence assumption allows us to learn a model for different tables independently, which is very fast.
Classifiers built on the independence assumption work quite well, but there is a problem in that information from a data table may be redundant given information from others, which hurts classification accuracy. My student Bahareh Bina came up with an elegant approach to pruning/weighting data tables: take the aggregate predictions from different tables as regressors in a logistic regression model. Then standard logistic regression learns weights for different tables. Redundant tables receive weight 0, and the nonzero weights provide a ranking for which relational pathways carry the most information.
Erratum: The typesetters messed up the reference numbering at the very last stage. If you're trying to track down a reference, hopefully you can find it using the title or author mentioned in the text. Otherwise send me an email.
The workshop organizers solicited papers that pose a challenge for the SRL community. I thought this was a great idea. I had a challenge ready: A problem that I have been thinking about for several years and not been able to solve. The problem is this: In his classic paper on logic and probability, Joe Halpern proposed a basic principle that in the absence of any further information, our probabilistic beliefs about an individual should track the frequencies in the population that the individual comes from. Thus if the only thing we know about Tweety is that Tweety is a bird, then the probability that Tweety flies should be the proportion of fliers in the bird population. Halpern proves that this is equivalent to the relevant instance of Miller's principle. James Cussen pointed out to me that it can also be seen as an instance of David Lewis' famous Principal Principle. In the paper I give four different arguments supporting the principle.
The challenge is this: As reasonable as it seems, it's not easy to get an SRL system to satisfy the principle. What you need is for marginal inferences about ground atoms to track population frequencies. In the paper I give some example of how this can be worked out for various type of Bayes net structures in special circumstances. What I'd really like is a general result of the form "as long as the database is representative of population frequencies, learning a stochastic logic program will produce an model that satisfies Halpern's principle". Or similar results for Bayes nets, stochastic logic programs, etc. The only model class for which I can give a general argument are latent variable models, like those obtained by matrix factorization.
Update: I since realized that Markov logic network parameter learning satisfies Halpern's principle, with maximum likelihood as the objective function. This is because in a log-linear model, the maximum likelihood estimates have the property that the expected feature instantiation counts are the observed instantiation counts.
Update: Chris Re gave a great invited talk at the StarAI 2015 workshop. He said that he makes everybody who wants to work on DeepDive take an oath that they will validate their system by calibrating marginals against the data. His reason is that marginal probabilities are relatively easy for users to check and give them confidence in the system. That's yet another argument for why SRL models should give marginal inferences that track frequencies in the data.
Extended Abstract appears in Proceedings of the Conference on Inductive Logic Programming (ILP 2011), LNAI 7207, pp.39-44.
One of the biggest problems in using Bayes nets with relational data is the fact that the data feature cyclic dependencies (see papers below). One observation we prove in this paper is that this happens exactly when you have self-relationships where an entity type is related to other members of the same type (e.g., users are friends with each other in a social network). I proposed a pseudo-likelihood measure that allows us to learn Bayes nets efficiently even in this case (see SIAM SDM paper 2011). This paper presents a set of experiments that focus on learning with the pseudo-likelihood on databases with cyclic dependencies.
There is a nice theoretical contribution as well. Relational Bayes nets can represent recursive dependencies elegantly using "copies" of variables. Thus we can have a rule like "Smokes(X), Friend(X,Y) -> Smokes(Y)" to say that the smoking of a person X is associated with the smoking of their friends. The problem for structure learning now is that we don't want to treat all the copies of Smokes equally, because that would lead to a duplication of the same statistical patterns. Our proposal is to designate one of the copies as the main copy (e.g., "Smokes(Y)"). The rule is then that only the main copy can have parents. So a copy like "Smokes(X)" is just an auxilliary random variable for modelling the dependence of smoking on itself. We prove a theorem to the effect that restricting Bayes nets to this format involves no loss of generality. In a way this is a detail but it's an important one, and without the theorem it's not easy to see how to address it.
Extended abstract appears in the Proceedings of the Conference on Inductive Logic Programming (ILP 2011), LNAI 7207, pp.21-26.
One of the ideas we have been working out is to combine learning for directed relational models with inference in undirected relational models. This addresses the problem of how to do inference when there are cyclic dependencies in the relational data (see papers below). One problem we noticed is that if you just convert a Bayes net into a Markov Logic Network, you get many logical clauses (rules). The reason is that tables of conditional probabilities contain more rows than necessary because the tabular representation does not exploit local independencies for a given child-parent configuration. In this paper we show that if you use decision trees instead of conditional probability tables in the Bayes net, you get a much more compact set of rules. This makes the learned set of rules more readable, improves weight learning for rules, and ultimately increases predictive accuracy.
The most practically useful theoretical paper I have written. It considers the question of how we can quantify how well a Bayes net model fits the information in a relational database. In fact, the proposal works for any statistical-relational model that uses 1st-order logical variable. This is a problem because in relational data, units are not independent of each other, so we cannot apply the model to each unit separately and multiply the results. My proposal is to define the model fit by considering how well the model explains, on average, the properties of a random instantiation of the 1st-order variables. For instance in a social network, we can imagine randomly sampling pairs of actors, and applying the Bayes net to compute a probability for that pair. The average log-probability over all pairs is my proposed pseudo-likelihood measure.
The paper establishes various desirable theoretical properties of the pseudo-likelihood measure, gives an efficient closed form for computing it, and describes efficient algorithms for Bayes net parameter and structure learning that maximize the measure. Warning: the reviewers said the paper was "a little dense", so you may want to look at my powerpoint presentation first.
Connections. Xiang and Neville independently introduced the same normalization idea for Markov Logic Networks.
Erratum In Table 1, the column labelled C(X) should be omitted.
Consider the following algorithmic problem: Given a set of points X, and a consistency constraint on subsets of X, find a minimum collection of consistent subsets that covers all points in X. We show that a number of data summarization problems in data mining can be viewed as instances of this problem (e.g., finding a minimum rule set for classifying, frequent pattern search, converse k-clustering). Many of these problems satisfy a monotonicity constraint: all subsets of a consistent set are also consistent. For subset cover problems with the monotonicity constraint, the paper describes new heuristic algorithms that take advantage of the constraint.
This is our first description of our research results on learning Bayes nets for relational data. The main idea is to learn Bayes nets for the class-level distribution over attributes and relationships, as defined in Halpern and Bacchus' classic AI work from the 90s. Shorter versions are available as follows.
- 3 pages: Bayes Nets for combining logical and probabilistic structure . Schulte, O., Khosravi H., Bina, B. (2009). STRUCK workshop at IJCAI-09.
- 6 pages: Join Bayes Nets: A New Type of Bayes net for Relational Data. Schulte, O., Khosravi H., Bina, B., F.Moser (2009). GKR workshop at IJCAI-09. More on structure learning for statistical-relational models.
- A technical report, SFU School of Computing Science Technical Report 2008-17. Pretty much superseded by the preprint.
I'm starting to work on various issues that arise when we combine probabilistic concepts with a logic-based formalism, like relational databases. This paper defines the concept of support and confidence for a large class of association rules, based on safe queries in the domain relational calculus (a logical query language equivalent to SQL). The key problem is this: given a logical formula with free variables and a database instance (finite model), what is the percentage of free variables that satisfy the formula? For example, consider the formula Customer(X) AND Student(X). It seems this should be neither the percentage of students who are customers, nor the percentage of customers who are students. Our proposal is to define the frequency as the ratio of (#people who are both students and customers)/(#people who are students or customers). We extend this basic idea to general safe queries and prove various metatheorems about it, such as the a priori property.
The key differences with other relational rule approaches (e.g., WARMR) are this: 1) we do not assume that the user specifies a single target or base table; instead, we consider combinations of base tables (Student union Customer). 2) The base tables are defined dynamically and depend on the particular query formula, so there is no fixed table with respect to which the frequency of all formulas is defined. 3) Our rule language is much richer than conjunctions of atoms; it allows nested quantification and negation.
Sports Analytics and Reinforcement LearningSports data are a rich application area for many of the topics I'm interested in: machine learning for structured data: especially spatio-temporal processes, heterogeneous networks: lots of different agents with different types of relationships, and game theory: decision-making in multi-agent interactions. My general vision is to apply reinforcement learning to sports data. This is a rich well-developed area of machine learning that fits sports very well. In many papers I use reinforcement learning to learn a value function for a sport (hockey so far) and then apply the value function to answer questions about the sport, like evaluating the performance of players.
This is a complex domain with millions of data points, so the neural net is hard to train. We could not fit all the details into the paper. One small thing I'm proud is a novel initialization, where we train the neural net first to predict the average Q-value for the average input vector. This helped a lot. It's a simple idea but I have not seen this elsewhere.
One of my favourite aspects is a novel solution to the excess-zeros problem: having a large number of data records where your target feature for prediction is 0. For example, only about half of the players with an NHL contract ever actually play an NHL game. Our target is to predict the total number of NHL games for each player, so we have many 0s. Our suggestion is to adopt a classification approach instead: predict whether a player will play any games, then use the confidence that they will play >0 games to rank them.
Probably the most succinct and accessible write-up the Markov game approach to evaluating first actions, then players. we've done. What is new compared to the journal paper below, is that we cluster hockey players into types, then rank them within each cluster. Generally ranking players makes most sense for comparing those with similar roles.
Reality Check. I like tracking how players we ranked highly do in the future. Our ranking identifies Erik Karlsson as the top player. And he's been exceptional in the 2017 playoffs! We mention Jack Eichel as a rising star with relatively low salary, based on the 2015-2016 NHL season. He's having a killer 2016-2017 season.Connections. That ranking and clustering should go together is the main insight behind the RankClus system from Jiawei Han's group for ranking individuals in relational data. Erratum. When we made the player activity heatmaps, we inadvertently used the wrong sign for the y-coordinates, which means that we switched the left and right side of the rink (from the perspective of the attacking team). Here is a corrected version. Thank you to everyone on Twitter for pointing this out---peer review has helped us!
More material on this topic.
- Sloan Poster.
- SFU News published an article about the Karlsson connection.
- Breakdown of Clusters: Head counts and Heatmaps for each learned cluster.
Erratum . The numbers for the states and the actions given in the introduction are inconsistent with the numbers in the later detail sections. That's because our earlier version had the state include the last action. The referees suggested changing the definitions to simplify them, but then we did not update the numbers in the introduction, only later in the text. Thank you to Niel Rustenburg for pointing this out.
This paper is an instance of the problem of identifying unusual individuals (outlier detection, exception mining) in structured data. In this case we are talking about referees in the National Hockey League. The twist is that, we are interested not so much in whether a particular referee is different from a typical referee, but whether a particular referee is different from the ideal referee. The ideal referee for us is one who on average gives as many penalties to the home team as to the away team. The tricky part is modelling how referee behavior may depend on context factors such as game time and game score.
Connections. To evaluate a referee, say Tim Peel, the paper considers (i) the population of all referees, (ii) the population of all referees minus Tim Peel, and evaluates whether the referee set shows more bias with Tim Peel than without. In the Exceptional Model Mining framework, one compares the individual (subgroup) against the whole population without the individual. In our exception mining/outlier detection work (below with Riahi), we compare the individual against the whole population. These three options seem to be pretty much cover it.
The high-level idea in this paper is to treat sports analytics as a branch of reinforcement learning. (In the sense of AI, not psychology.) We build a big Markov game model for ice hockey as played in the National Hockey League. In this type of model, the game moves to state to state with a certain probability, which depends on the actions taken by both teams. Our model contains over 1.3 million states and is learned from over 8 million events, from 10 NHL seasons. As in my previous papers on learning big models from big data, we use a relational database to store both the data and the model.
We apply the Markov game model to assign a value to actions . Evaluating the actions of players is a common task for sports analytics. Reinforcement learning has developed the concept of an action-value function, often denoted as the Q-function, that addresses this problem using AI techniques. There are efficient dynamic programming algorithms for computing the Q-function even for large state spaces. For sports analytics, the Q-function has two key advantages.
- Context-dependence. The impact of an action depends on the game context in which it's executed. For example, a goal is worth more if the game is tied than if one time is leading by four. A second shot on goal shortly after a first one is more likely to succeed. Etc.
- Lookahead. Hockey actions have medium term effects, not only immediate ones. For example, gaining a powerplay does not necessarily lead to a goal immediately or within the next minute.
If you are into ice hockey, it's fun and revealing to look at the resulting rankings. For instance, all three members of St. Louis' famed STL line (Schwartz, Tarasenko, Lehtera) are among the top 20 in our list. In fact, Jori Lehtera tops our goal impact list, although his linemate Tarasenko outscored him by far. Our analysis suggests that Lehtera's actions create the opportunities that Tarasenko exploits, leading to a high goal impact ranking for Lehtera and a high goal score for Tarasenko. This pattern fits the traditional division of labor between a center and a wing player. Tarasenko is also the most undervalued player in our list. Starting with the 2015-2016 season, St. Louis has signed him for an annual average more than 7 times higher, which our analysis strongly supports.
Another interesting case is Jason Spezza, who was on top of our goal impact list in the 2013-2014 season. However, his plus-minus score was an awful -26. This is because his team the Ottawa senators was even worse overall (-29 goal differential). The goal impact score rewards players for good decisions even if their teammates do not convert them into goals. After the 2013-2014 season, Spezza requested a trade, which according to our analysis was definitely the right thing for him to do. With the Dallas Stars, his plus-minus score came more in line with his goal impact score, as we would expect (-7 overall, still negative).
As I mentioned, we also look at players' impact on penalties. Sadly, two Vancouver Canucks players take spots 2 and 4 when it comes to actions that lead to penalties (Dorsett and Bieksa). Even more sadly, none of the Canucks player make it into the top 20 when it comes to actions that lead to goals.
More material on this topic.
- Supplementary Material.
- 2-slide spotlight presentation. Perhaps the best way to get a quick idea of the details.
- MLSA presentation at ECML 2015. 15 minute lecture. PDF format.
- MLSA Workshop paper at ECML 2015. Shorter workshop version. Does not show the range of action values, looks only at goals, not penalties. Shows that goal impact rankings are consistent in the sense that they correlate highly from one season to the next. Compared to the UAI paper, this is a bit more geared towards readers interested in hockey analytics and less in the computer science aspects. The Q-value ticker pretty much tells the whole story in one picture (Figure 1).
- Overview Talk. This presentation adds more context about sports analytics in general, with references. I've given versions of this at the Universities of Alberta, BC, and Fraser Valley.
Learning Bayesian NetworksI worked on a new algorithms for learning Bayesian networks that combine model selection with independence tests. This was inspired by my work on the learning theory of Bayesian networks.
We continue the project of using the learning-theoretic analysis of learning Bayes nets from statistical tests ("Mind-change optimal learning of Bayes net structure from dependency and independency data") to develop new algorithms for this problem. We follow the hybrid approach of the previous I-map learning paper that combines model selection scores with statistical tests. The previous paper considered Bayes nets with discrete variables; this one considers continuous variables with linear Gaussian distributions. This is a subclass of structural equation models, a very widely used model class in economics, psychology, biology and other disciplines.
We give simulation evidence to show that in these models, standard model selection scores (like Bayes Information Criterion) tend to select overly complex models with too many edges. We propose using statistical tests to cross-check the model selection score: before we add an edge to the graph, we check to see if this would help to cover a statistically significant correlation. If not, we don't add the edge, even if it would increase the score. We prove that this procedure converges to a correct graphical model in the sample size limit. In simulations it gives a better match to the true data-generating graph, on average, than the unconstrained model selection score. The procedure is especially good at finding models with the correct number of edges, which is the measure of model complexity that comes out of applying the mind change analysis from learning theory to Bayes net learning.
The learning-theoretic model of Bayes net structure learning presented in the paper "Mind Change Optimal Learning of Bayes Net Structure" below suggests using dependencies or significant correlations as a constraint in learning Bayes nets. We present a new hybrid criterion to make this idea mathematically precise (maximize a model selection score subject to the constraint of fitting the observed correlations), give an algorithm for optimizing the hybrid criterion, and report experimental results from simulations and real-world data. Our general conclusion is that the dependency constraints are helpful for discrete-variable Bayes nets that are not too large (10 nodes or less), especially on small to medium sample sizes (1,000-3,000). One of the main things I've learned is that multiple hypothesis testing is hard! Statisticians have recognized this for a long time and proposed various ingenious schemes. We tried a number of them with more or less success, but in our experiments it was hard to find one that solves the basic dilemma for all or even most problems: if the testing is too conservative (e.g., with low significance levels), you get little information from the test and diminishing returns on the computational overhead. If the testing is too aggressive, there are too many false positives leading to overly complex models. Hence our caveats on the range of problems for applying the hybrid method.
Automated Scientific DiscoveryOne of the questions that continues to fascinate me is why scientific discovery produces so much more powerful models than standard statistics or machine learning. Inspired by my work on learning theory, I developed several algorithms that computationally model this learning power, mainly for particle accelerator data.
Continues the research on learning conservation laws in physics and other areas of science. In my IJCAI 2009 paper I showed that choosing a maximally strict set of conservation laws---that rules out as many unobserved data points as possible---formalizes an important part of the methodology used by scientists. It's actually also a classic principle from Machine Learning known as selecting the least generalization of the data. In itself the principle is not enough to determine a unique set of conservation laws, so in this paper we added the requirement that the conservation laws should be maximally simple. "Simplicity" here means assigning small conserved quantities (the L1-norm of the conservation matrix). The combined criterion we call maximally simple maximally strict laws, or MSMS laws. A beautiful and surprising connection with particle ontology is that if there is a set of maximally strict laws that corresponds to a disjoint partition of particles into families (like a clustering), then that set is also the MSMS-optimum. This means that the MSMS criterion can be used to discover not only conservation laws but also particle families - it's like doing classification and clustering at the same time.
Finding an MSMS theory isn't easy because it requires minimizing the L1-norm with a non-linear constraint. My collaborator Mark Drew devised a clever way to do the minimization which is blindingly fast on realistic data sets. We show that this method re-discovers the laws and families in the Standard Model of particle physics, which is a very important scientific theory. It also works in chemistry, re-discovering the molecular structure of substances like ammonia and water.
One of the referees was kind enough to nominate this paper for the best paper award. It's another installment in my series on conservation laws in particle physics. Previously seen on this channel ("Inferring Conservation Laws in Particle Physics", 2000): A learning-theoretic analysis shows that minimizing mind changes requires a particle theorist to adopt the following principle: try to find a set of conservation laws that is consistent with all observed reactions, and rules out as many unobserved reactions as possible. I also showed that optimally fulfilling the principle may require hidden particles: Though it may seem paradoxical, explaining the data about observable entities may require positing unobserved entities. The 2000 paper gave a mathematical analysis that characterizes exactly when this phenomenon occurs. The technical answer is that it occurs when there is an unobserved reaction that is a linear combination of observed reactions, but one that involves fractions as coefficients (e.g., 1/2 reaction1 + 1/2 reaction2).
What's new is an efficient algorithm that allows me to compute when the fractionality condition for introducing hidden particles is met. Applying it to the actual particle data leads to the rediscovery of the electron antineutrino. This may not sound exciting for readers outside of particle physics, but in the particle physics literature they say that the existence of this particle is one of the two main questions the field is investigating. From a computer science point of view, the cool thing is that the algorithmic solution is to compute the Smith Normal Form of the reaction data: A hidden particle is needed if and only if the Smith Normal Form contains a diagonal entry greater than 1. I didn't know about the Smith Normal Form before I looked at this research. It's a classic tool for integer matrices from the 19th century, tells you a lot about the matrix you are dealing with. Kind of like singular value decomposition for real-valued matrices. So from learning theory we got to hidden particles, which led to fractions of reactions, which led to the Smith Normal Form.
How I found about the Smith Normal Form is a nice story about scientific collaboration, one of my best experiences in that regard. Once I had the question about hidden particles down to a linear algebra question about fractional linear combinations, I thought I could consult one of our theory experts, so I asked Arvind Gupta. Arvind said that if it was about linear algebra, I should ask Wayne Eberley from the University of Calgary. So I sent Wayne e-mail and he said it sounded like a job for the Smith Normal Form, and if that was the case, I should ask Mark Giesbrecht from the University of Waterloo. So I sent the problem to Mark, and he sent back the solution, complete with correctness proof and implementation advice. Thank you Mark!
A succinct summary (6 pages) of the result of applying the learning-theoretically optimal algorithm to finding conservation laws and particle families. (The algorithm was described in "Inferring Conservation Principles in Particle Physics", see below.) Written for physicists. Shows how particle families (e.g., baryons, lepton generations) are uniquely determined by reaction data. States the result that for any class of reactions (e.g., strong interactions), the number of conserved quantities in that class is no greater than the number of stable particles without a decay mode in the reaction class. This assumes the principles of Special Relativity - my only piece of theoretical physics so far! (Probably ever.) This paper supersedes our previous technical report posted at http://www.cs.sfu.ca/research/publications/techreports/#2006.
Learning TheoryMy dissertation was mainly about learning theory. My main theorem says that a certain combination of performance criteria for learning determines an optimal learning method for each problem where they are attainable. (The criteria are reliable, steady, and fast convergence to a correct hypothesis). The papers below describe the theory that led to this result. I have applied the analysis to determine and implement the optimal theory in several domains (particle physics, learning Bayesian networks).
Previous papers of mine developed an approach to inductive inference
based on success criteria like fast convergence to a correct hypothesis,
and minimizing theory changes. One application area was the discovery
of conservation laws and particle families in particle physics. This
work opens a new application domain: learning Bayesian networks . Our learning
model is based on increasing information about what statistically
significant correlations obtain between domain variables (conditional on
others). For example, the learner may be told that "father's eye colour
is correlated with mother's eye colour given the child's eye colour".
We apply previous our previous results from general learning theory to determine that the optimal method by our success criteria is to find a Bayes net that entails the observed significant correlations with a minimum number of edges , or directed dependencies between variables. So the learning theory leads to a new notion of simplicity for Bayes nets: the fewer edges, the simpler the net.
If the data are observed independencies, the optimal method is to conjecture the Bayes net with the largest number of edges that entails these independencies. The famous PC method for Bayes net learning seems to be a heuristic method for the independency case. The paper shows that it is NP-hard to determine whether there is a unique minimum-edge Bayes net for a set of observed probabilistic dependencies (or unique maximum-edge for independencies) As far as I know, this is the first NP-hardness result in the literature concerning the existence of a uniquely optimal Bayes net structure. Personally, it's the first NP-hardness proof I've done for a published paper (as opposed to exercise). It was fun but hard because of the uniqueness constraint (full proof runs over 7 pages). Since implementing the mind-change optimal learner is NP-hard, applying the learning theory requires heuristic solutions. I worked on this with Russ Greiner and my student Gustavo Frigo, see the "hybrid method for Bayesian network learning" papers.
More material on this topic.
- Mind Change Optimal Learning of Bayes Net Structure. O.Schulte, W. Luo and R. Greiner (2007). in Proceedings 20th Annual Conference on Learning Theory (COLT), LNCS 4539, pp. 187-202, San Diego, CA, June 12-15. Shorter conference version. Analyzes dependency data only, where no independencies are observed.
- Conference Presentation Slides.
Our final paper on the relationship between topology and mind changes in the Gold learning model. It extends the previous conference paper with the same name (sorry!) in two ways: (1) we introduce a stronger, and mathematically simpler, notion of what it is for a learner to minimize mind changes at a given stage of learning, (2) we relate mind change complexity as measured in point-set topology to other notions of learning complexity from the literature (e.g.,"reducibility" of one learning problem to another).
There is a great follow-up
to this paper by Matthew De Brecht and Akihiro Yamamoto from Kyoto University. They apply even more concepts from point-set topology than we did to gain insight into learnability.
This paper continues my work on using the criterion of minimizing belief changes to determine optimal solutions to problems of inductive inference. We look at mind changes in the Gold learning paradigm, using a generalization of the idea of a mind change bound beyond finite mind change bounds to transfinite (ordinal) ones. First we characterize the learning problems that can be solved with a given ordinal mind change bound. It turns out that this corresponds exactly to Cantor's classic concept of accumulation order in a topological space. Second we characterize the mind change efficient inductive methods that achieve the best possible mind change bound given any data sequence. Basically, the mind change efficient learners are those that produces conjectures that maximize accumulation order. We show that mind change efficiency places strong constraints on inductive inference, and illustrate these constraints with examples such as identifying a linear subspace and identifying a one-variable pattern language (Angluin).
A review for a statistics journal by a computer scientist of a book written by a philosopher and an engineer - I couldn't resist. The book is basically an introduction to VC learning theory (known in CS as PAC theory, for "probably approximately correct"), with as little technicality as possible. Unfortunately, the word limit for Biometrics was very tight, so pretty much all I could do was a brief summary of the VC dimension and an outline of what Harmann and Kulkarni do with it to discuss problems of inductive inference.
Formal Epistemology and Philosophy of ScienceAlgorithmic Learning Theory is a fruitful approach to many long-standing problems in epistemology, and the philosophy of science and induction.
This is the expanded journal version of the previous paper on underdetermination of conservation laws in particle physics. The title alludes to a quote by David Lewis, that "laws and natural properties get discovered together". The main new result added is that particle reaction data alone are sufficient to determine the matter/antimatter division of the particle world (without a general model of particle ontology like the Standard Model). I also give more detailed comparisons between the algorithmic or learning-theoretic method for discovering conservation laws and how physicists have approached this problem in practice.
The Congress takes place every 4 years; it began 1960. It is organized by the International Scientific Union for History and Philosophy of Science (IUHPS), which is an organization affiliated with UNESCO. This is a major interdisciplinary conference with a proud tradition; for instance, Nobel Laureate Harsanyi and computation theorist Buechi presented major results there. I felt honoured to receive an invitation to give a lecture.
The paper considers the challenge of discovering conservation laws in particle physics from a philosophical point of view rather than an algorithmic one. The main new issue is to single out a unique set of conservation laws from the infinitely many sets that make the same predictions. In previous work I showed that the criteria based on mind change efficient learning developed in previous papers select exactly the conservation laws that make the ssame predictions as those found in the Standard Model of Particle Physics (see "Algorithmic Derivation of Additive Selection Rules and Particle Families from Reaction Data" and "Inferring Conservation Principles in Particle Physics: A Case Study in the Problem of Induction" below). But there are many different sets of conservation laws that achieve this (as many as there are bases for a vector space). So the question arises: why pick one of the law sets over another? In philosophy of science, this is sometimes called "global underdetermination" and sometimes the problem of choosing between empirically equivalent theories, meaning theories that agree on all their predictions. (The standard example is Copernicus' theory of the solar system vs. Ptolemy - with suitable parameters, both can be made to give the same predictions.) The answer in this paper is that particle ontology selects a unique theory. I prove that if you want a set of conservation laws that clusters particles into disjoint families (e.g., Baryons, Muon, Tau, Electron Types), then there is at most one way of doing that. Which---you guessed it---is the one physicists have chosen. It was very surprising to me that the clustering criterion picks out a unique set of conservation laws; mathematically it is a pretty result in linear algebra. So the main thing I learned in this research is the importance of ontology or taxonomy in scientific research. Based on this insight, I make a philosophical proposal, namely to answer the question: What is a natural kind? with: A natural kind is a category that appears in the ontologically simplest empirically adequate set of generalizations.
Another introductory piece on formal learning theory. Provides a fair amount of comparison with other approaches to inductive inference like confirmation theory in philosophy and language learning theory in computer science.
Covers conceptions of scientific method from Galileo and Newton to contemporary statistics and epistemology. Explains how modern work on inferring Bayes Nets or graphical models of causation fits into traditional philosophical methodology going back at least to the Middle Ages.
The online encyclopaedia is a great idea! My contribution is an introductory entry on formal learning theory.
Presents a learning-theoretic analysis of a problem that arises in particle physics: to infer conservation principles governing elementary particles from data about reactions among such particles. The criteria of reliable, fast and steady convergence to a correct theory (see "Means-Ends Epistemology" below) give strong constraints on what these inferences should be. In some cases they dictate that a particle theorist has to posit the existence of hidden particles.
In response to my article "Means-Ends Epistemology", David Chart constructed an example in which means-ends analysis sanctions the inference "all emeralds are grue" from a sample of green emeralds. I take this to show two points: (1) means-ends analysis does not depend on the predicates that an inquirer might have chosen to include in her language, but (2) means-ends analysis does depend on what hypotheses the inquirer takes seriously at the outset of inquiry. What hypotheses to take seriously is a different (though related) issue from what inferences to draw from given evidence.
A review of Martin and Osherson's treatment of many topics in scientific inquiry and inductive inference from a learning-theoretic point of view.
This paper pursues a thorough-going instrumentalist, or means-ends, approach to the theory of inductive inference. I consider three epistemic aims: convergence to a correct theory, fast convergence to a correct theory and steady convergence to a correct theory (avoiding retractions). For each of these, two questions arise: (1) What is the structure of inductive problems in which these aims are feasible? (2) When feasible, what are the inference methods that attain them? Formal learning theory provides the tools for a complete set of answers to these questions. As an illustration of the results, I apply means-ends analysis to various versions of Goodman's Riddle of Induction.
Develops the same ideas as "The Logic of Reliable and Efficient Inquiry", but in different directions, and at a much lower level of technicality. Features a means-ends vindications of (aversion of) Occam's Razor as well as the natural generalizations in a Goodmanian Riddle of Induction. I establish a hierarchy of means-ends notions of empirical success , and discuss a number of issues, results and applications of means-ends epistemology.
Compares learning theory with other approaches to inductive inference, especially Bayesian confirmation theory. Has a section of examples that illustrate learning-theoretic analysis, from scientific revolutions to cognitive science. No symbols!
When somebody proposes general norms for empirical inquiry, means-ends epistemology asks whether these norms help or hinder the goals of inquiry. In this paper we investigate whether axioms for "minimal-change belief revision" prevent agents from reliably finding the truth. In a nutshell, the answer is that these axioms don't help but they don't have to hurt either, provided we design our "belief revision operators" in the right way.
One interpretation of point-set topology is as the theory of inquiry for logically omniscient agents with no limitations on memory capacity (cf. Vickers' book "Topology via Logic"). This view turns topology into a powerful tool for epistemology. The paper is a friendly introduction to the connection between topology and epistemology. Features topological interpretations of Popper's falsifiability criterion.
Tradition has it that deductive logic and empirical inductive reasoning are two disparate forms of inquiry. However, modern theories of computation and of empirical inquiry reveal deep and fundamental analogies between the two. We describe some of the most important similarities. These analogies are yet another reason to study the empirical capacities of computationally bounded agents---hence we review the results from the paper below.
If a computer cannot derive the predictions of a theory, is it possible for a computer to test the theory against empirical data? The surprising answer is: sometimes yes, even with infinitely uncomputable (hyperarithmetical) theories. The paper gives a complete map of the relationship between the inductive complexity of a theory---how hard it is to test the theory---and its deductive complexity---how hard it is to derive the predictions of the theory.
Belief RevisionMy approach to learning is that optimal methods should be determined by clearly specified performance criteria. I wrote several papers trying to understand in what sense exactly belief revision principles lead to minimal belief changes.
One of the oldest ideas in belief revision theory is due to Isaac Levi: that changing your beliefs given some new information should proceed in two steps: first, give up old beliefs contradicted by the new information, second add the new information. Bill Harper proposed the reverse for belief contraction or weakening: to give up a belief p, consider what you would think if you learned that p was false. I show exactly what these two ideas entail, and under what conditions they can be used to make belief revision and belief contraction inverses of each other. - I saw Harper and Levi once at a conference at the same time, so I got to tell them about these results at the same time. Fun for me at least.
A survey of some basic results about minimal belief revision. Covers topics such as Pareto-minimal theory change, the Levi and Harper Identities, base revision, conditionals and the Ramsey test, and the Grove representation theorem for the AGM postulates.
What do changing beliefs, revising legal codes and updating knowledge bases have in common? The answer, according to many philosophers, logicians and computer scientists, is that they all ought to obey the principle of minimal belief change in the light of new information, change your beliefs or legal code or knowledgebase just enough to incorporate the new information but not more. Our paper "Reliable Belief Revision" below critically examines the effects of this principle on an agent's ability to reliably arrive at true beliefs.
In this paper, I ask what exactly a minimal belief change is. I apply the fundamental decision-theoretic principle of Pareto-Optimality to define a notion of minimal belief change, and then prove that a theory change is Pareto-minimal iff it satisfies certain axioms. It turns out that these axioms correspond exactly to three axioms for counterfactuals that are familiar from Lewis and Stalnaker's work. I also work out the consequences of Pareto-Optimality for minimal revision of axiomatic bases representing beliefs (rather than deductively closed theories). I compare Pareto-minimal belief change with the well-known AGM belief revision axioms.
Game TheoryGame theory provides a mathematical theory of social interactions among multiple agents, which is an amazing thing. My own contributions cover a variety of topics: applications, decision-theoretic foundations, logical representation.
An extended journal version of our ESA paper below with the same title (sorry!). Analyses the selfish routing game, a popular model of user behavior on a computer network. This is basically a congestion game. Our paper combines evolutionary analysis with Bayesian games of incomplete information. A description of results appears with the conference version below. The journal version has examples and nicer pictures :-)
Erratum: in Figure 4 and on page 1060, replace the symbol p with \sigma.
Most of my papers on game theory are about using logic to address foundational questions or solve computational problems. This paper is more in the standard mould of an economics game theory paper: take a game of interest and figure out its equilibria. The game of interest here is the parallel links network routing game popularized by Papadimitriou, which is a simple kind of congestion game. Our novel contribution is to consider evolutionarily stable equilibria in this game. In my view it makes a lot of sense to model a network with a large population as users, and in a population model evolutionary stability is a very natural concept to apply. The main result mathematically is that there is at most one evolutionarily stable Nash equilibrium. It was surprising to me that in this game evolutionary equilibrium has so much predictive power. We also give necessary and sufficient conditions concerning the distribution of tasks to resources that hold in the evolutionary equilibrium, which leads to some pretty pictures.
Game Trees are a very general, highly-developed mathematical formalism for representing what an agent knows about its environment to plan goal-directed actions. The Situation Calculus is a logical, computational formalism for representing environmental dynamics. In this paper we show a strong relationship between the two: just about every game-theoretic model of an environment can be described in the situation calculus with a categorical set of axioms. This implies that all and only true features of the game-theoretic model are expressed by the situation calculus axioms. Some provisos are that the game tree must provide each agent with at most countably many actions, and that the agents' utility functions (goal specifications) must be continuous in the Baire topology. We give situation calculus definitions for game-theoretic concepts such as Nash equilibrium and backward induction.
important approach to game theory is to examine the consequences of
that rational agents may have about each other. This paper considers
for public preferences. Asheim's epistemic characterization of respect
preferences interprets this condition as follows, roughly speaking: An
explains unexpected behaviour by other agents as mistakes, and considers
serious mistakes less likely than less serious ones. "Properly
rationalizable" outcomes of a multi-agent interaction are those that are
consistent with common belief in respect for preferences. It is
determine in a given game model what follows from common belief in
preferences. This paper proposes an algorithm for deriving consequences
assumption called Iterated Backward Inference. Iterated Backward
Inference is a
procedure that generalizes standard backward induction reasoning for
both perfect and imperfect information. Our main result shows that if
for public preferences and perfect recall obtains, then players choose
accordance with Iterated Backward Inference.
The algorithm in this paper does not necessarily find all strategies consistent with proper rationalizability. This problem was recently solved by Andres Perea.
Together with Jim Delgrande, I showed that just about every game tree has a sound and complete (categorical) axiomatization in the situation calculus (see above). This shows that the situation calculus is sufficiently expressive to represent multi-agent interactions. We apply this insight to give a natural axiomatization of a variant of the "Clue" game in the situation calculus. The paper discusses a number of the issues that arise, such as the representation of what is common knowledge between various players and successor state axioms for the rules of the game.
I regularly use the decision-theoretic criterion of admissibility (avoiding weak dominance) to evaluate the performance of inductive methods. This paper applies admissibility to make recommendations not just for games of inquiry but for any game, including those modelling social phenomena of great interest. We define a natural procedure for eliminating weakly dominated strategies in both game trees and matrix games. We show that this procedure satisfies a commonly accepted wishlist of properties that we would like in our tools for analyzing games---one of the few solution concepts to do so. Technically speaking, iterated admissibility generalizes backward and forward induction reasoning and yields the same result in game trees as it does in the matrix games that represent them (proviso: the game tree must have perfect recall).