Much of my work has addressed questions about learning, inference, and intelligent actions by proving theorems that aim at providing mathematical insight into these questions. Below I group several of these results by topic, with informal description of what the theorems say. Precise definitions and more details can be found in the abstracts and papers in my [publications page.]

Learning Theory

Starting with my dissertation work, I have investigated different performance criteria for learning. Once you define a learning goal, two questions arise: 1) what are necessary and sufficient conditions on a learning problem so that the goal is achievable? 2) Whenever the goal is achievable, what are necessary and sufficient conditions for learning methods to attain it? I have proved theorems that answer these questions and applied the results to design learning algorithms for challenging real-world problems.

Mind Change Optimality

Much of my work has been about the following performance criteria. Given a space of hypotheses and evidence for deciding among them, a learning method should achieve the following.
  1. Reliable convergence: as evidence accumulates, the method should eventually converge to the correct hypothesis. (This is like statistical consistency.)
  2. Steady convergence: before convergence, the number of hypothesis changes should be minimal.
  3. Fast convergence: the number of evidence items required to converge to a correct hypothesis should be minimal.
The second goal can be formalized as bounding the maximum number of mind changes, the number of times a learning method changes its mind on a single evidence stream. Mind change bounds are closely related to [mistake bounds].
  1. A fundamental theorem states that mind-change complexity = topological complexity , in the following sense. A hypothesis space can be viewed as a topological space. Each hypothesis is a point. The neighborhoods correspond to which hypotheses are consistent with which finite evidence sequences. For instance, with nested hypotheses, the nesting relation defines a topological set of neighborhoods (the order topology). For a given topological space, Cantor defined a measure of how complex a point in the space is, called the Cantor-Bendixon rank. The complexity of the whole space is the supremum of the complexity of its points. My result states that the Cantor-Bendixon rank of a hypothesis space is exactly the optimal bound on the number of mind changes for solving learning a correct hypothesis. Cantor's work dates back to the 19th century. I was pleased that such a classical fundamental mathematical structure is characteristic of mind change complexity.
  2. A second result concerns the design of methods that achieve the three criteria listed above (reliability, steadiness, speed). I proved that for any given learning problem, there is a unique optimal method. Moreover, this method always chooses the simplest hypothesis consistent with the data, where simplicity is measured by the Cantor-Bendixon rank. This assumes of course that the hypothesis space has a bounded Cantor-Bendixon rank. In other words, if the problem is solvable, there is a unique method that solves it. I'm especially proud of this result because it is unique in characterizing the methods that achieve optimal performance, rather than the structure of problems.

Applications: Designing Mind Change Optimal Algorithms

The result above says there is a unique mind change optimal learner for any problem, but it still takes work to find the learner for a given problem. I've applied learning theory to find optimal learners for problems in particle physics and in causal modelling.
  1. In particle physics , the problem is to simultaneously find conservation laws and hidden particles that explain particle accelerator reaction data. I showed the following results.
    • Given a matrix representation of the reaction data, the mind change optimal learner outputs a basis for the nullspace of the data matrix.
    • A hidden particle is required for mind change optimality if and only if the Smith decomposition of the data matrix contains a diagonal entry other than 0, 1, -1. This is really neat linear algebra. The Smith decomposition is like the singular value decomposition, but decomposes an integer matrix into integer matrices.
    • Particle families (disjoint clusterings) are uniquely determined by the data, and can be recovered by minimizing the L1-norm of the conservation law matrix.
    • Comparing with the models discovered by particle physicists, the mind change optimal algorithm rediscovers the conservation laws in the fundamental Standard Model of particle physics. It finds a new critical experiment for testing the existence of the electron antineutrino, one of the key issues in particle physics.
    • This algorithm also works for discovering molecular structure from chemical reaction data. For instance, discover that the substance known as "water" has the molecular structure H2O.
  2. Graphical Model Learning. Suppose the problem is to learn a causal graph, or Bayesian network, from observed conditional dependencies (constraint-based learning). The mind change optimal learner conjectures the unique graph that has the minimum number of edges and explains the observed correlations, if there is one. (Unique means up to d-separation equivalence.)
    I've recently extended this result to the case where one also wants to find the structure of the conditional probabilities in the Bayesian network, represented as decision tree. In that case mind change minimality requires minimizing the number of parameters in the model.

Statistical Learning for Relational Databases

Much of my work on learning for relational data is algorithmic and empirical. I have proved some results that provide a foundation for our algorithms.
  1. Closed-form for the random selection likelihood. A key step in selecting a statistical model is defining a quantitative metric of how well a model fits a dataset. This is difficult for relational data because related entities are not independent, so it does not seem that we can independently score each data point and then combine (multiply) the scores. I proposed using a random selection likelihood. The idea is that evaluating the model on a single instance is easy. So the overall score is the expected score from a randomly chosen instance. For example in a social network, we can assess how well the model fits the distribution of attributes in a single friendship pair. The overall score is the expected score from a randomly chosen friendship pair. My theorem gives a closed form evaluation for the random selection likelihood. The closed form is equivalent and requires only the sufficient statistics for the relational database. The closed form for Bayesian networks is very similar to the standard likelihood function for i.i.d. data: just replace counts by frequencies.
  2. Normal form for autocorrelations. One of the special features of relational data compared to independent datapoints is that the values of an attribute for an individual can depend on the values for the same attribute of related individual. For instance, the age of a person is predicted by the age of their friends. Jensen and Neville have coined the nice term "relational autocorrelation" for this. Temporal autocorrelation refers to the fact that, say, the humidity at time t predicts the humidity at time t+1. Relational autocorrelation is a real and frequent phenomonenon, that causes all kinds of problems, both for inference and learning. For graphical model learning one problem is that having the same variable appear twice. This increases the search space and leads to revisiting the same correlations. I proposed a normal form where a Bayesian network may contain two instances of a quantity (e.g., age), but only one of them can have edges pointing into it. In the temporal model, this is like saying that Humidity(t+1) can have parents but Humidity(t) cannot. The theorem says that this restriction incurs no loss of modelling power.
  3. Hierarchy of independence assumptions. One way to approach discriminative learning for relational data is to make independence assumptions. For example, we can assume that links are probabilistically independent of each other given the class label. We proposed a hierarchy of such assumption and proved that several relational prediction models in use can be rigorously derived from relational independence assumptions.

Game Theory

My work on game theory has been about describing and reasoning games, and also about analyzing specific game-theoretic models.
  1. Representing game trees in the Situation Calculus. I follow David Poole's insight that logic provides a compact formalism for representing problems in decision theory and game theory. I showed that von Neumann-Morgenstern game trees have a categorical axiomatization in the situation calculus. This means there is a set of axioms that completely and correctly describes the game tree. The result holds for games of imperfect information and infinite ones (the utility functions have to be continuous in the Baire topology). This general approach for imperfect information games has been adopted in the Game Description Logic for the General Game Playing Competition (but with a specific compact form for writing the logical axioms). Another way to look at this result is that it shows a very close relationship between the situation calculus and game trees. Since there are thousands of game tree models for strategic interactions in economics, political science, biology etc., this is evidence that the Situation Calculus is a very expressive formalism indeed.
  2. Evolutionary Game Theory for Network Games. I developed an evolutionary version of Papadimitrious's selfish routing game. To get evolutionary theory to work, I had to switch to a Bayesian version as well (there is uncertainty about the message sizes to be sent). To my knowledge this is the first Bayesian+evolutionary analysis in a computer science motivated game. Our main result is that there is a unique evolutionarily stable Nash equilibrium in such a network game. This means that you would expect the network traffic to settle into a unique pattern quite quickly. I'd love to test this prediction empirically.

Belief Revision Theory

Belief revision methods change beliefs in the light of new evidence. The main aim of belief revision methods is that this change should be minimal. I considered what happens if we take minimizing belief change as the aim of learning and then develop a similar characterization as with other learning - do we get something like the recommendations of belief revision theory? The answer is yes. I defined a concept of Pareto-minimal belief change in terms of the set of beliefs added or retracted by a belief revision operator. This applies the Pareto optimality concept from decision theory and seems like a minimal core of what it means to perform a minimal belief change. Pareto-minimal belief changes are characterized by a subset of the standard AGM axioms (notably dropping the preservation principle K*3). They correspond a set of logical axioms for conditional sentences that is a subset of Lewis' VC logic.

Computation Theory

I have been interested in connections between computation and learning. My Master's Thesis proved a set of results about how uncomputable the predictions of a hypothesis can be if it's possible for a computational learner to decide the hypothesis from data. The most surprising result is that there is a hypothesis with infinitely uncomputable predictions such that a computable learner can reliably indicate that it's false given enough data, if the hypothesis is false. This is a sharpening of a classic result of Hilbert and Bernays on the implicit definability of sentences true in arithmetic. This is an amazing construction but for Hilbert it was all in a day's work I think. I was pleased that this paper received a review in the Journal of Symbolic Logic.