Fast Learning of Relational Dependency Networks: The Long Story
The Long Story.
The problems addressed in "Fast Learning of Relational Dependency Networks" concern the most important differences between learning from relational data and learning from independent i.i.d. data points. It's taken me a long time to get clear on this, so I will try to explain briefly what I've learned.
Class-level Frequencies and Instances.
The problem of how to move from relational class frequencies to single-instance probabilities is a deep one. Joe Halpern has thought intensely about frequencies vs. single cases in relational structures, starting in 1990 and going to at least 2006. Here's the issue. As Kahneman observes in his book Thinking, Fast and Slow, "statistical thinking derives conclusions about individual cases from properties of categories and ensembles". Classical statistics takes a pipeline approach: First, estimate the frequency of a property in a relevant reference class. Second, use your estimate to derive a probability that an individual from the reference class will have that property. For example, suppose you know that the proportion of 12-year-olds who have read Harry Potter is 50%, and you ask yourself "what is the probability that Joe has read Harry Potter?" where Joe is a 12-year-old you just met. If don't know anything else about Joe, you would assign a probability of 50%. This is known as the direct inference principle, sometimes the Principal Principle. The problem is that in relational data the direct inference principle is not sufficient. I would say that this is the single most important difference between i.i.d. data and relational data. For instance, suppose that the probability that a 12-year-old has read Harry Potter increases to 60% if a randomly selected friend has read Harry Potter also. Even so, direct inference does not determine what the probability of Joe reading Harry Potter should be given that Joe is friends with Marina and Maite who have both read Harry Potter: You need to combine the information from different related objects. So in i.i.d. data, the direct inference principle is both necessary and sufficient to move from frequencies to single cases. In relational data, it is still necessary, but it is no longer sufficient.
The log-linear model in this paper is a solution to Halpern's problem of translating relational statistics into single-instance probabilities: given features and parameters/weights that model class-level statistics (Kahneman's "categories"), the equation specifies probabilities for single-instance predictions.
Counts vs. Proportions.
A key aspect of the equation is to use proportions instead of the counts that you usually find in log-linear models. Mathematically, switching from counts to proportions is a trivial difference. But conceptually and in terms of predictive accuracy, the impact of this move reaches far. I will try briefly to explain why. First, an example of what we are talking about. Go back to the problem of predicting the genre of a movie from the genders of its actors, say the genre of "Fargo". Suppose that 100 actors appear in the movie, 70 are women, and 30 men. Using counts, a standard log-linear score would use something like 70 x weight(female actor) + 30 x weight(male actor). Instead we use 70% x weight(female actor) + 30% x weight(male actor). In a movie with 50 actors, the denominator would be 50. In i.i.d. data, counts are converted to frequencies by dividing by a fixed sample size N . In relational data, there is no fixed global sample size. This is IMO the second most importance difference between i.i.d. data and relational data. In our example, the relevant sample size depends on the target movie (100 vs. 50 actors). It also depends on the feature under consideration: for example, if we predicted the genre of a movie by the ratings of its users, the relevant sample size would be the number of ratings of the movie. The fact that local sample sizes depend on features was termed "ill-conditioning" by Lowd and Domingos in the context of gradient descent.
How does this matter for predictive accuracy? In our log-linear equation, the weights are log-conditional probabilities computed from the parameter estimates in the Bayesian network. This means that the weights are roughly on the same scale for all features. Since feature counts differ exponentially, the (weight x count) terms with high counts override those with lower counts. But with proportions, all the terms are on the same scale and are balanced against each other properly. Going back to the example of predicting the genre of "Fargo", the feature "country of Origin = U.S." is instantiated only once, so contributes just a term (log-conditional probability x 1). If the number of female actors in the movie is 70, then the feature "actor-female" contributes a term (log-conditional probability x 70).
Btw, this is still a problem in the more general log-linear model (Markov Logic Networks) where weights can be of any magnitude. We can assign smaller weights to features with more instances---and a weight learning method will do that---but a single weight cannot properly balance all the different cast sizes for different movies. More generally, a single weight cannot balance all the different local sizes defined by the relational neighborhoods of different entities.
Consequences
Okay, so using proportions instead of counts improves predictions. Ravkic, Ramon and Davis arrived at the same conclusion. Does it make a difference for how we think about relational learning? In fact, the consequences are surprisingly profound in my opinion.
- Counts correspond to the complete instantiation semantics, frequencies to the random instantiation semantics. The most common semantics for a graphical first-order model is based on "unrolling" or "grounding" the graphical model. One forms an "inference graph" with all possible instances of edges in the first-order template, obtained by replacing first-order variables with constants. After that, one applies standard product formulas for probabilistic reasoning that multiply together the factors in the graph (conditional probabilities for Bayesian networks, clique potentials for Markov networks). The problem is that these formulas multiply together all factors. This means that the grounding semantics entails a log-linear model based on feature counts. The random instantiation semantics in contrast entails a log-linear model based on proportions. (Details in the paper).
- Counts define compatible conditional distributions, frequencies do not. Statisticians say
that a set of conditional distributions is compatible if there is a single joint distribution p such that the conditional distributions derived from p agree with the given conditional distributions. Heckerman et al. refer to a dependency network with compatible distributions as consistent. The main theorem in our paper gives necessary and sufficient conditions for our dependency networks to be consistent. Except under mild restrictions, they are not. In contrast, the conditional distributions that result from the log-linear equation used with counts agree with a Markov random field and hence are compatible.
The way I see it now, there is a dilemma for statistical-relational learning: Continue to use complete instantiation/count models, losing predictive accuracy, or move to random instantiation/proportion models, losing compatibility. Research on inconsistent dependency networks has shown that it is quite possible to reason with incompatible conditional distributions. Another alternative is to adopt collective factorization models
. That would raise the question of how to combine them with graphical first-order models that do not contain latent variables, for discussion see Nickel et al. Or maybe I'm missing something---this is complicated stuff!