How does this matter for predictive accuracy? In our log-linear equation, the weights are log-conditional probabilities computed from the parameter estimates in the Bayesian network. This means that the weights are roughly on the same scale for all features. Since feature counts differ exponentially, the (weight x count) terms with high counts override those with lower counts. But with proportions, all the terms are on the same scale and are balanced against each other properly. Going back to the example of predicting the genre of "Fargo", the feature "country of Origin = U.S." is instantiated only once, so contributes just a term (log-conditional probability x 1). If the number of female actors in the movie is 70, then the feature "actor-female" contributes a term (log-conditional probability x 70).

Btw, this is still a problem in the more general log-linear model (Markov Logic Networks) where weights can be of any magnitude. We can assign smaller weights to features with more instances---and a weight learning method will do that---but a single weight cannot properly balance all the different cast sizes for different movies. More generally, a single weight cannot balance all the different local sizes defined by the relational neighborhoods of different entities.

*Counts correspond to the complete instantiation semantics, frequencies to the random instantiation semantics.*The most common semantics for a graphical first-order model is based on "unrolling" or "grounding" the graphical model. One forms an "inference graph" with all possible instances of edges in the first-order template, obtained by replacing first-order variables with constants. After that, one applies standard product formulas for probabilistic reasoning that multiply together the factors in the graph (conditional probabilities for Bayesian networks, clique potentials for Markov networks). The problem is that these formulas multiply together*all*factors. This means that the grounding semantics entails a log-linear model based on feature counts. The random instantiation semantics in contrast entails a log-linear model based on proportions. (Details in the paper).*Counts define compatible conditional distributions, frequencies do not.*Statisticians say that a set of conditional distributions is**compatible**if there is a single joint distribution*p*such that the conditional distributions derived from*p*agree with the given conditional distributions. Heckerman et al. refer to a dependency network with compatible distributions as consistent. The main theorem in our paper gives necessary and sufficient conditions for our dependency networks to be consistent. Except under mild restrictions, they are not. In contrast, the conditional distributions that result from the log-linear equation used with counts agree with a Markov random field and hence are compatible.

The way I see it now, there is a dilemma for statistical-relational learning: Continue to use complete instantiation/count models, losing predictive accuracy, or move to random instantiation/proportion models, losing compatibility.

Research on inconsistent dependency networks has shown that it is quite possible to reason with incompatible conditional distributions. Another alternative is to adopt collective factorization models . That would raise the question of how to combine them with graphical first-order models that do not contain latent variables, for discussion see Nickel et al. Or maybe I'm missing something---this is complicated stuff!