Birth, Death, and Barcodes: Discretization of Continuous Attributes in Bayesian Network Structure Recovery
Bayesian networks are graphical models that can be used to represent complex relationships between a large number of random variables. A Bayesian network consists, in part, of a directed acyclic graph in which nodes represent the variables and edges represent dependencies between them. Given data consisting of multiple realizations of random variables, presumably represented by a Bayesian network, our ultimate goal is usually to recover the best fitting network or, in the case of simulated data, the actual generating network. There are many decent ways to accomplish this but most assume that the data are either discrete or Gaussian. In the case that the data are continuous and non-Gaussian, it is typical to discretize the data and to then proceed with a discrete network recovery approach. Unfortunately, if this is not done in a thoughtful way, the very conditional relationships that one is out to discover can be destroyed. Formal approaches for scoring a discretization as “good”, “better”, and “best” have been proposed but it is difficult to find high scoring discretizations. Proposing a discretization policy involves finding thresholds in data between which all observed values will be mapped to single points. For a random variable taking on, for example, 1,000 distinct realizations, there are 2999 possible policies to consider. In this talk, we will introduce a novel birth-and-death process approach on the space of thresholds for discretizations that appears to be very successful in maintaining maximal structure information. We will illustrate our approach on both real and simulated data sets.