1 Introduction
Bidirectional information flow in belief propagation networks is becoming a very popular framework in many signal processing applications [1][2]
because inference and learning can be easily manipulated with a small set of rules. Generally Bayesian models aim at capturing the hidden structure that may underly observed data through the assumption of a network of random variables that are only partially, or occasionally, visible
[3].Independent Component Analysis (ICA) is a popular signal processing framework in which observed data are mapped to, or generated from, independent hidden sources variables [4]. The variables are typically continuous and the transformation between sources and visible variables is linear. ICA has been used in many applications for signal separation and for analyzing signals and images [4]. ICA filters, trained on real images, seem to converge to patterns that resemble the receptive fields found in the neural visual cortex [5].
In this paper we explore the possibility of using the generative model of the ICA on discrete variables. The Bayesian model is constrained to a finite number of discrete hidden sources (factorial code) that feed the visible variables, also discrete. Even if there are computational difficulties that naturally emerge in dealing with the product space of discrete alphabets, we find that even limiting our attention to tractable small sizes, the DICA framework clearly shows some potential in the applications, perhaps as a building block of more complex architectures. Discrete Component Analysis (DCA) has also been discussed by Buntine et al. [7] with reference to different models.
We reduce the DICA architecture to a Bayesian factor graph in the socalled reduced normal form (see [9] and reference therein) that includes only simple interconnected blocks. We experiment with belief propagation on this architecture using images extracted from the MNIST dataset [12]. We show that the DICA network nicely converges after learning to a generative model that reproduces accurately the image set.
In Section 2 the Bayesian model is presented and in Section 3 its discrete version is transformed into a factor graph for belief propagation. The various modes of inference are discussed in Section 5 and learning in Section 6. The simulations for unsupervised mapping of the MNIST images are reported in Section 6 with the addition of the label variable in Section 7. The conclusions are in Sections 8.
2 The Bayesian Model
In this paper we focus on the generative model depicted as the bipartite graph of Figure 1 with independent source variables (hidden). The main variables (visible), are connected to the source variables via the factorization
(1) 
Note that to be conditionally independent, must be conditioned on the whole set of sources, even if their marginal distribution factorizes: . This appears to be the most general model for independent hidden sources that underly a set of dependent variables . When , the system degenerates into a singlevariable latent model [2].
One way of solving for the probability functions involved in the Bayesian model is to group (marry) the source variables (parents)
[8] as in Figure 2. Note that the Bayesian graph does not show that the source variables are marginally independent. This is made more explicit in the factor graph representation that will follow.2.1 Generative model for classical ICA
Independent Component Analysis is obtained when all the variables
and the conditional probability density functions
are constrained to depend on linear combinations of . More specifically, the typical assumption is that the linear combinations contribute to the means ofand the dispersion around the mean is spherical and follows a Gaussian distribution
(2) 
where the vector
contains all the source values and is the th column of the coefficient matrix [5]. More compactly , where . The sources’ pdfs can follow various distributions that go from uniform to laplacian [5]. Typically for the model to be identifiable, the sources cannot be Gaussian (except perhaps for one out of ).Unfortunately when ICA is used as a generative model it is hard to produce realistic images even when experimental densities are used as density sources [5]. Structured patches are easy to obtain, but they do not resemble the complex structures found in natural images. The reason is that independent continuous sources do not carry the necessary structure to assemble the ICA into the complex structures found in natural images. We report a simulation in the following that seems to confirm these results. Attempts have been made to use the ICA in twolayer architectures [5]. However, it is not clear how to properly include non linearities (without non linearities the whole system would still be linear) and investigations in this direction are still in progress.
2.2 Discrete ICA
In this work we experiment on the unconstrained ICA model with discrete variables. More specifically we assume that both sources and visible variables take values in the finite discrete alphabets , , with sizes and .
The difficulties in dealing with such a model are clearly related to the computational complexity in the manipulation of the product space , that has size (Figure 2). However, we find that even limiting our attention to small dimensionalites, i.e. to few source variables and to small alphabets, the framework applied to natural images reveals quite interesting results. Furthermore, the basic architecture can be used as a building block for more complicated multilayer Bayesian architectures (not discussed in this paper).
3 DICA in Reduced Normal Form
Probability propagation and learning for the graph of Figure 1 can be handled in a very flexible way if we transform the model into a factor graph as in Figure 3. The graph is in the socalled reduced normal form (see [9] and references therein), that is composed only of onetoone blocks, source blocks and diverters (these are equal constraint blocks that act like buses for belief propagation). Onetoone blocks are characterized by a conditional probability matrix and sources by a probability vector. We have often advocated the use of such a representation because it can be handled as a block diagram and it is amenable to distributed implementations. We have also designed a Simulink library for rapid prototyping [10].
More specifically for the DICA model, the source variables, that have prior distributions , … , are mapped to the product space via the fixed rowstochastic matrices (shaded blocks)
(3) 
where denotes the Kronecker product, is a dimensional column vector with all ones, and is the identity matrix. The conditional probability matrix is such that each variable contributes to the product space with its value and it is uniform on the components that compete to the other source variables. The blocks at the bottom of Figure 3 represent the conditional probability matrices , , that with the source prior distributions are typically learned from data. Information flows in the network bidirectionally: for each branch variable there is a forward () and a backward () message, which are (or proportional to) discrete probability vectors. Messages are usually kept normalized for numerical stability. The variables connected to the diverter represent a replicated version of the same variable, but they all carry different forward and backward messages that are combined with the product rule [11]. Propagation through each onetoone block follows the sum rule which in the variable direction is the matrix multiplication (already normalized) and in the opposite direction and
(normalization). After propagation for a number of steps equal to the graph diameter (if there are no loops), posterior probability
for a variable branch can be computed with the normalized product (denotes the elementbyelement product of two vectors). For the reader not familiar with this framework, it should be emphasized that these simple rules are rigorous translation of marginalization and Bayes’ theorem
[11].4 Inference in the DICA graph
The flexibility of this framework allows the use of the factor graph of Figure 3
in various inference modes. Information flow is bidirectional and assuming that all the parameters have been learned and that the unspecified messages are initialized to uniform distributions, we can use the DICA graph in:
(1) Generation: Source values are picked and are injected as forward delta distributions at . After three steps of message propagation, the forward distributions are collected at the terminal variables . They are the (soft) decoded version of the source values. Note that these are distributions that are typically displayed as their means or their argmaxes (see simulation results in the following).
(2) Encoding: Observed values for are injected as delta backward distributions at the bottom. After three steps of message propagation, the backward distributions are multiplied with the forward at . The normalized result is a (soft) factorial code of the input. The set of argmaxes of these distribution is the MAP decoding of the input.
(3) Pattern completion: Only a subset of values for is available (there are erasures). The available values are injected at the bottom as delta backward distributions. For the missing values uniform densities are usually injected. After three steps of message propagation, forward distributions are collected at the bottom variables. For the observed variables the forwardbackward products return just the deltas on the observations and provides no new information. At the unknown variables, the forward distribution is our best (soft) knowledge of that variable. Here too the means or the argmaxes can be used as a final result. The inference on the erasures is the synthesis of the information coming from the observations and the priors.
(4) Error correction: Available values for may contain errors. They are presented as backward delta distributions at the bottom variables. After three steps of message propagation, forward distributions (or their means or argmaxes) are collected and used as corrections. No product with the backward is applied here because we do not know which component is reliable. In a similar scheme the values for may be known softly via distributions that are injected at the bottom as backward messages.
Note that in both (3) and (4) also coded versions of the observations are available at the source branches.
5 Learning in the DICA graph
To train the DICA system, we assume that a set of T examples is available for the visible variables (training set). Learning the system matrices for the bottom blocks and the vectors for the sources, is performed using an EM search. Various algorithms can be used, all inspired by a localized maximum likelihood cost function. The iterations are confined to each block and use only locally available forward and backward messages. Details on the learning algorithms for the factor graph in reduced normal form have been reported elsewhere and are omitted here for space reasons (see [6] [9] and references therein).
6 DICA Simulations
We report here a full set of simulations on the MNIST data set [12]. We have reduced the images to binary pixels and extracted 500 images as our training set. In a first set of experiments we train the architecture of Figure 3
with all binary variables:
, (); , , for various number of sources . During learning the 500 images of the training set are presented as backward delta distributions on , one time, with 5 cycles inside each block (the maximum likelihood algorithm inside each block is iterative [9]). Therefore for each order we obtain the conditional probability matrices , , and the prior distributions .Generation: Figure 4 shows, for increasing , the means of when at the sources we inject the binary configurations in the forward messages . Reported in the picture are also the learned priors. We note that, for larger number of sources, the product space (sizes 2,4,8,16,256), corresponds to increasingly accurate pattern memorization. For some characters, that are different in shape, the system builds separate representations. The source variables, independent by definition (factorial code), learn marginal distributions progressively less uniform as the number of sources increases (recall that the vector that represents is the Kronecker product of the individual binary distributions and that even small non uniformities in the priors cause to be highly non uniform).
Encoding: Figure 5 shows the typical results of presenting to the DICA graph of Figure 3, with , images from the test set (i.e. not included in the 500 images used for training) as backward delta distributions at . In the third column the posterior distributions at the sources are shown (only the probability on the symbol is depicted). Here the DICA graph acts as an Encoder: the (soft) binary configurations are the factorial code of the presented images. Note that not all the codes are sharp. In the second column the mean of the forward distributions at is also shown.
Decoding: In Figure 6 the same DICA graph is used as a soft decoder when smooth and sharp distributions are injected at the sources.
Pattern completion: Figure 7 shows the results of the same network when as backward at we present images (from the test set) with 50 % of the pixels removed. For the erased pixels a backward uniform distribution is presented. The third and the fourth columns report the mean for the forward and the posterior distributions respectively. The network fillsin rather well the missing parts.
6.1 Continuous ICA on the same dataset
The natural question at this point is whether with continuous ICAs it would be possible to obtain similar results. The model is clearly very different, but on the same data set we have attempted a comparison. On the 500 MNIST images of the training set we have computed ICAs using the Fast ICA algorithm available for Matlab [13]
. We have retained only the first 8 components (largest variance) and estimated the output densities using average histograms. Random samples from these densities are used to generate the images though the inverse ICA
[14]. Figure 8 shows the 8 masks and some generated images. The results confirm that, even if the ICA nicely represent bases for the data, with unconstrained independent samples at the sources, only average structures are generated. We have also tried with larger number of components and the obtained images look very similar. These results seem to be consistent with other experiments presented in the literature [14] for patches of natural images where only average textures are obtained. The linear ICA with independent unconstrained sources do not seem to be a generative model that preserves the structured composition of the training set.7 DICA for classification
The great flexibility of the factor graph framework allows to extend easily the architecture of the DICA graph to the one shown in Figure 9 where also a label variable is included. The variable belong to the finite alphabet and it is attached directly, through a conditional probability matrix , to the product space diverter. Diverters in the reduced normal form act like probability pipelines [9].
Simulations have been performed on the same MNIST training set of 500 binarized images in the same mode as in the unsupervised experiments with the addition, during training, of the label information as a backward delta distribution. All the blocks, including now the probability matrix
, are trained for . On the learned network, a typical recognition task on two images from the test set is shown in Figure 10. The bar graph represents simultaneously classification and encoding. Note how in the first row the network is naturally confused between and .A generative experiment is also performed on this architecture with backward delta distributions injected at . The results are shown in Figure 11. The images are the mean forward distributions at and could be considered as the prototypes for the ten labels. The bar graphs are the corresponding simultaneous encoding at the sources.
8 Conclusions
The simulations on the MNIST dataset with binary sources show that belief propagation in the DICA architecture, also with the addition of the label variable, provides a unified framework in which image data can be coded, generated and corrected in a very flexible way. We have also experimented on natural images on quantized patches obtaining very similar results, also when the sources have alphabet sizes greater than two. These results will be reported elsewhere. We are currently pursuing the use of this framework for building multilayer architectures.
References
 [1] M. I. Jordan, E. B. Sudderth, M. Wainwright, and A. S. Willsky, “Major advances and emerging developments of graphical models (and the whole special issue),” IEEE Signal Processing Magazine, vol. 17, November 2010, Special Issue.

[2]
D. Barber,
Bayesian Reasoning and Machine Learning
, Cambridge University Press, 2012.  [3] J. Pearl, Probabilistic reasoning in intelligent systems: networks of plausible inference, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1988.
 [4] A. Hyvarinen, J. Karhunen, and E. Oja, Independent Component Analysis, Wiley, New York, 2001.

[5]
Aapo Hyvrinen, Jarmo Hurri, and Patrick O. Hoyer,
Natural Image Statistics: A Probabilistic Approach to Early Computational Vision.
, Springer Publishing Company, Incorporated, 1st edition, 2009.  [6] Francesco A. N. Palmieri, “Learning non linear functions with factor graphs,” IEEE Transactions on Signal Processing, vol. 61, N. 7, pp. 4360–4371, 2013.

[7]
Wray Buntine and Aleks Jakulin,
“Discrete component analysis,”
in
Subspace, Latent Structure and Feature Selection
, Craig Saunders, Marko Grobelnik, Steve Gunn, and John ShaweTaylor, Eds., vol. 3940 of Lecture Notes in Computer Science, pp. 1–33. Springer Berlin Heidelberg, 2006.  [8] S. L. Lauritzen, Graphical Models, Oxford, 1996.
 [9] F. A. N. Palmieri, “A comparison of algorithms for learning hidden variables in normal graphs,” 2013.

[10]
A. Buonanno and F. A. N. Palmieri,
“Simulink implementation of belief propagation in normal factor
graphs,”
in
Proceedings of 2014 Workshop on Neural Networks, Vietri s.m.
, 2014.  [11] H. A. Loeliger, “An introduction to factor graphs,” IEEE Signal Processing Magazine, vol. 21, no. 1, pp. 28 – 41, jan. 2004.
 [12] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradientbased learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, Nov 1998, http://yann.lecun.com/exdb/mnist/.
 [13] Hugo Gävert, Jarmo Hurri, Jaakko Särelä, and Aapo Hyvärinen, “The fastica package for matlab,” Available from: http://research.ics.aalto.fi/ica/fastica/.
 [14] Aapo Hyvärinen, “Statistical models of natural images and cortical visual representation,” Topics in Cognitive Science, vol. 2, no. 2, pp. 251–264, 2010.