When using multi-expert architectures for modeling behavior or data, the motivation is the separation of the stimulus or data space into disjoint regimes one which separate models (experts) are applied [JacobsJacobs1999, Jacobs, Jordan, & BartoJacobs et al.1990]. The idea is that experts responsible for only a limited regime can be smaller and more efficient, and that knowledge from one regime should not be extrapolated onto another regime, i.e., optimization on one regime should not interfere with optimization on another. Several arguments indicate that this kind of adaptability cannot be realized by a single conventional neural network [ToussaintToussaint2002]. Roughly speaking, for conventional neural networks the optimization of a response in one regime always interferes with responses in other regimes because they depend on the same parameters (weights), which are not separated into disjoint experts.
To realize a seperation of the stimulus space one could rely on the conventional way of implementing multi-experts, i.e., allow neural networks for the implementation of expert modules and use external, often more abstract types of gating networks to organize the interaction between these modules. Much research is done in this direction [Bengio & FrasconiBengio & Frasconi1994, Cacciatore & NowlanCacciatore & Nowlan1994, Jordan & JacobsJordan & Jacobs1994, Rahman & FairhurstRahman & Fairhurst1999, Ronco, Gollee, & GawthropRonco et al.1997]. The alternative we want to propose here is to introduce a neural model that is capable to represent systems that are functionally equivalent to multi-expert systems within a single integrative network. This network does not explicitly distinguish between expert and gating modules and generalizes conventional neural networks by introducing a counterpart for gating interactions. What is our motivation for such a new representation of multi-expert systems?
First, our representation allows much more and qualitatively new architectural freedom. E.g., gating neurons may interact with expert neurons; gating neurons can be a part of experts. There is no restriction with respect to serial, parallel, or hierarchical architectures—in a much more general sense as proposed in[Jordan & JacobsJordan & Jacobs1994].
Second, our representation allows in an intuitive way to combine techniques from various learning theories. This includes gradient descent, unsupervised learning methods like Hebb learning or the Oja rule, and an EM-algorithm that can be transferred from classical gating-learning theories [Jordan & JacobsJordan & Jacobs1994]
. Further, the interpretation of a specific gating as an action exploits the realm of reinforcement learning, in particular Q-learning and (though not discussed here) its TD and TD() variants [Sutton & BartoSutton & Barto1998].
Third, our representation makes a simple genetic encoding of such architectures possible. There already exist various techniques for evolutionary structure optimization of networks (see [YaoYao1999] for a review). Applied on our representation, they become techniques for the evolution of multi-expert architectures.
After the rather straight-forward generalization of neural interactions necessary to realize gatings (section 2), we will discuss in detail different learning methods in section 3. The empirical study in section 4 compares the different interactions and learning mechanisms on a test problem similar to the one discussed by Jacobs et al. jacobs:90.
2 Model definition
Conventional multi-expert systems.
Assume the system has to realize a mapping from an input space to an output space . Typically, an -expert architecture consists of a gating function and expert functions which are combined by the softmax linear combination:
where and are input and output, and describes the “softness” of this winner-takes-all type competition between the experts, see Figure 1. The crucial question becomes how to train the gating. We will discuss different methods in the next section.
Neural implementation of multi-experts.
We present a single neural system that has at least the capabilities of a multi-expert architecture of several neural networks. Basically we provide additional competitive and gating interactions, for an illustration compare Figure 1 and Figure 2-B. More formally, we introduce the model as follows:
The architecture is given by a directed, labeled graph of neurons and links from to , where
. Labels of links declare if they are ordinary, competitive or gating connections. Labels of neurons declare their type of activation function. With every neuron, an activation state (output value) is associated. A neuron collects two terms of excitation and given by
where are weights and bias associated with the links and the neuron , respectively. The second excitatory term has the meaning of a gating term and is induced by -labeled links .
In case there are no -labeled links connected to a neuron , its state is given by
is a sigmoid function. This means, if a neuronhas no gating links connected to it, then and the sigmoid describes its activation. Otherwise, the gating term multiplies to it.
Neurons that are connected by (bi-directed) -labeled links form a competitive group in which only one of the neurons (the winner) acquires state while the other’s states are zero. Let denote the competitive group of neurons to which
belongs. On such a group, we introduce a normalized distribution, , given by
Here, is some function (e.g., the exponential ). The neurons states , depend on this distribution
by one of the following competitive rules of winner selection: We will consider a selection with probability proportional to(softmax), deterministic selection of the maximum , and -greedy selection (where with probability a random winner is selection instead of the maximum).
Please see Figure 2
to get an impression of the architectural possibilities this representations provides. Example A realizes an ordinary feed-forward neural network, where the three output neurons form a competitive group. Thus, only one of the output neurons will return a value of, the others will return . Example realizes exactly the same multi-expert system as depicted in Figure 1
. The two outputs of the central module form a competitive group and gate the output neurons of the left and right module respectively—the central module calculates the gating whereas the left and right modules are the experts. Example C is an alternative way of designing multi-expert systems. Each expert module contains an additional output node which gates the rest of its outputs and competes with the gating nodes of the other experts. Thus, each expert estimates itself how good it can handle the current stimulus (see the Q-learning method described below). Finally, example D is a true hierarchical architecture. The two experts on the left compete to give an output, which is further processed and, again, has to compete with the larger expert to the right. In contrast, Jordan & Jacobs jordan:94 describe an architecture where the calculation of one single gating (corresponding to only one competitive level) is organized in a hierarchical manner. Here, several gatings on different levels can be combined in any successive, hierarchical way.
In this section we introduce four different learning methods, each of which is applicable independent of the specific architecture. We generally assume that the goal is to approximate training data given as pairs
of stimulus and target output value.
The gradient method.
To calculate the gradient, we assume that selection in competitive groups performed with probability proportional to the distribution . We calculate an approximate gradient of the conditional probability that this system represents by replacing the actual state in Eq. (2) by its expectation value for neurons in competitive groups (see also neal:90). For the simplicity of notation, we identify . Then, for a neuron in a competitive group obeying Eq. (5), we get the partial derivatives of the neuron’s output with respect to its excitations:
where iff is a member of . Let be an error functional. We write the delta-rule for back-propagation by using the notations and for the gradients at a neuron’s output and excitations, respectively, and for the local error of a single (output) neuron. From Eqs. (2,3,6,7) we get
where is given in Eq. (8). The final gradients are
The choice of the error functional is free. E.g., it can be chosen as the square error or as the log-likelihood , where in the latter case the target are states .
The basis for further learning rules.
For the following learning methods we concentrate on the question: What target values should we assume for the states of neurons in a competitive group? In the case of gradient descent, Eq. (8
) gives the answer. It actually describes a linear projection of the desired output variance down to all system states—including those in competitions. In fact, all the following learning methods will adopt the above gradient descent rules except for a redefinition of (or alternatively ) in the case of neurons in competitive groups. This means that neurons “below” competitive groups are adapted by ordinary gradient descent while the local error at competitive neurons is given by other rules than gradient descent. Actually this is the usual way for adapting systems where neural networks are used as internal modules and trained by back-propagation (e.g., see anderson:94).
We briefly review the basic ideas of applying an EM-algorithm on the problem of learning gatings in multi-experts [Jordan & JacobsJordan & Jacobs1994]. The algorithm is based on an additional, very interesting assumption: Let the outcome of a competition in a competitive group be described by the states , of the neurons that join this group. Now, we assume that there exists a correct outcome , . Formally, this means to assume that the complete training data are triplets of stimuli, competition states, and output values.111More precisely, the assumption is that there exists a teacher system of same architecture as our system. Our system adapts free parameters in order to approximate this teacher system. The teacher system produces training data and, since it has the same architecture as ours, also uses competitive groups to generate this data. The training data would be complete if it included the outcomes of these competitions. However, the competition training data is unobservable or hidden and must be inferred by statistical means. Bayes’ rule gives an answer on how to infer an expectation of the hidden training data and lays the ground for an EM-algorithm. The consequence of this assumption is that now the of competitive neurons are supposed to approximate this expectation of the training data instead of being free. For simplification, let us concentrate on a network containing a single competitive group; the generalization is straightforward.
Our system represents the conditional probability of output states and competition states , depending on the stimulus and parameters :
(E-step) We use Bayes rule to infer the expected competition training data hidden in a training tuple , i.e., the probability of when and are given.
Since these probabilities refer to the training (or teacher) system, we can only approximate them. We do this by our current approximation, i.e., our current system:
(M-step) We can now adapt our system. In the classical EM-algorithm, this amounts to maximizing the expectation of the log-likelihood (cp. Eq. (15))
where the expectation is with respect to the distribution of -values (i.e., depending on our inference of the hidden states ); and the maximization is with respect to parameters . This equation can be simplified further—but, very similar to the “least-square” algorithm developed by Jordan & Jacobs jordan:94, we are satisfied to have inferred an explicit desired probability for the competition states that we use to define a mean-square error and perform an ordinary gradient descent.
Based on this background we define the learning rule as follows and with some subtle differences to the one presented in [Jordan & JacobsJordan & Jacobs1994]. Equation (3) defines the desired probability of the states . Since we assume a selection rule proportional to the distribution , the values are actually target values for the distribution . The first modification we propose is to replace all likelihood measures involved in Eq. (3) by general error measures : Let us define
Then, in the case of the likelihood error , we retrieve . Further, let
By these definitions we may rewrite Eq. (3) as
However, this equation needs some discussion with respect to its explicit calculation in our context—leading to the second modification. Calculating for every amounts to evaluating the system for every possible competition outcome. One major difference to the algorithm presented in [Jordan & JacobsJordan & Jacobs1994] is that we do not allow for such a separated evaluation of all experts in a single time step. In fact, this would be very expensive in case of hierarchically interacting competitions and experts because the network had to be evaluated for each possible combinatorial state of competition outcomes. Thus we propose to use an approximation: We replace by its average over the recent history of cases where won the competition,
where is a trace constant (as a simplification of the time dependent notation, we use the algorithmic notation for a replacement if and only if wins). Hence, our adaptation rule finally reads
and if does not win; which means a gradient descent on the squared error between the approximated desired probabilities and the distribution .
Probably, the reader has noticed that we chose the notations in the previous section in the style of reinforcement learning: If one interprets the winning of neuron as a decision on an action, then (called action-value function) describes the (estimated) quality of taking this decision for stimulus ; whereas (called state-value function) describes the estimated quality for stimulus without having decided yet, see [Sutton & BartoSutton & Barto1998]. In this context, Eq. (21) is very interesting: it proposes to adapt the probability according to the ratio —the EM-algorithm acquires a very intuitive interpretation. To realize this equation without the approximation described above one has to provide an estimation of , e.g., a neuron trained on this target value (a critic). We leave this for future research and instead directly address the Q-learning paradigm.
For Q-learning, an explicit estimation of the action-values is modeled. In our case, we realize this by considering as the target value of the excitations , , i.e., we train the excitations of competing neurons toward the action values,
This approach seems very promising—in particular, it opens the door to temporal difference and TD() methods and other fundamental concepts of reinforcement learning theory.
Besides statistical and reinforcement learning theories, also the branch of unsupervised learning theories gives some inspiration for our problem. The idea of hierarchically, serially coupled competitive groups raises a conceptual problem: Can competitions in areas close to the input be trained without functioning higher level areas (closer to the output) and vice versa? Usually, back-propagation is the standard technique to address this problem. But this does not apply on either the EM-learning or the reinforcement learning approaches because they generate a direct feedback to competing neurons in any layer. Unsupervised learning in lower areas seems to point a way out of this dilemma. As a first approach we propose a mixture of unsupervised learning in the fashion of the normalized Hebb rule and Q-learning. The normalized Hebb rule (of which the Oja rule is a linearized version) can be realized by setting for a neuron in a competitive group (recall ). The gradient descent with respect to adjacent input links gives the ordinary rule. Thereafter, the input weights (including the bias) of each neuron , are normalized. We modify this rule in two respects. First, we introduce a factor ( that accounts for the success of neuron being the winner. Here, is an average trace of the feedback:
where is the winner. Second, in the case of failure, , we also adapt the non-winners in order to increase their response on the stimulus next time. Thus, our rule reads
Similar modifications are often proposed in reinforcement learning models [Barto & AnandanBarto & Anandan1985, Barto & JordanBarto & Jordan1987]. The rule investigated here is only a first proposal; all rules presented in the excellent survey of Diamantaras & Kung diamantaras:96 can equally be applied and are of equal interest but have not yet been implemented by the author.
4 Empirical study
We test the functionality of our model and the learning rules by addressing a variant of the test presented in [Jacobs, Jordan, & BartoJacobs et al.1990]. A single bit of an 8-bit input decides on the subtask that the system has to solve on the current input. The subtasks itself are rather simple and in our case (unlike in [Jacobs, Jordan, & BartoJacobs et al.1990]) are to map the 8-bit input either identically or inverted on the 8-bit output. The task has to be learned online. We investigate the learning dynamics of a conventional feed-forward neural network (FFNN) and of our model with the 4 different learning methods. We use a fixed architecture similar to an 8-10-8-layered network with 10 hidden neurons but additionally install 2 competitive neurons that receive the input and each gates half of the hidden neurons, see Figure 3. In the case of the conventional FFNN we used the same architecture but replaced all gating and competitive connections by conventional links.
Figure 4 displays the learning curves averaged over 20 runs with different weight initializations. For implementation details see Table 1. First of all, we find that all of the 4 learning methods perform well on this task compared to the conventional FFNN. The curves can best be interpreted by investigating if a task separation has been learned. Figure 5 displays the frequencies of winning of the two competitive neurons in case of the different subtasks. The task separation would be perfect if these two neurons would reliably distinguish the two subtasks. First noticeable is that all 4 learning methods learn the task separation. In the case of Q-learning the task separation is found rather late and remains noisy because of the -greedy selection used. This explains its slower learning curve in Figure 4. EM and Oja-Q realize strict task separations (maximum selection), for the gradient method it is still a little noisy (softmax selection). It is clear that, if the task separation has been found and fixed, all four learning methods proceed equivalently. So it is no surprise that the learning curves in Figure 4 are very similar except for a temporal offset corresponding to the time until the task separation has been found, and the non-zero asymptotic error corresponding to the noise of task separation. (Note that Figure 5 represents only a single, typical trial.)
Generally, our experience was that the learning curves may look very different depending on the weight initialization. It also happened that the task separation was not found when weights and biases (especially of the competing neurons) are initialized very large (by ). One of the competitive neurons then dominates from the very beginning and prohibits the “other expert” to adapt in any way. Definitely, a special, perhaps equal initialization of competitive neurons could be profitable.
Finally, also the conventional FFNN only sometimes solves the task completely—more often when weights are initialized relatively high. This explains the rather high error offset for its learning curve.
We generalized conventional neural networks to allow for multi-expert like interactions. We introduced 4 different learning methods for this model and gave empirical support for their functionality. What makes the model particularly interesting is:
The generality of our representation of system architecture allows new approaches for the structure optimization of multi-expert systems, including arbitrary serial, parallel, and hierarchical architectures. In particular evolutionary techniques of structure optimization become applicable.
The model allows the combination of various learning methods within a single framework. Especially the idea of integrating unsupervised learning methods in a system that adapts supervised opens new perspectives. Many more techniques from elaborated learning theories can be transfered on our model. In principle, the uniformity of architecture representation would allow to specify freely where it is learned by which principles.
The model overcomes the limitedness of conventional neural networks to perform task decomposition, i.e., to adapt in a decorrelated way to decorrelated data [ToussaintToussaint2002].
I acknowledge support by the German Research Foundation DFG under grant SoleSys.
00 20 40 60 80 10010000 EM-learningEM Q-learningQ Oja-Q-learningOja-Q gradient learninggradient
|1st subtask||2nd subtask|
- [Anderson & HongAnderson & Hong1994] Anderson, C. W. & Z. Hong (1994). Reinforcement learning with modular neural networks for control. In IEEE International Workshop on Neural Networks Application to Control and Image Processing.
- [Barto & AnandanBarto & Anandan1985] Barto, A. G. & P. Anandan (1985). Pattern recognizing stochastic learning automata. IEEE Transactions on Systems, Man, and Cybernetics 15, 360–375.
- [Barto & JordanBarto & Jordan1987] Barto, A. G. & M. I. Jordan (1987). Gradient following without back-propagation in layered networks. In Proceedings of the IEEE First Annual Conference on Neural Networks, pp. II629–II636.
- [Bengio & FrasconiBengio & Frasconi1994] Bengio, Y. & P. Frasconi (1994). An EM approach to learning sequential behavior. Technical Report DSI-11/94, Dipartimento di Sistemi e Informatica, Universita di Firenze.
- [Cacciatore & NowlanCacciatore & Nowlan1994] Cacciatore, T. W. & S. J. Nowlan (1994). Mixtures of controllers for jump linear and non-linear plants. In J. D. Cowan, G. Tesauro, & J. Alspector (Eds.), Advances in Neural Information Processing Systems, Volume 6, pp. 719–726. Morgan Kaufmann Publishers, Inc.
- [Diamantaras & KungDiamantaras & Kung1996] Diamantaras, K. I. & S. Y. Kung (1996). Principle component neural networks: theory and applications. Wiley & Sons, New York.
- [JacobsJacobs1999] Jacobs, R. A. (1999). Computational studies of the development of functionally specialized neural modules. Trends in Cognitive Sciences 3, 31–38.
- [Jacobs, Jordan, & BartoJacobs et al.1990] Jacobs, R. A., M. I. Jordan, & A. G. Barto (1990). Task decomposistion through competition in a modular connectionist architecture: The what and where vision tasks. Technical Report COINS-90-27, Department of Computer and Information Science, University of Massachusetts Amherst.
- [Jordan & JacobsJordan & Jacobs1994] Jordan, M. I. & R. A. Jacobs (1994). Hierarchical mixtures of experts and the EM algorithm. Neural Computation 6, 181–214.
- [NealNeal1990] Neal, R. M. (1990). Learning stochastic feedforward networks. Technical Report CRG-TR-90-7, Department of Computer Science, University of Toronto.
- [Rahman & FairhurstRahman & Fairhurst1999] Rahman, A. F. R. & M. C. Fairhurst (1999). Serial combination of multiple experts: A unified evaluation. Pattern Analysis & Applications 2, 292–311.
- [Ronco, Gollee, & GawthropRonco et al.1997] Ronco, E., H. Gollee, & P. Gawthrop (1997). Modular neural networks and selfdecomposition. Technical Report CSC-96012, Center for System and Control, University of Glasgow.
- [Sutton & BartoSutton & Barto1998] Sutton, R. S. & A. G. Barto (1998). Reinforcement Learning. MIT Press, Cambridge.
- [ToussaintToussaint2002] Toussaint, M. (2002). On model selection and the disability of neural networks to decompose tasks. In Proceedings of the International Joint Conference on Neural Networks (IJCNN 2002).
- [YaoYao1999] Yao, X. (1999). Evolving artificial neural networks. In Proceedings of IEEE, Volume 87, pp. 1423–1447.