Representing the information from the recent past as transient activity distributed over a network has been actively researched in biophysical as well as purely computational domains Maass et al. (2002); Jaeger (2001). It is understood that recurrent connections in the network can keep the information from distant past alive so that it can be recovered from the current state. The memory capacity of these networks are generally measured in terms of the accuracy of recovery of the past information Jaeger (2001); White et al. (2004); Hermans and Schrauwen (2010). Although the memory capacity strongly depends on the network’s topology and sparsity Ganguli et al. (2008); Strauss et al. (2012); Legenstein and Maass (2007); Wallace et al. (2013), it can be significantly increased by exploiting any prior knowledge of the underlying structure of the encoded signal Ganguli and Sompolinsky (2012); Charles et al. (2014).
Our approach to encoding memory stems from a focus on its utility for future prediction, rather than on the accuracy of recovering the past. In particular we are interested in encoding time varying signals from the natural world into memory so as to optimize future prediction. It is well known that most natural signals exhibit scale free long range correlations Mandelbrot (1982); Voss and Clarke (1975); West and Shlesinger (1990). By exploiting this intrinsic structure underlying natural signals, prior work has shown that the predictive information contained in a finite sized memory system can be maximized if the past is encoded in a scale-invariantly coarse grained fashion Shankar and Howard (2013)
. Each node in such a memory system would represent a coarse grained average around a specific past moment, and the time window of coarse graining linearly scales with the past timescale. Clearly the accuracy of information recovery in such a memory system degrades more for more distant past. In effect, the memory system sacrifices accuracy in order to represent information from very distant past, scaling exponentially with the network sizeShankar and Howard (2013). The predictive advantage of such a memory system comes from washing out non-predictive fluctuations from the distant past, whose accurate representation would have served very little in predicting the future. Arguably, in the natural world filled with scale-free time varying signals, animals would have evolved to adopt such a memory system conducive for future predictions. This is indeed evident from animal and human behavioral studies that show that our memory for time involves scale invariant errors which linearly scale with the target timescale Gibbon (1977); Rakitin et al. (1998).
Our focus here is not to further emphasize the predictive advantage offered by a scale invariantly coarse grained memory system, rather we simply assume the utility of such a memory system and focus on the generic mechanism to construct it. One way to mechanistically construct such a memory system is to gradually encode information over real time as a Laplace transform of the past and approximately invert it Shankar and Howard (2012). The central result in this paper is that any mechanistic construction of such a memory system is simply equivalent to encoding linear combinations of Laplace transformed past and their approximate inverses. This result should lay strong constraints on the connectivity structure of memory networks exhibiting the scale invariance property.
We start with the basic requirement that different nodes in the memory system represents coarse grained averages about different past moments. Irrespective of the connectivity, the nodes can be linearly arranged to reflect their monotonic relationship to the past time. Rather than considering a network with a finite set of nodes, for analysis benefit, we consider a continuum limit where the information from the past time is smoothly projected on a spatial axis. The construction can later be discretized and implemented in a network with finite nodes to represent past information from timescales that exponentially scale with the network size.
Ii Scale Invariant Coarse Graining
Consider a real valued function observed over time
. The aim is to encode this time-varying function into a spatially distributed representation in one dimension parametrized by, such that at any moment the entire past from to is represented in a coarse grained fashion as
This is now a convolution memory model. The kernel is the coarse graining window function with normalized area for all , . Different points on the spatial axis uniquely and monotonically represents coarse grained averages about different instants in the past, as illustrated in figure 1.
We require that coarse graining about any past instant linearly scales with the past timescale. So, for any pair of points and , there exists a scaling constant such that . For the window function to satisfy this scale-invariance property, there should exist a monotonic mapping from a scaling variable to the spatial axis so that
Without loss of generality we shall pick because it can be retransformed to any other monotonic mapping after the analysis. Hence with ,
Iii Space-Time Local mechanism
Equation 1 expresses the encoded memory as an integral over the entire past. However, the encoding mechanism can only have access to the instantaneous functional value of and its derivatives. The spatial pattern should self sufficiently evolve in real time to encode eq. 1. This is a basic requirement to mechanistically construct in real time using any network architecture. Since the spatial axis is organized monotonically to correspond to different past moments, only the local neighborhood of any point would affect its time evolution. So we postulate that the most general encoding mechanism that can yield eq. 1 is a space-time local mechanism given by some differential equation for . To analyze this, let us first express the general space-time derivative of by repeatedly differentiating eq. 1 w.r.t and .
Here and are positive integers. For brevity, we denote the order of time derivative within a square bracket in the superscript and the order of space derivative within a parenthesis in the subscript.
Since is an arbitrary input, should satisfy a time-independent differential equation which can depend on instantaneous time derivatives of . The first term in the r.h.s of eq. 4 is time-local, while the second term involves an integral over the entire past. In order for the second term to be time-local, it must be expressible in terms of lower derivatives of . Since the equation must hold for any , should satisfy a linear equation.
The aim here is not to derive the time-local differential equation satisfied by , but just to impose its existence, which is achieved by imposing eq. 5 for some set of functions . To impose this condition, let us first evaluate by exploiting the functional form of the window function given by eq. 3. Defining and the function , eq. 3 can be repeatedly differentiated to obtain
where max[0, ] and the superscript on represents the order of the derivative w.r.t . Now eq. 5 takes the form
The above equation is not necessarily solvable for an arbitrary choice of . However, when it is is solvable, the separability of the variables and implies that the above equation will be separable into a set of linear differential equations for with coefficients given by integer powers of . The general solution for is then given by
where and are non negative integers. The coefficients and , and the functions cannot be independently chosen as they are constrained through eq. 6. Once a set of is chosen consistently with the coefficients and , the differential equation satisfied by can be obtained by iteratively substituting (in the second term of the r.h.s of eq. 4) in terms of its lower derivatives and replacing the integral in terms of derivatives of .
Here we shall neither focus on the choice of nor on the differential equation for it yields. We shall only focus on the set of possible window functions that can be constructed by a space-time local mechanism. Hence it suffices to note from the above derivation that the general form of such a window function is given by eq. 7. Since by definition the window function at each coarse grains the input about some past moment, we expect it to be non-oscillatory and hence restrict our focus to real values of . Further, the requirement of the window function to have normalized area at all restricts to be positive.
Iv Two step process
Let us consider the simplest window function, where only one of the coefficients in the set of and in eq. 7 are non-zero, namely and . We shall denote the corresponding window function as to highlight its dependence on specific and . The most general window function is then simply a linear combination of various for different values of and . From eq. 7, takes the form
It turns out that the differential equation satisfied by that generates this window function is simply first order in both space and time given by
with a boundary condition . This equation can hence be evolved in real time by only relying on the instantaneous input at each moment .
For more complex window functions that are linear combinations of for various and , the order of the space and time derivatives of involved in the differential equation are not necessarily bounded when the parameters and involved in the linear combinations of are bounded. So, it is not straight forward to derive the mechanistic construction as a differential equation for . Hence the question now is, what is the mechanism to construct a memory system with any window function?
Interestingly, there exists an alternative derivation of eq. 9 where the time derivative and space derivative can be sequentially employed in a two step process Shankar and Howard (2012). The first step is equivalent to encoding the Laplace transform of the input as . The second step is equivalent to approximately inverting the Laplace transformed input to construct .
Taking to be a function of bounded variation and , eq. 10 can be integrated to see that . Thus is the Laplace transform of the past input computed over real time. Eq. 11 is an approximation to inverse Laplace transform operation Post (1930). So essentially attempts to reconstruct the past input, such that at any , . This reconstruction grows more accurate as , and the input from each past moment is reliably represented at specific spatial location. For finite however, the reconstruction is fuzzy and each spatial location represents a coarse grained average of inputs from past moments, as characterized by the window function . For further details, refer to Shankar and Howard (2012).
Since any window function is a linear combination of various for different values of and , its construction is essentially equivalent to linear combinations of the two step process given by equations 10 and 11.
The choice of the combinations of has strong implications on the shape of the resulting window function. At any given , is a unimodal function with a peak at (see eq. 8). Arbitrary combinations of could result in a spatial location representing the coarse grained average about disjoint past moments, leading to undesirable shapes of the window function. Hence the values of and should be carefully tuned. Figure 2 shows the window functions constructed from four combinations of and . The combinations are chosen such that at the point , the window function coarse grains around a past time of . The scale invariance property guarantees that its shape remains identical at any other value of with a linear shift in the coarse graining timescale. Comparing combinations 1 and 3, note that the window function is narrower for larger (=100) than for a smaller (=8). Combination 2 has been chosen to illustrate a plateau shaped window function whose sides can be made arbitrarily vertical by fine tuning the combinations. Combination 4 (dotted curve in fig. 2) illustrates that combining different values of for the same will generally lead to a multimodal window function which would be an undesirable feature.
V Discretized spatial axis
A memory system represented on a continuous spatial axis is not practical, so the spatial axis should be discretized to finite points (nodes). The two step process given by equations 10 and 11 is optimal for discretization particularly when the nodes are picked from a geometric progression in the values of Shankar and Howard (2013). Eq. 10 implies that the activity of each node evolves independently of the others to construct with real time input . This is achieved with each node recurrently connected on to itself with an appropriate decay constant of . Eq. 11 involves taking the spatial derivative of order which can be approximated by the discretized derivative requiring linear combinations of activities from neighbors on either sides of any node. For further details on implementation of the two step process on discretized spatial axis, refer to Shankar and Howard (2013).
By choosing the nodes along the -axis from a geometric progression, the error from the discretized spatial derivative will be uniformly spread over all timescales, hence such a discretization is ideal to preserve scale-invariance. Let us choose the -values of successive nodes to have a ratio , where . Figure 3 shows the window function with and constructed from the discretized axis with . The window functions at two spatial points and are plotted to illustrate that scale invariance is preserved after discretization. As a comparison, the dotted curves are plotted to show the corresponding window function constructed in the continuous -axis (limit ). The window function computed on the discretized axis is artificially scaled up so that the solid and dotted curves in figure 3 are visually discernible. Note that the discretized window function peaks farther in the past time and is wider than the window function on the continuous spatial axis. As , the discretized window function converges on to the window function constructed on the continuous axis, while for larger values of the discrepancy grows larger. Nevertheless, for any value of , the discretized window function always stays scale-invariant, as can be seen by visually comparing the shapes of the window functions at and in figure 3. Now, it is straight forward to construct scale-invariant window functions of different shapes by taking linear combinations of discretized , analogous to the construction in figure 2.
Implementing this construction on a discretized spatial axis as a neural network has a tremendous resource conserving advantage. Since at each, the window function coarse grains the input around a past time of , the maximum past timesscale represented by the memory system is inversely related to minimum value of
. The geometric distribution of thevalues on the discretized axis implies that if there are nodes spanning the spatial axis for , it can represent the coarse grained past from timescales proportional to . Hence exponentially distant past can be represented in a coarse grained fashion with linearly increasing resources.
Vi Discussion and Conclusion
The formulation presented here starts from a convolution memory model (eq. 1) and derives the form of the scale-invariant window functions (or the kernels) that can be constructed from a space-time local mechanism. Interestingly, by simply postulating a kernel of the form of eq. 7, Tank and Hopfield have demonstrated the utility of such a memory system in temporal pattern classification Tank and Hopfield (1987). In general, a convolution memory model can adopt an arbitrary kernel, but it cannot be mechanistically constructed from a space-time local differential equation, which means a neural network implementation need not exist. However, the Gamma-memory model Vries and Principe (1992) shows that linear combinations of Gamma kernels, functionally similar to eq. 7, can indeed be mechanistically constructed from a set of differential equations.
The construction presented here takes a complementary approach to the Gamma-memory model by requiring scale invariance of the window function in the forefront and then imposing a space-time local differential equation to derive it. This sheds light on the connectivity between neighboring spatial units of the network that is required to generate a scale invariant window function, as described by the second part of the two step process (eq. 11). Moreover, the linearity of the two step process and its equivalence to the Laplace and Inverse Laplace transforms makes the memory representation analytically tractable.
Theoretically, the utility of a scale invariantly coarse grained memory hinges on the existence of scale free temporal fluctuations in the signals being encoded Shankar and Howard (2013). Although detailed empirical analysis of natural signals is needed to confirm this utility, preliminary analysis of time series from sunspots and global temperature show that such a memory system indeed has a higher predictive power than a conventional shift register Shankar and Howard (2013). The predictive advantage of this memory system can be understood as arising from its intrinsic ability to wash out non-predictive stochastic fluctuations in the input signal from distant past and just represent the predictively relevant information in a condensed form. Finally, the most noteworthy feature is that a memory system with nodes can represent information from exponentially past times proportional to . In comparison to a shift register with nodes which can accurately represent a maximum past time scale proportional to , this memory system is exponentially resource conserving.
The work was partly funded by NSF BCS-1058937 and AFOSR FA9550-12-1-0369.
- Maass et al. (2002) W. Maass, T. Natschläger, and H. Markram, Neural Computation 14, 2531 (2002).
- Jaeger (2001) H. Jaeger, The echo state approach to analyzing and training recurrent networks, GMD-Report 148 (GMD - German National Research Institute for Information Technology, 2001).
- White et al. (2004) O. L. White, D. D. Lee, and H. Sompolinsky, Physical Review Letters 92, 148102 (2004).
- Hermans and Schrauwen (2010) M. Hermans and B. Schrauwen, Neural Networks 23, 341 (2010).
- Ganguli et al. (2008) S. Ganguli, D. Huh, and H. Sompolinsky, Proceedings of the National Academy of Sciences of the United States of America 105, 18970 (2008).
- Strauss et al. (2012) T. Strauss, W. Wustlich, and R. Labahn, Neural Computation 24, 3246 (2012).
- Legenstein and Maass (2007) R. Legenstein and W. Maass, Neural Networks 20, 323 (2007).
- Wallace et al. (2013) E. Wallace, H. R. Maei, and P. E. Latham, Neural Computation 25, 1408 (2013).
- Ganguli and Sompolinsky (2012) S. Ganguli and H. Sompolinsky, Annual Review of Neuroscience 35, 485 (2012).
- Charles et al. (2014) A. S. Charles, H. L. Yap, and C. J. Rozell, Neural Computation 26, 1198 (2014).
- Mandelbrot (1982) B. Mandelbrot, The Fractal Geometry of Nature (W. H. Freeman, San Fransisco, CA, 1982).
- Voss and Clarke (1975) R. F. Voss and J. Clarke, Nature 258, 317 (1975).
- West and Shlesinger (1990) B. J. West and M. F. Shlesinger, American Scientist 78, 40 (1990).
Shankar and Howard (2013)
K. H. Shankar and M. W. Howard, Journal of Machine Learning Research14, 3785 (2013).
- Gibbon (1977) J. Gibbon, Psychological Review 84, 279 (1977).
- Rakitin et al. (1998) B. C. Rakitin, J. Gibbon, T. B. Penny, C. Malapani, S. C. Hinton, and W. H. Meck, Journal of Experimental Psychololgy: Animal Behavior Processes 24, 15 (1998).
- Shankar and Howard (2012) K. H. Shankar and M. W. Howard, Neural Computation 24, 134 (2012).
- Post (1930) E. Post, Transactions of the American Mathematical Society 32, 723 (1930).
- Tank and Hopfield (1987) D. Tank and J. Hopfield, Proceedings of the National Academy of Sciences 84, 1896 (1987).
- Vries and Principe (1992) B. d. Vries and J. C. Principe, Neural Networks 5, 565 (1992).