When processing sensory data by automatic methods in areas of signal processing such as computer vision or audio processing or in computational modelling of biological perception, the notion of receptive field constitutes an essential concept (Hubel and Wiesel HubWie59-Phys ; HubWie05-book ; Aertsen and Johannesma AerJoh81-BICY ; DeAngelis et al. DeAngOhzFre95-TINS ; deAngAnz04-VisNeuroSci ; Miller et al MilEscReaSch01-JNeuroPhys ).
For sensory data as obtained from vision or hearing, or their counterparts in artificial perception, the measurement from a single light sensor in a video camera or on the retina, or the instantaneous sound pressure registered by a microphone is hardly meaningful at all, since any such measurement is strongly dependent on external factors such as the illumination of a visual scene regarding vision or the distance between the sound source and the microphone regarding hearing. Instead, the essential information is carried by the relative relations between local measurements at different points and temporal moments regarding vision or local measurements over different frequencies and temporal moments regarding hearing. Following this paradigm, sensory measurements should be performed over local neighbourhoods over space-time regarding vision and over local neighbourhoods in the time-frequency domain regarding hearing, leading to the notions of spatio-temporal and spectro-temporal receptive fields.
Specifically, spatio-temporal receptive fields constitute a main class of primitives for expressing methods for video analysis (Zelnik-Manor and Irani ZelIra01-CVPR , Laptev and Lindeberg LapLin04-ECCVWS ; LapCapSchLin07-CVIU ; Jhuang et al. JhuSerWolPog07-ICCV ; Kläser et al. KlaMarSch08-BMVC ; Niebles et al. NieWanFei08-IJCV ; Wang et al. WanUllKlaLapSch09-BMVC ; Poppe et al. Pop09-IVC ; Shao and Mattivi ShaMatt10-CIVR ; Weinland et al. WeiRonBoy11-CVIU ; Wang et al. WanQiaTan15-CVPR ), whereas spectro-temporal receptive fields constitute a main class of primitives for expressing methods for machine hearing (Patterson et al. PatRobHolMcKeoZhaAll92-AudPhysPerc ; PatAllGig95-JASA ; Kleinschmidt Kle02-ActAcust ; Ezzat et al. EzzBouPog07-InterSpeech ; Meyer and Kollmeier MeyKol08-InterSpeech ; Schlute et al. SchBezWagNey07-ICASSP ; Heckmann et al. HecDomJouGoe11-SpeechComm ; Wu et al. WuZhaShi11-ASLP ; Alias et al. AliSocJoaSev16-ApplSci ).
A general problem when applying the notion of receptive fields in practice, however, is that the types of responses that are obtained in a specific situation can be strongly dependent on the scale levels at which they are computed. A spatio-temporal receptive field is determined by at least a spatial scale parameter and a temporal scale parameter, whereas a spectro-temporal receptive field is determined by at least a spectral and a temporal scale parameter. Beyond ensuring that local sensory measurements at different spatial, temporal and spectral scales are treated in a consistent manner, which by itself provides strong contraints on the shapes of the receptive fields (Lindeberg Lin13-BICY ; Lin16-JMIV ; Lindeberg and Friberg LinFri15-PONE ; LinFri15-SSVM ), it is necessary for computer vision or machine hearing algorithms to decide what responses within the families of receptive fields over different spatial, temporal and spectral scales they should base their analysis on.
Over the spatial domain, theoretically well-founded methods have been developed for choosing spatial scale levels among receptive field responses over multiple spatial scales (Lindeberg Lin97-IJCV ; Lin98-IJCV ; Lin99-CVHB ; Lin12-JMIV ; Lin14-EncCompVis ) leading to e.g. robust methods for image-based matching and recognition (Lowe Low04-IJCV ; Mikolajczyk and Schmid MikSch04-IJCV ; Tuytelaars and van Gool TuyGoo04-IJCV ; Bay et al. BayEssTuyGoo08-CVIU ; Tuytelaars and Mikolajczyk TuyMik08-Book ; van de Sande et al. SanGevSno10-PAMI ; Larsen et al. LarDarDahPed12-ECCV ) that are able to handle large variations of the size of the objects in the image domain and with numerous applications regarding object recognition, object categorization, multi-view geometry, construction of 3-D models from visual input, human-computer interaction, biometrics and robotics.
Much less research has, however, been performed regarding the topic of choosing local appropriate scales in temporal data. While some methods for temporal scale selection have been developed (Lindeberg Lin97-AFPAC ; Laptev and Lindeberg LapLin03-ICCV ; Willems et al. WilTuyGoo08-ECCV ), these methods suffer from either theoretical or practical limitations.
A main subject of this paper is present a theory for how to compare filter responses in terms of temporal derivatives that have been computed at different temporal scales, specifically with a detailed theoretical analysis of the possibilities of having temporal scale estimates as obtained from a temporal scale selection mechanism reflect the temporal duration of the underlying temporal structures that gave rise to the feature responses. Another main subject of this paper is to present a theoretical framework for temporal scale selection that leads to temporal scale invariance and enables the computation of scale covariant temporal scale estimates. While these topics can for a non-causal temporal domain be addressed by the non-causal Gaussian scale-space concept (Iijima Iij62 ; Witkin Wit83 ; Koenderink Koe84-BC ; Koenderink and van Doorn KoeDoo92-PAMI ; Lindeberg Lin93-Dis ; Lin94-SI ; Lin10-JMIV ; Florack Flo97-book ; ter Haar Romeny Haa04-book ), the development of such a theory has been missing regarding a time-causal temporal domain.
1.1 Temporal scale selection
When processing time-dependent signals in video or audio or more generally any temporal signal, special attention has to be put to the facts that:
the physical phenomena that generate the temporal signals may occur at different speed — faster or slower, and
the temporal signals may contain qualitatively different types of temporal structures at different temporal scales.
In certain controlled situations where the physical system that generates the temporal signals that is to be processed is sufficiently well known and if the variability of the temporal scales over time in the domain is sufficiently constrained, suitable temporal scales for processing the signals may in some situations be chosen manually and then be verified experimentally. If the sources that generate the temporal signals are sufficiently complex and/or if the temporal structures in the signals vary substantially in temporal duration by the underlying physical processes occurring significantly faster or slower, it is on the other hand natural to (i) include a mechanism for processing the temporal data at multiple temporal scales and (ii) try to detect in a bottom-up manner at what temporal scales the interesting temporal phenomena are likely to occur.
The subject of this article is to develop a theory for temporal scale selection in a time-causal temporal scale space as an extension of a previously developed theory for spatial scale selection in a spatial scale space (Lindeberg Lin97-IJCV ; Lin98-IJCV ; Lin99-CVHB ; Lin12-JMIV ; Lin14-EncCompVis ), to generate bottom-up hypotheses about characteristic temporal scales in time-dependent signals, intended to serve as estimates of the temporal duration of local temporal structures in time-dependent signals. Special focus will be on developing mechanisms analogous to scale selection in non-causal Gaussian scale-space, based on local extrema over scales of scale-normalized derivatives, while expressed within the framework of a time-causal and time-recursive temporal scale space in which the future cannot be accessed and the signal processing operations are thereby only allowed to make use of information from the present moment and a compact buffer of what has occurred in the past.
When designing and developing such scale selection mechanisms, it is essential that the computed scale estimates reflect the temporal duration of the corresponding temporal structures that gave rise to the feature responses. To understand the pre-requisites for developing such temporal scale selection methods, we will in this paper perform an in-depth theoretical analysis of the scale selection properties that such temporal scale selection mechanisms give rise to for different temporal scale-space concepts and for different ways of defining scale-normalized temporal derivatives.
Specifically, after an examination of the theoretical properties of different types of temporal scale-space concepts, we will focus on a class of recently extended time-causal temporal scale-space concepts obtained by convolution with truncated exponential kernels coupled in cascade (Lindeberg Lin90-PAMI ; Lin15-SSVM ; Lin16-JMIV ; Lindeberg and Fagerström LF96-ECCV ). For two natural ways of distributing the discrete temporal scale levels in such a representation, in terms of either a uniform distribution over the scale parameter corresponding to the variance of the composed scale-space kernel or a logarithmic distribution, we will study the scale selection properties that result from detecting local temporal scale levels from local extrema over scale of scale-normalized temporal derivatives. The motivation for studying a logarithmic distribution of the temporal scale levels, is that it corresponds to a uniform distribution in units of effective scale for some constants and , which has been shown to constitute the natural metric for measuring the scale levels in a spatial scale space (Koenderink Koe84-BC ; Lindeberg Lin92-PAMI ).
As we shall see from the detailed theoretical analysis that will follow, this will imply certain differences in scale selection properties of a temporally asymmetric time-causal scale space compared to scale selection in a spatially mirror symmetric Gaussian scale space. These differences in theoretical properties are in turn essential to take into explicit account when formulating algorithms for temporal scale selection in e.g. video analysis or audio analysis applications.
For the temporal scale-space concept based on a uniform distribution of the temporal scale levels in units of the variance of the composed scale-space kernel, it will be shown that temporal scale selection from local extrema over temporal scales will make it possible to estimate the temporal duration of local temporal structures modelled as local temporal peaks and local temporal ramps. For a dense temporal structure modelled as a temporal sine wave, the lack of true scale invariance for this concept will, however, imply that the temporal scale estimates will not be directly proportional to the wavelength of the temporal sine wave. Instead, the scale estimates are affected by a bias, which is not a desirable property.
For the temporal scale-space concept based on a logarithmic distribution of the temporal scale levels, and taken to the limit to scale-invariant time-causal limit kernel (Lindeberg Lin16-JMIV ) corresponding to an infinite number of temporal scale levels that cluster infinitely close near the temporal scale level zero, it will on the other hand be shown that the temporal scale estimates of a dense temporal sine wave will be truly proportional to the wavelength of the signal. By a general proof, it will be shown this scale invariant property of temporal scale estimates can also be extended to any sufficiently regular signal, which constitutes a general foundation for expressing scale invariant temporal scale selection mechanisms for time-dependent video and audio and more generally also other classes of time-dependent measurement signals.
As complement to this proposed overall framework for temporal scale selection, we will also present a set of general theoretical results regarding time-causal scale-space representations: (i) showing that previous application of the assumption of a semi-group property for time-causal scale-space concepts leads to undesirable temporal dynamics, which however can be remedied by replacing the assumption of a semi-group structure be a weaker assumption of a cascade property in turn based on a transitivity property, (ii) formulations of scale-normalized temporal derivatives for Koenderink’s time-causal scale-time model Koe88-BC and (iii) ways of translating the temporal scale estimates from local extrema over temporal scales in the temporal scale-space representation based on the scale-invariant time-causal limit kernel into quantitative measures of the temporal duration of the corresponding underlying temporal structures and in turn based on a scale-time approximation of the limit kernel.
In these ways, this paper is intended to provide a theoretical foundation for expressing theoretically well-founded temporal scale selection methods for selecting local temporal scales over time-causal temporal domains, such as video and audio with specific focus on real-time image or sound streams. Applications of this scale selection methodology for detecting both sparse and dense spatio-temporal features in video are presented in a companion paper Lin16-spattempscsel .
1.2 Structure of this article
As a conceptual background to the theoretical developments that will be performed, we will start in Section 2 with an overview of different approaches to handling temporal data within the scale-space framework including a comparison of relative advantages and disadvantages of different types of temporal scale-space concepts.
As a theoretical baseline for the later developments of methods for temporal scale selection in a time-causal scale space, we shall then in Section 3 give an overall description of basic temporal scale selection properties that will hold if the non-causal Gaussian scale-space concept with its corresponding selection methodology for a spatial image domain is applied to a one-dimensional non-causal temporal domain, e.g. for the purpose of handling the temporal domain when analysing pre-recorded video or audio in an offline setting.
In Sections 4–5 we will then continue with a theoretical analysis of the consequences of performing temporal scale selection in the time-causal scale space obtained by convolution with truncated exponential kernels coupled in cascade (Lindeberg Lin90-PAMI ; Lin15-SSVM ; Lin16-JMIV ; Lindeberg and Fagerström LF96-ECCV ). By selecting local temporal scales from the scales at which scale-normalized temporal derivatives assume local extrema over temporal scales, we will analyze the resulting temporal scale selection properties for two ways of defining scale-normalized temporal derivatives, by either variance-based normalization as determined by a scale normalization parameter or -normalization for different values of the scale normalization power .
With the temporal scale levels required to be discrete because of the very nature of this temporal scale-space concept, we will specifically study two ways of distributing the temporal scale levels over scale, using either a uniform distribution relative to the temporal scale parameter corresponding to the variance of the composed temporal scale-space kernel in Section 4 or a logarithmic distribution of the temporal scale levels in Section 5.
Because of the analytically simpler form for the time-causal scale-space kernels corresponding to a uniform distribution of the temporal scale levels, some theoretical scale-space properties will turn out to be easier to study in closed form for this temporal scale-space concept. We will specifically show that for a temporal peak modelled as the impulse response to a set of truncated exponential kernels coupled in cascade, the selected temporal scale level will serve as a good approximation of the temporal duration of the peak or be proportional to this measure depending on the value of the scale normalization parameter used for scale-normalized temporal derivatives based on variance-based normalization or the scale normalization power for scale-normalized temporal derivatives based on -normalization. For a temporal onset ramp, the selected temporal scale level will on the other hand be either a good approximation of the time constant of the onset ramp or proportional to this measure of the temporal duration of the ramp. For a temporal sine wave, the selected temporal scale level will, however, not be directly proportional to the wavelength of the signal, but instead affected by a systematic bias. Furthermore, the corresponding scale-normalized magnitude measures will not be independent of the wavelength of the sine wave but instead show systematic wavelength dependent deviations. A main reason for this is that this temporal scale-space concept does not guarantee temporal scale invariance if the temporal scale levels are distributed uniformly in terms of the temporal scale parameter corresponding to the temporal variance of the temporal scale-space kernel.
With a logarithmic distribution of the temporal scale levels, we will on the other hand show that for the temporal scale-space concept defined by convolution with the time-causal limit kernel (Lindeberg Lin16-JMIV ) corresponding to an infinitely dense distribution of the temporal scale levels towards zero temporal scale, the temporal scale estimates will be perfectly proportional to the wavelength of a sine wave for this temporal scale-space concept. It will also be shown that this temporal scale-space concept leads to perfect scale invariance in the sense that (i) local extrema over temporal scales are preserved under temporal scaling factors corresponding to integer powers of the distribution parameter of the time-causal limit kernel underlying this temporal scale-space concept and are transformed in a scale-covariant way for any temporal input signal and (ii) if the scale normalization parameter or equivalently if the scale normalization power , the magnitude values at the local extrema over scale will be equal under corresponding temporal scaling transformations. For this temporal scale-space concept we can therefore fulfil basic requirements to achieve temporal scale invariance also over a time-causal and time-recursive temporal domain.
To simplify the theoretical analysis we will in some cases temporarily extend the definitions of temporal scale-space representations over discrete temporal scale levels to a continuous scale variable, to make it possible to compute local extrema over temporal scales from differentiation with respect to the temporal scale parameter. Section 6 discusses the influence that this approximation has on the overall theoretical analysis.
Section 7 then illustrates how the proposed theory for temporal scale selection can be used for computing local scale estimates from 1-D signals with substantial variabilities in the characteristic temporal duration of the underlying structures in the temporal signal.
In Section 8, we analyse how the derived scale selection properties carry over to a set of spatio-temporal feature detectors defined over both multiple spatial scales and multiple temporal scales in a time-causal spatio-temporal scale-space representation for video analysis. Section 9 then outlines how corresponding selection of local temporal and logspectral scales can be expressed for audio analysis operations over a time-causal spectro-temporal domain. Finally, Section 10 concludes with a summary and discussion.
To simplify the presentation, we have put some derivations and theoretical analysis in the appendix. Appendix A presents a general theoretical argument of why a requirement about a semi-group property over temporal scales will lead to undesirable temporal dynamics for a time-causal scale space and argue that the essential structure of non-creation of new image structures from any finer to any coarser temporal scale can instead nevertheless be achieved with the less restrictive assumption about a cascade smoothing property over temporal scales, which then allows for better temporal dynamics in terms of e.g. shorter temporal delays.
In relation to Koenderink’s scale-time model Koe88-BC , Appendix B shows how corresponding notions of scale-normalized temporal derivatives based on either variance-based normalization or -normalization can be defined also for this time-causal temporal scale-space concept.
Appendix C shows how the temporal duration of the time-causal limit kernel proposed in (Lindeberg Lin16-JMIV ) can be estimated by a scale-time approximation of the limit kernel via Koenderink’s scale-time model leading to estimates of how a selected temporal scale level from local extrema over temporal scale can be translated into a estimates of the temporal duration of temporal structures in the temporal scale-space representation obtained by convolution with the time-causal limit kernel. Specifically, explicit expressions are given for such temporal duration estimates based on first- and second-order temporal derivatives.
2 Theoretical background and related work
2.1 Temporal scale-space concepts
For processing temporal signals at multiple temporal scales, different types of temporal scale-space concepts have been developed in the computer vision literature (see Figure 1):
For off-line processing of pre-recorded signals, a non-causal Gaussian temporal scale-space concept may in many situations be sufficient. A Gaussian temporal scale-space concept is constructed over the 1-D temporal domain in a similar manner as a Gaussian spatial scale-space concept is constructed over a D-dimensional spatial domain (Iijima Iij62 ; Witkin Wit83 ; Koenderink Koe84-BC ; Koenderink and van Doorn KoeDoo92-PAMI ; Lindeberg Lin93-Dis ; Lin94-SI ; Lin10-JMIV ; Florack Flo97-book ; ter Haar Romeny Haa04-book ), with or without the difference that a model for temporal delays may or may not be additionally included (Lindeberg Lin10-JMIV ).
When processing temporal signals in real time, or when modelling sensory processes in biological perception computationally, it is on the other hand necessary to base the temporal analysis on time-causal operations.
The first time-causal temporal scale-space concept was developed by Koenderink Koe88-BC , who proposed to apply Gaussian smoothing on a logarithmically transformed time axis with the present moment mapped to the unreachable infinity. This temporal scale-space concept does, however, not have any known time-recursive formulation. Formally, it requires an infinite memory of the past and has therefore not been extensively applied in computational applications.
Lindeberg Lin90-PAMI ; Lin15-SSVM ; Lin16-JMIV and Lindeberg and Fagerström LF96-ECCV proposed a time-causal temporal scale-space concept based on truncated exponential kernels or equivalently first-order integrators coupled in cascade, based on theoretical results by Schoenberg Sch50 (see also Schoenberg Sch88-book and Karlin Kar68 ) implying that such kernels are the only variation-diminishing kernels over a 1-D temporal domain that guarantee non-creation of new local extrema or equivalently zero-crossings with increasing temporal scale. This temporal scale-space concept is additionally time-recursive and can be implemented in terms of computationally highly efficient first-order integrators or recursive filters over time. This theory has been recently extended into a scale-invariant time-causal limit kernel (Lindeberg Lin16-JMIV ), which allows for scale invariance over the temporal scaling transformations that correspond to exact mappings between the temporal scale levels in the temporal scale-space representation based on a discrete set of logarithmically distributed temporal scale levels.
Based on semi-groups that guarantee either self-similarity over temporal scales or non-enhancement of local extrema with increasing temporal scales, Fagerström Fag05-IJCV and Lindeberg Lin10-JMIV have derived time-causal semi-groups that allow for a continuous temporal scale parameter and studied theoretical properties of these kernels.
Concerning temporal processing over discrete time, Fleet and Langley FleLan95-PAMI performed temporal filtering for optic flow computations based on recursive filters over time. Lindeberg Lin90-PAMI ; Lin15-SSVM ; Lin16-JMIV and Lindeberg and Fagerström LF96-ECCV showed that first-order recursive filters coupled in cascade constitutes a natural time-causal scale-space concept over discrete time, based on the requirement that the temporal filtering over a 1-D temporal signal must not increase the number of local extrema or equivalently the number of zero-crossings in the signal. In the specific case when all the time constants in this model are equal and tend to zero while simultaneously increasing the number of temporal smoothing steps in such a way that the composed temporal variance is held constant, these kernels can be shown to approach the temporal Poisson kernel LF96-ECCV . If on the other hand the time constants of the first-order integrators are chosen so that the temporal scale levels become logarithmically distributed, these temporal smoothing kernels approach a discrete approximation of the time-causal limit kernel Lin16-JMIV .
Applications of using these linear temporal scale-space concepts for modelling the temporal smoothing step in visual and auditory receptive fields have been presented by Lindeberg Lin97-ICSSTCV ; CVAP257 ; Lin10-JMIV ; Lin13-BICY ; Lin13-PONE ; Lin15-SSVM ; Lin16-JMIV , ter Haar Romeny et al. RomFloNie01-SCSP , Lindeberg and Friberg LinFri15-PONE ; LinFri15-SSVM and Mahmoudi Mah16-JMIV . Non-linear spatio-temporal scale-space concepts have been proposed by Guichard Gui98-TIP . Applications of the non-causal Gaussian temporal scale-space concept for computing spatio-temporal features have been presented by Laptev and Lindeberg LapLin03-ICCV ; LapLin04-ECCVWS ; LapCapSchLin07-CVIU , Kläser et al. KlaMarSch08-BMVC , Willems et al. WilTuyGoo08-ECCV , Wang et al. WanUllKlaLapSch09-BMVC , Shao and Mattivi ShaMatt10-CIVR and others, see specifically Poppe Pop09-IVC for a survey of early approaches to vision-based human human action recognition, Jhuang et al. JhuSerWolPog07-ICCV and Niebles et al. NieWanFei08-IJCV for conceptually related non-causal Gabor approaches, Adelson and Bergen AdeBer85-JOSA and Derpanis and Wildes DerWil12-PAMI for closely related spatio-temporal orientation models and Han et al. HanXuZhu15-JMIV for a related mid-level temporal representation termed the video primal sketch.
Applications of the temporal scale-space model based on truncated exponential kernels with equal time constants coupled in cascade and corresponding to Laguerre functions (Laguerre polynomials multiplied by a truncated exponential kernel) for computing spatio-temporal features have presented by Rivero-Moreno and Bres RivBre04-ImAnalRec , Shabani et al. ShaClaZel12-BMVC and Berg et al. BerReyRid14-SensMEMSElOptSyst as well as for handling time scales in video surveillance (Jacob and Pless JacPle08-CircSystVidTech ), for performing edge preserving smoothing in video streams (Paris Par08-ECCV ) and is closely related to Tikhonov regularization as used for image restoration by e.g. Surya et al. SurVorPelJosSeePal15-JMIV . A general framework for performing spatio-temporal feature detection based on the temporal scale-space model based on truncated exponential kernels coupled in cascade with specifically the both theoretical and practical advantages of using logarithmic distribution of the intermediated temporal scale levels in terms of temporal scale invariance and better temporal dynamics (shorter temporal delays) has been presented in Lindeberg Lin16-JMIV .
2.2 Relative advantages of different temporal scale spaces
When developing a temporal scale selection mechanism over a time-causal temporal domain, a first problem concerns what time-causal scale-space concept to base the multi-scale temporal analysis upon. The above reviewed temporal scale-space concepts have different relative advantages from a theoretical and computational viewpoint. In this section, we will perform an in-depth examination of the different temporal scale-space concepts that have been developed in the literature, which will lead us to a class of time-causal scale-space concepts that we argue is particularly suitable with respect to the set of desirable properties we aim at.
The non-causal Gaussian temporal scale space is in many cases the conceptually easiest temporal scale-space concept to handle and to study analytically (Lindeberg Lin10-JMIV ). The corresponding temporal kernels are scale invariant, have compact closed-form expressions over both the temporal and frequency domains and obey a semi-group property over temporal scales. When applied to pre-recorded signals, temporal delays can if desirable be disregarded, which eliminates any need for temporal delay compensation. This scale-space concept is, however, not time-causal and not time-recursive, which implies fundamental limitations with regard to real-time applications and realistic modelling of biological perception.
Koenderink’s scale-time kernels Koe88-BC
are truly time-causal, allow for a continuous temporal scale parameter, have good temporal dynamics and have a compact explicit expression over the temporal domain. These kernels are, however, not time-recursive, which implies that they in principle require an infinite memory of the past (or at least extended temporal buffers corresponding to the temporal extent to which the infinite support temporal kernels are truncated at the tail). Thereby, the application of Koenderink’s scale-time model to video analysis implies that substantial temporal buffers are needed when implementing this non-recursive temporal scale-space in practice. Similar problems with substantial need for extended temporal buffers arise when applying the non-causal Gaussian temporal scale-space concept to offline analysis of extended video sequences. The algebraic expressions for the temporal kernels in the scale-time model are furthermore not always straightforward to handle and there is no known simple expression for the Fourier transform of these kernels or no known simple explicit cascade smoothing property over temporal scales with respect to the regular (untransformed) temporal domain. Thereby, certain algebraic calculations with the scale-time kernels may become quite complicated.
The temporal scale-space kernels obtained by coupling truncated exponential kernels or equivalently first-order integrators in cascade are both truly time-causal and truly time-recursive (Lindeberg Lin90-PAMI ; Lin15-SSVM ; Lin16-JMIV ; Lindeberg and Fagerström LF96-ECCV ). The temporal scale levels are on the other hand required to be discrete. If the goal is to construct a real-time signal processing system that analyses continuous streams of signal data in real time, one can however argue that a restriction of the theory to a discrete set of temporal scale levels is less of a contraint, since the signal processing system anyway has to be based on a finite amount of sensors and hardware/wetware for sampling and processing the continuous stream of signal data.
In the special case when all the time constants are equal, the corresponding temporal kernels in the temporal scale-space model based on truncated exponential kernels coupled in cascade have compact explicit expressions that are easy to handle both in the temporal domain and in the frequency domain, which simplifies theoretical analysis. These kernels obey a semi-group property over temporal scales, but are not scale invariant and lead to slower temporal dynamics when a larger number of primitive temporal filters are coupled in cascade (Lindeberg Lin15-SSVM ; Lin16-JMIV ).
In the special case when the temporal scale levels in this scale-space model are logarithmically distributed, these kernels have a manageable explicit expression over the Fourier domain that enables some closed-form theoretical calculations. Deriving an explicit expression over the temporal domain is, however, harder, since the explicit expression then corresponds to a linear combination of truncated exponential filters for all the time constants, with the coefficients determined from a partial fraction expansion of the Fourier transform, which may lead to rather complex closed-form expressions. Thereby certain analytical calculations may become harder to handle. As shown in Lin16-JMIV and Appendix C, some such calculations can on the other hand be well approximated via a scale-time approximation of the time-causal temporal scale-space kernels. When using a logarithmic distribution of the temporal scales, the composed temporal kernels do however have very good temporal dynamics and much better temporal dynamics compared to corresponding kernels obtained by using truncated exponential kernels with equal time constants coupled in cascade. Moreover, these kernels lead to a computationally very efficient numerical implementation. Specifically, these kernels allow for the formulation of a time-causal limit kernel that obeys scale invariance under temporal scaling transformations, which cannot be achieved if using a uniform distribution of the temporal scale levels (Lindeberg Lin15-SSVM ; Lin16-JMIV ).
The temporal scale-space representations obtained from the self-similar time-causal semi-groups have a continuous scale parameter and obey temporal scale invariance (Fagerström Fag05-IJCV ; Lindeberg Lin10-JMIV ). These kernels do, however, have less desirable temporal dynamics (see Appendix A for a general theoretical argument about undesirable consequences of imposing a temporal semi-group property on temporal kernels with temporal delays) and/or lead to pseudodifferential equations that are harder to handle both theoretically and in terms of computational implementation. For these reasons, we shall not consider those time-causal semi-groups further in this treatment.
2.3 Previous work on methods for scale selection
A general framework for performing scale selection for local differential operations was proposed in Lindeberg Lin93-SCIA ; Lin93-Dis based on the detection of local extrema over scale of scale-normalized derivative expressions and then refined in Lindeberg Lin97-IJCV ; Lin98-IJCV — see Lindeberg Lin99-CVHB ; Lin14-EncCompVis for tutorial overviews.
This scale selection approach has been applied to a large number of feature detection tasks over spatial image domains including detection of scale-invariant interest points (Lindeberg Lin97-IJCV ; Lin12-JMIV , Mikolajczyk and Schmid MikSch04-IJCV ; Tuytelaars and Mikolajczyk TuyMik08-Book ), performing feature tracking (Bretzner and Lindeberg BL97-CVIU ), computing shape from texture and disparity gradients (Lindeberg and Gårding LG93-ICCV ; Gårding and Lindeberg GL94-IJCV ), detecting 2-D and 3-D ridges (Lindeberg Lin98-IJCV ; Sato et al. SatNakShiAtsYouKolGerKik98-MIA ; Frangi et al. FraNieHooWalVie00-MED ; Krissian et al. KriMalAyaValTro00-CVIU ), computing receptive field responses for object recognition (Chomat et al. ChoVerHalCro00-ECCV ; Hall et al. HalVerCro00-ECCV ), performing hand tracking and hand gesture recognition (Bretzner et al. BreLapLin02-FG ) and computing time-to-collision (Negre et al. NegBraCroLau08-ExpRob ).
Specifically, very successful applications have been achieved in the area of image-based matching and recognition (Lowe Low04-IJCV ; Bay et al. BayEssTuyGoo08-CVIU ; Lindeberg Lin12-Scholarpedia ; Lin15-JMIV ). The combination of local scale selection from local extrema of scale-normalized derivatives over scales (Lindeberg Lin93-Dis ; Lin97-IJCV ) with affine shape adaptation (Lindeberg and Garding LG96-IVC ) has made it possible to perform multi-view image matching over large variations in viewing distances and viewing directions (Mikolajczyk and Schmid MikSch04-IJCV ; Tuytelaars and van Gool TuyGoo04-IJCV ; Lazebnik et al. LazSchPon05-PAMI ; Mikolajczyk et al. MikTuySchZisMatSchKadGoo05-IJCV ; Rothganger et al. RotLazSchPon06-IJCV ). The combination of interest point detection from scale-space extrema of scale-normalized differential invariants (Lindeberg Lin93-Dis ; Lin97-IJCV ) with local image descriptors (Lowe Low04-IJCV ; Bay et al. BayEssTuyGoo08-CVIU ) has made it possible to design robust methods for performing object recognition of natural objects in natural environments with numerous applications to object recognition (Lowe Low04-IJCV ; Bay et al. BayEssTuyGoo08-CVIU ), object category classification (Bosch et al. BosZisMun07-ICCV ; Mutch and Lowe MutLow08-IJCV ), multi-view geometry (Hartley and Zisserman HarZis04-Book ), panorama stitching (Brown and Lowe BroLow07-IJCV ), automated construction of 3-D object and scene models from visual input (Brown and Lowe BroLow05-3DIM ; Agarwal et al. AgaSnaSimSeiSze09-ICCV ), synthesis of novel views from previous views of the same object (Liu LiuYueTor11-PAMI ), visual search in image databases (Lew et al. LewSebDjeJai06-ACM-Multi ; Datta et al. DatJosLiWan08-CompSurv ), human computer interaction based on visual input (Porta Por02-HumCompStud ; Jaimes and Sebe JaiSeb07-CVIU ), biometrics (Bicego et al. BicLagGroTis06-CVPRW ; Li Li09-EncBiometr ) and robotics (Se et al. SeLowLit05-TROB ; Siciliano and Khatib SicKha08-HandBookRob ).
Alternative approaches for performing scale selection over spatial image domains have also been proposed in terms of (i) detecting peaks of weighted entropy measures (Kadir and Brady KadBra01-IJCV ) or Lyaponov functionals (Sporring et al. SpoCoilTra00-ICIP ) over scales, (ii) minimising normalized error measures over scale (Lindeberg Lin97-IVC ), (iii) determining minimum reliable scales for edge detection based on a noise suppression model (Elder and Zucker EldZuc98-PAMI ), (iv) determining at what scale levels to stop in non-linear diffusion-based image restoration methods based on similarity measurements relative to the original image data (Mrázek and Navara MraNav03-IJCV
), (v) by comparing reliability measures from statistical classifiers for texture analysis at multiple scales (Kanget al. KanMorNag05-ScSp ), (vi) by computing image segmentations from the scales at which a supervised classifier delivers class labels with the highest reliability measure (Loog et al. LooLiTax09-LNCS ; Li et al. LiTaxLoo11-ScSp ), (vii) selecting scales for edge detection by estimating the saliency of elongated edge segments (Liu et al. LiuWanYaoZha12-CVPR ) or (viii) considering subspaces generated by local image descriptors computed over multiple scales (Hassner et al. HasMayZel12-CVPR ).
More generally, spatial scale selection can be seen as a specific instance of computing invariant receptive field responses under natural image transformations, to (i) handle objects in the world of different physical size and to account for scaling transformations caused by the perspective mapping, and with extensions to (ii) affine image deformations to account for variations in the viewing direction and (iii) Galilean transformations to account for relative motions between objects in the world and the observer as well as to (iv) illumination variations (Lindeberg Lin13-PONE ).
Early theoretical work on temporal scale selection in a time-causal scale space was presented in Lindeberg Lin97-AFPAC with primary focus on the temporal Poisson scale-space, which possesses a temporal semi-group structure over a discrete time-causal temporal domain while leading to long temporal delays (see Appendix A for a general theoretical argument). Temporal scale selection in non-causal Gaussian spatio-temporal scale space has been used by Laptev and Lindeberg LapLin03-ICCV and Willems et al. WilTuyGoo08-ECCV for computing spatio-temporal interest points, however, with certain theoretical limitations that are explained in a companion paper Lin16-spattempscsel .111The spatio-temporal scale selection method in (Laptev and Lindeberg LapLin03-ICCV ) is based on a spatio-temporal Laplacian operator that is not scale covariant under independent relative scaling transformations of the spatial vs. the temporal domains Lin16-spattempscsel , which implies that the spatial and temporal scale estimate will not be robust under independent variabilities of the spatial and temporal scales in video data. The spatio-temporal scale selection method applied to the determinant of the spatio-temporal Hessian in (Willems et al. WilTuyGoo08-ECCV ) does not make use of the full flexibility of the notion of -normalized derivative operators Lin16-spattempscsel and has not previously been developed over a time-causal spatio-temporal domain. The purpose of this article is to present a much further developed and more general theory for temporal scale selection in time-causal scale spaces over continuous temporal domains and to analyse the theoretical scale selection properties for different types of model signals.
3 Scale selection properties for the non-causal Gaussian temporal scale space concept
In this section, we will present an overview of theoretical properties that will hold if the Gaussian temporal scale-space concept is applied to a non-causal temporal domain, if additionally the scale selection mechanism that has been developed for a non-causal spatial domain is directly transferred to a non-causal temporal domain. The set of temporal scale-space properties that we will arrive at will then be used as a theoretical base-line for developing temporal scale-space properties over a time-causal temporal domain.
3.1 Non-causal Gaussian temporal scale-space
Over a one-dimensional temporal domain, axiomatic derivations of a temporal scale-space representation based on the assumptions of (i) linearity, (ii) temporal shift invariance, (iii) semi-group property over temporal scale, (iv) sufficient regularity properties over time and temporal scale and (v) non-enhancement of local extrema imply that the temporal scale-space representation
should be generated by convolution with possibly time-delayed temporal kernels of the form (Lindeberg Lin10-JMIV )
where is a temporal scale parameter corresponding to the variance of the Gaussian kernel and is a temporal delay. Differentiating the kernel with respect to time gives
see the top row in Figure 1 for graphs. When analyzing pre-recorded temporal signals, it can be preferable to set the temporal delay to zero, leading to temporal scale-space kernels having a similar form as spatial Gaussian kernels:
3.2 Temporal scale selection from scale-normalized derivatives
As a conceptual background to the treatments that we shall later develop regarding temporal scale selection in time-causal temporal scale spaces, we will in this section describe the theoretical structure that arises by transferring the theory for scale selection in a Gaussian scale space over a spatial domain to the non-causal Gaussian temporal scale space:
Given the temporal scale-space representation of a temporal signal obtained by convolution with the Gaussian kernel according to (1), temporal scale selection can be performed by detecting local extrema over temporal scales of differential expressions expressed in terms of scale-normalized temporal derivatives at any scale according to (Lindeberg Lin97-IJCV ; Lin98-IJCV ; Lin99-CVHB ; Lin14-EncCompVis )
where is the scale-normalized temporal variable, is the order of temporal differentiation and is a free parameter. It can be shown (Lin97-IJCV, , Section 9.1) that this notion of -normalized derivatives corresponds to normalizing the th order Gaussian derivatives over a one-dimensional domain to constant -norms over scale
where the perfectly scale invariant case corresponds to -normalization for all orders of temporal differentiation.
Temporal scale invariance.
A general and very useful scale invariant property that results from this construction of the notion of scale-normalized temporal derivatives can be stated as follows: Consider two signals and that are related by a temporal scaling transformation
and assume that there is a local extremum over scales at in a differential expression defined as a homogeneous polynomial of Gaussian derivatives computed from the scale-space representation of the original signal . Then, there will be a corresponding local extremum over scales at in the corresponding differential expression computed from the scale-space representation of the rescaled signal (Lin97-IJCV, , Section 4.1).
This scaling result holds for all homogeneous polynomial differential expression and implies that local extrema over scales of -normalized derivatives are preserved under scaling transformations. Specifically, this scale invariant property implies that if a local scale temporal level level in dimension of time is selected to be proportional to the temporal scale estimate such that , then if the temporal signal is transformed by a temporal scale factor , the temporal scale estimate and therefore also the selected temporal scale level will be transformed by a similar temporal factor , implying that the selected temporal scale levels will automatically adapt to variations in the characteristic temporal scale of the signal. Thereby, such local extrema over temporal scale provide a theoretically well-founded way to automatically adapt the scale levels to local scale variations.
Specifically, scale-normalized scale-space derivatives of order at corresponding temporal moments will be related according to
which means that implies perfect scale-invariance in the sense that the -normalized derivatives at corresponding points will be equal. If , the difference in magnitude can on the other hand be easily compensated for using the scale values of the corresponding scale-adaptive image features (see below).
3.3 Temporal peak
For a temporal peak modelled as a Gaussian function with variance
it can be shown that scale selection from local extrema over scale of second-order scale-normalized temporal derivatives
If we require the scale estimate to reflect the temporal duration of the peak such that
then this implies
which in the specific case of corresponds to (Lin98-IJCV, , Section 5.6.1)
and in turn corresponding to -normalization for according to (8).
If we additionally renormalize the original Gaussian peak to having maximum value equal to one
then if using the same value of for computing the magnitude response as for selecting the temporal scale, the maximum magnitude value over scales will be given by
and will not be independent of the temporal scale of the original peak unless . If on the other hand using as motivated by requirements of scale calibration (14) for , the scale dependency will for a Gaussian peak be of the form
To get a scale-invariant magnitude measure for comparing the responses of second-order temporal derivative responses at different temporal scales for the purpose of scale calibration, we should therefore consider a scale-invariant magnitude measure for peak detection of the form
which for a Gaussian temporal peak will assume the value
Specifically, this form of post-normalization corresponds to computing the scale-normalized derivatives for at the selected scale (14) of the temporal peak, which according to (8) corresponds to -normalization of the second-order temporal derivative kernels.
3.4 Temporal onset ramp
If we model a temporal onset ramp with temporal duration as the primitive function of the Gaussian kernel with variance
it can be shown that scale selection from local extrema over scale of first-order scale-normalized temporal derivatives
implies that the scale estimate at the central position will be given by (Lin98-IJCV, , Equation (23))
If we require this scale estimate to reflect the temporal duration of the ramp such that
then this implies
which in the specific case of corresponds to (Lin98-IJCV, , Section 4.5.1)
and in turn corresponding to -normalization for according to (8).
If using the same value of for computing the magnitude response as for selecting the temporal scale, the maximum magnitude value over scales will be given by
which is not independent of the temporal scale of the original onset ramp unless . If using for temporal scale selection, the selected temporal scale according to (24) would, however, become infinite. If on the other hand using as motivated by requirements of scale calibration (25) for , the scale dependency will for a Gaussian onset ramp be of the form
To get a scale-invariant magnitude measure for comparing the responses of first-order temporal derivative responses at different temporal scales, we should therefore consider a scale-invariant magnitude measure for ramp detection of the form
which for a Gaussian onset ramp will assume the value
Specifically, this form of post-normalization corresponds to computing the scale-normalized derivatives for at the selected scale (25) of the onset ramp and thus also to -normalization of the first-order temporal derivative kernels for .
3.5 Temporal sine wave
For a signal defined as a temporal sine wave
it can be shown that there will be a peak over temporal scales in the magnitude of the th order temporal derivative at temporal scale (Lin97-IJCV, , Section 3)
If we define a temporal scale parameter of dimension according to , then this implies that the scale estimate is proportional to the wavelength of the sine wave according to (Lin97-IJCV, , Equation (9))
and does in this respect reflect a characteristic time constant over which the temporal phenomena occur. Specifically, the maximum magnitude measure over scale (Lin97-IJCV, , Equation (10))
is for independent of the angular frequency of the sine wave and thereby scale invariant.
In the following, we shall investigate how these scale selection properties can be transferred to two types of time-causal temporal scale-space concepts.
4 Scale selection properties for the time-causal temporal scale space concept based on first-order integrators with equal time constants
In this section, we will present a theoretical analysis of the scale selection properties that are obtained in the time-causal scale-space based on truncated exponential kernels coupled in cascade, for the specific case of a uniform distribution of the temporal scale levels in units of the composed variance of the composed temporal scale-space kernels, and corresponding to the time-constants of all the primitive truncated exponential kernels being equal.
We will study three types of idealized model signals for which closed-form theoretical analysis is possible: (i) a temporal peak modelled as a set of truncated exponential kernels with equal time constants coupled in cascade, (ii) a temporal onset ramp modelled as the primitive function of the temporal peak model and (iii) a temporal sine wave. Specifically, we will analyse how the selected scale levels obtained from local extrema of temporal derivatives over scale relate to the temporal duration of a temporal peak or a temporal onset ramp alternatively how the selected scale levels depends on the the wavelength of a sine wave.
We will also study how good approximation the scale-normalized magnitude measure at the maximum over temporal scales is compared to the corresponding fully scale-invariant magnitude measures that are obtained from the non-causal temporal scale concept as listed in Section 3.
4.1 Time-causal scale space based on truncated exponential kernels with equal time constants coupled in cascade
Given the requirements that the temporal smoothing operation in a temporal scale-space representation should obey (i) linearity, (ii) temporal shift invariance, (iii) temporal causality and (iv) guarantee non-creation of new local extrema or equivalently new zero-crossings with increasing temporal scale for any one-dimensional temporal signal, it can be shown (Lindeberg Lin90-PAMI ; Lin15-SSVM ; Lin16-JMIV ; Lindeberg and Fagerström LF96-ECCV ) that the temporal scale-space kernels should be constructed as a cascade of truncated exponential kernels of the form
If we additionally require the time constants of all such primitive kernels that are coupled in cascade to be equal, then this leads to a composed temporal scale-space kernel of the form
corresponding to Laguerre functions (Laguerre polynomials multiplied by a truncated exponential kernel) and also equal to the probability density function of the Gamma distribution having a Laplace transform of the form
Differentiating the temporal scale-space kernel with respect to time gives
see the second row in Figure 1 for graphs. The -norms of these kernels are given by
The temporal scale level at level corresponds to temporal variance
and temporal standard deviation.
4.2 Temporal peak
Consider an input signal defined as a time-causal temporal peak corresponding to filtering a delta function with first-order integrators with time constants coupled in cascade:
With regard to the application area of vision, this signal can be seen as an idealized model of an object with temporal duration that first appears and then disappears from the field of view, and modelled on a form to be algebraically compatible with the algebra of the temporal receptive fields. With respect to the application area of hearing, this signal can be seen as an idealized model of a beat sound over some frequency range of the spectrogram, also modelled on a form to be compatible with the algebra of the temporal receptive fields.
|Scale estimate and maximum magnitude from temporal peak (uniform distr)|
|(var, )||(var, )||(var, )||(, )|
|Scale estimate and maximum magnitude from temporal ramp (uniform distr)|
|(var, )||(, )||(var, )||(, )|
Define the temporal scale-space representation by convolving this signal with the temporal scale-space kernel (43) corresponding to first-order integrators having the same time constants
where we have applied the semi-group property that follows immediately from the corresponding Laplace transforms
By differentiating the temporal scale-space representation (44) with respect to time we obtain
implying that the maximum point is assumed at
and the inflection points at
This form of the expression for the time of the temporal maximum implies that the temporal delay of the underlying peak and the temporal delay of the temporal scale-space kernel are not fully additive, but instead composed according to
If we define the temporal duration of the peak as the distance between the inflection points, if furthermore follows that this temporal duration is related to the temporal duration of the original peak and the temporal duration of the temporal scale-space kernel according to
Notably these expressions are not scale invariant, but instead strongly dependent on a preferred temporal scale as defined by the time constant of the primitive first-order integrators that define the uniform distribution of the temporal scales.
Scale-normalized temporal derivatives.
When using temporal scale normalization by variance-based normalization, the first- and second-order scale-normalized derivatives are given by
When using temporal scale normalization by -normalization, the first- and second-order scale-normalized derivatives are on the other hand given by (Lindeberg (Lin16-JMIV, , Equation (75)))
with the scale-normalization factors determined such that the -norm of the scale-normalized temporal derivative computation kernel
equals the -norm of some other reference kernel, where we here take the -norm of the corresponding Gaussian derivative kernels (Lindeberg (Lin16-JMIV, , Equation (76)))
for , thus implying
where and denote the -norms (7) of corresponding Gaussian derivative kernels for the value of at which they become constant over scales by -normalization, and the -norms and of the temporal scale-space kernels and for the specific case of are given by (41) and (4.1).
Temporal scale selection.
Let us assume that we want to register that a new object has appeared by a scale-space extremum of the scale-normalized second-order derivative response.
To determine the temporal moment at which the temporal event occurs, we should formally determine the time where , which by our model (54) would correspond to solving a third-order algebraic equation. To simplify the problem, let us instead approximate the temporal position of the peak in the second-order derivative by the temporal position of the peak according to (48) in the signal and study the evolution properties over scale of
In the case of variance-based normalization for a general value of , we have
and in the case of -normalization for
To determine the scale at which the local maximum is assumed, let us temporarily extend this definition to continuous values of and differentiate the corresponding expressions with respect to . Solving the equation
numerically for different values of then gives the dependency on the scale estimate as function of shown in Table 2 for variance-based normalization with either or and -normalization for .
As can be seen from the results in Table 2, when using variance-based scale normalization for , the scale estimate closely follows the scale of the temporal peak and does therefore imply a good approximate transfer of the scale selection property (14) to this temporal scale-space concept. If one would instead use variance-based normalization for or -normalization for , then that would, however, lead to substantial overestimates of the temporal duration of the peak.
Furthermore, if we additionally normalize the input signal to having unit contrast, then the corresponding time-causal correspondence to the post-normalized magnitude measure (20)
is for scale estimates proportional to the temporal duration of the underlying temporal peak very close to constant under variations of the temporal duration of the underlying temporal peak as determined by the parameter , thus implying a good approximate transfer of the scale selection property (21).
4.3 Temporal onset ramp
Consider an input signal defined as a time-causal onset ramp corresponding to the primitive function of first-order integrators with time constants coupled in cascade:
With respect to the application area of vision, this signal can be seen as an idealized model of a new object with temporal diffuseness that appears in the field of view and modelled on a form to be algebraically compatible with the algebra of the temporal receptive fields. With respect to the application area of hearing, this signal can be seen as an idealized model of the onset of a new sound in some frequency band of the spectrogram, also modelled on a form to be compatible with the algebra of the temporal receptive fields.
Define the temporal scale-space representation of the signal by convolution with the temporal scale-space kernel (43) corresponding to first-order integrators having the same time constants
Then, the first-order temporal derivative is given by
which assumes its temporal maximum at .
Temporal scale selection.
Let us assume that we are going to detect a new appearing object from a local maximum in the first-order derivative over both time and temporal scales. When using variance-based normalization for a general value of , the scale-normalized response at the temporal maximum in the first-order derivative is given by
When using -normalization for a general value of , the corresponding scale-normalized response is
where the -norm of the first-order scale-space derivative kernel can be expressed in terms of exponential functions, the Gamma function and hypergeometric functions, but is too complex to be written out here. Extending the definition of these expressions to continuous values of and solving the equation
numerically for different values of then gives the dependency on the scale estimate as function of shown in Table 2 for variance-based normalization with or -normalization for .
As can be seen from the numerical results, for both variance-based normalization and -normalization with corresponding values of and , the numerical scale estimates in terms of closely follow the diffuseness scale of the temporal ramp as parameterized by . Thus, for both of these scale normalization models, the numerical results indicate an approximate transfer of the scale selection property (14) to this temporal scale-space model. Additionally, the maximum magnitude values according to (