1 Introduction
Particle tracking detectors allow studying of elementary particle interactions and precise measurement of their properties by observing their tracks. Robust tracking algorithms is nowadays a fundamental component of all tracking detectors. Tracking techniques in particle physics were evolving along with the technology development, from implementations on hardware logical elements, computer data processing, GPUaccelerated algorithms, to modern DeepLearning based approaches
[NiwaK1974, Alexandrov2017, Acciarri2017]. Advanced implementation of tracking algorithms can be seen for example in emulsion detector data reconstruction.Nuclear photoemulsion (further emulsion) detector is a tracking detector that allows detecting charged particles with high spatial (50 nm) and angular (<1 mrad) resolution. It does not require power supply during the experimental run. Such properties of this detector enable fundamental physical experiments searching for short lived particles [Kodama2001, Agafonova2015, Ahdida2019, Aghion2018] and large scale experiments in remote regions [Nishiyama2017]. The emulsion gel consists of small silver bromide crystals dispersed in a gelatin frame. When a charged particle passes through the emulsion gel, the crystals along its trajectory create latent image centers, which become visible under optical microscopes, after a chemical development (Figure 1, a).
a  c 
b 
Conventional track data reconstruction is performed in three steps. First, 3D tomographic images of the emulsion detector are acquired using automated scanning microscopes. Next, the position of silver grains (“hits”) is located in the 3D image volumes, and finally tracks in the detector volume are reconstructed as sequence of hits [Alexandrov2017, Ariga2018].
Several tracking algorithms were developed over the course of evolution of the scanning systems, allowing for efficient track reconstruction in realtime during the acquisition [Armenise2005, Ariga2014, Yoshimoto2017]. While satisfying need of many experiments they have several drawbacks. Adaptation to different experimental condition, e.g. high track density or high background level requires tedious calibration ranging from extensive parameter tuning to performing dedicated test runs using e.g. accelerator beams. In addition, when the procedure of extracting the hits is separated from track reconstruction, the tracking algorithm cannot fully exploit the information available in the raw image data, compromising performance especially in the high background/track density cases.
Incorporating tracking based on the classical Deep Learning, where the track parameters are predicted from the raw image data would naturally address the latter issue. Yet, to train such model in a supervised manner either one would need to provide massive amount of labeled 3D raw image data for each experimental case, or training would need to be performed largely on the simulated datasets. Suitable for some similar cases [Acciarri2017] this approach requires perfect knowledge of optical microscope parameters, grains’ sizes distribution, detector noise, etc., which are not always available directly.
Similarly to the recent works where for example in the images of faces the underlying factors of variation, such as eyes or hair color, glasses, or head tilt are disentangled [Kumar2017], it is possible to identify geometrical factors of variation of tracks. Training such models in an unsupervised manner, i.e. where no track parameters (labels) are to be provided during the training can address the mentioned issues simultaneously, by both leveraging raw image data for efficient track reconstruction and would allow simple adaptation to new configurations, requiring the raw image dataset only.
Here we introduce a tracking approach based on the Deep Convolutional Autoencoder[Lecun1998, Li2017] model that learns to disentangle the factors of variation in a geometrically meaningful way in a fully unsupervised manner by imposing equivariance of the space transformation. While the reconstruction constraint alone fails to disentangle the factors of variation in a meaningful way, we show that adding a simple constraint on translational invariance along the track line does not lead to an improvement. We demonstrate that incorporating more sophisticated transformations in the latent representation is demanded to avoid the reference ambiguity.
The remaining of the paper is structured as follows. In Section 2. the details of the proposed equivariance constraints, latent representation interpretation and implementation details will be given. In Section 3. we will show how different constraints affect the performance and carry out an indepth study of encoder and decoder performance separately to better understand the learned representation. Also, performance on real emulsion data will be shown. Finally, in Section 4 the applicability of the approach and future prospects will be discussed.
2 Methods
2.1 Equivariance constraint
In this work we will simplify the problem to the 2D case. We use synthetic image data of the emulsion detector tracks to perform study of the proposed approach. Also, we demonstrate the performance of the trained model on the real emulsion detector data.
We use a Deep Convolutional Autoencoder consisting of an encoder and a decoder as shown in Fig. 1b, that are trained in an endtoend manner. The encoder acts on 3232 pixels image crops (obtained from the full images using the cropping operation : as basis of our model producing the latent representation . In our setup
is demanded to estimate the geometrical parameters, namely position and angle, of tracks present in the image crop.
We then define a set of geometrical transformations acting both in the image representation space and in the latent representation space correspondingly parametrized by the same parameter set : in the image space and in the latent space of track parameters .
Given the encoder and decoder functions
we then demand equivariance of both encoder and decoder under these transformations, i.e. commutation of the encoder and decoder functions and with the transformations in corresponding domain:
From which, assuming , follows that
where cropping operations are omitted for brevity. This allows us to formulate the optimization problem in an endtoend manner, primarily as minimization of the loss between the cropped transformed image and the decoder output (Figure 1, b):
We will show, that with a sufficient set of transformations the model is able to learn a geometrically meaningful latent representation .
2.1.1 Interpretable latent representation
We limit the number of tracks potentially detected on each crop to , slightly above the maximum expected number of tracks per crop in our data set (five). We will parametrize a track with
parameters. Thus, the encoder is designed to output a vector
of length , that are explicitly partitioned into “track feature containers” of length , corresponding to each of the tracks.We attribute an a priori meaning to each of the elements in track features . Here we have explored three parametrization options:

, – track’s geometrical parameters, where , are the coordinates of a point on the th track line and , are proportional to the cosine and sine of the track inclination angle . Coordinates on the image crop are in the lower left corner and in the upper right, , so that (Figure 1, c). Such parametrization is chosen because it is continuous and confined, unlike e.g. itself or .

, , where the first 4 elements are the track’s geometrical parameters as in 1), and the flag is showing confidence of the encoder in the track presence. Value of is defined as a disabled track, and as an enabled. This flag allows for effectively disabling some tracks, and prevents the degenerated outputs, with multiple outputs corresponding to a single track.

, – track’s geometrical parameters in the rhotheta parametrization[Duda1972]. , – are proportional to the cosine and sine of the angle , and is the distance from the origin to the track (Fig. 1c).
While 3. is the most common parametrization for 2D tracking (e.g. Hough transformation), we aimed at testing also the overparametrized yet more naturally occurring in the image representation parametrization 1), which enables explicit implementation of the translation invariance
Since by implementation the values are , the parameter is scaled linearly into the range.
2.2 Representation transformations
As space transformations we employ affine transformations as combination of rotation, scaling, skew, and translation. Under these affine transformations straight lines are transformed to straight lines. We implemented these transformations coherently in the image and latent representation spaces. The details on the transformation’s implementation are given in the Appendix
A.The main property of a line is the translation invariance along it. In one of the models we tried to see if this translation transformation can be sufficient for disentangling the latent parameters into geometrical parameters of a line. Also, we have studied the effect of such transformation when incorporated in addition to the affine transformations.
In this work we included the five model configurations of constraints based on the equivariance between the image space and the latent representation of the track line parameters (Table 1).
Name  Description  Parametrization  

1  AT+A  Affine Transformations with track activation parameter  5  
2  AT+TI+A  Affine Transformations + Translation Invariance with track activation parameter  5  
3  RT+TI+A  Translation Invariance + Rotation transformation only with track activation parameter  5  
4  AT+TI  Affine Transformations + Translation Invariance  4  
5  AT, rcs  Affine Transformations in the (r,c,s) parametrization  3 
In the models employing the track activation parameter , when a track is marked as not active (), we reset the geometrical parameters of a track to random values, forcing the decoder to learn to ignore the disabled track.
2.3 Loss function
The model training is performed by minimizing the loss function. In the models without the track activation parameter
, the loss function describes only the similarity between the decoder output and the transformed image with measure:Here the average is calculated over all image pixels. The term scales the loss in the image regions with high signal intensity in the beginning of training (): . The coefficient prevents the loss from dropping to zero at low intensity image regions. As the training progresses exponentially decreases, and the loss is relaxed to pure : .
In the models with the track activation parameter , the loss function contains 3 terms:
The first term describes the pixel value measure as described above. The next two terms address the information flow problem (also referred to as shortcut problem in [Szabo2018]), in which the geometrical parameters of multiple tracks describe the same track or some of them are ignored. This is achieved in two steps. First, we demand that each track is found (and marked as active by the flag ) on average at the same rate as the others:
where the mean activation of all tracks and the mean activation of a particular track are calculated over the minibatch. The final term forces the activation to cluster at values or , enforcing the encoder decision on whether a track is enabled or disabled.
2.4 Training data
For model training and evaluation, we generate synthetic images composed of 2 types of objects: “particle tracks”, and background “fog”. Tracks are chains of bright gaussian spots with the spot density per unit length sampled from Poisson distribution with mean
located randomly along the straight lines with deviation. Fog is represented by gaussian spots uniformly distributed in the area with density
. The track density as well as the , , parameters approximately correspond to usual experimental conditions[Ariga2018] and remain fixed throughout this study. While the generated dataset resolution matches the usual imaging resolution, we downscale the images by a factor of 4 to facilitate this study (Figure 2).2.5 Model implementation and training procedure
In both encoder and decoder, we incorporated the CoordConv approach [Liu2018] in the first layer, which conceptually fits in our study. In this approach two additional channels, containing and coordinates of pixels in the range of correspondingly, are concatenated with input data channels before the first convolutional layer. In practice, we observe that indeed CoordConv improves the performance.
Since we deal with several objects of the same nature, we found it reasonable to apply the decoder to the track feature containers of each th track separately and then merge the outputs. To this end we first process each of the containers with the same decoder network that outputs single channel map corresponding to the track . Then these images are merged into final output as
, where the sigmoid activation function
is assuring the output pixel values in the range . This not only reduces number of parameters in the decoder, but also simplifies the study of encoder performance, as each has the same structure. Alternatively, shuffling the containers within each sample could be employed. Details on the encoder and decoder architectures are given in Table 2.Encoder  c_conv(3x3, 1, 16), conv(3x3, 2, 16), conv(3x3, 2, 64), conv(3x3, 2, 128), AP(2x2), conv(1x1, 1, 128), conv(3x3, 2, 512), AP(2x2), conv(1x1, 1, 256), FC(1024), FC(512), FC(128), FC(, ) 

Decoder 
Single block:
FC(16), c_tconv(1x1, 32), conv(1x1, 1, 32), conv(1x1, 1, 32), conv(1x1, 1, 16), conv(3x3, 1, 16), conv(3x3, 1, no activation) Blocks merging: Max(blocks), sigmoid activation 
coordinates and convolution; AP  average pooling; FC  fully connected layer; c_tconv  transposed CoordConv: tiling input up to the target size, concatenation with 2 coordinate channels, and convolution. All convolutions and FC layers are followed by batch normalization
[Ioffe2015]and ReLU
[Nair:2010:RLU:3104322.3104425] activation, unless otherwise stated.The coefficients in the loss function were chosen empirically to balance its components: , ,, , and . Training starts with , and after iterations is exponentially decreased every 5k iterations by a factor of , so that is relaxed to pure after about 200k iterations:
We perform the training on 32x32 random crops from 40000 images of 128x128 pixels until convergence for 500000 iterations with minibatch of 128 images. All experiments were performed using TensorFlow 1.12. Adam optimizer with initial learning rate of
to allow for stabilization, rising to after 2k iteration was used. Afterwards the rate is decreased by a factor of 0.88 each 90000 iterations. The loss function over the course of training for e.g. AT+TI+A model is shown in Figure 3. Training took about 50 hours on a single GeForce GTX 1080 GPU.3 Results
3.1 Autoencoder performance
First, we evaluate whether our models have learned to properly capture the content of presented images in both latent and image spaces. In Figure 4 the comparison of the outputs of the five models and the lines drawn based on the latent representation z predicted for the image, interpreted as described above, are shown. It is clear, that both AT+A and AT+TI+A models were able to build the geometrically meaningful latent representation in most cases. Yet without translation invariance implied, the output of AT+A contains more fakes both in the image output and in the drawn track lines, and images are significantly less sharp in the beginning of training. RT+TI+A on the other hand did not manage to separate the factors of variation in the desired way, and it took much longer to converge even just to mimic the desired image output. One can see inconsistency between the image output of the autoencoder and the track lines obtained by the latent representation, meaning that overall it did not grasp the concept of geometrical space in the desired manner. None of the model learned properly the ability to “switch off” the tracks using the confidence flag . While this parameter is not completely ignored, in most cases the track parameter containers , which are not used to encode present in the image lines, simply contain geometrical parameters corresponding to lines outside of the image crop range. Performance of the AT+TI was comparable or sometimes even better than that of the AT+TI+A. On the downside, without the flag acting also as a regularizer the model tends to attribute close parameters to several lines. The overparametrized models used the position to encode confidence in the track presence by placing them closer or further from the image along the track line (Figure 4, d). Performance of the AT, rcs model is slightly worse than of AT+TI. The performance of the models clearly degrades when number of tracks in the image crop is 4. We assume that the main reason for this is that these cases were rather underrepresented in the training set.
3.2 Disentanglement of the geometrical variational factors
To better understand the learned representation, we have performed a careful dissection into both decoder and encoder in this and the following sections, that was possible since the latent representation was designed fully interpretable. We start with the visual analysis of the learned representation by verifying the output of the decoder for given values of . To this end, we have run the decoder on all range of meaningful values of . In addition, for this study we performed the prediction on the image coordinate area , i.e. 9 times bigger than the range of original image crop . This way we can see how well the decoder generalizes in wider coordinate range, and it is possible thanks to the CoordConv nature of the decoder: by changing the values in the coordinate channels we can perform the prediction at any position. In Figure 5 and Figure S3 we show comparison of decoder’s predictions for a set of representative values of between all models.
It is clear, that all models employing the activation parameter have learned to suppress the output when the values of flag are small. The output of RT+TI+A model does not correlate with the expectation at all, and while the output resembles lines, the learned representation is clearly not the desired one. It would be interesting to investigate which representation was found but this lays outside of the scope of this paper. AT+TI+A outputs more elongated lines, that fade out slower compared to the AT+A. This is a consequence employing the Translation Invariance. AT+TI shows even more pronounced and fine lines. All overparametrized models (i.e. except AT, rcs) also suppress tracks with position laying further from the image range (Fig. 5, top row).
3.3 Performance of the track parameters’ measurement
To study the performance of the model for tracking, we evaluate the distribution and resolution of the encoder outputs . In Figure 6 the distributions of predicted position and the angle obtained from the parameters for each of the 8 track feature containers is shown for AT+A, AT+TI+A, AT+TI, and AT, rcs models are shown. Since the latter has different representation, for consistency we obtain the values of as follows:
We skip further studies of the RT+TI+A model, since the latent variables don’t have the desired meaning, driving the geometrical analysis meaningless. We show these distributions separately for “enabled” and “disabled” tracks, according to the latent activation parameter where applicable. One can clearly see again that the models didn’t learn to use it, and e.g. the AT+A assigned almost all tracks the “enabled” flag. Instead, the in fact nonpresent tracks are assigned with rather localized positions and angle (see the peaks in the angular distribution, present for each of 8 track containers). The positions for existing tracks lay within or close to the coordinate region of the image and cover rather uniformly a wide band in the space.
In the angular space all directions are covered leaving no blind spots. Each of the 8 parameter containers covers a subspace with some overlap for models AT+A and AT+TI+A. This means, that only a fraction of track containers is sensitive to any chosen direction. E.g. AT+TI+A model would fail to detect e.g. >3 parallel lines with inclination. Angular overlap in the AT+A model is very poor leading to poor detection of several parallel tracks in an image crop. In fact, three out of eight output containers have geometrical parameters corresponding to lines outside of the image (Fig. S2). In AT+TI on the other hand each container covers almost in angular space, and the overall distribution is rather uniform (See Figure S2). This would allow to detect several parallel lines in a view (e.g. in the last row in Fig.4 several tracks have similar angles).
To evaluate quantitatively the performance of the models, we have processed the generated test dataset consisting of 19600 images with available information on ground truth (GT) track positions and angles. First, we assign the reconstructed tracks (and active, i.e. for models with activation parameter ) to GT ones or mark them as fake. We evaluate the distance from image center between predicted track and a GT track, and difference in angle. Then we build the as , where position resolution is defined by pixel size , and angular resolution by pixel size and track length within the image crop , ( for ), and , are the distance and angular difference between prediction and GT track correspondingly.
The assignment is then performed sequentially, by selecting available predictionGT pair according to the minimum value of , if ( statistical significance, ndf=2). Remaining prediction tracks are split into two categories, fakes and duplicates. A track is considered as a duplicate if it’s to any of the used GT tracks or assigned prediction tracks is (i.e. within ), and as a fake otherwise. For the assigned tracks, we then evaluate offsets , as a function of number of tracks on the original image, as well as the fraction of assigned tracks (efficiency), number of fake tracks, and number of duplicate tracks (Figure 7). distribution for these models for different number of tracks per image crop is shown in Figure S4.
One can see, that the resolution of the models is stable as number of tracks grows. The AT+TI model has consistently higher resolution, as well as higher efficiency. Efficiency significantly decreases in all models with the number of tracks per image. We believe it is caused by the fact that images with high track number were underrepresented in the training set (Figure 2, d), and efficiency would improve if training set with high track multiplicity images is used. While number of fake tracks is not significantly higher in the AT+TI model, since it lacks the regularization based on the flag showing if a track is enabled, it tends to use of all the available tracks, thus creating large number of duplicates. Another model without the flag – the AT, rcs model on the other hand does not produce many duplicates, most likely since each track is sensitive only to a narrow angular range, as shown above.
Finally, we processed a real emulsion dataset to observe AT+TI model performance in a reallife application. We used a single image out of 3D tomographic image stack of size pixels corresponding to of emulsion detector area, irradiated with 400 GeV protons at different angles at the SPS accelerator at CERN [Aoki2019]. We preprocess the image (Figure 8, a) by downscaling it by a factor of 4, inverting the image, and normalizing the color scale to match the training data properties (Figure 8, b). We then divide the image into 54 nonoverlapping 3232 pixel crops and process them independently. Resulting tracks are then assembled into full image size and shown as 32 pixels long segments with highlighted position (Figure 8, b, overlay). One can appreciate good agreement between real tracks and predicted ones.
4 Discussion and outlook
Disentangling factors of variation remains a hot topic for several years. Developing models that are capable of abstracting highlevel concepts from row data is an important part of the representation learning research and has plenty of direct practical applications. Many works approached this problem employing variational autoencoders with regularizations in the latent representation encouraging disentanglement [Kumar2017] or autoencoders combined with adversarial training [Szabo2018]. In most cases after disentangling the factors of variation few labeled samples can be used to associate the factors with interpretable measures in quantitative way. Locatello2019 have shown in their work how few labeled samples can improve disentanglement itself. In Kim2018 it was shown that equivariance constraint, i.e. that changing one factor of variation corresponding to change of one dimension of the disentangled representation in a predictable manner leads to variables disentanglement. Nevertheless, to the best of our knowledge no previous works tried to extract meaningful quantitative information in a fully unsupervised manner.
In this work we have demonstrated that imposing equivariance constraint on the autoencoder under geometrical transformations in the image and latent representation domains enables the model to “discover” the existence of multiple lines in the presented images in a fully unsupervised manner. Incorporating simple affine transformation such as translation, rotation, scaling and skew as equivariances between the image and latent spaces, allows the models to successfully disentangle the factors of variation in the image data into geometrically meaningful parameters (coordinates and angles of lines). Adding the possibility to “switch off” a predicted track with an additional activation parameter does not drastically change the results (models AT+A and AT+TI+A). While it helps to prevent the shortcut problem and reduces the number of track duplicates, these models did not learn to exploit it.
Incorporating the translation along the track line in addition to the whole set of affine transformations enforces the line detection. Yet, employing only the translation invariance even together with rotation transformation (RT+TI+A model) leads to reference ambiguity, so that latent parameters do not correspond to the desired geometrical variables. We believe though, that it is possible to find mapping from these latent parameters to the desired geometrical variables with few calibration measurements, yet this lays out of the scope of this work. As we have shown, bigger set of transformations allows the model to immediately learn the latent representation in an unambiguous geometrically meaningful way. The minimal subset of affine transformations, sufficient for disentangling the factors of variation without reference ambiguity will be explored in future works.
In addition to the coordinatesangle parametrization, which gives the models more freedom in sample exploration we have studied the classical rhotheta parametrization. Under the set of affine transformations this model is also able to learn meaningful parametrization in an unsupervised manner. Yet the model performance is slightly worse than, e.g. of the AT+TI model, most likely because the incorporated CoordConv approach prefers to have natural coordinate representation.
The main weak point of current implementation is that neither background nor the grain distribution along the lines are not in any way represented by the current models, which may impair line detection in the case of high background levels. In addition, currently the transformations in the image domain affects the image parameter distributions, such as brightness (corresponding to the energy loss in emulsion detector) and sharpness. Models with additional global and pertrack latent parameters without an a priori assigned meaning would naturally overcome these hurdles. Training them would require dealing with the shortcut problem in these parameters, and thus would benefit from employing adversarial framework [Szabo2018, Goodfellow2014]. While aim of this work was to carefully study the proposed approach in general, we leave this aspect to further studies.
We expect this approach to have a big potential in analysis and extraction of geometrical properties from image data. In further work we plan to adapt this technique to the location of tracks in 3D tomographic microscopy full resolution data, or data of the Liquid Argon Time Projection Chamber detectors [MicroBooNEcollaboration2018, Brailsford2018], which would be a direct extension of this approach. Also adding more samples with higher track number in the training dataset is expected to improve efficiency and resolution in cases with high track density.
While designed to detect simple line structures, this technique has a potential to be used for locating and parametrizing other objects, such as splines. This would pave the way to novel automated image vectorization techniques. Being fully unsupervised, this approach can leverage all available raw dataset with no extra work required.
Acknowledgement
The authors would like to thank Paolo Favaro and Attila Szabó for fruitful discussions.
Appendix A Affine transformations in latent representations
We compose the transformation functions in the parameter and image spaces as combination of rotation, scaling, skew, and shift. First, we define corresponding transformation operations of the track line parameters in the following way (index is omitted for brevity). Rotation:
where is the twoargument arctangent function. Scaling:
Skew:
Translation:
Rotation is performed around the coordinate origin; scaling and scaling leaves correspondingly and axes intact; skew and skew leaves correspondingly and axes intact.
The combined space transformation is then composed by consecutively applying these transformations:
In some models, we add an additional transformation of track parameters corresponding to translation invariance (TI) along the line. This is implemented as random shift of the parameters along the line:
We apply these transformations only to the enabled tracks according to the value of . For the disabled tracks the parameters are set to random values in the range enforcing decoder to learn to ignore disabled tracks:
where
and the sigmoid function
is applied to the activation parameter for the implementation reasons.For the model employing the representation, for implementation reasons we first transform the parameters to the representation as: , , ; ; ; , and then apply the shown above transformations. Afterwards the inverse transformation to the representation is applied.
The parameter set is drawn from uniform random distribution for each training sample on each iteration and is fed into the network along with the images. The input images of size 9696 pixels are cropped to 3232 as shown on Figure 1 and fed as the network input . The same input images are then elastically transformed with transformation function using same parameter set . The origin of transformations corresponds to pixel (48;48) on the input image, i.e. low bottom corner of the crop. After being cropped to 3232 pixels the images are used as network output target . We use larger 9696 input images to ensure that the final crop after transformation does not contain regions outside of the input image. Examples of the space transformations are shown in the Figure S1.
Comments
There are no comments yet.