I Introduction
Deep neural networks (DNNs) have been successfully used in a wide variety of tasks, such as regression, classification (e.g, in image or speech recognition Maas2017 ; Ciregan2012 ), and timeseries analysis. They are known for being able to construct useful higherlevel features from lowerlevel features in many applications, however, these feature representations frequently remain incomprehensible to humans. This property is one of the reasons why DNNs are not more widely used in physics, in which the approach to data exploration is usually drastically different.
Most systems studied in physics are well described by physicals models, generally referred to as equations of motion. The experimental data are analysed with respect to a particular model. When doing so, the equations of motion are analytically or numerically solved, yielding a theoretical description of the datagenerating process. The resulting model generally includes a set of mathematical variables that can be adjusted to span the data. The true values of these variables are generally unknown and must be recovered. For that reason, we refer to them as latent parameters. The true latent parameters are approximated by comparing the data to the model, typically by fitting the model to the data. With this in mind, the ability of DNNs to find abstract representations of the data features rather than a quantitative generating process is generally seen as a limitation rather than an advantage by physicists. For that reason, DNNs are still often viewed as black boxes in physics and started to be used in the field only in recent years Guest2018 .
We find this to be a missed opportunity for the physics community. With physical models at hand, one can generate arbitrarily large volumes of synthetic data to train the DNNs, and later process realworld signals tremblay2018training
. This circumvents many challenges of supervised learning during which DNNs are trained with data for which the true latent parameters (labeled data) need to be known. Making full use of this possibility, DNNs were recently trained on synthetic nuclear magnetic resonance (NMR) spectroscopic data, simulated by accurate physical models
Worswick2018 . The large amount of labeled data generated this way enables convergence of the DNN, which is then used to process real NMR data with great accuracy. A similar approach has become popular in robotics and autonomous driving.Moreover, extensive work was done in order to disentangle and make sense of DNN representations. A notable example is that of the variational autoencoder architecture Higgins2017 . Correlation loss penalties can also be used during DNN training, without prior knowledge of the datagenerating process Steeg2017 ; Gao2020 . These methods consist of penalizing the DNN if its feature representation becomes entangled during training. While doing so, the DNN is encouraged to produce an efficient or disentangled feature representation. While disentangled, the representations achieved through these methods are not readily interpretable and usually require further analysis.
Nonetheless, DNNs are being increasingly used in physics data processing, in particular for signal classification  during which unusual datasets are flagged for further analysis. It was shown that can effectively be trained on Large Hadron Collider particlejet data to detect events or anomalies Farina2020
. In this instance, the DNN is successfully able to increase the events’ signaltonoise ratio by a factor
. Other searches in highenergy physics, including Kuusela2012 ; DAgnolo2019, have recently been performed also with the aim of detecting data displacement from a nullhypothesis (no anomalies). All these searches seek to perform data analyses in a modelindependent setting, that is, with minimal prior information or bias. More recently, DNNs have been applied to timeseries processing in nanoNMR
Aharon2019. In nanoNMR settings the noise model is complex and noise overpowers the weak signals, rendering standard data analyses inefficient. The DNN was tasked to classify signals (i.e. discriminating two frequencies) and outperformed fullBayesian methods.
While often achieving great successes, to our knowledge most applications of DNNs in physics are geared toward classification problems. In addition, DNNs are still rarely employed for timeseries analyses, although they are the most common form of data acquired during physics experiments. In this article, we propose to use a DNN to disentangle components of monochromatic, amplitude and frequencymodulated sine waves (AM/FMsine waves respectively), arguably the most prevalent forms of timedomain signals in physics. The method yields similar performance as more standard analyses such as leastsquare curve fittings (LSfits), during which the datagenerating process is assumed to be known and a leastsquares regression is performed to predict the signal’s latent parameters.
LSfits, however, require the user to input latentparameters initial guesses prior to regression. These initial guesses are the prior estimation of the true latent parameters and provide a starting point for the LSfit gradient descent. The trained DNN however, needs no initial guesses, thus requiring less prior information about the datagenerating process. Indeed, we show that, precisely because DNNs find abstract data representations, they can be used in settings when prior knowledge exists, but is not complete, as it is particularly the case in “newphysics” searches
Safronova2018 , thus leaving space for data exploration and discoveries.The first part of this article describes the synthetic data that we generate and use throughout this work, i.e. monochromatic, AM and FMsine waves time series, and their relevance to realworld physics experiments. We then describe our DNN architecture, which incorporates two tasks: A ^{1}^{1}1We note here that throughout the paper the term is employed as opposed to a , not as an independent variable in a regression. Therefore, refers to the DNN predicting the signal’s latent parameters. DNN performs a regression of the signal’s latent parameters that are known to be present in the datagenerating process. In addition, an Hinton2006 denoises the signals by learning an approximation of the unknown latent parameters. As a benchmarking method, we evaluate the DNN by comparing its performance to an LSfit with true initial guesses.
We later employ the DNN in realistic settings, when prior knowledge about the datagenerating process is incomplete: LSfit fidelity is typically highly sensitive to initial guesses, thus requiring the user to perform preprocessing work or to possess prior information in order to perform optimally. As a first application, we show that the DNN can be used to predict initial guesses for the model fit evaluation. While consistently converging to optimal solutions, the technique circumvents the usual difficulties arising from fitting signals, such as the need for initialguesses exploration.
Next, we show that the DNN can be used when the user ignores if the timeseries are monochromatic, AM or FMsine waves, but still wishes to recover their main frequency component. In such settings, the user is generally required to repeat the analysis by exploring the space of datagenerating processes and initial guesses. Using our architecture enables the user to input only the known information when performing the analysis. That is, the is tasked to recover the userexpected latent parameters while ignoring the existence of others. Because the needs no prior information, it is still able to capture unknown information.
Ii Experimental methods
ii.1 Data description and generation procedure
The time series studied throughout the article are exponentially decaying monochromatic, FM and AMsine waves. Gaussian noise is linearly added to the pure signals. An example of FMsignal is shown in Fig.1 (top) alongside its subcomponents (decaying carrier, frequencymodulation signal, and noise).
Decaying monochromaticsine waves appear and are prevalent in all fields of physics. They arise from solving the equations of motion of the twolevel quantum system, or of the classical harmonic oscillator; to which a multitude of other physical systems can be mathematically reduced to. Notorious examples include the spin
particle in a DC magnetic field, the orbital motion of planets, or RLC circuits. In information theory, the twolevel quantum system also provides a complete description of the qbit. Frequency and amplitude modulation generally arise from external factors such as oscillating magnetic or electric fields applied by the experimenters. Amplitude and frequency modulation of a carrier frequency are also the most common scheme of information communication links. Some form of Gaussian noise, while not necessarily always dominant, is in general present in every realworld signal. The statistical Gaussian noise formalism provides an accurate description of electronic thermalnoise, quantum shot noise, blackbody radiation, and of White noise in general.
All time series used throughout the article are s long, sampled once per second. The latent parameters used to generate the monochromatic sinewaves are the carrier frequency, and phase , in addition to the coherence time . The AM and FMsinewaves are generated by adding a modulation function to the carrier. The modulation function’s latent parameters are the modulation frequency and amplitude, and
, respectively. Noise is linearly added to the pure signals by sampling the Gaussian distribution with zero mean and standard deviation
. The carrier amplitude is normalized to such that the signaltonoise ratio is solely given by . The mathematical descriptions of the monochromatic, AM and FMsine waves are given in the Supplementary Materials.Before each sample generation, the latent parameters are randomly and uniformly sampled within their respective allowed range, also given in the Supplementary Materials. The range of ensures the carrier frequency remains well within the Fourier and Nqyist limits. The modulation amplitude range ensures the majority of the signal’s power remains in its first sidebands and carrier.
Despite requiring only latent parameters to generate the samples, these ranges enable a wide scope of functions to be realized. AM/FMsignals with minimum reduce to decaying monochromaticsine waves and reach modulation with maximum . The coherence time range is wide enough to span underdamped signals up to virtually nondecaying signals. These latent parameter ranges are wide enough such that they would encompass many foreseeable realworld signals. A random selection of FMsignals with and without noise is shown in Fig. 1 (bottom), illustrating the richness of the data in a more qualitative manner.
The choice of studying monochromatic, AM, and FMsine waves is not only motivated by their richness and prevalence in realworld physics experiments. Indeed, despite originating from different physical models and having different mathematical descriptions, the time series share similar visual features. As a result, within some range of parameters, even expert users could mistake the three generating processes. This is especially the case for weak modulations in the presence of noise, for which visual discrimination in time or frequencydomain (inspecting the spectrum) may be impossible. For all the reasons cited above, monochromatic, AM or FMsine waves appear as good representative signals on which to perform our study. Nevertheless, the methods presented in this article can be applied to other types of signals as well.
Most DNN implementations generally require input and target data to be normalized such as to avoid exploding and vanishing gradients during training Hochreiter1998 ; NEURIPS2018_13f9896d . All signals and latent parameters are normalized to lie within the to range prior to the application of the DNN. The phase is mapped to two separate parameters, , such as to account for phase periodicity during loss computation, while keeping both targets properly normalized. All other latent parameters are normalized using their respective range.
ii.2 Deep neural network architecture
The latentparameters regression and signal denoising are performed by two separate architectures described (in Python code) in the Supplementary Material.
Denoising is performed by an architecture Hinton2006 composed of an followed by a . Noisy signals are first passed through the . The output layer has neurons and thus produces a compressed representation of the input signal. Following this step, the output is passed through the , which decompresses the signal to its original size. This type of  architecture, called , is widely used, inter alia, for data denoising Gondara2016 . As the output dimension is smaller than the dimension of the input data, the ’s output layer acts as an information bottleneck, or more specifically dimensionality reduction, thus encouraging the network to capture relevant latent features while discarding noise or redundant information Hinton2006 .
Latentparameters regression is also performed while passing the data through the . The output is then passed through a third DNN referred to as the . The output dimension of the is adjusted to the number of latent parameters that the is tasked to detect.
The and are trained on identical sets of samples. The ’s target data consists of the latent parameters, and the
target data are the noiseless signals. For both, the loss function is the mean squared error (MSE). The optimized architectures, shown in the Supplementary Materials, achieve sufficient performance, while keeping the number of trainable parameters under
million, such as to be able to perform training on a modern laptop GPU under hours for a typical training session of training sets of samples, over epochs. Due to the number and characteristics of the instances, asymptotic loss is reached within a small number of epochs. In general, increasing the number of instances of the training set was more beneficial than increasing the number of epochs.After refining the base , and , we unify the three architectures into a single DNN such that the and share the same . We find that unification is best achieved by merging them into a single DNN as depicted in Fig. 2. The output is passed through the , which predicts the signal’s latent parameters. The input consists of a concatenation of the and outputs. The latent parameter regression and signaldenoising losses are computed simultaneously ( and , respectively). The loss used during backpropagation is computed as a weighted sum of and as follows:
(1) 
where the hyperparameter
is the bias adjustment between the two tasks.This architecture presents the advantage of enabling bias control via a unique hyperparameter. Moreover, both networks are naturally trained at the same time rather than alternatingly, thus accelerating training approximately twofold and enabling highmomentum gradiant optimizers.
ii.3 Training procedure
To illustrate the effect of the bias parameter, we train the unified DNN on identical FMsine waves datasets with varying values of . For this experiment, training is performed using training sets of randomly generated samples for
epochs. Because the number of synthetic samples is large and the latent parameters are continuous random variables, overfitting (controlled by a validation set, unseen during training) was never an issue.
The performance of the trained DNN is evaluated using a test set of randomly generated FMsamples, which were unseen during training. Figure 3 shows the testsample losses for the denoising (top) and regression (bottom) tasks after training. Setting fully biases training towards the denoising tasks, which achieves best performance, while the parameter regression yields the worst results; vice versa for . This behaviour is also observed in Fig. S2 in the Supplementary Materials, which shows the validationlosses during training. The training curves show that extremum values of prevents validationloss improvement of the negativelybiased task. Middlerange values enable both tasks to be learned simultaneously.
We find that the best values of are those for which the initial weighted regression and denoising losses are within the same order of magnitude. As a result, determining a good value for is a trivial task: A single forward pass is performed to obtain the initial values of and . We then compute such that . Regardless of the type of data (monochromatic, AM and FMsamples), DNNs trained with achieve good overall performance (lowest weighted total loss) and little bias towards any of the tasks. This value of is employed throughout the entire article. For all that follows, training is always performed using training sets of randomly generated samples for epochs. This training is always enough to reach asymptotic loss, while exhibiting no noticeable overfitting. Training can be performed on decaying monochromatic, AM, FMsine waves or a combination of all three processes.
To illustrate the architecture’s output, we train the DNN on AMsine waves and show an example of a prediction in Fig. 4 alongside the noisy input signal. The DNN outputs a denoised prediction of the noisy AMsine wave and a prediction of the latent parameters used to generate the signal.
Iii Experimental results
iii.1 Performance evaluation
As a first evaluation method, we train the DNN on a random selection of decaying monochromatic sine waves (no modulation). The training, validation, and test samples are generated using random frequency, phase, coherence time, and noise levels.
After training, we evaluate the DNN performance by comparing its prediction error to an LSfit using the Python Scipy library. When performing the LSfit, the input data is the noisy signals and the objective function is with respect to the noiseless datagenerating process. The LSfit then produces predictions of the true latent parameters. To this end, the LSfit requires latentparameters initial guesses to start the gradient descent. The initial guesses used here are the true latent parameters (i.e. true frequency, phase, and coherence time). After gradient descent, the LSfit outputs an estimation of the latent parameters, from which we generate a prediction of the noiseless signals. This is done by inputting the LSfit latentparameters predictions in the datagenerating process. The LSfit and DNN performance are then compared in two ways: (i) the latentparameters regression loss is the MSE from the true latent parameters for both the LSfit and DNN (), and (ii) the denoising error is the MSE from the true noiseless signals for both the LSfit and DNN (). Note that this comparison drastically favors the LSfit, which then constitutes a good benchmark method. Indeed, in any practical applications the true value of the latent parameters are hidden from the user, and LSfits are employed precisely to approximate them.
A random selection of noisy signals from the test set is processed using this method. Figure 5 shows the and for both the DNN and LSfit sorted by noise level (examples of signals with extremum noise levels, alongside LSfit and DNN predictions are shown in Fig. S1 in the Supplementary Materials).
A similar evaluation is performed using AMsamples. In this experiment, the DNN is specifically trained on AMsamples. The test samples are all generated using identical latent parameters, while the noise level is increased. Examples of such samples are given in Fig. S1 in the Supplementary Materials. Figure 6 shows the prediction errors of all samples, for both the DNN and LSfit sorted by noise levels.
For both monochromatic and AMsignals, the DNN performs generally worse than the LSfit for lownoise signals. However, the DNN reaches LSfit performancelevel once the noise reaches the top half of the allowed range (corresponding to a noise level before normalization), while requiring no initial guesses. The latentparameters regression follows a similar trend. We note that, in general, DNN outputs are less sensitive to noise, and the performance is more consistent throughout both datasets.
These results show that our architecture is a good alternative to LSfits for timeseries analysis, as it reaches acceptable performance when benchmarked to standard LSfits with true guesses, while needing no initial guesses.
iii.2 DNNassisted LSfit
We now wish to apply our DNN in more realistic settings. Fitting oscillating time series using LSfits is notoriously difficult because the MSE is in general a nonconvex function of the latent parameters and possesses numerous local minima. Consequently, the quality of the LSfit is highly dependent on the initial guesses in addition to the noise. In the previous experiments, LSfits were only performed as a benchmark method, and the initial guesses were the true latent parameters. In any realworld setting, the user must perform some preprocessing work or use prior information to find initial guesses leading to the global minima. In this section, we propose to employ the DNN as a preprocessing tool to assist LSfit in the situation when the user possesses no prior information about the initial guesses and wishes to recover the signal’s latent parameters. The sine wave samples from the previous experiment are fitted while using the DNN latent predictions as initial guesses. Results of this experiment are shown in Fig. 7 alongside LSfits with true initial guesses results.
Because the DNN predictions are always within the venicity of the true parameters, almost all DNNassisted LSfits converge to optimal solutions. In settings when the initial guesses are unknown or samples are numerous, the user can initially train the DNN on synthetic data and use it for DNNassisted fits. As the latter performs optimally regardless of the noise level, this enables fast and accurate analysis of large datasets by removing the need for initial guesses exploration.
iii.3 Partial information regression and denoising
In the experiments presented above, the datagenerating process was assumed to be fully known by the user. The DNN or DNNassisted LSfits were employed to recover the signal latent parameters and denoise the signal. We now wish to explore the possibility of employing the DNN in a situation where the datagenerating processes to be explored are multifold and guesses must be done. This is typically the case in “newphysics searches” experiments Safronova2018 , during which hypothetical and undiscovered particles may cause signals deviating from the nullhypothesis (i.e. no new particles). As the hypothetical particles are numerous, they may have many potential effects on the signals. We take the situation in which a potential external source could modulate a carrier signal produced by the experiment, as it is sometimes the case for bosonic darkmatter Garcon2019 .
Specifically, we study the case in which the enduser is aware of the existence of an oscillation in the signal provided by the experimental setup. The user ignores if the signal is monochromatic, amplitude or frequency modulated. Nonetheless, the user wishes to recover the frequency, phase, and coherence time of the expected oscillation.
In this situation, the typical approach is to test all allowed processes by varying the LSfits objective functions and explore the space of initial guesses for each process. This approach presents a new set of challenges, as this exploration is time consuming and sometimes unrealistic, if the data is too large or if too many processes are to be tested. Moreover, in some situations, all guesses can be wrong.
We show that it is possible to perform the regression and denoising with partial prior information about the physical process producing the data. That is, the DNN is tasked to perform the regression only on the narrow set of latent parameters that exist across all models: frequency, phase, coherence time, and noise level. However, the DNN ignores any form of modulation. This is done by decreasing the number of neurons in the ’s output layer. The DNN is then trained on signals from every explored model (monochromatic, AM and FM). We now refer to this DNN as the partial DNN (ignoring the existence of particular modulation type).
After training, we compare the performance of the partial DNN to a specialized DNN, trained specifically on AM signals, which performs a regression of all latent parameters. Figure 8 shows the and averaged over the AMsine wave test set ( samples) for both the AMspecialized DNN and the partial DNN. The denoising task reaches the same level of precision to that of the specialized DNN. Moreover, the estimation of the carrier frequency, phase, and coherence time reaches similar performance to that of the specialized DNN.
Using this method, the user’s prior information is encoded into the architecture and training data. The then captures the expected latent parameters, thus removing the need to iteratively explore models. The and remain unchanged and are still able to capture unknown latent parameters by reproducing noiseless signals. This method enables partial prior information to be employed, while leaving space for signal exploration and unexpected discoveries.
Iv Conclusion & Outlook
We have presented an efficient DNN that combines the denoising of times series and regression of their latent parameters. The DNN was trained and evaluated on synthetic monochromatic, frequency and amplitudemodulated decaying sine waves with Gaussian noise; some of the most prevalent forms of signals acquired in physics.
For highnoise signals, the DNN reaches same levels of precision as an LSfit with true initial guesses, in spite of the DNN needing no guesses at all. In addition, the architecture requires no hyperparameter fine tuning to perform consistently. Moreover, because large volumes of synthetic training data can be generated, the DNN is quickly adaptable to a broad range of physical signals. This makes our architecture a good alternative to LSfits for analysing large volumes of data, when fitting individual signals requires too much computation or user time.
The DNN architecture is flexible and can accommodate for various levels of user prior information. First, the DNN was used to assist LSfits and predict initial guesses, unknown by the user. In this situation, DNNassisted LSfits consistently converge to the optimal solutions. Moreover, the regression task can be adapted to accommodate for partial prior information about the datagenerating process. The known latent parameters are encoded in the and training data, while the helps the to still capture unknown signal features, thus leaving space for data exploration and discoveries.
Because training is done on arbitrarily large volumes of synthetic data, raw performance could be improved by increasing the number of trainable parameters such as adding more layers or neurons, without too much concern for overfitting. The architecture itself could be augmented by adding an upstream classifier DNNmodule, which could identify the type of signals being analyzed. Classified signals could then be processed via specialized versions of our architecture, trained on the corresponding type of signals.
Timedomain oscillations generally appear as peaks or peak multiplets in frequencydomain spectra. Frequency, amplitude, and phase information is then localized to narrow regions of the spectral data. For that reason, we believe further improvements could be attained by making use of frequencydomain information. We suggest to use Fourier transforms or power spectra as DNN inputs, in addition to the raw time series.
The proposed DNN architecture can be used to detect and approximate hidden features in time series data. The outputs a prediction of prior known parameters, but real signals could still contain unknown latent variables. These hidden latent variables can be detected and approximated by our DNN, as it also incorporates an like structure. As such, the bottleneck layer contains a feature representation of the time series, used by the to recreate the original signal. This bottleneck layer will be further investigated, in order to detect and specify hidden latent parameters.
We remain aware that in physics data analysis, a sole estimation of latent parameters often provides insufficient information. Standard analysis usually requires a quantitative estimation of the prediction uncertainty, often represented as error bars or confidence intervals. In LSfits, this uncertainty is naturally obtained by maximizing the fit likelihood under the assumption of Gaussian distributed latent variables
Bishop2011chap1p29 . Despite extensive efforts, DNNs still lack the capacity for reliable uncertainty evaluation Kasiviswanathan2017 ; Kabir2018 ; Ding2020 and more work needs to be performed in this area to further generalize DNN usage in physics signal processing.Nonetheless, we believe this architecture is readily applicable to existing physics experiments, in particular bosonic darkmatter searches, in which large quantities of data are to be analyzed with partial prior information.
References
 (1) A. L. Maas, P. Qi, Z. Xie, A. Y. Hannun, C. T. Lengerich, D. Jurafsky, and A. Y. Ng, “Building DNN acoustic models for large vocabulary speech recognition,” Computer Speech and Language, vol. 41, pp. 195–213, 2017.

(2)
D. Ciregan, U. Meier, and J. Schmidhuber, “Multicolumn deep neural networks
for image classification,”
Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
, no. February, pp. 3642–3649, 2012. 
(3)
D. Guest, K. Cranmer, and D. Whiteson, “Deep learning and its application to LHC physics,”
Annual Review of Nuclear and Particle Science, vol. 68, pp. 161–181, 2018.  (4) J. Tremblay, A. Prakash, D. Acuna, M. Brophy, V. Jampani, C. Anil, T. To, E. Cameracci, S. Boochoon, and S. Birchfield, “Training deep networks with synthetic data: Bridging the reality gap by domain randomization,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 969–977, 2018.
 (5) S. G. Worswick, J. A. Spencer, G. Jeschke, and I. Kuprov, “Deep neural network processing of DEER data,” Science Advances, vol. 4, no. 8, pp. 1–18, 2018.
 (6) “ΒVAE: Learning basic visual concepts with a constrained variational framework,” 5th International Conference on Learning Representations, ICLR 2017  Conference Track Proceedings, pp. 1–22, 2017.

(7)
G. V. Steeg, “Unsupervised Learning via Total Correlation Explanation,”
arXiv, pp. 5151–5155, 2017. 
(8)
S. Gao, R. Brekelmans, G. Ver Steeg, and A. Galstyan, “Autoencoding total
correlation explanation,”
AISTATS 2019  22nd International Conference on Artificial Intelligence and Statistics
, 2020.  (9) M. Farina, Y. Nakai, and D. Shih, “Searching for new physics with deep autoencoders,” Physical Review D, vol. 101, no. 7, p. 75021, 2020.

(10)
M. Kuusela, T. Vatanen, E. Malmi, T. Raiko, T. Aaltonen, and Y. Nagai, “Semisupervised anomaly detection  Towards modelindependent searches of new physics,”
Journal of Physics: Conference Series, vol. 368, no. 1, 2012.  (11) R. T. D’Agnolo and A. Wulzer, “Learning new physics from a machine,” Physical Review D, vol. 99, no. 1, pp. 1–34, 2019.
 (12) N. Aharon, A. Rotem, L. P. McGuinness, F. Jelezko, A. Retzker, and Z. Ringel, “NV center based nanoNMR enhanced by deep learning,” Scientific Reports, vol. 9, no. 1, pp. 1–11, 2019.
 (13) M. S. Safronova, D. Budker, D. Demille, D. F. Kimball, A. Derevianko, and C. W. Clark, “Search for new physics with atoms and molecules,” Reviews of Modern Physics, vol. 90, no. 2, 2018.
 (14) G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of data with neural networks,” Science, vol. 313, no. 5786, pp. 504–507, 2006.

(15)
S. Hochreiter, “The vanishing gradient problem during learning recurrent neural nets and problem solutions,”
International Journal of Uncertainty, Fuzziness and KnowlegeBased Systems, vol. 6, no. 2, pp. 107–116, 1998.  (16) B. Hanin, “Which neural net architectures give rise to exploding and vanishing gradients?,” in Advances in Neural Information Processing Systems (S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. CesaBianchi, and R. Garnett, eds.), vol. 31, pp. 582–591, Curran Associates, Inc., 2018.

(17)
L. Gondara, “Medical Image Denoising Using Convolutional Denoising Autoencoders,”
IEEE International Conference on Data Mining Workshops, ICDMW, vol. 0, pp. 241–246, 2016.  (18) A. Garcon, J. W. Blanchard, G. P. Centers, N. L. Figueroa, P. W. Graham, D. F. Jackson Kimball, S. Rajendran, A. O. Sushkov, Y. V. Stadnik, A. Wickenbrock, T. Wu, and D. Budker, “Constraints on bosonic dark matter from ultralowfield nuclear magnetic resonance,” Science Advances, vol. 5, no. 10, 2019.

(19)
M. Bishop,
Pattern Recognition and Machine Learning  Chapter.1
. Springer, 2011.  (20) K. S. Kasiviswanathan and K. P. Sudheer, “Methods used for quantifying the prediction uncertainty of artificial neural network based hydrologic models,” Stochastic Environmental Research and Risk Assessment, vol. 31, no. 7, pp. 1659–1670, 2017.
 (21) H. M. Kabir, A. Khosravi, M. A. Hosen, and S. Nahavandi, “Neural NetworkBased Uncertainty Quantification: A Survey of Methodologies and Applications,” IEEE Access, vol. 6, no. c, pp. 36218–36234, 2018.
 (22) Y. Ding, J. Liu, J. Xiong, and Y. Shi, “Revisiting the evaluation of uncertainty estimation and its application to explore model complexityuncertainty tradeoff,” IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, vol. 2020June, pp. 22–31, 2020.
Acknowledgement
Acknowledgments: The authors wish to thank Lizon Tijus for figure rendering. Funding: This work was supported in part by the Cluster of Excellence PRISMA+ funded by the German Research Foundation (DFG) within the German Excellence Strategy (Project ID 39083149), by the European Research Council (ERC) under the European Union Horizon 2020 research and innovation program (project DarkOST, grant agreement No 695405), and by the DFG Reinhart Koselleck project. A.G. acknowledges funding from the Emergent AI Center funded by the CarlZeissStiftung. Competing interests: All authors have read and contributed to the final form of the manuscript and declare that they have no competing interests. Data and materials availability: All data needed to evaluate the conclusions of this article are present in the article and/or the Supplementary Materials. Additional data and source code related to this article may be requested from the authors.
Appendix A Supplementary information
a.1 Data generation procedure
The time series used throughout the article are generated by propagating the time, , from to (with length ) in s increments, and using the following formula:
(1)  
(2)  
(3) 
where and are the sine wave carrier frequency and phase, respectively. and are the modulation frequency and amplitude. The noise is sampled from the Gaussian distribution with zero mean and standard deviation . and are the first and second Bessel functions of the first kind, respectively. Before each sample generation, the latent parameters are randomly and uniformly sampled within the following ranges:
a.2 DNN architecture implementation
The composed of − layers, followed by layers. The output layer has neurons. The is composed of followed by a and layers. The output dimension of the is adjusted to the number of latent parameters that the is tasked to detect. The is composed of [] layers, followed by a single layer. The consists of a concatenation of the and ouputs.
All activation functions are rectified linear units, with the exception of the
andoutputs which are linear and sigmoid function, respectively. Pseudocode architectures of the
, ,, and final DNN, as implemented in Python (Keras  Tensorflow), are given below in addition to the custom weightedloss function.