Practical Electronic Warfare (EW) systems are often operated in environments where multiple signal sources exist simultaneously. Signal patterns from different emitters are received via the same channel of the receiver in time order, hence obtaining an interleaved signal sequence. For robust classification of radar signals, obtaining the individual sequences from emitters (i.e., deinterleaving, emitter separation or signal sorting) is essential [1, 2].
The first degree attributes of radar pulse streams are time of arrival (ToA), direction of arrival (DoA), pulse amplitude (PA), pulse width (PW) and radio frequency (RF), which are stored in pulse description words (PDWs) [1, 2]. DoA, PA, PW and RF may be missing or follow unpredictable irregular patterns [3, 4] though clustering the pulses with respect those attributes seem to be appealing. On the other hand, radar signals usually have characteristic pulse repetition interval (PRI) which is the first difference of the ToA sequence. Following the existing efforts [3, 4, 5, 6], we exploit only ToA measures to deinterleave radar signals .
Prior conventional methods construct histograms from ToA differences to capture the periodicity of the pulses [3, 4]. While the cumulative difference histogram (CDIF)  accumulates histogram values for each difference level of the ToA sequence in a single iteration, the sequential difference histogram (SDIF) 
computes the histogram values simultaneously for each difference order. Following that, they estimate PRI values of radar signals from calculated histogram by using heuristic threshold functions and sequentially search and cluster pulses, which follows estimated PRI regimes to form radar signals. Since those approaches are sensitive for sub-harmonics, PRI transform (PRIT) proposes to use a complex-valued autocorrelation-like integral instead of calculating histograms with the assumption that the ToA sequence is quite large. Although those conventional algorithms work well for simple PRI modulation types, they fail to succeed in scenarios where largely missing pulses, highly jittered PRI values and staggered PRI modulation modes exist.
are proposed to deinterleave radar signals which have complex PRI modulation types. However, their limitation to specific scenarios prevents generalizing to problems of different nature. Furthermore, machine learning techniques have been applied to solve the deinterleaving task. For instance, fuzzy adaptive resonance theory (ART) is utilized in to cluster pulse streams by exploiting RF, PW and DoA features before using one of the histogram based deinterleaving methods. Moreover, 
utilizes recurrent neural networks (RNNs) to capture the current context of the pulse stream and predict attributes of the next PDWs. Yet, RNNs classify PDWs from predefined radar list, hence, it identifies pulses rather than solving the deinterleaving problem. Finally, denoising pulse streams with autoencoders is visited to improve the deinterleaving solution.
We address alleviating the limitations of the aforementioned approaches through a novel end-to-end trainable clustering framework. Starting from linking deinterleaving to min-cost flow problem, we develop a bi-level optimization problem and relate the solver of the lower problem to the residual networks, yielding a novel unified self-attention based framework enabling to learn clustering arbitrary number of patterns in signals.
2.1 Min-cost Flow Formulation of Signal Clustering
We consider deinterleaving of PRI patterns using relative time of arrival (RToA) measures though our formulation is general for deinterleaving in time series. RToA is computed from ToA differences: for with .
We have RToAs, , of mixed signals. Our aim is to divide into disjoint sets, , containing the RToAs of the distinct sources. We encode such sets by the ground truth assignment matrix, , where tells is preceding in a distinct pattern, i.e., and as .
We formulate deinterleaving problem as a min-cost flow problem on a directed graph. The vertices of the graph are RToAs which we denote as and the edges are the possible connections among RToAs. We augment the graph with a start and a terminal vertex, , to help formulation. We associate edges with costs, , and mean to find min-cost flows from some starting vertices to the terminal vertices. Such flows will be the deinterleaved signals. Formally, we consider the problem:
to find an assignment matrix, , from which we can recover the flows. The constraints avoid one-to-many or many-to-one connections (1&2) and preserve flow continuity (3). Total unimodularity of the constraints ensures integer solutions for , meaning that we have a binary assignment matrix, , which encodes the linking of the flows. We note that the number of sources, , can be arbitrary; since, no explicit dependency on exists in the formulation.
2.2 Learning Association Costs and Soft Min-cost Flow
The optimality of the computed flows are based on the edge costs, . Ideally, given as the ground truth assignments, would yield the flows as the desired deinterleaved signals. In practice, we have no information of .
We are to learn association costs from some training data, , with known assignments, . To formulate our learning framework, we express the cost matrix as a parametric function of with parameters to be learned as where we drop the dependency on for simplicity. Then the problem becomes a bi-level optimization problem:
where and are the flow constraints in (P1
). The Jacobian of the linear program (LP)must be computed to perform gradient updates for . This can be achieved by transforming LP into an unconstraint problem with unique solution and using implicit differentiation . Equality constraints can be eliminated by expressing where is any solution to and is the basis for the null space of the constraints . The inequality constraints can be relaxed by introducing strictly convex penalty terms as in . Hence, the relaxed unconstrained problem, , has a unique solution to be obtained by gradient updates over as:
where is the iteration number. For a solution obtained at the iteration, , computing the partial derivatives for
can be seen as applying chain rule recursively ons: , where we deliberately omit dependency on
for simplicity. To this end, with the abuse of the gradient expression, we consider each update as a non-linear transform ofadded to itself:
where is realized by a stack of bidirectional transformer encoder  layers. We omit start node, , and related constraints from the formulation and embed flow continuity in (C3.3). We explicitly introduce flow constraints to the learning problem in terms of . Thus, the constraints are no longer linear nor convex. However, such a formulation allows us to use supervision to learn solving min-cost flow problem with proper costs inherently. Hence, no explicit LP solver to compute assignment matrix is required in practice.
We solve (P3) by relaxing the constraints with proper penalty terms to obtain an unconstrained problem. As we will show in Sec. 2.3, (C3.1) is already satisfied by design. We use hinge loss for (C3.2),
and squared L2 norm penalty for (C3.3),
For the binary constraint (C3.4), we exploit the fact always and equality holds when
is a one-hot vector.
where is the i row of . Hence, total loss becomes:
s are the penalty coefficient hyperparameters. Once the parameters,are learned by , distinct patterns in a sequence, , can be clustered using computed flows.
2.3 Soft Min-cost Flow Architecture: p(x;)
We exploit stack of bidirectional transformer encoder (BTE)  layers to model owing to the success of the architectures [19, 18] equipped with such layers in encoding the high order relations among sequential data. A BTE layer is a stack of a self-attention and a feed-forward layer each of which contains a non-linear function and a residual connection. Thus, such layers meet our intuition for expressing a solver in terms of stacked residual units as in Eqn. (2.2).
Self-attention layers expresses the representation of each sample in the sequence as the convex combination of the representation of the all samples. The mixing weights are according to the relative similarities of the representations. Thus, stack of such layers hierarchically encodes higher order relations among the samples of the sequential data.
In our architecture which is depicted in Fig. 1, we first represent each sample, , in the sequence with a -dimensional vector. We normalize the input sequence, , and then assign each sample, , to some token with embedding according to quantization of the interval, . One can also use a neural network to embed s to ; however, we empirically found that the former works well in practice and simpler. We then perform relative position encoding as in  to emphasize ordering in time. These encoded representations are then fed to the stack of BTE layers to yield transformed representations, , which are enhanced with the higher order relations. We apply affine transformation followed by soft-max to these representations: where we consider columns of as the decision representatives which tell to which the corresponding sample is to be linked . Thus, is a proxy of the similarity between and the decision representative. Large means
is probably linked to. Finally, soft-max of reads the assignment matrix and note that the constraint (C3.1) in (P3) is readily satisfied with this formulation: .
Though using fixed decision representatives brings the limitation of processing fixed-length sequence, this limitation can be alleviated during inference by processing the sequences in overlapping windows. We empirically found that interaction of the samples too far from each other can be neglected. Thus, fixed-length training of appropriate size achieves good generalization performance.
In general, we have soft assignments in our estimated assignment matrix, . We compute a binary assignment matrix from by solving (P1) with as the cost matrix. We also experimented greedily performing assignments starting from the highest . We provide the comparisons in Sec. 3.3.
For the sequences whose lengths exceed the model capacity, we take overlapping segments of the sequence and perform assignments. We take the most recent assignment for a sample in case of multiple assignments due to overlapping.
3 Experimental Work
|Dataset - / /||CDIF||SDIF||PRIT||Baseline + Greedy||Baseline + LP||SMCF + Greedy||SMCF + LP|
|Validation - Case 1||0.841 / 0.605 / 0.000||0.850 / 0.614 / 0.000||0.982 / 0.631 / 0.000||0.975 / 0.940 / 0.013||0.963 / 0.966 / 0.000||0.991 / 0.998 / 0.001||0.991 / 0.998 / 0.000|
|Validation - Case 2||0.519 / 0.312 / 0.000||0.518 / 0.309 / 0.000||0.631 / 0.538 / 0.000||0.894 / 0.758 / 0.201||0.885 / 0.764 / 0.000||0.945 / 0.996 / 0.004||0.943 / 0.994 / 0.000|
|Validation - Case 3||0.352 / 0.154 / 0.000||0.362 / 0.157 / 0.000||0.579 / 0.437 / 0.000||0.858 / 0.623 / 7.477||0.856 / 0.627 / 0.000||0.912 / 0.950 / 0.025||0.915 / 0.954 / 0.000|
|Validation - Case 4||0.679 / 0.480 / 0.000||0.683 / 0.481 / 0.000||0.768 / 0.545 / 0.000||0.956 / 0.977 / 0.011||0.955 / 0.977 / 0.000||0.991 / 0.999 / 0.002||0.990 / 0.998 / 0.000|
|Validation - Case 5||0.548 / 0.370 / 0.000||0.550 / 0.364 / 0.000||0.642 / 0.496 / 0.000||0.901 / 0.836 / 2.736||0.900 / 0.841 / 0.000||0.929 / 0.975 / 0.015||0.929 / 0.975 / 0.000|
|Test||0.304 / 0.147 / 0.000||0.311 / 0.148 / 0.000||0.564 / 0.436 / 0.000||0.842 / 0.539 / 11.943||0.823 / 0.555 / 0.000||0.903 / 0.883 / 0.098||0.898 / 0.886 / 0.000|
. To produce realistic data, we select the maximum standard deviation of constant, jitter, constant stagger, random stagger and switch&dwell for ToA are +/-1%, +/-15%, +/-40%, +/-40% and +/-40%, respectively. Since normalization occurs in PRI as pre-processing of the algorithm, we choose PRI values from pre-defined range which is between 1 and 1000s. Stagger level can vary from 2 to 9. We sample switch&dwell repetition parameter (the number of dwells) between 4 and 10 where the range of the number of switches is kept as same as the stagger level range. The least and the most number of sample belong to same radar is 5 and 100, respectively. We set the maximum number and minimum of radars in a sequence is 10 and 1, respectively.
We generate 200000 interleaved radar sequences for training and 5000 for validation data. Sequences in each dataset for the following cases are sampled with equal probability: Case 1: Only constant and jitter PRI type. Case 2: No missing occurs. While ending times of radar sequences are same, beginning times of radar sequences are close. Case 3: Missing occurs with maximum probability, 0.20. Consecutive missing is favored up to 10 pulses. While ending times of radar sequences are same, beginning times of radar sequences are close. Case 4: No missing occurs. While ending times of radar sequences are different, beginning times of radar sequences might be far. Case 5: Missing occurs with maximum probability, 0.20. Ending and beginning times of radar sequences might be close or far.
In test dataset, we use 15 deinterleaved signals provided by ASELSAN A.Ş. We interleaved random batches of distinct signals while the number of the interleaved signals varies between 2 and 15. The test dataset contains 1000 interleaved sequences and is of different modality and disjoint from training and validation.
3.2 Experimental Setup and Evaluation Metrics
We use Tensorflow deep learning library throughout the experiments. We set penalty coefficient hyperparameters, and , after fine-tuning the network. Furthermore, we select the maximum of sequence length, the dimension of representation and the number of quantization level as 256, 512 and 5001, respectively. For the optimization procedure, we choose the base learning rate as and utilize ADAM 
optimizer for mini-batch gradient descent with a mini-batch size is 128 and default moment parameters. The remained BTE parameters are kept same with.
We use three performance metrics for better analysis: the average probability of the true estimated next PDW position (), the average probability of the true estimated number of radars () and the average number of one to many assignments () which violates (C3.2). The desired values of those metrics are , and . Each predicted cluster which has larger than 3 is assigned as a radar signal.
3.3 Quantitative Results
We examine the effectiveness of the proposed soft min-cost flow (SMCF) framework through evaluation on the validation dataset and disjoint test data. We compare SMCF with CDIF , SDIF  and PRIT , which are conventional non-learning based deinterleaving methods, to show their limitations in realistic scenarios. Additionally, we show the significance of relaxed constraints in Eqn. (2.6) by training without corresponding penalty terms, i.e., , which we denote Baseline. We also compare the two inference methods (Greedy, LP) proposed in Sec. 2.4.
Table 1 shows the quantitative results for the deinterleaving tasks. SMCF algorithm consistently outperforms the associated baseline methods by up to 6.1% and 32.8% points on and metrics, respectively. Conventional methods perform poorly in challenging scenarios where largely missing pulses and complex PRI modulation types exist. Additionally, the obtained clusters are missing related PDWs although PRI values are predicted correctly.
Comparison of one-to-many violations of Baseline Greedy with of SMCF Greedy as well as the overall performance variation between SMCF Greedy and LP is a good indicator to show the proposed penalty terms improve the solution by enforcing constraints. Similar performances of Greedy&LP in SMCF and superior performance of Greedy SMCF to Baseline imply constraints are better satisfied in the output of SMCF. We note that all the methods except for Greedy have zero one-to-many violations owing to hard assignment.
We link deinterleaving to min-cost flow problem and formulate training of a self-attention based neural network to solve min-cost flow problem. The mechanism behind the formulation is supported by extensive evaluations on the large dataset. Experimental results on data of different modalities show that our framework is able to learn deinterleaving the signals.
-  Richard G Wiley, ELINT: The interception and analysis of radar signals, Artech House, 2006.
-  David K Barton, Radar system analysis and modeling, Artech House, 2004.
-  HK Mardia, “New techniques for the deinterleaving of repetitive sequences,” in IEE Proceedings F (Radar and Signal Processing). IET, 1989, vol. 136, pp. 149–154.
-  DJ Milojević and BM Popović, “Improved algorithm for the deinterleaving of radar pulses,” in IEE Proceedings F (Radar and Signal Processing). IET, 1992, vol. 139, pp. 98–104.
-  Ken’ichi Nishiguchi and Masaaki Kobayashi, “Improved algorithm for estimating pulse repetition intervals,” IEEE transactions on Aerospace and Electronic Systems, vol. 36, no. 2, pp. 407–421, 2000.
X. Li, Z. Liu, and Z. Huang,
“Deinterleaving of pulse streams with denoising autoencoders,”IEEE Transactions on Aerospace and Electronic Systems, pp. 1–1, 2020.
-  Zhipeng Ge, Xian Sun, Wenjuan Ren, Wenbin Chen, and Guangluan Xu, “Improved algorithm of radar pulse repetition interval deinterleaving based on pulse correlation,” IEEE Access, vol. 7, pp. 30126–30134, 2019.
-  Yanchao Liu and Qunying Zhang, “Improved method for deinterleaving radar signals and estimating pri values,” IET Radar, Sonar & Navigation, vol. 12, no. 5, pp. 506–514, 2018.
-  Ala Mahdavi and Amir Mansour Pezeshk, “A fast enhanced algorithm of pri transform,” in 2011 Sixth International Symposium on Parallel Computing in Electrical Engineering. IEEE, 2011, pp. 179–184.
-  Yan Mao, Jun Han, Guohua Guo, and Xu Qing, “An improved algorithm of pri transform,” in 2009 WRI Global Congress on Intelligent Systems. IEEE, 2009, vol. 3, pp. 145–149.
-  AW Ata’a and SN Abdullah, “Deinterleaving of radar signals and prf identification algorithms,” IET radar, sonar & navigation, vol. 1, no. 5, pp. 340–347, 2007.
-  Zhang-Meng Liu and S Yu Philip, “Classification, denoising, and deinterleaving of pulse streams with recurrent neural networks,” IEEE Transactions on Aerospace and Electronic Systems, vol. 55, no. 4, pp. 1624–1639, 2018.
-  Xueqiong Li, Zhangmeng Liu, and Zhitao Huang, “Deinterleaving of pulse streams with denoising autoencoders,” IEEE Transactions on Aerospace and Electronic Systems, 2020.
-  Alexander Schrijver, Theory of linear and integer programming, John Wiley & Sons, 1998.
Peter Ochs, René Ranftl, Thomas Brox, and Thomas Pock,
“Bilevel optimization with nonsmooth lower level problems,”
International Conference on Scale Space and Variational Methods in Computer Vision. Springer, 2015, pp. 654–665.
Samuel Schulter, Paul Vernaza, Wongun Choi, and Manmohan Chandraker,
“Deep network flow for multi-object tracking,”
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 6951–6960.
-  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
-  Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin, “Attention is all you need,” in Advances in neural information processing systems, 2017, pp. 5998–6008.
-  Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
-  Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani, “Self-attention with relative position representations,” arXiv preprint arXiv:1803.02155, 2018.
-  Moein Ahmadi and K Mohamedpour, “Pri modulation type recognition using level clustering and autocorrelation,” American Journal of Signal Processing, vol. 2, no. 5, pp. 83–91, 2012.
-  Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al., “Tensorflow: A system for large-scale machine learning,” in 12th USENIX symposium on operating systems design and implementation (OSDI 16), 2016, pp. 265–283.
-  Diederik P Kingma and Jimmy Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.