As a fundamental problem in mathematics, linear assignment Burkard and Cela (1999) aims to assign jobs to
agents under one-to-one matching constraints, meanwhile achieves the best state that the cost involved in the problem is minimum or the profit is maximum. It is closely related to many computer vision problems including point matchingLian and Zhang (2020), handwritten mathematical expressions recognition Hirata and Julca-Aguilar (2015), and multi-object tracking Sun et al. (2021); Xu et al. (2020). A well-known method for linear assignment is the Hungarian algorithm Kuhn (2010); Munkres (1957)
which can obtain the optimal solution without an exhaustive search. However, its computational complexity is extremely sensitive to the size of the problem. Consequently, using more elaborate heuristic or greedy strategies, several approximate algorithms including deep greedy switchingNaiem and El-Beltagy (2009); Naiem, Amgad and El-Beltagy, Mohammed (2013), interior point algorithms Karmarkar and Ramakrishnan (1991); Ramakrishnan et al. and auction algorithm Bertsekas (1988) have been proposed to seek acceptable suboptimal solutions with time constraints. However, it is difficult to formulate the gradients of heuristic linear assignment solvers, which prevents them from being directly embedded into the learning frameworks.
are widely taken as a differentiable linear assignment layer that obtains the approximate solutions by performing normalization along both rows and columns of the cost matrix alternatively. With simple row- and column-normalization process, the SinkHorn-like algorithms often suffer from low assignment precision as the problem size increasing. Formulating the linear assignment problem as an integer linear programming task, Zeng et al.Zeng et al. (2019) proposed a differentiable matching layer that updates the assignment solution following the negative gradient direction and maps it onto the one-to-one matching constraint set using Dykstra’s algorithm Boyle and Dykstra (1986); Dykstra (1983). However, its computational complexity is extremely sensitive to the problem size. In addition, these non-learnable approaches Sinkhorn (1964); Zeng et al. (2019) are inadaptable to fit data distributions in different tasks.
With the growing interest in using deep neural networks
deep neural networks(DNNs) for computer vision tasks, solving linear assignment problems using DNNs has attracted much research attention. Researchers Lee et al. (2018); Minh-Tuan and Kim (2019); Xu et al. (2020)
model the assignment process with the deep learning framework based onRecurrent Neural Networks (RNNs) Jordan (1997) or Convolutional Neural Networks (CNNs) LeCun et al. (1995). However, these works mainly focus on representing elements in cost matrix with local or sequential features and less investigation has been devoted to utilizing the structured relationships between each pair of them. Specifically, the features captured by CNNs only collect information from local neighborhoods. Although the receptive field can be expanded through stacking several CNN layers, it always fails to cover the whole cost matrix when the size of the problem increases. The RNNs-based methods treat the cost matrix as time-sequential flow data which allows these algorithms to deal with linear assignment problems of varying sizes. However, during inference, the contribution of the previous element to the current state becomes insignificant when the sequence length expands, which may affect the performance of RNN-based models.
Addressing the above-mentioned issues, we propose a novel graph-based linear assignment network (GLAN) aiming to improve both the accuracy and the computational efficiency. Our framework falls into the group of deep learning methods, and it is designed on top of graph neural networks. Given a cost matrix, we first transform it into a bipartite graph in which two sets of nodes correspond to the agents and jobs respectively, and edges between the two node sets are assigned with the corresponding cost values. By this way, the linear assignment problem is converted to the problem of selecting reliable edges from the bipartite graph. Subsequently, a graph neural network is developed to perform information aggregation and updating on the constructed graph, which generates the final edge representations through several convolution operations. Finally, according to the updated graph state, our network predicts a label for each edge which indicates whether the assignment relationship is reliable or not. Different from previous learning-based algorithms, the message passing process in the proposed framework is independent of the size and the structure of the constructed graph. Furthermore, the receptive field of the proposed network can cover the whole graph after only two-layer convolution modules. As a result, the proposed method is able to make assignment decisions efficiently for problems of varying sizes from the global scope.
To validate the advancement of the proposed method over previous approaches, we construct a synthetic dataset for LAP in which each sample contains a random cost matrix and its optimal assignment solution. The training of the network is guided under the assignment difference between the predictions and optimal solutions, as well as the one-to-one matching constraints, as the supervision signal. We compare the proposed method with several state-of-the-art baseline methods in terms of assignment precision and computational complexity. Experimental results reveal that our method achieves the highest average assignment precision with low time consumption. In addition, we embed all tested differentiable assignment solvers into an MOT pipeline respectively where each learning-based solver is taken as a trainable component to fit the data distribution. The tracking results evaluated under the standard MOT metrics illustrate that the proposed linear assignment solver is able to well capture the inductive information in the real scenarios and enhance the tracking performance by the largest margin on most metrics. We will release the code publically available upon the acceptance of the paper.
In summary, with the proposed learning framework for linear assignment problems, this paper makes contribution in three-fold:
(1) we convert the problem of linear assignment to learning of edge selection from a bipartite graph constructed based on the cost matrix;
(2) we propose a differentiable framework based on graph neural network to update the graph states, and the proposed model achieves excellent performance in terms of assignment precision and computation efficiency; and
(3) we embed the proposed LAP solver into a popular MOT pipeline and validate its effectiveness in boosting tracking performance.
2 Related works
2.1 Heuristic Linear Assignment
In the 1950s, combining the graph theory and the duality of linear programming, Kuhn et al. propose a well-known iterative method named Hungarian algorithm Kuhn (2010) in which a maximal alternating forest is constructed in each iteration to augment the primal solution and the optimal solution can be obtained with computation complexity. Due to the dynamic nature as well as the size expansion of many problems, where a solution needs to be found under tight time constraints, the heuristic algorithms whose solutions are close to the optimum have been investigated for decades. Simulating the process of real auction, Bertsekas et al. design the auction algorithm Bertsekas (1988) which is composed of two phases, the bidding phase and the assignment phase. The performance of the auction algorithm is improved by using the scaling strategy, where an is involved to update the elements in the cost (or price) matrix. Its results for linear assignment problems are close to the optimal solution, but it still suffers from the high computation complexity. A variation of the Hungarian algorithm has been proposed in Jonker and Volgenant (1987), which starts with an initialization phase based on a naive auction algorithm and uses the shortest alternating paths to augment primal solutions. In addition, several algorithms solve the linear assignment problems by general linear programming approaches such as the interior point method Ramakrishnan et al. ; Karmarkar and Ramakrishnan (1991) and dual forest method Akgül and Ekin (1991). For example, Ramakrishnan et al. Ramakrishnan et al. modify the Karmarkar interior-point algorithm Karmarkar (1984) and develop an approximate dual projective algorithm for assignment problems. Akgül et al. Akgül and Ekin (1991) use the forest construction as a dual forest algorithm to solve the linear assignment problems.
To speed up the computation, researchers investigate several pseudo flow algorithms including Goldberg and Kennedy (1995) which solves the corresponding minimum cost flow problem. Besides, many heuristic approaches based on greedy strategies are devoted to seeking approximate solutions in a fast manner. In Trick (1992), Trick et al. propose a greedy heuristic approach for the generalized assignment problems and prove that some randomization to the greedy approach can facilitate finding better solutions. Another similar approach is the greedy randomized adaptive search procedure (GRASP) Feo and Resende (1995) where each search iteration is made up of a construction stage, where a randomized greedy solution is constructed, and a local search stage which applies local iterative improvement on the greedy solution. Inspired by the above greedy heuristics, Naiem et al. propose the Deep Greedy Switching (DGS) algorithm Naiem and El-Beltagy (2009); Naiem, Amgad and El-Beltagy, Mohammed (2013) which starts with a random initial solution and tries to find a better solution through searching within a well-defined neighborhood. These methods are significantly faster than previous heuristic methods, due to the partial greedy strategy. However, they often terminate when the objective function value reaches a local optimum.
2.2 Differentiable Linear Assignment Solvers
Despite the progress from the previous heuristic approaches, gradients of these algorithms are difficult to formulate using standard derivation paradigms, which limits their extension to machine learning frameworks.
Performing row-wise and column-wise normalization on the cost matrix, the SinkHorn algorithm Sinkhorn (1964) has been widely used in learning-based frameworks as an assignment layer due to its high computation efficiency. However, its solution heavily depends on the distribution of values in the cost matrix, and always deteriorates when the size of the problem increases. Based on SinkHorn Sinkhorn (1964), several works Cuturi (2013); Altschuler et al. (2017) solve the linear assignment problem under the optimal transport architecture with entropic penalization Cuturi (2013) and greedy updating strategy Altschuler et al. (2017). These methods derived from SinkHorn Sinkhorn (1964) achieve faster convergency in inference and better assignment performance, but still can not obtain satisfactory results on large-sized problem. Aiming to improve the assignment quality, i.e., assignment precision, Zeng et al. Zeng et al. (2019) propose a differentiable mask matching layer, named Relax Matching (RM), by unrolling a projected gradient descent algorithm in which Dykstra’s algorithm Boyle and Dykstra (1986); Dykstra (1983) is employed to achieve the projection mapping the solution to constraint set. Based on a mathematic reasoning process, RM Zeng et al. (2019) obtains advancement in assignment precision over the SinkHorn algorithm Sinkhorn (1964) by a large margin. But its computation complexity increases greatly with the size of problems expanding. Despite the achievements made by these approaches mentioned above, they are inadaptable to fit the distribution of data in different tasks due to their non-learnable property.
With the popularity of deep learning frameworks, several data-driven algorithms for linear assignment problems have been proposed in recent years. In Liu and Zhao (2015), the linear assignment problem has been converted to an equivalent linear continuous programming problem which is solved by a recurrent neural network without any designed parameters. However, the optimization process includes many matrix multiplication operations, making the methods inefficient to solve large-sized problems. Decomposing the -to- assignment problem into multi-classification tasks, DFC Lee et al. (2018) and DBL Minh-Tuan and Kim (2019) address these sub-tasks with stacked fully-connection (FC) layers and a bidirectional long short-term memory neural network
bidirectional long short-term memory neural network(BDLSTM) respectively. But the relationships between each pair of sub-tasks have not been taken into consideration. Regarding the cost matrix as an image with pixels, DCNN Lee et al. (2018) employs the model based on CNNs to get local representations for making assignment decisions. However, the CNNs layers in their model have fixed receptive fields, which limits the generalization on problems with varying sizes. To receive information from the global scope, two bidirectional RNNs sequentially applied to the cost matrix in a row-wise and column-wise direction in Xu et al. (2020). However, it is not suitable to capture the global inductive information in sequential order for time-independent problems.
Our work falls into the group of deep learning algorithms for linear assignment problems. Different from previous learning-based works, our method does not work directly on the cost matrix, but converts the problem to learning of edge selection on a constructed bipartite graph. A graph neural network is developed to perform computation on the graph for edge prediction, and its excellent performance is evaluated in both synthetic datasets and real-world MOT tasks.
3 Problem Formulation
3.1 Linear Assignment Problem
Given two sets of agents and jobs , assigning job to agent results in a cost value of . The objective of linear assignment problems is to find the minimum total cost of assigning each job to exactly one agent and each agent to one job. The problem can be expressed as an integer binary programming problem where the decision variable is set to if job is assigned to agent , and is set to 0 otherwise. Therefore, the linear assignment problem can be formulated as:
3.2 Assigning with Edge Labeling
Despite the progress of previous learning-based approaches Lee et al. (2018); Xu et al. (2020); Minh-Tuan and Kim (2019) that regard the cost matrix as an image or sequential data flow, the inherent structure information of the cost matrix has not been fully explored in their models.
Basically, assignments can be modeled and visualized in different ways Burkard and Cela (1999). In this paper, the linear assignment is converted to an edge labeling problem by constructing a bipartite graph. Specifically, given a cost matrix of size, we consider the agents and jobs as two sets of nodes in the constructed graph. Then we build edges between each pair of agent and job and assign the corresponding cost values to these edges, i.e., the initial value on the edge is assigned with . Finally, the value in edge predicted by our framework indicates the assignment between agent and job is reliable or not.
Fig. 1 illustrates an example of the process that transforms a cost matrix to its corresponding bipartite graph. Specifically, given a cost matrix of size , the indices colored in green and orange represent the IDs of agents and jobs respectively. Then all the agents and jobs are considered as nodes in the bipartite graph and the edges are assigned with cost values according to the cost matrix. Thus finding the optimal assignment is converted to selecting reliable edges in the constructed graph . As shown in Fig. 1, selecting elements , , in the cost matrix is equivalent to selecting edges , , from the bipartite graph.
To learn how to select reliable edges from the constructed graph , we build a fully trainable graph network which takes the as input, and performs information aggregation and updating over the graph to predict the edge labels. The predicted label on the edge is equivalent to the value of in the permutation matrix which indicates whether to assign agent to job and job to agent also. The pipeline of the proposed framework is described in Sec. 4.
4 Proposed Framework
Recently, there has been growing interest in Graph neural networks (GNNs) Wu et al. (2021); Battaglia et al. (2018) which can well capture relational structured representations on data and has been widely applied to computer vision tasks such as action recognition Xu et al. (2021); Zhang et al. (2020) and hierarchical representation learning Bianchi et al. (2019). Inspired by these studies, we design our learning framework by extending the graph network block Battaglia et al. (2018); Wang et al. (2020) module in which some operations over graph data have been defined. As illustrated in Fig 2
, the pipeline for training the proposed framework consists of four modules: bipartite graph generation, encoder/decoder, convolution module, and loss function. Given a cost matrix, the proposed framework firstly constructs a bipartite graph that is taken as input to an encoder and transformed into a latent representation. Then the edge/node convolution layer in the convolution module iteratively performs feature aggregation from neighbors by attention/weight strategies and update the attributes for per-edge and per-node. Subsequently, the edge labels are predicted by a decoder according to the updated graph states. Finally, the loss between optimal solutions and predicted labels guides the training process. Besides, the one-to-one matching constraints are also taken into account as a part of supervision signals. In the following, we describe each module in detail.
4.1 Bipartite Graph Construction
As described in Sec. 3.2, a cost matrix can be transformed into a bipartite graph in which the nodes are grouped into two classes representing agents and jobs respectively. In the constructed graph, edges only exist between the nodes in inter-class. The raw features of these edges are assigned with the cost values between source nodes and receive nodes according to the cost matrix. Therefore the initial attributes of the edges in the constructed graph can be expressed as:
Note that all the raw attributes of the nodes in the constructed graph are initialized with zero-valued vectors. In order to avoid the influence of noise edges on training, for each agent node, we keepadjacent edges with the lowest costs and remove the rest edges. The constructed graph is represented as
where and denote the node set and edge set, and are the node attribute set and edge attribute set, respectively.
The encoder module transforms the attributes of the constructed graph into latent representations by applying an Multi-Layer Perceptrons
Multi-Layer Perceptrons(MLPs) to each edge to form the embedding features. The transformed graph then is passed to the convolutional module as input to update its state.
The decoder coupled with the encoder reads out the edge attributes from the output graph and predicts each edge label through an update function. Similarly, the update function is designed as an MLPs and mapped to each edge to form edge labels through a sigmoid activation.
4.3 The Convolution Module
The convolutional module consists of a node convolution layer and an edge convolution layer. For node in the graph, the node convolution layer collects the information from adjacent edges and its order neighborhoods by adaptive aggregation weights and updates its attributes. For each edge, the edge convolutional layer aggregates the attribute from its source node and receive node through channel-attention strategy and performs the attribute update. Although the reception field of the convolution module is -order neighborhoods, the messages on each node can reach all other nodes after two iterations of convolution. Therefore, the reception field of the -iteration convolution module can cover the whole graph when .
The edge convolution layer: This layer consists of an aggregation function and an update function. For each edge, the aggregation function collects information from its source node and receive node:
where denotes the attributes of the edge connecting node and node , and the attributes of and node. indicates the element-wise multiplication of two vector. The operator concatenates its input vectors in channel wise. In addition, is the node channel attention vector with the same dimension as node attribute, and computed as
where , and return the real number vectors including maximum values, minimum values and average values of their input attribute set along channel wise, respectively.
Similarly, the edge channel attention vector with the same dimension as edge attribute is computed as
The operators and are both parameterized as MLP with different configurations to map their input into desired space.
Taking the concatenated attributes as input, an update function designed as an MLP is applied to output the updated feature:
The node convolution layer: This layer is designed to collect information from adjacent edges and -order neighborhoods for each node. In this layer, not only the channel attention, but the aggregation weights are also taken into consideration. Specifically, for node in graph , this layer first aggregates the information from the associated edges and its -order adjacent nodes by adaptive weights:
where denotes the function to transform its input to an embedding feature. denotes the attribute set of all edges associated with node in graph and the attribute set of -order adjacent nodes to node . For node , is the weight measuring the contribution of its adjacent node during feature aggregation, and computed as
Then the collected embedding features are concatenated with the current attributes of node and are passed to another transformation function that outputs the updated attributes for node :
The functions , and are all specified as MLPs modules, while the structures and parameters in these MLPs are different from each other.
It is worth mentioning that the operations in the above two layers are independent of the size and structure of the constructed graph, which allows the proposed learning framework to be trained using cost matrices of different sizes.
4.4 Loss Function
Similar to Xu et al. (2020), we consider the linear assignment problem as a binary classification task and divide the elements in the ground-truth assignment matrix into positive labels and negative ones. Considering the case that for each node, there is at most one positive edge among its adjacent edges and the rest are negative ones, to avoid the negative labels dominating the training, we utilize the Balanced Cross Entropy as the loss function:
where is the predicted label for edge which connects agent and job , is the corresponding ground-truth vector indicating the edges as positive and negative, and is the weight which balances the loss to avoid the negative labels dominating the training.
The one-to-one matching constraint embodies the nature of the solution for linear assignment problems. However, it has not been fully explored in previous learning-based approaches. To impose this constraint in our learning framework, we first construct a predicted assignment matrix whose size is the same as the problem:
where is a bijection mapping an edge to an integer index. Then we design the soft constraint loss as follows:
In the above equations, is a one-valued vector; sums the values in predicted assignment matrix along the row-wise; returns a vector in which each element is the 2-norm of corresponding row-vector. In training, Eq. 16 drives the prediction to satisfy the constraints in Eq. 2 and 3. As a part of supervision signals in training, Eq. 17 guides the proposed network to output sparse predictions.
We finally combine the binary classification loss and the constraint loss to guide the training process as follows:
where weights the degree of the soft one-to-one matching constraints imposed.
|SH Sinkhorn (1964)||62.4||51.2||47.1||43.1||39.7||38.5||36.6||35.0||34.2||33.7||32.5||31.8||31.0||30.6||29.8||38.5|
|SD Cuturi (2013)||69.6||59.9||56.3||53.1||50.6||48.8||46.8||45.5||44.9||44.1||43.0||42.5||41.5||40.9||40.4||44.6|
|RM Zeng et al. (2019)||77.9||67.6||61.0||57.0||53.7||51.4||50.7||50.6||50.0||49.7||50.9||50.2||50.4||49.5||49.8||54.7|
|DFC Lee et al. (2018)||59.4||49.9||47.7||46.8||46.3||45.0||43.8||41.8||40.8||38.9||37.5||36.0||34.5||33.3||32.0||42.2|
|DCNN Lee et al. (2018)||62.8||59.1||57.7||57.8||57.7||57.3||57.4||57.3||57.5||57.0||57.2||56.9||57.3||57.1||56.5||57.8|
|BDL Minh-Tuan and Kim (2019)||63.4||59.8||60.0||59.1||59.2||59.0||58.5||58.9||58.9||58.4||58.5||58.2||58.5||58.3||57.9||59.1|
|DHN Xu et al. (2020)||72.3||67.9||67.0||65.9||64.9||64.8||64.4||63.8||63.2||62.9||62.8||62.0||61.8||60.8||59.7||64.3|
|SH Sinkhorn (1964)||0.02||0.06||0.05||0.02||0.02||0.03||0.03||0.04||0.05||0.04||0.05||0.06||0.05||0.07||0.07|
|SD Cuturi (2013)||0.01||0.01||0.01||0.01||0.01||0.01||0.01||0.01||0.01||0.01||0.01||0.01||0.01||0.01||0.01|
|RM Zeng et al. (2019)||1.6k||1.7k||1.7k||1.8k||1.8k||2.0k||2.2k||2.4k||2.6k||2.7k||3.1k||3.6k||4.3k||4.8k||5.6k|
|DFC Lee et al. (2018)||0.54||0.55||0.58||0.57||0.56||0.59||0.60||0.60||0.60||0.57||0.56||0.57||0.56||0.55||0.62|
|DCNN Lee et al. (2018)||0.36||0.40||0.48||0.46||0.42||0.39||0.43||0.35||0.49||0.49||0.47||0.45||0.56||0.53||0.57|
|BDL Minh-Tuan and Kim (2019)||3.21||4.41||5.85||7.19||9.78||11.6||12.8||14.2||15.6||13.6||14.8||16.0||17.3||18.4||19.7|
|DHN Xu et al. (2020)||196||308||520||842||1.4k||1.9k||2.5k||3.1k||4.0k||5.1k||6.8k||7.3k||9.4k||10.8k||11.3k|
In our experiments, we set the maximum of iteration in the convolution module and the dimension of edge feature to 5 and 16, respectively. And the number of adjacent edges for each node is assigned with 8. In addition, the weight for positive edges is fixed to 0.9.
For training the proposed framework, we generate a synthetic dataset that consists of samples. Each sample is composed of a cost matrix
in which the elements are generated from a uniform distribution onand the corresponding optimal assignment solution which is obtained by the Hungarian algorithm Kuhn (2010). In experiments, the cost matrices in the synthetic dataset have varying sizes ranging from 10 to 150 with an interval of 10. For the sake of clarity, with the default lower bound of cost data as 0, we name the synthetic dataset
where the superscript 1 denotes the upper bound of the cost value and the subscript 10_150 represents the minimum size (10) and maximum size (150) of cost matrices. Besides, we randomly sample 30% of this dataset for evaluation and keep the other 70% for training. The training totally takes 20 epochs where the learning rate is set as 0.003 initially and declined by 5% after each 5 epochs. In addition, thein Eq. 19 is 0 at the beginning and increases 0.01 after each epoch.
5.3 Evaluation on Synthetic Datasets
5.3.1 Comparison with Baseline Methods
In this section, we report the experimental results of the proposed framework, named GLAN (Graph-based Linear Assignment Network), compared with six state-of-the-art baselines, SH Sinkhorn (1964), SD Cuturi (2013), RM Zeng et al. (2019), DLASP Lee et al. (2018), BDL Minh-Tuan and Kim (2019), and DHN Xu et al. (2020) of which the first three algorithms are traditional learning-free methods and the last three are deep learning approaches. In Tables 1 and 2, two versions of DLASP Lee et al. (2018) are involved in comparison, named DFC and DCNN which are build upon Fully-Connection (FC) layers and CNNs LeCun et al. (1995) respectively.
To evaluate the performance of these methods, we apply Greedy discretization to the solution matrices predicted by these approaches, and define the assignment precision criteria as
where returns the trace of given matrix. In addition, in order to fairly compare our method with baselines in terms of computation efficiency, all the methods are run on PC with a Intel Core i5-4590 CPU (3.30 GHz), a RAM (16GB) and a GTX 1060Ti GPU (6G).
The average precision of each tested algorithm is illustrated in Table 1. In addition, Table 2 reports the comparisons in terms of computational time. In non-learnable approaches, adopting simple normalization along row and column, SinkHorn Sinkhorn (1964) can not obtain satisfactory assignment precision, despite of its relative fast inference process. As a variant of SinkHorn Sinkhorn (1964) with entropic penalization, SD Cuturi (2013) runs with the lowest time cost, and significantly outperforms SinkHorn Sinkhorn (1964) in assignment precision. However, their performance on assignment precision often deteriorates when the size of problems increases. The RM Zeng et al. (2019) algorithm employs an elaborate reasoning process to update the solution iteratively and achieves the best assignment precision among the learning-free baselines. However, its computation time grows greatly with the increase of the problem size. In learning-based methods, DFC Lee et al. (2018) passes the sub-classification tasks generated from the cost matrix to several FC layers to make decisions for linear assignment problems. Its computational complexity is the lowest than other learning-based methods, but it gets the worst assignment precision. Similar to the DFC Lee et al. (2018), decomposing the linear assignment problem into a sequence of smaller sub-assignment tasks, the RNN-based model BDL Minh-Tuan and Kim (2019) achieves a better performance on assignment precision due to its adaptation to problems of varying sizes. However, the relationships between sub-assignment tasks are not taken into account in both of their frameworks. Employing several stacked CNN layers, DCNN Lee et al. (2018) also achieves a better performance than the FC-based version of DLASP Lee et al. (2018), i.e., DFC. However, due to the limited receptive field of convolution kennels, it can not makes assignment decisions through global information when the size of the problem is greater than its receptive field. To receive information from the global scope, another RNN-based model, named DHN Xu et al. (2020), transforms the cost matrix into two sequential patterns from column-wise and row-wise respectively to make assignment decisions. The performance of DHN Xu et al. (2020) outperforms all the other previous baselines on assignment precision. But same as the RM Zeng et al. (2019), the computational time of DHN Xu et al. (2020) is also sensitive to the problem size.
Devoting to learning the inductive representations and relational structures, that help to make the decisions for linear assignment problems, our graph network achieves the best assignment precisions in general as shown in Table 1. Specifically, the average assignment precision of the proposed method superiors the one of DHN Xu et al. (2020) by . By multiple iterations of the convolution module, the receptive field of our model can cover the whole graph no matter how large the graph is. As shown Table 1, our method achieves consistently high assignment accuracies with the problem size increasing. Furthermore, since the network architecture is independent of the structure and size of the graph, the running time of the proposed method increases very slightly as the problem size expands.
5.3.2 Ablation Study
To verify the effectiveness of each component in our framework, we perform several ablation experiments for the number of convolution layers, the dimension of edge feature, the channel attention and aggregation weights, and the constraint losses.
Influence of the number of convolution layers. The number of convolution layers determines the iterations of the message passing in the assignment graph. As mentioned in Sec 1, the information of one node can be passed to any other nodes after two convolution operations. To explore the influence of the number of convolution layers, we set the dimension of edge features to 16 and vary the convolution iterations from 2 to 10. As illustrated in Fig. 3(a), when the iteration is set to 2, our model has outperformed the best baseline method DHN Xu et al. (2020). As the number of convolution layer increases from 2 to 5, the performance of our model is improved significantly. However, there is no significant improvement when convolution iteration is set to a value greater than 5, which means that using 5 convolutional layers is sufficient to capture the global inductive information for making assignment decisions.
Influence of edge feature dimension. The edge feature here refers to the latent representations of edges in the convolution layer. We fix the convolution iterations to 5 and perform the ablation study for edge feature dimension, where the dimension is set to several values to illustrate its influence on performance. As illustrated in Fig. 3(b), the performance of our model is significantly improved when the dimension is expanded from 2 to 4. In addition, the performance is enhanced to a certain extent by setting the dimension to 16. However, when setting the dimension to a larger value than 16, such as 32, 64, and 128, the assignment precision is basically stable around the result predicted by assigning the dimension with 16. As mentioned above, our model is a lightweight framework, where only 5 convolution layers and 16-dimension latent representations are sufficient to achieve the nearly best assignment precision.
Effectiveness of channel attention and aggregation weights. During convolution process, the channel attention adaptively guides the proposed framework focus on the more important features in channel wise. For a node, the aggregation weights are used to measure the contribution of its adjacent nodes in node feature aggregation. In order to validate their effectiveness for learning structured representations to make assignment decision, we design three downgraded version of our model, named GLAN-C, GLAN-W and GLAN-C-W, where C and W mean removing the channel attention and aggregation weights in convolution layer respectively. As illustrated in Fig. 3(c), both channel attention and aggregation weights can improve the assignment precision of GLAN-C-W in all cases with different data sizes, which demonstrates that both of them have positive effects on making assignment decision. Furthermore, with the combination of channel attention and aggregation weights, the full version of framework GLAN surpasses all downgraded models and achieves the best performance in all cases. It demonstrates that the channel attention and aggregation weights are not conflicting with each other, and the combination of them can further improve the assignment precision.
Effectiveness of constraint losses. The loss function Eq. 16 leads the assignment matrix output by our framework to satisfy the constraints in Eq. 2 and Eq. 3. And another loss function Eq. 17 guides our framework to make a sparse assignment decision. Aiming to quantify the contribution of the proposed soft constraint losses formulated as Eq. 16 and 17, we also design three downgraded versions of our model, named GLAN-L1-L2, GLAN-L1 and GLAN-L2, where L1 and L2 denote removing constraint loss Eq. 16 and Eq. 17 in training respectively. As illustrated in Fig. 3(d), without any constraint loss as the supervision signal in the training process, the model GLAN-L1-L2 has the worst performance compared with other versions. With the constraint loss Eq. 16 or Eq. 17 as a part of training supervision, the proposed framework make better assignment precisions than GLAN-L1-L2 in all cases. Furthermore, the two constraint losses are not conflicting with each other, and they are combined into the full version of our model GLAN making it achieve the best performance in all cases.
|SH Sinkhorn (1964)||27.3||23.3||20.7||19.2||18.0||17.2||16.6||15.8||15.3||14.9||14.5||14.0||13.9||13.6||13.2||15.3|
|SD Cuturi (2013)||37.3||32.2||29.0||26.9||25.6||24.3||23.3||22.4||21.8||21.0||20.5||19.8||19.6||19.2||18.6||21.5|
|RM Zeng et al. (2019)||51.1||48.0||45.8||46.1||45.9||45.4||47.1||46.2||43.8||42.1||42.2||41.3||40.6||39.3||38.7||42.5|
|DCNN Lee et al. (2018)||56.8||55.2||53.9||52.7||51.6||50.5||49.6||48.7||47.8||47.0||46.0||45.0||44.1||42.7||41.5||48.9|
|BDL Minh-Tuan and Kim (2019)||57.3||54.7||53.1||52.0||50.9||49.9||48.9||47.8||46.7||46.0||44.8||43.7||42.5||41.4||39.9||48.0|
|DHN Xu et al. (2020)||52.5||39.6||29.7||21.0||19.2||-||-||-||-||-||-||-||-||-||-||-|
|SH Sinkhorn (1964)||62.4||51.2||47.1||43.1||39.7||38.5||36.6||35.0||34.2||33.7||32.5||31.8||31.0||30.6||29.8||38.5|
|SD Cuturi (2013)||76.1||68.0||63.7||58.9||56.0||54.5||53.2||51.0||50.3||48.9||47.9||47.1||46.3||45.7||44.9||49.8|
|RM Zeng et al. (2019)||87.3||86.4||75.5||68.4||68.0||65.1||64.1||60.9||59.2||59.8||59.4||58.9||58.3||57.2||52.1||65.4|
|DFC Lee et al. (2018)||52.7||42.6||41.7||36.6||23.8||26.0||23.2||20.5||18.3||17.4||16.6||14.1||12.2||12.4||11.1||24.6|
|DCNN Lee et al. (2018)||49.9||51.9||52.2||54.5||55.1||56.2||56.1||55.5||55.5||55.4||56.4||55.0||56.3||55.9||55.8||54.8|
|BDL Minh-Tuan and Kim (2019)||49.5||54.2||54.7||57.0||56.1||57.4||57.0||56.7||57.3||57.2||57.5||57.2||57.6||57.1||57.5||56.3|
|DHN Xu et al. (2020)||54.6||54.6||53.6||52.1||49.4||49.3||47.3||46.5||45.5||44.1||43.1||42.4||42.7||40.0||39.7||47.0|
|SH Sinkhorn (1964)||27.3||23.3||20.7||19.2||18.0||17.2||16.6||15.8||15.3||14.9||14.5||14.0||13.9||13.6||13.2||15.3|
|SD Cuturi (2013)||42.0||35.8||32.0||29.7||28.1||26.5||25.9||24.5||23.9||23.1||22.5||21.9||21.4||20.7||20.3||23.6|
|RM Zeng et al. (2019)||57.5||53.3||47.0||47.1||46.8||46.4||50.8||49.0||45.6||43.5||43.8||42.7||41.8||41.4||40.4||44.3|
|DCNN Lee et al. (2018)||54.9||57.3||54.9||57.0||56.9||53.7||54.4||53.6||52.2||51.6||51.0||49.8||47.3||48.4||46.1||52.6|
|BDL Minh-Tuan and Kim (2019)||56.0||56.1||54.7||55.6||55.0||50.3||50.5||44.0||41.5||43.7||40.8||41.0||42.4||36.5||37.3||47.0|
|DHN Xu et al. (2020)||35.0||26.2||16.7||18.5||11.9||-||-||-||-||-||-||-||-||-||-||-|
5.4 generalization study
Generalization ability is an important quality of linear assignment solvers, especially for learning-based methods that are usually sensitive to the different data distribution between test data and training data. To verify the generalization ability of our framework and baselines, we train the learning-based models on the synthetic dataset and test them on three types of dataset which have different data distributions from .
Specifically, in order to evaluate the assignment precision on dataset with larger cost matrices, we generate a synthetic dataset named where the problem size is ranging from 200 to 3000 with an interval of 200 and the cost value is in (0,1). Table 3 reports the assignment precisions of our framework compared with the state-of-the-art baselines. In traditional solvers, the performance of SH Sinkhorn (1964) and SD Cuturi (2013) have an evident tendency to degradation with the problem size expanding. The assignment precision of the other traditional method, i.e., RM Zeng et al. (2019), is not sensitive to the problem size, however its performance is still poor. In the learning-based method, with the expansion of the problem scale, the assignment precision of DHN is severely reduced. When the problem size is greater than 1000, the problem cannot be solved within 5 minutes, which makes it difficult to perform complete statistics on the results of large-sized problems. In addition, DCNN Lee et al. (2018) and BDL Minh-Tuan and Kim (2019) have similar average assignment precision, while their performance are still sensitive to the size of problem. Different from these baselines discussed above, our proposed framework GLAN achieves consistent assignment precision, and surpasses all baselines on all cases, which demonstrates that our model trained on small-sized data is competent for more difficult problems with large-sized cost.
Besides, we also test all the methods on the dataset which is the combination of the datasets and , yet the cost matrices are multiplied by a real value randomly sampled from 1 to 10 making the cost value be in . For the sake of clarity, this dataset is divided into two parts named and , and the corresponding experimental results are reported in Table 4 and Table 5. Concretely, Table 4 illustrates the comparison results to validate the models’ generalization on test data which has a different cost value interval from the training data. And Table 5 reports the experimental results predicted on a more difficult dataset in which not only the problem size but the cost value interval are also different from the training data. In traditional solvers, SH Sinkhorn (1964) simply makes assignment decision by performing column- and row- normalization iteratively, thus is not sensitive to the cost value scale and achieves the similar assignment precision to its results on and . As the variant of SH Sinkhorn (1964), SD Cuturi (2013) achieves better performance on the dataset with interval of than that in . Interestingly, RM Zeng et al. (2019) achieves the best assignment precision on data with varying sizes from 10 to 30 in Table 4, because multiplying a value greater than 1 on cost matrix is equivalent to expanding the gradient during optimization, thus speeding up its convergence rate. Besides, in the case of the same data size, RM Zeng et al. (2019) achieves better results on data in than on data in . However, the assignment precision of RM Zeng et al. (2019) is lower than our proposed framework GLAN when the problem size is greater than 30. In learning-based baselines, both DCNN Lee et al. (2018) and BDL Minh-Tuan and Kim (2019) obtain relative low assignment precision on data with small-size such as 10 and 20, and gradually increase the precision until it is stable around a certain value as the problem size expands. The results of DCNN Lee et al. (2018) and BDL Minh-Tuan and Kim (2019) on demonstrate that both of them overfit small-sized training data and are susceptible to different data distribution. Besides, the assignment precision of DFC Lee et al. (2018) and DHN Xu et al. (2020) on are obviously worse than the results on . In addition, the performance of DHN Xu et al. (2020) on are still deteriorated as the problem size increases, and worse than that on , which demonstrate that DHN Xu et al. (2020) fails to handle the linear assignment problem whose data size and value scale are different from its training data. Due to the structured inductive representations, our framework GLAN obtains the similar performance on as that on the dataset , and achieves the best average assignment precision. Furthermore, it also obtains the best and nearly consistent assignment precision on where both the problem size and cost value scale are different from the training data.
The experimental results mentioned above demonstrate that our framework is insensitive to the data size and value scale, and can achieve consistent assignment precision.
5.5 Evaluation in MOT.
|FRT Bergmann et al. (2019)||53.5||78.0||52.3||19.5||36.6||12201||248047||2072|
|FRT+DHN Xu et al. (2020)||53.7||77.2||53.8||19.4||36.6||11731||247447||1947|
|FRT+BDL Minh-Tuan and Kim (2019)||54.9||75.5||53.8||20.2||35.8||11276||240981||2157|
|FRT+DCNN Lee et al. (2018)||56.3||76.8||55.0||21.2||35.2||9099||235161||2099|
|FRT+SD Cuturi (2013)||55.4||77.1||54.6||20.8||35.8||10144||239470||2050|
|FRT+RM Zeng et al. (2019)||56.4||78.9||55.3||21.3||35.3||8771||235400||1992|
|FRT+SH Sinkhorn (1964)||56.3||78.8||55.4||21.2||35.3||8745||235612||1996|
To validate the effectiveness of our framework on real scenarios, we employ the multi-object tracking (MOT) problem as the task and embed our GLAN into MOT architecture as an assignment module. In the following, we will introduce the experimental settings and report the experimental results in detail.
MOT can be tackled by exploiting the regression head of a detector to perform temporal alignment of object bounding boxes Bergmann et al. (2019). Based on the tracking pipeline in Bergmann et al. (2019) which adopts the Faster-RCNN Ren et al. (2017) as a detector, Xu et al. Xu et al. (2020) embed the differentiable DHN into the MOT architecture to train the detector in an end-to-end manner. Similarly, we also combine the differentiable assignment solvers with the tracker in Bergmann et al. (2019) to verify their effectiveness in MOT tasks. Here, we exclude the DFC Lee et al. (2018) because it is inflexible to cope with cost matrices with varying sizes. For simplicity, we name the tracker in Bergmann et al. (2019) FRT (Faster RCNN Tracker).
In training, the matching between predicted bounding boxes and ground truth is viewed as an assignment problem, and the cost matrix is constructed according to center distance and Intersection-over-Union (IOU) between tracks and bounding boxes. Furthermore, we employ the soft MOTA and MOTP (please refer to Xu et al. (2020) for the details of the construction of them) to guide the learning process. In the case of MOT, there can be different numbers of tracks and ground-truth objects, and the objects in the previous frame can be missed in the current frame or a new object can exist in the current frame. Therefore, in training, we remove the constraint loss to prevent the un-assigned track or object from having a high assignment score. In evaluation, following the tracking pipeline in Bergmann et al. (2019), only the trained detector is used to perform tracking and the assignment solvers are not employed to facilitate the tracking process.
The performance of each tested method on MOT benchmark MOT17 Milan et al. (2016) is evaluated by standard MOT metrics and is reported in Table 6. It is observed that trained along with differentiable assignment solvers, the performance of the tracker has been improved in most metrics. Specifically, the learnable-free differentiable methods, RM Zeng et al. (2019), SD Cuturi (2013) and SH Sinkhorn (1964), solve the assignment problem directly on the cost matrix, which allows the relationships between each pair of tracks and bounding boxes to be taken into consideration. Therefore, the trackers trained with these solvers have been improved on almost metrics. Despite the improvements achieved by learning-based baselines, there are still several limitations to enhancing the tracker. In detail, the BDL Minh-Tuan and Kim (2019) splits the assignment problem into several classification sub-tasks, but the relationships between these sub-tasks are not involved in solving the original assignment problems. The DHN Xu et al. (2020) transforms the cost matrix to a sequential pattern and passes it to an RNN-based model to extract discriminative representations from the global scope. However, it is unsuitable to solve the time-independent problem with the RNN-based model. Besides, the DCNN Lee et al. (2018) regards the cost matrix as an image and employs stacked CNN layers to obtain the assignment prediction. But the predicted assignments always reach the local optima when the size of problems exceeds the receptive fields.
In virtue of the structure of the bipartite graph constructed from the cost matrix, the proposed framework GLAN can well capture the global inductive information through message passing, and the tracker trained with GLAN achieves the best results in terms of most metrics. Specifically, in MOT17, both our model and RM Zeng et al. (2019) achieve the best improvement in terms of MOTA (), MOTP () and MT () over FRT Bergmann et al. (2019). Besides, the tracker combined with our GLAN also obtains the best results in terms of ML and FN, which demonstrates the effectiveness and adaptability of our method for MOT task.
In this paper, we propose a novel learning framework to improve a differentiable solver for the linear assignment problem. We first convert the problem of making a linear assignment from a cost matrix to the problem of edge selection from a constructed bipartite graph. In order to solve the edge selection task, we propose a graph network to form structured representations for each edge by message passing and predict their labels. Experimental results on synthetic datasets reveal that with low computational complexity, the proposed method outperforms other state-of-the-art baseline solvers on assignment precision. Moreover, as a trainable layer in MOT architecture, the proposed assignment solver can best boost the tracking performance than other differentiable approaches.
This work is supported by the National Nature Science Foundation of China (Nos. 62076021, 62072027 and 61872032) and the Beijing Municipal Natural Science Foundation (Nos. 4202060, 4202057 and 4212041).
- On the optimality and speed of the deep greedy switching algorithm for linear assignment problems. In IEEE Int. Symp. Parallel Distrib. Process. Worksh. Phd Forum, pp. 1828–1837. Cited by: §1, §2.1.
- A dual feasible forest algorithm for the linear assignment problem. RAIRO-Operations Research 25 (4), pp. 403–411. Cited by: §2.1.
- Near-linear time approximation algorithms for optimal transport via sinkhorn iteration. In Adv. Neural Inform. Process. Syst., pp. 1964–1974. Cited by: §1, §2.2.
- Relational inductive biases, deep learning, and graph networks. CoRR abs/1806.01261. Cited by: §4.
- Tracking without bells and whistles. In IEEE Int. Conf. Comput. Vis., pp. 941–951. Cited by: §5.5, §5.5, §5.5, Table 6.
- The auction algorithm: a distributed relaxation method for the assignment problem. Ann. Oper. Res. 14 (1), pp. 105–123. Cited by: §1, §2.1.
- Hierarchical representation learning in graph neural networks with node decimation pooling. CoRR abs/1910.11436. Cited by: §4.
- A method for finding projections onto the intersection of convex sets in hilbert spaces. In Adv. Order Restr. Stat. Inference, pp. 28–47. Cited by: §1, §2.2.
- Linear assignment problems and extensions. In Handb. Comb. Optim., pp. 75–149. Cited by: §1, §3.2.
- Sinkhorn distances: lightspeed computation of optimal transport. In Adv. Neural Inform. Process. Syst., pp. 2292–2300. Cited by: §1, §2.2, §5.3.1, §5.3.1, §5.4, §5.4, §5.5, Table 1, Table 2, Table 3, Table 4, Table 5, Table 6.
- An algorithm for restricted least squares regression. J. Am. Stat. Assoc. 78 (384), pp. 837–842. Cited by: §1, §2.2.
- Greedy randomized adaptive search procedures. J. Glob. Optim. 6 (2), pp. 109–133. Cited by: §2.1.
- An efficient cost scaling algorithm for the assignment problem. Math. Program. 71, pp. 153–177. Cited by: §2.1.
- Matching based ground-truth annotation for online handwritten mathematical expressions. Pattern Recognit. 48 (3), pp. 837–848. Cited by: §1.
- A shortest augmenting path algorithm for dense and sparse linear assignment problems. Computing 38 (4), pp. 325–340. Cited by: §2.1.
- Serial order: a parallel distributed processing approach. In Adv. Psychol., Vol. 121, pp. 471–495. Cited by: §1.
- Computational results of an interior point algorithm for large scale linear programming. Math. Program. 52, pp. 555–586. Cited by: §1, §2.1.
- A new polynomial-time algorithm for linear programming. Comb. 4 (4), pp. 373–396. Cited by: §2.1.
- The hungarian method for the assignment problem. In 50 Years of Integer Programming, pp. 29–47. Cited by: §1, §2.1, §5.2.
- Convolutional networks for images, speech, and time series. Handb. Brain Theory Neural Netw. 3361 (10), pp. 1995. Cited by: §1, §5.3.1.
- Deep neural networks for linear sum assignment problems. IEEE Wirel. Commun. Lett. 7 (6), pp. 962–965. Cited by: §1, §2.2, §3.2, §5.3.1, §5.3.1, §5.4, §5.4, §5.5, §5.5, Table 1, Table 2, Table 3, Table 4, Table 5, Table 6.
- A concave optimization algorithm for matching partially overlapping point sets. Pattern Recognit. 103, pp. 107322. Cited by: §1.
- A one-layer projection neural network for linear assignment problem. In Chin. Control Conf., pp. 3548–3552. Cited by: §2.2.
- MOT16: A benchmark for multi-object tracking. CoRR abs/1603.00831. Cited by: §5.5.
- Bidirectional long short-term memory neural networks for linear sum assignment problems. Appl. Sci. 9 (17), pp. 3470. Cited by: §1, §2.2, §3.2, §5.3.1, §5.3.1, §5.4, §5.4, §5.5, Table 1, Table 2, Table 3, Table 4, Table 5, Table 6.
- Algorithms for the assignment and transportation problems. J. Soc. Ind. Appl. Math. 5 (1), pp. 32–38. Cited by: §1.
- Deep greedy switching: a fast and simple approach for linear assignment problems. In Int. Conf. Numer. Anal. Appl. Numer., pp. 9999. Cited by: §1, §2.1.
-  An approximate dual projective algorithm for solving assignment problems. In Network Flows And Matching, pp. 431–452. Cited by: §1, §2.1.
- Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 39 (6), pp. 1137–1149. Cited by: §5.5.
- A relationship between arbitrary positive matrices and doubly stochastic matrices. Ann. Math. Stat. 35 (2), pp. 876–879. Cited by: §1, §2.2, §5.3.1, §5.3.1, §5.4, §5.4, §5.5, Table 1, Table 2, Table 3, Table 4, Table 5, Table 6.
- Deep affinity network for multiple object tracking. IEEE Trans. Pattern Anal. Mach. Intell. 43 (1), pp. 104–119. Cited by: §1.
- A linear relaxation heuristic for the generalized assignment problem. Nav. Res. Logist. 39 (2), pp. 137–151. Cited by: §2.1.
- Learning combinatorial solver for graph matching. In IEEE Conf. Comput. Vis. Pattern Recog., pp. 7565–7574. Cited by: §4.
- A comprehensive survey on graph neural networks. IEEE Trans. Neural Networks Learn. Syst. 32 (1), pp. 4–24. Cited by: §4.
- Transductive zero-shot action recognition via visually connected graph convolutional networks. IEEE Trans. Neural Networks Learn. Syst. 32 (8), pp. 3761–3769. Cited by: §4.
- How to train your deep multi-object tracker. In IEEE Conf. Comput. Vis. Pattern Recog., pp. 6786–6795. Cited by: §1, §1, §2.2, §3.2, §4.4, §5.3.1, §5.3.1, §5.3.1, §5.3.2, §5.4, §5.5, §5.5, §5.5, Table 1, Table 2, Table 3, Table 4, Table 5, Table 6.
- DMM-net: differentiable mask-matching network for video object segmentation. In IEEE Int. Conf. Comput. Vis., pp. 3928–3937. Cited by: §1, §2.2, §5.3.1, §5.3.1, §5.4, §5.4, §5.5, §5.5, Table 1, Table 2, Table 3, Table 4, Table 5, Table 6.
- Graph edge convolutional neural networks for skeleton-based action recognition. IEEE Trans. Neural Networks Learn. Syst. 31 (8), pp. 3047–3060. Cited by: §4.