Deep neural network (DNN) models have been widely used to solve many tasks in different domains and easily outperform many traditional methods. People use the DNN model to replace the existing traditional methods because the DNN models bring better accuracy than traditional methods or because the DNN models save much time for calculations. For example, there are many existing studies devoted to solving various complex optimization problems using DNN models [1, 2, 3]4, 5, 6]
or recurrent neural network (RNN)[7, 8, 9] and its variant models. It has become a consensus for solving various complex problems and has achieved better and more powerful performance than traditional methods.
However, above mentioned models, if trained under the framework of supervised learning or weakly-supervised learning, usually learned how to simulate the ground-truth data through the backpropagation of error between the network output and the ground-truth observation. In this process, the neural network as a ”black box” is often criticized that the learning process of the network itself is a process of simulating the truth value, rather than understanding some of the basic rules of the ground-truth value that they should have learned
. In many implementations, regularization is added to DNN models’ loss function to ensure the trained models satisfy some simple constraint conditions. For example, L1-regularization is widely considered to add sparsity to the predictions of trained models. Nevertheless, the problem lies in that what if the ground-truth data set is observed to satisfy some other simple constraint conditions that cannot be regularized by some conventional regularization widely used in loss function?
Here we consider some application scenarios such as shown in figure 1. We have a ground-truth observation in the constrained region formed by some linear constraints in this scenario. Since it is known that all inference output of DNN models should also belong to the constrained region, we can consider those constraints as a kind of prior knowledge. For a conventional DNN model, the training process directly minimizes the error between ground truth observation and network output. However, an additional process should be optimizing the network output by constrained region, indicated by the red dash line in figure 1. By combining these two processes, the model is expected to give some outputs closer to the constrained region, which will make the output closer to the ground truth observation.
This paper proposes a more unified problem formulation and uses a general framework to solve the DNN models’ output that satisfies certain constraints. We hope that the output of the neural network model of supervised or weakly supervised learning is no longer a simple process of simulating the ground-truth value but meets certain constraints. At the same time, we believe that when the output of the DNN models satisfies some of the constraint rules, the interpretability and transferability of the network can be improved. Here, we try to describe this problem in detail as much as possible and show the feasibility of a general method to solve this problem.
2 Related Works
Neural network models incorporating constraints are also used in solving optimization problems [11, 12]. There is also some novel neural network model for solving variational inequalities with linear and nonlinear constraints . In addition, many people optimize the DNN model by adjusting the internal structure of the neural network to meet certain constraints [14, 15]. In recent years, we observed many existing works on constrained DNN models in the field of computer vision, such as a constrained convolutional layer, to accomplish the problem that state-of-the-art approaches cannot detect resampling in re-compressed images initially compressed with high-quality factor . Besides, a constrained neural network (CCNN) uses a novel loss function to optimize for any set of linear constraints on the output space [17, 18]. Many previous studies on constrained neural network models are mainly dedicated to solving specific problems in one particular field, and there is a lack of a more general description of the problem. Besides, directly combing the DNN models with constrained optimization algorithms, such as the CCNN mentioned above, cannot handle some more generalized constraints and is not time-efficiency since it is necessary to solve optimization problems for every batch network output.
3 Deep Neural Networks with constraints
3.1 Problem definition
Given a dataset that consists of input points and its corresponding ground truth output points , in which we assume that ground truth output satisfy some equality and inequality constraint conditions. Inequality constraints are usually given by some physical or expert knowledge, which define a feasible set of possible inferenced outputs. In this research, we only focus on the linear constraints problem, then aforementioned constraint conditions can be rewritten as , where . Given a deep learning model, denoted by , parameterized by , we use denotes the output of this deep learning model.
For a conventional DNN, the outputs of the model is usually optimized by , where , and is a regulariser of weights in the DNN model. Although, this widely-used loss function consists of a regularization for weights in the DNN models. It is designed mainly for optimizing a DNN model which has as simple weights as possible to prevent over-fitting problem, then increase the robustness of the model for approximating some complex mapping from inputs to outputs. However, this objective function used in conventional DNNs lack the ability to ensure a feasible outputs of the model to satisfy the inequality constraints defined in Eq. 1. But it is obvious that if the error between inferenced output of the deep learning model and the ground truth is small enough, the inferenced output should eventually satisfy the constraint conditions. Therefore, we consider the minimization of error as the ultimate goal, while the constraint conditions as a kind of prior knowledge, which could help us solve the minimization problem if utilized properly.
Therefore, our objective is to solve the following equation:
To train a deep learning model, the most widely used optimization method is probably Gradient Descent (GD) since GD is a standard way to solve unconstrained optimization problem. Therefore, the intuition of solving above objective function is Projected Gradient Descent (PGD), which is considered as a simple way to solve constrained optimization problem. But unfortunately, PGD is not considered as an efficient method to solve optimization problem consists of some complex constraint conditions. Therefore, we aim to propose a more generalized efficient optimization method for above objective function.
3.2.1 Projected Gradient Descent (PGD)
Let us start with a simple unconstrained optimization problem like: . The optimal solution can be found using GP by: , where is the step size, and is the gradient of . However, if is defined in a constrained region, denoted as , then the problem becomes . The GD fails for this problem because is not guaranteed to be inside the constrained region. An intuition to solve this problem is using projection to find the closest point in the constrained region of updated by .
The above projection is defined as an implicit function, and this projection function itself defines a constrained optimization problem. In this paper, we consider constrained problems only contain linear constraint conditions, so we can define the constrained region as , then we can rewrite the projection function as . To solve this equation, we could use the Karush–Kuhn–Tucker (KKT) conditions:
Where is constant value. The above Eq. 2 gives an analytic solution of the constrained optimization problem.
Although the above Eq. 2 is usually not solved directly, many optimization algorithms can be interpreted as methods for numerically solving this KKT conditions . Therefore, PGD for constrained optimization problem can be written as a two-step update process: 1) ; 2) solving KKT conditions of constrained optimization problem . Both of these two steps can be numerically solved according to aforementioned method.
3.2.2 PGD for optimizing DNNs
The optimizing process of a conventional DNN model can be written as following equation:
Where and are learning rates, is the iteration, is a small scalar, and is a mini batch contains data samples. This simple update scales well and achieves good performance even in some very large problems.
Here we only consider the case that the output of the deep learning model should satisfy the linear constraint conditions, and we define a implicit projection function denoted as . From the definition of the point , it is obvious that all should located in the feasible region of the linear constraint condition. At that same time, since an optimal should be the closest with , we know that should be on the boundary of the feasible region, and the line connected by and
is orthogonal with the hyperplane wherelocated on since . In the case that all constraint will not be active when the output satisfy the constraint conditions, then the projected point is the same with the output exactly. However, if the output of DNN model is not satisfy the constraint conditions, we should add this projection function in the optimizing process of DNN model. To revise update process for DNN model weights , we should replace with in Eq. 3. We know that although we could solve KKT conditions to get the projected output, but usually it can only be obtained through some numerical constrained optimization methods. So to optimize weights in DNN models, we need firstly optimizing the projected output at each iteration.
Although we don’t know the explicit function , we can obtain an identical point by solving aforementioned Eq. 2, then we are able to find the optimal results, which is the closest points of original network outputs in the constrained region. However, directly solving this constrained optimization problem is not an efficient solution since it is computed every iteration when the DNN produces a mini batch of outputs. We consider this training process as a two-step update method similar with the PGD method. For example, constrained convolutional neural network  is proposed to solve image segmentation with some simpleconstraint conditions, which is required to solves a constrained optimization problem every time at each iteration when training the model. We doubt this strategy could be used in more generalized scenarios. Actually, we also concern that finding the closet point of original DNN output in the constrained region is not necessary. That is because what we want to achieve is that the output of DNN should be close to the constrained region. In the best situation, the output of DNN should satisfy the constraint. So directly solving KKT conditions for guiding the training process of DNN model is waste of computational resource. Therefore, we aim to find other approach to solve the Eq. 1.
3.3 Defferentiable Projection DNN
3.3.1 Projection for linear equality constraint conditions
Let us consider a specialized scenario that there is only linear equality constraint conditions for the output of deep learning models. We can write the constrained region into matrix form like , where . We can easily obtain the condition for have feasible solutions is that . It means that there exists a feasible area when the number of efficient constraints is less than the dimension of the variable . More specially, if the rank of satisfy that , we know that given any , the closest point subject to is .
The proof of above equation is given in Appendix. Based on this equation, given any network output , we can find the closest point. At first, let us simplify the notation using . We should notice that if is defined as Euclidean distance because
Where , and . Therefore, we can achieve a conclusion that the projected output is a better inference because that the distance between projected output with ground truth is smaller than the original output.
There are two strategy of using the property of projection for linear equality constraint conditions in Eq. 4. 1) We train a conventional DNN model at first, and then we can add a linear projection layer behind the pre-trained DNN model to obtain the projected outputs. 2) We add a linear projection layer behind the conventional DNN model, and then train the model using loss calculated by projected inference and ground truth.
For the first strategy, we notice that the DNN model can produce better inference which is closer with the ground truth even with fine-tuning the parameters of the pre-trained DNN model based on the above equations. This indicates that if we know some very basic prior knowledge about the ground truth data, we can build some linear equality constraint conditions and the corresponding projection layer. Using this simple projection layer, we can improve the performance of pre-trained models without fine-tuning their parameters, which is very time-efficient and require less computation resource.
For the second strategy, here we analyze the change of training process at first. According to the chain rule used in Back Propagation (BP) algorithm, we can look at the final layer at the beginning. In above context, we usedenote the final output of the model, but since here we only consider the final layer, we use the denote the output of final layer , and then
if we take sigmoid activation function as an example. For a conventional DNN model, the local error of the final layer can be calculated by. Then, we can calculate the local error of previous layers, which is used to calculate the local gradient of parameters in each layers using BP algorithm. However, if we add a projection layer behind the conventional DNN model, even we didn’t change the structure of the model, the local gradient of parameters will be effected because the local error of the final layer changed to . This will not increase the computation burden, and we can regard this as a special activation function using deep learning concept.
3.3.2 Projection for linear inequality constraint conditions
While for a more generalized scenario, the constraints are not always independent with each other, but also there could exist some inequality constraint conditions in many applications. This section, we will give a projection method for linear inequality constraint problems. Without any assumption about linear independent of constraints in Eq. 1, we have following definitions. At first, we need define a index set of violated constraint conditions, denoted by . Then, we can define a iterative projection like:
where , and is weights of selected constraint conditions. The value of the weights can be determined once , where is a small positive quantity because we need make sure every violated constraint conditions is considered. We can use as a normalized weights of violated constraint conditions for simplification. With above definition, we can use Eq. 5 to find a feasible solution, which satisfy constraint conditions of Eq. 1. Given any output of DNN models, we know that , where is the feasible region defined by constraint conditions in Eq. 1. The detailed proof of the convergence of above algorithm is given in Appendix.
So far, we propose to build a projection method like Eq. 5 to find a feasible solution in constrained region when given any output of DNN models after many iterations. But how many iterations are needed to find such feasible solution is an issue to be solved. At the same time, we also concern that many projection iterations will lead to the increasing of computation time, which will make the projection method unsuitable for practical implementation. However, in previous section, we notice that there is a good property that for projected output is closer to the feasible region after each projection in linear equality case, so we are curious about that whether this property exists in linear inequality case.
let us define an error vector between output aftersteps projection and ground truth , denoted as . Then, based on previous equation, we have following equation:
Based on above Eq. 7, we can write following equation:
From above equation, we know that because the second and third term in Eq. 8 is non-positive. We know that we could eventually find a projected output in feasible region given the deep leaning model output , but this process could be time-consuming potentially since many projection steps is necessary. But fortunately, there is a good property of this projection method given by above derivation, which indicates that projected points at current step are always closer to the constrained region than those projected points at previous step.
Therefore, only limited steps of projection is needed, and the projected outputs are always better than original deep learning models’ outputs. Here we make small revision for the definition of weights for selected violated constraint conditions as , where . We can simplify Eq. 6 as:
Where is a very small scalar because we notice that when . When the set of violated constraint conditions is not empty, the very small scalar has little effect on the positive value of . However, we need to prevent the denominator of the equation to be when . We notice that there is no assumption about the linear independent of constraint conditions, which means that the rank of is not always to be full row rank, so we summarize the above equations in a more uniform differentiable way.
At each time, the DNN model output some original points , by using projection method defined in Algorithm 1, there will be some converged feasible points located in constrained region. We have to clarify that the projected feasible points is not always the KKT points of original output, which means that projected feasible points are not always the closest points of original output. By using projection method, we can find some possible feasible solutions but not the best solution.
4.1 Numerical Experiments
In this section, we present some numerical experiments to demonstrate the effectiveness of the constrained neural network. The data we used for numerical experiments are randomly generated synthetic data. In this experiment, we use the same parameter settings and training process for a fair comparison with different methods. The input data is generated by a 64-dimensional unit Gaussian distribution. Then we build a nonlinear function with random parameters to convert the 64-dimensional inputs to 8-dimensional vectors. The linear equality constraints and inequality constraints are all generated randomly, and we only keep the data pair that satisfy these random constraints. The objective of this problem is to evaluate the performance of projection DNN using MSE as the metric.
4.1.1 Effectiveness of projection layer
In this experiment, we compare the projection DNN (PDNN) with a conventional DNN. We also compare the results of conducting projection as post-processing of DNN, denoted as DNN+projection. It can be found that the PDNN gives the best results, while DNN+projection also outperforms the conventional DNN. That proves our assumption that by incorporating some linear constraints as prior information, even a simple model could give better results.
We also evaluate the impact of the projection layer and the hyperparameterin the loss function. For a single iteration, we know that with more projection layers in a PDNN model, the projected outputs will be closer to the ground truth observations. This property will lead to an intuition that we should use more projection layers in a PDNN to get better performance. At the same time, the experimental results deny this kind of intuition shown in Figure 2 (left). At least in this experiment, the best performance is obtained when the PDNN consists of 3 stacked projection layers. The number of projection layers in a PDNN to achieve the best performance could differ in other application scenarios. Still, based on these results, it suggests that using limited stack projection layers could be sufficient instead of stacking as many projections as possible, which will lead to an increased computational burden.
Since the loss function is originally defined as , we also evaluate the impact of the hyperparameter . The best performance is obtained when shown in Fig. 2 (right). This result suggests that we should only use projected outputs for calculating the loss to guide the training of the PDNN. However, we think it depends on the different scenarios to decide the hyperparameter .
4.1.3 Impact of constraints
Experiments are also conducted for the comparison of different methods to evaluate the performance under different constraints conditions. The loss function of ”DNN+resid” in table II is . When the network output did not satisfy the constraints, will be a positive value. If the network output satisfies the constraints, then the residual will be 0. The loss function of ”DNN+fix resid” is similar. When the network outputs did not satisfy the constraints, give a fixed penalty in the loss function.
From table II, we notice that in most cases, the PDNN outperforms conventional DNN and DNN+proj. We must also figure out that the baseline methods “DNN+resid” and “DNN+resid” sometimes give significantly better results than the other three methods, while they are not robust. When the constraint conditions change, the performance of these two methods becomes worse than the other three methods. The possible reason for this phenomenon could be that when the complexity of the constraint condition increases, the residual of the outputs will increase dramatically, and the model cannot give accurate predictions. This might be solved by tuning the hyperparameters in the model, but it requires more attempt efforts than PDNN or conventional DNN, so another advantage of the PDNN is that we could always get better results without additional hyperparameter tuning efforts.
4.2 Image segmentation using weak labels
We analyze and compare the performance of our proposed differentiable projection layer for image-level tags and some additional supervision such as object size information. The objective is to learn models to predict dense multi-class semantic segmentation for a new image. At first, we train a VGG with pre-trained parameters, and then we add differentiable projection layers behind the model to demonstrate the utility of the proposed projection layer and how it helps increase supervision levels.
We evaluate the proposed DPDNN for the task of semantic image segmentation on the PASCAL VOC dataset 
. The dataset contains pixel-level labels for 20 object classes and a separate background class. For a fair comparison to prior work, we use a similar setup to train all models. Training is performed only on the VOC 2012 train set. The VGG network architecture used in our algorithm was pre-trained on the ILSVRC dataset for the classification task of 1K classes. Results are reported in the form of standard intersection over union (IoU) metric. It is defined per class as the percentage of pixels predicted correctly out of total pixels labeled or classified as that class.
4.2.2 Implementation details
In this section, we discuss the overall pipeline of our algorithm applied for semantic image segmentation. We consider the weakly supervised setting, i.e., only image-level labels are present during training. At the test time, the task is to predict the semantic segmentation mask for a given image. The CNN architecture used in our experiments is derived from the VGG 16-layer network 
. It was pre-trained on Imagenet 1K class dataset and achieved winning performance on ILSVRC14. We cast the fully connected layers into convolutions in a similar way as suggested in
, and the last fc8 layer with 1K outputs is replaced by that containing 21 outputs corresponding to 20 objects classes in Pascal VOC and background class. The overall network stride of this fully convolutional network is 32s. Also, we do not learn any weights of the last layer from Imagenet, which is unlike[24, 25]. Apart from the initial pre-training, all parameters are finetuned only on the Pascal VOC dataset.
The FCN takes in arbitrarily sized images and produces coarse heatmaps corresponding to each class in the dataset. Although it is not time-consuming for our proposed DP layer to be applied to finer-grained heatmaps, we apply our DP layer to these coarse heatmaps just for a fair comparison with . The network is trained using SGD with momentum. We follow previous works and train our models with a batch size of 1, momentum of 0.99, and an initial learning rate of . Unlike previous works, we did not apply a fully connected conditional random field model .
4.2.3 constraint conditions
We use similar constraint conditions used in . For each training image , we are given a set of image-level labels .
Suppression constraint: No pixel should be labeled as classes that are not listed in the image-level labels.
Foreground constraint: The number of pixels in the image labeled as classes in the image-level label should be greater than . This constraint is used to make sure each label can be detected efficiently.
Background constraint: Here we set as the background label. The upper bound and lower bound of the number of pixels to be labeled as the background is , respectively.
In practical implementation, all constraint conditions listed above are formed as a matrix and can efficiently be solved using a differentiable projection layer.
We summarize the results of experiments in table III. At first, we train a VGG model, and then we use a different number of differentiable projection layers after the VGG model without additional parameters tuning. From the results in the table, we notice that VGG obtains the best mean IoU with five differentiable projection layers. However, if we look at the performance in a different class, we find that VGG with 20 differentiable projection layers gives the best performance in most classes. Again, the results in this table show that more projection layers cannot guarantee better overall results in practical applications.
Besides, the visualization of the qualitative results is shown in figure 3. We notice that the proposed differentiable projection layer can suppress the pixels not listed in the image-level label without any parameter tuning. For example, the predicted segmentation of the VGG has some pixels labeled as a class (green pixels) not shown in the ground truth of the third image, and the results after adjusted by differentiable projection layer suppress the wrong label. Besides, in the fourth image, we notice that although the wrong label is not entirely suppressed, its size is smaller than the prediction of the VGG. Therefore, even without parameter tuning, the results are better than conventional VGG.
|supervised (20 proj)||47.73||35.05||6.74||42.14||37.05||28.44||69.37||44.40||51.15||11.23||60.97||46.59||49.97||40.83||50.49||37.71||12.38||44.44||44.32||47.35||39.06|
|supervised (10 proj)||48.02||34.43||6.40||43.61||36.65||27.61||70.90||44.44||55.23||11.87||56.93||45.18||50.00||40.10||52.05||39.26||11.88||45.52||44.55||46.33||39.37|
|supervised (3 proj)||47.87||34.59||6.20||44.49||36.86||27.01||70.23||44.17||57.07||11.63||54.69||43.95||49.61||39.40||52.54||39.61||11.56||45.74||44.31||45.31||39.52|
|proj tuned(3 1e-6)||47.84||34.21||6.06||44.61||36.71||26.94||70.17||44.22||57.26||11.47||54.63||43.78||49.68||39.37||52.56||39.71||11.37||45.71||44.28||45.09||39.31|
|proj tuned(3 1e-5)||46.90||29.76||3.92||45.28||34.31||25.94||69.33||43.73||59.11||10.22||52.90||42.55||50.15||38.52||51.28||40.16||10.14||44.99||43.03||41.71||37.29|
This paper aims to solve problems with linear constraints as a kind of prior knowledge in a more generalized form. We discuss the necessity of solving this kind of constrained problem under KKT conditions widely used before. Also, we show that in linear constrained cases, we could use a projection method to efficiently build a differentiable projection layer of the DNN. We use the dot product of errors of both the actual output of the DNN and projected output to train the DNN. Besides, when the actual output is in the feasible region, we show that conventional DNN is a particular case of this projection DNN. Then, we conduct numerical experiments using a randomly generated synthetical dataset to evaluate the performance of the PDNN, including a comparison with a convention DNN and some different simple modifications. The experimental results show the effectiveness and robustness of the PDNN. Then, we also conduct image segmentation experiments to show the utility of the proposed DPDNN model. Currently, the projection layer proposed is differentiable, while there are only fixed parameters. In the future, we will investigate the improvement of this projection layer to achieve a projection layer structure that consists of learnable parameters. Also, we will evaluate our methods for some real-world applications.
J. Snoek, O. Rippel, K. Swersky, R. Kiros, N. Satish, N. Sundaram, M. Patwary,
M. Prabhat, and R. Adams, “Scalable bayesian optimization using deep neural
International conference on machine learning. PMLR, 2015, pp. 2171–2180.
-  M. Fischetti and J. Jo, “Deep neural networks and mixed integer linear optimization,” Constraints, vol. 23, no. 3, pp. 296–309, 2018.
-  H. Sun, X. Chen, Q. Shi, M. Hong, X. Fu, and N. D. Sidiropoulos, “Learning to optimize: Training deep neural networks for interference management,” IEEE Transactions on Signal Processing, vol. 66, no. 20, pp. 5438–5453, 2018.
-  T. N. Sainath, A.-r. Mohamed, B. Kingsbury, and B. Ramabhadran, “Deep convolutional neural networks for lvcsr,” in 2013 IEEE international conference on acoustics, speech and signal processing. IEEE, 2013, pp. 8614–8618.
-  L. Xu, J. S. Ren, C. Liu, and J. Jia, “Deep convolutional neural network for image deconvolution,” Advances in neural information processing systems, vol. 27, pp. 1790–1798, 2014.
-  A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” Advances in neural information processing systems, vol. 25, pp. 1097–1105, 2012.
-  L. R. Medsker and L. Jain, “Recurrent neural networks,” Design and Applications, vol. 5, 2001.
-  T. Mikolov, S. Kombrink, L. Burget, J. Černockỳ, and S. Khudanpur, “Extensions of recurrent neural network language model,” in 2011 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2011, pp. 5528–5531.
M. Schuster and K. K. Paliwal, “Bidirectional recurrent neural networks,”IEEE transactions on Signal Processing, vol. 45, no. 11, pp. 2673–2681, 1997.
-  D. Castelvecchi, “Can we open the black box of ai?” Nature News, vol. 538, no. 7623, p. 20, 2016.
-  Y. Xia, “An extended projection neural network for constrained optimization,” Neural Computation, vol. 16, no. 4, pp. 863–883, 2004.
-  Y. Xia, H. Leung, and J. Wang, “A projection neural network and its application to constrained optimization problems,” IEEE Transactions on Circuits and Systems I: Fundamental Theory and Applications, vol. 49, no. 4, pp. 447–458, 2002.
-  X.-B. Gao, L.-Z. Liao, and L. Qi, “A novel neural network for variational inequalities with linear and nonlinear constraints,” IEEE transactions on neural networks, vol. 16, no. 6, pp. 1305–1317, 2005.
-  F. Han, Q.-H. Ling, and D.-S. Huang, “Modified constrained learning algorithms incorporating additional functional constraints into neural networks,” Information Sciences, vol. 178, no. 3, pp. 907–919, 2008.
-  J. Chen and L. Deng, “A primal-dual method for training recurrent neural networks constrained by the echo-state property,” arXiv preprint arXiv:1311.6091, 2013.
-  B. Bayar and M. C. Stamm, “On the robustness of constrained convolutional neural networks to jpeg post-compression for image resampling detection,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2017, pp. 2152–2156.
-  D. Pathak, P. Krahenbuhl, and T. Darrell, “Constrained convolutional neural networks for weakly supervised segmentation,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 1796–1804.
-  B. Bayar and M. C. Stamm, “Constrained convolutional neural networks: A new approach towards general purpose image manipulation detection,” IEEE Transactions on Information Forensics and Security, vol. 13, no. 11, pp. 2691–2706, 2018.
-  S. Boyd, S. P. Boyd, and L. Vandenberghe, Convex optimization. Cambridge university press, 2004.
-  K. Yang and K. G. Murty, “New iterative methods for linear inequalities,” Journal of Optimization Theory and Applications, vol. 72, no. 1, pp. 163–185, 1992.
-  M. Everingham, S. A. Eslami, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman, “The pascal visual object classes challenge: A retrospective,” International journal of computer vision, vol. 111, no. 1, pp. 98–136, 2015.
-  K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for
semantic segmentation,” in
Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 3431–3440.
-  D. Pathak, E. Shelhamer, J. Long, and T. Darrell, “Fully convolutional multi-class multiple instance learning,” arXiv preprint arXiv:1412.7144, 2014.
P. O. Pinheiro and R. Collobert, “Weakly supervised semantic segmentation with
convolutional networks,” in CVPR
, vol. 2, no. 5. Citeseer, 2015, p. 6.
-  P. Krähenbühl and V. Koltun, “Efficient inference in fully connected crfs with gaussian edge potentials,” Advances in neural information processing systems, vol. 24, pp. 109–117, 2011.
-  A. R. De Pierro and A. N. Iusem, “A simultaneous projections method for linear inequalities,” Linear Algebra and its applications, vol. 64, pp. 243–253, 1985.
-  M. Nashed, “Continuous and semicontinuous analogues of iterative methods of cimmino and kaczmarz with applications to the inverse radon transform,” in Mathematical Aspects of Computerized Tomography. Springer, 1981, pp. 160–178.
-  J. Dutta and C. Lalitha, “Optimality conditions in convex optimization revisited,” Optimization Letters, vol. 7, no. 2, pp. 221–229, 2013.
-  Z. Fan, X. Song, R. Shibasaki, and R. Adachi, “Citymomentum: an online approach for crowd behavior prediction at a citywide level,” in Proceedings of the 2015 ACM International Joint Conference on Pervasive and Ubiquitous Computing, 2015, pp. 559–569.
-  R. Dudek, “Iterative method for solving the linear feasibility problem,” Journal of optimization theory and applications, vol. 132, no. 3, pp. 401–410, 2007.
-  K. C. Kiwiel, “Block-iterative surrogate projection methods for convex feasibility problems,” Linear algebra and its applications, vol. 215, pp. 225–259, 1995.
-  H. H. Bauschke and J. M. Borwein, “On projection algorithms for solving convex feasibility problems,” SIAM review, vol. 38, no. 3, pp. 367–426, 1996.
-  Y. Censor, T. Elfving, G. T. Herman, and T. Nikazad, “On diagonally relaxed orthogonal projection methods,” SIAM Journal on Scientific Computing, vol. 30, no. 1, pp. 473–504, 2008.
-  A. Gibali, K.-H. Küfer, D. Reem, and P. Süss, “A generalized projection-based scheme for solving convex constrained optimization problems,” Computational Optimization and Applications, vol. 70, no. 3, pp. 737–762, 2018.
-  S. K. Roy, Z. Mhammedi, and M. Harandi, “Geometry aware constrained optimization techniques for deep learning,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 4460–4469.
-  L. Zhang, M. Edraki, and G.-J. Qi, “Cappronet: Deep feature learning via orthogonal projections onto capsule subspaces,” arXiv preprint arXiv:1805.07621, 2018.
-  Y. Li, X. Liu, Y. Shao, Q. Wang, and Y. Geng, “Structured directional pruning via perturbation orthogonal projection,” arXiv preprint arXiv:2107.05328, 2021.
-  T. Frerix, M. Nießner, and D. Cremers, “Homogeneous linear inequality constraints for neural network activations,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2020, pp. 748–749.
-  K. Ranasinghe, M. Naseer, M. Hayat, S. Khan, and F. S. Khan, “Orthogonal projection loss,” arXiv preprint arXiv:2103.14021, 2021.
H. Pan and H. Jiang, “Learning convolutional neural networks using hybrid orthogonal projection and estimation,” inAsian Conference on Machine Learning. PMLR, 2017, pp. 1–16.
H. Zhang, Y. Long, and L. Shao, “Zero-shot hashing with orthogonal projection for image retrieval,”Pattern Recognition Letters, vol. 117, pp. 201–209, 2019.