1 Introduction
Conflicting objectives are common in machine learning problems: designing a machine learning model takes into account model complexity and generalizability, training a model minimizes bias and variance errors from datasets, and evaluating a model typically involves multiple metrics that are, more often than not, competing with each other. Such tradeoffs among objectives often invalidate the existence of one single solution optimal for all objectives. Instead, they give rise to a set of solutions, known as the Pareto set, with varying preferences on different objectives.
In this paper, we are interested in the topic of recovering Pareto sets in deep multitask learning (MTL) problems. Despite that MTL is inherently a multiobjective problem and tradeoffs are frequently observed in theory and practice, most of prior work focused on obtaining one optimal solution that is universally used for all tasks. To solve this problem, prior approaches proposed new model architectures (Misra et al., 2016) or developed new optimization algorithms (Kendall et al., 2018; Sener and Koltun, 2018). Work on exploring a diverse set of solutions with tradeoffs is surprisingly rare and limited to finite and discrete solutions (Lin et al., 2019). In this work, we address this challenging problem by proposing an efficient method that reconstructs a firstorder accurate continuous approximation to Pareto sets in MTL problems.
The significant leap from finding a discrete Pareto set to discovering a continuous one requires a fundamentally novel algorithm. Typically, generating one solution in a Pareto set is a timeconsuming process that requires expensive optimization (e.g., training a neural network). In order to obtain an efficient algorithm for computing a continuous Pareto set, it is necessary to exploit local information. Our technical method is inspired by secondorder methods in multiobjective optimization (MOO)
(Hillermeier, 2001; Martín and Schütze, 2018; Schulz et al., 2018)which connect the local tangent plane, the gradient information, and the Hessian matrices at a Pareto optimal solution all in one concise linear equation. This theorem allows us to construct a continuous, firstorder approximation of the local Pareto set. However, naively applying this method to deep MTL scales poorly with the number of parameters (e.g., the number of weights in a neural network) due to its need to compute full Hessian matrices. Motivated by other secondorder methods in deep learning
(Martens, 2010; Vinyals and Povey, 2012), we propose to resolve the scalability issue by using Krylov subspace iteration methods, a family of matrixfree, iterative linear solvers, and present a complete algorithm for generating families of continuous Pareto sets in deep MTL.We empirically evaluate our method on five datasets with various size and model complexity, ranging from MultiMNIST (Sabour et al., 2017)
that consists of 60k images and requires a network classifier with only 20k parameters, to UTKFace
(Zhang et al., 2017), an image dataset with 3 objectives and a modern network structure with millions of parameters. The code and data are available online^{1}^{1}1https://github.com/mitgfx/ContinuousParetoMTL. Experimental results demonstrate that our method generates much denser Pareto sets and Pareto fronts than previous work with small computational overhead compared to the whole MTL training process. We also show in the experiments the continuous Pareto sets can be reparametrized into a low dimensional parameter space, allowing for intuitive manipulation and traversal in the Pareto set. We believe that our efficient and scalable algorithm can open up new possibilities in MTL and foster a deeper understanding of tradeoffs between tasks.2 Related work
Multitask learning (MTL) is a learning paradigm that jointly optimizes a set of tasks with shared parameters. It is generally assumed that information across different tasks can reinforce the training of shared parameters and improve the overall performance in all tasks. However, since MTL problems share some parameters, performances on different tasks compete with each other. Therefore, tradeoffs between performances on different tasks are usually prevalent in MTL. A standard strategy to deal with these tradeoffs is to formulate a singleobjective optimization problem which assigns weights to each task (Kokkinos, 2017)
. Choosing weights for each task is typically empirical, problemspecific, and tedious. To simplify the process of selecting weights, prior work suggests some heuristics on adaptive weights
(Chen et al., 2018; Kendall et al., 2018). However, this family of methods aims to find one optimal solution for all tasks and is not designed for exploring tradeoffs.Instead of solving a weighted sum of tasks as a single objective, some recent papers directly cast MTL as a multiobjective optimization (MOO) problem and introduce multiple gradientbased methods (MGDA) (Fliege and Svaiter, 2000; Désidéri, 2012; Fliege and Vaz, 2016) to MTL. Sener and Koltun (2018) formally formulate MTL as an MOO problem and propose to use MGDA for training a single optimal solution for all objectives. Another recent approach (Lin et al., 2019), which is the most relevant to our setting, pushes the frontier further by pointing out the necessity of exploring Pareto fronts in MTL and presents an MGDAbased method to generate a discrete set of solutions evenly distributed on the Pareto front. Each solution in their method requires full training from an initial network, which limits its ability to generate a dense set of Pareto optimal solutions.
All the methods discussed so far are based on firstorder algorithms in MOO and generate either one solution or a finite set of sparse solutions with tradeoffs. A clear distinction between our paper and previous work is that we propose replacing discrete solutions with continuous solution families, allowing for a much denser set of solutions and continuous analysis on them. The advance from discrete to continuous solutions requires a secondorder analysis tool in MOO (Hillermeier, 2001; Martín and Schütze, 2018; Schulz et al., 2018), which embeds tangent planes, gradients, and Hessians in one concise linear system. Our work is also related to Hessianfree methods in machine learning (Martens, 2010; Vinyals and Povey, 2012)
which rely heavily on Hessianvector products in neural networks
(Pearlmutter, 1994) to solve Hessian systems efficiently.3 Preliminaries
In this work, we consider an unconstrained multiobjective optimization problem described by where each represents the objective function of the th task to be minimized. For any , dominates if and only if and . A point is said to be Pareto optimal if is not dominated by any points in . Similarly, is locally Pareto optimal if is not dominated by any points in a neighborhood of . The Pareto set of this problem consists of all Pareto optimal points, and the Pareto front is the image of the Pareto set. In the context of deep MTL, represents the parameters of a neural network instance and each represents one learning objective, e.g., a certain classification loss.
Similar to singleobjective optimization, solving for local Pareto optimality is better established than global Pareto optimality. A standard way is to run gradientbased methods to solve for local Pareto optimality then prune the results. Hillermeier et al. (2001) describes the following necessary condition:
Definition 3.1 (Hillermeier and others 2001).
Assuming each is continuously differentiable, a point is called Pareto stationary if there exists such that , , and .
Proposition 3.1 (Hillermeier and others 2001).
All Pareto optimal points are Pareto stationary.
Once a Pareto optimal solution is found, previous papers (Hillermeier, 2001; Martín and Schütze, 2018; Schulz et al., 2018) have proven a strong result revealing the firstorder approximation of the local, continuous Pareto set:
Proposition 3.2 (Hillermeier 2001).
Assuming that is smooth and is Pareto optimal, consider any smooth curve in the Pareto set and passing at , i.e., , then such that:
(1) 
where is defined as
(2) 
and is given by Definition 3.1.
In other words, in the Pareto set, for any smooth curve passing , transforms its tangent at to a vector in the space spanned by . By gradually changing the curve, its tangent sweeps the tangent plane of the Pareto set at . Essentially, the theorem states that connects the whole tangent plane to the column space of . Note that, however, this theorem is not directly applicable to MTL because of its requirement of full Hessians.
4 Efficient Pareto Set Exploration
Given an initial , our algorithm is executed in two phases: phase 1 uses gradientbased methods to generate a Pareto stationary solution from . It then computes a few exploration directions to spawn new . We execute phase 1 recursively by feeding it with a newly generated . Phase 2 constructs continuous Pareto sets: we first build a local linear subspace at each Pareto stationary solution by linearly combining its exploration directions. We then check whether two local Pareto fronts collide and stitch them to form a larger continuous set. The major challenge brought by deep MTL is that is the space of neural network parameters. Therefore, it is computationally prohibitive to explicitly calculate Hessian matrices. We describe phase 1 below and phase 2 will be explained in Section 5.
4.1 GradientBased Optimization
Our algorithm is compatible with any gradientbased local optimization methods as long as they can return a Pareto stationary solution from any initial
. A standard method in MTL is to minimize a weighted sum of objectives with stochastic gradient descent (SGD)
(Kokkinos, 2017; Chen et al., 2018; Kendall et al., 2018). Recent papers (Sener and Koltun, 2018; Lin et al., 2019) also proposed to determine a gradient direction online by solving a small convex problem. Essentially, they minimize a loss by combining gradients with fixed or adaptive weights.4.2 FirstOrder Expansion
Once a Pareto stationary point is found, we explore its local Pareto set by spawning new points . This is decomposed into two steps: computing in Definition 3.1 at
and estimating
, the basis directions of the tangent plane, from Proposition 3.2. The new points are then computed by where is an empirical step size whose choice will be discussed in our experiments.We acquire at by solving the following convex problem (Désidéri, 2012), as suggested by Sener and Koltun (2018):
(3)  
Note that the objective can be written as a quadratic form of dimension . Since is typically very small, solving it takes little time even for large neural networks.
Given , finding on the tangent plane at can be transformed to finding a solution from Equation (1):
(4) 
When is small, we can apply classic
methods like GramSchmidt process or QR decomposition. However, directly applying them in deep MTL is difficult for two reasons: first,
is rarely a true Pareto stationary solution because of the early termination in training to avoid overfitting. Second, and more importantly, the large parameter space makes any method prohibitive.To address the first issue, we propose a variant to Problem (3) to find as well as a correction vector :
(5)  
In other words, we seek the minimal modification to the gradients such that if we use as if they were the true gradients, would be Pareto stationary. It is easy to show that solving this new optimization problem brings little overhead to the original problem (see supplemental material for the proof):
To address scalability, we consider the following sparse linear system with unknowns :
(6) 
where is an dimensional column vector with all elements equal to and is randomly sampled. In other words, we solve a linear system with the righthand side sampled from the space spanned by . Solving such a large linear system in MTL requires an efficient matrix solver. We propose to use Krylov subspace iteration methods because they are matrixfree and iterative solvers, allowing us to solve the system without complete Hessians and terminate with intermediate results. In our experiment, we choose to use the minimal residual method (MINRES), a classic Krylov subspace method designed for symmetric indefinite matrices (Choi et al., 2011).
We now discuss MINRES in more detail to better explain why it is the right tool for this problem. The time complexity of MINRES depends on the time spent on each iteration and the number of iterations. The cost of each iteration is dominated by calculating for arbitrary , which is in general . However, it is well known that Hessianvector products can be implemented in time on computational graphs (Pearlmutter, 1994)
, giving us the first strong reason to use MINRES. Analyzing the number of iterations is hard because it heavily depends on the rarely available eigenvalue distribution. In practice, MINRES is known to converge very fast for systems with fast decay of eigenvalues
(Fong and Saunders, 2012). In our experiments, we specify a maximum number of iterations . We observed that was usually sufficient to generate good exploration directions even for networks with millions of parameters. Note that early termination in MINRES still returns meaningful results because the residual error is guaranteed to decrease monotonically with iterations.To summarize, the efficiency of our exploration algorithm comes from two sources: exploration on the tangent plane and early termination from a matrixfree, iterative solver. The time cost of getting one tangent direction is , which scales linearly to the network size.
4.3 The Full Algorithm
We now state the complete algorithm for Pareto set exploration in Algorithm 1. It takes as input a seed network and spawns Pareto stationary networks in a breadthfirst style. Any networks put in the queue are returned by ParetoOptimize (Section 4.1) and therefore Pareto stationary by design. When such a network is popped out from the queue, ParetoExpand generates exploration directions (Section 4.2) and spawns child networks. The algorithm then calls ParetoOptimize to refine these networks before appending them to the queue, and terminates after Pareto stationary networks are collected.
For each output network, we also return the objectives, the gradients, and a reference to its parent. This information is mostly used to construct a continuous linear subspace approximating the local Pareto set, which we will describe in Section 5. Another usage is to remove the sign ambiguity in : by definition, both and are on the tangent plane, and an arbitrary choice can lead to a retraction instead of the desired expansion in the Pareto set. In this case, one can use to predict the changes in the objectives and rule out the undesired direction.
When Algorithm 1 is applied to MTL, it is worth noting that ParetoOptimize and ParetoExpand
rarely return the precise solutions because of stochasticity, early termination, and local minima. As a result, good choices of hyperparameters plays an important role. We discussed in more detail two crucial hyperparameters (
and ) and reported the ablation study in Section 6.5 Continuous Parametrization
In this section, we describe a postprocessing step that builds a continuous approximation to the local Pareto set based on the discrete points returned by Algorithm 1. For each , we collect its children and assign a continuous variable to a vector . The local Pareto set at is then constructed by
(7) 
In other words, is the convex hull of and its children . This construction is justified by the fact that a linear combination of tangent vectors is still on the tangent plane. As a special case, when there are only 2 objectives and , forms a chain, and therefore becomes a piecewise linear set in .
It is possible that two continuous families can collide in the objective space, creating a larger continuous Pareto front. In this case, we create a stitching point in both families and crop solutions dominated by the other family. By repeatedly applying this idea, a single continuous Pareto front covering all families can possibly be created, providing the ultimate solution to continuous traversal in the whole Pareto front. We illustrate this idea on MultiMNIST with our experimental results in Section 6.4.
Since the continuous approximation interpolates different tangent directions, having more directions can enrich the coverage of the continuous set and offer more options to users. It is therefore natural to ask whether the set of tangent directions discovered in the last section could be augmented even further by adding more directions without downgrading the quality of the Pareto front. For the special case of two objectives (
), it turns out that we can augment the set of known tangent directions with a null vector of the Hessian matrix, as stated in the following proposition:Proposition 5.1.
Assuming is sufficiently smooth. Let be a Pareto optimal point and consider a curve defined as . If is any smooth curve in Proposition 3.2 that satisfies , then for any :
1) and have the same value and tangent direction at ;
2) Furthermore, if is a null vector of , i.e., , then is not parallel to , and and have the same curvature at .
In this proposition, is a parametrized 2D curve: it considers a straightline trajectory in that passes in the direction of and uses to map this trajectory to the space of , generating a 2D curve. This proposition states that if a tangent direction is known and if we also have a null vector , then the two curves and are very similar at in the sense that they share the same value, gradients, and curvature. This means that for each tangent direction found in the previous section, can also be used as a backbone direction together with for continuous parametrization without downgrading the quality of the reconstructed Pareto front.
While this proposition is generally not applicable to real problems due to its need for null vectors, it still has interesting theoretical implications: the fact that and share the same gradients should not be surprising as also satisfies Equation (4), but it is less obvious to see that they actually share the same curvature at , which we illustrate in Section 6.4 and will prove in our supplemental material. In practice, we observed that neural networks typically have a Hessian matrix with a null space whose dimension is much higher than . This means a very large set of bases, while not often accessible in real problems, can in theory be used to greatly enrich the Pareto set.
6 Experimental Results
6.1 Datasets, Metrics, and Baselines
We applied our method to five datasets in three categories: 1) MultiMNIST (Sabour et al., 2017) and its two variants FashionMNIST (Xiao et al., 2017) and MultiFashionMNIST, which are mediumsized datasets with two classification tasks; 2) UCI CensusIncome (Kohavi, 1996), a mediumsized demographic dataset with three binary prediction tasks; 3) UTKFace (Zhang et al., 2017), a large dataset of face images. We used LeNet5 (LeCun et al., 1998)
(22,350 parameters) for MultiMNIST and its variants, twolayered multilayer perceptron (158,598 parameters) for UCI CensusIncome, and ResNet18
(He et al., 2016) (tens of millions of parameters) for UTKFace. Please refer to our supplemental material for more information about the network architectures, task descriptions, and implementation details in each dataset.We measure the performance of a method by two metrics: the time cost and the hypervolume (Zitzler and Thiele, 1999). We measure the time cost by counting the evaluations of objectives, gradients, and Hessianvector products. The hypervolume metric, explained in Figure 1, is a classic MOO metric for measuring the quality of exploration. More concretely, this metric takes as input a set of explored solutions in the objective space and returns a score. Larger hypervolume score indicates a better Pareto front. Using the two metrics, we define that a method is more efficient if, within the same time budget, it generates a Pareto front with a larger hypervolume, or equivalently, if it generates the Pareto front with a similar hypervolume but within shorter time. For all figures in this section, we use the same random seed whenever possible and report results from more random seeds in the supplemental material.
Our method is not directly comparable to any baselines because no prior work aims to recover a continuous Pareto front in MTL. Instead, we devised two experiments, which we call the sufficiency and necessity tests, to show its effectiveness (Section 6.3). In the sufficiency test, we consider four previous methods: GradNorm (Chen et al., 2018), Uncertainty (Kendall et al., 2018), MGDA (Sener and Koltun, 2018), and ParetoMTL (Lin et al., 2019). These methods aim at pushing an initial guess to one or a few discrete Pareto optimal solutions. For them, we show that our Pareto expansion procedure is a fast yet powerful complement by comparing the time and hypervolume before and after running it as a postprocessing step. We call this experiment the sufficiency test as it demonstrates our method is able to quickly explore Pareto sets and Pareto fronts.
Our necessity test, which focuses on the value of the tangent directions in exploring Pareto fronts, deserves some discussions on its baselines. There is a trivial baseline for Pareto expansion: rerunning an SGDbased method from scratch to optimize a perturbed weight combination of objectives. Since each new run requires full training, our method clearly dominates this baseline ( times faster on MultiMNIST). Another trivial baseline is to use a random direction instead of the tangent direction for Pareto expansion. We tested this idea but do not include it in our experiments as its performance is significantly worse than any other methods, which is understandable due to the high dimensionality of neural network parameters: with the increase of dimensionality, the chance of a random guess still staying on the tangent plane decays exponentially. The baseline we considered in this experiment is WeightedSum, which runs SGD from the last Pareto optimal solution but with weights on objectives different from the weights used in training. Specifically, we choose weights from onehot vectors for each task as well as a vector assigning equal weights to every task. We call this experiment the necessity test as we use this experiment to establish that the choice of expansion strategies is not arbitrary, and tangent directions are indeed the source of efficiency in our method.
6.2 Synthetic Examples
6.2.1 ZDT2variant
Our first example, ZDT2variant, was originated from ZDT2 (Zitzler et al., 2000), a classic benchmark problem in multiobjective optimization with and . Both the Pareto set and the Pareto front of this example can be computed analytically. This makes ZDT2variant an ideal example for visualizing Proposition 3.2 and Algorithm 1. Figure 2 compares the gradients to our tangent directions when used to explore the Pareto front. We used MINRES with to solve 5 tangent directions. It can be seen that our directions are much closer to the Pareto set and tracked the true Pareto front much better than the gradients. We further compare their performances in Algorithm 1 with MGDA (Désidéri, 2012; Sener and Koltun, 2018) as the optimizer in Figure 3. This figure shows that the gradients expanded the neighborhood not on the Pareto set but to the dominated interior, resulting in a much more expensive correction step to follow. On the other hand, expanding with our predicted tangents steadily grew the solution set along the Pareto front.
6.2.2 MultiMNIST Subset
To understand the behavior of our algorithm when neural networks are involved, we picked a subset of images from MultiMNIST and trained a simplified LeNet (LeCun et al., 1998) with 1500 parameters to minimize two classification errors. We generated an empirical Pareto front by optimizing the weighted sum of the two objectives with varying weights. We then picked a Pareto optimal and visualized trajectories generated by traversing along gradients and the approximated tangents after , , and iterations of MINRES (Figure 4 left). Just as in ZDT2variant, our approximated tangents tracked the Pareto front much more closely. We then compared using approximated tangents after 50 iterations of MINRES (MINRES50) to the WeightedSum baseline (Section 6.1) after 50 iterations of SGD. The two methods had roughly the same time budgets, and MINRES50 outperformed the WeightedSum baseline in that it explored a much wider Pareto front (Figure 4 middle). Specifically, its advantage comes from a much larger step size enabled by the approximated tangents (Figure 4 right).
6.3 Pareto Expansion
We first conducted the sufficiency test described in Section 6.1 to analyze Pareto expansion, the core of our algorithm. We ran ParetoMTL, the state of the art, on all datasets to generate discrete seeds for Pareto expansion. Moreover, for smaller datasets (MultiMNIST and its variants), we also ran the other baselines for a more thorough analysis. Compared to the time cost of generating discrete solutions (Table 1 column 2), our Pareto expansion only used a small fraction of the training time (Table 1 column 4) but generated much denser Pareto fronts (Figure 5 and Table 1 column 5). This experiment, as a natural extension to the synthetic experiments, confirms the efficacy of Pareto expansion on large neural networks and datasets.
The sufficiency test has established that our expansion method has a positive effect on discovering more solutions. However, one can still argue there could be simpler expansion strategies that are as good as ours. It remains to show that the benefit indeed comes from approximated tangent directions. We verified this with the necessity test described in Section 6.1, which directly compared our Pareto expansion to the WeightedSum expansion strategy. Starting with the same seed solution, we gave both methods the same time budget, so the area of their expansions directly reflected their performances. We display the results on MultiMNIST, UCI CensusIncome, and UTKFace in Figure 6
. New solutions were generated after each run of MINRES in our method and after each epoch in WeightedSum. We provide more results in the supplementary material. We see from these experiments that our method discovered solutions that clearly dominated what WeightedSum returned on 4 out of the 5 datasets except UCI CensusIncome. From this experiment, we conclude the tangent directions in Pareto expansion are indeed the core reason for the good performance of our algorithm.
The effectiveness of our Pareto expansion method can also be understood by noticing it uses higherorder derivatives than previous work for determining the optimal expansion directions. Consider the three possible methods for the task of expanding the local Pareto set from a known Pareto optimal solution : simply retraining the neural network from scratch with a different initial guess reuses nothing from ; rerunning SGD from leverages the firstorder gradient information at ; our method exploits both the firstorder and the secondorder information at and therefore is the most effective among the three.
It is worth mentioning that our Pareto expansion strategy is still a local optimization method, meaning that it inevitably suffers from being trapped in local minima. As a result, there is no theoretical guarantee on the resulting Pareto fronts being globally Pareto optimal. We alleviate this issue by exploring from multiple Pareto optimal solutions returned by previous methods and stitching them together, which we will explain shortly in the next section.
MultiMNIST  train  hv  expand  new hv 

GradNorm  21150  7.463  4520  7.628 
Uncertainty  21150  7.615  4520  7.756 
MGDA  21150  7.831  4520  7.896 
WeightedSum  70500  8.019  22600  8.034 
ParetoMTL  106281  8.025  22600  8.046 
UCI  train  hv  expand  new hv 
WeightedSum  467400  5.685  165600  5.725 
ParetoMTL  934888  5.642  165600  5.675 
Face  train  hv  expand  new hv 
ParetoMTL  35568  2.257  9920  5.030 
6.4 Continuous Parametrization
From discrete solutions returned by Algorithm 1, our continuous parametrization creates lowdimensional, locally smooth Pareto sets. Moreover, we stitch them together when their Pareto fronts collide, forming a larger continuous approximation. We illustrate this idea in Figure 7: we ran Algorithm 1 on MultiMNIST with and for each Pareto stationary solution , generating two chains of solutions favoring small and small respectively. As described in Section 5, we then constructed a piecewise linear curve parametrized by . By continuously varying , we explore a diverse set of solutions from favoring small to small . We highlight this mapping from a single control variable to a widerrange Pareto front because it demonstrates the real advantage of a continuous reconstruction over discrete solutions. As a straightforward application, one can analyze this mapping by running singlevariable gradientdescent to pick an optimal solution, which would be impossible if only discrete solutions were provided. We give more results in the supplemental material.
We conclude our discussion on continuous parametrization by demonstrating Proposition 5.1 on MultiMNIST subset in Figure 7. We precomputed its full null space and revealed over bases. We then expanded the Pareto set at a Pareto optimal with three directions: a tangent direction , plus a null vector , and plus a random direction. As expected, expanding with the first two directions led to trajectories sharing the same gradient and curvature at , showing that we can enrich the Pareto set by adding null space bases without degrading its quality.
6.5 Ablation Study
Finally, we conducted ablations tests on two crucial hyperparameters in our algorithm: the maximum number of iterations in MINRES and the step size that controls the expansion speed. We started with a random Pareto stationary point returned by ParetoMTL, followed by running Algorithm 1 with fixed parameters and on MultiMNIST and its two variants. The results are summarized in Figure 8, Table 2, and the supplemental material.
To see the influence of , we fixed and ran experiments with , whose trajectories are in Figure 8. Between and , the trajectories were pushed towards its lower left, indicating a better approximated Pareto front. This is as expected since more iterations in MINRES were consumed. This trend plateaued between and . Moreover, the tail of the trajectory drifted away after iterations. We hypothesized that the tangent after iterations explored a new region in where the constant step size was not proper. Based on these observations, we used in all experiments.
To understand how affects expansion, we reran the same experiments with a fixed and chose from . For each , we set the number of points to be generated to , i.e., the product of the step size and the step number is constant. From Figure 8 right, we noticed a conservative was likely to follow the Pareto front more closely while an aggressive step size quickly led the search to the dominated interior. This is consistent with the fact that our tangents are a firstorder approximation to the true Pareto set.
7 Conclusions
We presented a novel, efficient method to construct continuous Pareto sets and fronts in MTL. Our method is originated from secondorder analytical results in MOO, and we combined it with matrixfree iterative linear solvers to make it a practical tool for largescale problems in MTL. We analyzed thoroughly the source of efficiency with demonstrations on synthetic examples. Moreover, experiments showed our method is scalable to modern machine learning datasets and networks with millions of parameters.
20  30  50  100  500  

hv  7.731  7.739  7.734  7.727  7.669 
0.05  0.10  0.25  0.50  
hv  7.741  7.733  7.728  7.712 
While the majority of work in MTL aims to find one nearoptimal solution, we believe conflicting objectives in MTL are common and the full answer should be a wide range of candidates with varying tradeoffs. Although we are not the first to explore Pareto fronts in MTL or apply secondorder techniques to neural networks, we are, to our best knowledge, the first to introduce secondorder analysis to Pareto exploration in MTL and the first to propose a continuous reconstruction. We believe our work enables lots of opportunities that would otherwise be impossible if only finite, sparse, and discrete solutions were given, for example, revealing the dimensionality and underlying structure of local Pareto sets, developing interpretable analysis tools for deep MTL networks, and encoding dense Pareto sets and fronts with limited storage.
Acknowledgments
We thank TaeHyun Oh for his insightful suggestions and constructive feedback on Krylov subspace methods. We also thank all reviewers for their comments. We thank Buttercup Foshey (and Michael Foshey) for her emotional support during this work. This work is supported by the Intelligence Advanced Research Projects Activity under grant 201919020100001, the Defense Advanced Research Projects Agency under grant N6600115C4030, and the National Science Foundation under grant CMMI1644558.
References
 GradNorm: gradient normalization for adaptive loss balancing in deep multitask networks. In International Conference on Machine Learning, pp. 794–803. Cited by: §2, §4.1, §6.1.
 MINRESQLP: a Krylov subspace method for indefinite or singular symmetric systems. SIAM Journal on Scientific Computing 33 (4), pp. 1810–1836. Cited by: §4.2.
 Multiplegradient descent algorithm (MGDA) for multiobjective optimization. Comptes Rendus Mathematique 350 (56), pp. 313–318. Cited by: §2, §4.2, §6.2.1.
 Steepest descent methods for multicriteria optimization. Mathematical Methods of Operations Research 51 (3), pp. 479–494. Cited by: §2.
 A method for constrained multiobjective optimization based on SQP techniques. SIAM Journal on Optimization 26 (4), pp. 2091–2119. Cited by: §2.
 CG versus MINRES: an empirical comparison. Sultan Qaboos University Journal for Science [SQUJS] 17 (1), pp. 44–62. Cited by: §4.2.

Deep residual learning for image recognition.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pp. 770–778. Cited by: §6.1.  Nonlinear multiobjective optimization: a generalized homotopy approach. Vol. 135, Springer Science & Business Media. Cited by: Definition 3.1, Proposition 3.1, §3.
 Generalized homotopy approach to multiobjective optimization. Journal of Optimization Theory and Applications 110 (3), pp. 557–583. Cited by: §1, §2, Proposition 3.2, §3.
 Multitask learning using uncertainty to weigh losses for scene geometry and semantics. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7482–7491. Cited by: §1, §2, §4.1, §6.1.

Scaling up the accuracy of naivebayes classifiers: a decisiontree hybrid.
. In Kdd, Vol. 96, pp. 202–207. Cited by: §6.1. 
Ubernet: training a universal convolutional neural network for low, mid, and highlevel vision using diverse datasets and limited memory
. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6129–6138. Cited by: §2, §4.1.  Gradientbased learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §6.1, §6.2.2.
 Pareto multitask learning. In Advances in Neural Information Processing Systems, pp. 12037–12047. Cited by: §1, §2, §4.1, §6.1.
 Deep learning via hessianfree optimization.. In ICML, Vol. 27, pp. 735–742. Cited by: §1, §2.
 Pareto tracer: a predictor–corrector method for multiobjective optimization problems. Engineering Optimization 50 (3), pp. 516–536. Cited by: §1, §2, §3.
 Crossstitch networks for multitask learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3994–4003. Cited by: §1.
 Fast exact multiplication by the hessian. Neural computation 6 (1), pp. 147–160. Cited by: §2, §4.2.
 Dynamic routing between capsules. In Advances in neural information processing systems, pp. 3856–3866. Cited by: §1, §6.1.
 Interactive exploration of design tradeoffs. ACM Transactions on Graphics (TOG) 37 (4), pp. 1–14. Cited by: §1, §2, §3.
 Multitask learning as multiobjective optimization. In Advances in Neural Information Processing Systems, pp. 527–538. Cited by: §1, §2, §4.1, §4.2, §6.1, §6.2.1.
 Krylov subspace descent for deep learning. In Artificial Intelligence and Statistics, pp. 1261–1268. Cited by: §1, §2.
 FashionMNIST: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747. Cited by: §6.1.

Age progression/regression by conditional adversarial autoencoder
. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5810–5818. Cited by: §1, §6.1. 
Comparison of multiobjective evolutionary algorithms: empirical results
. Evolutionary computation 8 (2), pp. 173–195. Cited by: §6.2.1.  Multiobjective evolutionary algorithms: a comparative case study and the strength pareto approach. IEEE transactions on Evolutionary Computation 3 (4), pp. 257–271. Cited by: §6.1.
Comments
There are no comments yet.