Combined Optimization Method (COM)
Our algorithm is a nontrivial extension of the gradient descent. Let be the multivariable cost function to be minimized, where is a
dimensional vector. In the standard gradient descent, we try to decrease
by moving in the direction of the negative gradient of (or force), i.e.,(1) 
where the parameter is called the learning rate. In the case of joint optimization of cost functions, , a linear combination of the cost functions,
(2) 
is often introduced, where () denotes the weight (or the importance) of the th cost function, . The gradient descent will find one of local minima of the cost function; the result strongly depends on the initial condition.
By appropriately combining two or more cost functions that share the same global minimum but have different distributions of local minima, we expect that the cost of local minima is increased relative to the global minimum, and it may enhance the success rate of the optimization. It should be noted, however, that the local minima can not be removed completely in general, even if the weights are chosen carefully (see Fig. 1). This means that the gradient descent still gets trapped by one of the remaining local minima, and thus one has to introduce some metaheuristics, such as the simulated annealing Pannetier et al. (1990); Doll et al. (2008); Tsujimoto et al. (2018).
(3) 
(4) 
(5) 
(6) 
In order to avoid being trapped by local minima, we introduce two essential modifications to the gradient descent. Our algorithm, named “Combined Optimization Method (COM),” is summarized in Algorithm 1. First, we introduce a generalized force defined in equation 3. By definition, each component of the generalized force, , is nonnegative and becomes zero iff all the derivatives () vanish. This guarantees that the algorithm will stop only at the common minimum shared by all the cost functions. Second, we introduce the memory to the direction of update. In equation 4, the sign factors , which determines the direction of update, are chosen according to the unanimity rule for each component. When the votes are split, the sign factor of the previous step is inherited for that component. The last rule introduces the memory effect, or the momentum, in the optimization process, and helps the configuration to escape from local minima of the individual cost functions.
It should be noted that the generalized force, F, is always normalized to unity, i.e., . The learning rate in the present algorithm, , has the same physical dimension as , and thus one can choose the initial value of in a natural way by considering the physical property of the target problem as we will discuss later. At step 6 in Algorithm 1, the learning rate is halved when all the sign factors change their signs at the same time, which gradually decreases near the optimal point.
The proposed method reduces to the conventional gradient descent besides the learning rate, if we consider only one cost function, i.e., . Although the COM is an extension of the gradient descent, its convergence behavior is qualitatively different due to the construction of the generalize force and the introduction of the memory effect. Especially, the direction of optimization, i.e., the set of the sign factors , is not determined by a particular cost function nor by the average of cost functions, but depends on the detailed landscape of all the cost functions via the unanimity rule (equation 4). Nearby the common minimum, all the sign factors change their signs at every step, and as a result, a dumped oscillation occurs around the optimal point. On the other hand, such an oscillation does not occur at a local minimum of each cost function, since the sign factors do not change at such a point (see Fig. 1).
The above argument can be generalized to higherdimensional cases (). Let us consider the gradient descent for a single cost function (). With a sufficiently small learning rate, the dimensional search space can be decomposed into a set of domains, each of which defines a “basin of attraction” of each global or local minimum of the cost function. We define such a set of domains for each cost function also in the case of the joint optimization (). Then, we consider one of intersections of N domains. By construction, of all functions have their minima in this intersection. If =, the COM can not escape from the intersection. Indeed, the intersection containing the global minimum, which is shared by all the cost function, has exactly N minima. In other words, the COM gets trapped in the vicinity of the global minimum as we have expected. If , on the other hand, the COM can escape from the intersection, since at least one sign factor does not change its sign within the intersection. This remarkable property of the COM enables us to reach the common minimum.
Indeed, the average number of intersections with exactly
minima can be estimated as follows; Let
the number of local minima in th cost function () and the total number of intersections. By construction for . Note that the equality holds only when all the cost functions are identical with each other, and we can expect that is much larger than’s as we prepare the set of cost functions so that they have different distributions of local minima. Then, since the probability that a randomly chosen intersection contains a local minimum of
th cost function is given by , the average number of intersections with exactly minima is(7) 
that is, in average the number of such intersections always becomes much smaller the number of local minima of each cost function. Thus, we expect that the COM process converges to the global minimum with high success rate, in comparison with the conventional joint optimization using a linear combination of cost functions.
Application to Crystal Structure Prediction
In this section, we demonstrate the effectiveness of the COM, by applying it to an optimization problem that we encounter in materials science. Here, we address the crystal structure prediction by simultaneously optimizing the theoretical potential energy and the degree of coincidence with experimental data, the same problem discussed by Tsujimoto et al. Tsujimoto et al. (2018). As target materials, we consider Si diamond structure and two wellknown polymorphs of SiO, low quartz and low cristobalite. Theoretical structure prediction for SiO is extremely difficult since there exist an enormous number of metastable structures.
We introduce two cost functions that have qualitatively different properties with one another. The first cost function, , is the theoretical potential energy of the crystal, which is a function of atomic position and in general sensitive to the local structure between adjacent atoms. For calculating the potential energy from the position of atoms, we adopt the classical force field, the Tersoff potential Tersoff (1989) and the Tsuneyuki potential Tsuneyuki et al. (1988) for Si and SiO, respectively, by using the LAMMPS package Plimpton (1995) through the ASE Larsen et al. (2017).
The second cost function is defined using the powder Xray diffraction pattern, which reflects the longrange periodicity of the crystal. We calculate the diffraction intensity, , where is the diffraction angle, from the atomic position , and measure the degree of coincidence with the experimental diffraction intensity, . For the calculation of , we use the atomic scattering factors listed in International Tables for Crystallography Prince (2006). In the conventional crystal structure determination by using experimental Xray diffraction data, the socalled value,
(8) 
has been used widely Harris and Tremayne (1996); Putz et al. (1999). This value is, however, sensitive to the noise and/or the background of the experimental data, and thus it does not necessarily give the global minimum even at the correct crystal structure. In the present work, we adopt a cost function that does not use the intensity information of the Xray diffraction peaks Tsujimoto et al. (2018), which is defined by
(9) 
where is called “crystallinity” calculated using only the positions of the experimental Xray diffraction peaks, i.e.,
(10) 
Here, denotes the intensity of the calculated diffraction pattern for atomic position , means the peak positions in the reference (e.g., experimentally observed) diffraction pattern, and is the diffraction angle resolution. In our calculations the diffraction angle resolution is set to 0.1, and the Xray wavelength is set to 0.1541 nm, which corresponds to the Cu radiation. The range of integration in the numerator is restricted to the vicinity of the peaks observed in the experiment, and that of the denominator is the whole measured angle. By definition, ; for the correct crystal structure, where all the calculated peak positions coincide with those observed in the experiment, we have , otherwise becomes nonzero.
The global minimum of is more robust against the noise in the experimental data. It should be noted, however, that other nontrivial global minima with may appear, since it does not take into account the intensity. Although this fact makes more difficult to determine the correct structure by using the crystallinity alone relative to the value, it has been shown that the crystal structure prediction becomes much more reliable by combining the crystallinity with the the theoretical potential calculation Tsujimoto et al. (2018). In what follows, we demonstrate the COM is able to accelerate further the structure determination by using the property that the theoretical potential energy and the crystallinity share the identical target structure as the their global optimum.
Si diamond structure
We use a cubic supercell of 64 Si atoms. The linear size of the supercell is fixed to Å, which is used as the unit of and . The periodic boundary conditions are imposed. The initial position of atoms is sampled uniformly at random in the supercell. For the definition of the crystallinity, we use only three peaks located between 20 and 60.
First, we show a typical change in the cost functions, and , for during the COM optimization process in Fig. 2. In the early stage of the optimization (before step ), we observe that decreases almost monotonically. As for , although it intermittently exhibits a sharp spike structures, both of the values at spikes and at valleys decrease gradually. This early stage behavior is similar to that of the standard steepest descent. In the middle stage (step ), both cost functions show oscillating behavior. More precisely, and show an antiphase oscillation with each other. This indicates that the configuration is repeatedly trapped at some local minimum of one of the two cost functions and escapes from it one after another. This behavior can not happen in the standard steepest descent. The conventional joint optimization using the linear combination of cost functions is not able to escape from local minima without the help of metaheuristics, while the COM can get out of such structures. This is because the sign factors do not change when the gradients of the cost functions conflict with each other. Finally, after step , they start to decrease again and reach the minimum values successfully, which correspond to the correct Si diamond structure.
Next, we discuss the dependence of the performance of the COM. In Fig. 3, we show the change in the potential energy for , , and starting from the same initial configuration. It is observed that when the initial learning rate is sufficiently small, e.g., or 0.1, the potential energy exhibits a quick relaxation at the very early stage of the optimization, and successfully reaches the correct structure. If the learning rate is too large, i.e., , on the other hand, the potential energy fluctuates strongly and does not converge even after 1000 iterations. In the inset of Fig. 3, we show the same data with the horizontal axis multiplied by . We observe the similar behavior in the initial relaxation for and 0.1. This means that the optimization trajectories in the dimensional search space are almost identical in these two cases, though the former requires 10 times more iterations to proceed to the same distance. With , the initial relaxation follows a different trajectory from the other cases. This implies that is too large in comparison with the typical length scale of the potential energy landscape, and thus smaller (0.1 or 0.01) seems to be more appropriate.
Let us see the dependence of the success rate of the optimization. We prepare 100 different initial configurations where the position of atoms is chosen uniformly at random in the supercell. We stop the COM process when becomes smaller than 0.005, or the number of iterations exceeds some threshold (5000 for and , and 50000 for ). After the optimization by the COM, we additionally perform the gradient descent optimization on the potential energy for 200 steps. The success rate of the optimization strongly depends on the initial learning rate. For , we observe that no samples converge to the correct Si diamond structure. For and 0.01, about 90 samples among 100 trials reach the correct structure successfully. We point out that although the cases with and 0.01 exhibit similar success rates, the former converges with smaller number of iterations than the latter. Thus, we conclude for the optimal initial learning rate.
In the present Si diamond case with , since the generalized force (equation 3) is normalized to unity and we have coordinate variables in total for 64 Si atoms, the average displacement in each component of by one iteration step is evaluated approximately as . This average displacement is smaller enough than the distance between Si atoms in the Si diamond structure, . With , on the other hand, the average displacement () is the same order as the bond length between Si atoms, and thus is too large to capture the structure of the potential energy landscape. This rough estimation for the appropriate value for is consistent with the results of the numerical test presented above.
SiO polymorphs
Next, we show the results for low quartz and low cristobalite. We consider supercells containing 36 and 48 atoms for low quartz and low cristobalite, respectively. The lattice parameters are fixed to the same as those used in Ref. Tsujimoto et al. (2018). For the definition of the crystallinity, we use seven and six peaks located between 20 and 45 for low quartz and low cristobalite, respectively. In Fig. 4, we show the histograms of the potential energy after the optimization. The initial learning rate is set as in both cases. The COM process is terminated when becomes smaller than 0.005, or the number of iterations exceeds 1000.
For comparison, in Fig. 4, we also present the previous results obtained by the molecular dynamics with simulated annealing for , where is the number of the atoms in the supercell and the weight is set to the optimal value K Tsujimoto et al. (2018). Here, is the Boltzmann constant. The integration time step of the molecular dynamics simulation is 1 fs, and the temperature is kept at 500 K for the first 4500 steps and then gradually quenched to 0 K in 700 steps.
As shown in Fig. 4, by using the COM, 130 and 163 samples among 500 trials converge to the correct structure for low quartz and low cristobalite, respectively. These success rates are significantly higher than the previous simulatedannealing results presented in the upper half in each graph, especially for low cristobalite. It should be pointed out that the total number of iterations for the simulated annealing Tsujimoto et al. (2018) was 5200, which is more than five times longer than the present simulation. For the low quartz case, there exists another large peak around eV/SiO. This structure is also a common local minimum, but is an artifact due to the fixed lattice parameters. Indeed, once we further optimize the cell size as a free parameter starting from this local minimum structure, it converges to the correct low quartz structure rapidly, and the total success rate increases by about two times.
We have also applied the COM to coesite, another polymorph of SiO, but found a lower success rate than the simulated annealing. This failure might come from the fact that there are a number of common local optimal structures of the two cost functions in the case of coesite. Although these structures have significantly higher energies than the correct coesite structure, the COM is easily trapped to these local minima since they are common to the two cost functions. In such a case, the COM would not be very useful for searching the global optimal structure, since the COM is just an extension of the gradient descent.
Conclusion and Discussion
In this paper, we presented a new optimization method, the Combined Optimization Method (COM), for the joint optimization of multiple cost functions that share a common global minimum. In the COM, we have introduced two essential ingredients, the generalized force and the memory (Algorithm 1); They enable the configuration to escape from local minima of each cost function and explore in wider search space, and as a result guarantee that the optimization will stop at the common minimum shared by all the cost functions.
We demonstrated the effectiveness of our method by applying it to the crystal structure prediction in materials science. As cost functions, we used the theoretical potential energy of the crystal and the crystallinity, which characterizes the agreement with the theoretical and experimental Xray diffraction patterns. The former is sensitive to the local structure between neighboring atoms, while the latter reflects the longrange periodicity of the crystal. As a result, it is expected that they have qualitatively different distributions of local minima besides the global minimum, which corresponds to the correct crystal structure. We have applied our method to Si diamond structure, low crystal, and low cristobalite, and observed that the correct target structures can be found with significantly higher success rate than the previous work.
In the simulation performed in the present paper, we have fixed the cell parameters so that the potential energy has a global minimum exactly for each target crystal structure, and in addition, we have used the theoretically exact peak positions of each target structure in the definition of the crystallinity. Therefore, in the present examples, it is guaranteed that the global optimal points of the two cost functions strictly coincide with each other. Even though the situation assumed in the present work sounds too idealized, and in reality the optimal positions do not necessarily coincide strictly due to experimental and/or numerical errors or approximations, the COM should still find the target structure with high success rate with a help of some traditional local optimization method at the final stage as long as the two minima are involved in the same intersection of basins of attraction.
The COM has one tuning parameter, the initial learning rate, . If the initial value of is too large, it does not decrease and strong fluctuation persists asymptotically. If is too small, on the other hand, it requires more iterations to explore the whole search space. In the present implementation of the COM, the learning rate is halved when all the sign factors change their signs at the same time (step 6 in Algorithm 1). From our experience in the application to the crystal structure prediction, this rule seems to be too strict: the reduction of rarely happens during the optimization process in higher dimensional space. This causes asymptotic strong fluctuation with too large as well as the need of (short) gradient descent iterations at the final stage of the optimization. Although has the same physical dimension as , and thus one can choose the initial value of in a natural way by considering the physical property of the target problem, a better choice of the initial value of as well as a more robust criterion for the reduction of the learning rate is a subject of future studies.
Acknowledgements
This work was supported by Elements Strategy Initiative to Form Core Research Center in Japan, Japan Society for the Promotion of Science KAKENHI Grant No. 17H02930, the “Materials Research by Information Integration” Initiative (MI2I) Project of the Support Program for Starting Up Innovation Hub from Japan Science and Technology Agency (JST), and Ministry of Education, Culture, Sports, Science and Technology (MEXT) of Japan as a social and scientific priority issue (Creation of new functional devices and highperformance materials to support nextgeneration industries; CDMSI) to be tackled by using postK computer. D.A. and N.T. are supported by the Japan Society for the Promotion of Science through the Program for Leading Graduate Schools (MERIT).
References
 Fletcher (1987) R. Fletcher, Practical Methods of Optimization, 2nd ed. (John Wiley & Sons, New York, 1987).
 Bianchi et al. (2009) L. Bianchi, M. Dorigo, L. M. Gambardella, and W. J. Gutjahr, Natural Computing 8, 239 (2009).
 Rastrigin (1963) L. Rastrigin, Automation and Remote Control 24, 1337 (1963).
 Kirkpatrick et al. (1983) S. Kirkpatrick, C. Gelatt Jr., and M. Vecchi, Science 220, 671 (1983).
 Holland (1975) J. Holland, Adaptation in Natural and Artificial Systems (Univ. of Michigan Press, 1975).

Kennedy and Eberhart (1995)
J. Kennedy and R. Eberhart, Proceedings of IEEE International Conference on Neural Networks
IV, 1942 (1995). 
Shi and Eberhart (1998)
Y. Shi and R. Eberhart, Proceedings of IEEE International Conference on Evolutionary Computation , 69 (1998).
 Hukushima and Nemoto (1996) K. Hukushima and K. Nemoto, J. Phys. Soc. Jpn. 65, 1604 (1996).
 Mockus (1974) J. Mockus, in Proceedings of the IFIP Technical Conference (SpringerVerlag, London, UK, UK, 1974) pp. 400–404.
 Pickard and Needs (2011) C. J. Pickard and R. J. Needs, Journal of Physics: Condensed Matter 23, 053201 (2011).
 Chen et al. (2016) J. Chen, G. Schusteritsch, C. J. Pickard, C. G. Salzmann, and A. Michaelides, Phys. Rev. Lett. 116, 025501 (2016).
 Pannetier et al. (1990) J. Pannetier, J. BassasAlsina, J. RodriguezCarvajal, and V. Caignaert, Nature 346, 343 (1990).
 Doll et al. (2008) K. Doll, J. C. Schön, and M. Jansen, Phys. Rev. B 78, 144110 (2008).
 Abraham and Probert (2006) N. L. Abraham and M. I. J. Probert, Phys. Rev. B 73, 224104 (2006).
 Wu et al. (2014) S. Q. Wu, M. Ji, C. Z. Wang, M. C. Nguyen, X. Zhao, K. Umemoto, R. M. Wentzcovitch, and K. M. Ho, Journal of Physics: Condensed Matter 26, 035402 (2014).
 Supady et al. (2015) A. Supady, V. Blum, and C. Baldauf, Journal of Chemical Information and Modeling 55, 2338 (2015).
 Wang et al. (2010) Y. Wang, J. Lv, L. Zhu, and Y. Ma, Phys. Rev. B 82, 094116 (2010).
 Wang et al. (2015) Y. Wang, J. Lv, L. Zhu, S. Lu, K. Yin, Q. Li, H. Wang, L. Zhang, and Y. Ma, Journal of Physics: Condensed Matter 27, 203203 (2015).
 Muggleton et al. (1992) S. Muggleton, R. D. King, and M. J. Stenberg, Protein Engineering, Design and Selection 5, 647 (1992).
 Raccuglia et al. (2016) P. Raccuglia, K. Elbert, P. D. F. Adler, C. Falk, M. Wenny, A. Mollo, M. Zeller, S. A. Friedler, J. Schrier, and A. Norquist, Nature 533, 73 (2016).
 Lee et al. (2016) J. Lee, A. Seko, K. Shitara, K. Nakayama, and I. Tanaka, Phys. Rev. B 93, 115104 (2016).
 Glass et al. (2006) C. W. Glass, A. R. Oganov, and N. Hansen, Computer Physics Communications 175, 713 (2006).
 Dinnebier et al. (2009) R. E. Dinnebier, B. Hinrichsen, A. Lennie, and M. Jansen, Acta Crystallographica Section B 65, 1 (2009).
 Napolitano et al. (2014) E. Napolitano, F. Dolci, R. Campesi, C. Pistidda, M. Hoelzel, P. Moretto, and S. Enzo, International Journal of Hydrogen Energy 39, 868 (2014).
 Harris and Tremayne (1996) K. D. M. Harris and M. Tremayne, Chemistry of Materials 8, 2554 (1996).
 Keen and Mcgreevy (1990) D. A. Keen and R. L. Mcgreevy, Nature 344, 423 (1990).
 Tremayne et al. (1992) M. Tremayne, P. Lightfoot, M. Mehta, P. Bruce, K. Harris, K. Shankland, C. Gilmore, and G. Bricogne, Journal of Solid State Chemistry 100, 191 (1992).
 Tsujimoto et al. (2018) N. Tsujimoto, D. Adachi, R. Akashi, S. Todo, and S. Tsuneyuki, Phys. Rev. Materials 2, 053801 (2018).
 Pareto (1986) V. Pareto, (1986).
 Coello and Becerra (2003) C. A. C. Coello and R. L. Becerra, Swarm Intelligence Symposium, 2003. SIS ’03. Proceedings of the 2003 IEEE , 6 (2003).
 Tersoff (1989) J. Tersoff, Phys. Rev. B 39, 5566 (1989).
 Tsuneyuki et al. (1988) S. Tsuneyuki, M. Tsukada, H. Aoki, and Y. Matsui, Phys. Rev. Lett. 61, 869 (1988).
 Plimpton (1995) S. Plimpton, Journal of Computational Physics 117, 1 (1995).
 Larsen et al. (2017) A. H. Larsen, J. J. Mortensen, J. Blomqvist, I. E. Castelli, R. Christensen, M. Dułak, J. Friis, M. N. Groves, B. Hammer, C. Hargus, E. D. Hermes, P. C. Jennings, P. B. Jensen, J. Kermode, J. R. Kitchin, E. L. Kolsbjerg, J. Kubal, K. Kaasbjerg, S. Lysgaard, J. B. Maronsson, T. Maxson, T. Olsen, L. Pastewka, A. Peterson, C. Rostgaard, J. Schiøtz, O. Schütt, M. Strange, K. S. Thygesen, T. Vegge, L. Vilhelmsen, M. Walter, Z. Zeng, and K. W. Jacobsen, Journal of Physics: Condensed Matter 29, 273002 (2017).
 Prince (2006) E. Prince, ed., International Tables for Crystallography, 1st ed., Vol. C, Mathematical, Physical and Chemical Tables (2006).
 Putz et al. (1999) H. Putz, J. C. Schön, and M. Jansen, Journal of Applied Crystallography 32, 864 (1999).