Modern penetration tasks against the air defense system of the target heavily relies on the coordinated attack of multiple missiles, however the rapid development of detection technologies and close-in weapon system(CIWS) has reduced the chance of successful impact with a single conventional missile [jeon2010homing]
. In addition to increasing the difficulty of interception, the cooperative guidance strategy of multiple missiles is also crucial to the lethal effect of the final impact. Usually, the cooperative guidance of multiple missiles belongs to the phase of terminal guidance, where the accurate target information can be obtained with active radar systems or other detection devices. The existing cooperative guidance laws can be roughly divided into two categories. One is the analytical method to find closed form solution, which is mainly based on sliding mode control, optimal control and multi-agent consensus theory. The other is the intelligent method which generally adopts heuristic intelligent optimization algorithm and reinforcement learning (RL) theory.
The analytical cooperative guidance method have been proved to be robust and efficient for practical application [ma2013guidance, xiong2018hyperbolic, liang2020range, he2021computational, huang2019deep, ratnoo2008impact]. Based on fundamental proportional navigation (PN), Jeon et. al developed a cooperative proportional navigation (CPN) where the on-board time-to-go of the missile is used as the navigation gain [jeon2010homing]. It is a simple but effective approach for achieving time consensus. Ma developed a composite guidance law, which can be decomposed into the direction along the line of sight (LOS) and the direction perpendicular to LOS [ma2013guidance], corresponding to time and space cooperative respectively. Furthermore, time cooperative control is achieved under the guidance of a virtual leader in [chen2016impact], where undirected topology is adopted to establish communication relationships. Based on the optimal control approach, a variant of the hyperbolic tangent function is proposed in [xiong2018hyperbolic] to force early control of velocity and impact angle.
However, with the increasing demand for developing high-precision weapon systems, intelligent cooperative guidance method is increasingly regarded as a necessary auxiliary option. In recent years, the reinforcement learning theory has attracted much attention because of its ability to learn online based on environmental feedback [gaudet2020reinforcement, liang2019learning, hu2021application, kong2020maneuver]. According to the training structures, existing reinforcement learning algorithms for multi-agent system can be roughly divided into four types, which are Fully decentralized training , decentralized execution; Fully centralized training, decentralized execution; Centralized training, centralized execution, and value decomposition methods. Some of these algorithms have achieved satisfactory results in coping with problems with low complexity and accuracy requirements. In [liang2020range], [he2021computational] and [Liangchen2021Metalearning], the state-of-the-art reinforcement learning frameworks have demonstrated their effectiveness in the guidance task. Zhang et.al proposed a gradient-descent-based reinforcement learning method in the actor-critic framework and achieved consensus control for multi-agent systems by following a tracking leader [zhang2016data]. But the two challenges of Nonstaionarity and Partial Observability [nguyen2020deep] will lead to saturated output or coordination loss of multi-agent systems, which greatly reduces the accuracy of the value function. In addition, the use of value function in reinforcement learning is not suitable for continuous control tasks with large search space. Thus, these limitations of RL impedes the development of reinforcement learning in cooperative guidance.
It is an excellent way to solve the above problems by removing the value function of reinforcement learning and optimize in solution space with evolutionary strategy (ES), which is more robust and invariant to real-time rewards because it optimizes towards the objective function directly [brockhoff2010mirrored]. Moreover, as described in [salimans2017evolution], ES is tolerant of long horizontal and implicit solutions, which is exactly consistent with the need for cooperative guidance. The natural evolutionary strategy (NES) is the latest branch of ES, and shows good performance in solving high-dimensional continuous multimodal optimization problems, by using the natural gradient information estimated according to the fitness expectation of the population [brockhoff2010mirrored, wierstra2014natural, salimans2017evolution]
. Similar algorithms named co-evolutionary algorithm have been discussed in[xu2017environment] and [qu2013improved], which focuses on solving multi-objective optimization problems by dividing the overall objective into sub-objectives, such to optimize and evaluate together. Another idea is to evolve multiple populations for the same goal, and manually regulate constraints of each populations for faster convergence or fuller exploration [qu2013improved, yamanLimitedEvaluationCooperative2018]. As represented in [qu2013improved], the concept of co-evolution refers to multi-threads of training processes. Note that these methods do not use the natural gradient information as in NES, and the non-stationary issue discussed above is not considered.
When optimizing in continuous parameter(solution) space, it is very important to apply adaptive technology. Nomura presented a learning rate adaption method based on the quality of gradients which was often not easy to estimate [nomura2022towards]. Instead, Fukushima leveraged the shifting distance of parameters to adapt the learning rate [fukushima2011proposal]. As shown in [wang2022instance], the size of population was adjusted depending on the novelty metric and quantity metric, which reflected the complexity of the dynamic environment. The estimation of distribution algorithm (EDA) was applied to continuous control by searching the optimal parameter distribution [larranaga2001estimation, karshenas2013regularized]. A variety of evolutionary methods were investigated with random walk strategies to solve the optimal missile guidance handover problem [wang2019gaussian]. Maheswaranathan proposed a surrogate gradient to reduce the evaluation costs [maheswaranathan2019guided]. These works reveal the enormous potential of searching in parameter space, rather than directly searching in parameter space.
Therefore, a NES-based co-evolutionary algorithm naming as the natural co-evolutionary strategy (NCES) is developed in this paper to distress the dilemma faced by RL in the cooperative guidance task. Considering of the advantages of searching in parameter space, the co-evolutionary algorithm is improved in this work by rescaling the gradient information to reduce the estimation bias introduced by neighboring populations. As described in [del2019bio], most of today’s bio-inspired algorithm innovations are based on experimental observation rather than meticulous theoretical support. Whereas in this work, we try to dig into the depths of complex optimization and provide a proof as sensible as possible through the presentation of graphs and deduction. Via integrating the NCES algorithm, a hybrid co-evolutionary cooperative guidance law (HCCGL) is further developed to solve the challenging missile guidance problem. Extensive empirical results on various engagement scenarios verified the effectiveness of the proposed guidance law. The main contributions of this work are summarized as follows:
To address the issues of non-stationarity and continuous control faced by cooperative guidance, a NCES algorithm is formulated and incorporated into a novel guidance law as an alternative to RL in the cooperative guidance task.
The rigorous constraints of time and space consensus in cooperative guidance are integrated and designed as the fitness function for each missile. A MLP based policy network is constructed and learned to optimize the fitness function.
The proposed HCCGL has advantages in achieving high precision for cooperative guidance tasks, even with dynamic target and random initial conditions.
The rest of the paper are organized as follows. The problem formulation is elaborated in sec:problemFormulation, and the proposed cooperative guidance law is discussed in sec:naturalCo-evolutionaryStrategy. In sec:simulations, experiments under various configurations are implemented. Finally, conclusions are made in sec:conclusions.
Ii Problem Formulation
Ii-a Engagement geometry
The two-dimensional engagement geometry between multiple missiles and one target is shown in fig:engagement1, where the inertial coordinate frame represents the horizontal plane. There are missiles in total. The index denotes the missile, and represents the target. , and represent the velocity, line of sight (LOS) angle, flight-path angle, and heading angle of the missile, respectively. and represent the lateral acceleration and the thrust acceleration to be designed for the missile, which are perpendicular to and align with the direction of , respectively. , and are the velocity, LOS angle, flight-path angle, and heading angle of the target, respectively. The lateral acceleration of target is denoted by .
The dynamic equations of the missile and the target are as follows:
where, represents the relative range between the missile and the target. The time-to-go of the missile refers to the time left from current time until the interception:
Ii-B Communication Topology
The communication relationship of the multiple missiles is depicted by a topology, where a set of nodes representing the missiles. The communications are represented by a set of edges with an adjacency matrix , where if missile is able to communicate directly with missile , otherwise . is the set of neighboring missiles of the missile. In practical engineering, the communication topology is determined through comprehensive considerations of the communication cost and actual demand. In this work, the undirected topology shown in fig:topologies is adopted, where neighboring missiles can share information with each other.
For the multi-missile system, the complete observation information of the entire system is not available to each agent. Thus, the cooperative guidance problem is a partial observable markov decision process (POMDP) described by
where, and represent the observation and action of the missile. is the observation of the missile at next time step.
The full state information of each missile consists of three components: personal features, target features, and error features, shown in Table I. and represent the positions of the missile and the target in two-dimensional coordinates. The target features are estimated or detected through onboard equipment, and the estimation error is assumed to be negligible compared with the required guidance precision. The acquisition of accurate location information requires the support of powerful global positioning systems, here we only need relative error information. is the consensus error of time of the missile :
The consensus error of LOS angle of the missile is defined as:
where, is the LOS angle error of the missile , and is the desired impact angle of missile :
where, is the desired relative impact angle between two missiles, and is the nominal desired impact angle of the first missile which is determined online. To increase the flexibility and autonomy of the intelligent missile system, the desired can be adjusted adaptably instead of a fixed value.
Ii-D Fitness evaluation
The reward of each missile at one evaluation step consists of terminal reward and flight reward. The objective of the cooperative guidance task is to minimize the error , , and . Then, the terminal reward is defined as:
where, , , , are constant coefficients. is the step function defined as
Thus, the terminal reward only reflects the results at the terminal step, and if and only if and . The flight reward is defined as:
where, , , and are positive constant coefficients. It can be inferred that is always true. if and only if and . Then, the fitness function of missile for the cooperative guidance task can be defined as
Thus, the objective of the cooperative guidance task can be achieved by maximizing the fitness function for each missile.
Ii-E Design of the cooperative guidance law
Based on the requirements of cooperative guidance task, the guidance law proposed in this paper includes two parts: tracking control part and consensus control part. The tracking control part is obtained by proportional navigation guidance(PNG) :
where, is the navigation constant. Note that the tracking control part only designs the lateral acceleration.
The consensus control part is modeled by a neural network expressed as
where , , and denote the weight matrices of the output layer. and are the outputs of the first and second hidden layers. and
are the number of neurons in each layer.
is the bounded activation functionwith , and is the common activation function . The input states is chosen as:
Thus, the guidance law of the missile is presented as:
where, is the guidance gain trading off the tracking control part and the consensus control part.
Iii Natural Co-evolutionary Strategy
Iii-a Natural evolutionary Strategy in multi-agent POMDP
In the evolutionary strategy, individual agent (or its policy) is expressed as a population, the group of populations and the environment constitute the ecosystem. The objective is to develop the optimal strategy for the group of populations to maximize the fitnesses of the ecosystem. For cooperative tasks, the optimal strategy of the ecosystem will be exactly the optimal policy for each population.
where, which is defined in eqn:composedcommand represents the policy of the th population and is the joint matrix of individual optimal policy. is its corresponding fitness function and is the joint policy fitness function, more details can be viewed in [sonQTRANLearningFactorize2019]. However, the inverse is not true:
This is because the optimal fitness obtained by one population may be based on the suboptimal fitness obtained by other populations. When the other populations evolve, the previous optima is easy to be broken. To overcome this nonstationary issue, it is best for all populations to evolve simultaneously, that is the coevolution. Each generation updates its parameter at the same time, instead of updating sequentially, mapping in slight variance in fitness values.
Iii-B Optimization in co-evolutionary parameter space
The gradient information is obtained by measuring the contribution of each sample. The parameters of the population are defined as , and represents that of the next generation. is the distribution function of under , where is the intrinsic parameter. Then the expectation fitness of the next generation is expressed as:
The derivative of Eq. (17) with respect to is
If we represent as , then we have the similar equation
In an ecosystem with multiple populations, populations will interact and affect the evolutionary process. Thus, the fitness function of the th population is represented by , where represents the parameter set of the th population and its neighboring populations. The expected joint fitness of the next generation is expressed as:
is the joint probability distribution of the next generation over. Assume that and are sampled independently, we have .
The gradient of the joint fitness with respect to is expressed as
Note that it has the same format as the version of the single population, it seems to be fine if we just keep the original equation. The influence of
is counteracted through the calculation of its expectation. However, it is known that the expectation of the joint distribution is approximated through sampling with a limited size. Although individuals are sampled without bias(unbiased estimation), there exists intrinsic bias for inadequate sampling, and the bias will grow linearly with increment of distribution dimensionality. So it can be a serious issue when taking the expectation of all neighboring parameters, and the sample size stays relative small.
In fact, it is not necessary to take account of all parameters, since only the expectation of is actually needed. To alleviate the incremental bias, we propose to approximate only the expectation of the parameter of the current population and ignore its neighbor parameters, which is
Though is available for independent distribution, it is infeasible to obtain , since all agents are sampled and evaluated together. However the expectation of individual fitness can be approximated by the multiplication between the original fitness and its confidence. The rectified expectation is expressed as
where, is the confidence, and represent the samples that appear along with . In this way, the bias of estimating the expectation of the neighboring distributions is addressed. The gradient after modification is
The core idea is that although the individual fitness does not exist, the expectation of the individual fitness does, and is invariant to the parameter distributions of its neighboring agents, so the expectation of the individual agent’s fitness should be calculated instead of including the expectation of neighboring agents. Let’s denote the expectation of the objective function over , which is , by and the expectation of the objective function over by , such that
To visualize the sampling estimation process, we use a variant of eggholder as the objective function for demonstration, which is defined in eqn:eggholder, since the real objective is too expensive to obtain.
Assume there exists one neighboring population for
with the size of 400, the sampled individuals are shown in fig:estgrad3d, following a bivariate normal distribution, and the parameter spaces are confined to. Since and are sampled independently, the individuals can be considered to be sampled from only, which is represented by the sample points in fig:estgrad2d. In this objective graph with single-dimensional parameter space, the real objective curve expressed in solid line is obtained by eqn:estoversingleparameter. In order to standardize the scope, all the sampled data including the real objective values are uniformly scaled to the range [0,1], and such standardization does not affect the directionality of the estimated gradient.
The original objective value for each sample varies as the corresponding
changes, which introduces additional estimation bias. As shown by the blue dots in fig:estgrad2d, the distribution of the objective values before rescale is significantly different from the distribution of the true objective values. From fig:estgrad3d, it can be seen that as sample points deviate from the distribution center, their probability of being sampled also decreases, which means that the accuracy or confidence of the fitness of each sampledecreases with the decrease of . If the original objective is rescaled by its confidence , which is the probability of the appearance of the given the existence of , the reconstructed objective values represented by the green square dot in fig:estgrad2d is closer to the real , which obviously reduce the estimation bias.
The above proof indicates that in the case of a limited population size and large number of neighboring populations, applying the rescaled gradient will keep the approximation bias to the level of single population, resulting in more accurate estimation of gradient information, empirical results also supported this conclusion. However, when the population size is large enough (e.g., thousands), this approach may not result in additional accuracy improvements.
The modified expression is also desirable for parallel computing, as only the perturbation of the neighboring populations are needed, which can be easily obtained through communication among processes, and the probabilities can be calculated in a distributed approach.
Iii-C Elitist adaptation Techniques
The performance of NES is sensitive to hyper-parameters, and learning rate is usually the most critical hyper-parameter of NES. Thus, an elitist adaptation method for the learning rate is applied in this paper. First, a list of learning rates are linearly selected in the neighborhood of the original learning rate as:
where, . The and are the minimum and maximum value of . is the size of perturbations which is clipped by . To evaluate the quality of the candidate learning rates, the evaluation function is defined:
where is the th sampled learning rate of the candidate list. The gradient is kept after evaluation. Therefore, by comparing the candidate learning rates with the original one, the next update can be better than the previous one. Considering of peer pressure, each missile is assigned with the same learning rate. The learning rate of the next generation is obtained by
A similar approach is employed to obtain the optimal during the training process.
where, is uniformly sampled from the region . is the fitness function of sampled LOS angle that is defined as
where is the joint initial individual parameters. In this way, the desired impact angles are established automatically.
A rank-based fitness shaping method that is in the same spirit as the one proposed in [wierstra2014natural] is employed in shaping the raw fitness. Conventionally, we still let denote the fitness function after shaping. Another technique called mirrored sampling [brockhoff2010mirrored] is also applied for sampling parameter perturbations.
Iv Hybrid co-evolutionary cooperative guidance algorithm
To achieve coordinated attack, the natural co-evolutionary strategy is applied to optimize the parameter matrices of the neural network controller.
= 1mσ2∑_i=1^mF_i(ς_i’)ϵ_i ∏_c∈N_ip(ϵ_c).
The complete implementation algorithm of the proposed guidance law is shown in Algorithm 1. The conceptual diagram in fig:HCCGL figuratively revealed the parallel simulation process. A master-slave (or fully-distributed) model [gong2015distributed][mendiburu2005parallel] is used for the large scale parallel computation. In this case, each population is evaluated in a separate process and the results of the ecosystem are aggregated to calculate the rescaled gradient eqn:thisgradient and sent to produce guided generations.
V Simulations and analysis
To verify the validity of the proposed method, a variety of simulations based on the cooperative guidance framework are designed. Both cases with stationary target and maneuvering target are simulated. Further, comparison experiments are performed to fully demonstrate the superiority of the proposed guidance method.
V-a Paremeter setup
The acceleration constraint and velocity constraint of the missiles are listed in tab:ExpSetup. The hyper-parameters of the algorithm are listed in tab:HyperParameters.
Now that frameskip has been extensively employed in continuous control problems[salimans2017evolution]. In this work this parameter of frameskip is set to 12 for case 1 and case 2, and 40 for case 3. Appropriate adjustment of this parameter will facilitate the training process without affecting the final results.
|maximum lateral overload (g) ,||50|
|maximum trust overload (g),||5|
|Upper bound of velocity (m/s),||900|
|Lower bound of velocity (m/s),||350|
|simulation step (ms),||5|
|Initial learning rate,||0.015|
|standard deviation for sampling population,||0.2|
|size of learning rate adaptation,||20|
|size of population, m||140|
V-B Case 1: Comparison Experiments
In this section, the proposed guidance law is compared with the time and space cooperative guidance law (TASCGL) proposed in [lyu2019multiple], which considers the space and time cooperative guidance under the distributed communication topology. However, different from the method proposed in this work, the compared method is susceptible and brittle to the initial conditions. The initial conditions of the compared work are adopted in this work as shown in tab:InitialConditionofCase1. Four missiles are engaged in the cooperative scenario with different desired relative impact angles as , and , for each respectively. The target is located at (9500, 9000)m.
fig:trajectoryCase1 shows the trajectories of the two guidance laws. As is depicted in the figure, the trajectory of TASCGL is twisted at the initial stage, as the missiles try to consensus their LOS angles and velocities. In comparison, the trajectory of the proposed HCCGL shows better damping performance, without sharp turn and oscillation.
It can be seen from tab:ResultForCase1 that the Zero-Effort Miss (ZEM) and the consensus angle error for both guidance laws have achieved competitive final accuracy. However the consensus time error of TASCGL reached up to 5 seconds at maximum compared with less than 0.1 seconds acquired by the proposed method. Further analysis of the velocity curve shows that in the case of TASCGL, the velocities are prohibited from reaching their ideal values due to the velocity boundary, which is not considered in its design, thus leading to desynchronization in impact time. The profiles of the two methods are shown in fig:consensusAngleError and fig:timeToGo, it can be observed that the flight time of all missiles under HCCGL trends to be identical. For HCCGL, the decomposition of acceleration commands is shown in fig:decoCase1. The left figure shows the decomposition of lateral accelerations, in which the solid line represents the command from tracking controller while the dashed line represent the command from the consensus controller before weighting. Since the tracking part is derived from proportional navigation, the vertical acceleration shown in the right one is completely derived from the consensus controller. The two part of accelerations have similar trends but do not coincide, demonstrating the effectiveness of the consensus controller, which is trained with the improved co-evolutionary strategy.
The result reveals that the proposed guidance law outperforms the compared method with higher precision in consensus performance and smoother trajectories. Also, as the traditional guidance law is usually constrained to boundary conditions and missile’s superb maneuverability, the proposed guidance law is more resilient to limited conditions and more intelligent to aware of the time-varying states of missiles of collaboration.
V-C Case 2: Non-stationary target
In this part, an engagement scenario with a non-stationary target is designed and simulated to verify the effectiveness of the proposed method against unknown dynamic target. The target is maneuvering with lateral acceleration with its velocity fixed at , and its initial flight-path angle . Other initial conditions are the same with case 1. Simulation trajectory and the result can be seen in fig:trajCase2 and tab:ResultForCase2.
From tab:ResultForCase2 we can see that the consensus angle error is within one degree, which is sufficient for accuracy requirement, and salvo attack is achieved with negligible consensus time error. The result demonstrates the effectiveness of the proposed guidance method intercepting dynamic target. As far as the author knows, it is the first time achieving cooperative guidance against non-stationary target with intelligent control, which shows its extraordinary robustness against disturbance from non-stationary objectives.
V-D Case 3: Monte-Carlo simulation
Monte-Carlo simulation has been extensively employed to examine the robustness of an algorithm under varying initial conditions, thus it is applied in this section. In the existing literature, target is usually regarded as stationary as interception of a stationary target is more exclusive of unpredictable disturbance. In this case, five missiles are engaged, and each missile’s position are randomly sampled from a uniform distribution, which is denoted by. Specific, for the missile, the x-coordinate of its position is and the y-coordinate is , which makes the missiles arranged in an orderly manner. The initial flight-path angles of all missiles are set to , with identical velocities of 600m/s and the same desired relative impact angles of . Additionally, the target’s position is (10000m, 9000m).
Simulations with randomly sampled conditions are conducted 200 episodes. The diverse trajectories are depicted in fig:montecarloCase3, and the statistical result after taking the absolute value is shown in tab:ResultForCase3. From the result, we can see that the mean errors of impact angles are within , and the consensus error of impact time holds within most of the time. Result shows that for any initial state with limited error, the proposed scheme can always find the relative optimal solution.
V-E Optimization process analysis
fig:meanfitsAll shows the learning curves in the three cases. The mean fitness in case 1 keeps moving upper and merge together at final phase. From the curve of case 2, we can see that two of the missiles get ahead about 1000 scores, but finally back to meet with the other missiles. Similar phenomenon also appears in case 3. It can be inferred that the policies automatically evolved to equilibrium state, and one reason is that the rescaled gradient prohibited the ever-increasing gap between individual groups, which is crucial for mutual improvement. If one group get ahead too much, then the other groups may never chase up due to the interrelationship, which is to say that the improvement of the poorer performed group is prohibited when more significant drops on the better performed ones will occur. fig:lrAll presents the adaptation profiles of learning rates applying the aforementioned technique. For case 1 and case 2, the learning rates start from high values and gradually converge to the minimal value, which corresponds with the quality of estimated gradients. However, due to the random initial conditions in case 3, the learning rates will not settle easily. Extensive empirical result shows that without the learning rate adaptation, the fitness profiles will jitter in the end instead of converging to satisfactory ranges (regardless of the types of optimizer). Note that it is pretty common when training neural networks and may presumably have been caused by overfitting, according to related research in the field. Employing the simple adaptation technique contribute to distress this deficiency.
In this paper, an improved co-evolutionary strategy NCES has been developed to solve the non-stationarity issue in multi-agent dynamic environments. The hybrid co-evolutionary cooperative guidance law(HCCGL) has been proposed integrating with the improved strategy, and the neural network has been used to construct the consensus controller. To fully demonstrate its effectiveness in synchronizing impact time and angles, three experiments under different conditions have been carried out. Experiment on maneuvering target has been proved effective with satisfactory precision. The proposed method is shown to be robust and can be well scaled to solve cooperative guidance problem for the multi-agent system, which is the first time an intelligent cooperative guidance law is applied to intercept a non-stationary target with time and angle constraints in the existing studies.
The proposed algorithm combines traditional control theories with intelligent algorithms, revealing the enormous potential in this field. It is always meaningful to explore the limits of modern control tasks. Despite the satisfactory results that have been acquired, this work still left space to be improved. Future works may include exploring the effectiveness of incremental guidance gain, or control strategies that tackle actuation failure and system uncertainty.