I Introduction
Autonomous driving has become a research hotspot since it can enhance road safety, ease road congestion, free human drivers, etc. Decisionmaking is the core component of achieving highlevel autonomous driving. Although rulebased methods have been widely used to realize decisionmaking, manually encoding rules is not always feasible due to the highly dynamic and stochastic nature of driving scenarios [katrakazas2015rulebased, montemerlo2008junior]
. The learningbased method is a promising technology to realize highlevel autonomous driving by directly learning a parameterized policy that maps state representations to actions from data using supervised learning (SL) or reinforcement learning (RL)
[sutton2018reinforcement]. Recent learningbased decisionmaking research tends to use multilayer neural networks (NNs) to represent the policy due to their remarkable fitting and generalization capabilities [lecun2015deep]. According to the state representation methods, the learningbased methods can be divided into two categories: (1) endtoend (E2E) decision making, which directly maps the raw sensors outputs to driving decisions, and (2) tensortoend (T2E) decision making, which describes states using realvalued representations, such as velocity and position.
The E2E decisionmaking method has been widely investigated during the last two decades, because it reduces the need for perception algorithms. In the late 1980s, Pomerleau built the first endtoend autonomous driving system, called ALVINN, that took images consisting of binary values and an matrix from a laser range finder as inputs and output steering angles [pomerleau1989alvinn]. After training based on 1200 labeled samples, the NAVLAB vehicle equipped with ALVINN could drive in a 400m road without obstacles at the speed of 1m/s. NVIDIA trained a convolutional driving policy network for autonomous highway driving, which describes states using images from a single frontfacing camera paired with the steering angles [bojarski2016NVIDIA_e2e_1, bojarski2017NVIDIA_e2e_2]. In addition to SL methods, Lillicrap et al. (2016) employed an RL algorithm, called DDPG, to train a policy network for lanekeeping, which took simulated images as input and output acceleration quantity and steering wheel angles on the TORCS simulation platform [lillicrap2015DDPG]. Besides, many other related works on E2E decisionmarking for autonomous driving can be found in [jaritz2018end, lecun2004dave, chen2015deepdriving, wymann2000torcs, kendall2019DDPGdriving, perot2017end, wolf2017learning, liang2018cirl, chen2020interpretable]. Since there is a great difference between the sensor outputs of the simulated environment and the actual vehicle, the learned policy based on simulated perception is difficult to apply to real vehicles, or only applicable to simple driving tasks such as lanekeeping [pomerleau1989alvinn, kendall2019DDPGdriving]. Besides, the sensor outputs are also sensitive to the configuration of vehicle sensors, which limits the generalization of E2E decisionmaking methods in different vehicles.
Compared with E2E decisionmaking that takes raw sensors information as states, preliminary studies showed that realvalued representations perform better, due to the reduced state space being easier to learn and the real values making it easier for the system to generalize [isele2018navigating]. Therefore, T2E decisionmaking has achieved great success in autonomous driving [mirchevska2018high, wang2017formulation, wang2018reinforcement, wang2019continuous]. Duan et al. (2020) represents driving states using a 26dimensional vector, consisting of indicators of the ego vehicle, the road and the nearest four vehicles, realizing smooth and safe decision making on the simulated 2lane highway via RL [duan2020hierarchical]. Guan et al. (2020) included a total of 16 variables from the ego vehicle and seven surrounding vehicles (position, speed, etc.) in the state representation to handle the cooperative longitudinal decisionmaking in the virtual intersection [guan2020centralized]. Different vehicle information is sorted according to a predefined order to form the final state vector.
In summary, the T2E method needs to concatenate perception information of the ego vehicle, surrounding vehicles and roads into a state vector and then perform policy learning based on the vectorized state space. Although T2E has shown its advantages in terms of policy performance and generalization ability to vehicles with different sensor systems, it suffers from two challenges: (1) dimension sensitive problem and (2) permutation sensitive problem. The former means that T2E can only consider a fixed number of surrounding vehicles since the input dimension of the parameterized policy must be a predetermined value [isele2018navigating, duan2020hierarchical, guan2020centralized]. The latter indicates that the information of surrounding vehicles needs to be permuted according to manually designed sorting rules because different permutations lead to different state representations and policy outputs [mirchevska2018high, wang2017formulation, wang2018reinforcement, wang2019continuous]. It is usually difficult to design a proper sorting order for complex driving scenarios such as intersections. These two challenges will not only limit the generality of T2E for different driving scenarios, but also hurt the performance of the learned policy.
In this paper, we propose a new state representation method, called encoding sum and concatenation (ESC), for the state representation of learningbased decision making. The main contributions and advantages of this paper are as follows:

The proposed ESC method introduces a representation NN to encode each surrounding vehicle into an encoding vector, and then adds these vectors to obtain the representation vector of the set of surrounding vehicles. By concatenating the set representation with other variables, such as indicators of the ego vehicle and road, we can realize the fixeddimensional permutationinvariance state representation. Compared with state descriptions used in existing T2E studies [mirchevska2018high, wang2017formulation, wang2018reinforcement, wang2019continuous, duan2020hierarchical, guan2020centralized], ESC is applicable to a variable number of surrounding vehicles and eliminates the need for manually predesigned sorting rules, leading to higher representation ability and generality.

This paper has further proved that the proposed ESC method can realize the injective representation if the output dimension of the representation NN is greater than the number of variables of all surrounding vehicles. This means that by taking the ESC representation as policy inputs, we can find the nearly optimal representation NN and policy NN by simultaneously optimizing them using gradientbased updating.

Experiments on six SLbased policy learning benchmarks demonstrate that compared with the fixedpermutation representation method, the proposed method improves the representation ability of the surrounding vehicles, and the corresponding approximation error is reduced by 62.2%.
In Section II, we describe the state representation problem, and analyze the effect of the dimension sensitive problem and the permutation sensitive problem on policy learning. Section III proposes the ESC state representation method. In Section IV, we present experimental results that show the efficacy of ESC. Section V concludes this paper.
Ii Problem Description
In this section, we first describe the state representation problem. Then, we analyze the effect of the dimensionsensitive problem and the permutation sensitive problem on the performance, generality, learning difficulty of the policy.
Iia Observation and State
In this paper, we denote the observation set of driving scenarios as , which consists of: (a) the information set of surrounding vehicles , where is the indicator vector of the th vehicle, and (b) the vector contain other information related to the driving task , such as indicators of the ego vehicle and road geometry. Thus . The set size of , i.e., the number of surrounding vehicles within the perception range of the ego car, is constantly changing due to the dynamic nature of the traffic. Assuming that the range of the number of surrounding vehicles is , the space of can be denoted as , i.e., . Noted that the subscript of in represents the ID of a certain vehicle. For example, indicates that surround vehicles are sorted inversely according to the ID of each surrounding vehicle.
We denote the mapping from the observation set to state representation as , i.e.,
(1) 
Current T2E researches usually concatenate the variables in to obtain the state representation vector . According to the permutation of surrounding vehicles in , there are two commonly used approaches: (1) allpermutation (AP) representation and (2) fixedpermutation (FP) representation. AP method aims to consider all possible permutations of surrounding vehicles in ,
(2) 
where denotes the AP mapping and represents any permutation. Unlike the AP method, the FP method only considers one permutation, which permutes the objects in according to a predefined sorting rule ,
(3) 
where denotes the FP mapping.
According to (2) and (3), both the change of vehicle number or the permutation of surrounding vehicles would lead to different state vectors , bringing two challenges: (1) dimension sensitivity and (2) permutation sensitivity. To find a better state representation method, it necessary to first analyze the impact of these two problems on policy learning.
IiB Dimension Sensitivity
The state dimension of AP and FP methods is , which is proportional to the number of surrounding vehicles. Since this number is constantly changing during driving, is not a fixed value. However, the input dimension of the parameterized policy must be a predetermined fixed value due to the structure of the approximate functions, such as neural network (NN) and polynomial functions. This means that T2E methods based on AP or FP representation methods are only valid when the number of surrounding vehicles is fixed [mirchevska2018high, wang2017formulation, wang2018reinforcement, wang2019continuous]. Assuming that some T2E methods only consider surrounding vehicles, as shown in Figure 1, when , we need to select vehicles from based on predefined rules. When , we need to add virtual vehicles far away from the ego to meet the input requirement of the policy function without affecting decisionmaking. The former will lead to information loss, while the latter will introduce information redundancy. Therefore, it is usually necessary to select an appropriate value of according to the requirements of different driving tasks, which also limits the generality of AP and FP methods.
IiC Permutation Sensitivity
As illustrated in Figure 2, assuming the number of surrounding vehicles is fixed, different permutations of correspond to different state vector , thereby leading to different policy outputs. In other words, and policy outputs are permutation sensitive to the order of surrounding vehicles. However, a reasonable driving decision should be permutation invariant (PI) to the order of objects in because all possible permutations correspond to the same driving scenario. To analyze the effect of permutation sensitivity problems, we first define the PI function as follows.
Definition 1.
(Permutation Invariant Function). Function is permutation invariant to the order of objects in the set if for any permutation .
For example, is a PI function w.r.t. . Similarly, we define the permutation sensitive function as
Definition 2.
(Permutation Sensitive Function). Function is permutation invariant to the order of objects in the set if such that .
We denote the optimal driving policy as , which is a PI function w.r.t. . The objective of T2E decisionmaking methods is to learn a parameterized policy , which takes as inputs, such that
(4)  
where is the policy parameters and indicates that the parameters are optimal. An effective mapping will significantly reduce the difficulty of policy learning.
For the AP representation method in (2), the policy is be optimized by minimizing the following loss
(5) 
The problem is that there are permutations for a particular set containing surrounding vehicles. This indicates that one driving scenario will correspond to different state representations, which greatly increases the sample complexity.
For the FA representation method in (3), the policy can be found by minimizing
(6) 
The predefined order of FA guarantees the permutation invariance of the policy w.r.t. , reducing the sample complexity compared with AP methods. However, it may break the continuity of the policy function w.r.t. , i.e.,
(7)  
Since the position of each surrounding vehicle is dynamically changing during the driving, the position of information of the th vehicle may change at a certain time, resulting in a sudden change in the state and policy output . For example, the rear vehicle becomes the preceding vehicle by overtaking the ego vehicle. In particular, we will give a special case below for further explanation. Let , and , where is a variable. The rule sorts according to the first element of from small to large. It follows that when , and ; when , and . It can be seen that the permutation of objects in has changed around , which may cause a sudden change in policy outputs, i.e.,
(8)  
The policy discontinuity introduced by FA representations brings difficulties to policy learning since the optimal driving policy should be continuous w.r.t. to each element in . Besides, it is usually difficult to design a proper sorting rule for complex driving scenarios such as intersections.
To conclude, due to the permutation sensitivity, AP and FP methods suffer from high sample complexity and policy discontinuity respectively, which may adversely affect the performance of policy learning.
Iii Encoding Sum and Concatenation State Representation
Both dimension sensitivity and permutation sensitivity will damage the performance of the learned policy and limit the applicability of T2E decisionmaking in different driving scenarios. In the past five years, PI approximation methods for discrete data sets have been extensively studied [zaheer2017deepset, maron2020Deepset, sannai2019universalDeepset], but the theory is usually only applicable to (a) discrete and finite case such as images or (b) continuous set of fixed size, which is barely applicable to the continuous set with variable set size. In this section, the existing PI approximation theory is extended to the field of state representation in autonomous driving, and an encoding sum and concatenation (ESC) method is proposed to realize the fixeddimensional and PI state representation of the observation set .
Iiia State Representation
As shown in Figure 3, the mathematical description of the proposed ESC state representation is
(9) 
where is the representation NN with parameters and is the output dimension. Different from and , the ESC mapping is a parameterized function. ESC first encodes each in the set into the corresponding encoding vector , i.e.,
(10) 
Then, we obtain the representation vector of the surrounding vehicles set by summing the encoding vector of each surrounding vehicle
(11) 
From (11), it is clear that for . In other words, is a fixeddimension representation. Furthermore, the summation operator in (11) is PI w.r.t. . Thus, is a fixeddimensional and PI state representation of observation .
By taking as the input of , the policy function can be expressed as
(12) 
where is PI w.r.t. set . As shown in Figure 4, the policy falls into two layers: (1) an ESC representation layer and (2) an approximation layer. In the following, we refer to the policy function based on the ESC representations as ESC policy.
IiiB Injection and Optimality Analysis
In addition to the fixed dimension and permutation invariance attributes, to ensure the existence of and , such that
(13)  
the ESC state representation or ESC policy also needs to be injective w.r.t. the surrounding vehicles set . If is an injective mapping, for any where , it holds that or . In contrast, if it is not injective, there exist where , such that . This indicates two different driving scenarios correspond to the identical state representation, which leads to the same policy outputs, thus impairing driving safety. Therefore, it is crucial to make sure there such that is injective.
Before proving the injectivity of the proposed ESC method, the following two lemmas are needed.
Lemma 1.
(Universal Approximation Theorem [Hornik1990Universal]). For any continuous function on a compact set
, there exists an overparameterized NN (i.e., the number of hidden neurons is sufficiently large), which uniformly approximates
and its gradient to within arbitrarily small error on .Lemma 2.
(Sumofpower mapping[zaheer2017deepset]). Let and define a sumofpower mapping as
(14) 
The mapping is an injection (i.e. ) if .
Then, the main theorem is given as follows.
Theorem 1.
(Injectivity of the ESC State Representation). Let , where and , in which and are the lower and upper bounds of each element in , respectively. Denote the space of as , where . Noted that the size of the set is variable. If the representation NN is overparameterized and its output dimension , there always such that the mapping in (9) is injective.
Proof.
Let . We concatenate the th element of each for into the vector . By normalizing using the minmax scaling method, for , we will get
(15) 
According to Lemma 2, when , the sumofpower mapping expressed as
(16) 
is injective when is a fixed value.
Let . From (16), if , the mapping defined as
(17) 
is also injective. In particular, the item makes this mapping suitable for the case where the set size is variable.
Next, we will analyze the optimality of the ESC representation and the ESC policy.
Lemma 3.
(Global Minima of OverParameterized Neural Networks [allen2018convergence, du2019overconverge]). Consider the following optimization problem
where is the training input, is the associated label, is the dataset, is the parameter to be optimized, and is an NN. If the NN is overparameterized, simple algorithms such as gradient descent (GD) or (stochastic GD) SGD can find global minima on the training objective in polynomial time, as long as the dataset is nondegenerate. The dataset is nondegenerate if the same inputs have the same labels .
Theorem 2.
Given any continuous function operating on a the set , i.e., which is permutation invariant to the elements in . If the representation NN and policy NN are both overparameterized, and , we can find and which make (13) hold by directly minimizing using optimization methods such as GD and SGD, where
(20)  
Proof.
From Theorem 1, there such that in (9) is injective. Furthermore, from Lemma 1, one has
(21) 
In other words, there exists a pair of and , which makes approximate arbitrarily close. Although the nearly optimal parameters may not be unique, according to Lemma 3, we can find a pair of and which make (13) hold by directly minimizing using optimization methods such as GD and SGD. ∎
Remark 1.
The representation NN is only related to , but is independent of function . This indicates that for any different continuous PI functions and operating on set , for the same injective mapping , there exist and assuring
and
for and , respectively.
Iv Experimental verification
No.  PI target policy functions 

1  
2  
3  
4  
5  
6 
This section validates the effectiveness of the proposed methods in the policy learning task based on supervised learning. We take AP representation and FP representation methods as baselines.
Iva Experiments Design
We set the dimension of to , and each element of is bounded by and , i.e., . Similarly, we set . We assume that the maximum size of set is , i.e., . Based on these settings, we construct six PI target policy functions in Table I as benchmarks. Noted that , , in Table I represent taking the mean value, maximum and minimum of elements in , respectively, and denotes the norm of .
We will learn a policy to approximate each benchmark using different state representation methods. Then the performance of the ESC method can be evaluated by comparing the approximated error of different representations. As shown in Table II, according to the size of set , the experiment for each benchmark is divided into five cases, , , , and . Only ESC is applicable to variable size set , that is, case five.
Case  Set Size  Representation methods 

1  1)ESC; 2) FP; 3) AP  
2  1)ESC; 2) FP; 3) AP  
3  1)ESC; 2) FP; 3) AP  
4  1)ESC; 2) FP; 3) AP  
5  ESC 
For each case of each benchmark, we randomly generated a training set containing one million samples and a test set containing 2048 samples. The th sample in or is denoted as , where and are sampled uniformly within their space, and . Given , the policy NN (and the representation NN for ESC) based on the AP, FP, and ESC are optimized by directly minimizing (5), (6) and (20), respectively. For the FP method, the predefined order sorts the elements of according to the first element of from small to large. If the first element is equal, we will compare the second element, and so on.
IvB Training Details
For the ESC method, we use a fully connected network with five hidden layers, consisting of 256 units per layer, with Gaussian Error Linear Units (GELU) each layer [hendrycks2016gelu], for both representation NN and policy NN (See Figure (a)a). The output layer of each NN is linear. According to Theorem 1, the output dimension of should satisfy that , so we set .
Unlike the ESC method containing two NNs, AP and FP only need to learn a policy NN. To avoid the influence of different NN architectures on learning accuracy, the policy NN for these two methods is designed as shown in Figure (b)b. This architecture comprises 11 hidden layers, in which each layer contains 256 units with GELU activations, except for the middle layer (i.e., the th layer). The middle layer is a linear layer containing 101 units, which is equal to the output dimension of . The input dimension is , which is related to the size of set . Therefore, the approximation structures in Figure (a)a and (b)b have the same number of hidden layers and neurons. In particular, when , these two architectures are identical. This design will greatly reduce the impact of network structure differences on learning accuracy. By guaranteeing the similarity of approximation architectures, we can focus on comparing the policy learning accuracy based on different state representation methods.
For all representation methods, we adopt Adam [Diederik2015Adam]
to update NNs where the decay rate of first and secondorder moments are
and , respectively. The batchsize is 512 and the learning rate is .IvC Results Analysis
We train 5 different runs of each representation method with different random seeds, and evaluate the learning accuracy by calculating the Root Mean Square Error (RMSE) based on . The training curves of benchmark 1 are shown in Figure 6. In addition to the cases with fixedsize sets (case 14 in Table II), we also train an ESC policy learned based on the samples from the variablesize set (case 5). The ESC policy based on case 5 is evaluated when , , and , respectively, shown as the blue solid lines in Figure 6.
Training curves of benchmark 1. The solid lines correspond to the mean RMSE and the shaded regions correspond to 95% confidence interval over 5 runs.
Benchmark  Numbers of surrounding vehilces  ESC ()  ESC (fixed )  FP  AP 

1  3.780.1  3.770.15  7.420.13  8.50.07  
3.60.06  4.290.08  7.680.07  9.350.06  
3.510.02  4.60.08  8.420.17  10.360.41  
4.190.05  5.020.06  9.040.06  10.930.06  
2  36.870.29  30.690.36  53.630.01  55.830.02  
31.830.57  27.760.15  56.420.05  60.140.36  
30.150.67  29.971.05  54.180.41  56.250.12  
32.60.94  33.91.08  51.580.19  53.560.46  
3  12.460.73  10.980.19  42.311.11  57.561.21  
5.560.19  7.620.11  33.091.05  43.820.92  
3.820.17  6.130.16  29.590.23  44.00.34  
6.770.62  5.810.34  31.60.87  44.941.43  
4  5.960.17  4.420.36  10.820.14  12.280.07  
4.330.14  4.70.15  9.40.21  10.190.05  
3.80.08  4.460.26  8.30.37  9.190.1  
3.890.11  4.720.07  8.070.2  8.530.08  
5  5.570.08  3.950.17  18.590.31  24.790.47  
2.880.08  2.390.07  16.930.24  25.960.35  
2.20.06  1.890.1  14.820.3  24.160.46  
2.10.07  1.470.03  13.720.21  23.260.39  
6  40.881.33  43.023.11  66.011.78  64.974.91  
35.520.28  42.92.17  219.5716.9  355.96.21  
43.591.28  56.560.87  349.633.82  679.4225.02  
59.641.56  62.027.02  508.3117.91  832.745.49 
corresponds to a single standard deviation over 5 runs.
Figure 7 and Table III display the final RMSE under each experimental setting. Results show that the proposed ESC method outperforms or matches the two baselines in all benchmarks and cases. Among all the cases, the RMSE of the FP method is 20% lower than that of the AP method on average. This is because the predefined order helps to reduce the sample complexity. Compared with the AP and FP methods, ESC with fixed achieves an average error reduction of 62.2% and 67.5%, respectively. This indicates that ESC is more suitable to represent the surrounding vehicles set due to its permutationinvariance and continuity. In addition, ESC eliminates the requirement of manually designed sorting rules. The learning accuracy of ESC with variable set size is comparable to that with fixed set size. Therefore, it suggests that the ESC method is capable of representing variablesize sets, thereby eliminating the burden of training different approximation NNs for scenarios with different numbers of surrounding vehicles. To conclude, experimental results indicate that the proposed ESC method improves the representation ability of driving observation.
V Conclusions
In this paper, we propose a new state representation method, called encoding sum and concatenation (ESC), for the state representation of decisionmaking in autonomous driving . Unlike existing state representation methods, ESC is applicable to a variable number of surrounding vehicles and eliminates the need for manually predesigned sorting rules, leading to higher representation ability and generality. The proposed ESC method introduces a representation neural network (NN) to encode each surrounding vehicle into an encoding vector, and then adds these vectors to obtain the representation vector of the set of surrounding vehicles. By concatenating the set representation with other variables, such as indicators of the ego vehicle and road, we realize the fixeddimensional and permutation invariant state representation. This paper has further proved that the proposed ESC method can realize the injective representation if the output dimension of the representation NN is greater than the number of variables of all surrounding vehicles. This means that by taking the ESC representation as policy inputs, we can find the nearly optimal representation NN and policy NN by simultaneously optimizing them using gradientbased updating. Experiments demonstrate that compared with the fixedpermutation representation method, the proposed method improves the representation ability of the surrounding vehicles, and the corresponding approximation error is reduced by 62.2%.
Comments
There are no comments yet.