The compensation of mechanical and structural vibration has significant applications in manufacturing, infrastructure engineering and other domains. In automotive or aerospace applications, vibration reduces component lifetime and the associated acoustic noise can produce discomfort. In machine tools residual vibrations degrade the position accuracy and produce material fatigue.
The compensation of such mechanical vibrations is a large and important field of research. Various methods have been applied to provide solutions for this challenging problem.
In this paper we briefly review the state of the art in the field of vibration compensation of dynamic feed drive systems (the main sources of motion in machine tools) and describe drawbacks in the solutions provided in the state of the art. We propose a novel approach based on deep reinforcement learning to compensate vibrations in dynamic drive systems with a priori unknown system parameters. The proposed method is experimentally validated using a linear direct drive and control hard- and software customary in the machine tool industry.
Ii State of the Art
The research in the field of vibration compensation can roughly be broken into three categories: hardware design, command shaping, and feedback control. The above-mentioned research areas are illustrated in the followings.
Ii-a Hardware Design
In hardware design approaches, vibration compensation is achieved by using additional mechanical systems. The damping of the system is increased by mass dampers and vibration absorbers. The advantage of these systems resides in their simple construction and cost-effective implementation. The major drawback in hardware design approaches is the low flexibility  since hardware (e.g. passive dampers ) is used to overcome an application specific problem.
Ii-B Command Shaping
Command shaping methods are altering the reference motion trajectory in order to filter out specific frequencies. Thus the suppression of vibrations is done in a preemptive way. The major downside of command shaping methods is that a dynamic model of the system or at least its natural frequencies and damping behavior has to be known beforehand with sufficient accuracy. The dynamic model has to be re-evaluated when the system parameters vary. One of the earliest publications on command shaping is . Smith et al. proposed a method, known as posicast control, that processes a baseline command and delay a part of the command before transferring it to the system. The delayed portion of the command canceles out the vibration induced by the undelayed part of the motion command. A key advancement in command shaping was the concept of robustness – commands can be designed to work well, even when large modeling errors exist. Singer and Seering presented an input-shaping method  that increases the robustness of the input-shaping process. They used an additional constraint to enforce the derivative of the residual vibration, with respect to the frequency, to equal zero:
Where is the natural frequency, and is the damping ratio. When is satisfied, the result is a Zero Vibration and Derivative (ZVD) shaper containing three impulses.
Ii-C Feedback Control
Vibration compensation using feedback control, also known as active vibration control, incorporates sensors to measure the mechanical disturbance, a controller to compute an appropriate counter-vibration and control an actuator accordingly. Destructive interference from additional movements generated by the controller reduces or neutralizes the effects of the disturbance on the structure . The scheme of feedback control is depicted in Figure (1) . The feedback signal is computed from the comparison of the output of the system and the input . The error signal is passed into a compensator and applied to the system . The controller is designed with the aim of determining an appropriate transfer function of the compensator , to induce the sought-after performance while maintaining the system stability.
The objective of active damping is to reduce resonant peaks of the closed-loop control circuit
Active damping can be achieved without modelling the system dynamics, but is only effective near resonance peaks. Moreover the stability can only be guaranteed when the sensors and actors are collocated. Model based methods attenuate all disturbance within the control bandwidth, but require an accurate model of the system. Such methods generally have limited bandwidths and cope with control and observation spillover .
Coordinate Coupling Control (CCC) is an energy-based method to eliminate the transient vibration of an oscillatory system . The technique was later extended to compensate steady-state vibrations .
Robust control approaches focus the trade off between performance and stability in the presence of system model uncertainties. The -controller is designed to address the uncertainties systematically.
-methods formulate the control problem as a mathematical optimization problem and solve it. The resulting-controller is then optimal with respect to the prescribed cost function. Application of the method in the vibration control of flexible structures can be found in [2, 5].
Optimal control theory applied in vibration control aims to reduce the vibration of the mechanical system to the greatest possible extent. The method seeks to compute the feedback gain by minimizing a cost function or a performance index, which is correlated to the required measure of the system response. Popular approaches are [22, 4].
There are attempts in the state of the art for reducing vibrations using machine learning. However, these in contrast to our proposed approach are not using reinforcement learning, like the neurofuzzy approach in or use reinforcement learning in simulation for active automotive suspension in simulation  and not for machine tools on industrial hardware.
The main shortcomings of the state of the art methods are the following problem statements:
Complex modelling of the underlying system dynamics.
Learn from past performance to improve future actions.
Automatic adaption of the vibration compensation behavior to changes in the structure/system.
Iii-a Reinforcement Learning and Policy Optimization
Further we define the Reinforcement Learning (RL) problem and introduce the notation that we use throughout the paper. In this paper a finite-horizon, discounted Markov Decision Process (MDP) is regarded. At each timestep, the RL-agent observes the current state , performs an action , and then receives a reward . After that the resulting state will be observed, determined by the unknown dynamics of the environment . An episode has a pre-defined length time steps. The goal of the agent is to find a parameter of a policy that maximizes the expected cumulated reward over a trajectory
where is the discount factor.
RL methods solve a MDP by interacting with the system and accumulating the obtained reward. We consider several model-free policy gradient algorithms with open source implementations which appear frequently in the literature, e.g. Soft Actor-Critic approaches, Deep Deterministic Policy Gradient (DDPG) , and Proximal Policy Optimization (PPO) 
. The major advantage for the use of the PPO algorithm is that it allows to incorporate a Long Short-Term Memory (LSTM) effortlessly 
. A LSTM is a specific recurrent neural network (RNN) architecture that was designed to model temporal sequences and their long-range dependencies more accurately than conventional RNNs. The use of a LSTM significantly increases the model quality of the system dynamics (e.g. determining the actual vibrations from subsequent deflection observations). Therefore we use the PPO algorithm  for the training of the agent.
Generally, PPO maximizes (3) using a robust version of the policy gradient theorem
and performing gradient ascent steps
Iv Vibration Compensation using RL
Iv-a Problem Formulation
We consider vibration compensated movement commands that can be described as moving to a target position goal and compensating vibrations along the way. Let and denote the actual position and the actual vibration, respectively, at time and let denote the control command (velocity) applied to the system at that time. A movement command is described by a target position and a desired vibration . To compensate occurring vibrations is set to 0. Given the movement command , an initial system state , and a time horizon our vibration compensation problem is formulated as
Where describes the unknown dynamics of the system and
the loss function defined by the summed squared distance
Iv-B MDP Architecture
The flow of information of our method is shown in figure 2. With respect to the introduced notation for the MDP we define observations, actions, and rewards as follows:
Observation. A state is described by the actual position and velocity of the feed drive translator and the current vibration. Note that an observation of the current vibration solely incorporates the subsequent measurements of the deflection, not the frequency. The agent has to determine the actual frequency based on five preceeding measurements of the deflection .
Action. The agents action is defined by a continuous velocity command .
Reward. The reward signal is described as follows
Given the reward function in formula (9) a negative return is received while the target position and/or the desired vibration is not reached. Otherwise the agent receives .
For the RL agent a vibration-compensated motion is an opposing goal: Dynamically moving a machine axis induces vibrations; compensating vibrations affects the desired motion. This insight is used in the modelling of the reward function. We want the agent to fulfill both mutually influencing goals ( and
). Therefore we design the reward function based on a sparse reward setting, treating positioning accuracy and vibration suppression equally. To learn from sparse rewards, effective exploration is crucial to find a set of successful trajectories. To guarantee sufficient amount of exploration we use the entropy coefficient as a regularizer. In a policy optimization setting, a policy has maximum entropy when all policies are equally likely and minimum when the one action probability of the policy is dominant. The entropy coefficient is multiplied by the maximum possible entropy and added to the loss and therefore prevents premature convergence of one action probability dominating the policy and preventing exploration. Further, to ensure a high generalization performance of the agent, the target position is randomized during the training process.
The experiments were done using a linear direct drive depicted in figure 4 coupled to a TwinCAT control unit. The RL Agent is deployed on a Ubuntu Xenial computer with a ADS (Automation Device Specification) interconnection to the control unit. Further, we use the Stable Baselines 
implementations of RL algorithms. For hyperparameter tuning we apply a bayesian optimization approach provided by the framework optuna. To measure the mechanical vibrations we utilize a vision system using OpenCV.
Our experiment evaluates the cost function proposed in Section IV-A. For this experiment, we want the linear feed drive to reach various, random sampled target positions and suppress vibrations along the way. Consequently we define the goal state as and the number of time steps . Figure (3) illustrates the results. The agent solves the vibration compensation problem after 850.000 time steps, equalling 12 hours training on the real machine tool axis (cf. Figure (4)). Figure (3, a) shows the episode reword converging asymptotically towards zero after 850.000 time steps. Consequently the occurring vibrations (Figure 3, b) also converge towards zero. Figure (3, d) illustrates the entropy loss that regularizes when the learning rate decays and attenuates when agents rewards converges.
In this work a reinforcement learning based approach to compensate mechanical vibrations applied to an industrial machine tool axis is presented. We propose a problem formulation describing the vibration compensation based on a vibration cost optimization problem. We evaluate different state of the art Reinforcement Learning algorithms to solve the vibration compensation problem. We train the agent directly on a real machine tool axis, without the use of a simulation environment. To validate our method we perform experiments on a real machine tool axis. The experiments show that the proposed approach is capable of generating vibration compensated movements using a feed drive system with a priori unknowns system dynamics.
Further research could be conducted on the following topics: Deploy the agent using a discrete action space (move left; move right); investigate the generalization across varying machine tool hardware, utilize better vibration measurement system (more accurate and frequent observations).
-  (2002) Adaptive control of flexible structures using a nonlinear vibration absorber. Nonlinear Dynamics 28 (3-4), pp. 309–322. Cited by: §II-C.
-  (1998) Identification, uncertainty characterization and robust control synthesis applied to large flexible structures control. International Journal of Robust and Nonlinear Control: IFAC-Affiliated Journal 8 (2), pp. 97–112. Cited by: §II-C.
-  (2012) Vibration control of a nonlinear quarter-car active suspension system by reinforcement learning. International Journal of Systems Science 43 (6), pp. 1177–1190. External Links: Cited by: §II-C.
-  (2002) Optimal control method with time delay in control. Journal of Sound and Vibration 251 (3), pp. 383–394. Cited by: §II-C.
-  (2002) Design of robust vibration controller for a smart panel using finite element model. Journal of vibration and acoustics 124 (2), pp. 265–276. Cited by: §II-C.
-  (2009-05) Intelligent active vibration control. In 2009 International Conference on Industrial Mechatronics and Automation, Vol. , pp. 76–80. External Links: Cited by: §II-C.
-  (1991) Regulation of flexible structures via nonlinear coupling. Dynamics and Control 1 (4), pp. 405–428. Cited by: §II-C.
-  (2018) Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290. Cited by: §III-A.
-  (2018) Stable baselines. GitHub. Note: https://github.com/hill-a/stable-baselines Cited by: §V.
-  (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §III-A.
-  (1987) Some problems associated with the control of distributed structures. Journal of optimization theory and applications 54 (1), pp. 1–21. Cited by: §II-C.
-  (2016) Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pp. 1928–1937. Cited by: §III-A.
-  (2017) Optuna: a hyperparameter optimization framework. GitHub. Note: https://github.com/pfnet/optuna Cited by: §V.
-  (2014) Long short-term memory recurrent neural network architectures for large scale acoustic modeling. In Fifteenth annual conference of the international speech communication association, Cited by: §III-A.
-  (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: §III-A, §IV-A, §IV-B.
-  (2014) Deterministic policy gradient algorithms. In ICML, Cited by: §III-A.
-  (1990) Experimental verification of command shaping methods for controlling residual vibration in flexible robots. In 1990 ACC, pp. 1738–1744. Cited by: §II-B.
-  (1957) Posicast control of damped oscillatory systems. Proceedings of the IRE 45 (9), pp. 1249–1255. Cited by: §II-B.
-  (1958) Feedback control systems. Cited by: Fig. 1, §II-C.
-  (1997) Passive energy dissipation systems in structural engineering. Cited by: §II-A.
-  (2006) Werkzeugmaschinen 2: konstruktion und berechnung. Springer-Verlag. Cited by: §II-A.
-  (2002) Optimal control for mechanical vibration systems based on second-order matrix equations. Mechanical Systems and Signal Processing 16 (1), pp. 61–67. Cited by: §II-C.