Log In Sign Up

Robust Model-based Reinforcement Learning for Autonomous Greenhouse Control

Due to the high efficiency and less weather dependency, autonomous greenhouses provide an ideal solution to meet the increasing demand for fresh food. However, managers are faced with some challenges in finding appropriate control strategies for crop growth, since the decision space of the greenhouse control problem is an astronomical number. Therefore, an intelligent closed-loop control framework is highly desired to generate an automatic control policy. As a powerful tool for optimal control, reinforcement learning (RL) algorithms can surpass human beings' decision-making and can also be seamlessly integrated into the closed-loop control framework. However, in complex real-world scenarios such as agricultural automation control, where the interaction with the environment is time-consuming and expensive, the application of RL algorithms encounters two main challenges, i.e., sample efficiency and safety. Although model-based RL methods can greatly mitigate the efficiency problem of greenhouse control, the safety problem has not got too much attention. In this paper, we present a model-based robust RL framework for autonomous greenhouse control to meet the sample efficiency and safety challenges. Specifically, our framework introduces an ensemble of environment models to work as a simulator and assist in policy optimization, thereby addressing the low sample efficiency problem. As for the safety concern, we propose a sample dropout module to focus more on worst-case samples, which can help improve the adaptability of the greenhouse planting policy in extreme cases. Experimental results demonstrate that our approach can learn a more effective greenhouse planting policy with better robustness than existing methods.


page 6

page 13


MBDP: A Model-based Approach to Achieve both Robustness and Sample Efficiency via Double Dropout Planning

Model-based reinforcement learning is a widely accepted solution for sol...

A Simulator-based Planning Framework for Optimizing Autonomous Greenhouse Control Strategy

The rapidly growing global population presents challenges and demands fo...

Information Theoretic Model Predictive Q-Learning

Model-free Reinforcement Learning (RL) algorithms work well in sequentia...

A Modular and Transferable Reinforcement Learning Framework for the Fleet Rebalancing Problem

Mobility on demand (MoD) systems show great promise in realizing flexibl...

Integrating Conventional Headway Control with Reinforcement Learning to Avoid Bus Bunching

Bus bunching is a natural-occurring phenomenon that undermines the effic...

1 Introduction

The traditional agricultural production mode is highly dependent on the weather, which has been unable to meet the increasing demand for fresh and healthy food from global population growth. Around the world, some countries have gradually regarded the greenhouse industry as the main force in agricultural production Modern high-tech greenhouses are equipped with standard sensors and actuators (such as heating, lighting, dosing, irrigation, etc.) to empower precision agriculture. To improve crop yield and quality, managers regularly regulate a suitable environment for crop growth by overseeing the greenhouse climate and crop growth state. In addition to increasing the expected harvest, the corresponding energy consumption is another key consideration since natural resources are facing the challenge of exhaustion george2018management. Although the automatic greenhouse is an ideal solution to deal with the food crisis, skilled and experienced managers capable of autonomous greenhouse control are scarce greenhousegrower2018. Furthermore, even a seasoned manager is not able to monitor and manage too many greenhouses simultaneously.

To provide a favorable climate for crop growth in a modern high-tech greenhouse, growers need to manually determine control strategies, such as lighting and irrigation, according to their planting experience. Then the strategy is fed into the process computer to take effect in the greenhouse. Sensors will continuously measure the climate and crop growth state, and feedback the measured data to the grower for analysis and decision-making. The grower needs to balance production and resource consumption during a 3-5 months period hemming2020cherry, which implies a tremendous decision-making space. The complexity of decision-making has led to growers only giving coarse-grained control strategies, which do not make full use of the rich greenhouse states information.

With breakthroughs in AI, new technologies have been introduced into agriculture to help improve production, such as automatic pruning fruit trees you2020efficient, intelligent spraying systems kim2020intelligent

. RL has shown the potential to outperform humans in decision making via enabling both deep decision search at the macro-level and fine-grained control at the micro-level, making it well suited for automated control scenarios. By formulating autonomous greenhouse control as a Markov decision process (MDP) problem, there has been some research application of RL algorithm in an agricultural scene  

parameswaran2016arduino; wang2020deep .

However, RL still faces two major challenges in practical application. First is the sample efficiency problem. Although RL have been achieved excellent results in some fields, they often rely on huge training samples, which is not practical in the real world. Since there is no trial-and-error cost in a virtual simulation environment, algorithms are free to perform exploratory learning and thus learn strategies that can circumvent wrong decisions. However, in the real world, where much trial-and-error implies huge costs (e.g., machine damage, crop death), it becomes a major challenge to learn decisions that can circumvent errors within limited training samples. Second, the robustness of RL is also a key challenge in real-world application scenarios. Once the environment is perturbed during the training phrase, the algorithm performance may be affected and degrade significantly. Furthermore, in the deployment phase, the inconsistency between the deployment environment and the training environment can also affect the performance of the trained strategy.

In this paper, we investigate how RL can be better applied to autonomous greenhouse control. To address the sample efficiency challenge, we introduce the model ensemble approach. In this approach, samples from the real environment are not directly handed over to the RL algorithm for policy optimization but are used to model the environment. Then the model is used to simulate the environment to add more simulated samples and accelerate the learning efficiency. To further enhance the safety of the automated planting policy, we add a sample dropout module to the RL algorithm. The module enable our algorithm to selectively discard a portion of samples with excessive reward, to focus more on worst-case samples, improving the adaptability of the planting policy in extreme cases, and solve the safety challenge of RL in a real-world deployment.

2 Related Work

2.1 AI for Agriculture

Agriculture is an area of extreme importance, which faces several challenges from sowing to harvest bannerjee2018artificial. With the rapid development of AI in recent years huang2018artificial, more and more AI technologies are being applied in agricultural automation jha2019comprehensive.

During crop growing, crop disease is a matter of grave concern to a farmer. Significant expertise and experience are required to detect an ailing plant and take the necessary steps for recovery. Ghaffari et al. develop an electronic nose, and intelligent systems approach to detect diseases in tomato crops ghaffari2010early

. Issues on soil and irrigation management are also very vital in agriculture. Improper irrigation and soil management lead to crop loss and degraded quality. Manek and Singh compared several neural network architectures to predict rainfall using four atmospheric inputs 

manek2016comparative. Dahikar and Rode proposed a neural model to predict different crop yields using atmospheric inputs and fertilizer consumption dahikar2014agricultural.

Most agricultural automation works like the ones mentioned above are challenging to integrate into a holistic system. A macro-level centralized planning system for controlling entire farms remains under study. RL algorithms are often used to learn control policies in complex environments with the potential to outperform humans schulman2015high, as long as they have comprehensive observation information about the entire environment. An automated irrigation system is developed RL technique to control greenhouse remotely to minimize human involvement parameswaran2016arduino. RL algorithms are also used to optimize the autonomous greenhouse climate control to resource consumption wang2020deep.

Also, some works on agriculture with robotics as well, where mobile robots are internally powered by some AI algorithms to facilitate their jobs. In our opinion, this direction of research is promising in the sense that, conceivably, autonomous robots can significantly boost the management of large-scale crop farms as human labor becomes increasingly expensive and conventional machines are not intelligent enough. A few works down the line are 2002Online and zhou2014vision where they discuss online learning of robots and visual navigation of mobile robots.

2.2 Challenges of RL in Agricultural Applications

In recent years, RL has achieved remarkable results in a wide range of areas, including simulated control problems levine2016end, outperforming human performances on Go and games mnih2015human; silver2016mastering. However, applying reinforcement learning to agricultural applications requires addressing the two major challenges of sample efficiency and robustness.

Model-based methods have shown excellent abilities to reduce the sample complexity deisenroth2013survey. Previous works levine2016end; Chua2018DeepModels; janner2019trust have empirically shown significant sample efficiency improvements. However, model accuracy is a major barrier for policy quality, and it is challenging to build an accurate model in high-dimensional tasks abbeel2006using. Thus the policy learned on inaccurate models typically leads to performance degradation due to cumulative model error sutton1996model; asadi2019combating. While improving sample efficiency, the control policy is affected by the discrepancy between the simulator and the real environment. For this problem, previous works (e.g. PETS Chua2018DeepModels, ME-TRPO kurutach2018model, SLBO luo2018algorithmic, MB-MPO clavera2018model) use ensembles of bootstrapped probabilistic transition models to properly incorporate two kinds of uncertainty into the transition model. Concretely, individual probabilistic models capture aleatoric uncertainty or the noise due to the inherent stochasticity. The bootstrapping ensemble procedure can capture epistemic uncertainty or uncertainty in the model parameters aroused from insufficient training data. Empirical works levine2016end; Chua2018DeepModels; janner2019trust have demonstrated that the ensemble of probabilistic models is an efficient way to handle both of two uncertainties, allowing for a competitive model-based learning algorithm.

Figure 1: A modular framework for greenhouse automation. It mainly consists of a greenhouse environment, a crop growth simulator and a RL algorithm with sample dropout. The core part is the central policy module, which receives observations to decide the next action. The RL algorithm uses samples to optimize policies continuously. These three parts alternate in a cycle that eventually leads to an automated planting policy.

Moreover, the robustness issue of RL is one of the challenges hindering its application to complex tasks. Though current state-of-the-art model-based methods have achieved outstanding performance, the derived policies are often only adequate for the environment in which they are trained, and when deployed to perturbed real environments, the policy performance tends to degrade dramatically or even behave dangerously. Robust control zhou1998essentials is a branch of control theory focused on finding optimal policy under worst-case situations. Policy learned through robust control has better generalization performance. Robust adversarial RL methods, such as the RARL pinto2017robust and NR-MDP tessler2019action methods, learn robust decision policy by adversarially adding minimax objects, while EPOpt method introduces conditional value at risk (CVaR) Tamar2015OptimizingSampling; rajeswaran2016epopt, by optimizing the CVaR object, a robust policy with better performance is obtained. We follow these ideas and design a sample dropout module that improves safety in tomato greenhouse automation.

3 Notations and Preliminaries

We consider a Markov decision process (MDP), defined by the tuple , where is the state space, is the action space, is the reward function, is the discount factor, and

is the conditional probability distribution of the next state given current state

and action , or use the form denotes the state transition function when the environment is deterministic.

Let denotes the expected return or expectation of cumulative rewards starting from initial state , i.e., the expected sum of discounted rewards following policy and state transition function :


For simplicity of symbol, let denotes the expected return over random initial states:


The goal of RL is to maximize the expected return by finding the optimal decision policy


In model-based RL, an approximated transition model is learned by interacting with the environment, the policy is then optimized using the model-free method with samples from the environment and data generated by the model. We use the parametric notation to specifically denote the model trained by a neural network, where is the parameter space of models.

4 Proposed Approach

4.1 System overview

We propose the framework shown in the Fig. 1, which mainly consists of three components that can be executed asynchronously:

  1. The collection of real samples through a real agricultural greenhouse environment and a large number of IoT hardware devices.

  2. The generation of simulated samples through rapid simulation in a tomato growth simulator.

  3. Based on the observation samples, the RL algorithm is used to continually train and optimize the policy, which makes decisions based on current information and performs the next action to obtain new observations, leading to the next cycle.

[temperature sensor] [CO sensor] [PAR sensor]
[ventilation controller] [CO producer] [fertigation controller]

Figure 2: Monitoring sensors and control equipment in our greenhouse.

Specifically, we use multiple sensors to obtain observations, such as temperature, humidity, CO concentration, etc. Here are the details of how the main sensors work:

  • Temperature, humidity and CO sensor: multiple measuring boxes are hung in the greenhouse and their readings are averaged (sometimes with different weighting factors for different locations) to get the parameters of temperature, humidity, and CO as the feed for heating, humidity, ventilation, and CO supply.

  • PAR sensor: the greenhouse keeps track of the supply of PAR parameter by measuring the Photosynthetically Active Radiation. In the simulated greenhouse, a PAR sensor is modeled to describe the PAR-level just above the crop.

The observation samples are then transmitted to the decision-making policy module, which can generate corresponding actions. These actions are executed in the greenhouse through multiple actuators. Here are the details of how the main actuators work:

  • Ventilation controller: when the greenhouse temperature exceeds the ventilation setpoint, the vents will be opened to a certain extent. The vents’ opening angle is expressed as a percentage, and it is determined by the deviation between the current greenhouse temperature and the temperature-setpoint.

  • CO producer: when the CO concentration drops below the setpoint, the CO producer will supply CO through a piping network.

  • Fertigation controller: The crops will be automatically irrigated by drip irrigation according to the related setpoints, including irrigation time and watering quantity.

All the observation and action samples are stored in the data cache to optimize both the simulator and policy.

4.2 Algorithm

Although RL is a powerful tool for policy learning, it faces a significant challenge in training efficiency, often relying on a large number of interactions with the environment. Based on the data collected, we developed a tomato growth simulator close to the real environment. The simulator allows us to rapidly perform action interactions and generate simulation samples that simulate a tomato crop’s entire growth cycle within seconds. These samples are then provided to RL algorithms for policy optimization.

Function Dropout(, ):
       Calculate : the -percentile of batch
       for  do
             if  then
                   Fill into
             end if
       end for
Function Main:

    Initialize hyperparameters, policy

, environment replay buffer , simulator replay buffer
       for  iterations do
             Take an action in the greenhouse environment using policy ; add samples to
             Mask into
             for  iterations do
                   Load the pre-trained model and train on
                   Get the trained model ensemble collection with clip restriction
                   for  do
                         Select a model from

with probability

                         Perform rollouts on model with policy and get samples
                         Fill these samples into batch
                   end for
                   = Dropout(, ); Fill the data of into
             end for
             = Dropout(, );
             Optimize policy on and :
       end for
Algorithm 1 Robust Model-based RL for Autonomous Greenhouse Control

Specifically, we firstly pre-train a model based on the cumulative greenhouse crop data, i.e., , to express the relationship between crop growth and the observation parameters of greenhouse. The initialization of model parameters is based on expert knowledge, which can effectively alleviate the sample consumption in the early stage of model learning. Besides, we add clip restrictions to the simulation states based on prior knowledge in agriculture to avoid unreasonable anomalies (e.g., excessive temperature and CO concentration) due to model generalization, enabling the simulator to simulate more realistic and complex growth models and improve the robustness of the learned planting policy, thus improving the adaptability of the policies to a variety of complex scenarios. And with the idea of Dyna-style RL, we continually transmit the collected real samples to the simulator for further optimization of the tomato growth model in the policy optimization cycle, which enables the simulator to provide simulation samples for the framework more efficiently and accurately. We also use the model ensemble approach in order to capture the uncertainty in the real greenhouse environment.

Figure 3: We pretrain a tomato growth model using real data, and the samples collected during the training process will be used for further optimization of the model.

As shown in Fig. 3, we first build the binary masks

from the Bernoulli distribution with parameter

, and we perform the bootstrap mask


on each sample data , then we can generate subsets, i.e.,


where indicates that is retained in the set if , otherwise not. And then, based on the pre-trained simulator, we use the new data from the subsets to fine-tune the model. We learn a collection of fine-tuned simulator models . We use parametric notation to specifically denote the model trained by neural network, where is the parameter space of models. Each member of the collection is a probabilistic neural network whose outputs parametrize a Guassian distribution:


The corresponding loss of the simulator models is


In our method, the simulator is learned to replace the real environment . The policy is then optimized using the samples from both the environment and simulator. Optimizing the expected return in a general way as RL methods allows us to learn a policy that performs best in expectation over the simulator. However, best expectation doesn’t mean that the result policies can perform well in each individual model, since there could be high variability in performance for different models. This instability typically leads to risky decisions when facing poorly-informed states at deployment.

Inspired by previous works Tamar2015OptimizingSampling; rajeswaran2016epopt which optimize conditional value at risk (CVaR) to explicitly seek a robust policy, we add a sample dropout module to the RL algorithm, which selectively discards a portion of samples with excessive reward, to focus more on worst-case samples, improving the adaptability of the planting policy in extreme situations, and solve the safety challenges of RL in real-world deployment, aiming to further enhance the safety of the automated planting policy.

More specifically, to generate a prediction from our model collection, we first select a model with probability


at each time step , where

is a probability distribution (defaults to a random distribution). We then use the model

to interact with the policy and perform a simulated rollout using the selected model, i.e., . In this way, we can quickly generate a large number of crop growth simulation samples, and our simulator can be improved as we receive real samples from the greenhouse during the RL cycle. Then we fill these rollout samples into a batch and retain a -percentile dropout subset with more pessimistic rewards. We use to denote the -percentile dropout rollout batch:




is the sample batch collected by performing policy on the model ensemble and is -percentile of reward values in batch . The expected return of dropout batch rollouts is denoted by :


Then, we can perform policy gradient update by


Where is the learning rate. The overall pseudo code is shown as Algorithm 1.

5 Experiments

With the sensors and actuators deployed in our tomato greenhouse, we collect 22 kinds of observation variables, constituting a 275-dimensional observation space , and 6 kinds of control variables, constituting a 52-dimensional action space . Also, we use the (USD/m) as the target reward for training. , where are obtained through yields and price, and include resource consumption (electricity, heat, , and water) and crop maintenance costs. The details of observation space and action space are shown in the Table 1.

Name Min Max Dim
temperature setpoint 13 32 24
CO setpoint 400 1000 24
light-on time 0 24 1
light-off time 0 24 1
irrigation start time 0 24 1
irrigation stop time 0 24 1
outside solar radiation 0 2000 24
outside temperature -30 50 24
outside humidity 0 100 24
wind speed 0 25 24
virtual sky temperature -20 20 24
greenhouse air temperature -30 100 24
greenhouse air humidity 0 100 24
greenhouse air CO concentration 400 1000 24
light intensity just above crop 0 2000 24
cumulative amount of irrigation per day 0 10 1
cumulative amount of drain per day 0 10 1
leaf area index 0 10 1
current number of growing fruits 0 1000 1
cumulative harvest in terms of fruit fresh weight 0 100 1
cumulative harvest in terms of fruit dry weight 0 100 1
planting days 0 365 1
temperature setpoint 13 32 24
CO setpoint 400 1000 24
light on time 0 24 1
light off time 0 24 1
irrigation start time 0 24 1
irrigation stop time 0 24 1
Table 1: Observation Space (Above) and Action Sapce (Below)

5.1 Analysis of Performance

Figure 4: (1) The left plot shows the training curves for our method with dropout (), our method without dropout (

) and the SAC baseline. Each with 5 seeds for training. Solid curves indicate the mean of all trials with different seeds. Shaded regions correspond to standard deviation among trials; (2) The right plot shows the evaluation curves of the policies trained by all the algorithms. The purple dashed line is the reference profit for skilled labor.

We train two versions of our method on the greenhouse simulator, one with sample dropout (, the choice of parameter will be analyzed in section 5.3) and one without sample dropout (). Additionally, we adopt the soft actor-critic (SAC) algorithm, a widely used model-free RL algorithm, as a baseline for comparison. With , these algorithms are trained with a 120-day as an optimization target.

As shown in the left of Fig. 4

, the algorithm with dropout converges better and has less variance than the one without dropout. This is mainly because the sample dropout module will discard a portion of samples with excessive feedback values, avoiding the local optimum. Meanwhile, the agent pays more attention to the worst-case states so that its variance is smaller. As for the SAC algorithm, it performs worse than our algorithm, which is caused by the low sample efficiency of the model-free method, making it difficult to learn enough information with limited samples.

Further, we evaluate the planting policies learned by the different algorithms on the tomato simulator, as shown in the right of Fig. 4. We observe that: (1) the policies have similar performance in the early stage when the crop is not growing; (2) when the crop starts to harvest, our algorithm outperforms both the SAC algorithm and skilled labor.

5.2 Analysis of Robustness and Safety

Figure 5: The robustness performance is depicted as heat maps for various environment settings. The algorithms without () and with () dropout are conducted separately. Each pixel in the heat map represents the average reward in one specific experiment. The closer the color to red (hotter) means the higher reward, the better the algorithm performance in that environment, and vice versa. Each experiment stops after 500,000 steps.

In order to verify the robustness improvement from the sample dropout module in our algorithm, we design a set of anti-disturbance experiments by perturbing the temperature (C) in the interval and the concentration (ppm) in the interval . We test the algorithm with and without dropout separately (shown in Fig. 5). In the heat map, each pixel represents the expected reward of the algorithm after training the same number of steps in each perturbed environment. Moreover, the closer the color to red (hotter) means a higher reward, and vice versa. Obviously, the algorithm without dropout can only achieve a normally expected reward in a small area closer to the standard area. In contrast, the algorithm with dropout can maintain a higher expected reward in the more disturbed area, demonstrating that the sample dropout module can improve the robustness of the algorithm.

To further analyze the benefits of dropout, we set outside solar radiation (Iglob), greenhouse air temperature (AirT), and greenhouse air humidity (AirRH) as anomalous parameters, which are the critical factors to crop growth. According to these anomalous parameters, we test our algorithms in the simulator. Additionally, we use the fresh weight and retention rate of crops as indicators to evaluate the algorithm performance. The setting of anomalous parameters and the experimental results are shown in Table 2.

Parameter Fresh weight Retention rate
38.51 45.24 80.23% 85.35%
30.31 38.49 63.14% 72.62%
32.69 39.30 68.10% 74.16%
36.76 43.71 76.59% 82.47%
Table 2: Exception Test

Based on the results in Table 2, we find that the algorithm with dropout has a higher fresh weight and retention rate under anomalous conditions, which shows higher safety, thus more promising for real-world applications.

5.3 Analysis of Hyperparameter

In this section, we investigate the sensitivity of our algorithm to the hyperparameter (details in Equation 9). We vary the parameter from 1.0 to 0.6, representing the rate of discarded samples. Further downward adjustment of the parameter is no longer worth investigating, which is difficult for the algorithm to obtain enough information. This conclusion is confirmed in the following experiments.

Firstly, We test the algorithm with dropout with different values ( for no dropout) in the standard environment for multiple sets of experiments. The results are shown in the left of Fig. 6. We observe that when , the corresponding are close, implying that the algorithms have similar performance under these parameters. When , it can be seen that the performance of the algorithm decreases significantly.

Next, we test the algorithm under different values with different disturbed environments. Then we take the mean value of these results as the final result. A larger mean value means better robustness of the algorithm. Specifically, we set up four different disturbed environments, controlling the temperature and the concentration . The results are shown in the right of Fig. 6. We observe that the optimal parameter value is 0.8. Moreover, a similarly significant decrease in the robustness of the algorithm starts when .

Figure 6: The effect of adjusting parameter : The left box plot shows the that the algorithms can achieve with different parameters in the standard environment; The right box plot shows the average values of that the algorithm can achieve with different parameters in the four disturbed environments (Temperature ; concentration ). Each experiment stops after 500,000 steps. The -axis represents the parameter of dropout, and the -axis represents the corresponding net profits.

6 Conclusions and Future Work

In this work, we propose a robust model-based RL framework to alleviate the sample inefficiency and safety concern in greenhouse automation. To be specific, our framework utilizes sensors and actuators deployed in the greenhouse to collect observation and action samples to learn an ensemble of environment models and optimize the policy in a Dyna-style manner. Experimental results demonstrate the effectiveness and superiority of our framework in terms of robustness and efficiency, contribute to better crop growth with a higher safety guarantee.

Our future work will incorporate more prior knowledge of agriculture to improve the simulator. We will also deploy the planting policies trained by the real greenhouses algorithm to evaluate the framework through long-term real-world experiments further. Besides, we plan to use offline RL methods to improve sample utilization, reduce training costs, and use meta RL methods to transfer learning different crop species to improve the algorithm’s generalization performance.