1 Introduction
Nearly 800 million people – oneninth of the world’s population – go to bed hungry every night and one person in three suffers from some form of malnutrition (wfp). In the coming decades, this problem is likely to be exacerbated by growing populations, changing climate, and environmental stressors (foodsecurity). Eradicating hunger and malnutrition is thus one of 17 Global Goals for Sustainable Development adopted by the United Nations in 2015 (UNgoals).
Means for improving food security include expanding arable land, increasing cropping intensity, and improving crop yield. This paper focuses on the latter. Specifically, we address the problem of optimally identifying seed varieties to increase crop yield within a risksensitive decision making framework. We performed this research as part of the Syngenta Crop Challenge, which focuses on developing and applying innovations in analytics to address the problem of worsening worldwide hunger (ideaconnection). The goal of the 2016 Syngenta Crop Challenge was to use data on soil properties, seed varieties, and weather patterns to develop a model to determine the soybean varieties that farmers should plant in the next year to reliably reduce risk and increase yields (syngentanews). Approximately 325 million metric tons of soybeans are grown worldwide each year, with significant amounts of production in North America, South America, and Asia (Figure 1).
The goals of our analysis are twofold: 1) to provide a general method for seed variety selection for seeds of a single crop (e.g., corn, soybeans) – and thus the development of planting plans – that takes into account soil and regional information as well as uncertain factors such as weather; 2) to develop a model to support farmers in local decision making about which soybean varieties to plant in the next year (the Syngenta Crop Challenge).
To understand the underlying mechanism of seed variety yield, we construct a twolayer hierarchical model that separates the task of variety yield prediction into two parts: the prediction of check yield (which we define as the average yield of the crop in a site) and the prediction of variety ratio (which we define as variety yield divided by check yield).
We use machine learning techniques to build and adapt our models. We predict the yield of each seed variety based on available data. To minimize the potential influence of data imbalance (where some seed varieties have many samples of yield but others have few samples), we develop a data augmentation method. The machine learning method combined with the data augmentation method allows us to identify seed varieties that will maximize expected yield, based on knowledge of soil and region attributes.
We then use an empirical Monte Carlo sampling method to predict yield under weather uncertainty. We also incorporate uncertainty due to other contributing factors, such as seed quality, farming skills, pests, and diseases, which we capture using the residual errors of our variety prediction model.
Finally, we develop three decision making models for selecting varieties. The models aim to optimize a combination of expected yield and risk. Risk is defined as variation of actual crop yield from expected yield (which is influenced by weather and other sources of uncertainty such as seed quality, farming skills, pests, and diseases). We consider: 1) a utility function model with a risk aversion parameter chosen by the decision maker; 2) a model that maximizes yield subject to a controlled level of risk determined by the decision maker; and 3) a robust optimization model in which the goal is to maximize the
quantile of yield.
2 Methodology
Our model consists of two parts: 1) a variety yield prediction scheme, built using machine learning techniques and 2) a decision making under uncertainty framework for selecting an optimal mix of varieties based on solving an optimization problem that relies on the yield prediction model.
2.1 Variety Yield Prediction Model
We first build a twolayer hierarchical model for variety yield prediction as shown in Figure 2. The bottom layer contains four attributes: weather, soil, region and variety. Notably, soil, region, and variety are fixed attributes, while weather is a random attribute. In the middle layer, to reflect the production level of a given site, we introduce the check yield as the average yield of all varieties of the crop in that site. The variety ratio , which is variety yield divided by check yield , reflects the relative expected yield of each variety in that site. However, variety yield is also influenced by other factors that include effects from all the unknown factors that might contribute to the variety yield but are not measured in the dataset, such as farmer’s expertise, quality of seeds, and pests and diseases.
In this hierarchical model, the variety yield can be represented as:
where
Here is the index for seed variety.
During the training phase, we train a single learner to predict the check yield and varietyspecific learners
for prediction of variety ratios. At test time, predictions are done in a bottomup fashion, obtaining a final prediction for variety yield. Weather attributes are regarded as unknown (and modeled as a random variable) when planning for the following year. Additional effects
not explicitly included in the dataset are also modeled as random variables.2.2 Decision Models for Variety Selection
In the yield prediction model, the random variables represent the variety yield of the th variety . Here is the total number of seed varieties. We define a decision variable which denotes the fraction of each seed variety that will be planted. We have the constraint that . To choose a variety mix that balances expected yield with risk of low yield, we formulate the decision making problem using three different models. We denote the expectation of the yield as , the covariance matrix of as , and the sets of constraints on as

Utility function model:
(1) We formulate a utility function with respect to the choice of
, based on the expected yield and the variance of yield. The goal is to maximize expected crop yield minus a weighted function of the yield variance. The parameter
() reflects risk aversion and is chosen by the decision maker. When , the decision maker is risk indifferent and expected yield is maximized. As increases above zero, risk aversion increases. 
Controlledrisk yield maximization model:
(2) Under this formulation, we maximize the expected crop yield subject to a maximum allowable risk level ().

Robust optimization model:
(3)
This is a robust optimization model in which we treat the yield as a random variable, , and maximize its quantile (). In other words, we look for an allocation such that the farmer can obtain yield
with probability
As approaches 1, expected yield is maximized. As decreases, risk sensitivity increases. When is near zero, the model maximizes the minimum yield that can be obtained.It is straightforward to convert the first two decision models into integer programs and solve them using a solver such as CPLEX (CPLEX)
. The third model is more complicated to solve. For this model we develop an efficient heuristic algorithm, given in the Appendix. The algorithm builds lists of variety combinations in a greedy fashion: First the variety is chosen that maximizes
subject to the quantile constraint. Then an additional variety is added to maximize the incremental improvement in subject to the constraint. The process continues until no more varieties can be added.3 Application to Syngenta Crop Challenge
In this section, we apply our models to the Syngenta dataset to identify the soybean varieties that should be planted in the next year. The Syngenta dataset that we used contains information on soybean varieties in sites over years (from 2008 to 2014). The dataset provides information on attributes, listed in Appendix Table B1. The dataset contains approximately entries.
3.1 Fixed and Random Attributes for the Yield Prediction Model
In Figure 2, the bottom layer includes both fixed attributes (soil, region, variety) and random attributes (weather). Information on each of the fixed attributes is provided in the Syngenta dataset:

Soil (16 attributes): sand, silt, clay, pH value, etc.

Region (6 attributes): longitude, latitude, probability of growing soybeans, etc.

Variety (1 attribute): the variety type index.
There are two classes of random attributes: weather and other factors (). Weather attributes are very important in the yield prediction process. We use an empirical Monte Carlo resampling procedure which utilizes historical weather data and can generate random weather attributes samples based on location, climate, and weather types. All of the other unknown factors that might affect the yield are incorporated into a single attribute (). This attribute could be interpreted as the noise of variety yield or could be interpreted as explaining the residual errors in our prediction models. We define this attribute as varietyspecific based on the observations that yield can vary even in the same site with the same variety and some varieties inherently have a larger yield variance than others.
Thus, the random attributes are:

Weather (6 attributes): precipitation, radiation and temperature in the year of experiments (random) and their historical medians (fixed).

Other factors (1 attribute): varietyspecific random variable.
We now discuss soybean variety categorization and yield prediction. We use of the above attributes – those relating to soil, region and weather (see Figure 2) – to categorize the soybean varieties and to predict check yield (for these tasks, we set variety as fixed, and we do not include the varietyspecific random variable that captures the nonweather random factors).
3.2 Data Selection, Soybean Variety Categorization, and Data Augmentation
The Syngenta dataset is heavily imbalanced: some varieties have more than 1000 experimental trials while others have far less. Since the prediction model uses 28 attributes, the varietyspecific yield ratio learner would be unreliable if trained with fewer than samples. Therefore we only consider the most frequent varieties in the dataset. The least commonly used variety in this subset has 30 samples in the training set (according to our crossvalidation split).
A large proportion of the 80 candidate varieties have very few samples compared to others. To alleviate this issue, we introduce a data augmentation scheme in order to make the learner more stable and to reduce the test error. The idea is to utilize “similar” varieties to assist the training process.
Figure 3 shows the correlation matrix of attributes and soybean variety yields. If a group of attributes and varieties are clustered horizontally or vertically, there exists a strong similarity among them that can be informative and helpful in practice. Large clusters are present in Figure 3, indicating strong correlation between attributes and variety yields and suggesting potential categorization of soybean varieties.
A close look at the rows (variety yields) reveals clusters of related attributes. A hierarchical representation of the clusters is provided along the top of Figure 3. For instance, precipitation (PREC), top soil pH (CONUS_PH), soil sand content (ISRIC_SAND) and soil type 2 (soil2) display a similar pattern of correlations with the yield of all varieties. This information is useful because it suggests alternative ways of improving variety yield: for example, one could change attributes such as top soil pH instead of precipitation, a factor that is beyond the farmer’s control.
A hierarchical representation of varieties is shown on the left side of Figure 3. For instance, varieties 191, 188, 127 and 9 represent a potential cluster, as they have similar correlation signs over almost all the attributes. From the color plot, we observe a number of ”blocks” in the rightmost and middle columns. This suggests similarities among certain varieties, and can be used to validate our method for grouping soybean varieties.
To obtain a grouping of the varieties, we utilize kmeans clustering. We determine the number of categories by plotting the within group sum of squares, as shown in Figure 4. We observe that for
, the within group sum of squares continues to decrease as increases; for , the decrease is relatively small. Based on this we select 12 clusters. The set of soybean varieties in each cluster is shown in Table 1.Group  Variety 

1  V56, V181 
2  V9, V32, V96, V107, V119, V127, V140, V188, V191 
3  V31, V46, V90, V95, V98, V99, V100, V102, V103, 
V105, V106, V108, V110, V112, V114, V115, V117,  
V118, V131, V133, V136, V179, V183, V189, V194  
4  V88, V116, V126, V128, V192 
5  V54 
6  V8, V36, V38, V41, V44, V49, V97, 
V111, V180, V186, V190, V196  
7  V169 
8  V139 
9  V135, V137 
10  V47, V51, V109, V121, V122, V130, V193 
11  V39, V42, V43, V45, V52, V124, V182, V185, V187 
12  V48, V87, V92, V101, V104, V195 
To account for the fact that some of the 80 considered soybean varieties have far fewer samples than others, we use the following data augmentation procedure: during training, if a variety has fewer than 200 samples, we augment the data by randomly sampling from varieties in the same category. Additionally, for the attribute CONUS_PH (top soil pH), we directly remove those samples for which its value missing. For the attribute RM_BAND (relative maturity band, a measure of when in the growing season the variety matures), we train a random forest model to predict its value from other attributes and impute the missing values by doing prediction.
We use a 3:1:1 split of the Syngenta data set to create the training, validation and test datasets. All the performance statistics below are reported on the test set.
3.3 Yield Prediction
3.3.1 Check Yield Prediction
Check yield indicates the current production level of all varieties in a given site, which can be treated as the reference point for a specific siteyear combination. In our model (Figure 2), predicted check yield serves as a building block for the prediction of varietyspecific yield in the same setting.
For the check yield prediction process we considered several different machine learning models, including linear regression, decision tree models, and neural network models. Based on characteristics of the Syngenta dataset, we choose a random forest model for the check yield prediction. The mean squared errors (MSE) on the test set are reported in Table 2. The baseline model simply uses the mean check yield for all samples as a prediction. With
of the error explained, the random forest model provides reasonably accurate predictions.Baseline  Random Forest Model  
MSE  107.08  23.91 
To better evaluate performance, we randomly permute the values of each attribute to see how much the check yield prediction error will increase. As shown in Figure 5, weather attributes, namely Temperature (Temp), Precipitation (Prec) and Radiation (Rad), increase the mean squared errors (MSE) by , and , respectively. This result suggests that prediction of Temperature, Precipitation and Radiation for the next growing year is highly significant for predicting the check yield. For this reason we deal with the weather attributes as random variables rather than using a deterministic prediction. Aside from weather attributes, the next two most important attributes are Longitude () and Latitude () of the sites. The sites with low latitude (towards the equator) and low longitude (towards the West) will tend to have higher check yield.
3.3.2 Variety Ratio Prediction
Variety ratio is the ratio of the expected yield of a given variety in a given site to the average yield of all varieties in the site. Ideally this quantitity is calculated as (i.e., ). The baseline for comparison is to predict all the ratios as , which means simply using the check yield prediction as the variety yield. For each variety, we train a random forest model to estimate its variety ratio. Figure 7 shows the performance of this method on the test set. This is a relatively difficult prediction task: on average, we reduced mean squared error by compared to using a variety yield ratio of for all varieties.
3.3.3 Variety Yield Prediction
By multiplying the prediction of check yield and variety ratio, we obtain a predictor of variety yield (i.e., = ). Table 3 shows the performance of several variety yield prediction methods. The Baseline model simply predicts the variety yield as the mean yield of all samples. The Check model predicts variety yield as the check yield. The 1Layer model directly predicts the variety yield with the bottom layer in Figure 2 (by training varietyspecific learners), while the 2Layer model is the full model shown in Figure 2. The 2Layer DA model additionally employs the data augmentation scheme described above.
Baseline  Check  OneLayer  TwoLayer  TwoLayer DA  
MSE  107.08  49.74  45.34  41.75  38.26 
From Table 3, we can see that the introduction of the middle layer improves the prediction accuracy. Prediction accuracy is further improved by use of our data augmentation scheme, demonstrating its success in dealing with the data imbalance problem. In terms of absolute errors, our model (the Twolayer DA model) obtains a median error of 3.74 bushels per acre and a mean error of 4.70 bushels per acre. In other words, our variety yield prediction on average will lie in a range of bushels per acre of the true yield.
As illustrated in Figure 6, the prediction mean squared errors vary significantly by variety types. For some varieties the predictor performed well but for other varieties the residual error is quite large. This phenomenon might be caused by the nature of soybean varieties: some have stable yields while others are more likely to be affected by uncertain factors, including not only weather, but also factors such as farming skills, seed quality, pests and diseases (captured by our random attribute ). In our decision models, we model the random attribute by sampling Gaussian noise for the variety yield that is proportional to these MSEs.
3.4 Variety Selection
We now use our variety yield prediction model along with information about the uncertain attributes to select soybean varieties to plant in the next year in a given site. This site, in the US midwest, was specified in the Syngenta Crop Challenge. We consider each of the three decision models. The goal is to choose a vector
such that up to entries of can be nonzero () with allowable increments of for the fraction of each variety selected.Since the weather conditions for next year are unknown, we randomly sample weather attributes from historical data of sites around the target site. We created random samples. For each weather attribute and variety type , the yield is calculated by multiplying the check yield prediction and variety ratio prediction. We also add a Gaussian noise proportional to the prediction errors of each variety (as in Figure 7) to characterize the fact that some varieties are more predictable than others.
Since an autoregressive integrated moving average (ARIMA) model did not yield a parametric fit for the weather time series data (presumably due to weather unpredictability), we design an empirical weather attribute sampling scheme. The idea is to generate samples from historical data on sites that are either near the target site or that share the same climate type as the target site. We first use a Euclidean distance search to identify the 20 nearest neighbors of the target site, and we identify sites that share the same climate type. We accumulate all the historical data from these “similar” sites and draw random samples from them as weather attribute samples for the next year.
With the empirical random samples , we can estimate the expected return and as the covariance matrix. For the robust optimization model (3), we treat the empirical data as the true distribution and maximize the empirical quantiles based on .
0.03  0.06  0.1  
Selected Varieties  V41, V44  V124, V41,V44  V124, V41, V43, V44 
Combination  
Expected Yield (Bushels per Acre)  59.29  58.75  58.52 
100  80  60  

Selected Varieties  V41  V124, V41  V124, V41,V44 
Combination  
Expected Yield (Bushels per Acre)  59.40  59.19  58.75 
0.2  0.5  0.8  

Selected Varieties  V41  V44, V124  V124 
Combination  (1.0)  (0.5, 0.5)  1.0 
Quantile Yield (Bushels per Acre)  57.12  58.73  61.17 
Expected Yield (Bushels per Acre)  59.40  57.78  57.31 
Results from each of our three decision models are presented in Tables 6, 6, and 6, respectively. For a low level of risk aversion, all three models favor variety V41 (the utility function model also includes
of variety V44). As the risk aversion level increases (moving toward the rightmost column in each table), the optimal choice for all three models becomes more conservative: a combination of varieties becomes optimal, rather than just the variety with the largest expected yield. With a higher level of risk aversion, all three models include variety V124. The utility function and controlledrisk yield maximization models also include varieties V41 and V44, and the utility function model additionally includes V43. As risk aversion increases, expected yield decreases slightly. We note that, from our kmeans cluster analysis (Table
1), varieties V41 and V44 were categorized in the same cluster, and varieties V43 and V124 were categorized together in a different cluster.The yield risk is hedged by combining several varieties. The three decision models selected various combinations of the varieties V41, V43, V44 and V124. Figure 8 plots the mean and variance of yield for all 80 soybean varieties that we considered. We see that the decision models selected varieties from the right bottom corner where points have high yield and relatively low variance. For a low level of risk aversion, the models select variety V41, which has the highest expected yield of all varieties. As risk aversion increases, the models select variety V44, which has almost the same yield as variety V41 but higher variance; then variety V124, which has lower yield and higher variance; and finally variety V43 (selected by the utility function model) which has lower yield but also lower variance.
After comparing the results from the three decision models, we recommend V124, V41 and V44 with a proportion of as the planting plan for this site for next year.
4 Discussion
In this paper, we address the problem of decision making for seed variety selection. Our hierarchical approach combining machine learning with a weather forecasting procedure allows us to accurately predict the yield of different seed varieties in a given site. We develop three decision models for selecting varieties that balance yield maximization and risk given the farmer’s risk aversion level. Our prediction and decision models support farmers in decision making about which seed varieties to plant in the next year.
For the Syngenta dataset, our prediction model generated estimates of yield that were within bushels per acre (or approximately ) of the true yield. The three decision models identified similar varieties for similar levels of risk aversion, thus providing a clear direction for variety selection.
We have focused on the problem of seed variety selection. However, our modeling approach provides additional information that can support farmers with other decisions. For example, our prediction method identifies how soil attributes affect crop yield. This can be used as a basis for decisions regarding soil amendment. Our prediction method also identifies how weather, soil and region attributes affect yield. This information can be used to assist in selecting promising sites for future farmland. Our clustering method provides information about seed varieties that have similar yield characteristics in a given site. This information can be used heuristically to select alternate seed varieties for a given site (for example, if a particular variety is not available).
Data analytics, simulation, and optimization have been successfully applied in a variety of business applications. Here we apply these techniques to a critical problem in agriculture. Our work provides a simple and effective way to reliably increase crop yield, which in turn can help to reduce hunger and malnutrition around the world.
References
5 Algorithm for Solving Robust Optimization Model
We use the following heuristic algorithm to efficiently solve the robust optimization model:
6 Feature Code Key
Table B1 provides a detailed description of the features in the Syngenta dataset.
Feature(s)  Description 
YEAR  Year 
SITE  Site code 
BREEDING_G  Breeding group 
EXP  Experiment number 
LAT  Latitude 
LONG  Longitude 
VARIETY  Variety code 
VARIETY_YIELD  Variety yield 
CHECK_YIELD  Check yield for the trial 
YIELD DIFFERENCE  Yield difference between variety and check in a trial 
SITE YIELD  Average site yield (check yields across experiments) 
RM_BAND  Relative maturity band 
CLIMATE  Climate type (based on temperature, precipitation and solar radiation) 
SEASON  Season type 
AREA  Probability of growing soybeans 
RM_25  Probability of growing soybeans of relative maturity band 2.5 to 3 
TOT_IRR_DE  Probability of irrigation 
SOIL_CUBE  Soil type (based on texture, available water holding capacity, soil drainage) 
TEMP_08, …, TEMP_14  Sum of the temperatures for each season, 2008, …, 2014 
TEMP_MED  Median sum of temperatures for seasons from 2001 to 2014 
PREC_08, … PREC_14  Sum of the precipitation for each season, 2008, …, 2014 
PREC_MED  Median sum of precipitation for seasons from 2001 to 2014 
RAD_08, …, RAD_14  Sum of the solar radiation for each season, 2008, …, 2014 
RAD_MED  Median sum of solar radiation for seasons from 2001 to 2014 
CONUS_PH  Topsoil pH (10 to 20 cm depth) 
CONUS_AWC  Topsoil available water capacity in 150 cm soil profile (10 to 20 cm depth) 
CONUS_CLAY  Topsoil clay content (10 to 20 cm depth) 
CONUS_SILT  Topsoil silt content (10 to 20 cm depth) 
CONUS_SAND  Topsoil sand content (10 to 20 cm depth) 
ISRIC_SAND  Soil sand content from another soil source 
ISRIC_SILT  Soil silt content from another soil source 
ISRIC_CLAY  Soil clay content from another soil source 
ISRIC_PH  Soil pH from another soil source 
ISRIC_CEC  Soil cation exchange from another soil source 
Table B1: Description of Features 