1 Methodology and theory
1.1 Problem Statement:
Let be the soybean varieties and let be a region with subregions . Each of the subregion has different soil and weather conditions associated with it. Our goal is to find which soybean seed variety, or mix of up to five varieties in appropriate proportions, will best meet the demands of farmers in each subregion and for whole region . In order to find such mix of seed varieties, we predict the top five varieties for each subregion in terms of maximum yield and then find the optimal proportion of these five varieties in order to maximize yield and minimize variance in the yield.
Our approach to predict the mix of soybean varieties in appropriate proportion for each subregion and for whole region consists of three steps: 1) Prediction of Weather and Soil attributes, 2) Yield Prediction given the soybean variety, and weather and soil conditions, and 3) Yield Optimization. We also present our visual analytic tool to understand and analyses the solutions generated by our approach. This tool helps in decision making process and also provide insightful visualizations.
1.2 Weather and Soil Prediction
For each subregion , let be the weather condition and be the soil condition attributes at time . Given these attributes from initial time to the current time
, we use Deep learning based approach
LSTM [6] to predict the value of these attributes for time . For each attribute separately, say , we prepare a sequence of its values from time to , and use it as an input to train LSTM with the target of predicting value of at time and then we use the trained LSTM to predict the value for time.LSTM based neural networks are competitive with the traditional methods and are considered a good alternative to forecast general weather conditions
[9] . Figure 1 shows an architecture used for the prediction of weather and soil attributes.(i)  (ii) 
1.3 Yield Prediction
Once we predict the weather and soil attributes for the time , we use these attributes as a feature set to predict the yield in every subregion for each soybean variety , where . We divide the yield value into equally sized bins by taking maximum and minimum from historical data and we treat the prediction problem as a classification problem, where our goal is to predict the bin value of yield. More formally, for each subregion , we compute probability distributions of yield, one for every soybean variety , where
, using Random Forest Classifier (RFC)
[2]. RFC is an ensemble learning method that operates by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) of the individual trees. As an output, we get the count of each class which represents the number of trees outputting the class. Further, we convert these counts into probabilities by diving each count by the sum of all counts.
1.4 Yield Optimization
Given probability distributions of the yield for subregion , we use an optimization approach to obtain weighted combination of varieties in order to maximize yield and minimize standard variation or variability. Steps to obtain combination of varieties in an appropriate proportion are given as follows:

Given n probability distributions of yield for every soybean variety , where , we calculate expected value and variance of each distribution, represented by , and respectively.

We choose top distributions out of having maximum score calculated as:
(1) where and are the normalized values between to of and respectively.

We use an optimization technique with objective function of maximizing yield using combination of upto five varities out of chosen top in an appropriate proportion. The objective function and constraints of optimization are given as follows:
(2) (3) The term is called variability. We ran above optimization for ten thresholds of variability, i.e., ten values of from to with the step size of . For every value of , we get the optimal solution. Therefore, for a subregion , we get 10 solutions and out of these 10 solutions, we choose the solution having maximum yield and minimum variability and it is called default solution for subregion
Spatial Cohesion(SC) Score: For every solution obtained from optimization approach, we calculate SC Score which is calculated as follows: Let for a subregion , the optimized solution at contains five varieties in some proportion. Let be the set of neighboring subregions of which are having maximum distance of miles from the centroid of it. For every in the solution of , we calculate variety score as:
(4) 
where is the proportion with which variety exist in the solution of neighboring subregion at . Further, SC score of subregion is calculated as the average of all variety scores in the solution:
(5) 
2 Visual analytics using ViSeed
We now describe, ViSeed, our visual analytics tool to understand the given data and analyse the solution generated by our analytics methodology.This tool differentiates our work from other related works [1] [8] as it lets the retailer explore varieties performing well in local as well as global areas. As agricultural yields vary widely around the world due to climate and the mix of crops grown [4],ViSeed lets the farmer explore the quality of soil and climatic variations region wise intuitively to take the planting decisions accordingly.
The main screen of ViSeed is divided into two parts. The first part displays a map of the United States over which various subregion attributes can be visualized, Figure 2 (i) A. The second (right hand panel) part, B, contains a tabbed control panel, to switch between various visualizations of the solution data.
Getting Started A data attribute such as precipitation or solar radiation may be visualized by selecting it from the attribute menu and a year from the timeline as shown in Figure 2 (i). Further, as described in the previous section, we first compute the top varieties for each subregion, based on high expected yield and low variance and from them, an optimal solution with up to five varieties. As a variety may occur in the optimal solution of multiple subregions, we compute a distribution of weights for each variety across subregions and the expected value of this distribution. The varieties are ranked in decreasing order of expected value which is an indicator of prevalence of the variety across subregions and displayed in the right panel when the visualization tool starts up, Figure 2 (i), C. Thus, the user is provided a starting point to begin exploring the possible solutions.
The histogram of weights for each variety across subregions, using a colour map, is shown alongside its expected value. Selection of a range on the histogram of a variety highlights those subregions for which the variety has weight or proportion in the selected range. Clicking on a variety name highlights those subregions on the map for which the variety occurs in the optimal solution. Clicking multiple varieties highlights the regions for all, thus allowing the user to visualize cumulative prevalence of varieties.
Common Solution A user may explore various solutions, for the entire region, by selecting up to five varieties from this list, 1 in Figure 2 (ii). On pressing the query button, 2, proportions of each of the selected varieties are computed, 3, and the total yield for the region is predicted, 4.
Differentiated Solution Each subregion has a precomputed default solution from among its top varieties. Clicking on a subregion in the map, Figure 4, 1, brings up the top varieties for that subregion, 2, along with their weights in the optimal solution, count of subregions in which the variety is in the top and the predicted yield distribution for the variety. The subregions for which a variety is in the top list can be seen by clicking on the red count bar.
For every solution, we show the Average Yield and Average Standard Deviation and Average offset in % of entire region
. The standard deviation and offset of each subregion is calculated as:
(6) 
(7) 
where is the variance of variety and is its proportion in the solution.
Changing Variability Our tool also allows the user to analyse solutions by playing with different variability thresholds. This can be done by first clicking on the Variance tab. The user can set a variability threshold by moving the slider, Figure 4, 1. On pressing the query button, the list of varieties, with variability below the chosen threshold is displayed, along with their histograms of weights across subregions and expected values as before. A new optimal solution is computed for each subregion and its similarity with the solutions of neighboring subregions is calculated as a spatial cohesion score. This score is visualized on the map. The user may also interact with this list of varieties and compute a global solution for the entire region, as described earlier.
3 Quantitative results
In this section, we present our results of LSTM and RFC model used for weather and yield prediction. We used keras, scikitlearn and cvxpy respectively to implement LSTM, RFC and optimization.
Available Data: We were provided with the following two datasets:

Experiment Dataset: It consists of 82000 experiments, in 583 subregions, between 2009 and 2015 using 174 varieties. It has three weather and three soil condition attributes for every subregion.

Region Dataset: It consists of 6490 subregions with the given three soil and three weather condition attributes from the year 2000 to 2015.
We predict three attributes of weather conditions, temperature, precipitation, and solr radiation for every subregion of Region dataset. For an attributes say in subregion , we prepare the sequence of 14 values from year 2000 to 2014 as an input to train LSTM with a target of predicting value of year 2015. We validate and test the LSTM model by dividing Region dataset into training, validation, and test data in the ratio of 70:15:15 respectively. Table 1, shows the normalized root mean square error (NRMSE) for all three weather attributes on validation and test set. Here, NRMSE is defined for an attribute say, as
(8) 
(9) 
where and are the actual and predicted values, and are maximum and minimum values, of attribute for subregion . The value of is in our case.
(i) 
(ii) 
NRMSE in Table 1 indicating that the prediction of weather attributes using LSTM have less than 1% error for temperature and precipitation and less than 3% error for Solr radiation. We use this trained LSTM model to predict weather attributes for year 2016. Note, we did not predict soil condition parameters as they did not change over time in experiment dataset.
We use experiment dataset to train RFC model by dividing it into three parts train, valid and test dataset in the ratio of 70:15:15 respectively. Soil and weather condition attributes in each experiment has been used as a feature set in RFC and discretized yield as a target variable. NRMSE of yield predicted on validation and test set using RFC is 6.01% and 6.25% respectively. We use the trained RFC model to predict the yield for 6490 subregions in Region dataset with weather and soil attributes for year 2016 as input, predicted using LSTM(as explained above).
Attributes  Validation Set  Test Set 

0.69%  0.78%  
0.73%  0.83%  
2.6%  2.8% 
3.1 Insights from ViSeed
Common Solution In order to compute the common solution, we used ViSeed to check multiple combinations of the top ten growing varieties based on expected value and arrived at the reported solution. The areas in which these varieties are in the optimal differentiated solution are shown in figure 5 (i), and the cumulative area for all five is shown in figure 5 (ii).
(i)  (ii) 
Differentiated Solution Analysing the predicted yield for the differentiated solution, we find two areas with high yield as highlighted in figure 6 (i). The spatial cohesion score is also visualised in the same figure and observers to be high for the areas with high yield. We validate this in figure 6 (ii), by visualizing the subregions in which a variety from a high yield, high spatial cohesion score subregion is grown. We may conclude that high yield varieties are localized to certain regions, and so they do not occur in the optimal common solution.
Experiments with Variability In figure 7, we show the changes in spatial cohesion and yield for different variability thresholds. We find that with a low variability threshold of 0.3, there is no optimal solution for most of the subregions (darkgrey areas in the map). Increasing the threshold, results in solutions for most of the subregions along with an increase in the average yield per subregion and the spatial cohesion score. However, beyond a certain threshold, the gain in yield and spatial cohesion is very small.
Our submission to the Syngenta challenge comprised of two parts. We use (a) machine learning and stochastic optimisation to compute solutions and (b) have developed a visual analytics tool, ViSeed, to analyse the results. In particular, we predict weather and soil attributes of each subregion through timeseries regression using LSTMs. For each subregion and seed variety, a Random Forest classifier is trained on experiment data to predict yield distributions. Next, we compute weights of varieties through stochastic optimization, maximizing expected yield and minimizing variance, for each subregion, followed by visual analytics to choose an optimal global solution based on the spread of each variety across locally optimal combinatios.. Our combination of soybean varieties is as follows  (i) V156774: 48.1%, (ii) V156806: 15.2%, (iii) V152312: 13.5%, (iv) V114565: 11.8% and (v) V152322: 11.4%. We provide a second solution to the challenge consisting of individual optimal solutions for each subregion which performs marginally better than the first reported above. Our geospatial visual analytics tool ViSeed, is designed to explore the raw data as well as aid in optimization. Our entry was not among the top 5 final entries selected from 600 registered teams and as the details of the winning entries have not been made public, we cannot compare our approach with theirs.
4 Conclusion
In this paper, we propose an approach for crop planning based on machine learning models (RFC and LSTM), stochastic optimization and a visual analytics platform. We have given 2 different solution sets; i) Common solution for entire region, ii) Differentiated solutions at subregion level. We use expected yield based on a model using subregion wise predictions of weather and soil conditions, standard deviation of expected yield as the criteria to select seed varieties. We give the spatial cohesion score for each solution which helps to find the similar solutions in the neigbouring subregions. We also present a geospatial visual analytics tool which has the capability of exploring raw data and helps the retailer in decision making by allowing exploration of solutions at subregion as well as global level.
References
 [1] O. Adekanmbi and O. O. Olugbara. Multiobjective optimization of cropmix planning using generalized differential evolution algorithm. 2015.
 [2] L. Breiman. Random forests. Machine learning, 45(1):5–32, 2001.
 [3] S. Drummond, K. Sudduth, A. Joshi, S. Birrell, and N. Kitchen. Statistical and neural methods for site–specific yield prediction.
 [4] J. A. Foley, N. Ramankutty, K. A. Brauman, E. S. Cassidy, J. S. Gerber, M. Johnston, N. D. Mueller, C. O’Connell, D. K. Ray, P. C. West, et al. Solutions for a cultivated planet. Nature, 478(7369):337–342, 2011.
 [5] H. C. J. Godfray, J. R. Beddington, I. R. Crute, L. Haddad, D. Lawrence, J. F. Muir, J. Pretty, S. Robinson, S. M. Thomas, and C. Toulmin. Food security: the challenge of feeding 9 billion people. science, 327(5967):812–818, 2010.
 [6] S. Hochreiter and J. Schmidhuber. Long shortterm memory. Neural computation, 9(8):1735–1780, 1997.
 [7] T. Iizumi and N. Ramankutty. How do weather and climate influence cropping area and intensity? Global Food Security, 4:46–50, 2015.
 [8] O. Marko, S. Brdar, M. Panic, P. Lugonja, and V. Crnojevic. Soybean varieties portfolio optimisation based on yield prediction. Computers and Electronics in Agriculture, 127:467–474, 2016.

[9]
M. A. Zaytar and C. El.
Sequence to sequence weather forecasting with long shortterm memory recurrent neural networks.
143:7–11, 06 2016.
Comments
There are no comments yet.