Predicting Soil pH by Using Nearest Fields

12/03/2019 ∙ by Quoc Hung Ngo, et al. ∙ 0

In precision agriculture (PA), soil sampling and testing operation is prior to planting any new crop. It is an expensive operation since there are many soil characteristics to take into account. This paper gives an overview of soil characteristics and their relationships with crop yield and soil profiling. We propose an approach for predicting soil pH based on nearest neighbour fields. It implements spatial radius queries and various regression techniques in data mining. We use soil dataset containing about 4,000 fields profiles to evaluate them and analyse their robustness. A comparative study indicates that LR, SVR, and GBRT techniques achieved high accuracy, with the R_2 values of about 0.718 and MAE values of 0.29. The experimental results showed that the proposed approach is very promising and can contribute significantly to PA.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Precision agriculture can be described as an autonomous process that collects data and presents it to analysis systems to mine it. And the application of data mining to agricultural data becomes highly important, as it is capable of mining huge collections of data to look for new knowledge and, thus, improve the current practices. In this context, soil profile is one of preconditions for making good agronomic decisions. This practical information can be obtained by soil sampling, however, it is costly and very time consuming. In addition, it is often not necessary to conduct soil tests for all fields when the field conditions can be similar to the neighbourhood fields.

In general, the use of data mining techniques allows us to study a large number of soil profiles [11] and monitor soil characteristics and other factors that affect crop yield [7] [12]

. These data mining techniques have been successfully used to classify soil data

[6], to predict soil map [2] and soil salinity [15]. However, to the best of our knowledge, there is no study on predicting soil characteristics for new fields with only their locations and some other features. Prediction of soil features based on nearest fields not only supports to fill omitting values for soil profiling but also reduces cost for soil sampling.

In this paper, we propose a solution to generate features for new fields without sampling. This is of great help, mainly when some data values were missing during their collection. We also propose a data mining approach to predict soil pH values based on neighbourhood field values. Finally, we test and evaluate our approach experimentally on real data collected from about soil profiles. The next section gives an overview of soil properties and reviews several soil studies that are related to precision agriculture.

2 Related Work

The most important soil characteristics can be divided into three categories: composition, physical and chemical characteristics [9]. In addition, there are several features, which relate to soil fertiliser and biological properties, such as CEC (Cation exchange capacity), SOC (soil organic carbon), and EC (Soil electrical conductivity). In fact, soil profiles mainly include physical and chemical characteristics (such as pH, N, P, K, etc), SOC, or SEC [4], [11],[12]. These soil characteristics were already represented using the AgriOnto ontology [8]. Many studies can be found in the literature on building soil profiles or datasets, monitoring soil characteristics that affect crop yield. Many of those studies use data mining techniques on soil characteristics to predict crop yield and other measures or objectives.

Wei Shangguan et al. [11] built a China soil dataset of soil profiles for land surface modelling. The data set includes attributes for vertical layers (from to m) which were collected from counties, national farms and forest farms. In an other soil study, P.K. Singh et al. [12] monitored pH, EC, CEC, and chemical characteristics of soil samples during and after crop harvesting to evaluate the effect of waste water on soil properties, crop yield and the environment.

In P. Han et al. [6], soil colour characteristics are used to classify soil types. They used soils layers in a depth of

cm below the surface. Their classifier is based on RGB signals and principal component analysis (PCA) to classify the data. The experimental findings have been obtained and evaluated on a data set of

soil samples per soil type ( samples in total).

For the prediction of soil characteristics, authors in [4] compared prediction methods for mapping CEC. Their study was carried out on a ha field in Australia for a duration of years ( sorghum and wheat). [15] predicted soil salinity in three geographically distinct areas in China. They compared five regression algorithms based on data sets with

soil samples to predict soil salinity. In their experiments, random forest (RF) and stochastic gradient treeboost (SGT) achieved the highest accuracy with

score of . However, the scores of RF and SGT predictions are not stable by time and locations. The most wide range of soil chemical and physical characteristics prediction was conducted by M.J. Aitkenhead et al. [1]

. They used artificial neural networks (ANN) and the soil color (RGB values) to predict

soil parameters including chemical, physical characteristics and soil texture. They also demonstrated that several soil parameters can be predicted accurately (with ).

In summary, several studies on the use of data mining techniques to predict other soil characteristics of each soil profile or predict other factors related to soil characteristics have been presented in the literature. To the best of our knowledge, there is not any study on predicting new soil profiles without main soil characteristics. Moreover, previous studies on soil profile prediction based on soil characteristics constitute a solid foundation for us to carry on this work.

3 Predicting Soil pH

3.1 Soil Dataset

The soil dataset includes soil sampling of fields, which are extracted from a large raw agriculture dataset of the CONSUS project. The soil datasets were collected from a widely distributed agriculture area of the UK. These fields grow many different plants, but the collected datasets were mainly focus on crops, fruits, vegetables, and grass. Each record in the dataset corresponds to one field, which includes field information, location information (longitude, latitude), chemical features (pH, P, K, Mg), and soil texture (sand, clay, and silt percentage). According to [10], soil pH is the most important attribute. The values of this attitude are between and , but, they are mainly from to for cultivated fields in our dataset.

3.2 Features based on Nearest Fields

The number of fields, which have nearest fields within the radius of m, is the highest and most fields have neighbours within the radius of m (3,760 of 3,809 fields, as shown in Table 1). But, there are several fields that only have nearest fields in the radius of m (km). There are about fields without neighbours within the radius of m.

Radius Fields have Number of Distance Average of
(m) neighbours Neighbours (m) max-min(pH)
100 25 1.12 78.2 0.03
200 756 1.28 147.42 0.09
300 2,102 1.67 185.22 0.19
400 2,945 2.27 210.35 0.31
500 3,295 3.01 232.57 0.44
750 3,594 5.07 296.29 0.67
1,000 3,672 7.11 367.19 0.83
1,500 3,733 10.65 505.93 1.04
2,000 3,760 13.66 635.17 1.16
Table 1: Validate soil feature by nearest fields

Our approach is based on field’s location. For each field in the dataset, we can get the nearest fields that are within a radius of a given field (based on spatial queries). The radius (in meters) is the maximum allowed distance between the given field and the returned list of nearby fields. In our experience, the radius is in the range between m.

To predict the pH attribute of data object (or field

), we estimate the average, maximum, and minimum pH values based on the pH values of the returned list of nearby fields of

and the distance between the centre of the list and the location of .

where is the number of neighbours in the radius of (e.g. m, m, m, and m), is the neighbour field in this region ( j={1..k}), and is the centre of neighbours for each radius (m).

3.3 Data Mining Techniques for Prediction

There are many data mining techniques used for soil classification and prediction. In our study, we propose to use common data regression techniques to predict soil pH. These techniques include Linear regression (LR), Support Vector regression (SVR)


, Decision Tree Regression (DTR)

[14], Least Absolute Shrinkage and Selection Operator (LASSO) [13], Random Forests (RF) [5]

, and Gradient Boosting Regression Tree (GBRT)

[5]. In our experiments, we use Scikit-learn toolkit ( to deploy and evaluate these techniques.

4 Experimental Results

In our experiments, the comparative evaluation of the prediction models is based on the coefficient of determination () and the mean absolute error (MAE). The best possible coefficient score is and the worst is . A constant model that always predicts the expected value of y, disregarding the input features, would get a score of 0.0.

In the first experiment, we apply six regression techniques (LR, SVR, LASSO, DTR, RF, and GBRT) on a part or the whole dataset depending on the evaluated features. For example, when evaluating a group of features related to the radius of m, only fields have neighbour fields, therefore the size of data for evaluating CropName+Min/Max/Avg200 features is () (Table 2). The obtained results for Soil pH prediction were very low with owned field features (1st row of Table 2). The results improved significantly when adding average pH features. We achieved high results with CropName+Min/Max/Avg400 features.

0.084 0.56 0.163 0.52 -0.004 0.61 0.46 0.36 0.162 0.52 0.536 0.35
0.681 0.33 0.688 0.33 -0.001 0.62 0.47 0.41 0.666 0.35 0.66 0.33
r300 0.695 0.29 0.698 0.29 -0.004 0.6 0.427 0.39 0.66 0.31 0.683 0.29
r400 0.718 0.29 0.713 0.28 -0.002 0.63 0.503 0.39 0.68 0.31 0.703 0.29
r500 0.671 0.31 0.666 0.3 -0.001 0.61 0.411 0.4 0.654 0.32 0.651 0.31
r750 0.633 0.28 0.632 0.27 -0.0 0.62 0.386 0.4 0.645 0.3 0.628 0.28
r1000 0.66 0.31 0.663 0.31 -0.001 0.61 0.452 0.4 0.617 0.32 0.669 0.3
r1500 0.653 0.3 0.647 0.29 -0.001 0.62 0.454 0.36 0.62 0.32 0.656 0.29
r2000 0.623 0.33 0.608 0.32 -0.002 0.62 0.452 0.38 0.596 0.35 0.658 0.32
Table 2: Result of Soil pH regression based on radius-based features
: Long/Lat/CropName; : CropName+Min/Max/Avg200

In another experiment, we evaluated the contribution of features to prediction. Only three regression techniques have returned high scores; these are LR, SVR and GBRT. We have also evaluated the CropType feature, which represents a mapping of the crop name to a crop type list (including Crops, Vegetables, Fruits, and Grass) by using lists of concepts and instances from the AgriOnto ontology [8]. Although the CropName feature contains over different crop names, it is mapped to the CropType feature with four crop types. The results are approximately the same for both experiments (3rd, 4th row of Table 3).

Feature Size LR SVR GBRT
Long/Lat/CropName (2945, 3) 0.086 0.17 0.548
Long/Lat/CropName+Avg400 (2,945, 4) 0.717 0.715 0.716
Nb/Dist/Avg400+CropName (2,945, 4) 0.717 0.714 0.7
Nb/Dist/Avg400+CropType (2,945, 4) 0.718 0.714 0.696
Nb/Dist/Max/Min/Avg400 (2,945, 5) 0.718 0.709 0.697
+ CropName (2,945, 6) 0.718 0.708 0.696
+ CropName, CropType (2,945, 7) 0.718 0.709 0.696
Table 3: score of Soil pH regression based on individual features

In the next experiments, we extended the number of features to include more radius values. As shown in Table 4, it calculates the average pH value of neighbours in the radius ranging from m to m. The same algorithms achieved their highest scores at the radius values m and m.

Feature Size LR SVR GBRT
Long/Lat/CropName (756, 3) 0.122 0.232 0.545
+ Nb/Dist/Avg200 (756, 6) 0.686 0.667 0.656
+ Nb/Dist/Avg300 (756, 9) 0.699 0.676 0.674
+ Nb/Dist/Avg400 (756, 12) 0.715 0.67 0.684
+ Nb/Dist/Avg500 (756, 15) 0.711 0.638 0.692
+ Nb/Dist/Avg750 (756, 18) 0.709 0.597 0.702
+ Nb/Dist/Avg1000 (756, 21) 0.707 0.58 0.696
+ Nb/Dist/Avg1500 (756, 24) 0.704 0.604 0.703
+ Nb/Dist/Avg2000 (756, 27) 0.702 0.493 0.697
Table 4: score of Soil pH regression based on combined features

5 Conclusion and Future Work

We presented a short study on soil properties and how to construct soil profiles which can be sued in crop yield management. We proposed an approach to predict soil pH based on the average pH values of the nearest neighbour fields. This can be applied to predict other characteristics of the soil profile if these characteristics were missing. With large soil dataset, our approach based only on neighbour fields has a great potential not only for pH prediction but also to predict other soil features. As a result, we plan to extend our model and perform more experiences to predict other soil characteristics. Moreover, the weather data or crop yield are also highly valuable to add into prediction models.


This work is part of CONSUS and is supported by the the SFI Strategic Partnerships Programme (16/SPP/3296) and is co-funded by Origin Enterprises Plc.


  • [1] Aitkenhead, M.J., et al. Prediction of soil characteristics and colour using data from the National Soils Inventory of Scotland. Geoderma200, 99-107 (2013).
  • [2] da Silva Chagas et al. Data mining methods applied to map soil units on tropicalhillslopes in Rio de Janeiro, Brazil. Geoderma Regional 9, 47-55 (2017).
  • [3] Basak, D., Pal, S., Patranabis, D.C. Support vector regression. Neural Information Processing-Letters and Reviews 11(10), 203-224 (2007).
  • [4] Bishop, T.F.A., McBratney, A.B. A comparison of prediction methods for the creation of field-extent soil property maps. Geoderma 103(1-2), 149-160 (2001).
  • [5] Breiman, L. Random forests. Machine learning 45(1), 5-32 (2001).
  • [6] Han, P., Dong, D., et al. A smartphone-based soil color sensor: For soil type classification. Computers and Electronics in Agriculture 123, 232-241 (2016).
  • [7] He, J., Li, H., et al. Soil properties and crop yields after 11 years of no tillage farming in wheat-maize cropping system in north china plain. Soil and Tillage Research 113(1), 48-54 (2011).
  • [8] Ngo, Q.H., Le-Khac, N.A., Kechadi, T. Ontology based approach for precision agriculture In: LNCS, Vol.11248. 175-186. Springer (2018).
  • [9] Osman, K.T. Soils: principles, properties and management. Springer Science & Business Media (2012).
  • [10] Pietri, J.A., Brookes, P. Relationships between soil pH and microbial properties in a UK arable soil. Soil Biology and Biochemistry 40(7), 1856-1861 (2008).
  • [11] Shangguan, W., et al. A China data set of soil properties for land surface modeling. Journal of Advances in Modeling Earth Systems 5(2), 212-224 (2013).
  • [12] Singh, P., et al. Effects of sewage wastewater irrigation on soil properties, crop yield and environment. Agricultural Water Management 103 (2012): 100-104.
  • [13] Tibshirani, R. Regression shrinkage and selection via the LASSO. Journal of the Royal Statistical Society: Series B (Methodological) 58(1), 267-288 (1996).
  • [14] Waheed, T., et al. Measuring performance in precision agriculture: Cart a decision tree approach. Agricultural Water Management 84(1-2), 173-185 (2006).
  • [15] Wang, F., et al. Comparison of machine learning algorithms for soil salinity predictions in three dry land oases located in Xinjiang Uyghur Autonomous Region (XJUAR) of China. European Journal of Remote Sensing 52(1), 256-276 (2019).