A data-driven workflow for predicting horizontal well production using vertical well logs

by   Jorge Guevara, et al.

In recent work, data-driven sweet spotting technique for shale plays previously explored with vertical wells has been proposed. Here, we extend this technique to multiple formations and formalize a general data-driven workflow to facilitate feature extraction from vertical well logs and predictive modeling of horizontal well production. We also develop an experimental framework that facilitates model selection and validation in a realistic drilling scenario. We present some experimental results using this methodology in a field with 90 vertical wells and 98 horizontal wells, showing that it can achieve better results in terms of predictive ability than kriging of known production values.



page 1

page 2

page 3

page 4


Hybrid Data-driven Framework for Shale Gas Production Performance Analysis via Game Theory, Machine Learning and Optimization Approaches

A comprehensive and precise analysis of shale gas production performance...

Developing a Hybrid Data-Driven, Mechanistic Virtual Flow Meter – a Case Study

Virtual flow meters, mathematical models predicting production flow rate...

An Intermediate Data-driven Methodology for Scientific Workflow Management System to Support Reusability

In this thesis first we propose an intermediate data management scheme f...

Real-Time Well Log Prediction From Drilling Data Using Deep Learning

The objective is to study the feasibility of predicting subsurface rock ...

On gray-box modeling for virtual flow metering

A virtual flow meter (VFM) enables continuous prediction of flow rates i...

Mokka: RSM for open networks

Mokka is a PC (CAP theorem) consensus algorithm for handling replicated ...

Software Engineering Solutions To Support Vertical Transportation

In this paper we introduce the core results of the project on visualisat...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In recent years, interest in unconventional resource exploration has grown substantially, particularly in North America, due to horizontal drilling and hydraulic fracturing techniques. However, these new techniques come at a cost, which poses many operational challenges. For example, with drilling costs at an all time high, choosing the right locations for new wells is a crucial issue. In this scenario, identifying so called “sweet spots” with high potential for oil and gas is of great importance. Currently, the industry is in a state of “trial-and-error”, with only immature research results available regarding the physical characteristics of sweet spots. This opens up a great opportunity to explore data-driven approaches to effectively learn to characterize sweet spots. To this end, there is a huge amount of available data that the industry has been collecting over several decades.

In recent work, a data-driven sweet spotting technique for shale plays previously explored with vertical wells has been proposed [Kor15]

. The technique involves three steps: 1) automatically extract features from vertical well log curves, within a single shale formation, using functional Principal Component Analysis (fPCA), 2) interpolate the extracted features from vertical well locations to horizontal well locations, and 3) build predictive models that relate interpolated features with horizontal well production. The method was tested using well log data from 2020 vertical wells and production data from 702 horizontal wells in a single field.

Here, we extend this previous work to multiple formations and formalize a general data-driven workflow that involves a series of steps to generate normalized data frames that facilitate feature extraction from vertical well logs and predictive modeling of horizontal well production. We also develop an experimental framework that facilitates model selection and validation in a realistic drilling scenario. This method is applicable in both large and small sample size settings. We finally show some experimental results using this methodology in another field with 90 vertical wells and 98 horizontal wells.

2 Methodology

Our workflow is divided into several phases. Firstly, we perform data pre-processing, which is depicted in Figure 1 and involves normalizing vertical and horizontal data sources to standardized data frames that may be used seamlessly in all downstream analyses. The data pre-processing is followed by an analysis workflow, which is depicted in Figure 2 and consists of three main steps: 1) feature extraction from the (standardized) vertical well logs data frames, 2) interpolation of the extracted features onto horizontal well locations and incorporation of the features into the (standardized) production data frame, 3) predictive model building for horizontal well production with the aim to find sweet spots.

2.1 Data Pre-processing

Figure 1: Pre-processing.

The data pre-processing system is demonstrated in Figure 1. It receives as input a set of files that stores the horizontal and vertical data sources. The horizontal well data includes daily production for all the horizontal wells under study and a meta data base that contains information about the target formations. The target formation is defined as the principal formation where the horizontal section of a given well lands during drilling.

The procedure “summarizeCumProd” takes as input the daily oil and gas production values and produces a cumulative production value for a chosen period of time (e.g. 6 months, 12 months, or 18 months). Cumulative production is calculated by summing the daily production values from the beginning of well production until the desired number of months later. The cumulative production will be the target variable for our predictive models. Wells that have produced for less time than the desired number of months receive a cumulative production value of “NA” (missing). Usually, we would like to choose the number of months reasonably large, but not too large so that we don’t eliminate many wells from our analysis.

The procedure “findTargetFormation” simply searches the meta data base for the target formations of the wells under study. Once the two pre-processing functions have been applied to the horizontal data then the data is stored in a data frame that contains the columns: “API”, “TARGET FORMATION”, “Cum_6month_oil_Prod”, “Cum_6month_gas_Prod”, “Cum_12month_oil_Prod”, etc. Each row of the data frame corresponds to a unique horizontal producing well in the area/polygon of study. This data frame is the data structure that will be accessed for all downstream analyses and we refer to it as the “cumulative production data frame”. Whenever new features are generated for the horizontal wells they can be stored as new columns appended to this data frame, e.g. surface_X, surface_Y coordinates of the horizontal wells (from meta data base), or functional principal components (extracted from vertical well logs), see section 2.2.

The vertical well data include well log files (.las) and files containing the formation tops of the vertical wells in the area/polygon of study. Formation tops of a given vertical well are stored as pairs of values (Formation_Name, Depth) that determine the name of the given formation and the depth at which it starts. If for a given vertical well we have the top of each target formation along with the top of the next formation below, then we can determine the relevant vertical well log sections within all target formations. The function “readLAS” essentially reads the well logs contained in the .las files corresponding to vertical wells inside the area/polygon of interest. After reading the .las file of a given vertical well we store in memory the quantitative log values in a data frame whose columns are the depth values and the underlying log properties.

In order for column names to be consistent across vertical wells (in .las files) we apply a function called “applyDictionary”. This function will scan through well log (column) names and substitute each log name with a unique alias that represents the unique identifier of the corresponding well log property. For example, if three .las files have logs named “GammaRay”, “Gamma”, and “GR”, respectively, then they may be substituted by the unique alias name “GR”. The underlying dictionary is maintained by geophysics experts and may be automated in part by accessing online alias sources, such as Crain’s Petrophysical Handbook.

The “createFormation3dMap” procedure reads in the formation tops (and bottoms) of all target formations and infers the depths of target formation tops and bottoms at all vertical wells. In case of incomplete data (e.g. missing formation depth at a given well) we apply spatial interpolation techniques to infer approximate depths.

The procedure “extractLogSection” is applied to each vertical well log data frame (as obtained from “readLAS”) and uses the inferred formation 3D map (as obtained from “createFormation3dMap”) to extract the relevant well log sections within all the target formations of interest. The “extractLogSection” procedure also re-samples (in depth) each well log section across all vertical wells so that a single data frame may be formed for each well log property of interest. More specifically, assume that for a given formation, each well contains depth values between the formation top and bottom, . Since formation thicknesses vary in sizes across the different vertical wells (and thus the number of depth values differ), we cannot store the raw well log section (across all the vertical wells) in a single data frame. However, if we choose a constant number (e.g. ) and we re-sample/interpolate the well log section of each well at equally spaced depth values between top and bottom, then we may store the formation’s well log sections (across all the vertical wells) in a single data frame. Each formation can thus be stored as a sub-matrix whose columns correspond to the vertical wells and whose rows correspond to comparable depth values (across wells). The first and last row of the given sub-matrix would correspond to the formation top and bottom, respectively. The intermittent rows would correspond to a sequence of equally spaced depth values between top and bottom. The above essentially corresponds to a uniform depth normalization of each formation and results in a data frame that may be easily accessed and manipulated in all downstream analyses, see e.g. section 2.2. The data frame that is obtained by stacking these sub-matrices across all target formations we refer to as “standardized well log data frame” and we note that we create a separate data frame for each well log property of interest.

Figure 2: Analysis.

2.2 Feature Extraction per Formation

From the standardized (vertical) well log data frames we may now extract features (within each formation) that we wish to include in our predictive modeling efforts (see step 1 of the analysis workflow in Figure 2

). These features may include simple summary statistics of the well logs, such as mean, variance, and maximum/minimum peak per formation. We extend the approach of 

[Kor15] and calculate functional principal components within each formation independently. Note that the standardized well log data frames facilitate greatly the calculation of the principal components. For each formation we simply select the appropriate sub-matrix corresponding to that formation and then provide it as direct input into the appropriate functions of the R-package “fda” (see [Ram09]), which we used for calculation of functional principal components. We calculated functional principal components of primary curves from a petrophysical perspective, which included Density, Gamma Ray, Limestone, Neutro Porosity, Deep and Shallow Resistivity, Photoelectric Factor, Medium Resistivity, Compressional and Shear Slowness.

2.3 Interpolation

Once the features have been extracted from the vertical well log sections within each formation, we interpolate those features onto the coordinates of the horizontal production wells. However, since for a given horizontal well it has one and only one target formation, we only consider features extracted from vertical well log sections inside the underlying target formation. More specifically, for each horizontal well we gather the features of the vertical well log sections of the corresponding target formation and then interpolate onto the (surface) coordinates of the horizontal well. Once the features have been interpolated/inferred for all horizontal wells, the features may be appended as new columns to the cumulative production data frame, which can then be used directly for building predictive models.

2.4 Predictive Modeling

This phase explores the relationship between the interpolated fPCA features and the cumulative production for oil or gas at horizontal wells by means of predictive modeling. To this end, we have established an experimental framework for ranking, selecting and validating machine learning models, see section 

3. Our framework uses a data set defined by the fPCA values (features) and cumulative production for gas or oil (predicted values) for all the horizontal wells from the cumulative production data frame

. Once a training set is defined, we select the most predictive features (fPCA values) using feature selection techniques. After that, we implement and rank several machine learning models in order to determine the best models. Once the best models have been established we perform an external validation to measure predictive power of the models in a realistic scenario and estimate the degree of association between the selected fPCA features and the cumulative production.

3 Experimental Framework

In order to select and validate predictive models, we propose an experimental framework that implements several machine learning models, ranks them in terms of predictive power and then externally validates performance of best models in a realistic drilling scenario.

3.1 Model Selection

In order to identify the best predictive models for a given training data set, we perform a benchmark experiment as suggested in [Hot05]. The goal of this experiment is to rank a set of candidate models in terms of the root mean squared error (RMSE). Some models may have built-in feature selection capabilities and for those models we enable feature selection using their own internal algorithms. For models that lack such capabilities we perform feature selection using the Elastic Net method [Zou05]. In order to avoid over-fitting, i.e., selecting features correlated to singularities on data, we select the subset of features with smallest cross validation error. After that, for each model, we estimate the distribution of the RMSE error using the re-sample technique described in [Hot05]. In our setting we use -fold cross validation with repeats which results in an RMSE error distributed across re-samples. The numbers and may be chosen in function of the underlying experimental data set (e.g. in our experiments in subsection 3.3 we chose and ). If a model has hyper-parameters to be optimized, then the distribution of the RMSE error is chosen by selecting the re-samples corresponding to the hyper-parameters with smallest cross validation error. Finally, we select the top best models for the given training data in terms of the smallest median value of the distribution of the RMSE values.

3.2 Nested Leave-One-Out Validation

We perform an external validation in order to assess the performance of the selected best models from the benchmark experiment111We do not use the RMSE values of the benchmark experiment because they could be optimistic in error estimation [Caw10].. Through this external validation we further “simulate” the realistic scenario of predicting the cumulative production of a new (unseen) well given only information from previous existing wells. To this end, the feature and model selection steps of the previous sub-section are done within an external nested Leave-One-Out (LOO) loop (i.e. the validation procedure is external to the feature and model selection steps). We proceed as follows: we divide our complete data set of size into leave-one-well-out subsets of size . For each subset of size we performe feature and model selection using the methods described in the previous sub-section. Finally, the cumulative production is predicted for each of the omitted wells in the external LOO loop.

3.3 Experimental Results

We applied the experimental framework on two datasets both containing the same predictors but different predicted variables depending on whether the goal was to predict cumulative production of oil or gas. We denote those datasets as and . The dataset consisted of 88 non-missing observations (corresponding to 88 wells) that were chosen from 98 horizontal wells at some specific area of interest after selecting only those wells with twelve months of cumulative production. The same procedure was applied to the dataset , which consisted of 86 non-missing observations. The complete candidate predictor set for both and was defined by the first 10 (interpolated) functional principal components of each of the primary curves specified in section 2.2.

We implemented several models and feature selection methods from the Caret library [Kuh15] (which consists of 64 models). After that, we chose the top three best models in terms of the root mean squared error () using the technqiues of sub-section 3.1. In our setting we used -fold cross validation with repeats which resulted in a RMSE error distributed across re-samples. Once top models were selected we conducted the nested leave-one-out cross validation (from sub-section 3.2) on those models in order to approximate the true generalization error as suggested in [Caw10, Amb02].

Figure 3 shows the observed vs the predicted values obtained from the top three machine learning models. The figure also shows the observed vs predicted values using kriging on the raw horizontal well production. Table 1 shows the LOO RMSE error, the feature selection method used and the Pearson correlation coefficient for the top regression models and kriging for oil and gas. We note that “svmRadialSigma” and “rqlasso” gave the best results for oil and gas in terms of RMSE values, respectively, and both outperformed traditional kriging.

Method Feature Selection RMSE Pearson Correlation
svmRadialSigma Elastic Net
lars built-in
krlsRadial Elastic Net
kriging horizontal production
Method Feature Selection RMSE Pearson Correlation
rqlasso built-in
blasso built-in
penalized built-in
kriging horizontal production
Table 1: Results for the top 3 machine learning methods for oil and gas compared to kriging
Figure 3: Predicted vs. observed values for oil production (above) and gas production (below)

4 Conclusions and Future Work

The workflow we presented in this paper is a systematic methodology for using and comparing machine learning methods to predict production in unconventional plays using well logs and production data from previous explorations, taking into account multiple formations. Our experimental results show that this workflow can outperform more conventional techniques, such as kriging, in particular for the case of gas production prediction.

As future work, we would like to add to our workflow the capability of extracting and integrating features from horizontal well logs, in additional to features from vertical well logs. We also expect to be able to add well completion parameters in the models, which we expect will be very important in terms for improving the accuracy of the production predictions.


  • [Amb02] Christophe Ambroise and Geoffrey J McLachlan. Selection bias in gene extraction on the basis of microarray gene-expression data. Proceedings of the national academy of sciences 99, 10 (2002), 6562–6566.
  • [Caw10] Gavin C Cawley and Nicola LC Talbot. On over-fitting in model selection and subsequent selection bias in performance evaluation. Journal of Machine Learning Research 11, Jul (2010), 2079–2107.
  • [Hot05] Torsten Hothorn, Friedrich Leisch, Achim Zeileis, and Kurt Hornik. The design and analysis of benchmark experiments. Journal of Computational and Graphical Statistics 14, 3 (2005), 675–699.
  • [Kor15] M. Kormaksson, M. Vieira, and B. Zadrozny A data driven method for sweet spot identification in shale plays using well log data. In SPE Digital Energy Conference and Exhibition. (2015) Society of Petroleum Engineers.
  • [Kuh15] Max Kuhn. Caret: classification and regression training. Astrophysics Source Code Library 1 (2015), 05003.
  • [Ram09] J. O. Ramsay, G. Hooker, and S. Graves. Functional Data Analysis with R and Matlab, (2009), Springer.
  • [Zou05] Hui Zou and Trevor Hastie. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67, 2 (2005), 301–320.