5. Completion Optimization - ML Model

Train an ML model to predict well performance from completion, well spacing, and geologic data.

Overview

This article describes how to build a Machine Learning (ML) model to predict well performance from completion, well spacing, and geologic/geophysical data specifically for the Completion Optimization workflows.

For a more comprehensive article on ML, see Overview of ML.

We recommend the ML OLS (Ordinary Least Squares) or Tree algorithms be considered for the purpose of Completion Optimization.

A challenge with Tree algorithms is assessing if the model is overfitted. 

A challenge of using the OLS algorithms is that they are subject to errors when certain underlying assumptions are violated. Two important assumptions:

  • Input variables (features) should not be correlated to each other (collinearity)
  • Each input and response variable should have a normal (Gaussian) distribution

Both OLS assumptions are routinely violated in typical completion optimization datasets. For instance, fluid intensity and proppant intensity are often highly correlated. As a result, there is a pre-training step in the process below to understand if these assumptions are being violated, and if so, options (best practices) are provided so that the user may take specific actions before training begins.

The process of creating and training an ML model for the purpose of Completion Optimization has several steps:

  • Create a New ML Model (if needed)
  • Edit the ML Model:
    • Choose pre-training actions
    • Choose training settings
  • Train the ML Model
  • Evaluate the Model

The user should have a training entity set created before creating a model.

Resource on ML:   Interpretable Machine Learning

Create A New ML Model

To create or edit an ML model:

  • 1. Select the ML menu item

To create a new model, there are two choices:

  • 2. Choose an existing ML model to copy into a new model
  • 3. Add a new ML Model

Tip: If the new model will be similar to an existing model, it is usually easier to copy an existing model to a new name and edit that new model than to Add a new model.

Save Existing Model to a New Name to Create a New Model

To save an existing model to a new name:

  • Select an existing model by clicking the model's name (item 2 above).
  • Select the 'edit' button in the upper right corner of the screen.

A pop-window will appear.

  • A. Select the Save dropdown list.
  • B. Select the Save As option.

  • C. Enter a name for the new model.
  • D. Select Save

All the model settings from the copied model will be retained in the newly created model. To change the model entity set, input or predicted parameters, or any settings, see the 'Edit an ML Model' section below.

Add a New Model

To create a new ML Model specifying all inputs and settings:

  • Select the "Add ML Model" button

A pop-up window will appear with several items:

  • 1. Enter a name
  • 2. Accept/choose 'Regression' as the Model Type
  • 3. Choose the signals that will be inputs for the predictive model (completion, well spacing, and geo parameters). The signals must laready exist and ideally shuold be populated. Signals can later be added or deleted.
  • 4. Choose the parameter to be predicted, usually B90 Oil Equivalent or Gas Equivalent.
  • 5. Choose the entity set (must already exist) that contains the wells to be used to train the model.
  • 6. Choose 'AtDate' as the Scope.
  • 7. Select 'Add ML Model'.

Once the model has been created 

Edit the ML Model

To edit an ML model:

  • 1. Select the ML menu item.
  • 2. Select an existing ML model.

There are two editing choices for ML Models:
  • 1. Pre-training parameters (input and predicted parameters, entity set, transformations, etc.).
  • 2. Training settings (algorithms, training time, cross folds, best model measure, etc.)

Pre-Training Edits

Selecting the 'Edit' button shown above will cause the pre-training editing window to appear. There are four sections in the editing window:

  • 1. Information - Model
  • 2. Data - Inputs and predicted variariable
  • 3. Training Options - Transformations, outlier detection and removal, validation and test options.
  • 4. Context - Entity Set

Information

Normally the user will not change any items:

A. Name: The name is not editable.

B. Model Type: The model type is not editable.

C. Data:  Leave 'Import From Script' unchanged.

Data

Use the dropdown lists to select the variables to include in the model, both the inputs and the response variable. If editing and existing model and/or editing a copied model, the user may only need to comment on/off individual variables in the code section.

  • 1. Select the the 'Data' menu.
  • 2. Check the code area for signals (input variables) already added to the model.
  • 3. Use the comment symbol (//) to comment out unwanted signals. A signal occupies three consecutive rows in the code.
  • To add a new signal:
    • 4. Choose a signal from the dropdown list.
    • 5. Enter a new (simplfied?) name for the signal if desired. This name is what appears in the ML Model.
    • 6. Choose units for the signal in the ML model.
    • 7. Select the the 'Add Column' button and the signal is added to the code section (item 2).
  • Repeat steps 4 - 7 until all inputs signals (called features in ML) have been added to the model.
  • 8. Choose the response variable (the signal the model will predict).
  • 9. Select Save to save edits.

Pro Tip: When adding variables in the ML Data window, include all variables that may be considered and then "turn off" [// to make the line a comment] variables not being used in a single model. This allows the user to quickly build and test many different models by simply turning on and off specific variables.  

Example: Include Fracture Fluid volume and Fracture Fluid per Stage in the code, and comment out one or the other depending on model design. Generally both of those variables would not be in the same model.


Signals common to Completion Optimization models:

  • Well/Parent/Child:
    • Well Spacing
    • TVD
    • Date of First Prodcution (DOFP)
  • Geologic:
    • Formation/Zone
    • landing Zone as a % Into The Formation/Zone
    • Porosity
    • Pay
    • Water Saturation
    • OOIP or OGIP per acre
  • Completion:
    • Completed Lateral Length
    • Stages:
      • Total Stages
      • Stage Spacing
    • Perf Cluster or Sleeve:
      • Perf Clusters or Sleeves
      • Perf Clusters or Sleeves per Stage
      • Perf Cluster or Sleeve Spacing
    • Frac Fluid:
      • Total Frac Fluid
      • Frac Fluid per Stage
      • Frac Fluid per Cluster
      • Frac Fluid Intensity
    • Proppant:
      • Total Proppant
      • Proppant per Stage
      • Proppant per Cluster
      • Proppant Intensity
    • Misc:
      • Proppant concentration
      • Injections rate
      • Injection Pressure
      • Net Pressure
      • Classification Variables:
        • Reservoir Fluid Type
        • Completion Technology
        • Fracture Fluid Type
        • Proppant Type
        • Artificial lift type
  • Response Variables: Only 1 is selected under the "Predict" dropdown list. 
    • B90 Oil Equivalent
    • B90 Oil Equivalent per Stage
    • B90 Oil Equivalent per Cluster
    • B90 Oil Equivalent Intensity
    • B90 Gas Equivalent 
    • B90 Gas Equivalent per Stage
    • B90 Gas Equivalent per Cluster
    • B90 Gas Equivalent Intensity

Training Options

There are three Training Option Categories:

  • Pre-Training - Oulier removal and pre-processors.
  • Validation - How data is chosen to be withheld from each training fold but used during every fold to test for the best model.
  • Test - How data is chosen to be withheld from training and validation for use in a post-training final assessment of model quality. Also called a blind data set.

To access the Training Options:

  • 1. Select the 'Training Options' menu item.
  • Pre-Training:
    • 2. Choose an Outlier Removal option if desired:
      • Cook's Distance (a multi-dimensional outlier identifier). Data points are ranked by their model leverage/influence
      • IQR (Inter-Quartile Range. Removes the largest and smallest values of each variable.
      • Best practice to only use the outlier removal if the user is uncertain if poor quality data may be i nthe data set.
    • 3. Choose Data Pre-Processors if desired:
      • These are data transformations to scale, shift, and change the distribution shape of individual variables. This is important when using the OLS (or any least squares) algorithm due to assumptions of each parameter having a Normal (Gaussian) distribution.
      • We recommend Box Cox and Box Tidwell when using OLS or any Least Squares algorithm.
  • 4. Accept the default Validation setting.
  • 5. Accept the default Test setting.
  • 6. Select Save if any settings were changed.

Pro Tip:  Run the model multiple times with different training options to see how these features affect the model.

Context

Context selects the entity set to train the model.  Here is an article on selecting an appropriate entity set for Completion Optimization.

To specify the Entity Set:

  • 1. Select the 'Context' menu item.
  • 2. Select the 'Load' dropdown list from 'Load Entity Set from Library'.
  • 3. Select the entity set of choice.
  • 4. By selecting the entity set, note the name of the entity set is placed in the code area.
  • 5. Confirm the well count in the Entites area is as expected.
  • 6. Select Save.

Pro Tip: The entity set name can be typed directly in the code area (4).

Additional Pre-Training Actions

After creating the ML Model and setting the the Pre-Training items above, there is is still a pre-training review required to decide if any inputs need to be exluded due to either too many inputs given a small dataset and/or strong correlations between input variables.

The excusion of correlated inputs is needed if any of the least sqaures algorithms are used such as OLS and Poisson Regression. If the models being considered are non-linear or do not use a least squares objective function then the consideration of strongly correlated variables is not an issue.

Scatter Matrix

The Scatter Matrix is a useful tool to view 2-D correlations, distribution shapes, and outlier remove prior to training. If there is a need to reduce the number of inputs given a small dataset, the scatter matrix allows the user to select variables for omission based on near-zero correlation with the repsonse variable (predicted variable).

To view the scatter matrix:

Select an existing ML model. The opening screen once selected should be the scatter matrix. It may take a short time to open if the entity set or variable set is large. If the opening screen does not present the scatter matrix:

  • 1. Select the 'Pre-Training' menu item.
  • 2. Select the 'Scatter Matrix' menu item

The scatter matrix has a standard layout:

  • 3. All variables are listed at the top and along the side of the matrix.
  • 4. Where a variable column and row intestes with itself, a distribution shape of the variable is presented.
  • 5. X/Y scatter plots are shown at the interresction of two variables as identified by the column and row headings.
  • 6. A correlation coefficient is shown at the interesction of two variables as identified by column and row headings.
    • r, not r-squared
    • Rages from -1 (perfect negative correlation) to +1 (perfect positive correlation). 0 = no correlation.
    • The font size increases as the correlation coefficient approach 1 or -1. The smallest fonts occure when the correlation coefficient approaches zero.
    • More stars indicate a higher degree of correaltion.

In the example below:

  • A. Note the response variable "B90 Gas" occcupies the top row of the matrix. In this row the correlations to individual inputs is presented. In this case Lateral Length has the highest individual correlation, but by a small margin over Stages, Proppant Intensity, and Number of Perf Clusters.
  • B. Note the high correlation between Proppant Intensity and Frac Fluid Intensity. At correlation coefficients generally above 0.8 (or below -0.8) there is a good chance one of these variables will need to be removed from the model due to collinearity.
  • C. Note the 2-D scatter plot for Proppant Intensity and Fluid Intensity. Maybe one or two outliers, probably not enough to cause the correlation coefficient to be misleading.
  • D. Note the data are presented without appying the preprocessing items, in this case Cook's Distance outlier removal and Box-Cox & Box-Tidwell transformations.

Pro Tip: The scatter matrix correlation coefficients provide a quick identification of those variables least likely to help the predictive model. The coefficients closest to zero in the top row above (response variable row, A) are candidates for removal.

Below is the same scatter plot with the pre-processing items applied (Cook's Distance oultliers and Box-Cox & Box-Tidwell transformation).

  • Note the correlation coefficients have changed slightly along with the shape of the scatter plots.
  • Note that some of the distribution shapes have become more symetrical, especially the response variable (B90 Gas).
  • The red dots in the scatter plots are points that will be removed for training as per the Cook's Distance formulas.

Pro Tip: Try training models with and without outlier removal and transformations and see for yourself if they make a difference in predictions. Remember, use the pre-training edit window, 'Training Options' menu item to turn those items on/off.

Data Table

The Data option provides three visualizations:

  • Table
  • Stats
  • Charts

For the purpose of identifying variables to be excluded from the model, the 'Stats' table is most useful.

  • 1. Select the 'Data' menu item
  • 2. Select the 'Stats' menu item
  • 3. The number of entries allows the confirmation that each variable has enough data points to provide a useful data set (# entities with data for all model variables).

Pro Tip: The trianing data set will only include wells with values for all variables. Therefore, if one variable is missing data for some wells, the training data set will shrink to only wells with "complete" data.

  • 4. and 5. The Min/Max helps QC the data set and provides likely Min/Max settings for the Evaluation Matrix.
  • 6. The Variance Inflation Factor (VIF) provides a quantitative ranking of which input variables should be removed from the model due to collinearity (correlation with eachother). Typically, variables would be removed for VIF > 10. Notice in this example, Frac Fluid Intensity has already been removed, soving the VIF (Collinearity) problem.

Pro Tip: For OLS (or any linear model using leaset squares as the objective function), correlated input variables (features) should be removed one at a time, removing the varibale with the largest VIF. Saving the model after removing the variable will recompute the VIF's of the remaining variables. This make take several iterations until all variables have acceptable VIF's.

Train the ML Model

To access the training settings:

  • Select the  the Training button in the upper right corner of the window.

The window below will pop up. Choose / enter the following settings:

  • 1. Leave 'Create Model per Entity' and 'Include Incomplete Cases' in the Off position.

Note: This example illustrates a special case of ML Model training, replicating a least squares regression process typically used in statistics. In this case, there is only one cross validation fold, no validation set and no test set. Thus, all data is used in a single model training pass.

Once you are familiar with ML model training options you can pursue other training techniques and algorithms by choosing different settings in the training pop-up window.

  • 2. Leave 'Auto ML' Off until you are ready to test other models. When in the Off position the rest of the settings are deafult values to create a least squares linear model using all complete data (no data withheld for validation or blind testing).
  • 3. Accept default values.

Pro Tip: 4. and 5. To conform with a traditional least squares training process, L2 regularization is deafulted to zero. Higher values for L2 penalizes the magnitude of the regression coefficients, reducing the sensitivity of the response variable to individual parameters and the model overall. For Completion Optimization, our experience indicates an L2 of zero creates the best models.

  • 6. Select 'Train' to train the model.

Users advancing beyond the traditional OLS model training described above will want to try the various models and settings available in PetroVisor.

To access the other models:

  • 1. Set 'Auto ML' to On, which presents the window below.
  • 2. The 'Algorithms' dropdown list allows one or more models to be chosen and trained. All model training results will be presented at the end of training with PetroVisor recommending the best model, but the user able to override that recommendation.
  • 3. Note, two additional items appear in this window due the 'Auto ML' being in the On position.

Pro Tip: See this article for an overview on additional ML models and features.

Training Results

With 'Auto ML' turned off and using the default settings, there will be only one trained model after selecting the 'Train' button. The one model is pre-selected.

Select the 'Save Selected Model' button to save the model.

Evaluate the Model

After saving the model, the Post Training and Playground tabs become available. There are several useful graphics to choose to understand model quality and the impact of individual inputs on the response variable.

Post-Training Tab

Access the post-training results:

  • 1. Select the 'Post-Training' menu item.
  • 2. Select the dropdown list just below the 'Post-Training' menu item.

There are six graphics in the dropdown list just below the 'Post-Training' menu item:

  • Summary - Feature importance bar chart and summary stats
  • Feature Importance - Table various measures including Feature Importance and P-value
  • Observed vs Predicted - Graph
  • Single Variable Effect - Graph
  • ICE Plots - Graph
  • PDP Plots - Graph

For completion optimization we will focus on the first four of these six graphics.

Post-Training - Summary

The Summary shows the most impactful features as the tallest bars. In the example below, Total Stages has the most influence on the model.

 

Post-Training - Feature Importance

The Feature Importance screen gives statistical information about each variable in the model.  Columns of note:

  • 1. Importance: What is used to scale the bar height in the Summary graphic.
  • 2. P-Value: The probability that the coefficient should be zero (i.e., that the parameter has no influence on the model).  P-value ranges from zero to 1.0. Generally, P-Values > 0.05 are used to exclude variables from the model. Use as a guide only, not a herd and fast rule.
  • 3. Power Transformation: The power transformation that was used to transform the variable into a more Gaussian (Normal) distribution shape. The variable is raised to this power.  


Observed vs. Predicted

Ideally, the best fit line would be at a 45° angle, the thin blue line on the graph below. The dashed line is a best fit line through the observed vs predicted response variable, in this case B90.  This is a somewaht poor predictive model which would likely improve significantly from the inclusion of geologic data.

 

Single Variable Effect

The Single Variable Effect plot illustrates the impact of changes in a single variable on the response variable within a multivariable model. This graph is intended to provide the user with a visual understanding of which variables have the greatest impact on the response variable. Thus, before the evaluation matrix is constructed and economics are run, the user will already have a general idea of which variables will likely have the most impact on B90 and EUR and if those impacts are positive or negative. This is also a useful plot to compare various trained models to determine if those models result in similar or decidedly different completion optimization conclusions.

The graph is constructed using the trained model, setting all input variables to their mean value, and computing the response variable while varying only one input variable at a time across a range of values. The range of values chosen for each input variable corresponds to the range of values in the observed dataset, represented as % of the mean value (multiple of the mean). This creates a spider graph where all the curves intersect at an x-axis value of 1.0 (the mean of all values). The y-axis is the value of the response variable.

  • If a line has curvature, there was a power transformation. The greater the curvature, the larger the absolute value of the power transformation coefficient.
  • Intuitively, the steeper the slope of a line, the greater the % impact on the response variable for a given % change in the input parameter.
  • Placing the cursor over a line presents a tooltip window identifying the line and showing the x and y (value) values.