Create multi-variable regression models from PetroVisor static numeric signals
Link to MVR Summary Dashboard Article.
Usage
Create an multi-variable regression (MVR) model by updating and running the MVR Generate Model workflow. When the workflow has finished building the model, you will be emailed a set of useful files containing information that can be used to evaluate the model.
The predicted values for the model can be obtained by running the MVR Predict Values P# script that is included with the package.
To update the workflow, navigate to the Workflows page in PetroVisor and select the MVR Generate Model workflow. In the designer, you can change the input signals of the activity already added to the workflow. To generate additional models, add additional instances of the activity MVR Generate Model to the workflow and configure them (see Details)
Overview
The Multi-Variable Regression package generates a MVR linear model from a set of input predictor signals and a response signal using R. You define the predictor variables (the X's) from a set of static numeric signals, and a single response variable (the Y), which must also be a static numeric signal. The Multi-Variable Regression package then calculates automatically the best linear model from a combination of the X's you define. The metric used to define the best model is adjusted R-squared.
The package incorporates a number of features to facilitate and improve model construction, including:
- automatic outlier detection and removal
- normalization
- automatic detection of both the response variable and the predictors via Box-Cox and Box-Tidwell methods
Details
The R activity can accept up to ten signals as inputs, plus one signal as the response variable.
R Activity Input Arguments
- x1: any static numeric or static string signal. A MVR model must have at least an x1 input signal defined
- x2...x10: any static numeric or static string signal. Optional. If not used, assign the null static signal
- predictor_names: a comma-separated list of character strings that represent the names of the x1...x10 input signals. You must always specify 10 unique predictor names, whether they are used in the model or not. Note: must be alpha-numeric characters and underscores only, no spaces or special characters
- response_name: a character string representing the name of the response variable
- precheck_only: a boolean (use WorkspaceValue TrueString or FalseString) that defines whether you only want a model pre-check or a full MVR analysis. A model pre-check does a number of quality control checks on the data, but does not generate a model
- complete_cases_only: a boolean (use WorkspaceValue TrueString or FalseString) that defines whether you only want complete cases in your model. A complete case is where every predictor has a value. If you only want complete cases, set complete_cases_only = TRUE, and any incomplete rows are removed from the dataset before model construction
- normalize: defines whether you want to normalize the data prior to model construction. normalize has seven options:
- Option 1: null – no normalization is done on the input dataset
- Option 2: feature – use feature normalization to normalize the range of the data between 0.0 and 1.0. With normalize = feature, input parameters use_box_cox and use_box_tidwell are ignored, and assumed to be set to False
- Option 3: feature_shift – (recommended) use feature normalization to normalize the range of the data between 0.0 and 1.0, and add a small positive shift to the data to ensure that all values are > 0.0. normalize = feature_shift is the best option for most situations
- Option 4: standard – use standard z-transformation normalization to normalize the range of the data to a mean of 0.0 and a standard deviation of 1.0. With normalize = standard, input parameters use_box_cox and use_box_tidwell are ignored and assumed to be set to False
- Option 5: standard_shift – use standard z-transformation normalization to normalize the range of the data to a mean of 0.0 and a standard deviation of 1.0, and add a small positive shift to the data to ensure that all values are > 0.0
- Option 6: standard_ratio – use standard z-transformation normalization to normalize the range of the data to a mean of 0.0 and a standard deviation of 1.0, and then shift the data to preserve the max/min ratio of the raw data to ensure that all values are > 0.0
- Option 7: all – special option to test all combinations of null, feature_shift, standard_shift, and standard_ratio to seek out the best model based on adjusted R². Can be used in conjunction with remove_outliers = all to examine all models.
- remove_colinear_preds: defines whether you want to identify and remove colinear predictors. remove_colinear_preds has two options:
- Option 1: null - no collinearity identification and removal is performed
- Option 2: vif - (recommended) uses the Variance Inflation Factor method to identify highly correlated predictors. Of those predictors whose VIF is greater than 5.0, all but the lowest VIF-valued predictor are removed from the model
- remove_outliers: defines whether you want outliers to be identified and removed from the input dataset before model construction. remove_outliers has four options:
- Option 1: null - no outlier identification and removal is performed
- Option 2: cooksd - identifies and removes outliers from the dataset that have a Cooks distance > 4x mean of the Cooks distance metric of the entire dataset
- Option 3: iqr - identifies and removes outliers from the dataset that fall outside 1.5 times the Inter-Quartile Range (IQR) of the data. remove_outliers = iqr is the best option for most situations
- Option 4: all – (recommended) special option to test all combinations of null, cooksd and iqr to seek out the best possible model based on adjusted R². Can be used in conjunction with normalize = all to examine all models
- use_box_cox: a boolean (use WorkspaceValue TrueString or FalseString) that defines whether you want to use Box-Cox power transformation on the input response variable (y). Normally, an MVR model benefits from Box-Cox power transformation
- use_box_tidwell: a boolean (use WorkspaceValue TrueString or FalseString) that defines whether you want to use Box-Tidwell power transformations on the input predictor variables (x’s). Normally, an MVR model benefits from Box-Tidwell power transformations of the predictors
- mvr_model_entity: the name of the entity that will be used to store the information related to the model once it is built. Any entity can be used to store an MVR model, and the default name is "MVR Models"
- project: a character string representing the name of the project. project is optional. If you do not want to use a project, enter none
- sendto: a comma-separated list of email addresses to send the MVR results to. sendto is optional. If you do not want to include sendto email addresses, enter none
R Activity Output Arguments
- mvr_model_info: this must be a static string signal that will store a JSON string containing the model information for the entity named in the mvr_model_entity input above. This should be a signal with a descriptive name to uniquely identify the MVR model (e.g., eur oil equivalent model). Remember to include the square brackets "[ ]" as the units when entering into the workflow options.
The stored JSON string will be formatted as the following:
{
'x1 input':
{
'name': 'string',
'unit': 'string'
},
'x2 input':
{
'name': 'string',
'unit': 'string'
},
'x3 input':
{
'name': 'string',
'unit': 'string'
},
'x4 input':
{
'name': 'string',
'unit': 'string'
},
'x5 input':
{
'name': 'string',
'unit': 'string'
},
'x6 input':
{
'name': 'string',
'unit': 'string'
},
'x7 input':
{
'name': 'string',
'unit': 'string'
},
'x8 input':
{
'name': 'string',
'unit': 'string'
},
'x9 input':
{
'name': 'string',
'unit': 'string'
},
'x10 input':
{
'name': 'string',
'unit': 'string'
},
'y input':
{
'name': 'string',
'unit': 'string'
},
'y coefficients':
{
'record': 0,
'name': 'string',
'type': 'string',
'is_cat': false,
'a': 0,
'b': 0,
'shift': 0,
'scale': 0,
'add': 0,
'p': 0,
'min': 0,
'mean': 0,
'max': 0
},
'x1 coefficients':
{
'record': 0,
'name': 'string',
'type': 'string',
'is_cat': false,
'a': 0,
'b': 0,
'shift': 0,
'scale': 0,
'add': 0,
'p': 0,
'min': 0,
'mean': 0,
'max': 0
},
'x2 coefficients':
{
'record': 0,
'name': 'string',
'type': 'string',
'is_cat': false,
'a': 0,
'b': 0,
'shift': 0,
'scale': 0,
'add': 0,
'p': 0,
'min': 0,
'mean': 0,
'max': 0
},
'x3 coefficients':
{
'record': 0,
'name': 'string',
'type': 'string',
'is_cat': false,
'a': 0,
'b': 0,
'shift': 0,
'scale': 0,
'add': 0,
'p': 0,
'min': 0,
'mean': 0,
'max': 0
},
'x4 coefficients':
{
'record': 0,
'name': 'string',
'type': 'string',
'is_cat': false,
'a': 0,
'b': 0,
'shift': 0,
'scale': 0,
'add': 0,
'p': 0,
'min': 0,
'mean': 0,
'max': 0
},
'x5 coefficients':
{
'record': 0,
'name': 'string',
'type': 'string',
'is_cat': false,
'a': 0,
'b': 0,
'shift': 0,
'scale': 0,
'add': 0,
'p': 0,
'min': 0,
'mean': 0,
'max': 0
},
'x6 coefficients':
{
'record': 0,
'name': 'string',
'type': 'string',
'is_cat': false,
'a': 0,
'b': 0,
'shift': 0,
'scale': 0,
'add': 0,
'p': 0,
'min': 0,
'mean': 0,
'max': 0
},
'x7 coefficients':
{
'record': 0,
'name': 'string',
'type': 'string',
'is_cat': false,
'a': 0,
'b': 0,
'shift': 0,
'scale': 0,
'add': 0,
'p': 0,
'min': 0,
'mean': 0,
'max': 0
},
'x8 coefficients':
{
'record': 0,
'name': 'string',
'type': 'string',
'is_cat': false,
'a': 0,
'b': 0,
'shift': 0,
'scale': 0,
'add': 0,
'p': 0,
'min': 0,
'mean': 0,
'max': 0
},
'x9 coefficients':
{
'record': 0,
'name': 'string',
'type': 'string',
'is_cat': false,
'a': 0,
'b': 0,
'shift': 0,
'scale': 0,
'add': 0,
'p': 0,
'min': 0,
'mean': 0,
'max': 0
},
'x10 coefficients':
{
'record': 0,
'name': 'string',
'type': 'string',
'is_cat': false,
'a': 0,
'b': 0,
'shift': 0,
'scale': 0,
'add': 0,
'p': 0,
'min': 0,
'mean': 0,
'max': 0
},
'training entities': 'pipe delimited string',
'outlier entities': 'pipe delimited string'
}
Data Requirements
There is no specific data required to run this package. The model can be created using any static numeric signals as inputs.
Further Comments
The Generate MVR Model workflow must be run in an Rserve cloud-based R environment.