The creditmodel package provides a highly efficient R tool suite for Credit Modeling Analysis and Visualization. Contains infrastructure functionalities such as data exploration and preparation, missing values treatment, outliers treatment, variable derivation, variable selection, dimensionality reduction, grid search for hyper parameters, data mining and visualization, model evaluation, strategy analysis etc. creditmodel can facilitate reliable predictive models (such as xgboost or scorecard) and data analysis on a standard laptop computer within minutes. This introductory vignette provides a brief glance at the training_model module of the package.
When I first wrote the creditmodel package, its primary purpose was to provide a tool to make the development of binary classification models (machine learning based models as well as credit scorecard) simpler and faster. Therefore, I wrote the package to automatically build model. However, as the package grew in functionality, this choice was increasingly problematic.
Importantly, the creditmodel package now provides a set of complementary tools with different missions.
Now, Let’s start with quick modeling.
## -- Building -------------------------------------------------------------------------- UCICreditCard --
## -- Creating the model output file path ----------------------------------------------------------------
## -- Seting model output file path:
## * model      : C:\Users\28142\AppData\Local\Temp\RtmpgPrU8i/UCICreditCard/model
## * data       : C:\Users\28142\AppData\Local\Temp\RtmpgPrU8i/UCICreditCard/data
## * variable   : C:\Users\28142\AppData\Local\Temp\RtmpgPrU8i/UCICreditCard/variable
## * performance: C:\Users\28142\AppData\Local\Temp\RtmpgPrU8i/UCICreditCard/performance
## * predict    : C:\Users\28142\AppData\Local\Temp\RtmpgPrU8i/UCICreditCard/predict
## -- Checking datasets and target -----------------------------------------------------------------------
## -- Cleansing & Prepocessing data ----------------------------------------------------------------------
## -- Cleansing data
## -- Checking data and target format...
## -- Replacing null or blank or miss_values with NA
## -- Formating time variables
## -- Deleting low variance variables
## -- Processing NAs & special value rate is more than 0.98
## -- Transfering character variables which are actually numerical to numeric
## -- Removing duplicated observations
## -- Merging categories...
## -- Saving data_cleansing to:
## * C:\Users\28142\AppData\Local\Temp\RtmpgPrU8i/UCICreditCard/data/data_cleansing.csv
## -- Logarithmic transformation
## -- Following variables are log transformed:
## * LIMIT_BAL -> LIMIT_BAL_log
## * PAY_0     -> PAY_0_log
## * PAY_2     -> PAY_2_log
## * PAY_AMT1  -> PAY_AMT1_log
## * PAY_AMT2  -> PAY_AMT2_log
## * PAY_AMT3  -> PAY_AMT3_log
## * PAY_AMT4  -> PAY_AMT4_log
## * PAY_AMT5  -> PAY_AMT5_log
## * PAY_AMT6  -> PAY_AMT6_log
## -- Spliting train & test ------------------------------------------------------------------------------
## -- train_test_split:
## * Total: 30000 (100%)
## * Train: 21000 (70%)
## * Test : 9000 (30%)
## -- Processing outliers using Kmeans and LOF
## * LIMIT_BAL_log  0%  no_outlier
## * AGE    0%  no_outlier
## * PAY_0_log  0%  no_outlier
## * PAY_2_log  0%  no_outlier
## * PAY_3  0%  no_outlier
## * PAY_4  0%  no_outlier
## * PAY_5  0%  no_outlier
## * PAY_6  0%  no_outlier
## * BILL_AMT1  0%  no_outlier
## * BILL_AMT2  0%  no_outlier
## * BILL_AMT3  0%  no_outlier
## * BILL_AMT4  0%  no_outlier
## * BILL_AMT5  0%  no_outlier
## * BILL_AMT6  0%  no_outlier
## * PAY_AMT1_log   0%  no_outlier
## * PAY_AMT2_log   0%  no_outlier
## * PAY_AMT3_log   0%  no_outlier
## * PAY_AMT4_log   0%  no_outlier
## * PAY_AMT5_log   0%  no_outlier
## * PAY_AMT6_log   0%  no_outlier
## -- Saving data_outlier_proc to:
## * C:\Users\28142\AppData\Local\Temp\RtmpgPrU8i/UCICreditCard/data/data_outlier_proc.csv
## -- Filtering features ---------------------------------------------------------------------------------
## -- Feature filtering by IV
## -- Feature filtering by Correlation
## -- Saving feature_filter to:
## * C:\Users\28142\AppData\Local\Temp\RtmpgPrU8i/UCICreditCard/variable/feature_filter.csv
## -- Saving feature_filter_table to:
## * C:\Users\28142\AppData\Local\Temp\RtmpgPrU8i/UCICreditCard/variable/feature_filter_table.csv
## -- Training logistic regression model/scorecard -------------------------------------------------------
## -- Searching optimal binning & feature selection parameters -------------------------------------------
## [1]  train_ks:0.4226  test_ks:0.4161  psi:0.002
## * tree_control:{ p:0.02, cp:0.00000001, xval:5, maxdepth:15 }
## * bins_control:{ bins_num:10, bins_pct:0.02, b_chi:0.03, b_odds:0.1, b_psi:0.02, b_or:0.2, mono:0.1, odds_psi:0.1, kc:1 }
## * thresholds:{ cor_p:0.8, iv_i:0.02, psi_i:0.1, cos_i:0.5 }
## [2]  train_ks:0.4246  test_ks:0.4194  psi:0
## * tree_control:{ p:0.02, cp:0.00001, xval:5, maxdepth:10 }
## * bins_control:{ bins_num:10, bins_pct:0.05, b_chi:0.01, b_odds:0.1, b_psi:0.06, b_or:0.15, mono:0.5, odds_psi:0.2, kc:1 }
## * thresholds:{ cor_p:0.8, iv_i:0.02, psi_i:0.1, cos_i:0.5 }
## -- [best iter] ----------------------------------------------------------------------------------------
## [2]  train_ks:0.4246 test_ks:0.4194  psi:0
## * tree_control:{ p:0.02, cp:0.00001, xval:5, maxdepth:10 }
## * bins_control:{ bins_num:10, bins_pct:0.05, b_chi:0.01, b_odds:0.1, b_psi:0.06, b_or:0.15, mono:0.5, odds_psi:0.2, kc:1 }
## * thresholds:{ cor_p:0.8, iv_i:0.02, psi_i:0.1, cos_i:0.5 }
## -- Constrained optimal binning of varibles ------------------------------------------------------------
## -- Getting optimal binning breaks
## * PAY_0_log: -0.5,0.346573590279972,Inf
## * PAY_2_log: -0.5,0.346573590279972,Inf
## * PAY_3: -1,0,Inf
## * PAY_4: -1,0,Inf
## * PAY_5: -1,1,Inf
## * PAY_6: -1,1,Inf
## * LIMIT_BAL_log: 10.7082065087532,11.2230162173437,11.8838941373349,12.4088051995843,Inf
## * PAY_AMT1_log: 1.49786613677699,8.42343166081953,9.6010640182797,Inf
## * PAY_AMT2_log: 4.51079912368965,7.39602845120143,8.51228114200585,Inf
## * PAY_AMT3_log: 2.86179255097619,8.51147683890271,9.85702382079673,Inf
## * PAY_AMT5_log: 1.70059869083108,5.95971501632653,7.56553428216372,7.97796803432616,8.50603087617906,9.60214591732222,Inf
## * PAY_AMT4_log: 1.70059869083108,8.4766835992804,Inf
## * PAY_AMT6_log: 0.346573590279972,7.31421983197877,8.28727672413841,9.77109771291889,Inf
## -- Saving breaks_list.breaks_list to:
## * C:\Users\28142\AppData\Local\Temp\RtmpgPrU8i/UCICreditCard/variable/LR/breaks_list.breaks_list.csv
## -- Filtering variables by IV & PSI --------------------------------------------------------------------
## -- Selecting variables by PSI & IV
## -- Calculating PSI
## --PAY_0_log
## * PSI: 0  -->  Very stable
## --PAY_2_log
## * PSI: 0  -->  Very stable
## --PAY_3
## * PSI: 0  -->  Very stable
## --PAY_4
## * PSI: 0  -->  Very stable
## --PAY_5
## * PSI: 0  -->  Very stable
## --PAY_6
## * PSI: 0  -->  Very stable
## --LIMIT_BAL_log
## * PSI: 0  -->  Very stable
## --PAY_AMT1_log
## * PSI: 0  -->  Very stable
## --PAY_AMT2_log
## * PSI: 0  -->  Very stable
## --PAY_AMT3_log
## * PSI: 0  -->  Very stable
## --PAY_AMT5_log
## * PSI: 0  -->  Very stable
## --PAY_AMT4_log
## * PSI: 0  -->  Very stable
## --PAY_AMT6_log
## * PSI: 0  -->  Very stable
## -- Calculating IV
## --PAY_0_log
## * IV: 0.711  -->  Very Strong
## --PAY_2_log
## * IV: 0.552  -->  Very Strong
## --PAY_3
## * IV: 0.413  -->  Very Strong
## --PAY_4
## * IV: 0.364  -->  Very Strong
## --PAY_5
## * IV: 0.337  -->  Very Strong
## --PAY_6
## * IV: 0.293  -->  Strong
## --LIMIT_BAL_log
## * IV: 0.181  -->  Strong
## --PAY_AMT1_log
## * IV: 0.187  -->  Strong
## --PAY_AMT2_log
## * IV: 0.155  -->  Strong
## --PAY_AMT3_log
## * IV: 0.127  -->  Strong
## --PAY_AMT5_log
## * IV: 0.102  -->  Strong
## --PAY_AMT4_log
## * IV: 0.103  -->  Strong
## --PAY_AMT6_log
## * IV: 0.098  -->  Medium
## -- Saving feature.IV_PSI to:
## * C:\Users\28142\AppData\Local\Temp\RtmpgPrU8i/UCICreditCard/variable/LR/feature.IV_PSI.csv
## -- Saving feature.PSI to:
## * C:\Users\28142\AppData\Local\Temp\RtmpgPrU8i/UCICreditCard/variable/LR/feature.PSI.csv
## -- Saving feature.IV to:
## * C:\Users\28142\AppData\Local\Temp\RtmpgPrU8i/UCICreditCard/variable/LR/feature.IV.csv
## -- Saving LR.IV_PSI_features to:
## * C:\Users\28142\AppData\Local\Temp\RtmpgPrU8i/UCICreditCard/variable/LR/LR.IV_PSI_features.csv
## -- Transforming WOE -----------------------------------------------------------------------------------
## -- Transforming variables to woe
## -- Saving lr_train.dat.woe to:
## * C:\Users\28142\AppData\Local\Temp\RtmpgPrU8i/UCICreditCard/data/LR/lr_train.dat.woe.csv
## -- Filtering variables by correlation -----------------------------------------------------------------
## -- Processing bins table
## * PAY_0_log IV: 0.712 PSI: 0
## * PAY_2_log IV: 0.552 PSI: 0
## * PAY_3 IV: 0.413 PSI: 0
## * PAY_4 IV: 0.364 PSI: 0
## * PAY_5 IV: 0.337 PSI: 0
## * PAY_6 IV: 0.293 PSI: 0
## * PAY_AMT1_log IV: 0.187 PSI: 0
## * LIMIT_BAL_log IV: 0.181 PSI: 0
## * PAY_AMT2_log IV: 0.155 PSI: 0
## * PAY_AMT3_log IV: 0.127 PSI: 0
## * PAY_AMT4_log IV: 0.102 PSI: 0
## * PAY_AMT5_log IV: 0.102 PSI: 0
## * PAY_AMT6_log IV: 0.097 PSI: 0
## -- Filtering variables by LASSO -----------------------------------------------------------------------## Saving 8 x 5 in image## -- Saving lr_premodel_features to:
## * C:\Users\28142\AppData\Local\Temp\RtmpgPrU8i/UCICreditCard/variable/LR/lr_premodel_features.csv
## -- Start training lr model ----------------------------------------------------------------------------
## -- Saving lr_model_features to:
## * C:\Users\28142\AppData\Local\Temp\RtmpgPrU8i/UCICreditCard/variable/LR/lr_model_features.csv
## -- Saving UCICreditCard.lr_coef to:
## * C:\Users\28142\AppData\Local\Temp\RtmpgPrU8i/UCICreditCard/performance/LR/UCICreditCard.lr_coef.csv
## -- Generating standard socrecard ----------------------------------------------------------------------
## -- Using scorecard to predict the train and test
## -- Saving lr_train_score to:
## * C:\Users\28142\AppData\Local\Temp\RtmpgPrU8i/UCICreditCard/predict/LR/lr_train_score.csv
## -- Saving lr_test_score to:
## * C:\Users\28142\AppData\Local\Temp\RtmpgPrU8i/UCICreditCard/predict/LR/lr_test_score.csv
## -- Saving lr_train_prob to:
## * C:\Users\28142\AppData\Local\Temp\RtmpgPrU8i/UCICreditCard/predict/LR/lr_train_prob.csv
## -- Saving lr_test_prob to:
## * C:\Users\28142\AppData\Local\Temp\RtmpgPrU8i/UCICreditCard/predict/LR/lr_test_prob.csv
## -- Producing plots that characterize performance of scorecard## Saving 12 x 5 in image## -- Saving UCICreditCard.LR.performance_table to:
## * C:\Users\28142\AppData\Local\Temp\RtmpgPrU8i/UCICreditCard/performance/LR/UCICreditCard.LR.performance_table.csv## -- Saving LR.params to:
## * C:\Users\28142\AppData\Local\Temp\RtmpgPrU8i/UCICreditCard/performance/LR/LR.params.csvIn a few minutes, the program completed data cleaning and pretreatment, variable screening, scorecard, Xgboost, GBDT, RandomForest four models development and evaluation.