Churn Dataset Analysis

Giorgio Garziano

Contesto

Un fornitore di servizi di musica in streaming vorrebbe predire se un utente non sottoscriverà piu’ alcuno dei loro servizi dopo 30 giorni dalla scadenza della sottoscrizione corrente (churn).

I sottoscrittori possono scegliere di effettuare un rinnovo esplicito o automatico della loro sottoscrizione attiva.

Un utente puo’ cancellare la sottoscrizione perche’ vuole cambiare il proprio piano di sottoscrizione o per altre ragioni.

Il criterio di “churn” consiste nel non sottoscrivere alcun nuovo servizio entro 30 giorni dalla scadenza di una sottoscrizione già effettuata o dalla sua esplicita cancellazione.

Notare che il campo “is_cancel” indica se l’utente ha effettuato una cancellazione esplicitamente, ma non e’ conclusivo sul fatto che cio’ implichi un “churn”.

Dati di partenza

churn.tab: tabella dei clienti

  1. msno: identificativo cliente
  2. is_churn: indicatore di churn

Dati di partenza (cont.)

members.tab: informazioni generali sugli utenti

  1. msno: identificativo cliente
  2. city: città
  3. age: età
  4. gender: genere
  5. registered_via: metodo di registrazione
  6. registration_init_time: data di inizio sottoscrizione

Dati di partenza (cont.)

transactions.tab: tabella delle transazioni

  1. msno: identificativo cliente
  2. payment_method_id: metodo di pagamento
  3. payment_plan_days: durata della sottoscrizione scelta
  4. plan_list_price: piano di pagamento
  5. actual_amount_paid: pagamenti effettuati
  6. is_auto_renew: indicatore di autorinnovo
  7. transaction_date: data di transazione
  8. membership_expire_date: data di scadenza della sottoscrizione
  9. is_cancel: indicatore del fatto che l’utente abbia cancellato la sottoscrizione nella sua storia di utente di questo servizio

Dati di partenza (cont.)

user_logs.tab: tabella di log attività utenti che descrive comportamento giorno per giorno

  1. msno: identificativo cliene
  2. date: data
  3. num_25: numero di canzoni ascoltate per meno del loro 25% della loro durata
  4. num_50: numero di canzoni ascoltate tra il 25% ed il 50% della loro durata
  5. num_75: numero di canzoni ascoltate tra il 50% ed il 75% della loro durata
  6. num_985: numero di canzoni ascoltate tra il 75% ed il 98.5% della loro durata
  7. num_100: numero di canzoni ascoltate per piu’ del 98.5% della loro durata
  8. num_unq: numero di canzoni ascoltate una volta sola

Goals

Analisi esplorativa

L’analisi esplorativa e’ suddivisa nei seguenti passi:

Individuazione dei predittori

L’individuazione dei potenziali predittori ha utilizzato i seguenti strumenti:

Modelli di Machine Learning

Modelli “white-box”:

Modelli “black-box”:

Possibili cause del churn

I clienti che risultano piu’ propensi al churn sono quelli che hanno sottoscritto i piani di abbonamento piu’ costosi, e tendenzialmente senza l’opzione di auto-rinnovo.

Inoltre, il tipo di pagamento scelto e il numero di transazioni effettuate sembrano avere una certa importanza.

Metriche della Confusion Matrix

   Predicted        Observed

                   CF              CT
             +-----------------------------+
             |              |              |
          CF |      A       |      B       |
             |              |              |
             +-------------+---------------+
             |              |              |
          CT |      C       |      D       |
             |              |              |
             +-----------------------------+

Positive Class: “CF”

Accuracy := (A + D) / (A + B + C + D)

Sensitivity := A / (A + C)

Specificity := D / (B + D)

PPV := A / (A + B)

NPV := D / (C + D)

Metriche della Confusion Matrix (cont.)

   Predicted        Observed

                   CF              CT
             +-----------------------------+
             |              |              |
          CF |      A       |      B       |
             |              |              |
             +-------------+---------------+
             |              |              |
          CT |      C       |      D       |
             |              |              |
             +-----------------------------+

Prevalence := (A + C) / (A + B + C + D)

DetectionRate := A / (A + B + C + D)

Detection Prevalence := (A + B) / (A + B + C + D)

BalancedAccuracy := (Sensitivity + Specificity) / 2

Metriche della Confusion Matrix (cont.)

   Predicted        Observed

                   CF              CT
             +-----------------------------+
             |              |              |
          CF |      A       |      B       |
             |              |              |
             +-------------+---------------+
             |              |              |
          CT |      C       |      D       |
             |              |              |
             +-----------------------------+

No Information Rate := (A + C)/(A + B + C + D), accuratezza se si rispondesse sempre CF

P-Value [Acc > NIR] := p-value per la null-hypothesis che l’accuratezza non sia maggiore del No Information Rate

Kappa di Cohen := misura di quanto meglio si classifica rispetto al classificatore randomico; chiave di lettura:

            [0,0.2]: scarso
            [0.2,0.4]: discreto 
            [0.4,0.6]: buono
            [0.6,0.8]: molto buono 
            [0.8, 1]: ottimo

Mcnemar’s Test P-Value := p-value per la null hypothesis che FP == FN

Metriche della campagna di retention

   Predicted        Observed

                   CF              CT
             +-----------------------------+
             |              |              |
          CF |      A       |      B       |
             |              |              |
             +-------------+---------------+
             |              |              |
          CT |      C       |      D       |
             |              |              |
             +-----------------------------+

Workflow

  1. KPI (Key Performance Indicator) ricevuti dal Business Management/Marketing:

    1. valore di churn raggiunto a seguito della campagna di retention dei clienti

    2. frazione massima di clienti sul totale dell’intero portafoglio che si intende contattare durante la campagna di retention

    3. efficienza minima desiderata della campagna di retention, percentuale di clienti churn sul totale di quelli contattati

    4. rendimento atteso di conversione del cliente da churn a non churn

  2. Dati dall’Information Technology:

    1. dati storici con label churn/no churn sulla base del quale costruire il modello di supervised learning

    2. dati di portafoglio clienti corrente

  3. Costruzione e regolazione dei modelli

    1. definizione di uno o piu’ modelli di statistical learning

    2. regolazione in termini dei KPI (a.1, a.2, a.3, a.4)

    3. selezione dei modelli candidati

  4. Esecuzione della predizione con i dati di portafoglio clienti corrente

Strumenti utilizzati

Si e’ fatto uso del linguaggio R in ambiente RStudio su Windows. Le funzionalità utilizzate sono fornite da 9 librerie R (“packages”).

Questa presentazione viene generata per mezzo della stesura di un file R markdown composto sia da codice che testo. I grafici sono generati dal codice incluso.

La compilazione del file R markdown produce un file HTML che puo’ essere visualizzato in modalità di presentazione a slide.

##      packages version
## 1       dplyr   0.7.8
## 2     ggplot2   3.1.0
## 3       caret  6.0.81
## 4   gridExtra     2.3
## 5    corrplot    0.84
## 6         rms   5.1.2
## 7        ROCR   1.0.7
## 8  rpart.plot   3.0.6
## 9   lubridate   1.7.4
## 10       boot  1.3.20

Analisi Esplorativa

df_churn <- read.delim("churn.tab", sep = "\t", stringsAsFactors = FALSE)
dim(df_churn)
## [1] 37757     2
head(df_churn)
##                                           msno is_churn
## 1 E1vsPZT5WGaqbFaXAV2Flx5LUfuMUHW8NoYSypf7oJU=        0
## 2 EMpiEFU1Fiv2LM714kYqvJcz/xp5qeNWbRheNZyExYc=        0
## 3 Fk7Qb26rnUGm8jeenM25HnnhteiGT9xjN6jDpOqdrck=        0
## 4 CLtKi0Tlx7XzUpy/Z8l979QBURHsOmvLU821wWclWrw=        0
## 5 ykKt0hHAFeJzXg2xXUrOKlHLyQEMoyjLFAT4odSLvP0=        0
## 6 nGu4EyiVLPy96WpruK4CoeZbCxS7becR9CWQVDbtuxY=        0

length(unique(df_churn$msno))/nrow(df_churn)
## [1] 1
sum(complete.cases(df_churn))/nrow(df_churn)
## [1] 1
sum(df_churn$is_churn)/nrow(df_churn)
## [1] 0.1292211
df_churn$is_churn <- factor(df_churn$is_churn)
levels(df_churn$is_churn) <- c("CF", "CT")

df_members <- read.delim("members.tab", sep = "\t", stringsAsFactors = FALSE)
dim(df_members)
## [1] 37761     6
colnames(df_members)
## [1] "msno"                   "city"                  
## [3] "age"                    "gender"                
## [5] "registered_via"         "registration_init_time"
head(df_members)
##                                           msno city age gender
## 1 E1vsPZT5WGaqbFaXAV2Flx5LUfuMUHW8NoYSypf7oJU=   22  43   male
## 2 EMpiEFU1Fiv2LM714kYqvJcz/xp5qeNWbRheNZyExYc=   12  27 female
## 3 Fk7Qb26rnUGm8jeenM25HnnhteiGT9xjN6jDpOqdrck=    6  46 female
## 4 CLtKi0Tlx7XzUpy/Z8l979QBURHsOmvLU821wWclWrw=   15  45   male
## 5 ykKt0hHAFeJzXg2xXUrOKlHLyQEMoyjLFAT4odSLvP0=    8  22   male
## 6 nGu4EyiVLPy96WpruK4CoeZbCxS7becR9CWQVDbtuxY=   15  44   male
##   registered_via registration_init_time
## 1              7             2013/06/04
## 2              3             2013/04/06
## 3              3             2013/01/31
## 4              7             2016/03/07
## 5              3             2013/08/21
## 6              9             2007/01/13

length(unique(df_members$msno))/nrow(df_members)
## [1] 0.9998941
sum(complete.cases(df_members))/nrow(df_members)
## [1] 1

Msno clean-up

mem_dup <- duplicated(df_members$msno)
mem_dup_pos <- which(mem_dup == TRUE)
dup_list <- lapply(mem_dup_pos, function(x) {
  which(df_members$msno == df_members[x,]$msno)
  })

to_del <- sapply(dup_list, function(x) { x[2]})
df_members <- df_members[setdiff(1:nrow(df_members),to_del),]
dim(df_members)
## [1] 37757     6

City

length(unique(df_members$city))
## [1] 21
do.call(rbind, lapply(split(df_members, df_members$city), nrow))
##    [,1]
## 1  1101
## 3   451
## 4  4067
## 5  6191
## 6  2239
## 7   215
## 8   654
## 9   792
## 10  603
## 11  774
## 12 1060
## 13 8471
## 14 1850
## 15 3770
## 16   92
## 17  457
## 18  682
## 19   11
## 20   62
## 21  463
## 22 3752

df_members$city <- factor(df_members$city)

unique(df_members$registered_via)
## [1]  7  3  9  4 13
df_members$registered_via <- factor(df_members$registered_via)

table(df_members$city, df_members$registered_via)
##     
##         3    4    7    9   13
##   1   472  312  249   66    2
##   3    94   22  107  228    0
##   4   916  266  842 2037    6
##   5  1329  493 1255 3114    0
##   6   559  172  315 1191    2
##   7    49    8   35  123    0
##   8   158   40  143  312    1
##   9   160   43  174  413    2
##   10  140   42  149  272    0
##   11  202   59   83  429    1
##   12  219   73  210  557    1
##   13 1778  404 1241 5043    5
##   14  338  102  477  932    1
##   15  942  243  556 2025    4
##   16   22    2   12   56    0
##   17  102   23  101  230    1
##   18  133   45  132  372    0
##   19    2    0    7    2    0
##   20    6    2   18   36    0
##   21  113   30   79  241    0
##   22  847  314  561 2028    2

ggplot(data=df_members, aes(x=city)) + geom_histogram(stat="count", fill='palegreen4') + guides(fill=FALSE)

ggplot(data=df_members, aes(x=city)) + geom_histogram(stat="count", fill='palegreen4') + guides(fill=FALSE) + 
facet_grid( ~ registered_via) + coord_flip()

Age

summary(df_members$age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00   24.00   28.00   30.26   34.00 1421.00
do.call(rbind, lapply(split(df_members, df_members$age), nrow))
##      [,1]
## 2       1
## 4       1
## 5       1
## 6       1
## 7       2
## 11      1
## 12      2
## 13     10
## 14     19
## 15     73
## 16    262
## 17    560
## 18    805
## 19    938
## 20   1325
## 21   1482
## 22   1728
## 23   1891
## 24   2039
## 25   2027
## 26   2230
## 27   2330
## 28   1933
## 29   1876
## 30   1674
## 31   1522
## 32   1340
## 33   1206
## 34   1192
## 35   1084
## 36   1013
## 37    850
## 38    763
## 39    642
## 40    527
## 41    512
## 42    421
## 43    371
## 44    312
## 45    296
## 46    292
## 47    291
## 48    241
## 49    221
## 50    178
## 51    187
## 52    174
## 53    165
## 54    105
## 55     98
## 56     78
## 57     72
## 58     54
## 59     54
## 60     39
## 61     31
## 62     23
## 63     27
## 64     20
## 65     15
## 66     13
## 67      9
## 68     11
## 69      8
## 70      5
## 71      1
## 72      4
## 73      4
## 74      5
## 75      1
## 77      2
## 78      2
## 79      2
## 80      2
## 81      1
## 83      1
## 84      3
## 87      3
## 88      1
## 90      1
## 91      3
## 92      1
## 93      1
## 95      3
## 99      2
## 100     2
## 101     1
## 102     1
## 103     3
## 104     1
## 105     1
## 106     8
## 107     2
## 112     2
## 117     6
## 126     1
## 153     1
## 948     1
## 951     1
## 1017    1
## 1031    1
## 1032    1
## 1033    1
## 1036    1
## 1037    2
## 1048    1
## 1051    1
## 1421    1

too_old <- which(df_members$age > 100)
length(too_old)
## [1] 39
if (length(too_old) > 0) {
  df_members[too_old,]$age <- NA
}

too_young <- which(df_members$age < 10)
length(too_young)
## [1] 6
if (length(too_young) > 0) {
   df_members[too_young,]$age <- NA
}

summary(df_members$age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   11.00   24.00   28.00   29.88   34.00  100.00      45
boxplot(df_members$age, col = 'palegreen4')

ggplot(data=df_members, aes(y=age, x=city)) + geom_boxplot(fill='palegreen4') + guides(fill=FALSE)

ggplot(data=df_members, aes(y=age, x=registered_via)) + geom_boxplot(fill='palegreen4') + guides(fill=FALSE)

Gender

unique(df_members$gender)
## [1] "male"   "female" ""
no_gender <- which(df_members$gender %in% c(" ", "", "  "))
if (length(no_gender) > 0) df_members[no_gender,]$gender <- NA
df_members$gender <- factor(df_members$gender)

ggplot(data=subset(df_members, !is.na(gender)), aes(y=age, x=gender)) + geom_boxplot(aes(fill=gender)) + guides(fill=FALSE) + scale_fill_manual(values=c("#FF00AA", "#0000FF"))

ggplot(data=subset(df_members, !is.na(gender)), aes(x=gender)) + geom_bar(position = "dodge", aes(fill=gender)) + guides(fill=FALSE) + scale_fill_manual(values=c("#FF00AA", "#0000FF"))

ggplot(data=subset(df_members, !is.na(gender)), aes(x=gender)) + geom_bar(position = "dodge", aes(fill=gender)) + guides(fill=FALSE)+ scale_fill_manual(values=c("#FF00AA", "#0000FF")) + facet_grid( ~ registered_via)

df_trans <- read.delim("transaction.tab", sep = "\t", stringsAsFactors = FALSE)
dim(df_trans)
## [1] 46079     9
head(df_trans)
##                                        ï..msno payment_method_id
## 1 ++/ZHqwUNa7U21Qz+zqteiXlZapxey86l6eEorrak/g=                14
## 2 ++/ZHqwUNa7U21Qz+zqteiXlZapxey86l6eEorrak/g=                14
## 3 ++6P09mCSJSh+Ft2pvZ0FWTrtcI3v1A7h3/coh8dBKw=                39
## 4 ++6P09mCSJSh+Ft2pvZ0FWTrtcI3v1A7h3/coh8dBKw=                39
## 5 ++8XaEDast796K/DQcP7sSoRQ8JRnA7X5S+oyXEw62E=                37
## 6 ++BW1PJYU5OZi3n3+IDLiU+d1IL1VE/GLx6p64TDs6U=                34
##   payment_plan_days plan_list_price actual_amount_paid is_auto_renew
## 1                30             149           140.3510             1
## 2                30             149           139.7265             1
## 3                30             149           141.5193             1
## 4                30             149           136.6532             1
## 5                30             149           137.9952             1
## 6                30             149           148.2081             1
##   transaction_date membership_expire_date is_cancel
## 1       2017/03/31             2017/05/04         0
## 2       2017/02/28             2017/04/04         0
## 3       2017/02/28             2017/04/13         0
## 4       2017/03/31             2017/05/13         0
## 5       2017/03/21             2017/04/20         0
## 6       2017/03/31             2017/04/30         0
colnames(df_trans)[1] <- "msno"

df_log <- read.delim("user_log.tab", sep = "\t", stringsAsFactors = FALSE)
dim(df_log)
## [1] 678793      9
head(df_log)
##                                           msno       date num_25 num_50
## 1 ++/ZHqwUNa7U21Qz+zqteiXlZapxey86l6eEorrak/g= 2017/03/25      5      4
## 2 ++/ZHqwUNa7U21Qz+zqteiXlZapxey86l6eEorrak/g= 2017/03/21      0      0
## 3 ++6P09mCSJSh+Ft2pvZ0FWTrtcI3v1A7h3/coh8dBKw= 2017/03/25      0      2
## 4 ++6P09mCSJSh+Ft2pvZ0FWTrtcI3v1A7h3/coh8dBKw= 2017/03/03      0      0
## 5 ++6P09mCSJSh+Ft2pvZ0FWTrtcI3v1A7h3/coh8dBKw= 2017/03/15      0      0
## 6 ++8XaEDast796K/DQcP7sSoRQ8JRnA7X5S+oyXEw62E= 2017/03/09      0      0
##   num_75 num_985 num_100 num_unq total_secs
## 1      3       2      26      25   6815.238
## 2      1       0      14      14   3427.827
## 3      0       0       5       7   1357.419
## 4      0       0       3       3    688.298
## 5      0       0       4       4    949.230
## 6      0       0       1       1    262.000

cnt_churn <- df_churn %>% dplyr::filter(is_churn=="CT")
(customer_churn_percentage <- round(100*nrow(cnt_churn)/nrow(df_churn), 2))
## [1] 12.92

Tabella dei clienti

df_join <- left_join(df_churn, df_members, by="msno")
dim(df_join)
## [1] 37757     7
colnames(df_join)
## [1] "msno"                   "is_churn"              
## [3] "city"                   "age"                   
## [5] "gender"                 "registered_via"        
## [7] "registration_init_time"

Tabella delle transazioni

colnames(df_trans) <- c("msno", colnames(df_trans[-1]))
trans_list <- split(df_trans, df_trans$msno)

last_transaction_idx <- function(dt) {
  tr_ymd <- ymd(dt$transaction_date)
  lt <- which.max(tr_ymd)
  lt
}


analyse_trans_ltow <- function(dt, levels, levels_names) {
  lst <- strsplit(as.character(dt$payment_method_id),"d_")
  pm_sub_table <- do.call(rbind, lapply(lst, function(x) 
    as.logical(table(factor(x, levels=levels)))))
  
  payment_method_wide <- data.frame(pm_sub_table, stringsAsFactors=FALSE)
  colnames(payment_method_wide) <- levels_names
  membership_expire_date <- max(ymd(dt$membership_expire_date))
  nr <- nrow(dt)
  lt <- last_transaction_idx(dt)

  analysis <- data.frame(dt$msno[1],
                         nr, 
                         sum(dt$actual_amount_paid),
                         dt[lt,]$is_auto_renew,
                         dt[lt,]$is_cancel,
                         length(unique(dt$payment_method_id)),
                         sum(as.numeric(dt$payment_plan_days)),
                         sum(as.numeric(dt$plan_list_price)),
                         membership_expire_date,
                         stringsAsFactors = FALSE)

  colnames(analysis) <- c("msno", "hist_trans", "sum_paid", "is_auto_renew",
                          "is_cancel", "payment_method_num",
                          "payment_plan_days_sum", "plan_list_price_sum",
                          "membership_expire_date")
  
  result_df <- as.data.frame(cbind(analysis, payment_method_wide))
  result_df[1,,drop=FALSE]
}

df_trans$payment_method_id <- factor(df_trans$payment_method_id)
lvl <- levels(df_trans$payment_method_id)
level_names <- paste("payment_method_id_", lvl, sep="")
temp_list <- lapply(trans_list, function(x){ analyse_trans_ltow(x, lvl, level_names) })
transan_df <- do.call(rbind, temp_list)
rownames(transan_df) <- NULL

dim(transan_df)
## [1] 35464    41
sum(complete.cases(transan_df))
## [1] 35463
head(transan_df)
##                                           msno hist_trans  sum_paid
## 1 //4hBneqk/4/TtgL1XXQ+eKx7fJTeSvSNt0ktxjSIYE=          1  92.41383
## 2 //8WziQXa8NgLpNL75Y0DpPGxy6Em79sos20XA1l8tw=          1 138.74138
## 3 //98KXjMykxaDiasDXcTK7C3BBr+gBFI9L/2OQQixy8=          2 272.98474
## 4 //aWvY7Y+dy+OCzivW4e/ojzfGRRv4QeN3E4gfBojVI=          1 174.01733
## 5 //Bj6yyexnZbioH2eELlbqRXgOCRPlWQjBYtqi5iJK8=          1 137.85528
## 6 //fXDIz34vI4O2VFpJ8CMLgVYquz5V14oeKgwy0pBHM=          1 120.45214
##   is_auto_renew is_cancel payment_method_num payment_plan_days_sum
## 1             0         0                  1                    30
## 2             1         1                  1                    30
## 3             1         0                  1                    60
## 4             0         0                  1                    30
## 5             1         0                  1                    30
## 6             1         0                  1                    30
##   plan_list_price_sum membership_expire_date payment_method_id_3
## 1                 100             2017-04-08               FALSE
## 2                 149             2017-03-09               FALSE
## 3                 298             2017-05-12               FALSE
## 4                 180             2017-04-23               FALSE
## 5                 149             2017-04-30               FALSE
## 6                 129             2017-04-05               FALSE
##   payment_method_id_8 payment_method_id_10 payment_method_id_11
## 1               FALSE                FALSE                FALSE
## 2               FALSE                FALSE                FALSE
## 3               FALSE                FALSE                FALSE
## 4               FALSE                FALSE                FALSE
## 5               FALSE                FALSE                FALSE
## 6               FALSE                FALSE                FALSE
##   payment_method_id_12 payment_method_id_13 payment_method_id_14
## 1                FALSE                FALSE                FALSE
## 2                FALSE                FALSE                FALSE
## 3                FALSE                FALSE                FALSE
## 4                FALSE                FALSE                FALSE
## 5                FALSE                FALSE                FALSE
## 6                FALSE                FALSE                FALSE
##   payment_method_id_15 payment_method_id_16 payment_method_id_17
## 1                FALSE                FALSE                FALSE
## 2                FALSE                FALSE                FALSE
## 3                FALSE                FALSE                FALSE
## 4                FALSE                FALSE                FALSE
## 5                FALSE                FALSE                FALSE
## 6                FALSE                FALSE                FALSE
##   payment_method_id_18 payment_method_id_19 payment_method_id_20
## 1                FALSE                FALSE                FALSE
## 2                FALSE                FALSE                FALSE
## 3                FALSE                FALSE                FALSE
## 4                FALSE                FALSE                FALSE
## 5                FALSE                FALSE                FALSE
## 6                FALSE                FALSE                FALSE
##   payment_method_id_21 payment_method_id_22 payment_method_id_23
## 1                FALSE                FALSE                FALSE
## 2                FALSE                FALSE                FALSE
## 3                FALSE                FALSE                FALSE
## 4                FALSE                FALSE                FALSE
## 5                FALSE                FALSE                FALSE
## 6                FALSE                FALSE                FALSE
##   payment_method_id_26 payment_method_id_27 payment_method_id_28
## 1                FALSE                FALSE                FALSE
## 2                FALSE                FALSE                FALSE
## 3                FALSE                FALSE                FALSE
## 4                FALSE                FALSE                FALSE
## 5                FALSE                FALSE                FALSE
## 6                FALSE                FALSE                FALSE
##   payment_method_id_29 payment_method_id_30 payment_method_id_31
## 1                FALSE                FALSE                FALSE
## 2                FALSE                FALSE                FALSE
## 3                FALSE                FALSE                FALSE
## 4                FALSE                FALSE                FALSE
## 5                FALSE                FALSE                 TRUE
## 6                FALSE                FALSE                FALSE
##   payment_method_id_32 payment_method_id_33 payment_method_id_34
## 1                FALSE                FALSE                FALSE
## 2                FALSE                FALSE                FALSE
## 3                FALSE                FALSE                FALSE
## 4                FALSE                FALSE                FALSE
## 5                FALSE                FALSE                FALSE
## 6                FALSE                FALSE                FALSE
##   payment_method_id_35 payment_method_id_36 payment_method_id_37
## 1                FALSE                FALSE                FALSE
## 2                FALSE                FALSE                 TRUE
## 3                FALSE                FALSE                FALSE
## 4                FALSE                 TRUE                FALSE
## 5                FALSE                FALSE                FALSE
## 6                FALSE                FALSE                FALSE
##   payment_method_id_38 payment_method_id_39 payment_method_id_40
## 1                 TRUE                FALSE                FALSE
## 2                FALSE                FALSE                FALSE
## 3                FALSE                 TRUE                FALSE
## 4                FALSE                FALSE                FALSE
## 5                FALSE                FALSE                FALSE
## 6                FALSE                FALSE                FALSE
##   payment_method_id_41
## 1                FALSE
## 2                FALSE
## 3                FALSE
## 4                FALSE
## 5                FALSE
## 6                 TRUE

df_join <- left_join(df_join, transan_df, by="msno")

Tabella dei log

analyse_log <- function(dt) {
  total_sums <- sum(dt$num_25) +sum(dt$num_50) + sum(dt$num_75) + sum(dt$num_100) + 
               sum(dt$num_985) + sum(dt$num_unq)
  
  analysis <- data.frame(dt$msno[1], 
                         nrow(dt),
                         sum(dt$total_secs),
                         round(100*sum(dt$num_25)/total_sums,2),
                         round(100*sum(dt$num_50)/total_sums,2),
                         round(100*sum(dt$num_75)/total_sums,2),
                         round(100*sum(dt$num_100)/total_sums,2),
                         round(100*sum(dt$num_985)/total_sums,2), 
                         round(100*sum(dt$num_unq)/total_sums,2), 
                         stringsAsFactors = FALSE)

  colnames(analysis) <- c("msno", "hist_log", "total_secs",
                          "num_25", "num_50", "num_75",
                          "num_100", "num_985", "num_unq")
  
  analysis
}

log_list <- split(df_log, df_log$msno)

logan_df <- do.call(rbind, lapply(log_list, function(x){analyse_log(x)}))
rownames(logan_df) <- NULL
dim(logan_df)
## [1] 35317     9

df_join2 <- left_join(df_churn, logan_df, by="msno")
df_join2 <- df_join2[complete.cases(df_join2),]

numeric_vars <- names(which(sapply(logan_df, class) == "numeric"))
numeric_vars <- c(numeric_vars, names(which(sapply(logan_df, class) == "integer")))

df_join2$is_churn <- factor(df_join2$is_churn)

gp <- invisible(lapply(numeric_vars, function(x) { 
  ggplot(data=df_join2, aes(x = is_churn, y=log(eval(parse(text=x))))) + geom_boxplot(aes(fill = is_churn)) + xlab("is_churn") + ylab(x) + ggtitle("") + scale_fill_manual(values=c("#009900", "#CC0000")) + theme(legend.position="none")}))
grob_plots <- invisible(lapply(chunk(1, length(gp), 4), function(x) {
  marrangeGrob(grobs=lapply(gp[x], ggplotGrob), nrow=2, ncol=2)}))
grob_plots$chunk1

grob_plots$chunk2

gp <- invisible(lapply(numeric_vars, function(x) { 
  ggplot(data=df_join2, aes(x=eval(parse(text=x)))) + geom_density(aes(fill = is_churn), alpha = 0.4) + xlab(x) + scale_fill_manual(values=c("#009900", "#CC0000")) + scale_x_continuous(trans='log2') + ggtitle(paste(x, "density", sep= " "))}))
grob_plots <- invisible(lapply(chunk(1, length(gp), 4), function(x) {
  marrangeGrob(grobs=lapply(gp[x], ggplotGrob), nrow=2, ncol=2)}))
grob_plots$chunk1

grob_plots$chunk2

ANOVA Analysis

aov_l <- lapply(numeric_vars, function(x) { summary(aov(eval(parse(text=x)) ~  is_churn, data = df_join2))})
names(aov_l) <- numeric_vars
aov_l
## $total_secs
##                Df    Sum Sq   Mean Sq F value Pr(>F)    
## is_churn        1 4.440e+12 4.440e+12   104.8 <2e-16 ***
## Residuals   35315 1.496e+15 4.237e+10                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## $num_25
##                Df  Sum Sq Mean Sq F value   Pr(>F)    
## is_churn        1     746   746.5   12.35 0.000443 ***
## Residuals   35315 2135447    60.5                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## $num_50
##                Df Sum Sq Mean Sq F value  Pr(>F)    
## is_churn        1    202  202.21   22.91 1.7e-06 ***
## Residuals   35315 311641    8.82                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## $num_75
##                Df Sum Sq Mean Sq F value   Pr(>F)    
## is_churn        1     72   71.87    21.6 3.38e-06 ***
## Residuals   35315 117520    3.33                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## $num_100
##                Df  Sum Sq Mean Sq F value   Pr(>F)    
## is_churn        1    6027    6027   36.36 1.66e-09 ***
## Residuals   35315 5853318     166                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## $num_985
##                Df Sum Sq Mean Sq F value  Pr(>F)   
## is_churn        1     40   39.97   8.338 0.00388 **
## Residuals   35315 169297    4.79                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## $num_unq
##                Df  Sum Sq Mean Sq F value  Pr(>F)   
## is_churn        1     453   453.3   9.791 0.00175 **
## Residuals   35315 1634782    46.3                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## $hist_log
##                Df  Sum Sq Mean Sq F value Pr(>F)    
## is_churn        1   40645   40645   471.3 <2e-16 ***
## Residuals   35315 3045696      86                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
tukey <- lapply(numeric_vars, function(x) { TukeyHSD(aov(eval(parse(text=x)) ~  is_churn, data = df_join2))})
names(tukey) <- numeric_vars
tukey
## $total_secs
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = eval(parse(text = x)) ~ is_churn, data = df_join2)
## 
## $is_churn
##            diff       lwr       upr p adj
## CT-CF -34115.04 -40647.05 -27583.02     0
## 
## 
## $num_25
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = eval(parse(text = x)) ~ is_churn, data = df_join2)
## 
## $is_churn
##            diff       lwr       upr     p adj
## CT-CF 0.4423427 0.1955868 0.6890987 0.0004423
## 
## 
## $num_50
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = eval(parse(text = x)) ~ is_churn, data = df_join2)
## 
## $is_churn
##            diff       lwr       upr   p adj
## CT-CF 0.2302275 0.1359625 0.3244926 1.7e-06
## 
## 
## $num_75
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = eval(parse(text = x)) ~ is_churn, data = df_join2)
## 
## $is_churn
##            diff       lwr       upr   p adj
## CT-CF 0.1372562 0.0793696 0.1951429 3.4e-06
## 
## 
## $num_100
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = eval(parse(text = x)) ~ is_churn, data = df_join2)
## 
## $is_churn
##            diff       lwr       upr p adj
## CT-CF -1.256892 -1.665423 -0.848362     0
## 
## 
## $num_985
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = eval(parse(text = x)) ~ is_churn, data = df_join2)
## 
## $is_churn
##            diff        lwr       upr     p adj
## CT-CF 0.1023596 0.03288145 0.1718377 0.0038826
## 
## 
## $num_unq
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = eval(parse(text = x)) ~ is_churn, data = df_join2)
## 
## $is_churn
##            diff       lwr       upr     p adj
## CT-CF 0.3446871 0.1287869 0.5605874 0.0017534
## 
## 
## $hist_log
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = eval(parse(text = x)) ~ is_churn, data = df_join2)
## 
## $is_churn
##            diff       lwr       upr p adj
## CT-CF -3.264055 -3.558746 -2.969365     0

set.seed(1023)
trControl <- trainControl(method = "repeatedcv", number = 10, repeats = 5, verboseIter = FALSE, sampling = "down")

fit <- train(is_churn ~ total_secs, data = df_join2, method = "lda", trControl = trControl, metric = 'Accuracy', tuneLength = 10)
predict_res <- predict(fit, df_join2, type = "raw")
confusionMatrix(factor(predict_res), df_join2$is_churn)$overall[2]
##      Kappa 
## 0.02688027
fit <- train(is_churn ~ num_25, data = df_join2, method = "lda", trControl = trControl, metric = 'Accuracy', tuneLength = 10)
predict_res <- predict(fit, df_join2, type = "raw")
confusionMatrix(factor(predict_res), df_join2$is_churn)$overall[2]
##       Kappa 
## 0.007623977

fit <- train(is_churn ~ num_50, data = df_join2, method = "lda", trControl = trControl, metric = 'Accuracy', tuneLength = 10)
predict_res <- predict(fit, df_join2, type = "raw")
confusionMatrix(factor(predict_res), df_join2$is_churn)$overall[2]
##      Kappa 
## 0.01327058
fit <- train(is_churn ~ num_75, data = df_join2, method = "lda", trControl = trControl, metric = 'Accuracy', tuneLength = 10)
predict_res <- predict(fit, df_join2, type = "raw")
confusionMatrix(factor(predict_res), df_join2$is_churn)$overall[2]
##       Kappa 
## 0.009044729
fit <- train(is_churn ~ num_100, data = df_join2, method = "lda", trControl = trControl, metric = 'Accuracy', tuneLength = 10)
predict_res <- predict(fit, df_join2, type = "raw")
confusionMatrix(factor(predict_res), df_join2$is_churn)$overall[2]
##      Kappa 
## 0.02014215

fit <- train(is_churn ~ num_985, data = df_join2, method = "lda", trControl = trControl, metric = 'Accuracy', tuneLength = 10)
predict_res <- predict(fit, df_join2, type = "raw")
confusionMatrix(factor(predict_res), df_join2$is_churn)$overall[2]
##       Kappa 
## 0.004943829
fit <- train(is_churn ~ num_unq, data = df_join2, method = "lda", trControl = trControl, metric = 'Accuracy', tuneLength = 10)
predict_res <- predict(fit, df_join2, type = "raw")
confusionMatrix(factor(predict_res), df_join2$is_churn)$overall[2]
##       Kappa 
## 0.005372912
fit <- train(is_churn ~ hist_log, data = df_join2, method = "lda", trControl = trControl, metric = 'Accuracy', tuneLength = 10)
predict_res <- predict(fit, df_join2, type = "raw")
confusionMatrix(factor(predict_res), df_join2$is_churn)$overall[2]
##      Kappa 
## 0.07486046

Dataset per costruzione modelli

df_join <- left_join(df_join, logan_df[, c("msno", "hist_log")], by="msno")

df_join$is_cancel <- as.logical(df_join$is_cancel)
df_join$is_auto_renew <- as.logical(df_join$is_auto_renew)
df_join$payment_method_num <- factor(df_join$payment_method_num)

sum(complete.cases(df_join))/nrow(df_join)
## [1] 0.8819557
df_join <- df_join[complete.cases(df_join),]
df_join$membership_duration <- as.numeric(ymd(df_join$membership_expire_date) - ymd(df_join$registration_init_time))
df_join$registration_init_time <- NULL
df_join$membership_expire_date <- NULL

dim(df_join)
## [1] 33300    47
str(df_join)
## 'data.frame':    33300 obs. of  47 variables:
##  $ msno                 : chr  "E1vsPZT5WGaqbFaXAV2Flx5LUfuMUHW8NoYSypf7oJU=" "EMpiEFU1Fiv2LM714kYqvJcz/xp5qeNWbRheNZyExYc=" "Fk7Qb26rnUGm8jeenM25HnnhteiGT9xjN6jDpOqdrck=" "CLtKi0Tlx7XzUpy/Z8l979QBURHsOmvLU821wWclWrw=" ...
##  $ is_churn             : Factor w/ 2 levels "CF","CT": 1 1 1 1 1 2 1 1 1 1 ...
##  $ city                 : Factor w/ 21 levels "1","3","4","5",..: 21 11 5 14 7 17 4 14 3 4 ...
##  $ age                  : int  43 27 46 45 22 34 25 28 31 19 ...
##  $ gender               : Factor w/ 2 levels "female","male": 2 1 1 2 2 1 1 1 2 2 ...
##  $ registered_via       : Factor w/ 5 levels "3","4","7","9",..: 3 1 1 3 1 4 1 4 4 4 ...
##  $ hist_trans           : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ sum_paid             : num  145.6 171.7 144.7 95.3 146.2 ...
##  $ is_auto_renew        : logi  TRUE TRUE FALSE TRUE FALSE TRUE ...
##  $ is_cancel            : logi  FALSE FALSE FALSE FALSE FALSE TRUE ...
##  $ payment_method_num   : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
##  $ payment_plan_days_sum: num  30 30 30 30 30 30 30 30 30 30 ...
##  $ plan_list_price_sum  : num  149 180 149 99 149 180 180 149 149 149 ...
##  $ payment_method_id_3  : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ payment_method_id_8  : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ payment_method_id_10 : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ payment_method_id_11 : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ payment_method_id_12 : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ payment_method_id_13 : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ payment_method_id_14 : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ payment_method_id_15 : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ payment_method_id_16 : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ payment_method_id_17 : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ payment_method_id_18 : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ payment_method_id_19 : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ payment_method_id_20 : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ payment_method_id_21 : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ payment_method_id_22 : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ payment_method_id_23 : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ payment_method_id_26 : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ payment_method_id_27 : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ payment_method_id_28 : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ payment_method_id_29 : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ payment_method_id_30 : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ payment_method_id_31 : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ payment_method_id_32 : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ payment_method_id_33 : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ payment_method_id_34 : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ payment_method_id_35 : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ payment_method_id_36 : logi  FALSE TRUE FALSE FALSE FALSE TRUE ...
##  $ payment_method_id_37 : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ payment_method_id_38 : logi  FALSE FALSE TRUE FALSE TRUE FALSE ...
##  $ payment_method_id_39 : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ payment_method_id_40 : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ payment_method_id_41 : logi  TRUE FALSE FALSE TRUE FALSE FALSE ...
##  $ hist_log             : int  4 31 30 5 29 14 26 21 16 30 ...
##  $ membership_duration  : num  1401 1468 1537 396 1329 ...

VIsualizzazioni differenziate secondo l’attributo is_churn

ggplot(data=df_join, aes(y=age, x=is_churn)) + geom_boxplot(aes(fill=is_churn)) + guides(fill=FALSE) + scale_fill_manual(values=c("#009900", "#CC0000"))

ggplot(data=df_join, aes(y=sum_paid, x=is_churn)) + geom_boxplot(aes(fill=is_churn)) + guides(fill=FALSE) + scale_fill_manual(values=c("#009900", "#CC0000"))

ggplot(data=df_join, aes(y=hist_log, x=is_churn)) + geom_boxplot(aes(fill=is_churn)) + guides(fill=FALSE)  + scale_fill_manual(values=c("#009900", "#CC0000"))

ggplot(data=df_join, aes(y=membership_duration, x=is_churn)) + geom_boxplot(aes(fill=is_churn)) + guides(fill=FALSE) + scale_fill_manual(values=c("#009900", "#CC0000"))

ggplot(data=df_join, aes(y=plan_list_price_sum, x=is_churn)) + geom_boxplot(aes(fill=is_churn)) + guides(fill=FALSE)  + scale_fill_manual(values=c("#009900", "#CC0000"))

ggplot(data=df_join, aes(y=membership_duration, x=city)) + geom_boxplot(aes(fill=is_churn)) + guides(fill=FALSE) + facet_wrap ( ~is_churn) + coord_flip() +  scale_fill_manual(values=c("#009900", "#CC0000"))

ggplot(data=df_join, aes(y=membership_duration, x=cut(df_join$age, breaks=c(0,20,40,60,80,100)))) + geom_boxplot(aes(fill=is_churn)) + xlab("age") + guides(fill=FALSE) + facet_wrap ( ~is_churn) + coord_flip() + scale_fill_manual(values=c("#009900", "#CC0000"))

ggplot(data=df_join, aes(y=age, x=city)) + geom_boxplot(aes(fill=is_churn)) + guides(fill=FALSE) + facet_wrap ( ~is_churn) + coord_flip() + scale_fill_manual(values=c("#009900", "#CC0000"))

ggplot(data=df_join, aes(y=payment_plan_days_sum, x=city)) + geom_boxplot(aes(fill=is_churn)) + guides(fill=FALSE) + facet_wrap ( ~ is_churn) + coord_flip() + scale_fill_manual(values=c("#009900", "#CC0000"))

ggplot(data=df_join, aes(y=age, x=gender)) + geom_boxplot(aes(fill=is_churn)) + guides(fill=FALSE) + facet_wrap ( ~ is_churn) + coord_flip() + scale_fill_manual(values=c("#009900", "#CC0000"))

ggplot(data=df_join, aes(y=membership_duration, x=registered_via)) + geom_boxplot(aes(fill=is_churn)) + guides(fill=FALSE) + facet_wrap ( ~is_churn) + coord_flip() + scale_fill_manual(values=c("#009900", "#CC0000"))

ggplot(data=df_join, aes(x=city, fill=is_churn)) + geom_bar(position = 'dodge2') + scale_fill_manual(values=c("#009900", "#CC0000"))

ggplot(data=df_join, aes(x=cut(age, breaks = c(0,20,30,40,50,100)), fill=is_churn)) + geom_bar(position = 'dodge2') + xlab("age")  + scale_fill_manual(values=c("#009900", "#CC0000"))

ggplot(data=df_join, aes(x=registered_via, fill=is_churn)) + geom_bar(position = 'dodge2') + xlab("registered via")  + scale_fill_manual(values=c("#009900", "#CC0000"))

ggplot(data=df_join, aes(x=gender, fill=is_churn)) + geom_bar(position = 'dodge2') + xlab("gender")  + scale_fill_manual(values=c("#009900", "#CC0000"))

ggplot(data=df_join, aes(x=is_cancel, fill=is_churn)) + geom_bar(position = 'dodge2') + xlab("is cancel") + scale_fill_manual(values=c("#009900", "#CC0000"))

table(df_join$is_cancel, df_join$is_churn)
##        
##            CF    CT
##   FALSE 30202  2412
##   TRUE    193   493

ggplot(data=df_join, aes(x=is_auto_renew, fill=is_churn)) + geom_bar(position = 'dodge2') + xlab("auto renew") + scale_fill_manual(values=c("#009900", "#CC0000")) 

ggplot(data=df_join, aes(x=cut(membership_duration, breaks=seq(1, 5475, by=365)), fill=is_churn)) + geom_bar(position = 'dodge2') + xlab("years of subscription") + coord_flip() + scale_x_discrete(labels=as.character(1:15))  + scale_fill_manual(values=c("#009900", "#CC0000"))

Tipi di variabili

factor_vars <- names(which(sapply(df_join, class) == "factor"))
factor_vars
## [1] "is_churn"           "city"               "gender"            
## [4] "registered_via"     "payment_method_num"
character_vars <- names(which(sapply(df_join, class) == "character"))
character_vars
## [1] "msno"
logical_vars <- names(which(sapply(df_join, class) == "logical"))
logical_vars
##  [1] "is_auto_renew"        "is_cancel"            "payment_method_id_3" 
##  [4] "payment_method_id_8"  "payment_method_id_10" "payment_method_id_11"
##  [7] "payment_method_id_12" "payment_method_id_13" "payment_method_id_14"
## [10] "payment_method_id_15" "payment_method_id_16" "payment_method_id_17"
## [13] "payment_method_id_18" "payment_method_id_19" "payment_method_id_20"
## [16] "payment_method_id_21" "payment_method_id_22" "payment_method_id_23"
## [19] "payment_method_id_26" "payment_method_id_27" "payment_method_id_28"
## [22] "payment_method_id_29" "payment_method_id_30" "payment_method_id_31"
## [25] "payment_method_id_32" "payment_method_id_33" "payment_method_id_34"
## [28] "payment_method_id_35" "payment_method_id_36" "payment_method_id_37"
## [31] "payment_method_id_38" "payment_method_id_39" "payment_method_id_40"
## [34] "payment_method_id_41"
numeric_vars <- setdiff(colnames(df_join), factor_vars)
numeric_vars <- setdiff(numeric_vars, character_vars)
numeric_vars <- setdiff(numeric_vars, logical_vars)

numeric_vars <- setdiff(numeric_vars, "is_churn")
numeric_vars
## [1] "age"                   "hist_trans"            "sum_paid"             
## [4] "payment_plan_days_sum" "plan_list_price_sum"   "hist_log"             
## [7] "membership_duration"
numeric_vars_mat <- as.matrix(df_join[, numeric_vars, drop=FALSE])

Summary

is_churn_num <- as.numeric(df_join$is_churn) - 1

summary_features <- colnames(df_join)
summary_features <- setdiff(summary_features, c("msno", "is_churn", "registration_init_time", "membership_expire_date"))

summary_string <- paste(summary_features, collapse = "+")
frm_sum <- as.formula(paste("is_churn_num ~ ", summary_string))
frm_sum
## is_churn_num ~ city + age + gender + registered_via + hist_trans + 
##     sum_paid + is_auto_renew + is_cancel + payment_method_num + 
##     payment_plan_days_sum + plan_list_price_sum + payment_method_id_3 + 
##     payment_method_id_8 + payment_method_id_10 + payment_method_id_11 + 
##     payment_method_id_12 + payment_method_id_13 + payment_method_id_14 + 
##     payment_method_id_15 + payment_method_id_16 + payment_method_id_17 + 
##     payment_method_id_18 + payment_method_id_19 + payment_method_id_20 + 
##     payment_method_id_21 + payment_method_id_22 + payment_method_id_23 + 
##     payment_method_id_26 + payment_method_id_27 + payment_method_id_28 + 
##     payment_method_id_29 + payment_method_id_30 + payment_method_id_31 + 
##     payment_method_id_32 + payment_method_id_33 + payment_method_id_34 + 
##     payment_method_id_35 + payment_method_id_36 + payment_method_id_37 + 
##     payment_method_id_38 + payment_method_id_39 + payment_method_id_40 + 
##     payment_method_id_41 + hist_log + membership_duration

Summary

(sum_res <- summary(frm_sum, data = df_join))
## is_churn_num     N= 33300 
## 
## +---------------------+-----------+-----+------------+
## |                     |           |N    |is_churn_num|
## +---------------------+-----------+-----+------------+
## |city                 |1          |  889|0.10911136  |
## |                     |3          |  386|0.07512953  |
## |                     |4          | 3588|0.08584169  |
## |                     |5          | 5421|0.08836008  |
## |                     |6          | 1971|0.08878742  |
## |                     |7          |  187|0.10695187  |
## |                     |8          |  574|0.06794425  |
## |                     |9          |  699|0.09442060  |
## |                     |10         |  526|0.07414449  |
## |                     |11         |  686|0.08309038  |
## |                     |12         |  942|0.09554140  |
## |                     |13         | 7592|0.08667018  |
## |                     |14         | 1648|0.07949029  |
## |                     |15         | 3341|0.09398384  |
## |                     |16         |   82|0.07317073  |
## |                     |17         |  393|0.06870229  |
## |                     |18         |  598|0.08361204  |
## |                     |19         |   10|0.00000000  |
## |                     |20         |   56|0.07142857  |
## |                     |21         |  405|0.08148148  |
## |                     |22         | 3306|0.08560194  |
## +---------------------+-----------+-----+------------+
## |age                  |[12, 25)   | 9720|0.12016461  |
## |                     |[25, 29)   | 7757|0.07580250  |
## |                     |[29, 35)   | 7883|0.07446404  |
## |                     |[35,100]   | 7940|0.07078086  |
## +---------------------+-----------+-----+------------+
## |gender               |female     |15875|0.08692913  |
## |                     |male       |17425|0.08751793  |
## +---------------------+-----------+-----+------------+
## |registered_via       |3          | 7403|0.11144131  |
## |                     |4          | 2220|0.14369369  |
## |                     |7          | 6255|0.05307754  |
## |                     |9          |17395|0.08203507  |
## |                     |13         |   27|0.07407407  |
## +---------------------+-----------+-----+------------+
## |hist_trans           |1          |25908|0.07337502  |
## |                     |[2,208]    | 7392|0.13582251  |
## +---------------------+-----------+-----+------------+
## |sum_paid             |[  0, 138) | 8325|0.02894895  |
## |                     |[138, 146) | 8325|0.03291291  |
## |                     |[146, 181) | 8325|0.04828829  |
## |                     |[181,6468] | 8325|0.23879880  |
## +---------------------+-----------+-----+------------+
## |is_auto_renew        |No         | 5489|0.31881946  |
## |                     |Yes        |27811|0.04153033  |
## +---------------------+-----------+-----+------------+
## |is_cancel            |No         |32614|0.07395597  |
## |                     |Yes        |  686|0.71865889  |
## +---------------------+-----------+-----+------------+
## |payment_method_num   |1          |32844|0.08111071  |
## |                     |2          |  435|0.51724138  |
## |                     |3          |   21|0.76190476  |
## +---------------------+-----------+-----+------------+
## |payment_plan_days_sum|[ 1,  31)  |24871|0.03490008  |
## |                     |[31,  61)  | 6482|0.06155508  |
## |                     |[61,1456]  | 1947|0.84129430  |
## +---------------------+-----------+-----+------------+
## |plan_list_price_sum  |[  0, 150) |20327|0.03099326  |
## |                     |[150, 199) | 4648|0.06454389  |
## |                     |[199,6456] | 8325|0.23723724  |
## +---------------------+-----------+-----+------------+
## |payment_method_id_3  |No         |33299|0.08720983  |
## |                     |Yes        |    1|1.00000000  |
## +---------------------+-----------+-----+------------+
## |payment_method_id_8  |No         |33297|0.08718503  |
## |                     |Yes        |    3|0.66666667  |
## +---------------------+-----------+-----+------------+
## |payment_method_id_10 |FALSE      |33300|0.08723724  |
## +---------------------+-----------+-----+------------+
## |payment_method_id_11 |No         |33295|0.08725034  |
## |                     |Yes        |    5|0.00000000  |
## +---------------------+-----------+-----+------------+
## |payment_method_id_12 |No         |33282|0.08680368  |
## |                     |Yes        |   18|0.88888889  |
## +---------------------+-----------+-----+------------+
## |payment_method_id_13 |No         |33276|0.08657892  |
## |                     |Yes        |   24|1.00000000  |
## +---------------------+-----------+-----+------------+
## |payment_method_id_14 |No         |33258|0.08734741  |
## |                     |Yes        |   42|0.00000000  |
## +---------------------+-----------+-----+------------+
## |payment_method_id_15 |No         |33197|0.08470645  |
## |                     |Yes        |  103|0.90291262  |
## +---------------------+-----------+-----+------------+
## |payment_method_id_16 |No         |33227|0.08697746  |
## |                     |Yes        |   73|0.20547945  |
## +---------------------+-----------+-----+------------+
## |payment_method_id_17 |No         |33237|0.08577790  |
## |                     |Yes        |   63|0.85714286  |
## +---------------------+-----------+-----+------------+
## |payment_method_id_18 |No         |33282|0.08725437  |
## |                     |Yes        |   18|0.05555556  |
## +---------------------+-----------+-----+------------+
## |payment_method_id_19 |No         |33203|0.08746198  |
## |                     |Yes        |   97|0.01030928  |
## +---------------------+-----------+-----+------------+
## |payment_method_id_20 |No         |33247|0.08578218  |
## |                     |Yes        |   53|1.00000000  |
## +---------------------+-----------+-----+------------+
## |payment_method_id_21 |No         |33239|0.08730708  |
## |                     |Yes        |   61|0.04918033  |
## +---------------------+-----------+-----+------------+
## |payment_method_id_22 |No         |33254|0.08597462  |
## |                     |Yes        |   46|1.00000000  |
## +---------------------+-----------+-----+------------+
## |payment_method_id_23 |No         |33173|0.08745064  |
## |                     |Yes        |  127|0.03149606  |
## +---------------------+-----------+-----+------------+
## |payment_method_id_26 |No         |33289|0.08699570  |
## |                     |Yes        |   11|0.81818182  |
## +---------------------+-----------+-----+------------+
## |payment_method_id_27 |No         |33133|0.08752603  |
## |                     |Yes        |  167|0.02994012  |
## +---------------------+-----------+-----+------------+
## |payment_method_id_28 |No         |33195|0.08685043  |
## |                     |Yes        |  105|0.20952381  |
## +---------------------+-----------+-----+------------+
## |payment_method_id_29 |No         |32342|0.08839899  |
## |                     |Yes        |  958|0.04801670  |
## +---------------------+-----------+-----+------------+
## |payment_method_id_30 |No         |32810|0.08747333  |
## |                     |Yes        |  490|0.07142857  |
## +---------------------+-----------+-----+------------+
## |payment_method_id_31 |No         |32440|0.08911837  |
## |                     |Yes        |  860|0.01627907  |
## +---------------------+-----------+-----+------------+
## |payment_method_id_32 |No         |32709|0.07117307  |
## |                     |Yes        |  591|0.97631134  |
## +---------------------+-----------+-----+------------+
## |payment_method_id_33 |No         |32260|0.08905766  |
## |                     |Yes        | 1040|0.03076923  |
## +---------------------+-----------+-----+------------+
## |payment_method_id_34 |No         |31188|0.09208670  |
## |                     |Yes        | 2112|0.01562500  |
## +---------------------+-----------+-----+------------+
## |payment_method_id_35 |No         |33274|0.08652401  |
## |                     |Yes        |   26|1.00000000  |
## +---------------------+-----------+-----+------------+
## |payment_method_id_36 |No         |29179|0.08831694  |
## |                     |Yes        | 4121|0.07959233  |
## +---------------------+-----------+-----+------------+
## |payment_method_id_37 |No         |30296|0.09400581  |
## |                     |Yes        | 3004|0.01897470  |
## +---------------------+-----------+-----+------------+
## |payment_method_id_38 |No         |30543|0.07487804  |
## |                     |Yes        | 2757|0.22415669  |
## +---------------------+-----------+-----+------------+
## |payment_method_id_39 |No         |28184|0.09395402  |
## |                     |Yes        | 5116|0.05023456  |
## +---------------------+-----------+-----+------------+
## |payment_method_id_40 |No         |27749|0.09416556  |
## |                     |Yes        | 5551|0.05260313  |
## +---------------------+-----------+-----+------------+
## |payment_method_id_41 |No         |27643|0.09637159  |
## |                     |Yes        | 5657|0.04260209  |
## +---------------------+-----------+-----+------------+
## |hist_log             |[ 1,13)    | 8484|0.09028760  |
## |                     |[13,23)    | 9096|0.08256376  |
## |                     |[23,29)    | 8242|0.08517350  |
## |                     |[29,31]    | 7478|0.09173576  |
## +---------------------+-----------+-----+------------+
## |membership_duration  |[   4, 857)| 8327|0.09667347  |
## |                     |[ 857,1691)| 8325|0.09297297  |
## |                     |[1691,2794)| 8324|0.08637674  |
## |                     |[2794,5393]| 8324|0.07292167  |
## +---------------------+-----------+-----+------------+
## |Overall              |           |33300|0.08723724  |
## +---------------------+-----------+-----+------------+

Summary

plot(summary(is_churn_num ~ city + age + gender, data = df_join))

Summary

plot(summary(is_churn_num ~ is_auto_renew, data = df_join))

Churn proportion

(tbl_df_join <- table(df_join$is_churn))
## 
##    CF    CT 
## 30395  2905
tbl_df_join[2]/sum(tbl_df_join)
##         CT 
## 0.08723724

Box plot

gp <- invisible(lapply(numeric_vars, function(x) { 
  ggplot(data=df_join, aes(x = is_churn, y=log(eval(parse(text=x))))) + geom_boxplot(aes(fill = is_churn)) + xlab("is churn") + ylab(x) + scale_fill_manual(values=c("#009900", "#CC0000")) + ggtitle("") + theme(legend.position="none")}))
grob_plots <- invisible(lapply(chunk(1, length(gp), 4), function(x) {
  marrangeGrob(grobs=lapply(gp[x], ggplotGrob), nrow=2, ncol=2)}))

grob_plots$chunk1

grob_plots$chunk2

Density plot

gp <- invisible(lapply(numeric_vars, function(x) { 
  ggplot(data=df_join, aes(x=eval(parse(text=x)))) + geom_density(aes(fill = is_churn), alpha = 0.4) + xlab(x) + scale_fill_manual(values=c("#009900", "#CC0000")) + scale_x_continuous(trans='log2') + ggtitle(paste(x, "density", sep= " "))}))
grob_plots <- invisible(lapply(chunk(1, length(gp), 4), function(x) {
  marrangeGrob(grobs=lapply(gp[x], ggplotGrob), nrow=2, ncol=2)}))

grob_plots$chunk1

grob_plots$chunk2

ANOVA Analysis

aov_l <- lapply(numeric_vars, function(x) { summary(aov(eval(parse(text=x)) ~  is_churn, data = df_join))})
names(aov_l) <- numeric_vars
aov_l
## $age
##                Df  Sum Sq Mean Sq F value Pr(>F)    
## is_churn        1    9210    9210   122.7 <2e-16 ***
## Residuals   33298 2499077      75                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## $hist_trans
##                Df Sum Sq Mean Sq F value Pr(>F)    
## is_churn        1    508   508.1   226.5 <2e-16 ***
## Residuals   33298  74708     2.2                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## $sum_paid
##                Df    Sum Sq   Mean Sq F value Pr(>F)    
## is_churn        1 4.916e+08 491640517   10374 <2e-16 ***
## Residuals   33298 1.578e+09     47390                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## $payment_plan_days_sum
##                Df    Sum Sq  Mean Sq F value Pr(>F)    
## is_churn        1  37440131 37440131   12431 <2e-16 ***
## Residuals   33298 100285816     3012                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## $plan_list_price_sum
##                Df    Sum Sq   Mean Sq F value Pr(>F)    
## is_churn        1 5.543e+08 554274226   10705 <2e-16 ***
## Residuals   33298 1.724e+09     51777                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## $hist_log
##                Df  Sum Sq Mean Sq F value Pr(>F)
## is_churn        1      19   18.55   0.219  0.639
## Residuals   33298 2815301   84.55               
## 
## $membership_duration
##                Df    Sum Sq  Mean Sq F value   Pr(>F)    
## is_churn        1 4.177e+07 41768610   26.79 2.29e-07 ***
## Residuals   33298 5.192e+10  1559357                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
tukey <- lapply(numeric_vars, function(x) { TukeyHSD(aov(eval(parse(text=x)) ~  is_churn, data = df_join))})
names(tukey) <- numeric_vars
tukey
## $age
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = eval(parse(text = x)) ~ is_churn, data = df_join)
## 
## $is_churn
##           diff       lwr       upr p adj
## CT-CF -1.86369 -2.193434 -1.533946     0
## 
## 
## $hist_trans
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = eval(parse(text = x)) ~ is_churn, data = df_join)
## 
## $is_churn
##           diff       lwr       upr p adj
## CT-CF 0.437752 0.3807394 0.4947645     0
## 
## 
## $sum_paid
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = eval(parse(text = x)) ~ is_churn, data = df_join)
## 
## $is_churn
##           diff      lwr      upr p adj
## CT-CF 430.5978 422.3119 438.8837     0
## 
## 
## $payment_plan_days_sum
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = eval(parse(text = x)) ~ is_churn, data = df_join)
## 
## $is_churn
##           diff      lwr      upr p adj
## CT-CF 118.8274 116.7386 120.9163     0
## 
## 
## $plan_list_price_sum
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = eval(parse(text = x)) ~ is_churn, data = df_join)
## 
## $is_churn
##           diff      lwr      upr p adj
## CT-CF 457.2043 448.5434 465.8653     0
## 
## 
## $hist_log
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = eval(parse(text = x)) ~ is_churn, data = df_join)
## 
## $is_churn
##              diff        lwr       upr     p adj
## CT-CF -0.08364616 -0.4336312 0.2663388 0.6394771
## 
## 
## $membership_duration
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = eval(parse(text = x)) ~ is_churn, data = df_join)
## 
## $is_churn
##            diff       lwr       upr p adj
## CT-CF -125.5085 -173.0386 -77.97835 2e-07

Separabilità della variabile target su base singolo predittore

set.seed(1023)
trControl <- trainControl(method = "repeatedcv", number = 10, repeats = 5, verboseIter = FALSE, sampling = "down")
fit <- train(is_churn ~ age, data = df_join, method = "lda", trControl = trControl, metric = 'Accuracy', tuneLength = 10)
predict_res <- predict(fit, df_join, type = "raw")
confusionMatrix(factor(predict_res), df_join$is_churn)$overall[2]
##      Kappa 
## 0.02200541
fit <- train(is_churn ~ sum_paid, data = df_join, method = "lda", trControl = trControl, metric = 'Accuracy', tuneLength = 10)
predict_res <- predict(fit, df_join, type = "raw")
confusionMatrix(factor(predict_res), df_join$is_churn)$overall[2]
##     Kappa 
## 0.6256314

fit <- train(is_churn ~ hist_trans, data = df_join, method = "lda", trControl = trControl, metric = 'Accuracy', tuneLength = 10)
predict_res <- predict(fit, df_join, type = "raw")
confusionMatrix(factor(predict_res), df_join$is_churn)$overall[2]
##      Kappa 
## 0.07974486
fit <- train(is_churn ~ payment_plan_days_sum, data = df_join, method = "lda", trControl = trControl, metric = 'Accuracy', tuneLength = 10)
predict_res <- predict(fit, df_join, type = "raw")
confusionMatrix(factor(predict_res), df_join$is_churn)$overall[2]
##     Kappa 
## 0.5489028
fit <- train(is_churn ~ hist_log, data = df_join, method = "lda", trControl = trControl, metric = 'Accuracy', tuneLength = 10)
predict_res <- predict(fit, df_join, type = "raw")
confusionMatrix(factor(predict_res), df_join$is_churn)$overall[2]
##       Kappa 
## 0.003055255

fit <- train(is_churn ~ plan_list_price_sum, data = df_join, method = "lda", trControl = trControl, metric = 'Accuracy', tuneLength = 10)
predict_res <- predict(fit, df_join, type = "raw")
confusionMatrix(factor(predict_res), df_join$is_churn)$overall[2]
##     Kappa 
## 0.6252373
fit <- train(is_churn ~ membership_duration, data = df_join, method = "lda", trControl = trControl, metric = 'Accuracy', tuneLength = 10)
predict_res <- predict(fit, df_join, type = "raw")
confusionMatrix(factor(predict_res), df_join$is_churn)$overall[2]
##     Kappa 
## 0.0144657

Correlation Analysis

numeric_vars_cor <- cor(numeric_vars_mat)
corrplot(numeric_vars_cor, type = "lower", sig.level = 0.7, diag=FALSE)

Scelta predittore tra le variabili correlate

fit <- train(is_churn ~ plan_list_price_sum, data = df_join, method = "lda", trControl = trControl, metric = 'Accuracy', tuneLength = 10)
predict_res <- predict(fit, df_join, type = "raw")
confusionMatrix(factor(predict_res), df_join$is_churn)$overall[2]
##     Kappa 
## 0.6252373
fit <- train(is_churn ~ payment_plan_days_sum, data = df_join, method = "lda", trControl = trControl, metric = 'Accuracy', tuneLength = 10)
predict_res <- predict(fit, df_join, type = "raw")
confusionMatrix(factor(predict_res), df_join$is_churn)$overall[2]
##     Kappa 
## 0.5489028
fit <- train(is_churn ~ sum_paid, data = df_join, method = "lda", trControl = trControl, metric = 'Accuracy', tuneLength = 10)
predict_res <- predict(fit, df_join, type = "raw")
confusionMatrix(factor(predict_res), df_join$is_churn)$overall[2]
##     Kappa 
## 0.6256314

Chi-square test

factor_vars
## [1] "is_churn"           "city"               "gender"            
## [4] "registered_via"     "payment_method_num"
factor_vars_red <- setdiff(factor_vars, "is_churn")
chisq_test_res <- lapply(factor_vars_red, function(x) { 
  chisq.test(df_join[,x], df_join[, "is_churn"], simulate.p.value = TRUE)
})
names(chisq_test_res) <- factor_vars_red

chisq_test_res$city
## 
##  Pearson's Chi-squared test with simulated p-value (based on 2000
##  replicates)
## 
## data:  df_join[, x] and df_join[, "is_churn"]
## X-squared = 19.023, df = NA, p-value = 0.5192

chisq_test_res$gender
## 
##  Pearson's Chi-squared test with simulated p-value (based on 2000
##  replicates)
## 
## data:  df_join[, x] and df_join[, "is_churn"]
## X-squared = 0.036167, df = NA, p-value = 0.8571

chisq_test_res$registered_via
## 
##  Pearson's Chi-squared test with simulated p-value (based on 2000
##  replicates)
## 
## data:  df_join[, x] and df_join[, "is_churn"]
## X-squared = 240.96, df = NA, p-value = 0.0004998

tb <- table(df_join$is_auto_renew, df_join$is_churn)
chisq.test(tb, simulate.p.value = TRUE)
## 
##  Pearson's Chi-squared test with simulated p-value (based on 2000
##  replicates)
## 
## data:  tb
## X-squared = 4426.6, df = NA, p-value = 0.0004998

tb <- table(df_join$is_cancel, df_join$is_churn)
chisq.test(tb, simulate.p.value = TRUE)
## 
##  Pearson's Chi-squared test with simulated p-value (based on 2000
##  replicates)
## 
## data:  tb
## X-squared = 3507.1, df = NA, p-value = 0.0004998

chisq_test_res$payment_method_num
## 
##  Pearson's Chi-squared test with simulated p-value (based on 2000
##  replicates)
## 
## data:  df_join[, x] and df_join[, "is_churn"]
## X-squared = 1145.6, df = NA, p-value = 0.0004998

Tabelle tipi di pagamento

payment_method_id_set <- colnames(df_join)[grep("payment_method_id", x=colnames(df_join))]

(table_pay <- sapply(payment_method_id_set, function(x) { table(df_join[,x], df_join$is_churn)}))
## $payment_method_id_3
##        
##            CF    CT
##   FALSE 30395  2904
##   TRUE      0     1
## 
## $payment_method_id_8
##        
##            CF    CT
##   FALSE 30394  2903
##   TRUE      1     2
## 
## $payment_method_id_10
##        
##            CF    CT
##   FALSE 30395  2905
## 
## $payment_method_id_11
##        
##            CF    CT
##   FALSE 30390  2905
##   TRUE      5     0
## 
## $payment_method_id_12
##        
##            CF    CT
##   FALSE 30393  2889
##   TRUE      2    16
## 
## $payment_method_id_13
##        
##            CF    CT
##   FALSE 30395  2881
##   TRUE      0    24
## 
## $payment_method_id_14
##        
##            CF    CT
##   FALSE 30353  2905
##   TRUE     42     0
## 
## $payment_method_id_15
##        
##            CF    CT
##   FALSE 30385  2812
##   TRUE     10    93
## 
## $payment_method_id_16
##        
##            CF    CT
##   FALSE 30337  2890
##   TRUE     58    15
## 
## $payment_method_id_17
##        
##            CF    CT
##   FALSE 30386  2851
##   TRUE      9    54
## 
## $payment_method_id_18
##        
##            CF    CT
##   FALSE 30378  2904
##   TRUE     17     1
## 
## $payment_method_id_19
##        
##            CF    CT
##   FALSE 30299  2904
##   TRUE     96     1
## 
## $payment_method_id_20
##        
##            CF    CT
##   FALSE 30395  2852
##   TRUE      0    53
## 
## $payment_method_id_21
##        
##            CF    CT
##   FALSE 30337  2902
##   TRUE     58     3
## 
## $payment_method_id_22
##        
##            CF    CT
##   FALSE 30395  2859
##   TRUE      0    46
## 
## $payment_method_id_23
##        
##            CF    CT
##   FALSE 30272  2901
##   TRUE    123     4
## 
## $payment_method_id_26
##        
##            CF    CT
##   FALSE 30393  2896
##   TRUE      2     9
## 
## $payment_method_id_27
##        
##            CF    CT
##   FALSE 30233  2900
##   TRUE    162     5
## 
## $payment_method_id_28
##        
##            CF    CT
##   FALSE 30312  2883
##   TRUE     83    22
## 
## $payment_method_id_29
##        
##            CF    CT
##   FALSE 29483  2859
##   TRUE    912    46
## 
## $payment_method_id_30
##        
##            CF    CT
##   FALSE 29940  2870
##   TRUE    455    35
## 
## $payment_method_id_31
##        
##            CF    CT
##   FALSE 29549  2891
##   TRUE    846    14
## 
## $payment_method_id_32
##        
##            CF    CT
##   FALSE 30381  2328
##   TRUE     14   577
## 
## $payment_method_id_33
##        
##            CF    CT
##   FALSE 29387  2873
##   TRUE   1008    32
## 
## $payment_method_id_34
##        
##            CF    CT
##   FALSE 28316  2872
##   TRUE   2079    33
## 
## $payment_method_id_35
##        
##            CF    CT
##   FALSE 30395  2879
##   TRUE      0    26
## 
## $payment_method_id_36
##        
##            CF    CT
##   FALSE 26602  2577
##   TRUE   3793   328
## 
## $payment_method_id_37
##        
##            CF    CT
##   FALSE 27448  2848
##   TRUE   2947    57
## 
## $payment_method_id_38
##        
##            CF    CT
##   FALSE 28256  2287
##   TRUE   2139   618
## 
## $payment_method_id_39
##        
##            CF    CT
##   FALSE 25536  2648
##   TRUE   4859   257
## 
## $payment_method_id_40
##        
##            CF    CT
##   FALSE 25136  2613
##   TRUE   5259   292
## 
## $payment_method_id_41
##        
##            CF    CT
##   FALSE 24979  2664
##   TRUE   5416   241

lapply(table_pay, function(x) {chisq.test(x, simulate.p.value = TRUE)})
## $payment_method_id_3
## 
##  Pearson's Chi-squared test with simulated p-value (based on 2000
##  replicates)
## 
## data:  x
## X-squared = 10.463, df = NA, p-value = 0.09495
## 
## 
## $payment_method_id_8
## 
##  Pearson's Chi-squared test with simulated p-value (based on 2000
##  replicates)
## 
## data:  x
## X-squared = 12.65, df = NA, p-value = 0.02249
## 
## 
## $payment_method_id_10
## 
##  Chi-squared test for given probabilities with simulated p-value
##  (based on 2000 replicates)
## 
## data:  x
## X-squared = 22694, df = NA, p-value = 0.0004998
## 
## 
## $payment_method_id_11
## 
##  Pearson's Chi-squared test with simulated p-value (based on 2000
##  replicates)
## 
## data:  x
## X-squared = 0.47795, df = NA, p-value = 1
## 
## 
## $payment_method_id_12
## 
##  Pearson's Chi-squared test with simulated p-value (based on 2000
##  replicates)
## 
## data:  x
## X-squared = 145.35, df = NA, p-value = 0.0004998
## 
## 
## $payment_method_id_13
## 
##  Pearson's Chi-squared test with simulated p-value (based on 2000
##  replicates)
## 
## data:  x
## X-squared = 251.29, df = NA, p-value = 0.0004998
## 
## 
## $payment_method_id_14
## 
##  Pearson's Chi-squared test with simulated p-value (based on 2000
##  replicates)
## 
## data:  x
## X-squared = 4.0192, df = NA, p-value = 0.05047
## 
## 
## $payment_method_id_15
## 
##  Pearson's Chi-squared test with simulated p-value (based on 2000
##  replicates)
## 
## data:  x
## X-squared = 863.29, df = NA, p-value = 0.0004998
## 
## 
## $payment_method_id_16
## 
##  Pearson's Chi-squared test with simulated p-value (based on 2000
##  replicates)
## 
## data:  x
## X-squared = 12.846, df = NA, p-value = 0.002999
## 
## 
## $payment_method_id_17
## 
##  Pearson's Chi-squared test with simulated p-value (based on 2000
##  replicates)
## 
## data:  x
## X-squared = 469.87, df = NA, p-value = 0.0004998
## 
## 
## $payment_method_id_18
## 
##  Pearson's Chi-squared test with simulated p-value (based on 2000
##  replicates)
## 
## data:  x
## X-squared = 0.22702, df = NA, p-value = 0.7406
## 
## 
## $payment_method_id_19
## 
##  Pearson's Chi-squared test with simulated p-value (based on 2000
##  replicates)
## 
## data:  x
## X-squared = 7.2301, df = NA, p-value = 0.01049
## 
## 
## $payment_method_id_20
## 
##  Pearson's Chi-squared test with simulated p-value (based on 2000
##  replicates)
## 
## data:  x
## X-squared = 555.42, df = NA, p-value = 0.0004998
## 
## 
## $payment_method_id_21
## 
##  Pearson's Chi-squared test with simulated p-value (based on 2000
##  replicates)
## 
## data:  x
## X-squared = 1.1116, df = NA, p-value = 0.3878
## 
## 
## $payment_method_id_22
## 
##  Pearson's Chi-squared test with simulated p-value (based on 2000
##  replicates)
## 
## data:  x
## X-squared = 481.96, df = NA, p-value = 0.0004998
## 
## 
## $payment_method_id_23
## 
##  Pearson's Chi-squared test with simulated p-value (based on 2000
##  replicates)
## 
## data:  x
## X-squared = 4.9746, df = NA, p-value = 0.02499
## 
## 
## $payment_method_id_26
## 
##  Pearson's Chi-squared test with simulated p-value (based on 2000
##  replicates)
## 
## data:  x
## X-squared = 73.832, df = NA, p-value = 0.0004998
## 
## 
## $payment_method_id_27
## 
##  Pearson's Chi-squared test with simulated p-value (based on 2000
##  replicates)
## 
## data:  x
## X-squared = 6.92, df = NA, p-value = 0.01149
## 
## 
## $payment_method_id_28
## 
##  Pearson's Chi-squared test with simulated p-value (based on 2000
##  replicates)
## 
## data:  x
## X-squared = 19.781, df = NA, p-value = 0.0004998
## 
## 
## $payment_method_id_29
## 
##  Pearson's Chi-squared test with simulated p-value (based on 2000
##  replicates)
## 
## data:  x
## X-squared = 19.055, df = NA, p-value = 0.0004998
## 
## 
## $payment_method_id_30
## 
##  Pearson's Chi-squared test with simulated p-value (based on 2000
##  replicates)
## 
## data:  x
## X-squared = 1.5609, df = NA, p-value = 0.2299
## 
## 
## $payment_method_id_31
## 
##  Pearson's Chi-squared test with simulated p-value (based on 2000
##  replicates)
## 
## data:  x
## X-squared = 55.822, df = NA, p-value = 0.0004998
## 
## 
## $payment_method_id_32
## 
##  Pearson's Chi-squared test with simulated p-value (based on 2000
##  replicates)
## 
## data:  x
## X-squared = 5972.8, df = NA, p-value = 0.0004998
## 
## 
## $payment_method_id_33
## 
##  Pearson's Chi-squared test with simulated p-value (based on 2000
##  replicates)
## 
## data:  x
## X-squared = 42.989, df = NA, p-value = 0.0004998
## 
## 
## $payment_method_id_34
## 
##  Pearson's Chi-squared test with simulated p-value (based on 2000
##  replicates)
## 
## data:  x
## X-squared = 145.23, df = NA, p-value = 0.0004998
## 
## 
## $payment_method_id_35
## 
##  Pearson's Chi-squared test with simulated p-value (based on 2000
##  replicates)
## 
## data:  x
## X-squared = 272.25, df = NA, p-value = 0.0004998
## 
## 
## $payment_method_id_36
## 
##  Pearson's Chi-squared test with simulated p-value (based on 2000
##  replicates)
## 
## data:  x
## X-squared = 3.4519, df = NA, p-value = 0.05947
## 
## 
## $payment_method_id_37
## 
##  Pearson's Chi-squared test with simulated p-value (based on 2000
##  replicates)
## 
## data:  x
## X-squared = 193.23, df = NA, p-value = 0.0004998
## 
## 
## $payment_method_id_38
## 
##  Pearson's Chi-squared test with simulated p-value (based on 2000
##  replicates)
## 
## data:  x
## X-squared = 707.68, df = NA, p-value = 0.0004998
## 
## 
## $payment_method_id_39
## 
##  Pearson's Chi-squared test with simulated p-value (based on 2000
##  replicates)
## 
## data:  x
## X-squared = 103.94, df = NA, p-value = 0.0004998
## 
## 
## $payment_method_id_40
## 
##  Pearson's Chi-squared test with simulated p-value (based on 2000
##  replicates)
## 
## data:  x
## X-squared = 100.35, df = NA, p-value = 0.0004998
## 
## 
## $payment_method_id_41
## 
##  Pearson's Chi-squared test with simulated p-value (based on 2000
##  replicates)
## 
## data:  x
## X-squared = 170.51, df = NA, p-value = 0.0004998

Modelli White-Box

C50 Rules

C5.0 evoluzione dei precedenti algoritmi di decision tree (C4.5) e costruisce un albero decisionale sulla base della minimizzazione di un costo associato allo diramazione dal nodo corrente sulla base della scelta di un predittore.

Il normalized information gain (differenza di valore di entropia) e’ la metrica usata per scegliere l’attribute che ne fornisce il valore piu’ alto ai fini della decisione.

C5.0 “Rules based” puo’ creare un albero decisionale che poi decompone in un set di regole mutuamente esclusive, sulle quali si puo’ fare “pruning” e determinare un insieme piu’ piccolo di regole e che potentialmente si sovrappongono

RPART tree

Basato sull’algoritmo CART che costruisce un albero decisionale per partizionamento ricorsivo. La metrica utilizzata per la scelta del predittore da associare al nodo di split e’ il Gini Index.

Tra C5.0 e CART ci sono differenze nel modo in cui implementano il pruning e gestiscono i missing data.

relevant_features <- colnames(df_join)
relevant_features <- setdiff(relevant_features, 
                             c("msno", "is_churn", "age", "city", "registration_init_time", 
                               "gender",  "sum_paid", "payment_plan_days_sum", "membership_duration"))

feat_sum <- paste(relevant_features, collapse = "+")
frm <- as.formula(paste("is_churn ~ ", feat_sum))
frm
## is_churn ~ registered_via + hist_trans + is_auto_renew + is_cancel + 
##     payment_method_num + plan_list_price_sum + payment_method_id_3 + 
##     payment_method_id_8 + payment_method_id_10 + payment_method_id_11 + 
##     payment_method_id_12 + payment_method_id_13 + payment_method_id_14 + 
##     payment_method_id_15 + payment_method_id_16 + payment_method_id_17 + 
##     payment_method_id_18 + payment_method_id_19 + payment_method_id_20 + 
##     payment_method_id_21 + payment_method_id_22 + payment_method_id_23 + 
##     payment_method_id_26 + payment_method_id_27 + payment_method_id_28 + 
##     payment_method_id_29 + payment_method_id_30 + payment_method_id_31 + 
##     payment_method_id_32 + payment_method_id_33 + payment_method_id_34 + 
##     payment_method_id_35 + payment_method_id_36 + payment_method_id_37 + 
##     payment_method_id_38 + payment_method_id_39 + payment_method_id_40 + 
##     payment_method_id_41 + hist_log

C5.0 Rules

set.seed(1023)
train_idx <- createDataPartition(df_join$is_churn, p=0.6, list=FALSE)

trControl <- trainControl(method = "repeatedcv",  
                          number = 10, 
                          repeats = 5, 
                          verboseIter = FALSE,
                          sampling = "down"
                          )

c50_fit <- train(frm, 
                 data = df_join[train_idx,], 
                 method = "C5.0Rules",
                 trControl = trControl,
                 metric = "Kappa")

summary(c50_fit)
## 
## Call:
## C50:::C5.0.default(x = x, y = y, rules = TRUE, weights = wts)
## 
## 
## C5.0 [Release 2.07 GPL Edition]      Sun Feb 03 11:30:46 2019
## -------------------------------
## 
## Class specified by attribute `outcome'
## 
## Read 3486 cases (44 attributes) from undefined.data
## 
## Rules:
## 
## Rule 1: (125, lift 2.0)
##  is_cancelTRUE <= 0
##  plan_list_price_sum <= 427
##  payment_method_id_34TRUE > 0
##  ->  class CF  [0.992]
## 
## Rule 2: (7, lift 1.8)
##  payment_method_id_19TRUE > 0
##  ->  class CF  [0.889]
## 
## Rule 3: (41/11, lift 1.4)
##  is_auto_renewTRUE <= 0
##  payment_method_id_29TRUE > 0
##  ->  class CF  [0.721]
## 
## Rule 4: (2575/848, lift 1.3)
##  plan_list_price_sum <= 427
##  ->  class CF  [0.671]
## 
## Rule 5: (911/16, lift 2.0)
##  plan_list_price_sum > 427
##  ->  class CT  [0.981]
## 
## Rule 6: (271/14, lift 1.9)
##  is_cancelTRUE > 0
##  plan_list_price_sum <= 427
##  ->  class CT  [0.945]
## 
## Rule 7: (1224/180, lift 1.7)
##  is_auto_renewTRUE <= 0
##  payment_method_id_29TRUE <= 0
##  ->  class CT  [0.852]
## 
## Rule 8: (557/97, lift 1.6)
##  hist_trans > 1
##  payment_method_id_39TRUE <= 0
##  ->  class CT  [0.825]
## 
## Default class: CF
## 
## 
## Evaluation on training data (3486 cases):
## 
##          Rules     
##    ----------------
##      No      Errors
## 
##       8  392(11.2%)   <<
## 
## 
##     (a)   (b)    <-classified as
##    ----  ----
##    1494   249    (a): class CF
##     143  1600    (b): class CT
## 
## 
##  Attribute usage:
## 
##  100.00% plan_list_price_sum
##   36.29% is_auto_renewTRUE
##   36.29% payment_method_id_29TRUE
##   15.98% hist_trans
##   15.98% payment_method_id_39TRUE
##   11.36% is_cancelTRUE
##    3.59% payment_method_id_34TRUE
##    0.20% payment_method_id_19TRUE
## 
## 
## Time: 0.0 secs

varImp(c50_fit)
## C5.0Rules variable importance
## 
##   only 20 most important variables shown (out of 43)
## 
##                          Overall
## plan_list_price_sum       100.00
## is_auto_renewTRUE          36.29
## payment_method_id_29TRUE   36.29
## payment_method_id_39TRUE   15.98
## hist_trans                 15.98
## is_cancelTRUE              11.36
## payment_method_id_34TRUE    3.59
## payment_method_id_19TRUE    0.20
## payment_method_id_20TRUE    0.00
## payment_method_id_35TRUE    0.00
## payment_method_id_26TRUE    0.00
## payment_method_num3         0.00
## payment_method_id_21TRUE    0.00
## payment_method_id_17TRUE    0.00
## payment_method_id_15TRUE    0.00
## payment_method_id_37TRUE    0.00
## payment_method_id_11TRUE    0.00
## payment_method_id_27TRUE    0.00
## payment_method_id_13TRUE    0.00
## payment_method_id_8TRUE     0.00

train_set <- df_join[train_idx,]
c50_train_pred <- predict(c50_fit, train_set, type="raw")
(cm_mat <- confusionMatrix(factor(c50_train_pred), train_set$is_churn))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    CF    CT
##         CF 15461   143
##         CT  2776  1600
##                                           
##                Accuracy : 0.8539          
##                  95% CI : (0.8489, 0.8588)
##     No Information Rate : 0.9128          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.455           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.8478          
##             Specificity : 0.9180          
##          Pos Pred Value : 0.9908          
##          Neg Pred Value : 0.3656          
##              Prevalence : 0.9128          
##          Detection Rate : 0.7738          
##    Detection Prevalence : 0.7810          
##       Balanced Accuracy : 0.8829          
##                                           
##        'Positive' Class : CF              
## 

train_set <- df_join[train_idx,]
c50_train_prob <- predict(c50_fit, train_set, type="prob")
c50_train_pred <- ifelse(c50_train_prob[,2] > 0.8, "CT", "CF")
(cm_mat <- confusionMatrix(factor(c50_train_pred), train_set$is_churn))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    CF    CT
##         CF 18011   792
##         CT   226   951
##                                           
##                Accuracy : 0.949           
##                  95% CI : (0.9459, 0.9521)
##     No Information Rate : 0.9128          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.625           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9876          
##             Specificity : 0.5456          
##          Pos Pred Value : 0.9579          
##          Neg Pred Value : 0.8080          
##              Prevalence : 0.9128          
##          Detection Rate : 0.9015          
##    Detection Prevalence : 0.9411          
##       Balanced Accuracy : 0.7666          
##                                           
##        'Positive' Class : CF              
## 

test_set <- df_join[-train_idx,]
c50_test_prob <- predict(c50_fit, test_set, type="prob")
c50_test_pred <- ifelse(c50_test_prob[,2] > 0.8, "CT", "CF")
(cm_mat1 <- confusionMatrix(factor(c50_test_pred), test_set$is_churn))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    CF    CT
##         CF 12022   508
##         CT   136   654
##                                           
##                Accuracy : 0.9517          
##                  95% CI : (0.9479, 0.9552)
##     No Information Rate : 0.9128          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.645           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9888          
##             Specificity : 0.5628          
##          Pos Pred Value : 0.9595          
##          Neg Pred Value : 0.8278          
##              Prevalence : 0.9128          
##          Detection Rate : 0.9026          
##    Detection Prevalence : 0.9407          
##       Balanced Accuracy : 0.7758          
##                                           
##        'Positive' Class : CF              
## 

Confusion Matrix

   Predicted        Observed

                   CF              CT
             +-----------------------------+
             |              |              |
          CF |      A       |      B       |
             |              |              |
             +-------------+---------------+
             |              |              |
          CT |      C       |      D       |
             |              |              |
             +-----------------------------+

Metriche della Campagna di retention clienti

Metriche della Campagna di retention clienti - help function

retention_campaign_metrics <- function(cm_mat_i, current_churn_percentage, rho) {

  # percentuale di clienti che considerero' churn potenziali
  #
  # (C + D) / (A + B + C + D)
  #
  target_perc <- 100*(cm_mat_i[2,1] + cm_mat_i[2,2])/sum(cm_mat_i)

  # percentuale di efficienza della campagna di retention, cioe' percentuale dei veri
  # churn che raggiungo sul totale dei clienti interessati alla campagna (e' pari al negative predicted value)
  #
  # D / (C + D)
  #
  campaign_efficiency_perc <- 100*(cm_mat_i[2,2]/(cm_mat_i[2,1] + cm_mat_i[2,2]))
  
  # percentuale di veri potenziali churn che avro' raggiunto (uguale alla specificity)
  #
  # D / (B + D)
  #
  churn_reach_perc <- 100*(cm_mat_i[2,2]/(cm_mat_i[1,2] + cm_mat_i[2,2]))
  
  # tiene conto dell'efficienza e del reah sui clienti churn
  effectiveness_perc = campaign_efficiency_perc * churn_reach_perc/100
  
  # esprime quale sia la piu' bassa percentuale di churn raggiungibile finita la
  # campagna di retention, tenendo conto di un fattore rho di rendimento della
  # campagna stessa sul campione di effettivi clienti churn
  #
  # Se contatto D clienti, suppongo di convincerne rho * D a restare cliente
  # mentre (1-rho) D saranno ancora clienti churn che si sommeranno ai B clienti
  # che non ho raggiunto con la campagna perche' il modello li ritiene non churn
  #
  # Idealmente con rho = 1, il numero di clienti churn dopo la campgna di retention
  # sarebbe uguale a B
  #
  #  (B + D - rho * D) / (A + B + C + D)
  #
  min_reachable_churn_percentage <- round((100*(cm_mat_i[1,2] + cm_mat_i[2,2] - rho * cm_mat_i[2,2]))/sum(cm_mat_i), 4)

  roi <- round(100*((current_churn_percentage - min_reachable_churn_percentage)/target_perc), 1)

  v <- c(target_perc,
         campaign_efficiency_perc,
         churn_reach_perc,
         effectiveness_perc,
         min_reachable_churn_percentage,
         roi)
  v <- round(v, 1)
  names(v) <- c("target%", "efficiency%", "reach%", "effectiveness%", "min_reachable_churn%", "roi")

  v
}

(is_churn_cnt_test_set <- sum((test_set$is_churn=="CT")))
## [1] 1162
(customer_churn_percentage <- 100*(is_churn_cnt_test_set/nrow(test_set)))
## [1] 8.723724

(rcm1_wb <- retention_campaign_metrics(cm_mat1$table, customer_churn_percentage, 1))
##              target%          efficiency%               reach% 
##                  5.9                 82.8                 56.3 
##       effectiveness% min_reachable_churn%                  roi 
##                 46.6                  3.8                 82.8

RPART tree model

trControl <- trainControl(method = "repeatedcv",  
                          number = 10, 
                          repeats = 5, 
                          verboseIter = FALSE,
                          sampling = "down",
                          classProbs = TRUE,
                          summaryFunction = twoClassSummary,
                          savePredictions = TRUE,
                          returnResamp = "all")

rpart.grid <- data.frame(cp = c(0.005, 0.01, 0.03, 0.05))

rpart_fit <- train(frm, 
                 data = df_join[train_idx,], 
                 method = "rpart",
                 trControl = trControl,
                 metric = "ROC",
                 tuneGrid = rpart.grid
                 )

rpart_fit
## CART 
## 
## 19980 samples
##    39 predictor
##     2 classes: 'CF', 'CT' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## Summary of sample sizes: 17981, 17982, 17983, 17983, 17983, 17982, ... 
## Addtional sampling using down-sampling
## 
## Resampling results across tuning parameters:
## 
##   cp     ROC        Sens       Spec     
##   0.005  0.9183860  0.8510720  0.8955908
##   0.010  0.9091038  0.8539671  0.8859494
##   0.030  0.9090771  0.8534298  0.8864079
##   0.050  0.8985119  0.8695068  0.8592631
## 
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.005.

plot(rpart_fit)

varImp(rpart_fit)
## rpart variable importance
## 
##   only 20 most important variables shown (out of 43)
## 
##                           Overall
## is_auto_renewTRUE        100.0000
## plan_list_price_sum       88.1809
## is_cancelTRUE             48.3262
## payment_method_id_32TRUE  32.0795
## hist_trans                21.9960
## payment_method_id_38TRUE  17.6008
## payment_method_id_39TRUE   6.1409
## registered_via4            2.5091
## payment_method_id_40TRUE   1.7250
## payment_method_id_41TRUE   0.6481
## registered_via9            0.5880
## registered_via7            0.5771
## payment_method_id_34TRUE   0.3982
## hist_log                   0.2396
## payment_method_id_18TRUE   0.0000
## payment_method_id_27TRUE   0.0000
## payment_method_id_22TRUE   0.0000
## payment_method_id_33TRUE   0.0000
## payment_method_id_10TRUE   0.0000
## payment_method_id_36TRUE   0.0000

pred1 <- ROCR::prediction(as.vector(predict(rpart_fit, df_join[train_idx,], type="prob")[,2]), df_join[train_idx,]$is_churn)
perf1 <- performance(pred1, "tpr", "fpr")
plot(perf1, colorize=TRUE, print.cutoffs.at=c(0.2, 0.4, 0.6, 0.8))

rpart.plot(rpart_fit$finalModel, cex = 0.8)

rpart.rules(rpart_fit$finalModel)
##  .outcome                                                                                                                                                                
##      0.05 when plan_list_price_sum <  299 & is_cancelTRUE is 0 & is_auto_renewTRUE is 1 & payment_method_id_40TRUE is 0                   & payment_method_id_41TRUE is 0
##      0.09 when plan_list_price_sum <  299 & is_cancelTRUE is 0 & is_auto_renewTRUE is 1 & payment_method_id_40TRUE is 0 & hist_trans <  2 & payment_method_id_41TRUE is 1
##      0.19 when plan_list_price_sum <  299 & is_cancelTRUE is 0 & is_auto_renewTRUE is 1 & payment_method_id_40TRUE is 1 & hist_trans <  2                                
##      0.62 when plan_list_price_sum <  299 & is_cancelTRUE is 0 & is_auto_renewTRUE is 0                                                                                  
##      0.71 when plan_list_price_sum <  299 & is_cancelTRUE is 0 & is_auto_renewTRUE is 1 & payment_method_id_40TRUE is 1 & hist_trans >= 2                                
##      0.93 when plan_list_price_sum <  299 & is_cancelTRUE is 0 & is_auto_renewTRUE is 1 & payment_method_id_40TRUE is 0 & hist_trans >= 2 & payment_method_id_41TRUE is 1
##      0.96 when plan_list_price_sum >= 299                                                                                                                                
##      0.97 when plan_list_price_sum <  299 & is_cancelTRUE is 1

ggplot(df_join, aes(x=is_churn, y=plan_list_price_sum, fill = is_churn)) + geom_boxplot() + scale_fill_manual(values=c("#009900", "#CC0000"))

ggplot(df_join, aes(x=is_churn, y=payment_plan_days_sum, fill = is_churn)) + geom_boxplot() + scale_fill_manual(values=c("#009900", "#CC0000"))

ggplot(df_join, aes(x=cut(plan_list_price_sum, breaks = c(0,100,200,300,1000,2000)))) + geom_histogram(position = 'dodge2', stat='count', aes(fill = is_churn)) + xlab("plan list price sum") + scale_fill_manual(values=c("#009900", "#CC0000")) 

ggplot(data=df_join, aes(x=cut(hist_trans, breaks=c(0,1,2,3,4,5,10,50,100,208)))) + geom_bar(position = 'dodge2', aes(fill=is_churn)) + xlab("Transaction history length") + scale_fill_manual(values=c("#009900", "#CC0000")) + scale_x_discrete(labels = c("1","2","3","4","(5, 10]","(10,50]","(100,208]"))

ggplot(data=df_join, aes(x=cut(plan_list_price_sum/hist_trans, breaks=c(0,150,299,600,3000)))) + geom_bar(position = 'dodge2', aes(fill=is_churn)) + xlab("average subscribed plan list price") + scale_fill_manual(values=c("#009900", "#CC0000"))

ggplot(df_join, aes(x=cut(payment_plan_days_sum, breaks = c(0,100,200,300,1000,2000)))) + geom_histogram(position = 'dodge2', stat='count', aes(fill = is_churn)) + xlab("payment plan days total") + scale_fill_manual(values=c("#009900", "#CC0000")) 

ggplot(df_join, aes(x=cut(payment_plan_days_sum/hist_trans, breaks = c(0,100,200,300,1000,2000)))) + geom_histogram(position = 'dodge2', stat='count', aes(fill = is_churn)) + xlab("average payment plan days") + scale_fill_manual(values=c("#009900", "#CC0000")) 

ggplot(data=df_join, aes(x=is_auto_renew, fill=is_churn)) + geom_bar(position = 'dodge2') + xlab("auto renew") + scale_fill_manual(values=c("#009900", "#CC0000"))

df_join_pay <- left_join(df_churn, df_trans)
df_join_pay <- df_join_pay[complete.cases(df_join_pay),]
ggplot(df_join_pay %>% dplyr::select(is_churn, payment_method_id), aes(x = payment_method_id)) + geom_histogram(position = 'dodge2', stat='count', aes(fill = is_churn)) + xlab("payment method id") + scale_fill_manual(values=c("#009900", "#CC0000"))

Visualizzazioni con dati filtrati

rpart.plot(rpart_fit$finalModel, cex = 0.8)

df_filter <- df_join %>% filter(payment_plan_days_sum > 299)
ggplot(data=df_filter, aes(x=cut(hist_trans, breaks=c(0,1,2,3,4,5,10,50,100,208)))) + geom_bar(position = 'dodge2', aes(fill=is_churn)) + xlab("Transaction history length") + scale_fill_manual(values=c("#009900", "#CC0000")) + scale_x_discrete(labels = c("1","2","3","4","(5, 10]","(10,50]","(100,208]"))

rpart.plot(rpart_fit$finalModel, cex = 0.8)

df_filter <- df_join %>% filter(payment_plan_days_sum < 299) %>% filter (is_cancel == TRUE)
ggplot(data=df_filter, aes(x=cut(hist_trans, breaks=c(0,1,2,3,4,5,10,50,100,208)))) + geom_bar(position = 'dodge2', aes(fill=is_churn)) + xlab("Transaction history length") + scale_fill_manual(values=c("#009900", "#CC0000")) + scale_x_discrete(labels = c("1","2","3","4","(5, 10]","(10,50]","(100,208]"))

rpart.plot(rpart_fit$finalModel, cex = 0.8)

df_filter <- df_join %>% filter(payment_plan_days_sum < 299) %>% filter(is_cancel == FALSE) %>% filter(is_auto_renew == FALSE)
ggplot(data=df_filter, aes(x=cut(hist_trans, breaks=c(0,1,2,3,4,5,10,50,100,208)))) + geom_bar(position = 'dodge2', aes(fill=is_churn)) + xlab("Transaction history length") + scale_fill_manual(values=c("#009900", "#CC0000")) + scale_x_discrete(labels = c("1","2","3","4","(5, 10]","(10,50]","(100,208]"))

rpart.plot(rpart_fit$finalModel, cex = 0.8)

df_join_pay <- left_join(df_join, df_trans)
df_join_pay <- df_join_pay
df_filter <- df_join_pay %>% filter(payment_plan_days_sum < 299) %>% filter(is_cancel == FALSE) %>% filter(is_auto_renew == TRUE) %>% filter (hist_trans >= 2)
ggplot(df_filter %>% dplyr::select(is_churn, payment_method_id), aes(x = payment_method_id)) + geom_histogram(position = 'dodge2', stat='count', aes(fill = is_churn)) + xlab("payment method id") + scale_fill_manual(values=c("#009900", "#CC0000"))

df_join_pay <- left_join(df_join, df_trans)
df_join_pay <- df_join_pay
df_filter <- df_join_pay %>% filter(payment_plan_days_sum < 299) %>% filter(is_cancel == FALSE) %>% filter(is_auto_renew == TRUE) %>% filter (hist_trans < 2)
ggplot(df_filter %>% dplyr::select(is_churn, payment_method_id), aes(x = payment_method_id)) + geom_histogram(position = 'dodge2', stat='count', aes(fill = is_churn)) + xlab("payment method id") + scale_fill_manual(values=c("#009900", "#CC0000"))

Regitration Initial Time vs Membership Duration

df_join_ext <- left_join(df_join, df_members[,c("msno", "registration_init_time")])
df_join_ext <- left_join(df_join_ext, df_trans[,c("msno", "membership_expire_date")])
ggplot(df_join_ext %>% dplyr::select(is_churn, registration_init_time) %>% mutate(registration_init_time_year = year(ymd(registration_init_time))), aes(x = registration_init_time_year)) + geom_histogram(position = 'dodge2', stat='count', aes(fill = is_churn)) + scale_x_continuous(labels = as.character(seq(2004,2018,1)), breaks = seq(2004,2018,1)) + xlab("registration initial year") + scale_fill_manual(values=c("#009900", "#CC0000"))

ggplot(df_join_ext %>% dplyr::select(is_churn, membership_expire_date) %>% mutate(membership_expire_year = year(ymd(membership_expire_date))), aes(x = membership_expire_year)) + geom_histogram(position = 'dodge2', stat='count', aes(fill = is_churn)) + xlab("membership expiration year") + scale_fill_manual(values=c("#009900", "#CC0000"))

par(mfrow=c(1,2))
d <- density(df_join[df_join$is_churn == "CF",]$membership_duration)
plot(d, main = "Churn false membership duration")
polygon(d, col = "#009900")
grid()

d <- density(df_join[df_join$is_churn == "CT",]$membership_duration)
plot(d, main = "Churn true membership duration")
polygon(d, col = "#CC0000")
grid()

par(mfrow=c(1,1))

train_set <- df_join[train_idx,]
rpart_train_prob <- predict(rpart_fit, train_set, type="prob")
rpart_train_pred <- ifelse(rpart_train_prob[,2] > 0.5, "CT", "CF")
(cm_mat <- confusionMatrix(factor(rpart_train_pred), train_set$is_churn))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    CF    CT
##         CF 15425   150
##         CT  2812  1593
##                                           
##                Accuracy : 0.8518          
##                  95% CI : (0.8467, 0.8567)
##     No Information Rate : 0.9128          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.4494          
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.8458          
##             Specificity : 0.9139          
##          Pos Pred Value : 0.9904          
##          Neg Pred Value : 0.3616          
##              Prevalence : 0.9128          
##          Detection Rate : 0.7720          
##    Detection Prevalence : 0.7795          
##       Balanced Accuracy : 0.8799          
##                                           
##        'Positive' Class : CF              
## 

rpart.plot(rpart_fit$finalModel, cex = 0.8)

train_set <- df_join[train_idx,]
rpart_train_prob <- predict(rpart_fit, train_set, type="prob")
rpart_train_pred <- ifelse(rpart_train_prob[,2] > 0.65, "CT", "CF")
(cm_mat <- confusionMatrix(factor(rpart_train_pred), train_set$is_churn))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    CF    CT
##         CF 17550   460
##         CT   687  1283
##                                           
##                Accuracy : 0.9426          
##                  95% CI : (0.9393, 0.9458)
##     No Information Rate : 0.9128          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6596          
##  Mcnemar's Test P-Value : 2.505e-11       
##                                           
##             Sensitivity : 0.9623          
##             Specificity : 0.7361          
##          Pos Pred Value : 0.9745          
##          Neg Pred Value : 0.6513          
##              Prevalence : 0.9128          
##          Detection Rate : 0.8784          
##    Detection Prevalence : 0.9014          
##       Balanced Accuracy : 0.8492          
##                                           
##        'Positive' Class : CF              
## 

test_set <- df_join[-train_idx,]
rpart_test_prob <- predict(rpart_fit, test_set, type="prob")
rpart_test_pred <- ifelse(rpart_test_prob[,2] > 0.65, "CT", "CF")
(cm_mat2 <- confusionMatrix(factor(rpart_test_pred), test_set$is_churn))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    CF    CT
##         CF 11725   275
##         CT   433   887
##                                           
##                Accuracy : 0.9468          
##                  95% CI : (0.9429, 0.9506)
##     No Information Rate : 0.9128          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6856          
##  Mcnemar's Test P-Value : 3.626e-09       
##                                           
##             Sensitivity : 0.9644          
##             Specificity : 0.7633          
##          Pos Pred Value : 0.9771          
##          Neg Pred Value : 0.6720          
##              Prevalence : 0.9128          
##          Detection Rate : 0.8803          
##    Detection Prevalence : 0.9009          
##       Balanced Accuracy : 0.8639          
##                                           
##        'Positive' Class : CF              
## 

(rcm2_wb <- retention_campaign_metrics(cm_mat2$table, customer_churn_percentage, 1))
##              target%          efficiency%               reach% 
##                  9.9                 67.2                 76.3 
##       effectiveness% min_reachable_churn%                  roi 
##                 51.3                  2.1                 67.2

Modelli “Black Box”

Random Forest

E’ un classificatore d’insieme che è composto da molti alberi di decisione che sono usati per determinare la risposta. Nel caso di classificazione, viene usato un meccanismo di votazione che raccoglie la risposta di ciascun albero decisionale

C50 Boost

L’algoritmo di boosting tree si basa sull’idea generale che computa una sequenza di alberi e dove ogni albero successivo al primo e’ costruito sulla base dei residui della predizione dell’albero che lo precede. In altre parole, costruiamo il primo albero, calcoliamo i residui della predizione sulla base del training set, calcoliamo un secondo albero sulla base di tali residui e sommiamo al primo albero il secondo con associato un peso. Da tale somma si e’ costruito un nuovo classificatore. Se ne calcola i residui corrispondenti e si procede alla costruzione di un nuovo albero e cosi’ via.

Per cui dalla costruzione di singoli weak learner, con questa fitting incrementale basato su residui si puo’ costruire un buon predittore.

Random Forest

trControl <- trainControl(method = "repeatedcv",  
                          number = 10, 
                          repeats = 5, 
                          verboseIter = FALSE,
                          sampling = "down",
                          classProbs = TRUE,
                          summaryFunction = twoClassSummary,
                          savePredictions = TRUE,
                          returnResamp = "all")

rf_fit <- train(frm,
                data = df_join[train_idx,],
                method = "rf",
                trControl = trControl,
                metric = 'ROC',
                tuneLength = 3
                )

rf_fit
## Random Forest 
## 
## 19980 samples
##    39 predictor
##     2 classes: 'CF', 'CT' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## Summary of sample sizes: 17981, 17981, 17983, 17983, 17982, 17983, ... 
## Addtional sampling using down-sampling
## 
## Resampling results across tuning parameters:
## 
##   mtry  ROC        Sens       Spec     
##    2    0.9091029  0.8977462  0.6973117
##   22    0.9434049  0.8861766  0.8641353
##   43    0.9383752  0.8800789  0.8409537
## 
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 22.

varImp(rf_fit)
## rf variable importance
## 
##   only 20 most important variables shown (out of 43)
## 
##                           Overall
## plan_list_price_sum      100.0000
## is_cancelTRUE             51.1344
## is_auto_renewTRUE         41.8769
## hist_log                  18.8259
## hist_trans                11.3488
## payment_method_id_32TRUE   6.8600
## payment_method_id_38TRUE   3.8820
## payment_method_id_40TRUE   3.8567
## registered_via9            3.2208
## payment_method_id_39TRUE   2.7976
## payment_method_id_29TRUE   2.5014
## registered_via4            2.2488
## payment_method_id_37TRUE   2.0728
## payment_method_id_36TRUE   1.8239
## registered_via7            1.7188
## payment_method_id_41TRUE   1.3161
## payment_method_num2        1.0348
## payment_method_id_34TRUE   0.4855
## payment_method_id_30TRUE   0.3519
## payment_method_id_33TRUE   0.3474

plot(rf_fit)

pred1 <- ROCR::prediction(as.vector(predict(rf_fit, df_join[train_idx,], type="prob")[,2]), df_join[train_idx,]$is_churn)
perf1 <- performance(pred1, "tpr", "fpr")
plot(perf1, colorize=TRUE, print.cutoffs.at=c(0.2, 0.4, 0.6, 0.8))

train_set <- df_join[train_idx,]
rf_train_probs <- predict(rf_fit, train_set, type = "prob")
rf_train_pred <- ifelse(rf_train_probs[,2] > 0.5, "CT", "CF")
(rf_cm_mat <- confusionMatrix(factor(rf_train_pred), train_set$is_churn))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    CF    CT
##         CF 16144   164
##         CT  2093  1579
##                                           
##                Accuracy : 0.887           
##                  95% CI : (0.8826, 0.8914)
##     No Information Rate : 0.9128          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.5273          
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.8852          
##             Specificity : 0.9059          
##          Pos Pred Value : 0.9899          
##          Neg Pred Value : 0.4300          
##              Prevalence : 0.9128          
##          Detection Rate : 0.8080          
##    Detection Prevalence : 0.8162          
##       Balanced Accuracy : 0.8956          
##                                           
##        'Positive' Class : CF              
## 

train_set <- df_join[train_idx,]
rf_train_probs <- predict(rf_fit, train_set, type = "prob")
rf_train_pred <- ifelse(rf_train_probs[,2] > 0.9, "CT", "CF")
(rf_cm_mat <- confusionMatrix(factor(rf_train_pred), train_set$is_churn))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    CF    CT
##         CF 17635   386
##         CT   602  1357
##                                           
##                Accuracy : 0.9506          
##                  95% CI : (0.9475, 0.9535)
##     No Information Rate : 0.9128          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.706           
##  Mcnemar's Test P-Value : 7.916e-12       
##                                           
##             Sensitivity : 0.9670          
##             Specificity : 0.7785          
##          Pos Pred Value : 0.9786          
##          Neg Pred Value : 0.6927          
##              Prevalence : 0.9128          
##          Detection Rate : 0.8826          
##    Detection Prevalence : 0.9020          
##       Balanced Accuracy : 0.8728          
##                                           
##        'Positive' Class : CF              
## 

test_set <- df_join[-train_idx,]
rf_test_probs <- predict(rf_fit, test_set, type = "prob")
rf_test_pred <- ifelse(rf_test_probs[,2] > 0.9, "CT", "CF")
(cm_mat3 <- confusionMatrix(factor(rf_test_pred), test_set$is_churn))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    CF    CT
##         CF 11729   302
##         CT   429   860
##                                           
##                Accuracy : 0.9451          
##                  95% CI : (0.9411, 0.9489)
##     No Information Rate : 0.9128          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6716          
##  Mcnemar's Test P-Value : 3.158e-06       
##                                           
##             Sensitivity : 0.9647          
##             Specificity : 0.7401          
##          Pos Pred Value : 0.9749          
##          Neg Pred Value : 0.6672          
##              Prevalence : 0.9128          
##          Detection Rate : 0.8806          
##    Detection Prevalence : 0.9032          
##       Balanced Accuracy : 0.8524          
##                                           
##        'Positive' Class : CF              
## 

(rcm1_bb <- retention_campaign_metrics(cm_mat3$table, customer_churn_percentage, 1))
##              target%          efficiency%               reach% 
##                  9.7                 66.7                 74.0 
##       effectiveness% min_reachable_churn%                  roi 
##                 49.4                  2.3                 66.7

C50 boosting

trControl <- trainControl(method = "repeatedcv",  
                          number = 10, 
                          repeats = 5, 
                          verboseIter = FALSE,
                          sampling = "down",
                          classProbs = TRUE,
                          summaryFunction = twoClassSummary,
                          savePredictions = TRUE,
                          returnResamp = "all")

grid <- expand.grid(.winnow = c(FALSE), .trials=c(25, 30, 35, 40), .model="tree" )

c50_boost_fit <- train(frm,
                       data = df_join[train_idx,], 
                       method = "C5.0",
                       trControl = trControl,
                       metric = 'ROC',
                       tuneGrid=grid
                       )

plot(c50_boost_fit)

varImp(c50_boost_fit)
## C5.0 variable importance
## 
##   only 20 most important variables shown (out of 43)
## 
##                          Overall
## payment_method_id_15TRUE  100.00
## payment_method_id_34TRUE  100.00
## is_cancelTRUE             100.00
## is_auto_renewTRUE         100.00
## hist_trans                100.00
## plan_list_price_sum       100.00
## payment_method_id_33TRUE  100.00
## payment_method_id_32TRUE  100.00
## payment_method_id_31TRUE  100.00
## payment_method_id_39TRUE   99.68
## payment_method_id_40TRUE   97.73
## payment_method_id_37TRUE   91.65
## payment_method_num2        89.19
## payment_method_id_17TRUE   88.38
## payment_method_id_41TRUE   86.69
## payment_method_id_29TRUE   78.57
## payment_method_id_38TRUE   66.27
## payment_method_id_36TRUE   64.77
## registered_via4            57.49
## registered_via9            47.59

summary(c50_boost_fit$finalModel)
## 
## Call:
## (function (x, y, trials = 1, rules = FALSE, weights = NULL, control
##  = 2, fuzzyThreshold = FALSE, sample = 0, earlyStopping = TRUE, label
##  = "outcome", seed = 1692L))
## 
## 
## C5.0 [Release 2.07 GPL Edition]      Sun Feb 03 11:42:19 2019
## -------------------------------
## 
## Class specified by attribute `outcome'
## 
## Read 3486 cases (44 attributes) from undefined.data
## 
## -----  Trial 0:  -----
## 
## Decision tree:
## 
## plan_list_price_sum > 398: CT (908/13)
## plan_list_price_sum <= 398:
## :...is_cancelTRUE > 0: CT (266/9)
##     is_cancelTRUE <= 0:
##     :...is_auto_renewTRUE <= 0:
##         :...payment_method_id_29TRUE <= 0: CT (524/183)
##         :   payment_method_id_29TRUE > 0: CF (37/11)
##         is_auto_renewTRUE > 0:
##         :...payment_method_id_39TRUE > 0: CF (277)
##             payment_method_id_39TRUE <= 0:
##             :...hist_trans <= 1: CF (1313/132)
##                 hist_trans > 1: CT (161/54)
## 
## -----  Trial 1:  -----
## 
## Decision tree:
## 
## plan_list_price_sum > 398: CT (735.1/34.7)
## plan_list_price_sum <= 398:
## :...is_cancelTRUE <= 0: CF (2525.8/732.1)
##     is_cancelTRUE > 0: CT (225.1/24)
## 
## -----  Trial 2:  -----
## 
## Decision tree:
## 
## payment_method_id_32TRUE > 0: CT (255.4/3.4)
## payment_method_id_32TRUE <= 0:
## :...payment_method_id_15TRUE > 0: CT (50.5)
##     payment_method_id_15TRUE <= 0:
##     :...plan_list_price_sum > 398: CT (387.3/41.3)
##         plan_list_price_sum <= 398:
##         :...payment_method_id_34TRUE > 0: CF (80.2/3.9)
##             payment_method_id_34TRUE <= 0:
##             :...is_cancelTRUE > 0: CT (192.7/31)
##                 is_cancelTRUE <= 0:
##                 :...payment_method_id_39TRUE > 0: CF (178.5)
##                     payment_method_id_39TRUE <= 0:
##                     :...payment_method_id_33TRUE > 0: CF (41.4)
##                         payment_method_id_33TRUE <= 0:
##                         :...payment_method_id_31TRUE > 0: CF (36)
##                             payment_method_id_31TRUE <= 0:
##                             :...payment_method_id_37TRUE > 0: CF (165.4/37.2)
##                                 payment_method_id_37TRUE <= 0:
##                                 :...hist_trans > 1: CT (306.6/105.4)
##                                     hist_trans <= 1:
##                                     :...registered_via7 > 0: CF (305.6/99.6)
##                                         registered_via7 <= 0:
##                                         :...plan_list_price_sum <= 105: CT (31.3/4.5)
##                                             plan_list_price_sum > 105: [S1]
## 
## SubTree [S1]
## 
## payment_method_id_40TRUE > 0: CT (412.1/167.5)
## payment_method_id_40TRUE <= 0:
## :...payment_method_id_29TRUE <= 0: CF (955/455)
##     payment_method_id_29TRUE > 0: CT (88/32.9)
## 
## -----  Trial 3:  -----
## 
## Decision tree:
## 
## plan_list_price_sum > 398: CT (540.8/51.8)
## plan_list_price_sum <= 398:
## :...is_cancelTRUE > 0: CT (180.1/35.8)
##     is_cancelTRUE <= 0:
##     :...is_auto_renewTRUE <= 0: CT (1047.4/376)
##         is_auto_renewTRUE > 0:
##         :...plan_list_price_sum <= 120: CF (131.1/32.1)
##             plan_list_price_sum > 120:
##             :...payment_method_id_36TRUE <= 0: CF (1340.5/433.7)
##                 payment_method_id_36TRUE > 0: CT (246/114.8)
## 
## -----  Trial 4:  -----
## 
## Decision tree:
## 
## payment_method_id_32TRUE > 0: CT (187.7/4.6)
## payment_method_id_32TRUE <= 0:
## :...payment_method_id_15TRUE > 0: CT (36.7)
##     payment_method_id_15TRUE <= 0:
##     :...plan_list_price_sum > 398: CT (306.5/55.1)
##         plan_list_price_sum <= 398:
##         :...payment_method_id_31TRUE > 0: CF (27.5/1.4)
##             payment_method_id_31TRUE <= 0:
##             :...payment_method_id_33TRUE > 0: CF (32.9/2.8)
##                 payment_method_id_33TRUE <= 0:
##                 :...payment_method_id_39TRUE > 0: CF (160.1/30.4)
##                     payment_method_id_39TRUE <= 0:
##                     :...is_cancelTRUE > 0: CT (130.3/41.3)
##                         is_cancelTRUE <= 0:
##                         :...hist_trans > 1: CT (374/140.1)
##                             hist_trans <= 1:
##                             :...is_auto_renewTRUE > 0: CF (1262.2/498.5)
##                                 is_auto_renewTRUE <= 0:
##                                 :...payment_method_id_29TRUE > 0: CF (74.1/27.5)
##                                     payment_method_id_29TRUE <= 0:
##                                     :...registered_via9 <= 0: CF (562.4/258.3)
##                                         registered_via9 > 0: CT (331.5/136)
## 
## -----  Trial 5:  -----
## 
## Decision tree:
## 
## payment_method_id_32TRUE > 0: CT (166.4/5.1)
## payment_method_id_32TRUE <= 0:
## :...payment_method_id_15TRUE > 0: CT (32.3)
##     payment_method_id_15TRUE <= 0:
##     :...payment_method_id_17TRUE > 0: CT (25.6/2.9)
##         payment_method_id_17TRUE <= 0:
##         :...plan_list_price_sum > 398: CT (276.1/61.7)
##             plan_list_price_sum <= 398:
##             :...is_cancelTRUE > 0: CT (200.2/46.3)
##                 is_cancelTRUE <= 0:
##                 :...payment_method_id_39TRUE > 0: CF (114.2)
##                     payment_method_id_39TRUE <= 0:
##                     :...payment_method_id_33TRUE > 0: CF (26.5)
##                         payment_method_id_33TRUE <= 0:
##                         :...payment_method_id_37TRUE > 0: CF (142.1/56.3)
##                             payment_method_id_37TRUE <= 0:
##                             :...payment_method_id_36TRUE > 0: CF (653.2/300.5)
##                                 payment_method_id_36TRUE <= 0:
##                                 :...registered_via9 <= 0: CT (1114/490)
##                                     registered_via9 > 0: CF (735.3/326.1)
## 
## -----  Trial 6:  -----
## 
## Decision tree:
## 
## payment_method_id_32TRUE > 0: CT (150/5.4)
## payment_method_id_32TRUE <= 0:
## :...payment_method_id_15TRUE > 0: CT (28.9)
##     payment_method_id_15TRUE <= 0:
##     :...hist_trans > 6: CF (39.4/7)
##         hist_trans <= 6:
##         :...plan_list_price_sum > 398: CT (223.9/32.4)
##             plan_list_price_sum <= 398:
##             :...is_cancelTRUE > 0: CT (186.6/48.6)
##                 is_cancelTRUE <= 0:
##                 :...payment_method_id_39TRUE > 0: CF (102.4)
##                     payment_method_id_39TRUE <= 0:
##                     :...payment_method_id_29TRUE > 0: CF (131.9/54.7)
##                         payment_method_id_29TRUE <= 0:
##                         :...registered_via4 > 0: CT (315.1/132.4)
##                             registered_via4 <= 0:
##                             :...payment_method_id_40TRUE > 0: CT (550/239.6)
##                                 payment_method_id_40TRUE <= 0:
##                                 :...is_auto_renewTRUE <= 0: CT (818.2/371.3)
##                                     is_auto_renewTRUE > 0: CF (939.5/339.9)
## 
## -----  Trial 7:  -----
## 
## Decision tree:
## 
## plan_list_price_sum > 398: CT (370.6/69.2)
## plan_list_price_sum <= 398:
## :...is_cancelTRUE > 0: CT (174.4/52)
##     is_cancelTRUE <= 0:
##     :...payment_method_id_39TRUE > 0: CF (90.8)
##         payment_method_id_39TRUE <= 0:
##         :...hist_trans > 1: CT (402.5/159.4)
##             hist_trans <= 1:
##             :...is_auto_renewTRUE > 0: CF (1380.6/546.1)
##                 is_auto_renewTRUE <= 0:
##                 :...payment_method_id_38TRUE <= 0: CT (457.2/196.9)
##                     payment_method_id_38TRUE > 0: CF (610/269)
## 
## -----  Trial 8:  -----
## 
## Decision tree:
## 
## payment_method_id_32TRUE > 0: CT (121.3/6.3)
## payment_method_id_32TRUE <= 0:
## :...plan_list_price_sum > 398: CT (246.9/69.4)
##     plan_list_price_sum <= 398:
##     :...is_cancelTRUE > 0: CT (166.3/56.5)
##         is_cancelTRUE <= 0:
##         :...payment_method_id_38TRUE > 0: CT (665.7/310.4)
##             payment_method_id_38TRUE <= 0:
##             :...plan_list_price_sum <= 120: CF (148.7/54.8)
##                 plan_list_price_sum > 120:
##                 :...payment_method_id_41TRUE <= 0: CF (1876.1/815.4)
##                     payment_method_id_41TRUE > 0: CT (261/99.9)
## 
## -----  Trial 9:  -----
## 
## Decision tree:
## 
## payment_method_id_15TRUE > 0: CT (26)
## payment_method_id_15TRUE <= 0:
## :...payment_method_id_32TRUE > 0: CT (112.4/6.5)
##     payment_method_id_32TRUE <= 0:
##     :...payment_method_id_34TRUE > 0: CF (42.5/8.2)
##         payment_method_id_34TRUE <= 0:
##         :...payment_method_id_33TRUE > 0: CF (32.1/6.5)
##             payment_method_id_33TRUE <= 0:
##             :...hist_trans > 4: CF (71.4/18.5)
##                 hist_trans <= 4:
##                 :...plan_list_price_sum > 398: CT (139.5/13.1)
##                     plan_list_price_sum <= 398:
##                     :...hist_trans > 2: CF (57.8/17.4)
##                         hist_trans <= 2:
##                         :...is_cancelTRUE > 0: CT (125.9/32.7)
##                             is_cancelTRUE <= 0:
##                             :...payment_method_id_39TRUE > 0: CF (75)
##                                 payment_method_id_39TRUE <= 0:
##                                 :...hist_trans > 1: CT (373.3/157.8)
##                                     hist_trans <= 1:
##                                     :...payment_method_id_37TRUE > 0: CF (77.9/25.2)
##                                         payment_method_id_37TRUE <= 0:
##                                         :...registered_via7 > 0: CF (355.6/150.7)
##                                             registered_via7 <= 0: [S1]
## 
## SubTree [S1]
## 
## plan_list_price_sum <= 105: CT (30.7/4.4)
## plan_list_price_sum > 105:
## :...registered_via9 > 0: CT (920.4/434.1)
##     registered_via9 <= 0:
##     :...hist_log <= 24: CF (517.5/221.6)
##         hist_log > 24: CT (528.2/249.9)
## 
## -----  Trial 10:  -----
## 
## Decision tree:
## 
## payment_method_id_32TRUE > 0: CT (102.8/6.8)
## payment_method_id_32TRUE <= 0:
## :...plan_list_price_sum > 398: CT (235.3/67.5)
##     plan_list_price_sum <= 398:
##     :...is_cancelTRUE > 0: CT (153.3/57.7)
##         is_cancelTRUE <= 0:
##         :...payment_method_id_39TRUE > 0: CF (68)
##             payment_method_id_39TRUE <= 0:
##             :...hist_trans <= 1: CF (2514.6/1148.6)
##                 hist_trans > 1: CT (412/189.1)
## 
## -----  Trial 11:  -----
## 
## Decision tree:
## 
## payment_method_id_31TRUE > 0: CF (31/3.1)
## payment_method_id_31TRUE <= 0:
## :...payment_method_id_33TRUE > 0: CF (32.9/9.5)
##     payment_method_id_33TRUE <= 0:
##     :...payment_method_id_34TRUE > 0: CF (42.1/12.7)
##         payment_method_id_34TRUE <= 0:
##         :...payment_method_id_39TRUE > 0: CF (119.3/55.9)
##             payment_method_id_39TRUE <= 0:
##             :...hist_trans <= 2: CT (3082.1/1425.8)
##                 hist_trans > 2: CF (178.6/68)
## 
## -----  Trial 12:  -----
## 
## Decision tree:
## 
## plan_list_price_sum > 398: CT (310.3/74.5)
## plan_list_price_sum <= 398:
## :...is_cancelTRUE > 0: CT (151/59.4)
##     is_cancelTRUE <= 0:
##     :...is_auto_renewTRUE <= 0: CT (1207.4/567.1)
##         is_auto_renewTRUE > 0: CF (1817.3/783.3)
## 
## -----  Trial 13:  -----
## 
## Decision tree:
## 
## payment_method_id_32TRUE > 0: CT (88.1/7)
## payment_method_id_32TRUE <= 0:
## :...plan_list_price_sum > 398: CT (225.9/71.1)
##     plan_list_price_sum <= 398:
##     :...payment_method_id_40TRUE <= 0: CF (2471.3/1145.5)
##         payment_method_id_40TRUE > 0: CT (700.7/327.8)
## 
## -----  Trial 14:  -----
## 
## Decision tree:
## 
## plan_list_price_sum > 398: CT (289/79.4)
## plan_list_price_sum <= 398:
## :...is_cancelTRUE > 0: CT (163.7/58.8)
##     is_cancelTRUE <= 0:
##     :...payment_method_id_39TRUE > 0: CF (53.9)
##         payment_method_id_39TRUE <= 0:
##         :...plan_list_price_sum <= 180: CF (2537/1216.4)
##             plan_list_price_sum > 180: CT (442.4/182.5)
## 
## -----  Trial 15:  -----
## 
## Decision tree:
## 
## payment_method_id_31TRUE > 0: CF (26.5/3.4)
## payment_method_id_31TRUE <= 0:
## :...payment_method_id_33TRUE > 0: CF (30.7/9.5)
##     payment_method_id_33TRUE <= 0:
##     :...payment_method_id_34TRUE > 0: CF (36.4/12.5)
##         payment_method_id_34TRUE <= 0:
##         :...hist_trans > 4: CF (78.8/25.2)
##             hist_trans <= 4:
##             :...plan_list_price_sum > 540: CT (103.4/7.3)
##                 plan_list_price_sum <= 540:
##                 :...plan_list_price_sum > 398: CT (79/14.7)
##                     plan_list_price_sum <= 398:
##                     :...plan_list_price_sum > 300: CF (217.1/86.1)
##                         plan_list_price_sum <= 300:
##                         :...is_cancelTRUE > 0: CT (113.8/29)
##                             is_cancelTRUE <= 0:
##                             :...payment_method_id_39TRUE <= 0: CT (2749.2/1284.4)
##                                 payment_method_id_39TRUE > 0: CF (51.1)
## 
## -----  Trial 16:  -----
## 
## Decision tree:
## 
## plan_list_price_sum > 398: CT (273.8/78.4)
## plan_list_price_sum <= 398:
## :...is_cancelTRUE > 0: CT (154.8/59)
##     is_cancelTRUE <= 0:
##     :...payment_method_id_39TRUE > 0: CF (47.6)
##         payment_method_id_39TRUE <= 0:
##         :...is_auto_renewTRUE <= 0:
##             :...payment_method_id_29TRUE <= 0: CT (1138.7/530.6)
##             :   payment_method_id_29TRUE > 0: CF (86.7/33.4)
##             is_auto_renewTRUE > 0:
##             :...payment_method_num2 > 0: CF (26.9/5.6)
##                 payment_method_num2 <= 0:
##                 :...hist_trans <= 1: CF (1427.4/629.4)
##                     hist_trans > 1: CT (330.2/150.9)
## 
## -----  Trial 17:  -----
## 
## Decision tree:
## 
## payment_method_id_32TRUE > 0: CT (64.2)
## payment_method_id_32TRUE <= 0:
## :...plan_list_price_sum <= 398: CF (3214.5/1532.3)
##     plan_list_price_sum > 398: CT (206.3/75)
## 
## -----  Trial 18:  -----
## 
## Decision tree:
## 
## payment_method_id_32TRUE > 0: CT (62)
## payment_method_id_32TRUE <= 0:
## :...hist_trans > 8: CF (29.6/6.7)
##     hist_trans <= 8:
##     :...plan_list_price_sum > 398: CT (157.7/37.6)
##         plan_list_price_sum <= 398:
##         :...is_cancelTRUE > 0: CT (172.3/59.9)
##             is_cancelTRUE <= 0:
##             :...payment_method_id_39TRUE > 0: CF (43.1)
##                 payment_method_id_39TRUE <= 0:
##                 :...registered_via4 <= 0: CF (2616.3/1276.6)
##                     registered_via4 > 0: CT (402/181.8)
## 
## -----  Trial 19:  -----
## 
## Decision tree:
## 
## is_cancelTRUE > 0: CT (152.7/38)
## is_cancelTRUE <= 0:
## :...payment_method_id_39TRUE > 0: CF (56.7/15.5)
##     payment_method_id_39TRUE <= 0:
##     :...payment_method_id_34TRUE <= 0: CT (3242/1516.1)
##         payment_method_id_34TRUE > 0: CF (27.6/7.9)
## 
## -----  Trial 20:  -----
## 
## Decision tree:
## 
## payment_method_id_32TRUE > 0: CT (57)
## payment_method_id_32TRUE <= 0:
## :...is_cancelTRUE > 0: CT (134.2/24.6)
##     is_cancelTRUE <= 0:
##     :...plan_list_price_sum <= 398: CF (3129.1/1512.5)
##         plan_list_price_sum > 398: CT (153.7/34.5)
## 
## -----  Trial 21:  -----
## 
## Decision tree:
## 
## payment_method_id_31TRUE > 0: CF (25.6/3.7)
## payment_method_id_31TRUE <= 0:
## :...plan_list_price_sum > 398: CT (186.1/23.2)
##     plan_list_price_sum <= 398:
##     :...is_cancelTRUE > 0: CT (122.7/25.5)
##         is_cancelTRUE <= 0:
##         :...payment_method_id_39TRUE > 0: CF (38)
##             payment_method_id_39TRUE <= 0:
##             :...payment_method_id_37TRUE <= 0: CT (2973.2/1414)
##                 payment_method_id_37TRUE > 0: CF (123.5/45.4)
## 
## -----  Trial 22:  -----
## 
## Decision tree:
## 
## plan_list_price_sum > 398: CT (156.6)
## plan_list_price_sum <= 398:
## :...is_cancelTRUE > 0: CT (94.5)
##     is_cancelTRUE <= 0:
##     :...payment_method_id_39TRUE > 0: CF (36)
##         payment_method_id_39TRUE <= 0:
##         :...payment_method_id_34TRUE > 0: CF (30.2)
##             payment_method_id_34TRUE <= 0:
##             :...plan_list_price_sum > 180: CT (460.6/193.3)
##                 plan_list_price_sum <= 180:
##                 :...is_auto_renewTRUE <= 0: CT (1208.1/603.2)
##                     is_auto_renewTRUE > 0: CF (1474.8/653.4)
## 
## -----  Trial 23:  -----
## 
## Decision tree:
## 
## is_cancelTRUE > 0: CT (94.5)
## is_cancelTRUE <= 0:
## :...plan_list_price_sum <= 99: CF (107.9/23.6)
##     plan_list_price_sum > 99:
##     :...payment_method_id_37TRUE > 0: CF (111.5/39)
##         payment_method_id_37TRUE <= 0:
##         :...payment_method_id_41TRUE > 0: CT (335/139.2)
##             payment_method_id_41TRUE <= 0:
##             :...payment_method_id_40TRUE > 0: CT (747/345.8)
##                 payment_method_id_40TRUE <= 0:
##                 :...payment_method_num2 > 0: CT (50/12.8)
##                     payment_method_num2 <= 0:
##                     :...plan_list_price_sum > 298: CT (276/71.8)
##                         plan_list_price_sum <= 298:
##                         :...hist_trans > 1: CF (107.6/11.3)
##                             hist_trans <= 1:
##                             :...plan_list_price_sum <= 150: CF (854.5/329.3)
##                                 plan_list_price_sum > 150:
##                                 :...hist_log <= 30: CF (695.2/307.4)
##                                     hist_log > 30: CT (69.9/20.2)
## 
## -----  Trial 24:  -----
## 
## Decision tree:
## 
## plan_list_price_sum > 398: CT (141.1)
## plan_list_price_sum <= 398:
## :...is_cancelTRUE > 0: CT (84.8)
##     is_cancelTRUE <= 0:
##     :...plan_list_price_sum <= 0: CT (25.4)
##         plan_list_price_sum > 0:
##         :...is_auto_renewTRUE > 0: CF (1744.4/578)
##             is_auto_renewTRUE <= 0:
##             :...plan_list_price_sum > 180: CT (95.3/28.4)
##                 plan_list_price_sum <= 180:
##                 :...payment_method_id_29TRUE > 0: CF (100.4/38.3)
##                     payment_method_id_29TRUE <= 0:
##                     :...registered_via9 <= 0: CF (770.1/363.9)
##                         registered_via9 > 0: CT (447.4/204.8)
## 
## -----  Trial 25:  -----
## 
## Decision tree:
## 
## plan_list_price_sum > 398: CT (126.3)
## plan_list_price_sum <= 398:
## :...is_cancelTRUE > 0: CT (75.9)
##     is_cancelTRUE <= 0:
##     :...payment_method_id_39TRUE > 0: CF (28.9)
##         payment_method_id_39TRUE <= 0:
##         :...plan_list_price_sum > 180: CT (509.8/182.6)
##             plan_list_price_sum <= 180:
##             :...is_auto_renewTRUE <= 0:
##                 :...registered_via9 <= 0: CT (926.8/401.6)
##                 :   registered_via9 > 0: CF (494.1/230.4)
##                 is_auto_renewTRUE > 0:
##                 :...payment_method_id_40TRUE <= 0: CF (471/4.3)
##                     payment_method_id_40TRUE > 0: CT (764.1/365.9)
## 
## -----  Trial 26:  -----
## 
## Decision tree:
## 
## is_cancelTRUE > 0: CT (80.4)
## is_cancelTRUE <= 0:
## :...plan_list_price_sum > 180: CT (639/193.5)
##     plan_list_price_sum <= 180:
##     :...plan_list_price_sum <= 0: CT (25.4)
##         plan_list_price_sum > 0:
##         :...is_auto_renewTRUE > 0: CF (1037.3/72.7)
##             is_auto_renewTRUE <= 0:
##             :...payment_method_id_29TRUE > 0: CF (112.6/47.1)
##                 payment_method_id_29TRUE <= 0:
##                 :...hist_log <= 23: CF (618.7/289.1)
##                     hist_log > 23: CT (815.7/360.3)
## 
## -----  Trial 27:  -----
## 
## Decision tree:
## 
## plan_list_price_sum > 358: CT (215/10.9)
## plan_list_price_sum <= 358:
## :...is_cancelTRUE > 0: CT (64.4)
##     is_cancelTRUE <= 0:
##     :...is_auto_renewTRUE > 0: CF (1350.3/158.1)
##         is_auto_renewTRUE <= 0:
##         :...hist_trans > 1: CT (37.2)
##             hist_trans <= 1:
##             :...plan_list_price_sum <= 120: CT (29.4)
##                 plan_list_price_sum > 120:
##                 :...hist_log <= 22: CT (612.3/256.2)
##                     hist_log > 22: CF (989.4/454.5)
## 
## -----  Trial 28:  -----
## 
## Decision tree:
## 
## plan_list_price_sum > 358: CT (175.9)
## plan_list_price_sum <= 358:
## :...is_cancelTRUE > 0: CT (55.5)
##     is_cancelTRUE <= 0:
##     :...payment_method_id_32TRUE > 0: CT (34.5)
##         payment_method_id_32TRUE <= 0:
##         :...payment_method_id_39TRUE > 0: CF (220.2)
##             payment_method_id_39TRUE <= 0:
##             :...is_auto_renewTRUE > 0:
##                 :...plan_list_price_sum <= 180: CF (721.8)
##                 :   plan_list_price_sum > 180: CT (313.8/85.8)
##                 is_auto_renewTRUE <= 0:
##                 :...hist_trans > 1: CT (31.7)
##                     hist_trans <= 1:
##                     :...plan_list_price_sum <= 120: CT (25.3)
##                         plan_list_price_sum > 120:
##                         :...payment_method_id_29TRUE > 0: CF (95.5/28.3)
##                             payment_method_id_29TRUE <= 0:
##                             :...hist_log <= 23: CF (696.9/327.1)
##                                 hist_log > 23: CT (905.8/329.6)
## 
## -----  Trial 29:  -----
## 
## Decision tree:
## 
## is_auto_renewTRUE > 0:
## :...is_cancelTRUE <= 0: CF (1163.8/238)
## :   is_cancelTRUE > 0: CT (53.2)
## is_auto_renewTRUE <= 0:
## :...payment_method_id_29TRUE > 0: CF (82.1/22.6)
##     payment_method_id_29TRUE <= 0:
##     :...hist_log > 30: CT (168.7)
##         hist_log <= 30:
##         :...plan_list_price_sum > 180: CT (134.7)
##             plan_list_price_sum <= 180:
##             :...registered_via9 > 0: CT (619.4/107.4)
##                 registered_via9 <= 0:
##                 :...payment_method_id_36TRUE <= 0: CT (625.1/173)
##                     payment_method_id_36TRUE > 0: CF (331.1/144.7)
## 
## 
## Evaluation on training data (3486 cases):
## 
## Trial        Decision Tree   
## -----      ----------------  
##    Size      Errors  
## 
##    0      7  402(11.5%)
##    1      3  613(17.6%)
##    2     15  707(20.3%)
##    3      6  573(16.4%)
##    4     12  495(14.2%)
##    5     11  881(25.3%)
##    6     11  673(19.3%)
##    7      7  452(13.0%)
##    8      7  631(18.1%)
##    9     16  807(23.1%)
##   10      6  489(14.0%)
##   11      6 1566(44.9%)
##   12      4  470(13.5%)
##   13      4  950(27.3%)
##   14      5  488(14.0%)
##   15     10 1352(38.8%)
##   16      8  398(11.4%)
##   17      3  830(23.8%)
##   18      7  599(17.2%)
##   19      4 1445(41.5%)
##   20      4  582(16.7%)
##   21      6 1256(36.0%)
##   22      7  415(11.9%)
##   23     11  833(23.9%)
##   24      8  476(13.7%)
##   25      8  663(19.0%)
##   26      7  684(19.6%)
##   27      7  512(14.7%)
##   28     11  410(11.8%)
##   29      8  616(17.7%)
## boost            394(11.3%)   <<
## 
## 
##     (a)   (b)    <-classified as
##    ----  ----
##    1523   220    (a): class CF
##     174  1569    (b): class CT
## 
## 
##  Attribute usage:
## 
##  100.00% hist_trans
##  100.00% is_auto_renewTRUE
##  100.00% is_cancelTRUE
##  100.00% plan_list_price_sum
##  100.00% payment_method_id_15TRUE
##  100.00% payment_method_id_31TRUE
##  100.00% payment_method_id_32TRUE
##  100.00% payment_method_id_33TRUE
##  100.00% payment_method_id_34TRUE
##   99.68% payment_method_id_39TRUE
##   97.73% payment_method_id_40TRUE
##   91.65% payment_method_id_37TRUE
##   89.19% payment_method_num2
##   88.38% payment_method_id_17TRUE
##   86.69% payment_method_id_41TRUE
##   78.57% payment_method_id_29TRUE
##   66.27% payment_method_id_38TRUE
##   64.77% payment_method_id_36TRUE
##   57.49% registered_via4
##   47.59% registered_via9
##   45.55% hist_log
##   41.05% registered_via7
## 
## 
## Time: 0.3 secs

pred1 <- ROCR::prediction(as.vector(predict(c50_boost_fit, df_join[train_idx,], type="prob")[,2]), df_join[train_idx,]$is_churn)
perf1 <- performance(pred1, "tpr", "fpr")
plot(perf1, colorize=TRUE, print.cutoffs.at=c(0.2, 0.4, 0.6, 0.8))

train_set <- df_join[train_idx,]
c50_boost_train_probs <- predict(c50_boost_fit, train_set, type = "prob")
c50_boost_train_pred <- ifelse(c50_boost_train_probs[,2] > 0.5, "CT", "CF")
(c50_boost_cm_mat <- confusionMatrix(factor(c50_boost_train_pred), train_set$is_churn))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    CF    CT
##         CF 15758   174
##         CT  2479  1569
##                                           
##                Accuracy : 0.8672          
##                  95% CI : (0.8624, 0.8719)
##     No Information Rate : 0.9128          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.4782          
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.8641          
##             Specificity : 0.9002          
##          Pos Pred Value : 0.9891          
##          Neg Pred Value : 0.3876          
##              Prevalence : 0.9128          
##          Detection Rate : 0.7887          
##    Detection Prevalence : 0.7974          
##       Balanced Accuracy : 0.8821          
##                                           
##        'Positive' Class : CF              
## 

train_set <- df_join[train_idx,]
c50_boost_train_probs <- predict(c50_boost_fit, train_set, type = "prob")
c50_boost_train_pred <- ifelse(c50_boost_train_probs[,2] > 0.7, "CT", "CF")
(c50_boost_cm_mat <- confusionMatrix(factor(c50_boost_train_pred), train_set$is_churn))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    CF    CT
##         CF 17899   502
##         CT   338  1241
##                                           
##                Accuracy : 0.958           
##                  95% CI : (0.9551, 0.9607)
##     No Information Rate : 0.9128          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7243          
##  Mcnemar's Test P-Value : 1.865e-08       
##                                           
##             Sensitivity : 0.9815          
##             Specificity : 0.7120          
##          Pos Pred Value : 0.9727          
##          Neg Pred Value : 0.7859          
##              Prevalence : 0.9128          
##          Detection Rate : 0.8958          
##    Detection Prevalence : 0.9210          
##       Balanced Accuracy : 0.8467          
##                                           
##        'Positive' Class : CF              
## 

test_set <- df_join[-train_idx,]
c50_boost_test_probs <- predict(c50_boost_fit, test_set, type = "prob")
c50_boost_test_pred <- ifelse(c50_boost_test_probs[,2] > 0.7, "CT", "CF")
(cm_mat4 <- confusionMatrix(factor(c50_boost_test_pred), test_set$is_churn))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    CF    CT
##         CF 11934   296
##         CT   224   866
##                                           
##                Accuracy : 0.961           
##                  95% CI : (0.9575, 0.9642)
##     No Information Rate : 0.9128          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7478          
##  Mcnemar's Test P-Value : 0.001848        
##                                           
##             Sensitivity : 0.9816          
##             Specificity : 0.7453          
##          Pos Pred Value : 0.9758          
##          Neg Pred Value : 0.7945          
##              Prevalence : 0.9128          
##          Detection Rate : 0.8959          
##    Detection Prevalence : 0.9182          
##       Balanced Accuracy : 0.8634          
##                                           
##        'Positive' Class : CF              
## 

(rcm2_bb <- retention_campaign_metrics(cm_mat4$table, customer_churn_percentage, 1))
##              target%          efficiency%               reach% 
##                  8.2                 79.4                 74.5 
##       effectiveness% min_reachable_churn%                  roi 
##                 59.2                  2.2                 79.4

cbind(rcm1_wb, rcm2_wb, rcm1_bb, rcm2_bb)
##                      rcm1_wb rcm2_wb rcm1_bb rcm2_bb
## target%                  5.9     9.9     9.7     8.2
## efficiency%             82.8    67.2    66.7    79.4
## reach%                  56.3    76.3    74.0    74.5
## effectiveness%          46.6    51.3    49.4    59.2
## min_reachable_churn%     3.8     2.1     2.3     2.2
## roi                     82.8    67.2    66.7    79.4

Plot del ROI della campagna di “retention” basata sul rendimento (“rho”) di convincimento dei clienti della stessa.

rho <- seq(0, 1, by = 0.01)

roi_wb1 <- sapply(rho, function(x) retention_campaign_metrics(cm_mat1$table, customer_churn_percentage, x)["roi"])
roi_wb2 <- sapply(rho, function(x) retention_campaign_metrics(cm_mat2$table, customer_churn_percentage, x)["roi"])
roi_bb1 <- sapply(rho, function(x) retention_campaign_metrics(cm_mat3$table, customer_churn_percentage, x)["roi"])
roi_bb2 <- sapply(rho, function(x) retention_campaign_metrics(cm_mat4$table, customer_churn_percentage, x)["roi"])

mrc_wb1 <- sapply(rho, function(x) retention_campaign_metrics(cm_mat1$table, customer_churn_percentage, x)["min_reachable_churn%"])
mrc_wb2 <- sapply(rho, function(x) retention_campaign_metrics(cm_mat2$table, customer_churn_percentage, x)["min_reachable_churn%"])
mrc_bb1 <- sapply(rho, function(x) retention_campaign_metrics(cm_mat3$table, customer_churn_percentage, x)["min_reachable_churn%"])
mrc_bb2 <- sapply(rho, function(x) retention_campaign_metrics(cm_mat4$table, customer_churn_percentage, x)["min_reachable_churn%"])


rho_df <- data.frame(rho = rho, c50rules = roi_wb1, rpart = roi_wb2, rf = roi_bb1, c50boost = roi_bb2)
mrc_df <- data.frame(rho = rho, c50rules = mrc_wb1, rpart = mrc_wb2, rf = mrc_bb1, c50boost = mrc_bb2)

rho_plot <- ggplot(data = rho_df) + geom_line(aes(x = rho, y = c50rules, colour = "c50rules"),  size = 1.2) + geom_line(aes(x = rho, y = rpart, colour = "rpart"), size = 1.2) + geom_line(aes(x = rho, y = rf, colour = "rf"), size = 1.2) + geom_line(aes(x = rho, y = c50boost, colour = "c50boost"), size = 1.2) + ylab("ROI")  + scale_colour_manual("model", values = c("red", "blue", "green", "violet")) + theme_bw() + ggtitle("Models ROI as a function of retention campaign rho")

rho_plot

mrc_plot <- ggplot(data = mrc_df) + geom_line(aes(x = rho, y = c50rules, colour = "c50rules"),  size = 1.2) + geom_line(aes(x = rho, y = rpart, colour = "rpart"), size = 1.2) + geom_line(aes(x = rho, y = rf, colour = "rf"), size = 1.2) + geom_line(aes(x = rho, y = c50boost, colour = "c50boost"), size = 1.2) + ylab("Minimum Reachable Churn Percentage")  + scale_colour_manual("model", values = c("red", "blue", "green", "violet")) + theme_bw() + ggtitle("Models Minimum Churn Reach as a function of retention campaign rho")
mrc_plot

Confronto Accuratezza Modelli

model_results <- resamples(list(RPART_TREE=rpart_fit, RANDOM_FOREST=rf_fit, C50_BOOST=c50_boost_fit))

summary(model_results)
## 
## Call:
## summary.resamples(object = model_results)
## 
## Models: RPART_TREE, RANDOM_FOREST, C50_BOOST 
## Number of resamples: 50 
## 
## ROC 
##                    Min.   1st Qu.    Median      Mean   3rd Qu.      Max.
## RPART_TREE    0.8892074 0.9028988 0.9178273 0.9183860 0.9321598 0.9537219
## RANDOM_FOREST 0.9241416 0.9341439 0.9437298 0.9434049 0.9525766 0.9638306
## C50_BOOST     0.9351967 0.9498406 0.9526146 0.9532042 0.9575689 0.9692604
##               NA's
## RPART_TREE       0
## RANDOM_FOREST    0
## C50_BOOST        0
## 
## Sens 
##                    Min.   1st Qu.    Median      Mean   3rd Qu.      Max.
## RPART_TREE    0.8267544 0.8449197 0.8516996 0.8510720 0.8567708 0.8749314
## RANDOM_FOREST 0.8617663 0.8796601 0.8873355 0.8861766 0.8941741 0.9040044
## C50_BOOST     0.8360746 0.8543037 0.8626645 0.8696716 0.8803454 0.9248904
##               NA's
## RPART_TREE       0
## RANDOM_FOREST    0
## C50_BOOST        0
## 
## Spec 
##                    Min.   1st Qu.    Median      Mean   3rd Qu.      Max.
## RPART_TREE    0.8333333 0.8750000 0.8965517 0.8955908 0.9182266 0.9540230
## RANDOM_FOREST 0.7816092 0.8469294 0.8678161 0.8641353 0.8800000 0.9310345
## C50_BOOST     0.7586207 0.8692529 0.9022989 0.8893695 0.9141626 0.9425287
##               NA's
## RPART_TREE       0
## RANDOM_FOREST    0
## C50_BOOST        0

bwplot(model_results)

Bootstrap Confidence Interval - specificity

boot.fn <- function(data, index) {
  test_set <- data[index,]
  rpart_resample_probs <- predict(rpart_fit, test_set, type = "prob")
  rpart_resample_test_pred <- ifelse(rpart_resample_probs[,2] > 0.65, "CT", "CF")
  cm_mat_resample <- confusionMatrix(factor(rpart_resample_test_pred), test_set$is_churn)
  cm_mat_resample$byClass[2]
}

set.seed(1023)        
results <- boot(data=df_join[-train_idx,], boot.fn, R = 100)
results
## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = df_join[-train_idx, ], statistic = boot.fn, R = 100)
## 
## 
## Bootstrap Statistics :
##      original       bias    std. error
## t1* 0.7633391 -0.002040165  0.01379297

Bootstrap Confidence Interval - specificity

boot.ci(results, type = "norm")
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 100 bootstrap replicates
## 
## CALL : 
## boot.ci(boot.out = results, type = "norm")
## 
## Intervals : 
## Level      Normal        
## 95%   ( 0.7383,  0.7924 )  
## Calculations and Intervals on Original Scale

Bootstrap Confidence Interval - negative predicted value

boot.fn <- function(data, index) {
  test_set <- data[index,]
  rpart_resample_probs <- predict(rpart_fit, test_set, type = "prob")
  rpart_resample_test_pred <- ifelse(rpart_resample_probs[,2] > 0.65, "CT", "CF")
  cm_mat_resample <- confusionMatrix(factor(rpart_resample_test_pred), test_set$is_churn)
  cm_mat_resample$byClass[4]
}

set.seed(1023)        
results <- boot(data=df_join[-train_idx,], boot.fn, R=100)
results
## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = df_join[-train_idx, ], statistic = boot.fn, R = 100)
## 
## 
## Bootstrap Statistics :
##      original       bias    std. error
## t1* 0.6719697 -0.001793563  0.01135488

Bootstrap Confidence Interval - negative predicted value

boot.ci(results, type = "norm")
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 100 bootstrap replicates
## 
## CALL : 
## boot.ci(boot.out = results, type = "norm")
## 
## Intervals : 
## Level      Normal        
## 95%   ( 0.6515,  0.6960 )  
## Calculations and Intervals on Original Scale

Bootstrap Confidence Interval - Minimum Reachable Churn

boot.fn <- function(data, index) {
  test_set <- data[index,]
  rpart_resample_probs <- predict(rpart_fit, test_set, type = "prob")
  rpart_resample_test_pred <- ifelse(rpart_resample_probs[,2] > 0.65, "CT", "CF")
  cm_mat_resample <- confusionMatrix(factor(rpart_resample_test_pred), test_set$is_churn)
  rcm_boot <- retention_campaign_metrics(cm_mat_resample$table, customer_churn_percentage, 1)
  rcm_boot[5]
}

set.seed(1023)        
results <- boot(data=df_join[-train_idx,], boot.fn, R = 100)
results
## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = df_join[-train_idx, ], statistic = boot.fn, R = 100)
## 
## 
## Bootstrap Statistics :
##     original  bias    std. error
## t1*      2.1  -0.018   0.1431112

Bootstrap Confidence Interval - Minimum Reachable Churn

boot.ci(results, type = "norm")
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 100 bootstrap replicates
## 
## CALL : 
## boot.ci(boot.out = results, type = "norm")
## 
## Intervals : 
## Level      Normal        
## 95%   ( 1.838,  2.398 )  
## Calculations and Intervals on Original Scale

Bootstrap Confidence Interval - ROI

boot.fn <- function(data, index) {
  test_set <- data[index,]
  rpart_resample_probs <- predict(rpart_fit, test_set, type = "prob")
  rpart_resample_test_pred <- ifelse(rpart_resample_probs[,2] > 0.65, "CT", "CF")
  cm_mat_resample <- confusionMatrix(factor(rpart_resample_test_pred), test_set$is_churn)
  rcm_boot <- retention_campaign_metrics(cm_mat_resample$table, customer_churn_percentage, 1)
  rcm_boot[6]
}

set.seed(1023)        
results <- boot(data=df_join[-train_idx,], boot.fn, R = 100)
results
## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = df_join[-train_idx, ], statistic = boot.fn, R = 100)
## 
## 
## Bootstrap Statistics :
##     original  bias    std. error
## t1*     67.2    0.04    2.209529

Bootstrap Confidence Interval - ROI

boot.ci(results, type = "norm")
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 100 bootstrap replicates
## 
## CALL : 
## boot.ci(boot.out = results, type = "norm")
## 
## Intervals : 
## Level      Normal        
## 95%   (62.83, 71.49 )  
## Calculations and Intervals on Original Scale

Ensemble Model

ensemble_train <- cbind(c50_train_pred, rpart_train_pred, rf_train_pred, c50_boost_train_pred)

weights <- c(1,1,2,2)
thresh <- 4

train_res <- apply(ensemble_train, 1, function(x) { sum(weights %*% as.numeric (x =="CT"))})
train_res_churn <- ifelse(train_res > thresh, "CT", "CF")

(cm_mat_train_ensemble <- confusionMatrix(factor(train_res_churn), train_set$is_churn))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    CF    CT
##         CF 18126   589
##         CT   111  1154
##                                           
##                Accuracy : 0.965           
##                  95% CI : (0.9623, 0.9675)
##     No Information Rate : 0.9128          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7489          
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9939          
##             Specificity : 0.6621          
##          Pos Pred Value : 0.9685          
##          Neg Pred Value : 0.9123          
##              Prevalence : 0.9128          
##          Detection Rate : 0.9072          
##    Detection Prevalence : 0.9367          
##       Balanced Accuracy : 0.8280          
##                                           
##        'Positive' Class : CF              
## 

ensemble_test <- cbind(c50_test_pred, rpart_test_pred, rf_test_pred, c50_boost_test_pred)

test_res <- apply(ensemble_test, 1, function(x) { sum(weights %*% as.numeric (x =="CT"))})
test_res_churn <- ifelse(test_res > thresh, "CT", "CF")

(cm_mat_test_ensemble <- confusionMatrix(factor(test_res_churn), test_set$is_churn))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    CF    CT
##         CF 12066   378
##         CT    92   784
##                                           
##                Accuracy : 0.9647          
##                  95% CI : (0.9614, 0.9678)
##     No Information Rate : 0.9128          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7507          
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9924          
##             Specificity : 0.6747          
##          Pos Pred Value : 0.9696          
##          Neg Pred Value : 0.8950          
##              Prevalence : 0.9128          
##          Detection Rate : 0.9059          
##    Detection Prevalence : 0.9342          
##       Balanced Accuracy : 0.8336          
##                                           
##        'Positive' Class : CF              
## 

(rcm_ensemble <- retention_campaign_metrics(cm_mat_test_ensemble$table, customer_churn_percentage, 1))
##              target%          efficiency%               reach% 
##                  6.6                 89.5                 67.5 
##       effectiveness% min_reachable_churn%                  roi 
##                 60.4                  2.8                 89.5

cbind(rcm1_wb, rcm2_wb, rcm1_bb, rcm2_bb, rcm_ensemble)
##                      rcm1_wb rcm2_wb rcm1_bb rcm2_bb rcm_ensemble
## target%                  5.9     9.9     9.7     8.2          6.6
## efficiency%             82.8    67.2    66.7    79.4         89.5
## reach%                  56.3    76.3    74.0    74.5         67.5
## effectiveness%          46.6    51.3    49.4    59.2         60.4
## min_reachable_churn%     3.8     2.1     2.3     2.2          2.8
## roi                     82.8    67.2    66.7    79.4         89.5

Model Stacking

df_stack <- data.frame(c50_train_pred, rpart_train_pred, rf_train_pred, c50_boost_train_pred, train_set$is_churn)

colnames(df_stack) <- c("C50", "RPART", "RF", "C50BOOST", "is_churn")

trControl <- trainControl(method = "repeatedcv",  
                          number = 10, 
                          repeats = 5, 
                          verboseIter = FALSE)

stack_fit <- train(is_churn ~ .,
                    data = df_stack, 
                    method = "glm",
                    trControl = trControl,
                    metric = 'Accuracy',
                    tuneLength = 10)

summary(stack_fit)
## 
## Call:
## NULL
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1366  -0.1866  -0.1866  -0.1866   2.8494  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -4.04222    0.05525 -73.166  < 2e-16 ***
## C50CT        0.38531    0.16359   2.355   0.0185 *  
## RPARTCT      0.81355    0.13805   5.893 3.79e-09 ***
## RFCT         2.93218    0.09323  31.452  < 2e-16 ***
## C50BOOSTCT   2.08604    0.16462  12.672  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 11832.1  on 19979  degrees of freedom
## Residual deviance:  5143.7  on 19975  degrees of freedom
## AIC: 5153.7
## 
## Number of Fisher Scoring iterations: 6

stack_train_probs <- predict(stack_fit, df_stack, type = "prob")
stack_train_pred <- ifelse(stack_train_probs[,2] > 0.5, "CT", "CF")
(stack_boost_cm_mat <- confusionMatrix(factor(stack_train_pred), train_set$is_churn))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    CF    CT
##         CF 18115   541
##         CT   122  1202
##                                           
##                Accuracy : 0.9668          
##                  95% CI : (0.9642, 0.9693)
##     No Information Rate : 0.9128          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7662          
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9933          
##             Specificity : 0.6896          
##          Pos Pred Value : 0.9710          
##          Neg Pred Value : 0.9079          
##              Prevalence : 0.9128          
##          Detection Rate : 0.9067          
##    Detection Prevalence : 0.9337          
##       Balanced Accuracy : 0.8415          
##                                           
##        'Positive' Class : CF              
## 

df_stack_test <- data.frame(c50_test_pred, rpart_test_pred, rf_test_pred, c50_boost_test_pred, test_set$is_churn)
colnames(df_stack_test) <- c("C50", "RPART", "RF", "C50BOOST", "is_churn")

stack_probs <- predict(stack_fit, df_stack_test, type = "prob")
stack_test_pred <- ifelse(stack_probs[,2] > 0.5, "CT", "CF")
(cm_mat_stack <- confusionMatrix(factor(stack_test_pred), test_set$is_churn))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    CF    CT
##         CF 12045   353
##         CT   113   809
##                                           
##                Accuracy : 0.965           
##                  95% CI : (0.9618, 0.9681)
##     No Information Rate : 0.9128          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7577          
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9907          
##             Specificity : 0.6962          
##          Pos Pred Value : 0.9715          
##          Neg Pred Value : 0.8774          
##              Prevalence : 0.9128          
##          Detection Rate : 0.9043          
##    Detection Prevalence : 0.9308          
##       Balanced Accuracy : 0.8435          
##                                           
##        'Positive' Class : CF              
## 

(rcm_stacking <- retention_campaign_metrics(cm_mat_stack$table, customer_churn_percentage, 1))
##              target%          efficiency%               reach% 
##                  6.9                 87.7                 69.6 
##       effectiveness% min_reachable_churn%                  roi 
##                 61.1                  2.7                 87.7

cbind(rcm1_wb, rcm2_wb, rcm1_bb, rcm2_bb, rcm_ensemble, rcm_stacking)
##                      rcm1_wb rcm2_wb rcm1_bb rcm2_bb rcm_ensemble
## target%                  5.9     9.9     9.7     8.2          6.6
## efficiency%             82.8    67.2    66.7    79.4         89.5
## reach%                  56.3    76.3    74.0    74.5         67.5
## effectiveness%          46.6    51.3    49.4    59.2         60.4
## min_reachable_churn%     3.8     2.1     2.3     2.2          2.8
## roi                     82.8    67.2    66.7    79.4         89.5
##                      rcm_stacking
## target%                       6.9
## efficiency%                  87.7
## reach%                       69.6
## effectiveness%               61.1
## min_reachable_churn%          2.7
## roi                          87.7

ensemble <- sapply(rho, function(x) retention_campaign_metrics(cm_mat_test_ensemble$table, customer_churn_percentage, x)["roi"])
stack <- sapply(rho, function(x) retention_campaign_metrics(cm_mat_stack$table, customer_churn_percentage, x)["roi"])

rho_df <- rho_df %>% mutate(ensemble = ensemble) %>% mutate(stack = stack)
rho_plot + geom_line(aes(x = rho, y = ensemble, colour = "ensemble"), size = 1.2) + geom_line(aes(x = rho, y = stack, colour = "stack"), size = 1.2) + scale_colour_manual("model", values = c("red", "blue", "green", "violet", "yellow", "black")) 

ensemble <- sapply(rho, function(x) retention_campaign_metrics(cm_mat_test_ensemble$table, customer_churn_percentage, x)["min_reachable_churn%"])
stack <- sapply(rho, function(x) retention_campaign_metrics(cm_mat_stack$table, customer_churn_percentage, x)["min_reachable_churn%"])

mrc_df <- mrc_df %>% mutate(ensemble = ensemble) %>% mutate(stack = stack)

mrc_plot + geom_line(aes(x = rho, y = ensemble, colour = "ensemble"), size = 1.2) + geom_line(aes(x = rho, y = stack, colour = "stack"), size = 1.2) + scale_colour_manual("model", values = c("red", "blue", "green", "violet", "yellow", "black"))