Giorgio Garziano
Un fornitore di servizi di musica in streaming vorrebbe predire se un utente non sottoscriverà piu’ alcuno dei loro servizi dopo 30 giorni dalla scadenza della sottoscrizione corrente (churn).
I sottoscrittori possono scegliere di effettuare un rinnovo esplicito o automatico della loro sottoscrizione attiva.
Un utente puo’ cancellare la sottoscrizione perche’ vuole cambiare il proprio piano di sottoscrizione o per altre ragioni.
Il criterio di “churn” consiste nel non sottoscrivere alcun nuovo servizio entro 30 giorni dalla scadenza di una sottoscrizione già effettuata o dalla sua esplicita cancellazione.
Notare che il campo “is_cancel” indica se l’utente ha effettuato una cancellazione esplicitamente, ma non e’ conclusivo sul fatto che cio’ implichi un “churn”.
churn.tab: tabella dei clienti
members.tab: informazioni generali sugli utenti
transactions.tab: tabella delle transazioni
user_logs.tab: tabella di log attività utenti che descrive comportamento giorno per giorno
Predizione del churn: supervised learning con indicatore di churn quale variabile target
Spiegazione del churn: modello di tipo “white box” + post-analisi esplorativa
Definizione di metriche relative alla campagna di “retention” sulla base delle Confusion Matrix
L’analisi esplorativa e’ suddivisa nei seguenti passi:
sommarizzazione e graficazione
clean-up dei dati
costruzione di un dataset quale risultato delle operazioni di “join” sulla chiave primaria “identificativo utente”
trasformazione long-to-wide del metodo di pagamento
feature engineering
L’individuazione dei potenziali predittori ha utilizzato i seguenti strumenti:
box-plot, density-plot e linear discriminant analysis per capire il grado di separabilità della variable target (is_churn) sulla base di uno specifico predittore candidato
correlation plot delle variabili quantitative
chi-square test delle variabili categoriche
Modelli “white-box”:
C5.0 Rules
Recursive Partitioning Tree (rpart)
Modelli “black-box”:
Random Forest
C5.0 boosting
I clienti che risultano piu’ propensi al churn sono quelli che hanno sottoscritto i piani di abbonamento piu’ costosi, e tendenzialmente senza l’opzione di auto-rinnovo.
Inoltre, il tipo di pagamento scelto e il numero di transazioni effettuate sembrano avere una certa importanza.
Predicted Observed
CF CT
+-----------------------------+
| | |
CF | A | B |
| | |
+-------------+---------------+
| | |
CT | C | D |
| | |
+-----------------------------+
Positive Class: “CF”
Accuracy := (A + D) / (A + B + C + D)
Sensitivity := A / (A + C)
Specificity := D / (B + D)
PPV := A / (A + B)
NPV := D / (C + D)
Predicted Observed
CF CT
+-----------------------------+
| | |
CF | A | B |
| | |
+-------------+---------------+
| | |
CT | C | D |
| | |
+-----------------------------+
Prevalence := (A + C) / (A + B + C + D)
DetectionRate := A / (A + B + C + D)
Detection Prevalence := (A + B) / (A + B + C + D)
BalancedAccuracy := (Sensitivity + Specificity) / 2
Predicted Observed
CF CT
+-----------------------------+
| | |
CF | A | B |
| | |
+-------------+---------------+
| | |
CT | C | D |
| | |
+-----------------------------+
No Information Rate := (A + C)/(A + B + C + D), accuratezza se si rispondesse sempre CF
P-Value [Acc > NIR] := p-value per la null-hypothesis che l’accuratezza non sia maggiore del No Information Rate
Kappa di Cohen := misura di quanto meglio si classifica rispetto al classificatore randomico; chiave di lettura:
[0,0.2]: scarso
[0.2,0.4]: discreto
[0.4,0.6]: buono
[0.6,0.8]: molto buono
[0.8, 1]: ottimo
Mcnemar’s Test P-Value := p-value per la null hypothesis che FP == FN
Predicted Observed
CF CT
+-----------------------------+
| | |
CF | A | B |
| | |
+-------------+---------------+
| | |
CT | C | D |
| | |
+-----------------------------+
frazione di clienti churn := (B + D) / (A + B + C + D) (current_churn) oppure da dati storici
frazione di clienti che considerero’ churn potenziali = (C + D) / (A + B + C + D) (campaign target)
efficienza della campagna di retention = D / (C + D) (efficiency)
veri potenziali clienti churn raggiunti = D / (B + D) (churn_reach)
efficacia := efficiency * churn_reach
valore minimo di churn post-campagna: (B + D - rho * D) / (A + B + C + D) (indicato come min_reachable_churn)
con rho rendimento di conversione cliente da churn a non-churn
ROI (Return On Investment) := (current_churn - min_reachable_churn)/campaign_target = rho * D / (C + D)
Il Return e’ interpretato come decremento della frazione (percentuale) di clienti churn L’Investiment come la frazione di target di clienti da contattare nella campagna di retention
KPI (Key Performance Indicator) ricevuti dal Business Management/Marketing:
valore di churn raggiunto a seguito della campagna di retention dei clienti
frazione massima di clienti sul totale dell’intero portafoglio che si intende contattare durante la campagna di retention
efficienza minima desiderata della campagna di retention, percentuale di clienti churn sul totale di quelli contattati
rendimento atteso di conversione del cliente da churn a non churn
Dati dall’Information Technology:
dati storici con label churn/no churn sulla base del quale costruire il modello di supervised learning
dati di portafoglio clienti corrente
Costruzione e regolazione dei modelli
definizione di uno o piu’ modelli di statistical learning
regolazione in termini dei KPI (a.1, a.2, a.3, a.4)
selezione dei modelli candidati
Esecuzione della predizione con i dati di portafoglio clienti corrente
Si e’ fatto uso del linguaggio R in ambiente RStudio su Windows. Le funzionalità utilizzate sono fornite da 9 librerie R (“packages”).
Questa presentazione viene generata per mezzo della stesura di un file R markdown composto sia da codice che testo. I grafici sono generati dal codice incluso.
La compilazione del file R markdown produce un file HTML che puo’ essere visualizzato in modalità di presentazione a slide.
## packages version
## 1 dplyr 0.7.8
## 2 ggplot2 3.1.0
## 3 caret 6.0.81
## 4 gridExtra 2.3
## 5 corrplot 0.84
## 6 rms 5.1.2
## 7 ROCR 1.0.7
## 8 rpart.plot 3.0.6
## 9 lubridate 1.7.4
## 10 boot 1.3.20
df_churn <- read.delim("churn.tab", sep = "\t", stringsAsFactors = FALSE)
dim(df_churn)
## [1] 37757 2
head(df_churn)
## msno is_churn
## 1 E1vsPZT5WGaqbFaXAV2Flx5LUfuMUHW8NoYSypf7oJU= 0
## 2 EMpiEFU1Fiv2LM714kYqvJcz/xp5qeNWbRheNZyExYc= 0
## 3 Fk7Qb26rnUGm8jeenM25HnnhteiGT9xjN6jDpOqdrck= 0
## 4 CLtKi0Tlx7XzUpy/Z8l979QBURHsOmvLU821wWclWrw= 0
## 5 ykKt0hHAFeJzXg2xXUrOKlHLyQEMoyjLFAT4odSLvP0= 0
## 6 nGu4EyiVLPy96WpruK4CoeZbCxS7becR9CWQVDbtuxY= 0
length(unique(df_churn$msno))/nrow(df_churn)
## [1] 1
sum(complete.cases(df_churn))/nrow(df_churn)
## [1] 1
sum(df_churn$is_churn)/nrow(df_churn)
## [1] 0.1292211
df_churn$is_churn <- factor(df_churn$is_churn)
levels(df_churn$is_churn) <- c("CF", "CT")
df_members <- read.delim("members.tab", sep = "\t", stringsAsFactors = FALSE)
dim(df_members)
## [1] 37761 6
colnames(df_members)
## [1] "msno" "city"
## [3] "age" "gender"
## [5] "registered_via" "registration_init_time"
head(df_members)
## msno city age gender
## 1 E1vsPZT5WGaqbFaXAV2Flx5LUfuMUHW8NoYSypf7oJU= 22 43 male
## 2 EMpiEFU1Fiv2LM714kYqvJcz/xp5qeNWbRheNZyExYc= 12 27 female
## 3 Fk7Qb26rnUGm8jeenM25HnnhteiGT9xjN6jDpOqdrck= 6 46 female
## 4 CLtKi0Tlx7XzUpy/Z8l979QBURHsOmvLU821wWclWrw= 15 45 male
## 5 ykKt0hHAFeJzXg2xXUrOKlHLyQEMoyjLFAT4odSLvP0= 8 22 male
## 6 nGu4EyiVLPy96WpruK4CoeZbCxS7becR9CWQVDbtuxY= 15 44 male
## registered_via registration_init_time
## 1 7 2013/06/04
## 2 3 2013/04/06
## 3 3 2013/01/31
## 4 7 2016/03/07
## 5 3 2013/08/21
## 6 9 2007/01/13
length(unique(df_members$msno))/nrow(df_members)
## [1] 0.9998941
sum(complete.cases(df_members))/nrow(df_members)
## [1] 1
mem_dup <- duplicated(df_members$msno)
mem_dup_pos <- which(mem_dup == TRUE)
dup_list <- lapply(mem_dup_pos, function(x) {
which(df_members$msno == df_members[x,]$msno)
})
to_del <- sapply(dup_list, function(x) { x[2]})
df_members <- df_members[setdiff(1:nrow(df_members),to_del),]
dim(df_members)
## [1] 37757 6
length(unique(df_members$city))
## [1] 21
do.call(rbind, lapply(split(df_members, df_members$city), nrow))
## [,1]
## 1 1101
## 3 451
## 4 4067
## 5 6191
## 6 2239
## 7 215
## 8 654
## 9 792
## 10 603
## 11 774
## 12 1060
## 13 8471
## 14 1850
## 15 3770
## 16 92
## 17 457
## 18 682
## 19 11
## 20 62
## 21 463
## 22 3752
df_members$city <- factor(df_members$city)
unique(df_members$registered_via)
## [1] 7 3 9 4 13
df_members$registered_via <- factor(df_members$registered_via)
table(df_members$city, df_members$registered_via)
##
## 3 4 7 9 13
## 1 472 312 249 66 2
## 3 94 22 107 228 0
## 4 916 266 842 2037 6
## 5 1329 493 1255 3114 0
## 6 559 172 315 1191 2
## 7 49 8 35 123 0
## 8 158 40 143 312 1
## 9 160 43 174 413 2
## 10 140 42 149 272 0
## 11 202 59 83 429 1
## 12 219 73 210 557 1
## 13 1778 404 1241 5043 5
## 14 338 102 477 932 1
## 15 942 243 556 2025 4
## 16 22 2 12 56 0
## 17 102 23 101 230 1
## 18 133 45 132 372 0
## 19 2 0 7 2 0
## 20 6 2 18 36 0
## 21 113 30 79 241 0
## 22 847 314 561 2028 2
ggplot(data=df_members, aes(x=city)) + geom_histogram(stat="count", fill='palegreen4') + guides(fill=FALSE)
ggplot(data=df_members, aes(x=city)) + geom_histogram(stat="count", fill='palegreen4') + guides(fill=FALSE) +
facet_grid( ~ registered_via) + coord_flip()
summary(df_members$age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.00 24.00 28.00 30.26 34.00 1421.00
do.call(rbind, lapply(split(df_members, df_members$age), nrow))
## [,1]
## 2 1
## 4 1
## 5 1
## 6 1
## 7 2
## 11 1
## 12 2
## 13 10
## 14 19
## 15 73
## 16 262
## 17 560
## 18 805
## 19 938
## 20 1325
## 21 1482
## 22 1728
## 23 1891
## 24 2039
## 25 2027
## 26 2230
## 27 2330
## 28 1933
## 29 1876
## 30 1674
## 31 1522
## 32 1340
## 33 1206
## 34 1192
## 35 1084
## 36 1013
## 37 850
## 38 763
## 39 642
## 40 527
## 41 512
## 42 421
## 43 371
## 44 312
## 45 296
## 46 292
## 47 291
## 48 241
## 49 221
## 50 178
## 51 187
## 52 174
## 53 165
## 54 105
## 55 98
## 56 78
## 57 72
## 58 54
## 59 54
## 60 39
## 61 31
## 62 23
## 63 27
## 64 20
## 65 15
## 66 13
## 67 9
## 68 11
## 69 8
## 70 5
## 71 1
## 72 4
## 73 4
## 74 5
## 75 1
## 77 2
## 78 2
## 79 2
## 80 2
## 81 1
## 83 1
## 84 3
## 87 3
## 88 1
## 90 1
## 91 3
## 92 1
## 93 1
## 95 3
## 99 2
## 100 2
## 101 1
## 102 1
## 103 3
## 104 1
## 105 1
## 106 8
## 107 2
## 112 2
## 117 6
## 126 1
## 153 1
## 948 1
## 951 1
## 1017 1
## 1031 1
## 1032 1
## 1033 1
## 1036 1
## 1037 2
## 1048 1
## 1051 1
## 1421 1
too_old <- which(df_members$age > 100)
length(too_old)
## [1] 39
if (length(too_old) > 0) {
df_members[too_old,]$age <- NA
}
too_young <- which(df_members$age < 10)
length(too_young)
## [1] 6
if (length(too_young) > 0) {
df_members[too_young,]$age <- NA
}
summary(df_members$age)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 11.00 24.00 28.00 29.88 34.00 100.00 45
boxplot(df_members$age, col = 'palegreen4')
ggplot(data=df_members, aes(y=age, x=city)) + geom_boxplot(fill='palegreen4') + guides(fill=FALSE)
ggplot(data=df_members, aes(y=age, x=registered_via)) + geom_boxplot(fill='palegreen4') + guides(fill=FALSE)
unique(df_members$gender)
## [1] "male" "female" ""
no_gender <- which(df_members$gender %in% c(" ", "", " "))
if (length(no_gender) > 0) df_members[no_gender,]$gender <- NA
df_members$gender <- factor(df_members$gender)
ggplot(data=subset(df_members, !is.na(gender)), aes(y=age, x=gender)) + geom_boxplot(aes(fill=gender)) + guides(fill=FALSE) + scale_fill_manual(values=c("#FF00AA", "#0000FF"))
ggplot(data=subset(df_members, !is.na(gender)), aes(x=gender)) + geom_bar(position = "dodge", aes(fill=gender)) + guides(fill=FALSE) + scale_fill_manual(values=c("#FF00AA", "#0000FF"))
ggplot(data=subset(df_members, !is.na(gender)), aes(x=gender)) + geom_bar(position = "dodge", aes(fill=gender)) + guides(fill=FALSE)+ scale_fill_manual(values=c("#FF00AA", "#0000FF")) + facet_grid( ~ registered_via)
df_trans <- read.delim("transaction.tab", sep = "\t", stringsAsFactors = FALSE)
dim(df_trans)
## [1] 46079 9
head(df_trans)
## ï..msno payment_method_id
## 1 ++/ZHqwUNa7U21Qz+zqteiXlZapxey86l6eEorrak/g= 14
## 2 ++/ZHqwUNa7U21Qz+zqteiXlZapxey86l6eEorrak/g= 14
## 3 ++6P09mCSJSh+Ft2pvZ0FWTrtcI3v1A7h3/coh8dBKw= 39
## 4 ++6P09mCSJSh+Ft2pvZ0FWTrtcI3v1A7h3/coh8dBKw= 39
## 5 ++8XaEDast796K/DQcP7sSoRQ8JRnA7X5S+oyXEw62E= 37
## 6 ++BW1PJYU5OZi3n3+IDLiU+d1IL1VE/GLx6p64TDs6U= 34
## payment_plan_days plan_list_price actual_amount_paid is_auto_renew
## 1 30 149 140.3510 1
## 2 30 149 139.7265 1
## 3 30 149 141.5193 1
## 4 30 149 136.6532 1
## 5 30 149 137.9952 1
## 6 30 149 148.2081 1
## transaction_date membership_expire_date is_cancel
## 1 2017/03/31 2017/05/04 0
## 2 2017/02/28 2017/04/04 0
## 3 2017/02/28 2017/04/13 0
## 4 2017/03/31 2017/05/13 0
## 5 2017/03/21 2017/04/20 0
## 6 2017/03/31 2017/04/30 0
colnames(df_trans)[1] <- "msno"
df_log <- read.delim("user_log.tab", sep = "\t", stringsAsFactors = FALSE)
dim(df_log)
## [1] 678793 9
head(df_log)
## msno date num_25 num_50
## 1 ++/ZHqwUNa7U21Qz+zqteiXlZapxey86l6eEorrak/g= 2017/03/25 5 4
## 2 ++/ZHqwUNa7U21Qz+zqteiXlZapxey86l6eEorrak/g= 2017/03/21 0 0
## 3 ++6P09mCSJSh+Ft2pvZ0FWTrtcI3v1A7h3/coh8dBKw= 2017/03/25 0 2
## 4 ++6P09mCSJSh+Ft2pvZ0FWTrtcI3v1A7h3/coh8dBKw= 2017/03/03 0 0
## 5 ++6P09mCSJSh+Ft2pvZ0FWTrtcI3v1A7h3/coh8dBKw= 2017/03/15 0 0
## 6 ++8XaEDast796K/DQcP7sSoRQ8JRnA7X5S+oyXEw62E= 2017/03/09 0 0
## num_75 num_985 num_100 num_unq total_secs
## 1 3 2 26 25 6815.238
## 2 1 0 14 14 3427.827
## 3 0 0 5 7 1357.419
## 4 0 0 3 3 688.298
## 5 0 0 4 4 949.230
## 6 0 0 1 1 262.000
cnt_churn <- df_churn %>% dplyr::filter(is_churn=="CT")
(customer_churn_percentage <- round(100*nrow(cnt_churn)/nrow(df_churn), 2))
## [1] 12.92
df_join <- left_join(df_churn, df_members, by="msno")
dim(df_join)
## [1] 37757 7
colnames(df_join)
## [1] "msno" "is_churn"
## [3] "city" "age"
## [5] "gender" "registered_via"
## [7] "registration_init_time"
colnames(df_trans) <- c("msno", colnames(df_trans[-1]))
trans_list <- split(df_trans, df_trans$msno)
last_transaction_idx <- function(dt) {
tr_ymd <- ymd(dt$transaction_date)
lt <- which.max(tr_ymd)
lt
}
analyse_trans_ltow <- function(dt, levels, levels_names) {
lst <- strsplit(as.character(dt$payment_method_id),"d_")
pm_sub_table <- do.call(rbind, lapply(lst, function(x)
as.logical(table(factor(x, levels=levels)))))
payment_method_wide <- data.frame(pm_sub_table, stringsAsFactors=FALSE)
colnames(payment_method_wide) <- levels_names
membership_expire_date <- max(ymd(dt$membership_expire_date))
nr <- nrow(dt)
lt <- last_transaction_idx(dt)
analysis <- data.frame(dt$msno[1],
nr,
sum(dt$actual_amount_paid),
dt[lt,]$is_auto_renew,
dt[lt,]$is_cancel,
length(unique(dt$payment_method_id)),
sum(as.numeric(dt$payment_plan_days)),
sum(as.numeric(dt$plan_list_price)),
membership_expire_date,
stringsAsFactors = FALSE)
colnames(analysis) <- c("msno", "hist_trans", "sum_paid", "is_auto_renew",
"is_cancel", "payment_method_num",
"payment_plan_days_sum", "plan_list_price_sum",
"membership_expire_date")
result_df <- as.data.frame(cbind(analysis, payment_method_wide))
result_df[1,,drop=FALSE]
}
df_trans$payment_method_id <- factor(df_trans$payment_method_id)
lvl <- levels(df_trans$payment_method_id)
level_names <- paste("payment_method_id_", lvl, sep="")
temp_list <- lapply(trans_list, function(x){ analyse_trans_ltow(x, lvl, level_names) })
transan_df <- do.call(rbind, temp_list)
rownames(transan_df) <- NULL
dim(transan_df)
## [1] 35464 41
sum(complete.cases(transan_df))
## [1] 35463
head(transan_df)
## msno hist_trans sum_paid
## 1 //4hBneqk/4/TtgL1XXQ+eKx7fJTeSvSNt0ktxjSIYE= 1 92.41383
## 2 //8WziQXa8NgLpNL75Y0DpPGxy6Em79sos20XA1l8tw= 1 138.74138
## 3 //98KXjMykxaDiasDXcTK7C3BBr+gBFI9L/2OQQixy8= 2 272.98474
## 4 //aWvY7Y+dy+OCzivW4e/ojzfGRRv4QeN3E4gfBojVI= 1 174.01733
## 5 //Bj6yyexnZbioH2eELlbqRXgOCRPlWQjBYtqi5iJK8= 1 137.85528
## 6 //fXDIz34vI4O2VFpJ8CMLgVYquz5V14oeKgwy0pBHM= 1 120.45214
## is_auto_renew is_cancel payment_method_num payment_plan_days_sum
## 1 0 0 1 30
## 2 1 1 1 30
## 3 1 0 1 60
## 4 0 0 1 30
## 5 1 0 1 30
## 6 1 0 1 30
## plan_list_price_sum membership_expire_date payment_method_id_3
## 1 100 2017-04-08 FALSE
## 2 149 2017-03-09 FALSE
## 3 298 2017-05-12 FALSE
## 4 180 2017-04-23 FALSE
## 5 149 2017-04-30 FALSE
## 6 129 2017-04-05 FALSE
## payment_method_id_8 payment_method_id_10 payment_method_id_11
## 1 FALSE FALSE FALSE
## 2 FALSE FALSE FALSE
## 3 FALSE FALSE FALSE
## 4 FALSE FALSE FALSE
## 5 FALSE FALSE FALSE
## 6 FALSE FALSE FALSE
## payment_method_id_12 payment_method_id_13 payment_method_id_14
## 1 FALSE FALSE FALSE
## 2 FALSE FALSE FALSE
## 3 FALSE FALSE FALSE
## 4 FALSE FALSE FALSE
## 5 FALSE FALSE FALSE
## 6 FALSE FALSE FALSE
## payment_method_id_15 payment_method_id_16 payment_method_id_17
## 1 FALSE FALSE FALSE
## 2 FALSE FALSE FALSE
## 3 FALSE FALSE FALSE
## 4 FALSE FALSE FALSE
## 5 FALSE FALSE FALSE
## 6 FALSE FALSE FALSE
## payment_method_id_18 payment_method_id_19 payment_method_id_20
## 1 FALSE FALSE FALSE
## 2 FALSE FALSE FALSE
## 3 FALSE FALSE FALSE
## 4 FALSE FALSE FALSE
## 5 FALSE FALSE FALSE
## 6 FALSE FALSE FALSE
## payment_method_id_21 payment_method_id_22 payment_method_id_23
## 1 FALSE FALSE FALSE
## 2 FALSE FALSE FALSE
## 3 FALSE FALSE FALSE
## 4 FALSE FALSE FALSE
## 5 FALSE FALSE FALSE
## 6 FALSE FALSE FALSE
## payment_method_id_26 payment_method_id_27 payment_method_id_28
## 1 FALSE FALSE FALSE
## 2 FALSE FALSE FALSE
## 3 FALSE FALSE FALSE
## 4 FALSE FALSE FALSE
## 5 FALSE FALSE FALSE
## 6 FALSE FALSE FALSE
## payment_method_id_29 payment_method_id_30 payment_method_id_31
## 1 FALSE FALSE FALSE
## 2 FALSE FALSE FALSE
## 3 FALSE FALSE FALSE
## 4 FALSE FALSE FALSE
## 5 FALSE FALSE TRUE
## 6 FALSE FALSE FALSE
## payment_method_id_32 payment_method_id_33 payment_method_id_34
## 1 FALSE FALSE FALSE
## 2 FALSE FALSE FALSE
## 3 FALSE FALSE FALSE
## 4 FALSE FALSE FALSE
## 5 FALSE FALSE FALSE
## 6 FALSE FALSE FALSE
## payment_method_id_35 payment_method_id_36 payment_method_id_37
## 1 FALSE FALSE FALSE
## 2 FALSE FALSE TRUE
## 3 FALSE FALSE FALSE
## 4 FALSE TRUE FALSE
## 5 FALSE FALSE FALSE
## 6 FALSE FALSE FALSE
## payment_method_id_38 payment_method_id_39 payment_method_id_40
## 1 TRUE FALSE FALSE
## 2 FALSE FALSE FALSE
## 3 FALSE TRUE FALSE
## 4 FALSE FALSE FALSE
## 5 FALSE FALSE FALSE
## 6 FALSE FALSE FALSE
## payment_method_id_41
## 1 FALSE
## 2 FALSE
## 3 FALSE
## 4 FALSE
## 5 FALSE
## 6 TRUE
df_join <- left_join(df_join, transan_df, by="msno")
analyse_log <- function(dt) {
total_sums <- sum(dt$num_25) +sum(dt$num_50) + sum(dt$num_75) + sum(dt$num_100) +
sum(dt$num_985) + sum(dt$num_unq)
analysis <- data.frame(dt$msno[1],
nrow(dt),
sum(dt$total_secs),
round(100*sum(dt$num_25)/total_sums,2),
round(100*sum(dt$num_50)/total_sums,2),
round(100*sum(dt$num_75)/total_sums,2),
round(100*sum(dt$num_100)/total_sums,2),
round(100*sum(dt$num_985)/total_sums,2),
round(100*sum(dt$num_unq)/total_sums,2),
stringsAsFactors = FALSE)
colnames(analysis) <- c("msno", "hist_log", "total_secs",
"num_25", "num_50", "num_75",
"num_100", "num_985", "num_unq")
analysis
}
log_list <- split(df_log, df_log$msno)
logan_df <- do.call(rbind, lapply(log_list, function(x){analyse_log(x)}))
rownames(logan_df) <- NULL
dim(logan_df)
## [1] 35317 9
df_join2 <- left_join(df_churn, logan_df, by="msno")
df_join2 <- df_join2[complete.cases(df_join2),]
numeric_vars <- names(which(sapply(logan_df, class) == "numeric"))
numeric_vars <- c(numeric_vars, names(which(sapply(logan_df, class) == "integer")))
df_join2$is_churn <- factor(df_join2$is_churn)
gp <- invisible(lapply(numeric_vars, function(x) {
ggplot(data=df_join2, aes(x = is_churn, y=log(eval(parse(text=x))))) + geom_boxplot(aes(fill = is_churn)) + xlab("is_churn") + ylab(x) + ggtitle("") + scale_fill_manual(values=c("#009900", "#CC0000")) + theme(legend.position="none")}))
grob_plots <- invisible(lapply(chunk(1, length(gp), 4), function(x) {
marrangeGrob(grobs=lapply(gp[x], ggplotGrob), nrow=2, ncol=2)}))
grob_plots$chunk1
grob_plots$chunk2
gp <- invisible(lapply(numeric_vars, function(x) {
ggplot(data=df_join2, aes(x=eval(parse(text=x)))) + geom_density(aes(fill = is_churn), alpha = 0.4) + xlab(x) + scale_fill_manual(values=c("#009900", "#CC0000")) + scale_x_continuous(trans='log2') + ggtitle(paste(x, "density", sep= " "))}))
grob_plots <- invisible(lapply(chunk(1, length(gp), 4), function(x) {
marrangeGrob(grobs=lapply(gp[x], ggplotGrob), nrow=2, ncol=2)}))
grob_plots$chunk1
grob_plots$chunk2
aov_l <- lapply(numeric_vars, function(x) { summary(aov(eval(parse(text=x)) ~ is_churn, data = df_join2))})
names(aov_l) <- numeric_vars
aov_l
## $total_secs
## Df Sum Sq Mean Sq F value Pr(>F)
## is_churn 1 4.440e+12 4.440e+12 104.8 <2e-16 ***
## Residuals 35315 1.496e+15 4.237e+10
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## $num_25
## Df Sum Sq Mean Sq F value Pr(>F)
## is_churn 1 746 746.5 12.35 0.000443 ***
## Residuals 35315 2135447 60.5
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## $num_50
## Df Sum Sq Mean Sq F value Pr(>F)
## is_churn 1 202 202.21 22.91 1.7e-06 ***
## Residuals 35315 311641 8.82
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## $num_75
## Df Sum Sq Mean Sq F value Pr(>F)
## is_churn 1 72 71.87 21.6 3.38e-06 ***
## Residuals 35315 117520 3.33
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## $num_100
## Df Sum Sq Mean Sq F value Pr(>F)
## is_churn 1 6027 6027 36.36 1.66e-09 ***
## Residuals 35315 5853318 166
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## $num_985
## Df Sum Sq Mean Sq F value Pr(>F)
## is_churn 1 40 39.97 8.338 0.00388 **
## Residuals 35315 169297 4.79
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## $num_unq
## Df Sum Sq Mean Sq F value Pr(>F)
## is_churn 1 453 453.3 9.791 0.00175 **
## Residuals 35315 1634782 46.3
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## $hist_log
## Df Sum Sq Mean Sq F value Pr(>F)
## is_churn 1 40645 40645 471.3 <2e-16 ***
## Residuals 35315 3045696 86
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
tukey <- lapply(numeric_vars, function(x) { TukeyHSD(aov(eval(parse(text=x)) ~ is_churn, data = df_join2))})
names(tukey) <- numeric_vars
tukey
## $total_secs
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = eval(parse(text = x)) ~ is_churn, data = df_join2)
##
## $is_churn
## diff lwr upr p adj
## CT-CF -34115.04 -40647.05 -27583.02 0
##
##
## $num_25
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = eval(parse(text = x)) ~ is_churn, data = df_join2)
##
## $is_churn
## diff lwr upr p adj
## CT-CF 0.4423427 0.1955868 0.6890987 0.0004423
##
##
## $num_50
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = eval(parse(text = x)) ~ is_churn, data = df_join2)
##
## $is_churn
## diff lwr upr p adj
## CT-CF 0.2302275 0.1359625 0.3244926 1.7e-06
##
##
## $num_75
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = eval(parse(text = x)) ~ is_churn, data = df_join2)
##
## $is_churn
## diff lwr upr p adj
## CT-CF 0.1372562 0.0793696 0.1951429 3.4e-06
##
##
## $num_100
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = eval(parse(text = x)) ~ is_churn, data = df_join2)
##
## $is_churn
## diff lwr upr p adj
## CT-CF -1.256892 -1.665423 -0.848362 0
##
##
## $num_985
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = eval(parse(text = x)) ~ is_churn, data = df_join2)
##
## $is_churn
## diff lwr upr p adj
## CT-CF 0.1023596 0.03288145 0.1718377 0.0038826
##
##
## $num_unq
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = eval(parse(text = x)) ~ is_churn, data = df_join2)
##
## $is_churn
## diff lwr upr p adj
## CT-CF 0.3446871 0.1287869 0.5605874 0.0017534
##
##
## $hist_log
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = eval(parse(text = x)) ~ is_churn, data = df_join2)
##
## $is_churn
## diff lwr upr p adj
## CT-CF -3.264055 -3.558746 -2.969365 0
set.seed(1023)
trControl <- trainControl(method = "repeatedcv", number = 10, repeats = 5, verboseIter = FALSE, sampling = "down")
fit <- train(is_churn ~ total_secs, data = df_join2, method = "lda", trControl = trControl, metric = 'Accuracy', tuneLength = 10)
predict_res <- predict(fit, df_join2, type = "raw")
confusionMatrix(factor(predict_res), df_join2$is_churn)$overall[2]
## Kappa
## 0.02688027
fit <- train(is_churn ~ num_25, data = df_join2, method = "lda", trControl = trControl, metric = 'Accuracy', tuneLength = 10)
predict_res <- predict(fit, df_join2, type = "raw")
confusionMatrix(factor(predict_res), df_join2$is_churn)$overall[2]
## Kappa
## 0.007623977
fit <- train(is_churn ~ num_50, data = df_join2, method = "lda", trControl = trControl, metric = 'Accuracy', tuneLength = 10)
predict_res <- predict(fit, df_join2, type = "raw")
confusionMatrix(factor(predict_res), df_join2$is_churn)$overall[2]
## Kappa
## 0.01327058
fit <- train(is_churn ~ num_75, data = df_join2, method = "lda", trControl = trControl, metric = 'Accuracy', tuneLength = 10)
predict_res <- predict(fit, df_join2, type = "raw")
confusionMatrix(factor(predict_res), df_join2$is_churn)$overall[2]
## Kappa
## 0.009044729
fit <- train(is_churn ~ num_100, data = df_join2, method = "lda", trControl = trControl, metric = 'Accuracy', tuneLength = 10)
predict_res <- predict(fit, df_join2, type = "raw")
confusionMatrix(factor(predict_res), df_join2$is_churn)$overall[2]
## Kappa
## 0.02014215
fit <- train(is_churn ~ num_985, data = df_join2, method = "lda", trControl = trControl, metric = 'Accuracy', tuneLength = 10)
predict_res <- predict(fit, df_join2, type = "raw")
confusionMatrix(factor(predict_res), df_join2$is_churn)$overall[2]
## Kappa
## 0.004943829
fit <- train(is_churn ~ num_unq, data = df_join2, method = "lda", trControl = trControl, metric = 'Accuracy', tuneLength = 10)
predict_res <- predict(fit, df_join2, type = "raw")
confusionMatrix(factor(predict_res), df_join2$is_churn)$overall[2]
## Kappa
## 0.005372912
fit <- train(is_churn ~ hist_log, data = df_join2, method = "lda", trControl = trControl, metric = 'Accuracy', tuneLength = 10)
predict_res <- predict(fit, df_join2, type = "raw")
confusionMatrix(factor(predict_res), df_join2$is_churn)$overall[2]
## Kappa
## 0.07486046
df_join <- left_join(df_join, logan_df[, c("msno", "hist_log")], by="msno")
df_join$is_cancel <- as.logical(df_join$is_cancel)
df_join$is_auto_renew <- as.logical(df_join$is_auto_renew)
df_join$payment_method_num <- factor(df_join$payment_method_num)
sum(complete.cases(df_join))/nrow(df_join)
## [1] 0.8819557
df_join <- df_join[complete.cases(df_join),]
df_join$membership_duration <- as.numeric(ymd(df_join$membership_expire_date) - ymd(df_join$registration_init_time))
df_join$registration_init_time <- NULL
df_join$membership_expire_date <- NULL
dim(df_join)
## [1] 33300 47
str(df_join)
## 'data.frame': 33300 obs. of 47 variables:
## $ msno : chr "E1vsPZT5WGaqbFaXAV2Flx5LUfuMUHW8NoYSypf7oJU=" "EMpiEFU1Fiv2LM714kYqvJcz/xp5qeNWbRheNZyExYc=" "Fk7Qb26rnUGm8jeenM25HnnhteiGT9xjN6jDpOqdrck=" "CLtKi0Tlx7XzUpy/Z8l979QBURHsOmvLU821wWclWrw=" ...
## $ is_churn : Factor w/ 2 levels "CF","CT": 1 1 1 1 1 2 1 1 1 1 ...
## $ city : Factor w/ 21 levels "1","3","4","5",..: 21 11 5 14 7 17 4 14 3 4 ...
## $ age : int 43 27 46 45 22 34 25 28 31 19 ...
## $ gender : Factor w/ 2 levels "female","male": 2 1 1 2 2 1 1 1 2 2 ...
## $ registered_via : Factor w/ 5 levels "3","4","7","9",..: 3 1 1 3 1 4 1 4 4 4 ...
## $ hist_trans : int 1 1 1 1 1 1 1 1 1 1 ...
## $ sum_paid : num 145.6 171.7 144.7 95.3 146.2 ...
## $ is_auto_renew : logi TRUE TRUE FALSE TRUE FALSE TRUE ...
## $ is_cancel : logi FALSE FALSE FALSE FALSE FALSE TRUE ...
## $ payment_method_num : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
## $ payment_plan_days_sum: num 30 30 30 30 30 30 30 30 30 30 ...
## $ plan_list_price_sum : num 149 180 149 99 149 180 180 149 149 149 ...
## $ payment_method_id_3 : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ payment_method_id_8 : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ payment_method_id_10 : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ payment_method_id_11 : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ payment_method_id_12 : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ payment_method_id_13 : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ payment_method_id_14 : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ payment_method_id_15 : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ payment_method_id_16 : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ payment_method_id_17 : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ payment_method_id_18 : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ payment_method_id_19 : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ payment_method_id_20 : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ payment_method_id_21 : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ payment_method_id_22 : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ payment_method_id_23 : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ payment_method_id_26 : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ payment_method_id_27 : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ payment_method_id_28 : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ payment_method_id_29 : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ payment_method_id_30 : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ payment_method_id_31 : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ payment_method_id_32 : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ payment_method_id_33 : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ payment_method_id_34 : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ payment_method_id_35 : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ payment_method_id_36 : logi FALSE TRUE FALSE FALSE FALSE TRUE ...
## $ payment_method_id_37 : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ payment_method_id_38 : logi FALSE FALSE TRUE FALSE TRUE FALSE ...
## $ payment_method_id_39 : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ payment_method_id_40 : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ payment_method_id_41 : logi TRUE FALSE FALSE TRUE FALSE FALSE ...
## $ hist_log : int 4 31 30 5 29 14 26 21 16 30 ...
## $ membership_duration : num 1401 1468 1537 396 1329 ...
ggplot(data=df_join, aes(y=age, x=is_churn)) + geom_boxplot(aes(fill=is_churn)) + guides(fill=FALSE) + scale_fill_manual(values=c("#009900", "#CC0000"))
ggplot(data=df_join, aes(y=sum_paid, x=is_churn)) + geom_boxplot(aes(fill=is_churn)) + guides(fill=FALSE) + scale_fill_manual(values=c("#009900", "#CC0000"))
ggplot(data=df_join, aes(y=hist_log, x=is_churn)) + geom_boxplot(aes(fill=is_churn)) + guides(fill=FALSE) + scale_fill_manual(values=c("#009900", "#CC0000"))
ggplot(data=df_join, aes(y=membership_duration, x=is_churn)) + geom_boxplot(aes(fill=is_churn)) + guides(fill=FALSE) + scale_fill_manual(values=c("#009900", "#CC0000"))
ggplot(data=df_join, aes(y=plan_list_price_sum, x=is_churn)) + geom_boxplot(aes(fill=is_churn)) + guides(fill=FALSE) + scale_fill_manual(values=c("#009900", "#CC0000"))
ggplot(data=df_join, aes(y=membership_duration, x=city)) + geom_boxplot(aes(fill=is_churn)) + guides(fill=FALSE) + facet_wrap ( ~is_churn) + coord_flip() + scale_fill_manual(values=c("#009900", "#CC0000"))
ggplot(data=df_join, aes(y=membership_duration, x=cut(df_join$age, breaks=c(0,20,40,60,80,100)))) + geom_boxplot(aes(fill=is_churn)) + xlab("age") + guides(fill=FALSE) + facet_wrap ( ~is_churn) + coord_flip() + scale_fill_manual(values=c("#009900", "#CC0000"))
ggplot(data=df_join, aes(y=age, x=city)) + geom_boxplot(aes(fill=is_churn)) + guides(fill=FALSE) + facet_wrap ( ~is_churn) + coord_flip() + scale_fill_manual(values=c("#009900", "#CC0000"))
ggplot(data=df_join, aes(y=payment_plan_days_sum, x=city)) + geom_boxplot(aes(fill=is_churn)) + guides(fill=FALSE) + facet_wrap ( ~ is_churn) + coord_flip() + scale_fill_manual(values=c("#009900", "#CC0000"))
ggplot(data=df_join, aes(y=age, x=gender)) + geom_boxplot(aes(fill=is_churn)) + guides(fill=FALSE) + facet_wrap ( ~ is_churn) + coord_flip() + scale_fill_manual(values=c("#009900", "#CC0000"))
ggplot(data=df_join, aes(y=membership_duration, x=registered_via)) + geom_boxplot(aes(fill=is_churn)) + guides(fill=FALSE) + facet_wrap ( ~is_churn) + coord_flip() + scale_fill_manual(values=c("#009900", "#CC0000"))
ggplot(data=df_join, aes(x=city, fill=is_churn)) + geom_bar(position = 'dodge2') + scale_fill_manual(values=c("#009900", "#CC0000"))
ggplot(data=df_join, aes(x=cut(age, breaks = c(0,20,30,40,50,100)), fill=is_churn)) + geom_bar(position = 'dodge2') + xlab("age") + scale_fill_manual(values=c("#009900", "#CC0000"))
ggplot(data=df_join, aes(x=registered_via, fill=is_churn)) + geom_bar(position = 'dodge2') + xlab("registered via") + scale_fill_manual(values=c("#009900", "#CC0000"))
ggplot(data=df_join, aes(x=gender, fill=is_churn)) + geom_bar(position = 'dodge2') + xlab("gender") + scale_fill_manual(values=c("#009900", "#CC0000"))
ggplot(data=df_join, aes(x=is_cancel, fill=is_churn)) + geom_bar(position = 'dodge2') + xlab("is cancel") + scale_fill_manual(values=c("#009900", "#CC0000"))
table(df_join$is_cancel, df_join$is_churn)
##
## CF CT
## FALSE 30202 2412
## TRUE 193 493
ggplot(data=df_join, aes(x=is_auto_renew, fill=is_churn)) + geom_bar(position = 'dodge2') + xlab("auto renew") + scale_fill_manual(values=c("#009900", "#CC0000"))
ggplot(data=df_join, aes(x=cut(membership_duration, breaks=seq(1, 5475, by=365)), fill=is_churn)) + geom_bar(position = 'dodge2') + xlab("years of subscription") + coord_flip() + scale_x_discrete(labels=as.character(1:15)) + scale_fill_manual(values=c("#009900", "#CC0000"))
factor_vars <- names(which(sapply(df_join, class) == "factor"))
factor_vars
## [1] "is_churn" "city" "gender"
## [4] "registered_via" "payment_method_num"
character_vars <- names(which(sapply(df_join, class) == "character"))
character_vars
## [1] "msno"
logical_vars <- names(which(sapply(df_join, class) == "logical"))
logical_vars
## [1] "is_auto_renew" "is_cancel" "payment_method_id_3"
## [4] "payment_method_id_8" "payment_method_id_10" "payment_method_id_11"
## [7] "payment_method_id_12" "payment_method_id_13" "payment_method_id_14"
## [10] "payment_method_id_15" "payment_method_id_16" "payment_method_id_17"
## [13] "payment_method_id_18" "payment_method_id_19" "payment_method_id_20"
## [16] "payment_method_id_21" "payment_method_id_22" "payment_method_id_23"
## [19] "payment_method_id_26" "payment_method_id_27" "payment_method_id_28"
## [22] "payment_method_id_29" "payment_method_id_30" "payment_method_id_31"
## [25] "payment_method_id_32" "payment_method_id_33" "payment_method_id_34"
## [28] "payment_method_id_35" "payment_method_id_36" "payment_method_id_37"
## [31] "payment_method_id_38" "payment_method_id_39" "payment_method_id_40"
## [34] "payment_method_id_41"
numeric_vars <- setdiff(colnames(df_join), factor_vars)
numeric_vars <- setdiff(numeric_vars, character_vars)
numeric_vars <- setdiff(numeric_vars, logical_vars)
numeric_vars <- setdiff(numeric_vars, "is_churn")
numeric_vars
## [1] "age" "hist_trans" "sum_paid"
## [4] "payment_plan_days_sum" "plan_list_price_sum" "hist_log"
## [7] "membership_duration"
numeric_vars_mat <- as.matrix(df_join[, numeric_vars, drop=FALSE])
is_churn_num <- as.numeric(df_join$is_churn) - 1
summary_features <- colnames(df_join)
summary_features <- setdiff(summary_features, c("msno", "is_churn", "registration_init_time", "membership_expire_date"))
summary_string <- paste(summary_features, collapse = "+")
frm_sum <- as.formula(paste("is_churn_num ~ ", summary_string))
frm_sum
## is_churn_num ~ city + age + gender + registered_via + hist_trans +
## sum_paid + is_auto_renew + is_cancel + payment_method_num +
## payment_plan_days_sum + plan_list_price_sum + payment_method_id_3 +
## payment_method_id_8 + payment_method_id_10 + payment_method_id_11 +
## payment_method_id_12 + payment_method_id_13 + payment_method_id_14 +
## payment_method_id_15 + payment_method_id_16 + payment_method_id_17 +
## payment_method_id_18 + payment_method_id_19 + payment_method_id_20 +
## payment_method_id_21 + payment_method_id_22 + payment_method_id_23 +
## payment_method_id_26 + payment_method_id_27 + payment_method_id_28 +
## payment_method_id_29 + payment_method_id_30 + payment_method_id_31 +
## payment_method_id_32 + payment_method_id_33 + payment_method_id_34 +
## payment_method_id_35 + payment_method_id_36 + payment_method_id_37 +
## payment_method_id_38 + payment_method_id_39 + payment_method_id_40 +
## payment_method_id_41 + hist_log + membership_duration
(sum_res <- summary(frm_sum, data = df_join))
## is_churn_num N= 33300
##
## +---------------------+-----------+-----+------------+
## | | |N |is_churn_num|
## +---------------------+-----------+-----+------------+
## |city |1 | 889|0.10911136 |
## | |3 | 386|0.07512953 |
## | |4 | 3588|0.08584169 |
## | |5 | 5421|0.08836008 |
## | |6 | 1971|0.08878742 |
## | |7 | 187|0.10695187 |
## | |8 | 574|0.06794425 |
## | |9 | 699|0.09442060 |
## | |10 | 526|0.07414449 |
## | |11 | 686|0.08309038 |
## | |12 | 942|0.09554140 |
## | |13 | 7592|0.08667018 |
## | |14 | 1648|0.07949029 |
## | |15 | 3341|0.09398384 |
## | |16 | 82|0.07317073 |
## | |17 | 393|0.06870229 |
## | |18 | 598|0.08361204 |
## | |19 | 10|0.00000000 |
## | |20 | 56|0.07142857 |
## | |21 | 405|0.08148148 |
## | |22 | 3306|0.08560194 |
## +---------------------+-----------+-----+------------+
## |age |[12, 25) | 9720|0.12016461 |
## | |[25, 29) | 7757|0.07580250 |
## | |[29, 35) | 7883|0.07446404 |
## | |[35,100] | 7940|0.07078086 |
## +---------------------+-----------+-----+------------+
## |gender |female |15875|0.08692913 |
## | |male |17425|0.08751793 |
## +---------------------+-----------+-----+------------+
## |registered_via |3 | 7403|0.11144131 |
## | |4 | 2220|0.14369369 |
## | |7 | 6255|0.05307754 |
## | |9 |17395|0.08203507 |
## | |13 | 27|0.07407407 |
## +---------------------+-----------+-----+------------+
## |hist_trans |1 |25908|0.07337502 |
## | |[2,208] | 7392|0.13582251 |
## +---------------------+-----------+-----+------------+
## |sum_paid |[ 0, 138) | 8325|0.02894895 |
## | |[138, 146) | 8325|0.03291291 |
## | |[146, 181) | 8325|0.04828829 |
## | |[181,6468] | 8325|0.23879880 |
## +---------------------+-----------+-----+------------+
## |is_auto_renew |No | 5489|0.31881946 |
## | |Yes |27811|0.04153033 |
## +---------------------+-----------+-----+------------+
## |is_cancel |No |32614|0.07395597 |
## | |Yes | 686|0.71865889 |
## +---------------------+-----------+-----+------------+
## |payment_method_num |1 |32844|0.08111071 |
## | |2 | 435|0.51724138 |
## | |3 | 21|0.76190476 |
## +---------------------+-----------+-----+------------+
## |payment_plan_days_sum|[ 1, 31) |24871|0.03490008 |
## | |[31, 61) | 6482|0.06155508 |
## | |[61,1456] | 1947|0.84129430 |
## +---------------------+-----------+-----+------------+
## |plan_list_price_sum |[ 0, 150) |20327|0.03099326 |
## | |[150, 199) | 4648|0.06454389 |
## | |[199,6456] | 8325|0.23723724 |
## +---------------------+-----------+-----+------------+
## |payment_method_id_3 |No |33299|0.08720983 |
## | |Yes | 1|1.00000000 |
## +---------------------+-----------+-----+------------+
## |payment_method_id_8 |No |33297|0.08718503 |
## | |Yes | 3|0.66666667 |
## +---------------------+-----------+-----+------------+
## |payment_method_id_10 |FALSE |33300|0.08723724 |
## +---------------------+-----------+-----+------------+
## |payment_method_id_11 |No |33295|0.08725034 |
## | |Yes | 5|0.00000000 |
## +---------------------+-----------+-----+------------+
## |payment_method_id_12 |No |33282|0.08680368 |
## | |Yes | 18|0.88888889 |
## +---------------------+-----------+-----+------------+
## |payment_method_id_13 |No |33276|0.08657892 |
## | |Yes | 24|1.00000000 |
## +---------------------+-----------+-----+------------+
## |payment_method_id_14 |No |33258|0.08734741 |
## | |Yes | 42|0.00000000 |
## +---------------------+-----------+-----+------------+
## |payment_method_id_15 |No |33197|0.08470645 |
## | |Yes | 103|0.90291262 |
## +---------------------+-----------+-----+------------+
## |payment_method_id_16 |No |33227|0.08697746 |
## | |Yes | 73|0.20547945 |
## +---------------------+-----------+-----+------------+
## |payment_method_id_17 |No |33237|0.08577790 |
## | |Yes | 63|0.85714286 |
## +---------------------+-----------+-----+------------+
## |payment_method_id_18 |No |33282|0.08725437 |
## | |Yes | 18|0.05555556 |
## +---------------------+-----------+-----+------------+
## |payment_method_id_19 |No |33203|0.08746198 |
## | |Yes | 97|0.01030928 |
## +---------------------+-----------+-----+------------+
## |payment_method_id_20 |No |33247|0.08578218 |
## | |Yes | 53|1.00000000 |
## +---------------------+-----------+-----+------------+
## |payment_method_id_21 |No |33239|0.08730708 |
## | |Yes | 61|0.04918033 |
## +---------------------+-----------+-----+------------+
## |payment_method_id_22 |No |33254|0.08597462 |
## | |Yes | 46|1.00000000 |
## +---------------------+-----------+-----+------------+
## |payment_method_id_23 |No |33173|0.08745064 |
## | |Yes | 127|0.03149606 |
## +---------------------+-----------+-----+------------+
## |payment_method_id_26 |No |33289|0.08699570 |
## | |Yes | 11|0.81818182 |
## +---------------------+-----------+-----+------------+
## |payment_method_id_27 |No |33133|0.08752603 |
## | |Yes | 167|0.02994012 |
## +---------------------+-----------+-----+------------+
## |payment_method_id_28 |No |33195|0.08685043 |
## | |Yes | 105|0.20952381 |
## +---------------------+-----------+-----+------------+
## |payment_method_id_29 |No |32342|0.08839899 |
## | |Yes | 958|0.04801670 |
## +---------------------+-----------+-----+------------+
## |payment_method_id_30 |No |32810|0.08747333 |
## | |Yes | 490|0.07142857 |
## +---------------------+-----------+-----+------------+
## |payment_method_id_31 |No |32440|0.08911837 |
## | |Yes | 860|0.01627907 |
## +---------------------+-----------+-----+------------+
## |payment_method_id_32 |No |32709|0.07117307 |
## | |Yes | 591|0.97631134 |
## +---------------------+-----------+-----+------------+
## |payment_method_id_33 |No |32260|0.08905766 |
## | |Yes | 1040|0.03076923 |
## +---------------------+-----------+-----+------------+
## |payment_method_id_34 |No |31188|0.09208670 |
## | |Yes | 2112|0.01562500 |
## +---------------------+-----------+-----+------------+
## |payment_method_id_35 |No |33274|0.08652401 |
## | |Yes | 26|1.00000000 |
## +---------------------+-----------+-----+------------+
## |payment_method_id_36 |No |29179|0.08831694 |
## | |Yes | 4121|0.07959233 |
## +---------------------+-----------+-----+------------+
## |payment_method_id_37 |No |30296|0.09400581 |
## | |Yes | 3004|0.01897470 |
## +---------------------+-----------+-----+------------+
## |payment_method_id_38 |No |30543|0.07487804 |
## | |Yes | 2757|0.22415669 |
## +---------------------+-----------+-----+------------+
## |payment_method_id_39 |No |28184|0.09395402 |
## | |Yes | 5116|0.05023456 |
## +---------------------+-----------+-----+------------+
## |payment_method_id_40 |No |27749|0.09416556 |
## | |Yes | 5551|0.05260313 |
## +---------------------+-----------+-----+------------+
## |payment_method_id_41 |No |27643|0.09637159 |
## | |Yes | 5657|0.04260209 |
## +---------------------+-----------+-----+------------+
## |hist_log |[ 1,13) | 8484|0.09028760 |
## | |[13,23) | 9096|0.08256376 |
## | |[23,29) | 8242|0.08517350 |
## | |[29,31] | 7478|0.09173576 |
## +---------------------+-----------+-----+------------+
## |membership_duration |[ 4, 857)| 8327|0.09667347 |
## | |[ 857,1691)| 8325|0.09297297 |
## | |[1691,2794)| 8324|0.08637674 |
## | |[2794,5393]| 8324|0.07292167 |
## +---------------------+-----------+-----+------------+
## |Overall | |33300|0.08723724 |
## +---------------------+-----------+-----+------------+
plot(summary(is_churn_num ~ city + age + gender, data = df_join))
plot(summary(is_churn_num ~ is_auto_renew, data = df_join))
(tbl_df_join <- table(df_join$is_churn))
##
## CF CT
## 30395 2905
tbl_df_join[2]/sum(tbl_df_join)
## CT
## 0.08723724
gp <- invisible(lapply(numeric_vars, function(x) {
ggplot(data=df_join, aes(x = is_churn, y=log(eval(parse(text=x))))) + geom_boxplot(aes(fill = is_churn)) + xlab("is churn") + ylab(x) + scale_fill_manual(values=c("#009900", "#CC0000")) + ggtitle("") + theme(legend.position="none")}))
grob_plots <- invisible(lapply(chunk(1, length(gp), 4), function(x) {
marrangeGrob(grobs=lapply(gp[x], ggplotGrob), nrow=2, ncol=2)}))
grob_plots$chunk1
grob_plots$chunk2
gp <- invisible(lapply(numeric_vars, function(x) {
ggplot(data=df_join, aes(x=eval(parse(text=x)))) + geom_density(aes(fill = is_churn), alpha = 0.4) + xlab(x) + scale_fill_manual(values=c("#009900", "#CC0000")) + scale_x_continuous(trans='log2') + ggtitle(paste(x, "density", sep= " "))}))
grob_plots <- invisible(lapply(chunk(1, length(gp), 4), function(x) {
marrangeGrob(grobs=lapply(gp[x], ggplotGrob), nrow=2, ncol=2)}))
grob_plots$chunk1
grob_plots$chunk2
aov_l <- lapply(numeric_vars, function(x) { summary(aov(eval(parse(text=x)) ~ is_churn, data = df_join))})
names(aov_l) <- numeric_vars
aov_l
## $age
## Df Sum Sq Mean Sq F value Pr(>F)
## is_churn 1 9210 9210 122.7 <2e-16 ***
## Residuals 33298 2499077 75
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## $hist_trans
## Df Sum Sq Mean Sq F value Pr(>F)
## is_churn 1 508 508.1 226.5 <2e-16 ***
## Residuals 33298 74708 2.2
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## $sum_paid
## Df Sum Sq Mean Sq F value Pr(>F)
## is_churn 1 4.916e+08 491640517 10374 <2e-16 ***
## Residuals 33298 1.578e+09 47390
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## $payment_plan_days_sum
## Df Sum Sq Mean Sq F value Pr(>F)
## is_churn 1 37440131 37440131 12431 <2e-16 ***
## Residuals 33298 100285816 3012
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## $plan_list_price_sum
## Df Sum Sq Mean Sq F value Pr(>F)
## is_churn 1 5.543e+08 554274226 10705 <2e-16 ***
## Residuals 33298 1.724e+09 51777
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## $hist_log
## Df Sum Sq Mean Sq F value Pr(>F)
## is_churn 1 19 18.55 0.219 0.639
## Residuals 33298 2815301 84.55
##
## $membership_duration
## Df Sum Sq Mean Sq F value Pr(>F)
## is_churn 1 4.177e+07 41768610 26.79 2.29e-07 ***
## Residuals 33298 5.192e+10 1559357
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
tukey <- lapply(numeric_vars, function(x) { TukeyHSD(aov(eval(parse(text=x)) ~ is_churn, data = df_join))})
names(tukey) <- numeric_vars
tukey
## $age
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = eval(parse(text = x)) ~ is_churn, data = df_join)
##
## $is_churn
## diff lwr upr p adj
## CT-CF -1.86369 -2.193434 -1.533946 0
##
##
## $hist_trans
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = eval(parse(text = x)) ~ is_churn, data = df_join)
##
## $is_churn
## diff lwr upr p adj
## CT-CF 0.437752 0.3807394 0.4947645 0
##
##
## $sum_paid
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = eval(parse(text = x)) ~ is_churn, data = df_join)
##
## $is_churn
## diff lwr upr p adj
## CT-CF 430.5978 422.3119 438.8837 0
##
##
## $payment_plan_days_sum
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = eval(parse(text = x)) ~ is_churn, data = df_join)
##
## $is_churn
## diff lwr upr p adj
## CT-CF 118.8274 116.7386 120.9163 0
##
##
## $plan_list_price_sum
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = eval(parse(text = x)) ~ is_churn, data = df_join)
##
## $is_churn
## diff lwr upr p adj
## CT-CF 457.2043 448.5434 465.8653 0
##
##
## $hist_log
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = eval(parse(text = x)) ~ is_churn, data = df_join)
##
## $is_churn
## diff lwr upr p adj
## CT-CF -0.08364616 -0.4336312 0.2663388 0.6394771
##
##
## $membership_duration
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = eval(parse(text = x)) ~ is_churn, data = df_join)
##
## $is_churn
## diff lwr upr p adj
## CT-CF -125.5085 -173.0386 -77.97835 2e-07
set.seed(1023)
trControl <- trainControl(method = "repeatedcv", number = 10, repeats = 5, verboseIter = FALSE, sampling = "down")
fit <- train(is_churn ~ age, data = df_join, method = "lda", trControl = trControl, metric = 'Accuracy', tuneLength = 10)
predict_res <- predict(fit, df_join, type = "raw")
confusionMatrix(factor(predict_res), df_join$is_churn)$overall[2]
## Kappa
## 0.02200541
fit <- train(is_churn ~ sum_paid, data = df_join, method = "lda", trControl = trControl, metric = 'Accuracy', tuneLength = 10)
predict_res <- predict(fit, df_join, type = "raw")
confusionMatrix(factor(predict_res), df_join$is_churn)$overall[2]
## Kappa
## 0.6256314
fit <- train(is_churn ~ hist_trans, data = df_join, method = "lda", trControl = trControl, metric = 'Accuracy', tuneLength = 10)
predict_res <- predict(fit, df_join, type = "raw")
confusionMatrix(factor(predict_res), df_join$is_churn)$overall[2]
## Kappa
## 0.07974486
fit <- train(is_churn ~ payment_plan_days_sum, data = df_join, method = "lda", trControl = trControl, metric = 'Accuracy', tuneLength = 10)
predict_res <- predict(fit, df_join, type = "raw")
confusionMatrix(factor(predict_res), df_join$is_churn)$overall[2]
## Kappa
## 0.5489028
fit <- train(is_churn ~ hist_log, data = df_join, method = "lda", trControl = trControl, metric = 'Accuracy', tuneLength = 10)
predict_res <- predict(fit, df_join, type = "raw")
confusionMatrix(factor(predict_res), df_join$is_churn)$overall[2]
## Kappa
## 0.003055255
fit <- train(is_churn ~ plan_list_price_sum, data = df_join, method = "lda", trControl = trControl, metric = 'Accuracy', tuneLength = 10)
predict_res <- predict(fit, df_join, type = "raw")
confusionMatrix(factor(predict_res), df_join$is_churn)$overall[2]
## Kappa
## 0.6252373
fit <- train(is_churn ~ membership_duration, data = df_join, method = "lda", trControl = trControl, metric = 'Accuracy', tuneLength = 10)
predict_res <- predict(fit, df_join, type = "raw")
confusionMatrix(factor(predict_res), df_join$is_churn)$overall[2]
## Kappa
## 0.0144657
numeric_vars_cor <- cor(numeric_vars_mat)
corrplot(numeric_vars_cor, type = "lower", sig.level = 0.7, diag=FALSE)
fit <- train(is_churn ~ plan_list_price_sum, data = df_join, method = "lda", trControl = trControl, metric = 'Accuracy', tuneLength = 10)
predict_res <- predict(fit, df_join, type = "raw")
confusionMatrix(factor(predict_res), df_join$is_churn)$overall[2]
## Kappa
## 0.6252373
fit <- train(is_churn ~ payment_plan_days_sum, data = df_join, method = "lda", trControl = trControl, metric = 'Accuracy', tuneLength = 10)
predict_res <- predict(fit, df_join, type = "raw")
confusionMatrix(factor(predict_res), df_join$is_churn)$overall[2]
## Kappa
## 0.5489028
fit <- train(is_churn ~ sum_paid, data = df_join, method = "lda", trControl = trControl, metric = 'Accuracy', tuneLength = 10)
predict_res <- predict(fit, df_join, type = "raw")
confusionMatrix(factor(predict_res), df_join$is_churn)$overall[2]
## Kappa
## 0.6256314
factor_vars
## [1] "is_churn" "city" "gender"
## [4] "registered_via" "payment_method_num"
factor_vars_red <- setdiff(factor_vars, "is_churn")
chisq_test_res <- lapply(factor_vars_red, function(x) {
chisq.test(df_join[,x], df_join[, "is_churn"], simulate.p.value = TRUE)
})
names(chisq_test_res) <- factor_vars_red
chisq_test_res$city
##
## Pearson's Chi-squared test with simulated p-value (based on 2000
## replicates)
##
## data: df_join[, x] and df_join[, "is_churn"]
## X-squared = 19.023, df = NA, p-value = 0.5192
chisq_test_res$gender
##
## Pearson's Chi-squared test with simulated p-value (based on 2000
## replicates)
##
## data: df_join[, x] and df_join[, "is_churn"]
## X-squared = 0.036167, df = NA, p-value = 0.8571
chisq_test_res$registered_via
##
## Pearson's Chi-squared test with simulated p-value (based on 2000
## replicates)
##
## data: df_join[, x] and df_join[, "is_churn"]
## X-squared = 240.96, df = NA, p-value = 0.0004998
tb <- table(df_join$is_auto_renew, df_join$is_churn)
chisq.test(tb, simulate.p.value = TRUE)
##
## Pearson's Chi-squared test with simulated p-value (based on 2000
## replicates)
##
## data: tb
## X-squared = 4426.6, df = NA, p-value = 0.0004998
tb <- table(df_join$is_cancel, df_join$is_churn)
chisq.test(tb, simulate.p.value = TRUE)
##
## Pearson's Chi-squared test with simulated p-value (based on 2000
## replicates)
##
## data: tb
## X-squared = 3507.1, df = NA, p-value = 0.0004998
chisq_test_res$payment_method_num
##
## Pearson's Chi-squared test with simulated p-value (based on 2000
## replicates)
##
## data: df_join[, x] and df_join[, "is_churn"]
## X-squared = 1145.6, df = NA, p-value = 0.0004998
payment_method_id_set <- colnames(df_join)[grep("payment_method_id", x=colnames(df_join))]
(table_pay <- sapply(payment_method_id_set, function(x) { table(df_join[,x], df_join$is_churn)}))
## $payment_method_id_3
##
## CF CT
## FALSE 30395 2904
## TRUE 0 1
##
## $payment_method_id_8
##
## CF CT
## FALSE 30394 2903
## TRUE 1 2
##
## $payment_method_id_10
##
## CF CT
## FALSE 30395 2905
##
## $payment_method_id_11
##
## CF CT
## FALSE 30390 2905
## TRUE 5 0
##
## $payment_method_id_12
##
## CF CT
## FALSE 30393 2889
## TRUE 2 16
##
## $payment_method_id_13
##
## CF CT
## FALSE 30395 2881
## TRUE 0 24
##
## $payment_method_id_14
##
## CF CT
## FALSE 30353 2905
## TRUE 42 0
##
## $payment_method_id_15
##
## CF CT
## FALSE 30385 2812
## TRUE 10 93
##
## $payment_method_id_16
##
## CF CT
## FALSE 30337 2890
## TRUE 58 15
##
## $payment_method_id_17
##
## CF CT
## FALSE 30386 2851
## TRUE 9 54
##
## $payment_method_id_18
##
## CF CT
## FALSE 30378 2904
## TRUE 17 1
##
## $payment_method_id_19
##
## CF CT
## FALSE 30299 2904
## TRUE 96 1
##
## $payment_method_id_20
##
## CF CT
## FALSE 30395 2852
## TRUE 0 53
##
## $payment_method_id_21
##
## CF CT
## FALSE 30337 2902
## TRUE 58 3
##
## $payment_method_id_22
##
## CF CT
## FALSE 30395 2859
## TRUE 0 46
##
## $payment_method_id_23
##
## CF CT
## FALSE 30272 2901
## TRUE 123 4
##
## $payment_method_id_26
##
## CF CT
## FALSE 30393 2896
## TRUE 2 9
##
## $payment_method_id_27
##
## CF CT
## FALSE 30233 2900
## TRUE 162 5
##
## $payment_method_id_28
##
## CF CT
## FALSE 30312 2883
## TRUE 83 22
##
## $payment_method_id_29
##
## CF CT
## FALSE 29483 2859
## TRUE 912 46
##
## $payment_method_id_30
##
## CF CT
## FALSE 29940 2870
## TRUE 455 35
##
## $payment_method_id_31
##
## CF CT
## FALSE 29549 2891
## TRUE 846 14
##
## $payment_method_id_32
##
## CF CT
## FALSE 30381 2328
## TRUE 14 577
##
## $payment_method_id_33
##
## CF CT
## FALSE 29387 2873
## TRUE 1008 32
##
## $payment_method_id_34
##
## CF CT
## FALSE 28316 2872
## TRUE 2079 33
##
## $payment_method_id_35
##
## CF CT
## FALSE 30395 2879
## TRUE 0 26
##
## $payment_method_id_36
##
## CF CT
## FALSE 26602 2577
## TRUE 3793 328
##
## $payment_method_id_37
##
## CF CT
## FALSE 27448 2848
## TRUE 2947 57
##
## $payment_method_id_38
##
## CF CT
## FALSE 28256 2287
## TRUE 2139 618
##
## $payment_method_id_39
##
## CF CT
## FALSE 25536 2648
## TRUE 4859 257
##
## $payment_method_id_40
##
## CF CT
## FALSE 25136 2613
## TRUE 5259 292
##
## $payment_method_id_41
##
## CF CT
## FALSE 24979 2664
## TRUE 5416 241
lapply(table_pay, function(x) {chisq.test(x, simulate.p.value = TRUE)})
## $payment_method_id_3
##
## Pearson's Chi-squared test with simulated p-value (based on 2000
## replicates)
##
## data: x
## X-squared = 10.463, df = NA, p-value = 0.09495
##
##
## $payment_method_id_8
##
## Pearson's Chi-squared test with simulated p-value (based on 2000
## replicates)
##
## data: x
## X-squared = 12.65, df = NA, p-value = 0.02249
##
##
## $payment_method_id_10
##
## Chi-squared test for given probabilities with simulated p-value
## (based on 2000 replicates)
##
## data: x
## X-squared = 22694, df = NA, p-value = 0.0004998
##
##
## $payment_method_id_11
##
## Pearson's Chi-squared test with simulated p-value (based on 2000
## replicates)
##
## data: x
## X-squared = 0.47795, df = NA, p-value = 1
##
##
## $payment_method_id_12
##
## Pearson's Chi-squared test with simulated p-value (based on 2000
## replicates)
##
## data: x
## X-squared = 145.35, df = NA, p-value = 0.0004998
##
##
## $payment_method_id_13
##
## Pearson's Chi-squared test with simulated p-value (based on 2000
## replicates)
##
## data: x
## X-squared = 251.29, df = NA, p-value = 0.0004998
##
##
## $payment_method_id_14
##
## Pearson's Chi-squared test with simulated p-value (based on 2000
## replicates)
##
## data: x
## X-squared = 4.0192, df = NA, p-value = 0.05047
##
##
## $payment_method_id_15
##
## Pearson's Chi-squared test with simulated p-value (based on 2000
## replicates)
##
## data: x
## X-squared = 863.29, df = NA, p-value = 0.0004998
##
##
## $payment_method_id_16
##
## Pearson's Chi-squared test with simulated p-value (based on 2000
## replicates)
##
## data: x
## X-squared = 12.846, df = NA, p-value = 0.002999
##
##
## $payment_method_id_17
##
## Pearson's Chi-squared test with simulated p-value (based on 2000
## replicates)
##
## data: x
## X-squared = 469.87, df = NA, p-value = 0.0004998
##
##
## $payment_method_id_18
##
## Pearson's Chi-squared test with simulated p-value (based on 2000
## replicates)
##
## data: x
## X-squared = 0.22702, df = NA, p-value = 0.7406
##
##
## $payment_method_id_19
##
## Pearson's Chi-squared test with simulated p-value (based on 2000
## replicates)
##
## data: x
## X-squared = 7.2301, df = NA, p-value = 0.01049
##
##
## $payment_method_id_20
##
## Pearson's Chi-squared test with simulated p-value (based on 2000
## replicates)
##
## data: x
## X-squared = 555.42, df = NA, p-value = 0.0004998
##
##
## $payment_method_id_21
##
## Pearson's Chi-squared test with simulated p-value (based on 2000
## replicates)
##
## data: x
## X-squared = 1.1116, df = NA, p-value = 0.3878
##
##
## $payment_method_id_22
##
## Pearson's Chi-squared test with simulated p-value (based on 2000
## replicates)
##
## data: x
## X-squared = 481.96, df = NA, p-value = 0.0004998
##
##
## $payment_method_id_23
##
## Pearson's Chi-squared test with simulated p-value (based on 2000
## replicates)
##
## data: x
## X-squared = 4.9746, df = NA, p-value = 0.02499
##
##
## $payment_method_id_26
##
## Pearson's Chi-squared test with simulated p-value (based on 2000
## replicates)
##
## data: x
## X-squared = 73.832, df = NA, p-value = 0.0004998
##
##
## $payment_method_id_27
##
## Pearson's Chi-squared test with simulated p-value (based on 2000
## replicates)
##
## data: x
## X-squared = 6.92, df = NA, p-value = 0.01149
##
##
## $payment_method_id_28
##
## Pearson's Chi-squared test with simulated p-value (based on 2000
## replicates)
##
## data: x
## X-squared = 19.781, df = NA, p-value = 0.0004998
##
##
## $payment_method_id_29
##
## Pearson's Chi-squared test with simulated p-value (based on 2000
## replicates)
##
## data: x
## X-squared = 19.055, df = NA, p-value = 0.0004998
##
##
## $payment_method_id_30
##
## Pearson's Chi-squared test with simulated p-value (based on 2000
## replicates)
##
## data: x
## X-squared = 1.5609, df = NA, p-value = 0.2299
##
##
## $payment_method_id_31
##
## Pearson's Chi-squared test with simulated p-value (based on 2000
## replicates)
##
## data: x
## X-squared = 55.822, df = NA, p-value = 0.0004998
##
##
## $payment_method_id_32
##
## Pearson's Chi-squared test with simulated p-value (based on 2000
## replicates)
##
## data: x
## X-squared = 5972.8, df = NA, p-value = 0.0004998
##
##
## $payment_method_id_33
##
## Pearson's Chi-squared test with simulated p-value (based on 2000
## replicates)
##
## data: x
## X-squared = 42.989, df = NA, p-value = 0.0004998
##
##
## $payment_method_id_34
##
## Pearson's Chi-squared test with simulated p-value (based on 2000
## replicates)
##
## data: x
## X-squared = 145.23, df = NA, p-value = 0.0004998
##
##
## $payment_method_id_35
##
## Pearson's Chi-squared test with simulated p-value (based on 2000
## replicates)
##
## data: x
## X-squared = 272.25, df = NA, p-value = 0.0004998
##
##
## $payment_method_id_36
##
## Pearson's Chi-squared test with simulated p-value (based on 2000
## replicates)
##
## data: x
## X-squared = 3.4519, df = NA, p-value = 0.05947
##
##
## $payment_method_id_37
##
## Pearson's Chi-squared test with simulated p-value (based on 2000
## replicates)
##
## data: x
## X-squared = 193.23, df = NA, p-value = 0.0004998
##
##
## $payment_method_id_38
##
## Pearson's Chi-squared test with simulated p-value (based on 2000
## replicates)
##
## data: x
## X-squared = 707.68, df = NA, p-value = 0.0004998
##
##
## $payment_method_id_39
##
## Pearson's Chi-squared test with simulated p-value (based on 2000
## replicates)
##
## data: x
## X-squared = 103.94, df = NA, p-value = 0.0004998
##
##
## $payment_method_id_40
##
## Pearson's Chi-squared test with simulated p-value (based on 2000
## replicates)
##
## data: x
## X-squared = 100.35, df = NA, p-value = 0.0004998
##
##
## $payment_method_id_41
##
## Pearson's Chi-squared test with simulated p-value (based on 2000
## replicates)
##
## data: x
## X-squared = 170.51, df = NA, p-value = 0.0004998
C50 Rules
C5.0 evoluzione dei precedenti algoritmi di decision tree (C4.5) e costruisce un albero decisionale sulla base della minimizzazione di un costo associato allo diramazione dal nodo corrente sulla base della scelta di un predittore.
Il normalized information gain (differenza di valore di entropia) e’ la metrica usata per scegliere l’attribute che ne fornisce il valore piu’ alto ai fini della decisione.
C5.0 “Rules based” puo’ creare un albero decisionale che poi decompone in un set di regole mutuamente esclusive, sulle quali si puo’ fare “pruning” e determinare un insieme piu’ piccolo di regole e che potentialmente si sovrappongono
RPART tree
Basato sull’algoritmo CART che costruisce un albero decisionale per partizionamento ricorsivo. La metrica utilizzata per la scelta del predittore da associare al nodo di split e’ il Gini Index.
Tra C5.0 e CART ci sono differenze nel modo in cui implementano il pruning e gestiscono i missing data.
relevant_features <- colnames(df_join)
relevant_features <- setdiff(relevant_features,
c("msno", "is_churn", "age", "city", "registration_init_time",
"gender", "sum_paid", "payment_plan_days_sum", "membership_duration"))
feat_sum <- paste(relevant_features, collapse = "+")
frm <- as.formula(paste("is_churn ~ ", feat_sum))
frm
## is_churn ~ registered_via + hist_trans + is_auto_renew + is_cancel +
## payment_method_num + plan_list_price_sum + payment_method_id_3 +
## payment_method_id_8 + payment_method_id_10 + payment_method_id_11 +
## payment_method_id_12 + payment_method_id_13 + payment_method_id_14 +
## payment_method_id_15 + payment_method_id_16 + payment_method_id_17 +
## payment_method_id_18 + payment_method_id_19 + payment_method_id_20 +
## payment_method_id_21 + payment_method_id_22 + payment_method_id_23 +
## payment_method_id_26 + payment_method_id_27 + payment_method_id_28 +
## payment_method_id_29 + payment_method_id_30 + payment_method_id_31 +
## payment_method_id_32 + payment_method_id_33 + payment_method_id_34 +
## payment_method_id_35 + payment_method_id_36 + payment_method_id_37 +
## payment_method_id_38 + payment_method_id_39 + payment_method_id_40 +
## payment_method_id_41 + hist_log
set.seed(1023)
train_idx <- createDataPartition(df_join$is_churn, p=0.6, list=FALSE)
trControl <- trainControl(method = "repeatedcv",
number = 10,
repeats = 5,
verboseIter = FALSE,
sampling = "down"
)
c50_fit <- train(frm,
data = df_join[train_idx,],
method = "C5.0Rules",
trControl = trControl,
metric = "Kappa")
summary(c50_fit)
##
## Call:
## C50:::C5.0.default(x = x, y = y, rules = TRUE, weights = wts)
##
##
## C5.0 [Release 2.07 GPL Edition] Sun Feb 03 11:30:46 2019
## -------------------------------
##
## Class specified by attribute `outcome'
##
## Read 3486 cases (44 attributes) from undefined.data
##
## Rules:
##
## Rule 1: (125, lift 2.0)
## is_cancelTRUE <= 0
## plan_list_price_sum <= 427
## payment_method_id_34TRUE > 0
## -> class CF [0.992]
##
## Rule 2: (7, lift 1.8)
## payment_method_id_19TRUE > 0
## -> class CF [0.889]
##
## Rule 3: (41/11, lift 1.4)
## is_auto_renewTRUE <= 0
## payment_method_id_29TRUE > 0
## -> class CF [0.721]
##
## Rule 4: (2575/848, lift 1.3)
## plan_list_price_sum <= 427
## -> class CF [0.671]
##
## Rule 5: (911/16, lift 2.0)
## plan_list_price_sum > 427
## -> class CT [0.981]
##
## Rule 6: (271/14, lift 1.9)
## is_cancelTRUE > 0
## plan_list_price_sum <= 427
## -> class CT [0.945]
##
## Rule 7: (1224/180, lift 1.7)
## is_auto_renewTRUE <= 0
## payment_method_id_29TRUE <= 0
## -> class CT [0.852]
##
## Rule 8: (557/97, lift 1.6)
## hist_trans > 1
## payment_method_id_39TRUE <= 0
## -> class CT [0.825]
##
## Default class: CF
##
##
## Evaluation on training data (3486 cases):
##
## Rules
## ----------------
## No Errors
##
## 8 392(11.2%) <<
##
##
## (a) (b) <-classified as
## ---- ----
## 1494 249 (a): class CF
## 143 1600 (b): class CT
##
##
## Attribute usage:
##
## 100.00% plan_list_price_sum
## 36.29% is_auto_renewTRUE
## 36.29% payment_method_id_29TRUE
## 15.98% hist_trans
## 15.98% payment_method_id_39TRUE
## 11.36% is_cancelTRUE
## 3.59% payment_method_id_34TRUE
## 0.20% payment_method_id_19TRUE
##
##
## Time: 0.0 secs
varImp(c50_fit)
## C5.0Rules variable importance
##
## only 20 most important variables shown (out of 43)
##
## Overall
## plan_list_price_sum 100.00
## is_auto_renewTRUE 36.29
## payment_method_id_29TRUE 36.29
## payment_method_id_39TRUE 15.98
## hist_trans 15.98
## is_cancelTRUE 11.36
## payment_method_id_34TRUE 3.59
## payment_method_id_19TRUE 0.20
## payment_method_id_20TRUE 0.00
## payment_method_id_35TRUE 0.00
## payment_method_id_26TRUE 0.00
## payment_method_num3 0.00
## payment_method_id_21TRUE 0.00
## payment_method_id_17TRUE 0.00
## payment_method_id_15TRUE 0.00
## payment_method_id_37TRUE 0.00
## payment_method_id_11TRUE 0.00
## payment_method_id_27TRUE 0.00
## payment_method_id_13TRUE 0.00
## payment_method_id_8TRUE 0.00
train_set <- df_join[train_idx,]
c50_train_pred <- predict(c50_fit, train_set, type="raw")
(cm_mat <- confusionMatrix(factor(c50_train_pred), train_set$is_churn))
## Confusion Matrix and Statistics
##
## Reference
## Prediction CF CT
## CF 15461 143
## CT 2776 1600
##
## Accuracy : 0.8539
## 95% CI : (0.8489, 0.8588)
## No Information Rate : 0.9128
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.455
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.8478
## Specificity : 0.9180
## Pos Pred Value : 0.9908
## Neg Pred Value : 0.3656
## Prevalence : 0.9128
## Detection Rate : 0.7738
## Detection Prevalence : 0.7810
## Balanced Accuracy : 0.8829
##
## 'Positive' Class : CF
##
train_set <- df_join[train_idx,]
c50_train_prob <- predict(c50_fit, train_set, type="prob")
c50_train_pred <- ifelse(c50_train_prob[,2] > 0.8, "CT", "CF")
(cm_mat <- confusionMatrix(factor(c50_train_pred), train_set$is_churn))
## Confusion Matrix and Statistics
##
## Reference
## Prediction CF CT
## CF 18011 792
## CT 226 951
##
## Accuracy : 0.949
## 95% CI : (0.9459, 0.9521)
## No Information Rate : 0.9128
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.625
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9876
## Specificity : 0.5456
## Pos Pred Value : 0.9579
## Neg Pred Value : 0.8080
## Prevalence : 0.9128
## Detection Rate : 0.9015
## Detection Prevalence : 0.9411
## Balanced Accuracy : 0.7666
##
## 'Positive' Class : CF
##
test_set <- df_join[-train_idx,]
c50_test_prob <- predict(c50_fit, test_set, type="prob")
c50_test_pred <- ifelse(c50_test_prob[,2] > 0.8, "CT", "CF")
(cm_mat1 <- confusionMatrix(factor(c50_test_pred), test_set$is_churn))
## Confusion Matrix and Statistics
##
## Reference
## Prediction CF CT
## CF 12022 508
## CT 136 654
##
## Accuracy : 0.9517
## 95% CI : (0.9479, 0.9552)
## No Information Rate : 0.9128
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.645
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9888
## Specificity : 0.5628
## Pos Pred Value : 0.9595
## Neg Pred Value : 0.8278
## Prevalence : 0.9128
## Detection Rate : 0.9026
## Detection Prevalence : 0.9407
## Balanced Accuracy : 0.7758
##
## 'Positive' Class : CF
##
Predicted Observed
CF CT
+-----------------------------+
| | |
CF | A | B |
| | |
+-------------+---------------+
| | |
CT | C | D |
| | |
+-----------------------------+
frazione di clienti che considerero’ churn potenziali = (C + D) / (A + B + C + D) (campaign target)
efficienza della campagna di retention = D / (C + D)
veri potenziali clienti churn raggiunti = D / (B + D) (indicato come churn_reach)
efficacia := efficiency * churn_reach
valore minimo di churn post-campagna: (B + D - rho * D) / (A + B + C + D) (indicato come min_reachable_churn)
con rho rendimento di conversione cliente da churn a non-churn
ROI (Return On Investment) := (current_churn - min_reachable_churn)/campaign_target
Il ritorno e’ interpretato come decremento della frazione (percentuale) di clienti churn
L’investimento, come la frazione di target di clienti da contattare durante la campagna di retention
retention_campaign_metrics <- function(cm_mat_i, current_churn_percentage, rho) {
# percentuale di clienti che considerero' churn potenziali
#
# (C + D) / (A + B + C + D)
#
target_perc <- 100*(cm_mat_i[2,1] + cm_mat_i[2,2])/sum(cm_mat_i)
# percentuale di efficienza della campagna di retention, cioe' percentuale dei veri
# churn che raggiungo sul totale dei clienti interessati alla campagna (e' pari al negative predicted value)
#
# D / (C + D)
#
campaign_efficiency_perc <- 100*(cm_mat_i[2,2]/(cm_mat_i[2,1] + cm_mat_i[2,2]))
# percentuale di veri potenziali churn che avro' raggiunto (uguale alla specificity)
#
# D / (B + D)
#
churn_reach_perc <- 100*(cm_mat_i[2,2]/(cm_mat_i[1,2] + cm_mat_i[2,2]))
# tiene conto dell'efficienza e del reah sui clienti churn
effectiveness_perc = campaign_efficiency_perc * churn_reach_perc/100
# esprime quale sia la piu' bassa percentuale di churn raggiungibile finita la
# campagna di retention, tenendo conto di un fattore rho di rendimento della
# campagna stessa sul campione di effettivi clienti churn
#
# Se contatto D clienti, suppongo di convincerne rho * D a restare cliente
# mentre (1-rho) D saranno ancora clienti churn che si sommeranno ai B clienti
# che non ho raggiunto con la campagna perche' il modello li ritiene non churn
#
# Idealmente con rho = 1, il numero di clienti churn dopo la campgna di retention
# sarebbe uguale a B
#
# (B + D - rho * D) / (A + B + C + D)
#
min_reachable_churn_percentage <- round((100*(cm_mat_i[1,2] + cm_mat_i[2,2] - rho * cm_mat_i[2,2]))/sum(cm_mat_i), 4)
roi <- round(100*((current_churn_percentage - min_reachable_churn_percentage)/target_perc), 1)
v <- c(target_perc,
campaign_efficiency_perc,
churn_reach_perc,
effectiveness_perc,
min_reachable_churn_percentage,
roi)
v <- round(v, 1)
names(v) <- c("target%", "efficiency%", "reach%", "effectiveness%", "min_reachable_churn%", "roi")
v
}
(is_churn_cnt_test_set <- sum((test_set$is_churn=="CT")))
## [1] 1162
(customer_churn_percentage <- 100*(is_churn_cnt_test_set/nrow(test_set)))
## [1] 8.723724
(rcm1_wb <- retention_campaign_metrics(cm_mat1$table, customer_churn_percentage, 1))
## target% efficiency% reach%
## 5.9 82.8 56.3
## effectiveness% min_reachable_churn% roi
## 46.6 3.8 82.8
trControl <- trainControl(method = "repeatedcv",
number = 10,
repeats = 5,
verboseIter = FALSE,
sampling = "down",
classProbs = TRUE,
summaryFunction = twoClassSummary,
savePredictions = TRUE,
returnResamp = "all")
rpart.grid <- data.frame(cp = c(0.005, 0.01, 0.03, 0.05))
rpart_fit <- train(frm,
data = df_join[train_idx,],
method = "rpart",
trControl = trControl,
metric = "ROC",
tuneGrid = rpart.grid
)
rpart_fit
## CART
##
## 19980 samples
## 39 predictor
## 2 classes: 'CF', 'CT'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times)
## Summary of sample sizes: 17981, 17982, 17983, 17983, 17983, 17982, ...
## Addtional sampling using down-sampling
##
## Resampling results across tuning parameters:
##
## cp ROC Sens Spec
## 0.005 0.9183860 0.8510720 0.8955908
## 0.010 0.9091038 0.8539671 0.8859494
## 0.030 0.9090771 0.8534298 0.8864079
## 0.050 0.8985119 0.8695068 0.8592631
##
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.005.
plot(rpart_fit)
varImp(rpart_fit)
## rpart variable importance
##
## only 20 most important variables shown (out of 43)
##
## Overall
## is_auto_renewTRUE 100.0000
## plan_list_price_sum 88.1809
## is_cancelTRUE 48.3262
## payment_method_id_32TRUE 32.0795
## hist_trans 21.9960
## payment_method_id_38TRUE 17.6008
## payment_method_id_39TRUE 6.1409
## registered_via4 2.5091
## payment_method_id_40TRUE 1.7250
## payment_method_id_41TRUE 0.6481
## registered_via9 0.5880
## registered_via7 0.5771
## payment_method_id_34TRUE 0.3982
## hist_log 0.2396
## payment_method_id_18TRUE 0.0000
## payment_method_id_27TRUE 0.0000
## payment_method_id_22TRUE 0.0000
## payment_method_id_33TRUE 0.0000
## payment_method_id_10TRUE 0.0000
## payment_method_id_36TRUE 0.0000
pred1 <- ROCR::prediction(as.vector(predict(rpart_fit, df_join[train_idx,], type="prob")[,2]), df_join[train_idx,]$is_churn)
perf1 <- performance(pred1, "tpr", "fpr")
plot(perf1, colorize=TRUE, print.cutoffs.at=c(0.2, 0.4, 0.6, 0.8))
rpart.plot(rpart_fit$finalModel, cex = 0.8)
rpart.rules(rpart_fit$finalModel)
## .outcome
## 0.05 when plan_list_price_sum < 299 & is_cancelTRUE is 0 & is_auto_renewTRUE is 1 & payment_method_id_40TRUE is 0 & payment_method_id_41TRUE is 0
## 0.09 when plan_list_price_sum < 299 & is_cancelTRUE is 0 & is_auto_renewTRUE is 1 & payment_method_id_40TRUE is 0 & hist_trans < 2 & payment_method_id_41TRUE is 1
## 0.19 when plan_list_price_sum < 299 & is_cancelTRUE is 0 & is_auto_renewTRUE is 1 & payment_method_id_40TRUE is 1 & hist_trans < 2
## 0.62 when plan_list_price_sum < 299 & is_cancelTRUE is 0 & is_auto_renewTRUE is 0
## 0.71 when plan_list_price_sum < 299 & is_cancelTRUE is 0 & is_auto_renewTRUE is 1 & payment_method_id_40TRUE is 1 & hist_trans >= 2
## 0.93 when plan_list_price_sum < 299 & is_cancelTRUE is 0 & is_auto_renewTRUE is 1 & payment_method_id_40TRUE is 0 & hist_trans >= 2 & payment_method_id_41TRUE is 1
## 0.96 when plan_list_price_sum >= 299
## 0.97 when plan_list_price_sum < 299 & is_cancelTRUE is 1
ggplot(df_join, aes(x=is_churn, y=plan_list_price_sum, fill = is_churn)) + geom_boxplot() + scale_fill_manual(values=c("#009900", "#CC0000"))
ggplot(df_join, aes(x=is_churn, y=payment_plan_days_sum, fill = is_churn)) + geom_boxplot() + scale_fill_manual(values=c("#009900", "#CC0000"))
ggplot(df_join, aes(x=cut(plan_list_price_sum, breaks = c(0,100,200,300,1000,2000)))) + geom_histogram(position = 'dodge2', stat='count', aes(fill = is_churn)) + xlab("plan list price sum") + scale_fill_manual(values=c("#009900", "#CC0000"))
ggplot(data=df_join, aes(x=cut(hist_trans, breaks=c(0,1,2,3,4,5,10,50,100,208)))) + geom_bar(position = 'dodge2', aes(fill=is_churn)) + xlab("Transaction history length") + scale_fill_manual(values=c("#009900", "#CC0000")) + scale_x_discrete(labels = c("1","2","3","4","(5, 10]","(10,50]","(100,208]"))
ggplot(data=df_join, aes(x=cut(plan_list_price_sum/hist_trans, breaks=c(0,150,299,600,3000)))) + geom_bar(position = 'dodge2', aes(fill=is_churn)) + xlab("average subscribed plan list price") + scale_fill_manual(values=c("#009900", "#CC0000"))
ggplot(df_join, aes(x=cut(payment_plan_days_sum, breaks = c(0,100,200,300,1000,2000)))) + geom_histogram(position = 'dodge2', stat='count', aes(fill = is_churn)) + xlab("payment plan days total") + scale_fill_manual(values=c("#009900", "#CC0000"))
ggplot(df_join, aes(x=cut(payment_plan_days_sum/hist_trans, breaks = c(0,100,200,300,1000,2000)))) + geom_histogram(position = 'dodge2', stat='count', aes(fill = is_churn)) + xlab("average payment plan days") + scale_fill_manual(values=c("#009900", "#CC0000"))
ggplot(data=df_join, aes(x=is_auto_renew, fill=is_churn)) + geom_bar(position = 'dodge2') + xlab("auto renew") + scale_fill_manual(values=c("#009900", "#CC0000"))
df_join_pay <- left_join(df_churn, df_trans)
df_join_pay <- df_join_pay[complete.cases(df_join_pay),]
ggplot(df_join_pay %>% dplyr::select(is_churn, payment_method_id), aes(x = payment_method_id)) + geom_histogram(position = 'dodge2', stat='count', aes(fill = is_churn)) + xlab("payment method id") + scale_fill_manual(values=c("#009900", "#CC0000"))
rpart.plot(rpart_fit$finalModel, cex = 0.8)
df_filter <- df_join %>% filter(payment_plan_days_sum > 299)
ggplot(data=df_filter, aes(x=cut(hist_trans, breaks=c(0,1,2,3,4,5,10,50,100,208)))) + geom_bar(position = 'dodge2', aes(fill=is_churn)) + xlab("Transaction history length") + scale_fill_manual(values=c("#009900", "#CC0000")) + scale_x_discrete(labels = c("1","2","3","4","(5, 10]","(10,50]","(100,208]"))
rpart.plot(rpart_fit$finalModel, cex = 0.8)
df_filter <- df_join %>% filter(payment_plan_days_sum < 299) %>% filter (is_cancel == TRUE)
ggplot(data=df_filter, aes(x=cut(hist_trans, breaks=c(0,1,2,3,4,5,10,50,100,208)))) + geom_bar(position = 'dodge2', aes(fill=is_churn)) + xlab("Transaction history length") + scale_fill_manual(values=c("#009900", "#CC0000")) + scale_x_discrete(labels = c("1","2","3","4","(5, 10]","(10,50]","(100,208]"))
rpart.plot(rpart_fit$finalModel, cex = 0.8)
df_filter <- df_join %>% filter(payment_plan_days_sum < 299) %>% filter(is_cancel == FALSE) %>% filter(is_auto_renew == FALSE)
ggplot(data=df_filter, aes(x=cut(hist_trans, breaks=c(0,1,2,3,4,5,10,50,100,208)))) + geom_bar(position = 'dodge2', aes(fill=is_churn)) + xlab("Transaction history length") + scale_fill_manual(values=c("#009900", "#CC0000")) + scale_x_discrete(labels = c("1","2","3","4","(5, 10]","(10,50]","(100,208]"))
rpart.plot(rpart_fit$finalModel, cex = 0.8)
df_join_pay <- left_join(df_join, df_trans)
df_join_pay <- df_join_pay
df_filter <- df_join_pay %>% filter(payment_plan_days_sum < 299) %>% filter(is_cancel == FALSE) %>% filter(is_auto_renew == TRUE) %>% filter (hist_trans >= 2)
ggplot(df_filter %>% dplyr::select(is_churn, payment_method_id), aes(x = payment_method_id)) + geom_histogram(position = 'dodge2', stat='count', aes(fill = is_churn)) + xlab("payment method id") + scale_fill_manual(values=c("#009900", "#CC0000"))
df_join_pay <- left_join(df_join, df_trans)
df_join_pay <- df_join_pay
df_filter <- df_join_pay %>% filter(payment_plan_days_sum < 299) %>% filter(is_cancel == FALSE) %>% filter(is_auto_renew == TRUE) %>% filter (hist_trans < 2)
ggplot(df_filter %>% dplyr::select(is_churn, payment_method_id), aes(x = payment_method_id)) + geom_histogram(position = 'dodge2', stat='count', aes(fill = is_churn)) + xlab("payment method id") + scale_fill_manual(values=c("#009900", "#CC0000"))
df_join_ext <- left_join(df_join, df_members[,c("msno", "registration_init_time")])
df_join_ext <- left_join(df_join_ext, df_trans[,c("msno", "membership_expire_date")])
ggplot(df_join_ext %>% dplyr::select(is_churn, registration_init_time) %>% mutate(registration_init_time_year = year(ymd(registration_init_time))), aes(x = registration_init_time_year)) + geom_histogram(position = 'dodge2', stat='count', aes(fill = is_churn)) + scale_x_continuous(labels = as.character(seq(2004,2018,1)), breaks = seq(2004,2018,1)) + xlab("registration initial year") + scale_fill_manual(values=c("#009900", "#CC0000"))
ggplot(df_join_ext %>% dplyr::select(is_churn, membership_expire_date) %>% mutate(membership_expire_year = year(ymd(membership_expire_date))), aes(x = membership_expire_year)) + geom_histogram(position = 'dodge2', stat='count', aes(fill = is_churn)) + xlab("membership expiration year") + scale_fill_manual(values=c("#009900", "#CC0000"))
par(mfrow=c(1,2))
d <- density(df_join[df_join$is_churn == "CF",]$membership_duration)
plot(d, main = "Churn false membership duration")
polygon(d, col = "#009900")
grid()
d <- density(df_join[df_join$is_churn == "CT",]$membership_duration)
plot(d, main = "Churn true membership duration")
polygon(d, col = "#CC0000")
grid()
par(mfrow=c(1,1))
train_set <- df_join[train_idx,]
rpart_train_prob <- predict(rpart_fit, train_set, type="prob")
rpart_train_pred <- ifelse(rpart_train_prob[,2] > 0.5, "CT", "CF")
(cm_mat <- confusionMatrix(factor(rpart_train_pred), train_set$is_churn))
## Confusion Matrix and Statistics
##
## Reference
## Prediction CF CT
## CF 15425 150
## CT 2812 1593
##
## Accuracy : 0.8518
## 95% CI : (0.8467, 0.8567)
## No Information Rate : 0.9128
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.4494
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.8458
## Specificity : 0.9139
## Pos Pred Value : 0.9904
## Neg Pred Value : 0.3616
## Prevalence : 0.9128
## Detection Rate : 0.7720
## Detection Prevalence : 0.7795
## Balanced Accuracy : 0.8799
##
## 'Positive' Class : CF
##
rpart.plot(rpart_fit$finalModel, cex = 0.8)
train_set <- df_join[train_idx,]
rpart_train_prob <- predict(rpart_fit, train_set, type="prob")
rpart_train_pred <- ifelse(rpart_train_prob[,2] > 0.65, "CT", "CF")
(cm_mat <- confusionMatrix(factor(rpart_train_pred), train_set$is_churn))
## Confusion Matrix and Statistics
##
## Reference
## Prediction CF CT
## CF 17550 460
## CT 687 1283
##
## Accuracy : 0.9426
## 95% CI : (0.9393, 0.9458)
## No Information Rate : 0.9128
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6596
## Mcnemar's Test P-Value : 2.505e-11
##
## Sensitivity : 0.9623
## Specificity : 0.7361
## Pos Pred Value : 0.9745
## Neg Pred Value : 0.6513
## Prevalence : 0.9128
## Detection Rate : 0.8784
## Detection Prevalence : 0.9014
## Balanced Accuracy : 0.8492
##
## 'Positive' Class : CF
##
test_set <- df_join[-train_idx,]
rpart_test_prob <- predict(rpart_fit, test_set, type="prob")
rpart_test_pred <- ifelse(rpart_test_prob[,2] > 0.65, "CT", "CF")
(cm_mat2 <- confusionMatrix(factor(rpart_test_pred), test_set$is_churn))
## Confusion Matrix and Statistics
##
## Reference
## Prediction CF CT
## CF 11725 275
## CT 433 887
##
## Accuracy : 0.9468
## 95% CI : (0.9429, 0.9506)
## No Information Rate : 0.9128
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6856
## Mcnemar's Test P-Value : 3.626e-09
##
## Sensitivity : 0.9644
## Specificity : 0.7633
## Pos Pred Value : 0.9771
## Neg Pred Value : 0.6720
## Prevalence : 0.9128
## Detection Rate : 0.8803
## Detection Prevalence : 0.9009
## Balanced Accuracy : 0.8639
##
## 'Positive' Class : CF
##
(rcm2_wb <- retention_campaign_metrics(cm_mat2$table, customer_churn_percentage, 1))
## target% efficiency% reach%
## 9.9 67.2 76.3
## effectiveness% min_reachable_churn% roi
## 51.3 2.1 67.2
Random Forest
E’ un classificatore d’insieme che è composto da molti alberi di decisione che sono usati per determinare la risposta. Nel caso di classificazione, viene usato un meccanismo di votazione che raccoglie la risposta di ciascun albero decisionale
C50 Boost
L’algoritmo di boosting tree si basa sull’idea generale che computa una sequenza di alberi e dove ogni albero successivo al primo e’ costruito sulla base dei residui della predizione dell’albero che lo precede. In altre parole, costruiamo il primo albero, calcoliamo i residui della predizione sulla base del training set, calcoliamo un secondo albero sulla base di tali residui e sommiamo al primo albero il secondo con associato un peso. Da tale somma si e’ costruito un nuovo classificatore. Se ne calcola i residui corrispondenti e si procede alla costruzione di un nuovo albero e cosi’ via.
Per cui dalla costruzione di singoli weak learner, con questa fitting incrementale basato su residui si puo’ costruire un buon predittore.
Random Forest
trControl <- trainControl(method = "repeatedcv",
number = 10,
repeats = 5,
verboseIter = FALSE,
sampling = "down",
classProbs = TRUE,
summaryFunction = twoClassSummary,
savePredictions = TRUE,
returnResamp = "all")
rf_fit <- train(frm,
data = df_join[train_idx,],
method = "rf",
trControl = trControl,
metric = 'ROC',
tuneLength = 3
)
rf_fit
## Random Forest
##
## 19980 samples
## 39 predictor
## 2 classes: 'CF', 'CT'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times)
## Summary of sample sizes: 17981, 17981, 17983, 17983, 17982, 17983, ...
## Addtional sampling using down-sampling
##
## Resampling results across tuning parameters:
##
## mtry ROC Sens Spec
## 2 0.9091029 0.8977462 0.6973117
## 22 0.9434049 0.8861766 0.8641353
## 43 0.9383752 0.8800789 0.8409537
##
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 22.
varImp(rf_fit)
## rf variable importance
##
## only 20 most important variables shown (out of 43)
##
## Overall
## plan_list_price_sum 100.0000
## is_cancelTRUE 51.1344
## is_auto_renewTRUE 41.8769
## hist_log 18.8259
## hist_trans 11.3488
## payment_method_id_32TRUE 6.8600
## payment_method_id_38TRUE 3.8820
## payment_method_id_40TRUE 3.8567
## registered_via9 3.2208
## payment_method_id_39TRUE 2.7976
## payment_method_id_29TRUE 2.5014
## registered_via4 2.2488
## payment_method_id_37TRUE 2.0728
## payment_method_id_36TRUE 1.8239
## registered_via7 1.7188
## payment_method_id_41TRUE 1.3161
## payment_method_num2 1.0348
## payment_method_id_34TRUE 0.4855
## payment_method_id_30TRUE 0.3519
## payment_method_id_33TRUE 0.3474
plot(rf_fit)
pred1 <- ROCR::prediction(as.vector(predict(rf_fit, df_join[train_idx,], type="prob")[,2]), df_join[train_idx,]$is_churn)
perf1 <- performance(pred1, "tpr", "fpr")
plot(perf1, colorize=TRUE, print.cutoffs.at=c(0.2, 0.4, 0.6, 0.8))
train_set <- df_join[train_idx,]
rf_train_probs <- predict(rf_fit, train_set, type = "prob")
rf_train_pred <- ifelse(rf_train_probs[,2] > 0.5, "CT", "CF")
(rf_cm_mat <- confusionMatrix(factor(rf_train_pred), train_set$is_churn))
## Confusion Matrix and Statistics
##
## Reference
## Prediction CF CT
## CF 16144 164
## CT 2093 1579
##
## Accuracy : 0.887
## 95% CI : (0.8826, 0.8914)
## No Information Rate : 0.9128
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.5273
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.8852
## Specificity : 0.9059
## Pos Pred Value : 0.9899
## Neg Pred Value : 0.4300
## Prevalence : 0.9128
## Detection Rate : 0.8080
## Detection Prevalence : 0.8162
## Balanced Accuracy : 0.8956
##
## 'Positive' Class : CF
##
train_set <- df_join[train_idx,]
rf_train_probs <- predict(rf_fit, train_set, type = "prob")
rf_train_pred <- ifelse(rf_train_probs[,2] > 0.9, "CT", "CF")
(rf_cm_mat <- confusionMatrix(factor(rf_train_pred), train_set$is_churn))
## Confusion Matrix and Statistics
##
## Reference
## Prediction CF CT
## CF 17635 386
## CT 602 1357
##
## Accuracy : 0.9506
## 95% CI : (0.9475, 0.9535)
## No Information Rate : 0.9128
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.706
## Mcnemar's Test P-Value : 7.916e-12
##
## Sensitivity : 0.9670
## Specificity : 0.7785
## Pos Pred Value : 0.9786
## Neg Pred Value : 0.6927
## Prevalence : 0.9128
## Detection Rate : 0.8826
## Detection Prevalence : 0.9020
## Balanced Accuracy : 0.8728
##
## 'Positive' Class : CF
##
test_set <- df_join[-train_idx,]
rf_test_probs <- predict(rf_fit, test_set, type = "prob")
rf_test_pred <- ifelse(rf_test_probs[,2] > 0.9, "CT", "CF")
(cm_mat3 <- confusionMatrix(factor(rf_test_pred), test_set$is_churn))
## Confusion Matrix and Statistics
##
## Reference
## Prediction CF CT
## CF 11729 302
## CT 429 860
##
## Accuracy : 0.9451
## 95% CI : (0.9411, 0.9489)
## No Information Rate : 0.9128
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6716
## Mcnemar's Test P-Value : 3.158e-06
##
## Sensitivity : 0.9647
## Specificity : 0.7401
## Pos Pred Value : 0.9749
## Neg Pred Value : 0.6672
## Prevalence : 0.9128
## Detection Rate : 0.8806
## Detection Prevalence : 0.9032
## Balanced Accuracy : 0.8524
##
## 'Positive' Class : CF
##
(rcm1_bb <- retention_campaign_metrics(cm_mat3$table, customer_churn_percentage, 1))
## target% efficiency% reach%
## 9.7 66.7 74.0
## effectiveness% min_reachable_churn% roi
## 49.4 2.3 66.7
C50 boosting
trControl <- trainControl(method = "repeatedcv",
number = 10,
repeats = 5,
verboseIter = FALSE,
sampling = "down",
classProbs = TRUE,
summaryFunction = twoClassSummary,
savePredictions = TRUE,
returnResamp = "all")
grid <- expand.grid(.winnow = c(FALSE), .trials=c(25, 30, 35, 40), .model="tree" )
c50_boost_fit <- train(frm,
data = df_join[train_idx,],
method = "C5.0",
trControl = trControl,
metric = 'ROC',
tuneGrid=grid
)
plot(c50_boost_fit)
varImp(c50_boost_fit)
## C5.0 variable importance
##
## only 20 most important variables shown (out of 43)
##
## Overall
## payment_method_id_15TRUE 100.00
## payment_method_id_34TRUE 100.00
## is_cancelTRUE 100.00
## is_auto_renewTRUE 100.00
## hist_trans 100.00
## plan_list_price_sum 100.00
## payment_method_id_33TRUE 100.00
## payment_method_id_32TRUE 100.00
## payment_method_id_31TRUE 100.00
## payment_method_id_39TRUE 99.68
## payment_method_id_40TRUE 97.73
## payment_method_id_37TRUE 91.65
## payment_method_num2 89.19
## payment_method_id_17TRUE 88.38
## payment_method_id_41TRUE 86.69
## payment_method_id_29TRUE 78.57
## payment_method_id_38TRUE 66.27
## payment_method_id_36TRUE 64.77
## registered_via4 57.49
## registered_via9 47.59
summary(c50_boost_fit$finalModel)
##
## Call:
## (function (x, y, trials = 1, rules = FALSE, weights = NULL, control
## = 2, fuzzyThreshold = FALSE, sample = 0, earlyStopping = TRUE, label
## = "outcome", seed = 1692L))
##
##
## C5.0 [Release 2.07 GPL Edition] Sun Feb 03 11:42:19 2019
## -------------------------------
##
## Class specified by attribute `outcome'
##
## Read 3486 cases (44 attributes) from undefined.data
##
## ----- Trial 0: -----
##
## Decision tree:
##
## plan_list_price_sum > 398: CT (908/13)
## plan_list_price_sum <= 398:
## :...is_cancelTRUE > 0: CT (266/9)
## is_cancelTRUE <= 0:
## :...is_auto_renewTRUE <= 0:
## :...payment_method_id_29TRUE <= 0: CT (524/183)
## : payment_method_id_29TRUE > 0: CF (37/11)
## is_auto_renewTRUE > 0:
## :...payment_method_id_39TRUE > 0: CF (277)
## payment_method_id_39TRUE <= 0:
## :...hist_trans <= 1: CF (1313/132)
## hist_trans > 1: CT (161/54)
##
## ----- Trial 1: -----
##
## Decision tree:
##
## plan_list_price_sum > 398: CT (735.1/34.7)
## plan_list_price_sum <= 398:
## :...is_cancelTRUE <= 0: CF (2525.8/732.1)
## is_cancelTRUE > 0: CT (225.1/24)
##
## ----- Trial 2: -----
##
## Decision tree:
##
## payment_method_id_32TRUE > 0: CT (255.4/3.4)
## payment_method_id_32TRUE <= 0:
## :...payment_method_id_15TRUE > 0: CT (50.5)
## payment_method_id_15TRUE <= 0:
## :...plan_list_price_sum > 398: CT (387.3/41.3)
## plan_list_price_sum <= 398:
## :...payment_method_id_34TRUE > 0: CF (80.2/3.9)
## payment_method_id_34TRUE <= 0:
## :...is_cancelTRUE > 0: CT (192.7/31)
## is_cancelTRUE <= 0:
## :...payment_method_id_39TRUE > 0: CF (178.5)
## payment_method_id_39TRUE <= 0:
## :...payment_method_id_33TRUE > 0: CF (41.4)
## payment_method_id_33TRUE <= 0:
## :...payment_method_id_31TRUE > 0: CF (36)
## payment_method_id_31TRUE <= 0:
## :...payment_method_id_37TRUE > 0: CF (165.4/37.2)
## payment_method_id_37TRUE <= 0:
## :...hist_trans > 1: CT (306.6/105.4)
## hist_trans <= 1:
## :...registered_via7 > 0: CF (305.6/99.6)
## registered_via7 <= 0:
## :...plan_list_price_sum <= 105: CT (31.3/4.5)
## plan_list_price_sum > 105: [S1]
##
## SubTree [S1]
##
## payment_method_id_40TRUE > 0: CT (412.1/167.5)
## payment_method_id_40TRUE <= 0:
## :...payment_method_id_29TRUE <= 0: CF (955/455)
## payment_method_id_29TRUE > 0: CT (88/32.9)
##
## ----- Trial 3: -----
##
## Decision tree:
##
## plan_list_price_sum > 398: CT (540.8/51.8)
## plan_list_price_sum <= 398:
## :...is_cancelTRUE > 0: CT (180.1/35.8)
## is_cancelTRUE <= 0:
## :...is_auto_renewTRUE <= 0: CT (1047.4/376)
## is_auto_renewTRUE > 0:
## :...plan_list_price_sum <= 120: CF (131.1/32.1)
## plan_list_price_sum > 120:
## :...payment_method_id_36TRUE <= 0: CF (1340.5/433.7)
## payment_method_id_36TRUE > 0: CT (246/114.8)
##
## ----- Trial 4: -----
##
## Decision tree:
##
## payment_method_id_32TRUE > 0: CT (187.7/4.6)
## payment_method_id_32TRUE <= 0:
## :...payment_method_id_15TRUE > 0: CT (36.7)
## payment_method_id_15TRUE <= 0:
## :...plan_list_price_sum > 398: CT (306.5/55.1)
## plan_list_price_sum <= 398:
## :...payment_method_id_31TRUE > 0: CF (27.5/1.4)
## payment_method_id_31TRUE <= 0:
## :...payment_method_id_33TRUE > 0: CF (32.9/2.8)
## payment_method_id_33TRUE <= 0:
## :...payment_method_id_39TRUE > 0: CF (160.1/30.4)
## payment_method_id_39TRUE <= 0:
## :...is_cancelTRUE > 0: CT (130.3/41.3)
## is_cancelTRUE <= 0:
## :...hist_trans > 1: CT (374/140.1)
## hist_trans <= 1:
## :...is_auto_renewTRUE > 0: CF (1262.2/498.5)
## is_auto_renewTRUE <= 0:
## :...payment_method_id_29TRUE > 0: CF (74.1/27.5)
## payment_method_id_29TRUE <= 0:
## :...registered_via9 <= 0: CF (562.4/258.3)
## registered_via9 > 0: CT (331.5/136)
##
## ----- Trial 5: -----
##
## Decision tree:
##
## payment_method_id_32TRUE > 0: CT (166.4/5.1)
## payment_method_id_32TRUE <= 0:
## :...payment_method_id_15TRUE > 0: CT (32.3)
## payment_method_id_15TRUE <= 0:
## :...payment_method_id_17TRUE > 0: CT (25.6/2.9)
## payment_method_id_17TRUE <= 0:
## :...plan_list_price_sum > 398: CT (276.1/61.7)
## plan_list_price_sum <= 398:
## :...is_cancelTRUE > 0: CT (200.2/46.3)
## is_cancelTRUE <= 0:
## :...payment_method_id_39TRUE > 0: CF (114.2)
## payment_method_id_39TRUE <= 0:
## :...payment_method_id_33TRUE > 0: CF (26.5)
## payment_method_id_33TRUE <= 0:
## :...payment_method_id_37TRUE > 0: CF (142.1/56.3)
## payment_method_id_37TRUE <= 0:
## :...payment_method_id_36TRUE > 0: CF (653.2/300.5)
## payment_method_id_36TRUE <= 0:
## :...registered_via9 <= 0: CT (1114/490)
## registered_via9 > 0: CF (735.3/326.1)
##
## ----- Trial 6: -----
##
## Decision tree:
##
## payment_method_id_32TRUE > 0: CT (150/5.4)
## payment_method_id_32TRUE <= 0:
## :...payment_method_id_15TRUE > 0: CT (28.9)
## payment_method_id_15TRUE <= 0:
## :...hist_trans > 6: CF (39.4/7)
## hist_trans <= 6:
## :...plan_list_price_sum > 398: CT (223.9/32.4)
## plan_list_price_sum <= 398:
## :...is_cancelTRUE > 0: CT (186.6/48.6)
## is_cancelTRUE <= 0:
## :...payment_method_id_39TRUE > 0: CF (102.4)
## payment_method_id_39TRUE <= 0:
## :...payment_method_id_29TRUE > 0: CF (131.9/54.7)
## payment_method_id_29TRUE <= 0:
## :...registered_via4 > 0: CT (315.1/132.4)
## registered_via4 <= 0:
## :...payment_method_id_40TRUE > 0: CT (550/239.6)
## payment_method_id_40TRUE <= 0:
## :...is_auto_renewTRUE <= 0: CT (818.2/371.3)
## is_auto_renewTRUE > 0: CF (939.5/339.9)
##
## ----- Trial 7: -----
##
## Decision tree:
##
## plan_list_price_sum > 398: CT (370.6/69.2)
## plan_list_price_sum <= 398:
## :...is_cancelTRUE > 0: CT (174.4/52)
## is_cancelTRUE <= 0:
## :...payment_method_id_39TRUE > 0: CF (90.8)
## payment_method_id_39TRUE <= 0:
## :...hist_trans > 1: CT (402.5/159.4)
## hist_trans <= 1:
## :...is_auto_renewTRUE > 0: CF (1380.6/546.1)
## is_auto_renewTRUE <= 0:
## :...payment_method_id_38TRUE <= 0: CT (457.2/196.9)
## payment_method_id_38TRUE > 0: CF (610/269)
##
## ----- Trial 8: -----
##
## Decision tree:
##
## payment_method_id_32TRUE > 0: CT (121.3/6.3)
## payment_method_id_32TRUE <= 0:
## :...plan_list_price_sum > 398: CT (246.9/69.4)
## plan_list_price_sum <= 398:
## :...is_cancelTRUE > 0: CT (166.3/56.5)
## is_cancelTRUE <= 0:
## :...payment_method_id_38TRUE > 0: CT (665.7/310.4)
## payment_method_id_38TRUE <= 0:
## :...plan_list_price_sum <= 120: CF (148.7/54.8)
## plan_list_price_sum > 120:
## :...payment_method_id_41TRUE <= 0: CF (1876.1/815.4)
## payment_method_id_41TRUE > 0: CT (261/99.9)
##
## ----- Trial 9: -----
##
## Decision tree:
##
## payment_method_id_15TRUE > 0: CT (26)
## payment_method_id_15TRUE <= 0:
## :...payment_method_id_32TRUE > 0: CT (112.4/6.5)
## payment_method_id_32TRUE <= 0:
## :...payment_method_id_34TRUE > 0: CF (42.5/8.2)
## payment_method_id_34TRUE <= 0:
## :...payment_method_id_33TRUE > 0: CF (32.1/6.5)
## payment_method_id_33TRUE <= 0:
## :...hist_trans > 4: CF (71.4/18.5)
## hist_trans <= 4:
## :...plan_list_price_sum > 398: CT (139.5/13.1)
## plan_list_price_sum <= 398:
## :...hist_trans > 2: CF (57.8/17.4)
## hist_trans <= 2:
## :...is_cancelTRUE > 0: CT (125.9/32.7)
## is_cancelTRUE <= 0:
## :...payment_method_id_39TRUE > 0: CF (75)
## payment_method_id_39TRUE <= 0:
## :...hist_trans > 1: CT (373.3/157.8)
## hist_trans <= 1:
## :...payment_method_id_37TRUE > 0: CF (77.9/25.2)
## payment_method_id_37TRUE <= 0:
## :...registered_via7 > 0: CF (355.6/150.7)
## registered_via7 <= 0: [S1]
##
## SubTree [S1]
##
## plan_list_price_sum <= 105: CT (30.7/4.4)
## plan_list_price_sum > 105:
## :...registered_via9 > 0: CT (920.4/434.1)
## registered_via9 <= 0:
## :...hist_log <= 24: CF (517.5/221.6)
## hist_log > 24: CT (528.2/249.9)
##
## ----- Trial 10: -----
##
## Decision tree:
##
## payment_method_id_32TRUE > 0: CT (102.8/6.8)
## payment_method_id_32TRUE <= 0:
## :...plan_list_price_sum > 398: CT (235.3/67.5)
## plan_list_price_sum <= 398:
## :...is_cancelTRUE > 0: CT (153.3/57.7)
## is_cancelTRUE <= 0:
## :...payment_method_id_39TRUE > 0: CF (68)
## payment_method_id_39TRUE <= 0:
## :...hist_trans <= 1: CF (2514.6/1148.6)
## hist_trans > 1: CT (412/189.1)
##
## ----- Trial 11: -----
##
## Decision tree:
##
## payment_method_id_31TRUE > 0: CF (31/3.1)
## payment_method_id_31TRUE <= 0:
## :...payment_method_id_33TRUE > 0: CF (32.9/9.5)
## payment_method_id_33TRUE <= 0:
## :...payment_method_id_34TRUE > 0: CF (42.1/12.7)
## payment_method_id_34TRUE <= 0:
## :...payment_method_id_39TRUE > 0: CF (119.3/55.9)
## payment_method_id_39TRUE <= 0:
## :...hist_trans <= 2: CT (3082.1/1425.8)
## hist_trans > 2: CF (178.6/68)
##
## ----- Trial 12: -----
##
## Decision tree:
##
## plan_list_price_sum > 398: CT (310.3/74.5)
## plan_list_price_sum <= 398:
## :...is_cancelTRUE > 0: CT (151/59.4)
## is_cancelTRUE <= 0:
## :...is_auto_renewTRUE <= 0: CT (1207.4/567.1)
## is_auto_renewTRUE > 0: CF (1817.3/783.3)
##
## ----- Trial 13: -----
##
## Decision tree:
##
## payment_method_id_32TRUE > 0: CT (88.1/7)
## payment_method_id_32TRUE <= 0:
## :...plan_list_price_sum > 398: CT (225.9/71.1)
## plan_list_price_sum <= 398:
## :...payment_method_id_40TRUE <= 0: CF (2471.3/1145.5)
## payment_method_id_40TRUE > 0: CT (700.7/327.8)
##
## ----- Trial 14: -----
##
## Decision tree:
##
## plan_list_price_sum > 398: CT (289/79.4)
## plan_list_price_sum <= 398:
## :...is_cancelTRUE > 0: CT (163.7/58.8)
## is_cancelTRUE <= 0:
## :...payment_method_id_39TRUE > 0: CF (53.9)
## payment_method_id_39TRUE <= 0:
## :...plan_list_price_sum <= 180: CF (2537/1216.4)
## plan_list_price_sum > 180: CT (442.4/182.5)
##
## ----- Trial 15: -----
##
## Decision tree:
##
## payment_method_id_31TRUE > 0: CF (26.5/3.4)
## payment_method_id_31TRUE <= 0:
## :...payment_method_id_33TRUE > 0: CF (30.7/9.5)
## payment_method_id_33TRUE <= 0:
## :...payment_method_id_34TRUE > 0: CF (36.4/12.5)
## payment_method_id_34TRUE <= 0:
## :...hist_trans > 4: CF (78.8/25.2)
## hist_trans <= 4:
## :...plan_list_price_sum > 540: CT (103.4/7.3)
## plan_list_price_sum <= 540:
## :...plan_list_price_sum > 398: CT (79/14.7)
## plan_list_price_sum <= 398:
## :...plan_list_price_sum > 300: CF (217.1/86.1)
## plan_list_price_sum <= 300:
## :...is_cancelTRUE > 0: CT (113.8/29)
## is_cancelTRUE <= 0:
## :...payment_method_id_39TRUE <= 0: CT (2749.2/1284.4)
## payment_method_id_39TRUE > 0: CF (51.1)
##
## ----- Trial 16: -----
##
## Decision tree:
##
## plan_list_price_sum > 398: CT (273.8/78.4)
## plan_list_price_sum <= 398:
## :...is_cancelTRUE > 0: CT (154.8/59)
## is_cancelTRUE <= 0:
## :...payment_method_id_39TRUE > 0: CF (47.6)
## payment_method_id_39TRUE <= 0:
## :...is_auto_renewTRUE <= 0:
## :...payment_method_id_29TRUE <= 0: CT (1138.7/530.6)
## : payment_method_id_29TRUE > 0: CF (86.7/33.4)
## is_auto_renewTRUE > 0:
## :...payment_method_num2 > 0: CF (26.9/5.6)
## payment_method_num2 <= 0:
## :...hist_trans <= 1: CF (1427.4/629.4)
## hist_trans > 1: CT (330.2/150.9)
##
## ----- Trial 17: -----
##
## Decision tree:
##
## payment_method_id_32TRUE > 0: CT (64.2)
## payment_method_id_32TRUE <= 0:
## :...plan_list_price_sum <= 398: CF (3214.5/1532.3)
## plan_list_price_sum > 398: CT (206.3/75)
##
## ----- Trial 18: -----
##
## Decision tree:
##
## payment_method_id_32TRUE > 0: CT (62)
## payment_method_id_32TRUE <= 0:
## :...hist_trans > 8: CF (29.6/6.7)
## hist_trans <= 8:
## :...plan_list_price_sum > 398: CT (157.7/37.6)
## plan_list_price_sum <= 398:
## :...is_cancelTRUE > 0: CT (172.3/59.9)
## is_cancelTRUE <= 0:
## :...payment_method_id_39TRUE > 0: CF (43.1)
## payment_method_id_39TRUE <= 0:
## :...registered_via4 <= 0: CF (2616.3/1276.6)
## registered_via4 > 0: CT (402/181.8)
##
## ----- Trial 19: -----
##
## Decision tree:
##
## is_cancelTRUE > 0: CT (152.7/38)
## is_cancelTRUE <= 0:
## :...payment_method_id_39TRUE > 0: CF (56.7/15.5)
## payment_method_id_39TRUE <= 0:
## :...payment_method_id_34TRUE <= 0: CT (3242/1516.1)
## payment_method_id_34TRUE > 0: CF (27.6/7.9)
##
## ----- Trial 20: -----
##
## Decision tree:
##
## payment_method_id_32TRUE > 0: CT (57)
## payment_method_id_32TRUE <= 0:
## :...is_cancelTRUE > 0: CT (134.2/24.6)
## is_cancelTRUE <= 0:
## :...plan_list_price_sum <= 398: CF (3129.1/1512.5)
## plan_list_price_sum > 398: CT (153.7/34.5)
##
## ----- Trial 21: -----
##
## Decision tree:
##
## payment_method_id_31TRUE > 0: CF (25.6/3.7)
## payment_method_id_31TRUE <= 0:
## :...plan_list_price_sum > 398: CT (186.1/23.2)
## plan_list_price_sum <= 398:
## :...is_cancelTRUE > 0: CT (122.7/25.5)
## is_cancelTRUE <= 0:
## :...payment_method_id_39TRUE > 0: CF (38)
## payment_method_id_39TRUE <= 0:
## :...payment_method_id_37TRUE <= 0: CT (2973.2/1414)
## payment_method_id_37TRUE > 0: CF (123.5/45.4)
##
## ----- Trial 22: -----
##
## Decision tree:
##
## plan_list_price_sum > 398: CT (156.6)
## plan_list_price_sum <= 398:
## :...is_cancelTRUE > 0: CT (94.5)
## is_cancelTRUE <= 0:
## :...payment_method_id_39TRUE > 0: CF (36)
## payment_method_id_39TRUE <= 0:
## :...payment_method_id_34TRUE > 0: CF (30.2)
## payment_method_id_34TRUE <= 0:
## :...plan_list_price_sum > 180: CT (460.6/193.3)
## plan_list_price_sum <= 180:
## :...is_auto_renewTRUE <= 0: CT (1208.1/603.2)
## is_auto_renewTRUE > 0: CF (1474.8/653.4)
##
## ----- Trial 23: -----
##
## Decision tree:
##
## is_cancelTRUE > 0: CT (94.5)
## is_cancelTRUE <= 0:
## :...plan_list_price_sum <= 99: CF (107.9/23.6)
## plan_list_price_sum > 99:
## :...payment_method_id_37TRUE > 0: CF (111.5/39)
## payment_method_id_37TRUE <= 0:
## :...payment_method_id_41TRUE > 0: CT (335/139.2)
## payment_method_id_41TRUE <= 0:
## :...payment_method_id_40TRUE > 0: CT (747/345.8)
## payment_method_id_40TRUE <= 0:
## :...payment_method_num2 > 0: CT (50/12.8)
## payment_method_num2 <= 0:
## :...plan_list_price_sum > 298: CT (276/71.8)
## plan_list_price_sum <= 298:
## :...hist_trans > 1: CF (107.6/11.3)
## hist_trans <= 1:
## :...plan_list_price_sum <= 150: CF (854.5/329.3)
## plan_list_price_sum > 150:
## :...hist_log <= 30: CF (695.2/307.4)
## hist_log > 30: CT (69.9/20.2)
##
## ----- Trial 24: -----
##
## Decision tree:
##
## plan_list_price_sum > 398: CT (141.1)
## plan_list_price_sum <= 398:
## :...is_cancelTRUE > 0: CT (84.8)
## is_cancelTRUE <= 0:
## :...plan_list_price_sum <= 0: CT (25.4)
## plan_list_price_sum > 0:
## :...is_auto_renewTRUE > 0: CF (1744.4/578)
## is_auto_renewTRUE <= 0:
## :...plan_list_price_sum > 180: CT (95.3/28.4)
## plan_list_price_sum <= 180:
## :...payment_method_id_29TRUE > 0: CF (100.4/38.3)
## payment_method_id_29TRUE <= 0:
## :...registered_via9 <= 0: CF (770.1/363.9)
## registered_via9 > 0: CT (447.4/204.8)
##
## ----- Trial 25: -----
##
## Decision tree:
##
## plan_list_price_sum > 398: CT (126.3)
## plan_list_price_sum <= 398:
## :...is_cancelTRUE > 0: CT (75.9)
## is_cancelTRUE <= 0:
## :...payment_method_id_39TRUE > 0: CF (28.9)
## payment_method_id_39TRUE <= 0:
## :...plan_list_price_sum > 180: CT (509.8/182.6)
## plan_list_price_sum <= 180:
## :...is_auto_renewTRUE <= 0:
## :...registered_via9 <= 0: CT (926.8/401.6)
## : registered_via9 > 0: CF (494.1/230.4)
## is_auto_renewTRUE > 0:
## :...payment_method_id_40TRUE <= 0: CF (471/4.3)
## payment_method_id_40TRUE > 0: CT (764.1/365.9)
##
## ----- Trial 26: -----
##
## Decision tree:
##
## is_cancelTRUE > 0: CT (80.4)
## is_cancelTRUE <= 0:
## :...plan_list_price_sum > 180: CT (639/193.5)
## plan_list_price_sum <= 180:
## :...plan_list_price_sum <= 0: CT (25.4)
## plan_list_price_sum > 0:
## :...is_auto_renewTRUE > 0: CF (1037.3/72.7)
## is_auto_renewTRUE <= 0:
## :...payment_method_id_29TRUE > 0: CF (112.6/47.1)
## payment_method_id_29TRUE <= 0:
## :...hist_log <= 23: CF (618.7/289.1)
## hist_log > 23: CT (815.7/360.3)
##
## ----- Trial 27: -----
##
## Decision tree:
##
## plan_list_price_sum > 358: CT (215/10.9)
## plan_list_price_sum <= 358:
## :...is_cancelTRUE > 0: CT (64.4)
## is_cancelTRUE <= 0:
## :...is_auto_renewTRUE > 0: CF (1350.3/158.1)
## is_auto_renewTRUE <= 0:
## :...hist_trans > 1: CT (37.2)
## hist_trans <= 1:
## :...plan_list_price_sum <= 120: CT (29.4)
## plan_list_price_sum > 120:
## :...hist_log <= 22: CT (612.3/256.2)
## hist_log > 22: CF (989.4/454.5)
##
## ----- Trial 28: -----
##
## Decision tree:
##
## plan_list_price_sum > 358: CT (175.9)
## plan_list_price_sum <= 358:
## :...is_cancelTRUE > 0: CT (55.5)
## is_cancelTRUE <= 0:
## :...payment_method_id_32TRUE > 0: CT (34.5)
## payment_method_id_32TRUE <= 0:
## :...payment_method_id_39TRUE > 0: CF (220.2)
## payment_method_id_39TRUE <= 0:
## :...is_auto_renewTRUE > 0:
## :...plan_list_price_sum <= 180: CF (721.8)
## : plan_list_price_sum > 180: CT (313.8/85.8)
## is_auto_renewTRUE <= 0:
## :...hist_trans > 1: CT (31.7)
## hist_trans <= 1:
## :...plan_list_price_sum <= 120: CT (25.3)
## plan_list_price_sum > 120:
## :...payment_method_id_29TRUE > 0: CF (95.5/28.3)
## payment_method_id_29TRUE <= 0:
## :...hist_log <= 23: CF (696.9/327.1)
## hist_log > 23: CT (905.8/329.6)
##
## ----- Trial 29: -----
##
## Decision tree:
##
## is_auto_renewTRUE > 0:
## :...is_cancelTRUE <= 0: CF (1163.8/238)
## : is_cancelTRUE > 0: CT (53.2)
## is_auto_renewTRUE <= 0:
## :...payment_method_id_29TRUE > 0: CF (82.1/22.6)
## payment_method_id_29TRUE <= 0:
## :...hist_log > 30: CT (168.7)
## hist_log <= 30:
## :...plan_list_price_sum > 180: CT (134.7)
## plan_list_price_sum <= 180:
## :...registered_via9 > 0: CT (619.4/107.4)
## registered_via9 <= 0:
## :...payment_method_id_36TRUE <= 0: CT (625.1/173)
## payment_method_id_36TRUE > 0: CF (331.1/144.7)
##
##
## Evaluation on training data (3486 cases):
##
## Trial Decision Tree
## ----- ----------------
## Size Errors
##
## 0 7 402(11.5%)
## 1 3 613(17.6%)
## 2 15 707(20.3%)
## 3 6 573(16.4%)
## 4 12 495(14.2%)
## 5 11 881(25.3%)
## 6 11 673(19.3%)
## 7 7 452(13.0%)
## 8 7 631(18.1%)
## 9 16 807(23.1%)
## 10 6 489(14.0%)
## 11 6 1566(44.9%)
## 12 4 470(13.5%)
## 13 4 950(27.3%)
## 14 5 488(14.0%)
## 15 10 1352(38.8%)
## 16 8 398(11.4%)
## 17 3 830(23.8%)
## 18 7 599(17.2%)
## 19 4 1445(41.5%)
## 20 4 582(16.7%)
## 21 6 1256(36.0%)
## 22 7 415(11.9%)
## 23 11 833(23.9%)
## 24 8 476(13.7%)
## 25 8 663(19.0%)
## 26 7 684(19.6%)
## 27 7 512(14.7%)
## 28 11 410(11.8%)
## 29 8 616(17.7%)
## boost 394(11.3%) <<
##
##
## (a) (b) <-classified as
## ---- ----
## 1523 220 (a): class CF
## 174 1569 (b): class CT
##
##
## Attribute usage:
##
## 100.00% hist_trans
## 100.00% is_auto_renewTRUE
## 100.00% is_cancelTRUE
## 100.00% plan_list_price_sum
## 100.00% payment_method_id_15TRUE
## 100.00% payment_method_id_31TRUE
## 100.00% payment_method_id_32TRUE
## 100.00% payment_method_id_33TRUE
## 100.00% payment_method_id_34TRUE
## 99.68% payment_method_id_39TRUE
## 97.73% payment_method_id_40TRUE
## 91.65% payment_method_id_37TRUE
## 89.19% payment_method_num2
## 88.38% payment_method_id_17TRUE
## 86.69% payment_method_id_41TRUE
## 78.57% payment_method_id_29TRUE
## 66.27% payment_method_id_38TRUE
## 64.77% payment_method_id_36TRUE
## 57.49% registered_via4
## 47.59% registered_via9
## 45.55% hist_log
## 41.05% registered_via7
##
##
## Time: 0.3 secs
pred1 <- ROCR::prediction(as.vector(predict(c50_boost_fit, df_join[train_idx,], type="prob")[,2]), df_join[train_idx,]$is_churn)
perf1 <- performance(pred1, "tpr", "fpr")
plot(perf1, colorize=TRUE, print.cutoffs.at=c(0.2, 0.4, 0.6, 0.8))
train_set <- df_join[train_idx,]
c50_boost_train_probs <- predict(c50_boost_fit, train_set, type = "prob")
c50_boost_train_pred <- ifelse(c50_boost_train_probs[,2] > 0.5, "CT", "CF")
(c50_boost_cm_mat <- confusionMatrix(factor(c50_boost_train_pred), train_set$is_churn))
## Confusion Matrix and Statistics
##
## Reference
## Prediction CF CT
## CF 15758 174
## CT 2479 1569
##
## Accuracy : 0.8672
## 95% CI : (0.8624, 0.8719)
## No Information Rate : 0.9128
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.4782
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.8641
## Specificity : 0.9002
## Pos Pred Value : 0.9891
## Neg Pred Value : 0.3876
## Prevalence : 0.9128
## Detection Rate : 0.7887
## Detection Prevalence : 0.7974
## Balanced Accuracy : 0.8821
##
## 'Positive' Class : CF
##
train_set <- df_join[train_idx,]
c50_boost_train_probs <- predict(c50_boost_fit, train_set, type = "prob")
c50_boost_train_pred <- ifelse(c50_boost_train_probs[,2] > 0.7, "CT", "CF")
(c50_boost_cm_mat <- confusionMatrix(factor(c50_boost_train_pred), train_set$is_churn))
## Confusion Matrix and Statistics
##
## Reference
## Prediction CF CT
## CF 17899 502
## CT 338 1241
##
## Accuracy : 0.958
## 95% CI : (0.9551, 0.9607)
## No Information Rate : 0.9128
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7243
## Mcnemar's Test P-Value : 1.865e-08
##
## Sensitivity : 0.9815
## Specificity : 0.7120
## Pos Pred Value : 0.9727
## Neg Pred Value : 0.7859
## Prevalence : 0.9128
## Detection Rate : 0.8958
## Detection Prevalence : 0.9210
## Balanced Accuracy : 0.8467
##
## 'Positive' Class : CF
##
test_set <- df_join[-train_idx,]
c50_boost_test_probs <- predict(c50_boost_fit, test_set, type = "prob")
c50_boost_test_pred <- ifelse(c50_boost_test_probs[,2] > 0.7, "CT", "CF")
(cm_mat4 <- confusionMatrix(factor(c50_boost_test_pred), test_set$is_churn))
## Confusion Matrix and Statistics
##
## Reference
## Prediction CF CT
## CF 11934 296
## CT 224 866
##
## Accuracy : 0.961
## 95% CI : (0.9575, 0.9642)
## No Information Rate : 0.9128
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7478
## Mcnemar's Test P-Value : 0.001848
##
## Sensitivity : 0.9816
## Specificity : 0.7453
## Pos Pred Value : 0.9758
## Neg Pred Value : 0.7945
## Prevalence : 0.9128
## Detection Rate : 0.8959
## Detection Prevalence : 0.9182
## Balanced Accuracy : 0.8634
##
## 'Positive' Class : CF
##
(rcm2_bb <- retention_campaign_metrics(cm_mat4$table, customer_churn_percentage, 1))
## target% efficiency% reach%
## 8.2 79.4 74.5
## effectiveness% min_reachable_churn% roi
## 59.2 2.2 79.4
cbind(rcm1_wb, rcm2_wb, rcm1_bb, rcm2_bb)
## rcm1_wb rcm2_wb rcm1_bb rcm2_bb
## target% 5.9 9.9 9.7 8.2
## efficiency% 82.8 67.2 66.7 79.4
## reach% 56.3 76.3 74.0 74.5
## effectiveness% 46.6 51.3 49.4 59.2
## min_reachable_churn% 3.8 2.1 2.3 2.2
## roi 82.8 67.2 66.7 79.4
Plot del ROI della campagna di “retention” basata sul rendimento (“rho”) di convincimento dei clienti della stessa.
rho <- seq(0, 1, by = 0.01)
roi_wb1 <- sapply(rho, function(x) retention_campaign_metrics(cm_mat1$table, customer_churn_percentage, x)["roi"])
roi_wb2 <- sapply(rho, function(x) retention_campaign_metrics(cm_mat2$table, customer_churn_percentage, x)["roi"])
roi_bb1 <- sapply(rho, function(x) retention_campaign_metrics(cm_mat3$table, customer_churn_percentage, x)["roi"])
roi_bb2 <- sapply(rho, function(x) retention_campaign_metrics(cm_mat4$table, customer_churn_percentage, x)["roi"])
mrc_wb1 <- sapply(rho, function(x) retention_campaign_metrics(cm_mat1$table, customer_churn_percentage, x)["min_reachable_churn%"])
mrc_wb2 <- sapply(rho, function(x) retention_campaign_metrics(cm_mat2$table, customer_churn_percentage, x)["min_reachable_churn%"])
mrc_bb1 <- sapply(rho, function(x) retention_campaign_metrics(cm_mat3$table, customer_churn_percentage, x)["min_reachable_churn%"])
mrc_bb2 <- sapply(rho, function(x) retention_campaign_metrics(cm_mat4$table, customer_churn_percentage, x)["min_reachable_churn%"])
rho_df <- data.frame(rho = rho, c50rules = roi_wb1, rpart = roi_wb2, rf = roi_bb1, c50boost = roi_bb2)
mrc_df <- data.frame(rho = rho, c50rules = mrc_wb1, rpart = mrc_wb2, rf = mrc_bb1, c50boost = mrc_bb2)
rho_plot <- ggplot(data = rho_df) + geom_line(aes(x = rho, y = c50rules, colour = "c50rules"), size = 1.2) + geom_line(aes(x = rho, y = rpart, colour = "rpart"), size = 1.2) + geom_line(aes(x = rho, y = rf, colour = "rf"), size = 1.2) + geom_line(aes(x = rho, y = c50boost, colour = "c50boost"), size = 1.2) + ylab("ROI") + scale_colour_manual("model", values = c("red", "blue", "green", "violet")) + theme_bw() + ggtitle("Models ROI as a function of retention campaign rho")
rho_plot
mrc_plot <- ggplot(data = mrc_df) + geom_line(aes(x = rho, y = c50rules, colour = "c50rules"), size = 1.2) + geom_line(aes(x = rho, y = rpart, colour = "rpart"), size = 1.2) + geom_line(aes(x = rho, y = rf, colour = "rf"), size = 1.2) + geom_line(aes(x = rho, y = c50boost, colour = "c50boost"), size = 1.2) + ylab("Minimum Reachable Churn Percentage") + scale_colour_manual("model", values = c("red", "blue", "green", "violet")) + theme_bw() + ggtitle("Models Minimum Churn Reach as a function of retention campaign rho")
mrc_plot
model_results <- resamples(list(RPART_TREE=rpart_fit, RANDOM_FOREST=rf_fit, C50_BOOST=c50_boost_fit))
summary(model_results)
##
## Call:
## summary.resamples(object = model_results)
##
## Models: RPART_TREE, RANDOM_FOREST, C50_BOOST
## Number of resamples: 50
##
## ROC
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## RPART_TREE 0.8892074 0.9028988 0.9178273 0.9183860 0.9321598 0.9537219
## RANDOM_FOREST 0.9241416 0.9341439 0.9437298 0.9434049 0.9525766 0.9638306
## C50_BOOST 0.9351967 0.9498406 0.9526146 0.9532042 0.9575689 0.9692604
## NA's
## RPART_TREE 0
## RANDOM_FOREST 0
## C50_BOOST 0
##
## Sens
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## RPART_TREE 0.8267544 0.8449197 0.8516996 0.8510720 0.8567708 0.8749314
## RANDOM_FOREST 0.8617663 0.8796601 0.8873355 0.8861766 0.8941741 0.9040044
## C50_BOOST 0.8360746 0.8543037 0.8626645 0.8696716 0.8803454 0.9248904
## NA's
## RPART_TREE 0
## RANDOM_FOREST 0
## C50_BOOST 0
##
## Spec
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## RPART_TREE 0.8333333 0.8750000 0.8965517 0.8955908 0.9182266 0.9540230
## RANDOM_FOREST 0.7816092 0.8469294 0.8678161 0.8641353 0.8800000 0.9310345
## C50_BOOST 0.7586207 0.8692529 0.9022989 0.8893695 0.9141626 0.9425287
## NA's
## RPART_TREE 0
## RANDOM_FOREST 0
## C50_BOOST 0
bwplot(model_results)
boot.fn <- function(data, index) {
test_set <- data[index,]
rpart_resample_probs <- predict(rpart_fit, test_set, type = "prob")
rpart_resample_test_pred <- ifelse(rpart_resample_probs[,2] > 0.65, "CT", "CF")
cm_mat_resample <- confusionMatrix(factor(rpart_resample_test_pred), test_set$is_churn)
cm_mat_resample$byClass[2]
}
set.seed(1023)
results <- boot(data=df_join[-train_idx,], boot.fn, R = 100)
results
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = df_join[-train_idx, ], statistic = boot.fn, R = 100)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 0.7633391 -0.002040165 0.01379297
boot.ci(results, type = "norm")
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 100 bootstrap replicates
##
## CALL :
## boot.ci(boot.out = results, type = "norm")
##
## Intervals :
## Level Normal
## 95% ( 0.7383, 0.7924 )
## Calculations and Intervals on Original Scale
boot.fn <- function(data, index) {
test_set <- data[index,]
rpart_resample_probs <- predict(rpart_fit, test_set, type = "prob")
rpart_resample_test_pred <- ifelse(rpart_resample_probs[,2] > 0.65, "CT", "CF")
cm_mat_resample <- confusionMatrix(factor(rpart_resample_test_pred), test_set$is_churn)
cm_mat_resample$byClass[4]
}
set.seed(1023)
results <- boot(data=df_join[-train_idx,], boot.fn, R=100)
results
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = df_join[-train_idx, ], statistic = boot.fn, R = 100)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 0.6719697 -0.001793563 0.01135488
boot.ci(results, type = "norm")
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 100 bootstrap replicates
##
## CALL :
## boot.ci(boot.out = results, type = "norm")
##
## Intervals :
## Level Normal
## 95% ( 0.6515, 0.6960 )
## Calculations and Intervals on Original Scale
boot.fn <- function(data, index) {
test_set <- data[index,]
rpart_resample_probs <- predict(rpart_fit, test_set, type = "prob")
rpart_resample_test_pred <- ifelse(rpart_resample_probs[,2] > 0.65, "CT", "CF")
cm_mat_resample <- confusionMatrix(factor(rpart_resample_test_pred), test_set$is_churn)
rcm_boot <- retention_campaign_metrics(cm_mat_resample$table, customer_churn_percentage, 1)
rcm_boot[5]
}
set.seed(1023)
results <- boot(data=df_join[-train_idx,], boot.fn, R = 100)
results
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = df_join[-train_idx, ], statistic = boot.fn, R = 100)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 2.1 -0.018 0.1431112
boot.ci(results, type = "norm")
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 100 bootstrap replicates
##
## CALL :
## boot.ci(boot.out = results, type = "norm")
##
## Intervals :
## Level Normal
## 95% ( 1.838, 2.398 )
## Calculations and Intervals on Original Scale
boot.fn <- function(data, index) {
test_set <- data[index,]
rpart_resample_probs <- predict(rpart_fit, test_set, type = "prob")
rpart_resample_test_pred <- ifelse(rpart_resample_probs[,2] > 0.65, "CT", "CF")
cm_mat_resample <- confusionMatrix(factor(rpart_resample_test_pred), test_set$is_churn)
rcm_boot <- retention_campaign_metrics(cm_mat_resample$table, customer_churn_percentage, 1)
rcm_boot[6]
}
set.seed(1023)
results <- boot(data=df_join[-train_idx,], boot.fn, R = 100)
results
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = df_join[-train_idx, ], statistic = boot.fn, R = 100)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 67.2 0.04 2.209529
boot.ci(results, type = "norm")
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 100 bootstrap replicates
##
## CALL :
## boot.ci(boot.out = results, type = "norm")
##
## Intervals :
## Level Normal
## 95% (62.83, 71.49 )
## Calculations and Intervals on Original Scale
ensemble_train <- cbind(c50_train_pred, rpart_train_pred, rf_train_pred, c50_boost_train_pred)
weights <- c(1,1,2,2)
thresh <- 4
train_res <- apply(ensemble_train, 1, function(x) { sum(weights %*% as.numeric (x =="CT"))})
train_res_churn <- ifelse(train_res > thresh, "CT", "CF")
(cm_mat_train_ensemble <- confusionMatrix(factor(train_res_churn), train_set$is_churn))
## Confusion Matrix and Statistics
##
## Reference
## Prediction CF CT
## CF 18126 589
## CT 111 1154
##
## Accuracy : 0.965
## 95% CI : (0.9623, 0.9675)
## No Information Rate : 0.9128
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7489
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9939
## Specificity : 0.6621
## Pos Pred Value : 0.9685
## Neg Pred Value : 0.9123
## Prevalence : 0.9128
## Detection Rate : 0.9072
## Detection Prevalence : 0.9367
## Balanced Accuracy : 0.8280
##
## 'Positive' Class : CF
##
ensemble_test <- cbind(c50_test_pred, rpart_test_pred, rf_test_pred, c50_boost_test_pred)
test_res <- apply(ensemble_test, 1, function(x) { sum(weights %*% as.numeric (x =="CT"))})
test_res_churn <- ifelse(test_res > thresh, "CT", "CF")
(cm_mat_test_ensemble <- confusionMatrix(factor(test_res_churn), test_set$is_churn))
## Confusion Matrix and Statistics
##
## Reference
## Prediction CF CT
## CF 12066 378
## CT 92 784
##
## Accuracy : 0.9647
## 95% CI : (0.9614, 0.9678)
## No Information Rate : 0.9128
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7507
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9924
## Specificity : 0.6747
## Pos Pred Value : 0.9696
## Neg Pred Value : 0.8950
## Prevalence : 0.9128
## Detection Rate : 0.9059
## Detection Prevalence : 0.9342
## Balanced Accuracy : 0.8336
##
## 'Positive' Class : CF
##
(rcm_ensemble <- retention_campaign_metrics(cm_mat_test_ensemble$table, customer_churn_percentage, 1))
## target% efficiency% reach%
## 6.6 89.5 67.5
## effectiveness% min_reachable_churn% roi
## 60.4 2.8 89.5
cbind(rcm1_wb, rcm2_wb, rcm1_bb, rcm2_bb, rcm_ensemble)
## rcm1_wb rcm2_wb rcm1_bb rcm2_bb rcm_ensemble
## target% 5.9 9.9 9.7 8.2 6.6
## efficiency% 82.8 67.2 66.7 79.4 89.5
## reach% 56.3 76.3 74.0 74.5 67.5
## effectiveness% 46.6 51.3 49.4 59.2 60.4
## min_reachable_churn% 3.8 2.1 2.3 2.2 2.8
## roi 82.8 67.2 66.7 79.4 89.5
df_stack <- data.frame(c50_train_pred, rpart_train_pred, rf_train_pred, c50_boost_train_pred, train_set$is_churn)
colnames(df_stack) <- c("C50", "RPART", "RF", "C50BOOST", "is_churn")
trControl <- trainControl(method = "repeatedcv",
number = 10,
repeats = 5,
verboseIter = FALSE)
stack_fit <- train(is_churn ~ .,
data = df_stack,
method = "glm",
trControl = trControl,
metric = 'Accuracy',
tuneLength = 10)
summary(stack_fit)
##
## Call:
## NULL
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1366 -0.1866 -0.1866 -0.1866 2.8494
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.04222 0.05525 -73.166 < 2e-16 ***
## C50CT 0.38531 0.16359 2.355 0.0185 *
## RPARTCT 0.81355 0.13805 5.893 3.79e-09 ***
## RFCT 2.93218 0.09323 31.452 < 2e-16 ***
## C50BOOSTCT 2.08604 0.16462 12.672 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 11832.1 on 19979 degrees of freedom
## Residual deviance: 5143.7 on 19975 degrees of freedom
## AIC: 5153.7
##
## Number of Fisher Scoring iterations: 6
stack_train_probs <- predict(stack_fit, df_stack, type = "prob")
stack_train_pred <- ifelse(stack_train_probs[,2] > 0.5, "CT", "CF")
(stack_boost_cm_mat <- confusionMatrix(factor(stack_train_pred), train_set$is_churn))
## Confusion Matrix and Statistics
##
## Reference
## Prediction CF CT
## CF 18115 541
## CT 122 1202
##
## Accuracy : 0.9668
## 95% CI : (0.9642, 0.9693)
## No Information Rate : 0.9128
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7662
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9933
## Specificity : 0.6896
## Pos Pred Value : 0.9710
## Neg Pred Value : 0.9079
## Prevalence : 0.9128
## Detection Rate : 0.9067
## Detection Prevalence : 0.9337
## Balanced Accuracy : 0.8415
##
## 'Positive' Class : CF
##
df_stack_test <- data.frame(c50_test_pred, rpart_test_pred, rf_test_pred, c50_boost_test_pred, test_set$is_churn)
colnames(df_stack_test) <- c("C50", "RPART", "RF", "C50BOOST", "is_churn")
stack_probs <- predict(stack_fit, df_stack_test, type = "prob")
stack_test_pred <- ifelse(stack_probs[,2] > 0.5, "CT", "CF")
(cm_mat_stack <- confusionMatrix(factor(stack_test_pred), test_set$is_churn))
## Confusion Matrix and Statistics
##
## Reference
## Prediction CF CT
## CF 12045 353
## CT 113 809
##
## Accuracy : 0.965
## 95% CI : (0.9618, 0.9681)
## No Information Rate : 0.9128
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7577
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9907
## Specificity : 0.6962
## Pos Pred Value : 0.9715
## Neg Pred Value : 0.8774
## Prevalence : 0.9128
## Detection Rate : 0.9043
## Detection Prevalence : 0.9308
## Balanced Accuracy : 0.8435
##
## 'Positive' Class : CF
##
(rcm_stacking <- retention_campaign_metrics(cm_mat_stack$table, customer_churn_percentage, 1))
## target% efficiency% reach%
## 6.9 87.7 69.6
## effectiveness% min_reachable_churn% roi
## 61.1 2.7 87.7
cbind(rcm1_wb, rcm2_wb, rcm1_bb, rcm2_bb, rcm_ensemble, rcm_stacking)
## rcm1_wb rcm2_wb rcm1_bb rcm2_bb rcm_ensemble
## target% 5.9 9.9 9.7 8.2 6.6
## efficiency% 82.8 67.2 66.7 79.4 89.5
## reach% 56.3 76.3 74.0 74.5 67.5
## effectiveness% 46.6 51.3 49.4 59.2 60.4
## min_reachable_churn% 3.8 2.1 2.3 2.2 2.8
## roi 82.8 67.2 66.7 79.4 89.5
## rcm_stacking
## target% 6.9
## efficiency% 87.7
## reach% 69.6
## effectiveness% 61.1
## min_reachable_churn% 2.7
## roi 87.7
ensemble <- sapply(rho, function(x) retention_campaign_metrics(cm_mat_test_ensemble$table, customer_churn_percentage, x)["roi"])
stack <- sapply(rho, function(x) retention_campaign_metrics(cm_mat_stack$table, customer_churn_percentage, x)["roi"])
rho_df <- rho_df %>% mutate(ensemble = ensemble) %>% mutate(stack = stack)
rho_plot + geom_line(aes(x = rho, y = ensemble, colour = "ensemble"), size = 1.2) + geom_line(aes(x = rho, y = stack, colour = "stack"), size = 1.2) + scale_colour_manual("model", values = c("red", "blue", "green", "violet", "yellow", "black"))
ensemble <- sapply(rho, function(x) retention_campaign_metrics(cm_mat_test_ensemble$table, customer_churn_percentage, x)["min_reachable_churn%"])
stack <- sapply(rho, function(x) retention_campaign_metrics(cm_mat_stack$table, customer_churn_percentage, x)["min_reachable_churn%"])
mrc_df <- mrc_df %>% mutate(ensemble = ensemble) %>% mutate(stack = stack)
mrc_plot + geom_line(aes(x = rho, y = ensemble, colour = "ensemble"), size = 1.2) + geom_line(aes(x = rho, y = stack, colour = "stack"), size = 1.2) + scale_colour_manual("model", values = c("red", "blue", "green", "violet", "yellow", "black"))