Estudo sobre o treinamento de classificadores

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from time import time
%autosave 300
%matplotlib notebook
Autosaving every 300 seconds
In [108]:
# Supress unnecessary warnings so that presentation looks clean
import warnings
warnings.filterwarnings('ignore')
In [125]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

def test_model(clf, X_test, y_test):
    print('\nClassification Report:')
    test_time = time()
    y_pred = clf.predict(X_test)
    print('Prediciton time: {:.3f}s'.format(time() - test_time))
    print('\n', classification_report(y_test, y_pred))
    
    print('\nAccuracy: {:.3f}'.format(accuracy_score(y_test, y_pred)))
    
    print('\nConfusion Matrix:')
    print('\n', confusion_matrix(y_test, y_pred),'\n\n')

def train_model(model, parameters, scores, X_train, X_test, y_train, y_test, cv=5, name='model'):
    clfs = []
    for score in scores:
        print('Training {} for {}'.format(name, score))
        train_time = time()
        clf = GridSearchCV(model, parameters, cv=cv, scoring=score)
        clf.fit(X_train, y_train)
        print('Finished Traininig in {:.3f}s'.format(time() - train_time))

        print('Best parameters found:')
        print(clf.best_params_)
        test_model(clf, X_test, y_test)
        clfs.append((score, clf.best_params_, clf))
    
    return clfs

Árvore de decisão

Foi testado o algoritmo de árvore de decisão para todo o dataset e ao final conseguiu um resultado de 93% de acurácia média.

In [6]:
data = pd.read_csv('covtype.data', names=label, index_col=None)
In [9]:
from sklearn.model_selection import train_test_split
x_train, x_test , y_train, y_test = train_test_split(data.iloc[:,:-1], data.iloc[:,-1], test_size=0.3,random_state=40)

model = DecisionTreeClassifier()

model.fit(x_train,y_train)

prediction = model.predict(data.iloc[:,:-1])

print(model.score(x_test,y_test))
0.934218377088
In [12]:
from sklearn.metrics import confusion_matrix
confu = confusion_matrix(data.iloc[:,-1], prediction)
print(confu)
[[207543   3916      8      0     64     10    299]
 [  3828 278631    245      2    394    165     36]
 [     7    201  34968    103     41    434      0]
 [     0      0    106   2607      0     34      0]
 [    52    402     29      0   8996     13      1]
 [    20    211    430     45     10  16651      0]
 [   312     48      0      0      0      0  20150]]

Extra-Trees

Extra-Trees é um algoritmo de floresta de árvores de decisão, que significa árvores extremamente randomizadas (extremely randomized trees). Foi proposta por Geurts Pierre, et al, em 2006 para classificações e regressões supervisionadas.

Esse método consiste em randomizar fortemente tanto atrivutos e escolhas de pontos de poda para nós da árvore. Onde o a decisão de um ponto ótimo de poda é responsável, em grande parte, pela variância na árvore inferida. Portanto, esse método em vez de utilizar cópias da amostra de aprendizado tenta achar o ponto ótimo de poda para cada uma das K caracteristicas escolhidas randomicamente em cada nó, onde seleciona um ponto de poda aleatório. Dessa forma, esse método aumenta a acurácia ao custo de diminuir um pouco a precisão do modelo.

Esse método randômico é bom para contextos onde problemas são categorizados por um grande número de caracteristicas númericas continuas. Onde o principal objetivo do algoritmo é a eficiência computacional, além da acurácia, de forma a otimizar o aprendizado em grandes volumes de dados.

Teste para o dataset de 40-60

Classes 1 e 2 representam 40% do dataset e 60% para as classes restantes (3-7).

In [2]:
dataset = pd.read_csv('143k_std.csv')
In [33]:
from sklearn.model_selection import train_test_split

target_col = dataset.shape[1]-1

X = dataset.iloc[:, :target_col]
y = list(map(int, dataset.iloc[:, target_col].values))

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4)
In [107]:
from sklearn.ensemble import ExtraTreesClassifier

parameters = [
    {'n_estimators': [target_col, 100], 'max_features':[0.75,0.8,0.85,0.9], 'n_jobs':[-1]}
]

scores = ['f1_macro', 'f1_micro','accuracy']

et_clfs = train_model(ExtraTreesClassifier(),
                   parameters,
                   scores,
                   X_train,
                   X_test,
                   y_train,
                   y_test,
                   name='ExtraTrees'
                  )
Training ExtraTrees for f1_macro
Finished Traininig in 439.453s
Best parameters found:
{'max_features': 0.9, 'n_estimators': 100, 'n_jobs': -1}

Classification Report:
Prediciton time: 0.857s

              precision    recall  f1-score   support

          1       0.95      0.93      0.94      8357
          2       0.96      0.96      0.96     15132
          3       0.98      0.98      0.98     11434
          4       0.98      0.99      0.99      3343
          5       0.98      0.99      0.98      6393
          6       0.97      0.97      0.97      6864
          7       0.98      0.99      0.99      5724

avg / total       0.97      0.97      0.97     57247


Confusion Matrix:

 [[ 7781   474     2     0    19     4    77]
 [  336 14559    63     1   111    44    18]
 [    1    11 11245    52     8   117     0]
 [    0     0    24  3313     0     6     0]
 [    0    41     8     0  6334    10     0]
 [    0    20   133    15     4  6692     0]
 [   30     3     0     0     0     0  5691]] 


Training ExtraTrees for f1_micro
Finished Traininig in 407.805s
Best parameters found:
{'max_features': 0.75, 'n_estimators': 100, 'n_jobs': -1}

Classification Report:
Prediciton time: 0.851s

              precision    recall  f1-score   support

          1       0.96      0.93      0.94      8357
          2       0.96      0.96      0.96     15132
          3       0.98      0.98      0.98     11434
          4       0.98      0.99      0.99      3343
          5       0.98      0.99      0.98      6393
          6       0.97      0.97      0.97      6864
          7       0.98      0.99      0.99      5724

avg / total       0.97      0.97      0.97     57247


Confusion Matrix:

 [[ 7771   483     2     0    19     3    79]
 [  329 14566    66     2   109    45    15]
 [    0    10 11257    44     8   115     0]
 [    0     0    21  3317     0     5     0]
 [    0    36     8     0  6340     9     0]
 [    0    18   132    17     5  6692     0]
 [   29     3     0     0     0     0  5692]] 


Training ExtraTrees for accuracy
Finished Traininig in 439.705s
Best parameters found:
{'max_features': 0.85, 'n_estimators': 100, 'n_jobs': -1}

Classification Report:
Prediciton time: 0.857s

              precision    recall  f1-score   support

          1       0.96      0.93      0.94      8357
          2       0.96      0.96      0.96     15132
          3       0.98      0.98      0.98     11434
          4       0.98      0.99      0.99      3343
          5       0.98      0.99      0.98      6393
          6       0.97      0.97      0.97      6864
          7       0.98      0.99      0.99      5724

avg / total       0.97      0.97      0.97     57247


Confusion Matrix:

 [[ 7764   485     2     0    21     3    82]
 [  337 14562    62     1   107    47    16]
 [    0     9 11258    48     7   112     0]
 [    0     0    21  3317     0     5     0]
 [    0    38     9     0  6337     9     0]
 [    0    21   134    15     5  6689     0]
 [   28     2     0     0     0     0  5694]] 


Resultados Parciais

Visto o resultado de 97%, podemos ter certo ceticismo, já que na literatura podemos verificar apenas o resultado de 71%.

Sendo assim, verificamos a quantidade e a distribuição das classes no conjunto de dados para avaliação. Visto que, dependendo da distribuição das classes podemos experenciar diferentes resultados.

Pode-se notar que o conjunto de dados para validação está distribuido de forma satisfatória, porém por desencargo de consciência podemos testar o modelo para as amostras retiradas na fase de modelagem.

In [102]:
def plot_multbar(bars, title='', xlabel='', ylabel=''):
    fig, ax = plt.subplots()
    ind = np.arange(1,len(bars)+1)
    plt.bar(ind, bars)
    ax.set_xticks(ind)
    ax.set_title('{}'.format(title))
    ax.set_xlabel(xlabel)
    ax.set_ylabel(ylabel)
    plt.show()
In [103]:
target_sums = [y_test.count(i) for i in range(1, 8)]
print(target_sums)
plot_multbar(target_sums, 'Distribuição de Classes - Avaliação', 'Classes', 'Qnt')
[8357, 15132, 11434, 3343, 6393, 6864, 5724]

Testando para classes 1 e 2

Esse teste é realizado tendo em mente que as duas classes são as que mais se sobrepõem dentro do domínio.

In [78]:
from sklearn.preprocessing import StandardScaler
from sklearn.externals import joblib

ssc = joblib.load('stdscaler.pkl')

def pre_process(dataset):
    try:
        dataset = dataset.drop(['Soil7','Soil8','Soil15'], axis=1)
    except Exception as e:
        print(e)
    
    y = dataset.loc[:, 'Target'].values
    X = dataset.iloc[:, :-1].values

    X = ssc.transform(X[:, 0:10])
    X = np.concatenate((X, dataset.iloc[:, 10:-1].values), axis=1)
    
    return X, y
    
In [298]:
dataset_val = pd.read_csv('143k_val.csv')
X_val, y_val = pre_process(dataset_val)
In [299]:
X_val = np.concatenate((X_val, X_test), axis=0)
y_val = np.concatenate((y_val, y_test), axis=0)
In [128]:
for s, p, clf in et_clfs:
    print('\nTesting ET - {}'.format(p))
    test_model(clf, X_val, y_val)
Testing ET - {'max_features': 0.9, 'n_estimators': 100, 'n_jobs': -1}

Classification Report:
Prediciton time: 6.656s

              precision    recall  f1-score   support

          1       0.89      0.84      0.86    194436
          2       0.90      0.86      0.88    266948
          3       0.67      0.98      0.80     11434
          4       0.98      0.99      0.98      3343
          5       0.49      0.99      0.65      6393
          6       0.59      0.97      0.74      6864
          7       0.47      0.99      0.64      5724

avg / total       0.88      0.86      0.87    495142


Accuracy: 0.863

Confusion Matrix:

 [[163218  24419     98      0   1081    283   5337]
 [ 19880 230863   5255      6   5609   4193   1142]
 [     1     11  11245     52      8    117      0]
 [     0      0     24   3313      0      6      0]
 [     0     41      8      0   6334     10      0]
 [     0     20    133     15      4   6692      0]
 [    30      3      0      0      0      0   5691]] 



Testing ET - {'max_features': 0.75, 'n_estimators': 100, 'n_jobs': -1}

Classification Report:
Prediciton time: 7.421s

              precision    recall  f1-score   support

          1       0.89      0.84      0.86    194436
          2       0.90      0.86      0.88    266948
          3       0.67      0.98      0.80     11434
          4       0.98      0.99      0.99      3343
          5       0.48      0.99      0.65      6393
          6       0.59      0.97      0.73      6864
          7       0.46      0.99      0.63      5724

avg / total       0.88      0.86      0.87    495142


Accuracy: 0.863

Confusion Matrix:

 [[163051  24386     98      0   1080    278   5543]
 [ 19733 230858   5327     13   5673   4272   1072]
 [     0     10  11257     44      8    115      0]
 [     0      0     21   3317      0      5      0]
 [     0     36      8      0   6340      9      0]
 [     0     18    132     17      5   6692      0]
 [    29      3      0      0      0      0   5692]] 



Testing ET - {'max_features': 0.85, 'n_estimators': 100, 'n_jobs': -1}

Classification Report:
Prediciton time: 7.532s

              precision    recall  f1-score   support

          1       0.89      0.84      0.86    194436
          2       0.90      0.86      0.88    266948
          3       0.66      0.98      0.79     11434
          4       0.98      0.99      0.99      3343
          5       0.48      0.99      0.65      6393
          6       0.60      0.97      0.74      6864
          7       0.46      0.99      0.63      5724

avg / total       0.88      0.86      0.87    495142


Accuracy: 0.862

Confusion Matrix:

 [[162996  24468    122      0   1106    298   5446]
 [ 19900 230669   5420      6   5658   4115   1180]
 [     0      9  11258     48      7    112      0]
 [     0      0     21   3317      0      5      0]
 [     0     38      9      0   6337      9      0]
 [     0     21    134     15      5   6689      0]
 [    28      2      0      0      0      0   5694]] 


In [335]:
target_sums = [list(y_val).count(i) for i in range(1, 8)]
print(target_sums)
plot_multbar(target_sums, 'Distribuição de Classes - Avaliação', 'Classes', 'Qnt')
[343749, 468841, 14363, 1128, 3805, 6931, 8349]

Resultados

Com os resultados da matriz de confusão e os resultados das métricas, podemos notar que as classes 1 e 2 estão sobrepostas, sendo difícil classificá-las. Porém, mesmo com essa dificuldade conseguiu-se um resultado de 86% de acurácia ao final, uma melhora de 15% do artigo original.

50-50 Dataset

Esse dataset tenta melhorar a acurácia para as classes 1 e 2, aumentando o número de amostras dessas classes para treinamento, totalizando 50% paras as duas primeiras classes e 50% para as demais.

In [129]:
dataset_50 = pd.read_csv('50-50_std.csv')
In [132]:
from sklearn.model_selection import train_test_split

target_col = dataset_50.shape[1]-1

X = dataset_50.iloc[:, :target_col]
y = list(map(int, dataset_50.iloc[:, target_col].values))

X_train_50, X_test_50, y_train_50, y_test_50 = train_test_split(X, y, test_size=0.4)
In [133]:
from sklearn.ensemble import ExtraTreesClassifier

parameters = [
    {'n_estimators': [target_col, 100], 'max_features':[0.75,0.8,0.85,0.9], 'n_jobs':[-1]}
]

scores = ['f1_macro', 'f1_micro','accuracy']

et_clfs_50 = train_model(ExtraTreesClassifier(),
                   parameters,
                   scores,
                   X_train_50,
                   X_test_50,
                   y_train_50,
                   y_test_50,
                   name='ExtraTrees'
                  )
Training ExtraTrees for f1_macro
Finished Traininig in 714.502s
Best parameters found:
{'max_features': 0.9, 'n_estimators': 100, 'n_jobs': -1}

Classification Report:
Prediciton time: 1.688s

              precision    recall  f1-score   support

          1       0.94      0.92      0.93     15596
          2       0.93      0.93      0.93     18639
          3       0.96      0.98      0.97     14517
          4       0.90      0.87      0.89      1110
          5       0.94      0.94      0.94      3772
          6       0.95      0.95      0.95      6882
          7       0.98      0.99      0.98      8181

avg / total       0.95      0.95      0.95     68697


Accuracy: 0.946

Confusion Matrix:

 [[14303  1068     1     0    44     4   176]
 [  840 17384   126     0   185    90    14]
 [    0    27 14174    81    12   223     0]
 [    0     0   104   971     0    35     0]
 [   12   132    49     0  3558    20     1]
 [    3    17   289    26     2  6545     0]
 [   83    16     0     0     2     0  8080]] 


Training ExtraTrees for f1_micro
Finished Traininig in 749.117s
Best parameters found:
{'max_features': 0.9, 'n_estimators': 100, 'n_jobs': -1}

Classification Report:
Prediciton time: 1.397s

              precision    recall  f1-score   support

          1       0.94      0.92      0.93     15596
          2       0.93      0.93      0.93     18639
          3       0.96      0.98      0.97     14517
          4       0.91      0.87      0.89      1110
          5       0.93      0.94      0.94      3772
          6       0.94      0.95      0.95      6882
          7       0.98      0.99      0.98      8181

avg / total       0.95      0.95      0.95     68697


Accuracy: 0.947

Confusion Matrix:

 [[14343  1040     1     0    46     5   161]
 [  805 17396   126     0   197    99    16]
 [    0    31 14177    75    14   220     0]
 [    0     0   107   969     0    34     0]
 [   10   131    51     0  3557    22     1]
 [    3    22   305    24     3  6525     0]
 [   92    12     0     0     2     0  8075]] 


Training ExtraTrees for accuracy
Finished Traininig in 721.092s
Best parameters found:
{'max_features': 0.85, 'n_estimators': 100, 'n_jobs': -1}

Classification Report:
Prediciton time: 1.365s

              precision    recall  f1-score   support

          1       0.94      0.92      0.93     15596
          2       0.93      0.93      0.93     18639
          3       0.96      0.98      0.97     14517
          4       0.91      0.87      0.89      1110
          5       0.94      0.95      0.94      3772
          6       0.95      0.95      0.95      6882
          7       0.98      0.99      0.98      8181

avg / total       0.95      0.95      0.95     68697


Accuracy: 0.947

Confusion Matrix:

 [[14305  1076     2     0    41     3   169]
 [  825 17393   124     0   189    93    15]
 [    0    30 14181    72    14   220     0]
 [    0     0   106   966     0    38     0]
 [   11   125    45     0  3569    21     1]
 [    3    18   288    25     2  6546     0]
 [   93    15     0     0     2     0  8071]] 


In [134]:
dataset_val_50 = pd.read_csv('50-50_val.csv')
X_val_50, y_val_50 = pre_process(dataset_val)
X_val_50 = np.concatenate((X_val_50, X_test_50), axis=0)
y_val_50 = np.concatenate((y_val_50, y_test_50), axis=0)
for s, p, clf in et_clfs_50:
    print('\nTesting ET 50-50 - {}'.format(p))
    test_model(clf, X_val_50, y_val_50)
Testing ET 50-50 - {'max_features': 0.9, 'n_estimators': 100, 'n_jobs': -1}

Classification Report:
Prediciton time: 9.491s

              precision    recall  f1-score   support

          1       0.94      0.92      0.93    201675
          2       0.95      0.94      0.94    270455
          3       0.87      0.98      0.92     14517
          4       0.90      0.87      0.89      1110
          5       0.57      0.94      0.71      3772
          6       0.79      0.95      0.86      6882
          7       0.78      0.99      0.87      8181

avg / total       0.94      0.93      0.93    506592


Accuracy: 0.934

Confusion Matrix:

 [[185850  13424     30      0    314     92   1965]
 [ 10867 253940   1634      0   2301   1409    304]
 [     0     27  14174     81     12    223      0]
 [     0      0    104    971      0     35      0]
 [    12    132     49      0   3558     20      1]
 [     3     17    289     26      2   6545      0]
 [    83     16      0      0      2      0   8080]] 



Testing ET 50-50 - {'max_features': 0.9, 'n_estimators': 100, 'n_jobs': -1}

Classification Report:
Prediciton time: 7.946s

              precision    recall  f1-score   support

          1       0.94      0.92      0.93    201675
          2       0.95      0.94      0.94    270455
          3       0.87      0.98      0.92     14517
          4       0.91      0.87      0.89      1110
          5       0.57      0.94      0.71      3772
          6       0.79      0.95      0.86      6882
          7       0.78      0.99      0.87      8181

avg / total       0.94      0.93      0.93    506592


Accuracy: 0.934

Confusion Matrix:

 [[185924  13291     28      0    336     92   2004]
 [ 10799 253941   1635      1   2373   1400    306]
 [     0     31  14177     75     14    220      0]
 [     0      0    107    969      0     34      0]
 [    10    131     51      0   3557     22      1]
 [     3     22    305     24      3   6525      0]
 [    92     12      0      0      2      0   8075]] 



Testing ET 50-50 - {'max_features': 0.85, 'n_estimators': 100, 'n_jobs': -1}

Classification Report:
Prediciton time: 9.241s

              precision    recall  f1-score   support

          1       0.94      0.92      0.93    201675
          2       0.95      0.94      0.94    270455
          3       0.87      0.98      0.92     14517
          4       0.91      0.87      0.89      1110
          5       0.57      0.95      0.71      3772
          6       0.79      0.95      0.86      6882
          7       0.78      0.99      0.87      8181

avg / total       0.94      0.93      0.93    506592


Accuracy: 0.934

Confusion Matrix:

 [[185748  13493     35      0    327     81   1991]
 [ 10831 253997   1632      0   2322   1359    314]
 [     0     30  14181     72     14    220      0]
 [     0      0    106    966      0     38      0]
 [    11    125     45      0   3569     21      1]
 [     3     18    288     25      2   6546      0]
 [    93     15      0      0      2      0   8071]] 


Resultados Parciais

Podemos notar com o tal teste, que um número maior de amostras nas classes 1 e 2 melhorou o dicernimento do modelo quanto essas classes. Porém, ainda podemos utilizar mais amostras para o treinamento e talvez tal treinamento possa melhorar ainda mais os resultados.


Testando para todos os datasets

Foi criado um modelo para cada dataset, onde cada um foi manipulado de forma que as duas primeiras classes tivessem mais ou menos amostras. Para a validação de cada modelo foi utilizada o restante das amostras não treinadas.

Dataset % Classes 1 e 2 % Demais Classes
143k 40% 60%
50-50 50% 50%
60-40 60% 40%
70-30 70% 30%
All 85% 15%
In [142]:
datasets_names = ['143k', '50-50', '60-40', '70-30', 'All']

datasets_train = []
datasets_test = []
for name in datasets_names:
    datasets_train.append(pd.read_csv('{}_std.csv'.format(name)))
    datasets_test.append(pd.read_csv('{}_val.csv'.format(name)))
In [143]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import ExtraTreesClassifier

parameters = [
    {'n_estimators': [target_col, 100], 'max_features':[0.75,0.8,0.85,0.9], 'n_jobs':[-1]}
]

scores = ['f1_macro', 'f1_micro','accuracy']



for d_train, d_test, name in zip(datasets_train, datasets_test, datasets_names):
    print('\nTrain for dataset: {}'.format(name))
    
    target_col = d_train.shape[1]-1

    X = d_train.iloc[:, :target_col]
    y = list(map(int, d_train.iloc[:, target_col].values))

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4)
    
    X_val, y_val = pre_process(d_test)
    X_test = np.concatenate((X_val, X_test), axis=0)
    y_test = np.concatenate((y_val, y_test), axis=0)
    
    clfs = train_model(ExtraTreesClassifier(),
                   parameters,
                   scores,
                   X_train,
                   X_test,
                   y_train,
                   y_test,
                   name='ExtraTrees - {}'.format(name)
                  )
Train for dataset: 143k
Training ExtraTrees - 143k for f1_macro
Finished Traininig in 391.480s
Best parameters found:
{'max_features': 0.85, 'n_estimators': 100, 'n_jobs': -1}

Classification Report:
Prediciton time: 5.853s

              precision    recall  f1-score   support

          1       0.89      0.84      0.86    194482
          2       0.90      0.86      0.88    266862
          3       0.66      0.98      0.79     11303
          4       0.98      0.99      0.98      3341
          5       0.49      0.99      0.65      6464
          6       0.59      0.98      0.74      6973
          7       0.46      1.00      0.63      5717

avg / total       0.88      0.86      0.86    495142


Accuracy: 0.861

Confusion Matrix:

 [[163095  24243    115      0   1109    318   5602]
 [ 20988 229698   5398      9   5551   4270    948]
 [     3      6  11125     53     10    106      0]
 [     0      0     25   3296      0     20      0]
 [     2     44     18      0   6393      7      0]
 [     0      4    139     12      6   6812      0]
 [    25      2      0      0      0      0   5690]] 


Training ExtraTrees - 143k for f1_micro
Finished Traininig in 431.449s
Best parameters found:
{'max_features': 0.9, 'n_estimators': 100, 'n_jobs': -1}

Classification Report:
Prediciton time: 6.241s

              precision    recall  f1-score   support

          1       0.88      0.84      0.86    194482
          2       0.90      0.86      0.88    266862
          3       0.66      0.98      0.79     11303
          4       0.98      0.99      0.98      3341
          5       0.49      0.99      0.65      6464
          6       0.59      0.98      0.74      6973
          7       0.46      1.00      0.63      5717

avg / total       0.88      0.86      0.86    495142


Accuracy: 0.860

Confusion Matrix:

 [[163123  24163    124      0   1065    316   5691]
 [ 21309 229143   5510     10   5662   4269    959]
 [     3      7  11107     59     12    115      0]
 [     0      0     27   3296      0     18      0]
 [     2     43     19      0   6396      4      0]
 [     0      3    126     13      6   6825      0]
 [    24      1      0      0      0      0   5692]] 


Training ExtraTrees - 143k for accuracy
Finished Traininig in 394.433s
Best parameters found:
{'max_features': 0.9, 'n_estimators': 100, 'n_jobs': -1}

Classification Report:
Prediciton time: 6.169s

              precision    recall  f1-score   support

          1       0.89      0.84      0.86    194482
          2       0.90      0.86      0.88    266862
          3       0.66      0.98      0.79     11303
          4       0.98      0.99      0.99      3341
          5       0.48      0.99      0.65      6464
          6       0.59      0.98      0.74      6973
          7       0.46      1.00      0.63      5717

avg / total       0.88      0.86      0.86    495142


Accuracy: 0.860

Confusion Matrix:

 [[162984  24272    111      0   1102    302   5711]
 [ 21146 229366   5460      3   5685   4261    941]
 [     3      6  11127     49     10    108      0]
 [     0      0     20   3305      0     16      0]
 [     3     41     19      0   6394      7      0]
 [     0      2    124     12      5   6830      0]
 [    22      1      0      0      0      0   5694]] 



Train for dataset: 50-50
Training ExtraTrees - 50-50 for f1_macro
Finished Traininig in 593.642s
Best parameters found:
{'max_features': 0.9, 'n_estimators': 100, 'n_jobs': -1}

Classification Report:
Prediciton time: 6.715s

              precision    recall  f1-score   support

          1       0.94      0.91      0.92    188660
          2       0.94      0.93      0.94    254991
          3       0.86      0.98      0.92     14228
          4       0.92      0.89      0.90      1085
          5       0.58      0.95      0.72      3824
          6       0.79      0.95      0.86      7023
          7       0.77      0.99      0.87      8157

avg / total       0.93      0.93      0.93    477968


Accuracy: 0.927

Confusion Matrix:

 [[171936  14237     34      0    340     85   2028]
 [ 11293 237883   1750      0   2273   1447    345]
 [     0     20  13893     62     19    234      0]
 [     0      0     93    961      0     31      0]
 [    15    140     30      0   3622     17      0]
 [     1     42    305     25      4   6646      0]
 [    94      6      0      0      3      0   8054]] 


Training ExtraTrees - 50-50 for f1_micro
Finished Traininig in 595.562s
Best parameters found:
{'max_features': 0.9, 'n_estimators': 100, 'n_jobs': -1}

Classification Report:
Prediciton time: 6.682s

              precision    recall  f1-score   support

          1       0.94      0.91      0.92    188660
          2       0.94      0.93      0.94    254991
          3       0.86      0.98      0.92     14228
          4       0.92      0.88      0.90      1085
          5       0.58      0.95      0.72      3824
          6       0.78      0.95      0.86      7023
          7       0.78      0.99      0.87      8157

avg / total       0.93      0.93      0.93    477968


Accuracy: 0.927

Confusion Matrix:

 [[171937  14293     35      0    325     84   1986]
 [ 11395 237777   1720      0   2303   1462    334]
 [     0     23  13882     62     21    240      0]
 [     0      0     97    957      0     31      0]
 [     8    145     31      0   3621     19      0]
 [     1     40    300     21      4   6657      0]
 [    82      5      0      0      3      0   8067]] 


Training ExtraTrees - 50-50 for accuracy
Finished Traininig in 585.436s
Best parameters found:
{'max_features': 0.85, 'n_estimators': 100, 'n_jobs': -1}

Classification Report:
Prediciton time: 6.394s

              precision    recall  f1-score   support

          1       0.94      0.91      0.92    188660
          2       0.94      0.93      0.94    254991
          3       0.86      0.98      0.92     14228
          4       0.92      0.89      0.90      1085
          5       0.58      0.95      0.72      3824
          6       0.79      0.95      0.86      7023
          7       0.77      0.99      0.87      8157

avg / total       0.93      0.93      0.93    477968


Accuracy: 0.926

Confusion Matrix:

 [[171829  14374     33      0    323     84   2017]
 [ 11373 237823   1758      0   2292   1416    329]
 [     0     23  13876     66     24    239      0]
 [     0      0     93    961      0     31      0]
 [     9    147     32      0   3618     18      0]
 [     2     38    307     23      3   6650      0]
 [    86      8      0      0      3      0   8060]] 



Train for dataset: 60-40
Training ExtraTrees - 60-40 for f1_macro
Finished Traininig in 808.672s
Best parameters found:
{'max_features': 0.9, 'n_estimators': 100, 'n_jobs': -1}

Classification Report:
Prediciton time: 7.012s

              precision    recall  f1-score   support

          1       0.95      0.93      0.94    177095
          2       0.95      0.95      0.95    240826
          3       0.89      0.98      0.93     14306
          4       0.91      0.89      0.90      1049
          5       0.65      0.94      0.77      3731
          6       0.84      0.95      0.89      6907
          7       0.83      0.99      0.90      8293

avg / total       0.94      0.94      0.94    452207


Accuracy: 0.942

Confusion Matrix:

 [[164708  10691     13      0    237     67   1379]
 [  8503 228328   1261      0   1589    891    254]
 [     0     56  13961     63     18    208      0]
 [     0      0     79    931      0     39      0]
 [    11    158     41      0   3507     13      1]
 [     5     37    270     30      7   6558      0]
 [    86     11      0      0      2      0   8194]] 


Training ExtraTrees - 60-40 for f1_micro
Finished Traininig in 819.215s
Best parameters found:
{'max_features': 0.85, 'n_estimators': 100, 'n_jobs': -1}

Classification Report:
Prediciton time: 6.741s

              precision    recall  f1-score   support

          1       0.95      0.93      0.94    177095
          2       0.95      0.95      0.95    240826
          3       0.89      0.98      0.93     14306
          4       0.91      0.89      0.90      1049
          5       0.66      0.94      0.77      3731
          6       0.84      0.95      0.89      6907
          7       0.83      0.99      0.90      8293

avg / total       0.94      0.94      0.94    452207


Accuracy: 0.942

Confusion Matrix:

 [[164633  10743     11      0    234     69   1405]
 [  8490 228307   1278      0   1584    913    254]
 [     2     51  13966     61     18    208      0]
 [     0      0     84    931      0     34      0]
 [    14    152     39      0   3512     13      1]
 [     4     32    258     27      8   6578      0]
 [    94     15      0      0      2      0   8182]] 


Training ExtraTrees - 60-40 for accuracy
Finished Traininig in 818.508s
Best parameters found:
{'max_features': 0.9, 'n_estimators': 100, 'n_jobs': -1}

Classification Report:
Prediciton time: 6.670s

              precision    recall  f1-score   support

          1       0.95      0.93      0.94    177095
          2       0.95      0.95      0.95    240826
          3       0.89      0.98      0.93     14306
          4       0.91      0.89      0.90      1049
          5       0.65      0.94      0.77      3731
          6       0.84      0.95      0.89      6907
          7       0.83      0.99      0.90      8293

avg / total       0.94      0.94      0.94    452207


Accuracy: 0.943

Confusion Matrix:

 [[164794  10592     12      0    234     67   1396]
 [  8547 228269   1267      0   1588    897    258]
 [     0     54  13964     59     17    212      0]
 [     0      0     83    929      0     37      0]
 [    11    164     37      0   3505     13      1]
 [     5     31    271     30      7   6563      0]
 [    87     12      0      0      2      0   8192]] 



Train for dataset: 70-30
Training ExtraTrees - 70-30 for f1_macro
Finished Traininig in 1209.638s
Best parameters found:
{'max_features': 0.85, 'n_estimators': 100, 'n_jobs': -1}

Classification Report:
Prediciton time: 7.257s

              precision    recall  f1-score   support

          1       0.96      0.95      0.95    157556
          2       0.96      0.96      0.96    217151
          3       0.93      0.97      0.95     14367
          4       0.91      0.88      0.89      1096
          5       0.74      0.93      0.82      3768
          6       0.90      0.94      0.92      7065
          7       0.90      0.98      0.94      8268

avg / total       0.96      0.95      0.95    409271


Accuracy: 0.954

Confusion Matrix:

 [[148996   7550      8      0    156     31    815]
 [  6478 208351    696      0   1039    465    122]
 [     0     69  13997     62     18    221      0]
 [     0      0     97    964      0     35      0]
 [    20    209     41      0   3491      7      0]
 [     2     73    277     34      7   6672      0]
 [   112     26      0      0      2      0   8128]] 


Training ExtraTrees - 70-30 for f1_micro
Finished Traininig in 1185.479s
Best parameters found:
{'max_features': 0.9, 'n_estimators': 100, 'n_jobs': -1}

Classification Report:
Prediciton time: 6.561s

              precision    recall  f1-score   support

          1       0.96      0.95      0.95    157556
          2       0.96      0.96      0.96    217151
          3       0.93      0.97      0.95     14367
          4       0.91      0.89      0.90      1096
          5       0.74      0.93      0.82      3768
          6       0.90      0.94      0.92      7065
          7       0.90      0.98      0.94      8268

avg / total       0.96      0.95      0.95    409271


Accuracy: 0.955

Confusion Matrix:

 [[148990   7539      8      0    165     30    824]
 [  6408 208439    692      0   1041    449    122]
 [     0     66  14001     62     20    218      0]
 [     0      0     90    972      0     34      0]
 [    21    205     39      0   3497      6      0]
 [     1     79    277     32      7   6669      0]
 [   115     23      0      0      2      0   8128]] 


Training ExtraTrees - 70-30 for accuracy
Finished Traininig in 1183.985s
Best parameters found:
{'max_features': 0.9, 'n_estimators': 100, 'n_jobs': -1}

Classification Report:
Prediciton time: 6.600s

              precision    recall  f1-score   support

          1       0.96      0.95      0.95    157556
          2       0.96      0.96      0.96    217151
          3       0.93      0.97      0.95     14367
          4       0.91      0.89      0.90      1096
          5       0.74      0.93      0.82      3768
          6       0.90      0.95      0.92      7065
          7       0.90      0.98      0.94      8268

avg / total       0.96      0.95      0.95    409271


Accuracy: 0.954

Confusion Matrix:

 [[148954   7578      9      0    161     34    820]
 [  6457 208352    712      0   1051    459    120]
 [     0     72  13979     64     19    233      0]
 [     0      0     91    972      0     33      0]
 [    21    212     40      0   3487      8      0]
 [     2     75    265     31      9   6683      0]
 [   108     24      0      0      1      0   8135]] 



Train for dataset: All
Training ExtraTrees - All for f1_macro
Finished Traininig in 1171.862s
Best parameters found:
{'max_features': 0.9, 'n_estimators': 100, 'n_jobs': -1}

Classification Report:
Prediciton time: 2.206s

              precision    recall  f1-score   support

          1       0.96      0.95      0.95     36395
          2       0.95      0.96      0.96     43924
          3       0.96      0.98      0.97     14173
          4       0.92      0.90      0.91      1115
          5       0.94      0.92      0.93      3770
          6       0.95      0.94      0.94      6880
          7       0.97      0.98      0.98      8238

avg / total       0.96      0.96      0.96    114495


Accuracy: 0.956

Confusion Matrix:

 [[34449  1720     5     0    25     3   193]
 [ 1292 42153   141     0   194   117    27]
 [    0    75 13826    50    15   207     0]
 [    0     0    83  1001     0    31     0]
 [   21   215    50     0  3468    16     0]
 [    0    62   285    33     5  6495     0]
 [  145    30     0     0     2     0  8061]] 


Training ExtraTrees - All for f1_micro
Finished Traininig in 1189.586s
Best parameters found:
{'max_features': 0.9, 'n_estimators': 100, 'n_jobs': -1}

Classification Report:
Prediciton time: 2.115s

              precision    recall  f1-score   support

          1       0.96      0.95      0.95     36395
          2       0.95      0.96      0.96     43924
          3       0.96      0.98      0.97     14173
          4       0.92      0.88      0.90      1115
          5       0.94      0.92      0.93      3770
          6       0.94      0.94      0.94      6880
          7       0.97      0.98      0.98      8238

avg / total       0.96      0.96      0.96    114495


Accuracy: 0.956

Confusion Matrix:

 [[34424  1745     5     0    25     4   192]
 [ 1308 42147   133     0   193   119    24]
 [    0    80 13826    47    12   208     0]
 [    0     0    92   985     0    38     0]
 [   26   219    46     0  3465    14     0]
 [    1    63   280    34     7  6495     0]
 [  141    31     0     0     2     0  8064]] 


Training ExtraTrees - All for accuracy
Finished Traininig in 1180.970s
Best parameters found:
{'max_features': 0.9, 'n_estimators': 100, 'n_jobs': -1}

Classification Report:
Prediciton time: 2.036s

              precision    recall  f1-score   support

          1       0.96      0.95      0.95     36395
          2       0.95      0.96      0.96     43924
          3       0.96      0.98      0.97     14173
          4       0.92      0.89      0.91      1115
          5       0.93      0.92      0.93      3770
          6       0.95      0.94      0.95      6880
          7       0.97      0.98      0.98      8238

avg / total       0.96      0.96      0.96    114495


Accuracy: 0.956

Confusion Matrix:

 [[34457  1715     5     0    26     3   189]
 [ 1298 42156   137     0   194   113    26]
 [    0    75 13821    49    15   213     0]
 [    0     0    92   993     0    30     0]
 [   24   215    46     0  3471    14     0]
 [    1    59   279    33     9  6499     0]
 [  142    29     0     0     2     0  8065]] 


Resultados Parciais

Após executar o treinamento para todos os datasets gerados, encontramos o melhor resultado, de 95.6% de acurácia, para o dataset completo. Porém, para o dataset onde as duas primeiras classes representam 70% do total, foi encontrado um resultado de 95.5%. Visto que o último é avaliado com um conjunto de avaliação significantemente maior, é mais seguro o utilizarmos como modelo final.

Portanto, utilizamos desse modelo para realizar ajustes finos e tentar melhorar o resultado final.

Ajuste fino do modelo 70-30 para n_estimators maior que 100

In [146]:
parameters = [
    {'n_estimators': [100, 150, 200], 'max_features':[0.85,0.9,0.95], 'n_jobs':[-1]}
]

scores = ['f1_macro', 'f1_micro','accuracy']
d_train = datasets_train[3] # 70-30
d_test = datasets_test[3]
name = datasets_names[3]

print('\nTrain for dataset: {}'.format(name))

target_col = d_train.shape[1]-1

X = d_train.iloc[:, :target_col]
y = list(map(int, d_train.iloc[:, target_col].values))

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4)

X_val, y_val = pre_process(d_test)
X_test = np.concatenate((X_val, X_test), axis=0)
y_test = np.concatenate((y_val, y_test), axis=0)

clfs = train_model(ExtraTreesClassifier(),
               parameters,
               scores,
               X_train,
               X_test,
               y_train,
               y_test,
               name='ExtraTrees - {}'.format(name)
              )
Train for dataset: 70-30
Training ExtraTrees - 70-30 for f1_macro
Finished Traininig in 2754.610s
Best parameters found:
{'max_features': 0.95, 'n_estimators': 200, 'n_jobs': -1}

Classification Report:
Prediciton time: 12.941s

              precision    recall  f1-score   support

          1       0.96      0.94      0.95    157947
          2       0.96      0.96      0.96    217243
          3       0.92      0.97      0.95     14132
          4       0.93      0.88      0.91      1093
          5       0.75      0.93      0.83      3739
          6       0.90      0.95      0.92      6853
          7       0.89      0.98      0.93      8264

avg / total       0.95      0.95      0.95    409271


Accuracy: 0.954

Confusion Matrix:

 [[149168   7760      6      0    140     35    838]
 [  6393 208550    725      0   1013    423    139]
 [     1     82  13777     51     17    204      0]
 [     0      0     82    963      0     48      0]
 [    11    219     36      0   3459     14      0]
 [     4     53    285     17     11   6483      0]
 [   152     17      0      0      2      0   8093]] 


Training ExtraTrees - 70-30 for f1_micro
Finished Traininig in 2773.835s
Best parameters found:
{'max_features': 0.95, 'n_estimators': 200, 'n_jobs': -1}

Classification Report:
Prediciton time: 12.705s

              precision    recall  f1-score   support

          1       0.96      0.94      0.95    157947
          2       0.96      0.96      0.96    217243
          3       0.93      0.98      0.95     14132
          4       0.93      0.88      0.91      1093
          5       0.75      0.92      0.83      3739
          6       0.90      0.95      0.92      6853
          7       0.89      0.98      0.93      8264

avg / total       0.95      0.95      0.95    409271


Accuracy: 0.954

Confusion Matrix:

 [[149110   7804      6      0    138     34    855]
 [  6405 208591    709      0    993    413    132]
 [     1     74  13786     52     17    202      0]
 [     0      0     80    964      0     49      0]
 [    14    222     35      0   3454     14      0]
 [     3     62    277     18     11   6482      0]
 [   146     19      0      0      2      0   8097]] 


Training ExtraTrees - 70-30 for accuracy
Finished Traininig in 2764.112s
Best parameters found:
{'max_features': 0.95, 'n_estimators': 200, 'n_jobs': -1}

Classification Report:
Prediciton time: 12.682s

              precision    recall  f1-score   support

          1       0.96      0.94      0.95    157947
          2       0.96      0.96      0.96    217243
          3       0.92      0.97      0.95     14132
          4       0.93      0.88      0.90      1093
          5       0.75      0.93      0.83      3739
          6       0.90      0.95      0.92      6853
          7       0.89      0.98      0.93      8264

avg / total       0.95      0.95      0.95    409271


Accuracy: 0.954

Confusion Matrix:

 [[149189   7715      6      0    136     36    865]
 [  6444 208529    720      0    997    413    140]
 [     1     77  13775     50     18    211      0]
 [     0      0     83    959      0     51      0]
 [    13    209     33      0   3471     13      0]
 [     3     58    278     18     11   6485      0]
 [   148     18      0      0      2      0   8096]] 


Comentários

Vimos que apesar do aumento do n_estimators não houve um aumento significativo no resultado final.

Sendo assim, isso levanta uma dúvida: "Será que existe um n_estimators menor que mantenha o resultado de 95%?"

Teste para n_estimators igual e menor que o número de features

In [161]:
parameters = [
    {'n_estimators': [40, 45, target_col], 'max_features':[0.85,0.9,0.95,1], 'n_jobs':[-1]}
]

scores = ['f1_macro', 'f1_micro','accuracy']
d_train = datasets_train[3] # 70-30
d_test = datasets_test[3]
name = datasets_names[3]

print('\nTrain for dataset: {}'.format(name))

target_col = d_train.shape[1]-1

X = d_train.iloc[:, :target_col]
y = list(map(int, d_train.iloc[:, target_col].values))

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4)

X_val, y_val = pre_process(d_test)
X_test = np.concatenate((X_val, X_test), axis=0)
y_test = np.concatenate((y_val, y_test), axis=0)

clfs_min_nest = train_model(ExtraTreesClassifier(),
               parameters,
               scores,
               X_train,
               X_test,
               y_train,
               y_test,
               name='ExtraTrees - {}'.format(name)
              )
Train for dataset: 70-30
Training ExtraTrees - 70-30 for f1_macro
Finished Traininig in 1120.444s
Best parameters found:
{'max_features': 0.9, 'n_estimators': 51, 'n_jobs': -1}

Classification Report:
Prediciton time: 4.667s

              precision    recall  f1-score   support

          1       0.96      0.94      0.95    157689
          2       0.96      0.96      0.96    217162
          3       0.92      0.98      0.95     14162
          4       0.92      0.89      0.90      1090
          5       0.76      0.92      0.83      3888
          6       0.90      0.94      0.92      7024
          7       0.89      0.98      0.94      8256

avg / total       0.95      0.95      0.95    409271


Accuracy: 0.954

Confusion Matrix:

 [[148640   8017     12      0    146     42    832]
 [  6284 208622    699      0    981    436    140]
 [     0     69  13828     55     13    197      0]
 [     0      0     95    965      0     30      0]
 [    16    262     34      0   3562     14      0]
 [     4     68    289     28      5   6630      0]
 [   122     19      0      0      2      0   8113]] 


Training ExtraTrees - 70-30 for f1_micro
Finished Traininig in 1025.393s
Best parameters found:
{'max_features': 0.95, 'n_estimators': 51, 'n_jobs': -1}

Classification Report:
Prediciton time: 3.462s

              precision    recall  f1-score   support

          1       0.96      0.94      0.95    157689
          2       0.96      0.96      0.96    217162
          3       0.93      0.98      0.95     14162
          4       0.91      0.89      0.90      1090
          5       0.76      0.92      0.83      3888
          6       0.91      0.95      0.93      7024
          7       0.89      0.98      0.94      8256

avg / total       0.95      0.95      0.95    409271


Accuracy: 0.954

Confusion Matrix:

 [[148601   8053     13      0    148     35    839]
 [  6255 208666    709      0    970    424    138]
 [     0     60  13841     66     12    183      0]
 [     0      0     92    970      0     28      0]
 [    12    247     32      0   3582     15      0]
 [     5     64    257     27      5   6666      0]
 [   117     16      0      0      2      0   8121]] 


Training ExtraTrees - 70-30 for accuracy
Finished Traininig in 947.853s
Best parameters found:
{'max_features': 0.9, 'n_estimators': 51, 'n_jobs': -1}

Classification Report:
Prediciton time: 3.362s

              precision    recall  f1-score   support

          1       0.96      0.94      0.95    157689
          2       0.96      0.96      0.96    217162
          3       0.93      0.98      0.95     14162
          4       0.91      0.89      0.90      1090
          5       0.76      0.91      0.83      3888
          6       0.90      0.95      0.93      7024
          7       0.90      0.98      0.94      8256

avg / total       0.95      0.95      0.95    409271


Accuracy: 0.954

Confusion Matrix:

 [[148690   8003     13      0    135     37    811]
 [  6219 208675    719      0    983    432    134]
 [     0     63  13830     68     14    187      0]
 [     0      0     89    972      0     29      0]
 [    13    274     33      0   3551     17      0]
 [     3     67    267     29      4   6654      0]
 [   133     16      0      0      1      0   8106]] 


Após o treinamento utilizando o GridSearch podemos visualizar todos os treinamentos e avaliar a diferença entre a categoria mean_test_score, que é o resultado final de treinamento. Podemos ver que apenas para a nona tentativa o resultado é discrepante. Todo o restante é coerente com os restantes dos treinamentos.

In [162]:
pd.DataFrame(clfs_min_nest[1][2].cv_results_)
Out[162]:
mean_fit_time mean_score_time mean_test_score mean_train_score param_max_features param_n_estimators param_n_jobs params rank_test_score split0_test_score ... split2_test_score split2_train_score split3_test_score split3_train_score split4_test_score split4_train_score std_fit_time std_score_time std_test_score std_train_score
0 21.698520 0.398194 0.947834 1.0 0.85 40 -1 {'max_features': 0.85, 'n_estimators': 40, 'n_... 9 0.946202 ... 0.947479 1.0 0.948322 1.0 0.948726 1.0 1.509012 0.045673 0.000916 0.0
1 17.430413 0.382023 0.948207 1.0 0.85 45 -1 {'max_features': 0.85, 'n_estimators': 45, 'n_... 8 0.946232 ... 0.947071 1.0 0.949399 1.0 0.949629 1.0 1.192426 0.056983 0.001333 0.0
2 19.567503 0.458790 0.948253 1.0 0.85 51 -1 {'max_features': 0.85, 'n_estimators': 51, 'n_... 6 0.947804 ... 0.946518 1.0 0.948584 1.0 0.949367 1.0 2.615536 0.045648 0.001011 0.0
3 15.798809 0.333966 0.948213 1.0 0.9 40 -1 {'max_features': 0.9, 'n_estimators': 40, 'n_j... 7 0.947716 ... 0.946402 1.0 0.948991 1.0 0.950066 1.0 1.313390 0.003249 0.001239 0.0
4 17.137032 0.338014 0.948812 1.0 0.9 45 -1 {'max_features': 0.9, 'n_estimators': 45, 'n_j... 3 0.948007 ... 0.947653 1.0 0.950068 1.0 0.949716 1.0 0.255022 0.005749 0.000941 0.0
5 19.421373 0.436335 0.948865 1.0 0.9 51 -1 {'max_features': 0.9, 'n_estimators': 51, 'n_j... 2 0.948066 ... 0.948323 1.0 0.949341 1.0 0.949600 1.0 0.319634 0.002438 0.000586 0.0
6 15.756043 0.342739 0.948609 1.0 0.95 40 -1 {'max_features': 0.95, 'n_estimators': 40, 'n_... 4 0.948735 ... 0.947537 1.0 0.949748 1.0 0.949250 1.0 0.083579 0.005854 0.000845 0.0
7 18.118283 0.341180 0.948422 1.0 0.95 45 -1 {'max_features': 0.95, 'n_estimators': 45, 'n_... 5 0.948298 ... 0.947770 1.0 0.948351 1.0 0.949105 1.0 0.314696 0.006299 0.000433 0.0
8 20.235556 0.390792 0.948970 1.0 0.95 51 -1 {'max_features': 0.95, 'n_estimators': 51, 'n_... 1 0.948182 ... 0.947915 1.0 0.949515 1.0 0.950444 1.0 0.120122 0.047527 0.000920 0.0
9 3.816082 0.472003 0.898522 1.0 1 40 -1 {'max_features': 1, 'n_estimators': 40, 'n_job... 12 0.896102 ... 0.895394 1.0 0.899817 1.0 0.900306 1.0 0.121977 0.048016 0.002306 0.0
10 4.205809 0.558517 0.900303 1.0 1 45 -1 {'max_features': 1, 'n_estimators': 45, 'n_job... 10 0.896015 ... 0.901071 1.0 0.901563 1.0 0.902053 1.0 0.101525 0.006473 0.002186 0.0
11 4.659386 0.573008 0.900001 1.0 1 51 -1 {'max_features': 1, 'n_estimators': 51, 'n_job... 11 0.898576 ... 0.899965 1.0 0.897487 1.0 0.901092 1.0 0.091691 0.040171 0.001889 0.0

12 rows × 23 columns

Resultados Finais

Visto que o ajuste fino, para incremento e decremento no número de estimadores não causou mudanças significativas nas métricas (mudança de +-0.001% para acurácia), foi escolhido o modelo com base no tempo de treinamento e no tamanho final do modelo gerado. Para o modelo com >100 estimadores, um modelo final serializado tem um tamanho de >100MB, onde do outro lado, um modelo com o mesmo número de features, ou menos, para estimadores tem seu tamanho em torno de 40MB.

Para o modelo final foi selecionado o algoritmo Extra Trees, com os parâmetros: {'max_features': 0.95, 'n_estimators': 51, 'n_jobs': -1}, treinado sobre o dataset com proporções de 70-30. Tal modelo alcancou uma acurácia de 95.4%, F1 score, recall e precisão de 95%.

In [336]:
from sklearn.externals import joblib

parameters = [
    {'n_estimators': [51], 'max_features':[0.95], 'n_jobs':[-1]}
]

scores = ['f1_micro']
d_train = datasets_train[3] # 70-30
d_test = datasets_test[3]
name = datasets_names[3]

print('\nTrain for dataset: {}'.format(name))

target_col = d_train.shape[1]-1

X = d_train.iloc[:, :target_col]
y = list(map(int, d_train.iloc[:, target_col].values))

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4)

X_val, y_val = pre_process(d_test)
X_test = np.concatenate((X_val, X_test), axis=0)
y_test = np.concatenate((y_val, y_test), axis=0)

final = train_model(ExtraTreesClassifier(),
               parameters,
               scores,
               X_train,
               X_test,
               y_train,
               y_test,
               name='ExtraTrees - {}'.format(name)
              )

joblib.dump(final[0][2].best_estimator_,'ExtraTree5195.pkl', compress=3)
Train for dataset: 70-30
Training ExtraTrees - 70-30 for f1_micro
Finished Traininig in 164.468s
Best parameters found:
{'max_features': 0.95, 'n_estimators': 51, 'n_jobs': -1}

Classification Report:
Prediciton time: 3.864s

              precision    recall  f1-score   support

          1       0.96      0.94      0.95    157753
          2       0.96      0.96      0.96    217165
          3       0.92      0.98      0.95     14237
          4       0.91      0.88      0.90      1147
          5       0.75      0.92      0.83      3828
          6       0.90      0.94      0.92      6916
          7       0.90      0.98      0.94      8225

avg / total       0.95      0.95      0.95    409271


Accuracy: 0.953

Confusion Matrix:

 [[148905   7824      8      0    162     41    813]
 [  6707 208179    756      0    965    429    129]
 [     0     70  13908     65     14    180      0]
 [     0      0     88   1014      0     45      0]
 [    14    237     45      0   3518     13      1]
 [     1     61    293     30     10   6521      0]
 [   141     20      0      0      2      0   8062]] 


Out[336]:
['ExtraTree5195.pkl']

Por fim podemos plotar os gráficos ROC para melhor ver a performance do modelo classificador.

In [240]:
def find_optimal_cutoff(false_pos_rate, true_pos_rate, threshold):
    i = np.arange(len(true_pos_rate)) 
    roc = pd.DataFrame({
        'fpr':false_pos_rate,
        'tpr':true_pos_rate, 
        'tf' : pd.Series(true_pos_rate-(1-false_pos_rate), index=i)
    })
    roc_t = roc.iloc[(roc.tf).abs().argsort()[0]]
    
    return roc_t, roc_t['fpr'], roc_t['tpr']
In [296]:
from sklearn.metrics import roc_curve, auc
%matplotlib inline

y_pred = final[0][2].predict_proba(X_test)

for index, label in enumerate(range(1, 8)):
    y_pred_i = y_pred[:, index]
    
    false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, y_pred_i, pos_label=label)
    roc_auc = auc(false_positive_rate, true_positive_rate)
    optimal_threshold = find_optimal_cutoff(false_positive_rate, true_positive_rate, thresholds)

    plt.figure(figsize=(20,10))
    plt.title('Receiver Operating Characteristic (ROC)\nClasse {}'.format(label), fontsize=18)
    plt.plot(false_positive_rate, true_positive_rate,
             color='darkorange',
             lw=2,
             label='Curva ROC (Area = {:.4f})'.format(roc_auc))
    plt.plot(optimal_threshold[1], optimal_threshold[2], 'b*', ms=15, label='Cutoff otimo')
    plt.plot([0,1],[0,1], color='navy', lw=2, linestyle='--')
    plt.xlim([-0.1,1.2])
    plt.ylim([-0.1,1.2])
    plt.ylabel('Taxa de Verdadeiro Positivo (Sensitivity)', fontsize=16)
    plt.xlabel('Taxa de Falso Positivo (Specificity)', fontsize=16)
    plt.legend(loc="lower right", fontsize=16)
    plt.show()