lab3

Objetivos¶

Praticar as etapas de entendimento do problema
praticar as etapas de preparação dos dados
Praticar os algoritmos de aprendizado de maquina e otimização de hiperparâmetros
Praticar as metricas de validação dos resultados

Classificador de renda¶

Você foi contratado por uma empresa para prestar um serviço de consultor.

Nesse sentido a empresa disponibilizou uma base em dados demográficos e ocupacionais dos seus clientes e gostaria de saber se é possível criar um modelo que preve se determinada pessoa ganha mais ou menos de 50k dólares por ano.

Nosso dataset:http://archive.ics.uci.edu/ml/datasets/Adult

Importa as bibliotecas¶

In [12]:

Copied!





%matplotlib inline

import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt 
import seaborn as sns 
%matplotlib inline

import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt 
import seaborn as sns 

Carrega o dataset¶

In [2]:

Copied!

df = pd.read_csv('df.csv', header = None)

columns_name = ['age', 'workclass', 'fnlwgt', 'education', 'education_num', 'marital_status', 'occupation', 'relationship',
             'race', 'sex', 'capital_gain', 'capital_loss', 'hours_per_week', 'native_country', 'income']

df.columns = columns_name

df.shape
df = pd.read_csv('df.csv', header = None)

columns_name = ['age', 'workclass', 'fnlwgt', 'education', 'education_num', 'marital_status', 'occupation', 'relationship',
             'race', 'sex', 'capital_gain', 'capital_loss', 'hours_per_week', 'native_country', 'income']

df.columns = columns_name

df.shape

Out[2]:

(32561, 15)

Descrição do dataset:

Age – idade 
Workclass – classe de trabalho
fnlwgt – número de pessoas que amostra representa comparada a população.
Education – educação
Education_Num – anos de escolaridade
Martial_Status – Estado Civil
Occupation – ocupação, cargo que ocupa
Relationship – parentesco
Race – raça
Sex – sexo
Capital_Gain – ganho capital
Capital_Loss – perda capital
Hours_per_week – horas por semana
Country – Nacionalidade

income – renda anual

In [3]:

Copied!

df.head()
df.head()

Out[3]:

	age	workclass	fnlwgt	education	education_num	marital_status	occupation	relationship	race	sex	capital_gain	hours_per_week	native_country	income
0	39	State-gov	77516	Bachelors	13	Never-married	Adm-clerical	Not-in-family	White	Male	2174	40	United-States	<=50K
1	50	Self-emp-not-inc	83311	Bachelors	13	Married-civ-spouse	Exec-managerial	Husband	White	Male	0	13	United-States	<=50K
2	38	Private	215646	HS-grad	9	Divorced	Handlers-cleaners	Not-in-family	White	Male	0	40	United-States	<=50K
3	53	Private	234721	11th	7	Married-civ-spouse	Handlers-cleaners	Husband	Black	Male	0	40	United-States	<=50K
4	28	Private	338409	Bachelors	13	Married-civ-spouse	Prof-specialty	Wife	Black	Female	0	40	Cuba	<=50K

In [11]:

Copied!

df.groupby('native_country').size()
df.groupby('native_country').size()

Out[11]:

native_country
?                               583
Cambodia                         19
Canada                          121
China                            75
Columbia                         59
Cuba                             95
Dominican-Republic               70
Ecuador                          28
El-Salvador                     106
England                          90
France                           29
Germany                         137
Greece                           29
Guatemala                        64
Haiti                            44
Holand-Netherlands                1
Honduras                         13
Hong                             20
Hungary                          13
India                           100
Iran                             43
Ireland                          24
Italy                            73
Jamaica                          81
Japan                            62
Laos                             18
Mexico                          643
Nicaragua                        34
Outlying-US(Guam-USVI-etc)       14
Peru                             31
Philippines                     198
Poland                           60
Portugal                         37
Puerto-Rico                     114
Scotland                         12
South                            80
Taiwan                           51
Thailand                         18
Trinadad&Tobago                  19
United-States                 29170
Vietnam                          67
Yugoslavia                       16
dtype: int64

In [ ]:

Copied!

## verifica as informações do dataset
## Seu código aqui....
## verifica as informações do dataset
## Seu código aqui....

Verificar se possui dados ausentes¶

In [4]:

Copied!

## Seu código aqui....

df.isnull().sum()
## Seu código aqui....

df.isnull().sum()

Out[4]:

age               0
workclass         0
fnlwgt            0
education         0
education_num     0
marital_status    0
occupation        0
relationship      0
race              0
sex               0
capital_gain      0
capital_loss      0
hours_per_week    0
native_country    0
income            0
dtype: int64

Aparentemente, não possui dados ausente. Vamos visualizar de forma diferente...

In [5]:

Copied!

for v2 in df:
    print(df[v2].value_counts())
for v2 in df:
    print(df[v2].value_counts())

age
36    898
31    888
34    886
23    877
35    876
     ... 
83      6
88      3
85      3
86      1
87      1
Name: count, Length: 73, dtype: int64
workclass
Private             22696
Self-emp-not-inc     2541
Local-gov            2093
?                    1836
State-gov            1298
Self-emp-inc         1116
Federal-gov           960
Without-pay            14
Never-worked            7
Name: count, dtype: int64
fnlwgt
164190    13
203488    13
123011    13
148995    12
121124    12
          ..
232784     1
325573     1
140176     1
318264     1
257302     1
Name: count, Length: 21648, dtype: int64
education
HS-grad         10501
Some-college     7291
Bachelors        5355
Masters          1723
Assoc-voc        1382
11th             1175
Assoc-acdm       1067
10th              933
7th-8th           646
Prof-school       576
9th               514
12th              433
Doctorate         413
5th-6th           333
1st-4th           168
Preschool          51
Name: count, dtype: int64
education_num
9     10501
10     7291
13     5355
14     1723
11     1382
7      1175
12     1067
6       933
4       646
15      576
5       514
8       433
16      413
3       333
2       168
1        51
Name: count, dtype: int64
marital_status
Married-civ-spouse       14976
Never-married            10683
Divorced                  4443
Separated                 1025
Widowed                    993
Married-spouse-absent      418
Married-AF-spouse           23
Name: count, dtype: int64
occupation
Prof-specialty       4140
Craft-repair         4099
Exec-managerial      4066
Adm-clerical         3770
Sales                3650
Other-service        3295
Machine-op-inspct    2002
?                    1843
Transport-moving     1597
Handlers-cleaners    1370
Farming-fishing       994
Tech-support          928
Protective-serv       649
Priv-house-serv       149
Armed-Forces            9
Name: count, dtype: int64
relationship
Husband           13193
Not-in-family      8305
Own-child          5068
Unmarried          3446
Wife               1568
Other-relative      981
Name: count, dtype: int64
race
White                 27816
Black                  3124
Asian-Pac-Islander     1039
Amer-Indian-Eskimo      311
Other                   271
Name: count, dtype: int64
sex
Male      21790
Female    10771
Name: count, dtype: int64
capital_gain
0        29849
15024      347
7688       284
7298       246
99999      159
         ...  
1111         1
2538         1
22040        1
4931         1
5060         1
Name: count, Length: 119, dtype: int64
capital_loss
0       31042
1902      202
1977      168
1887      159
1848       51
        ...  
2080        1
1539        1
1844        1
2489        1
1411        1
Name: count, Length: 92, dtype: int64
hours_per_week
40    15217
50     2819
45     1824
60     1475
35     1297
      ...  
82        1
92        1
87        1
74        1
94        1
Name: count, Length: 94, dtype: int64
native_country
United-States                 29170
Mexico                          643
?                               583
Philippines                     198
Germany                         137
Canada                          121
Puerto-Rico                     114
El-Salvador                     106
India                           100
Cuba                             95
England                          90
Jamaica                          81
South                            80
China                            75
Italy                            73
Dominican-Republic               70
Vietnam                          67
Guatemala                        64
Japan                            62
Poland                           60
Columbia                         59
Taiwan                           51
Haiti                            44
Iran                             43
Portugal                         37
Nicaragua                        34
Peru                             31
France                           29
Greece                           29
Ecuador                          28
Ireland                          24
Hong                             20
Cambodia                         19
Trinadad&Tobago                  19
Laos                             18
Thailand                         18
Yugoslavia                       16
Outlying-US(Guam-USVI-etc)       14
Honduras                         13
Hungary                          13
Scotland                         12
Holand-Netherlands                1
Name: count, dtype: int64
income
<=50K    24720
>50K      7841
Name: count, dtype: int64

Faça a interpretação das informações, observe o caracter ?.

Em quais colunas ele aparece?

Faça o replace dos ? por np.NaN.

In [6]:

Copied!

## Seu código aqui...

df=df.replace(' ?', np.nan)
## Seu código aqui...

df=df.replace(' ?', np.nan)

Vamos ver como ficou

In [7]:

Copied!

for v2 in df:
    print(df[v2].value_counts())
    
df.isnull().sum()
for v2 in df:
    print(df[v2].value_counts())
    
df.isnull().sum()

age
36    898
31    888
34    886
23    877
35    876
     ... 
83      6
88      3
85      3
86      1
87      1
Name: count, Length: 73, dtype: int64
workclass
Private             22696
Self-emp-not-inc     2541
Local-gov            2093
State-gov            1298
Self-emp-inc         1116
Federal-gov           960
Without-pay            14
Never-worked            7
Name: count, dtype: int64
fnlwgt
164190    13
203488    13
123011    13
148995    12
121124    12
          ..
232784     1
325573     1
140176     1
318264     1
257302     1
Name: count, Length: 21648, dtype: int64
education
HS-grad         10501
Some-college     7291
Bachelors        5355
Masters          1723
Assoc-voc        1382
11th             1175
Assoc-acdm       1067
10th              933
7th-8th           646
Prof-school       576
9th               514
12th              433
Doctorate         413
5th-6th           333
1st-4th           168
Preschool          51
Name: count, dtype: int64
education_num
9     10501
10     7291
13     5355
14     1723
11     1382
7      1175
12     1067
6       933
4       646
15      576
5       514
8       433
16      413
3       333
2       168
1        51
Name: count, dtype: int64
marital_status
Married-civ-spouse       14976
Never-married            10683
Divorced                  4443
Separated                 1025
Widowed                    993
Married-spouse-absent      418
Married-AF-spouse           23
Name: count, dtype: int64
occupation
Prof-specialty       4140
Craft-repair         4099
Exec-managerial      4066
Adm-clerical         3770
Sales                3650
Other-service        3295
Machine-op-inspct    2002
Transport-moving     1597
Handlers-cleaners    1370
Farming-fishing       994
Tech-support          928
Protective-serv       649
Priv-house-serv       149
Armed-Forces            9
Name: count, dtype: int64
relationship
Husband           13193
Not-in-family      8305
Own-child          5068
Unmarried          3446
Wife               1568
Other-relative      981
Name: count, dtype: int64
race
White                 27816
Black                  3124
Asian-Pac-Islander     1039
Amer-Indian-Eskimo      311
Other                   271
Name: count, dtype: int64
sex
Male      21790
Female    10771
Name: count, dtype: int64
capital_gain
0        29849
15024      347
7688       284
7298       246
99999      159
         ...  
1111         1
2538         1
22040        1
4931         1
5060         1
Name: count, Length: 119, dtype: int64
capital_loss
0       31042
1902      202
1977      168
1887      159
1848       51
        ...  
2080        1
1539        1
1844        1
2489        1
1411        1
Name: count, Length: 92, dtype: int64
hours_per_week
40    15217
50     2819
45     1824
60     1475
35     1297
      ...  
82        1
92        1
87        1
74        1
94        1
Name: count, Length: 94, dtype: int64
native_country
United-States                 29170
Mexico                          643
Philippines                     198
Germany                         137
Canada                          121
Puerto-Rico                     114
El-Salvador                     106
India                           100
Cuba                             95
England                          90
Jamaica                          81
South                            80
China                            75
Italy                            73
Dominican-Republic               70
Vietnam                          67
Guatemala                        64
Japan                            62
Poland                           60
Columbia                         59
Taiwan                           51
Haiti                            44
Iran                             43
Portugal                         37
Nicaragua                        34
Peru                             31
France                           29
Greece                           29
Ecuador                          28
Ireland                          24
Hong                             20
Cambodia                         19
Trinadad&Tobago                  19
Laos                             18
Thailand                         18
Yugoslavia                       16
Outlying-US(Guam-USVI-etc)       14
Honduras                         13
Hungary                          13
Scotland                         12
Holand-Netherlands                1
Name: count, dtype: int64
income
<=50K    24720
>50K      7841
Name: count, dtype: int64

Out[7]:

age                  0
workclass         1836
fnlwgt               0
education            0
education_num        0
marital_status       0
occupation        1843
relationship         0
race                 0
sex                  0
capital_gain         0
capital_loss         0
hours_per_week       0
native_country     583
income               0
dtype: int64

note que agora conseguimos ver que existem dados faltantes no dataset.

Use o método .fillna(), para substituir os nulos (np.nan) por zero.

In [ ]:

Copied!

## Seu código aqui...

df = df.fillna(0)
df.isnull().sum()
## Seu código aqui...

df = df.fillna(0)
df.isnull().sum()

In [9]:

Copied!

df['occupation'].value_counts()
df['occupation'].value_counts()

Out[9]:

occupation
Prof-specialty       4140
Craft-repair         4099
Exec-managerial      4066
Adm-clerical         3770
Sales                3650
Other-service        3295
Machine-op-inspct    2002
Transport-moving     1597
Handlers-cleaners    1370
Farming-fishing       994
Tech-support          928
Protective-serv       649
Priv-house-serv       149
Armed-Forces            9
Name: count, dtype: int64

Ainda temos um problema, os classificadores que estudamos até agora não se dão bem com variáveis categóricas.

Dentre as diversas formas de se fazer isso.....

Vamos utilizar a técnica de OneHotEncoder com a biblioteca category_encoders apenas para economizar tempo.

In [ ]:

Copied!





from sklearn.preprocessing import OneHotEncoder

colunas_categoricas = ["workclass", "education", "marital_status", "occupation", "relationship", "race", "sex", "native_country"]
onehot_encoder = OneHotEncoder(sparse_output=False, handle_unknown="ignore")

encoded = onehot_encoder.fit_transform(df[colunas_categoricas])
encoded_df = pd.DataFrame(
    encoded,
    columns=onehot_encoder.get_feature_names_out(colunas_categoricas),
    index=df.index,
)

df = pd.concat([df.drop(columns=colunas_categoricas), encoded_df], axis=1)
df.head()
from sklearn.preprocessing import OneHotEncoder

colunas_categoricas = ["workclass", "education", "marital_status", "occupation", "relationship", "race", "sex", "native_country"]
onehot_encoder = OneHotEncoder(sparse_output=False, handle_unknown="ignore")

encoded = onehot_encoder.fit_transform(df[colunas_categoricas])
encoded_df = pd.DataFrame(
    encoded,
    columns=onehot_encoder.get_feature_names_out(colunas_categoricas),
    index=df.index,
)

df = pd.concat([df.drop(columns=colunas_categoricas), encoded_df], axis=1)
df.head()

In [12]:

Copied!

df.head()
df.head()

Out[12]:

	age	workclass	fnlwgt	education	education_num	marital_status	occupation	relationship	race	sex	capital_gain	hours_per_week	native_country	income
0	39	State-gov	77516	Bachelors	13	Never-married	Adm-clerical	Not-in-family	White	Male	2174	40	United-States	<=50K
1	50	Self-emp-not-inc	83311	Bachelors	13	Married-civ-spouse	Exec-managerial	Husband	White	Male	0	13	United-States	<=50K
2	38	Private	215646	HS-grad	9	Divorced	Handlers-cleaners	Not-in-family	White	Male	0	40	United-States	<=50K
3	53	Private	234721	11th	7	Married-civ-spouse	Handlers-cleaners	Husband	Black	Male	0	40	United-States	<=50K
4	28	Private	338409	Bachelors	13	Married-civ-spouse	Prof-specialty	Wife	Black	Female	0	40	Cuba	<=50K

Criamos um novo problema, os valores estão em escalas muito diferentes, o que pode prejudicar o aprendizado.

Podemos tomar algumas descições: Vamos normatizar os dados, mas antes, faça o drop da coluna income para não mudar a escala dela; Da um replace na coluna income para 0 ou 1.

Variavel independente: --> X Variavel dependente: --> y

In [ ]:

Copied!

# Variável dependente (target)
y = (df["income"].str.strip() == ">50K").astype(int)

# Variáveis independentes
X = df.drop(columns=["income"])

X.shape, y.shape
# Variável dependente (target)
y = (df["income"].str.strip() == ">50K").astype(int)

# Variáveis independentes
X = df.drop(columns=["income"])

X.shape, y.shape

In [ ]:

Copied!

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

X_scaled.shape
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

X_scaled.shape

Agora sim! Temos uma base limpa e organizada para rodar diversos modelos de ML.

X_scaled contém as variáveis independentes normalizadas.
y contém a variável dependente (classe alvo).

In [ ]:

Copied!





# Separar os dados em treino e teste
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Treino: {X_train.shape} | Teste: {X_test.shape}")
# Separar os dados em treino e teste
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Treino: {X_train.shape} | Teste: {X_test.shape}")

Classificadores Naive Bayes¶

Trata-se de "classificadores probabilísticos" simples, baseados na aplicação do teorema de Bayes com fortes pressupostos de independência entre os atributos. Eles estão entre os modelos de rede bayesianos mais simples.

Existem 3 tipos diferentes de Naive Bayes:

1.Gaussian Naive Bayes

2.Multinomial Naive Bayes

3.Bernoulli Naive Bayes

São muiiiitttooo utilizados em aplicações com texto e NLP e aplicações como:

[X] Filtro de Spam
[X] Classificação de Texto
[X] Analise de sentimento
[X] Sistemas de recomendação

Para mais referências: https://en.wikipedia.org/wiki/Naive_Bayes_classifier

In [ ]:

Copied!

from sklearn.naive_bayes import GaussianNB

gnb = GaussianNB()
gnb.fit(X_train, y_train)

y_pred = gnb.predict(X_test)
y_pred
from sklearn.naive_bayes import GaussianNB

gnb = GaussianNB()
gnb.fit(X_train, y_train)

y_pred = gnb.predict(X_test)
y_pred

In [ ]:

Copied!





from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

print(f"Acurácia (teste): {accuracy_score(y_test, y_pred):.4f}")
print("\nMatriz de confusão:")
print(confusion_matrix(y_test, y_pred))
print("\nRelatório de classificação:")
print(classification_report(y_test, y_pred, digits=3))
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

print(f"Acurácia (teste): {accuracy_score(y_test, y_pred):.4f}")
print("\nMatriz de confusão:")
print(confusion_matrix(y_test, y_pred))
print("\nRelatório de classificação:")
print(classification_report(y_test, y_pred, digits=3))

In [ ]:

Copied!

y_pred_train = gnb.predict(X_train)
print(f"Acurácia (treino): {accuracy_score(y_train, y_pred_train):.4f}")
y_pred_train = gnb.predict(X_train)
print(f"Acurácia (treino): {accuracy_score(y_train, y_pred_train):.4f}")

In [ ]:

Copied!

print(f"Acurácia (teste): {accuracy_score(y_test, y_pred):.4f}")
print(f"Acurácia (teste): {accuracy_score(y_test, y_pred):.4f}")

Compare o resultado de acurácia com pelo menos 1 outro método de machine learning.¶

In [ ]:

Copied!





# Exemplo: comparar com Regressão Logística
from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression(max_iter=1000)
logreg.fit(X_train, y_train)
y_pred_logreg = logreg.predict(X_test)

print(f"Acurácia (LogReg - teste): {accuracy_score(y_test, y_pred_logreg):.4f}")
# Exemplo: comparar com Regressão Logística
from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression(max_iter=1000)
logreg.fit(X_train, y_train)
y_pred_logreg = logreg.predict(X_test)

print(f"Acurácia (LogReg - teste): {accuracy_score(y_test, y_pred_logreg):.4f}")