lab3

Objetivos¶

Praticar as etapas de entendimento do problema
praticar as etapas de preparação dos dados
Praticar os algoritmos de aprendizado de maquina e otimização de hiperparâmetros
Praticar as metricas de validação dos resultados

Classificador de renda¶

Você foi contratado por uma empresa para prestar um serviço de consultor.

Nesse sentido a empresa disponibilizou uma base em dados demográficos e ocupacionais dos seus clientes e gostaria de saber se é possível criar um modelo que preve se determinada pessoa ganha mais ou menos de 50k dólares por ano.

Nosso dataset:http://archive.ics.uci.edu/ml/datasets/Adult

Importa as bibliotecas¶

In [12]:

Copied!





%matplotlib inline

import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt 
import seaborn as sns 
%matplotlib inline

import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt 
import seaborn as sns 

Carrega o dataset¶

In [2]:

Copied!

df = pd.read_csv('df.csv', header = None)

columns_name = ['age', 'workclass', 'fnlwgt', 'education', 'education_num', 'marital_status', 'occupation', 'relationship',
             'race', 'sex', 'capital_gain', 'capital_loss', 'hours_per_week', 'native_country', 'income']

df.columns = columns_name

df.shape
df = pd.read_csv('df.csv', header = None)

columns_name = ['age', 'workclass', 'fnlwgt', 'education', 'education_num', 'marital_status', 'occupation', 'relationship',
             'race', 'sex', 'capital_gain', 'capital_loss', 'hours_per_week', 'native_country', 'income']

df.columns = columns_name

df.shape

Out[2]:

(32561, 15)

Descrição do dataset:

Age – idade 
Workclass – classe de trabalho
fnlwgt – número de pessoas que amostra representa comparada a população.
Education – educação
Education_Num – anos de escolaridade
Martial_Status – Estado Civil
Occupation – ocupação, cargo que ocupa
Relationship – parentesco
Race – raça
Sex – sexo
Capital_Gain – ganho capital
Capital_Loss – perda capital
Hours_per_week – horas por semana
Country – Nacionalidade

income – renda anual

In [3]:

Copied!

df.head()
df.head()

Out[3]:

	age	workclass	fnlwgt	education	education_num	marital_status	occupation	relationship	race	sex	capital_gain	hours_per_week	native_country	income
0	39	State-gov	77516	Bachelors	13	Never-married	Adm-clerical	Not-in-family	White	Male	2174	40	United-States	<=50K
1	50	Self-emp-not-inc	83311	Bachelors	13	Married-civ-spouse	Exec-managerial	Husband	White	Male	0	13	United-States	<=50K
2	38	Private	215646	HS-grad	9	Divorced	Handlers-cleaners	Not-in-family	White	Male	0	40	United-States	<=50K
3	53	Private	234721	11th	7	Married-civ-spouse	Handlers-cleaners	Husband	Black	Male	0	40	United-States	<=50K
4	28	Private	338409	Bachelors	13	Married-civ-spouse	Prof-specialty	Wife	Black	Female	0	40	Cuba	<=50K

In [11]:

Copied!

df.groupby('native_country').size()
df.groupby('native_country').size()

Out[11]:

native_country
?                               583
Cambodia                         19
Canada                          121
China                            75
Columbia                         59
Cuba                             95
Dominican-Republic               70
Ecuador                          28
El-Salvador                     106
England                          90
France                           29
Germany                         137
Greece                           29
Guatemala                        64
Haiti                            44
Holand-Netherlands                1
Honduras                         13
Hong                             20
Hungary                          13
India                           100
Iran                             43
Ireland                          24
Italy                            73
Jamaica                          81
Japan                            62
Laos                             18
Mexico                          643
Nicaragua                        34
Outlying-US(Guam-USVI-etc)       14
Peru                             31
Philippines                     198
Poland                           60
Portugal                         37
Puerto-Rico                     114
Scotland                         12
South                            80
Taiwan                           51
Thailand                         18
Trinadad&Tobago                  19
United-States                 29170
Vietnam                          67
Yugoslavia                       16
dtype: int64

In [ ]:

Copied!

## verifica as informações do dataset
## Seu código aqui....
## verifica as informações do dataset
## Seu código aqui....

Verificar se possui dados ausentes¶

In [4]:

Copied!

## Seu código aqui....

df.isnull().sum()
## Seu código aqui....

df.isnull().sum()

Out[4]:

age               0
workclass         0
fnlwgt            0
education         0
education_num     0
marital_status    0
occupation        0
relationship      0
race              0
sex               0
capital_gain      0
capital_loss      0
hours_per_week    0
native_country    0
income            0
dtype: int64

Aparentemente, não possui dados ausente. Vamos visualizar de forma diferente...

In [5]:

Copied!

for v2 in df:
    print(df[v2].value_counts())
for v2 in df:
    print(df[v2].value_counts())

age
36    898
31    888
34    886
23    877
35    876
     ... 
83      6
88      3
85      3
86      1
87      1
Name: count, Length: 73, dtype: int64
workclass
Private             22696
Self-emp-not-inc     2541
Local-gov            2093
?                    1836
State-gov            1298
Self-emp-inc         1116
Federal-gov           960
Without-pay            14
Never-worked            7
Name: count, dtype: int64
fnlwgt
164190    13
203488    13
123011    13
148995    12
121124    12
          ..
232784     1
325573     1
140176     1
318264     1
257302     1
Name: count, Length: 21648, dtype: int64
education
HS-grad         10501
Some-college     7291
Bachelors        5355
Masters          1723
Assoc-voc        1382
11th             1175
Assoc-acdm       1067
10th              933
7th-8th           646
Prof-school       576
9th               514
12th              433
Doctorate         413
5th-6th           333
1st-4th           168
Preschool          51
Name: count, dtype: int64
education_num
9     10501
10     7291
13     5355
14     1723
11     1382
7      1175
12     1067
6       933
4       646
15      576
5       514
8       433
16      413
3       333
2       168
1        51
Name: count, dtype: int64
marital_status
Married-civ-spouse       14976
Never-married            10683
Divorced                  4443
Separated                 1025
Widowed                    993
Married-spouse-absent      418
Married-AF-spouse           23
Name: count, dtype: int64
occupation
Prof-specialty       4140
Craft-repair         4099
Exec-managerial      4066
Adm-clerical         3770
Sales                3650
Other-service        3295
Machine-op-inspct    2002
?                    1843
Transport-moving     1597
Handlers-cleaners    1370
Farming-fishing       994
Tech-support          928
Protective-serv       649
Priv-house-serv       149
Armed-Forces            9
Name: count, dtype: int64
relationship
Husband           13193
Not-in-family      8305
Own-child          5068
Unmarried          3446
Wife               1568
Other-relative      981
Name: count, dtype: int64
race
White                 27816
Black                  3124
Asian-Pac-Islander     1039
Amer-Indian-Eskimo      311
Other                   271
Name: count, dtype: int64
sex
Male      21790
Female    10771
Name: count, dtype: int64
capital_gain
0        29849
15024      347
7688       284
7298       246
99999      159
         ...  
1111         1
2538         1
22040        1
4931         1
5060         1
Name: count, Length: 119, dtype: int64
capital_loss
0       31042
1902      202
1977      168
1887      159
1848       51
        ...  
2080        1
1539        1
1844        1
2489        1
1411        1
Name: count, Length: 92, dtype: int64
hours_per_week
40    15217
50     2819
45     1824
60     1475
35     1297
      ...  
82        1
92        1
87        1
74        1
94        1
Name: count, Length: 94, dtype: int64
native_country
United-States                 29170
Mexico                          643
?                               583
Philippines                     198
Germany                         137
Canada                          121
Puerto-Rico                     114
El-Salvador                     106
India                           100
Cuba                             95
England                          90
Jamaica                          81
South                            80
China                            75
Italy                            73
Dominican-Republic               70
Vietnam                          67
Guatemala                        64
Japan                            62
Poland                           60
Columbia                         59
Taiwan                           51
Haiti                            44
Iran                             43
Portugal                         37
Nicaragua                        34
Peru                             31
France                           29
Greece                           29
Ecuador                          28
Ireland                          24
Hong                             20
Cambodia                         19
Trinadad&Tobago                  19
Laos                             18
Thailand                         18
Yugoslavia                       16
Outlying-US(Guam-USVI-etc)       14
Honduras                         13
Hungary                          13
Scotland                         12
Holand-Netherlands                1
Name: count, dtype: int64
income
<=50K    24720
>50K      7841
Name: count, dtype: int64

Faça a interpretação das informações, observe o caracter ?.

Em quais colunas ele aparece?

Faça o replace dos ? por np.NaN.

In [6]:

Copied!

## Seu código aqui...

df=df.replace(' ?', np.nan)
## Seu código aqui...

df=df.replace(' ?', np.nan)

Vamos ver como ficou

In [7]:

Copied!

for v2 in df:
    print(df[v2].value_counts())
    
df.isnull().sum()
for v2 in df:
    print(df[v2].value_counts())
    
df.isnull().sum()

age
36    898
31    888
34    886
23    877
35    876
     ... 
83      6
88      3
85      3
86      1
87      1
Name: count, Length: 73, dtype: int64
workclass
Private             22696
Self-emp-not-inc     2541
Local-gov            2093
State-gov            1298
Self-emp-inc         1116
Federal-gov           960
Without-pay            14
Never-worked            7
Name: count, dtype: int64
fnlwgt
164190    13
203488    13
123011    13
148995    12
121124    12
          ..
232784     1
325573     1
140176     1
318264     1
257302     1
Name: count, Length: 21648, dtype: int64
education
HS-grad         10501
Some-college     7291
Bachelors        5355
Masters          1723
Assoc-voc        1382
11th             1175
Assoc-acdm       1067
10th              933
7th-8th           646
Prof-school       576
9th               514
12th              433
Doctorate         413
5th-6th           333
1st-4th           168
Preschool          51
Name: count, dtype: int64
education_num
9     10501
10     7291
13     5355
14     1723
11     1382
7      1175
12     1067
6       933
4       646
15      576
5       514
8       433
16      413
3       333
2       168
1        51
Name: count, dtype: int64
marital_status
Married-civ-spouse       14976
Never-married            10683
Divorced                  4443
Separated                 1025
Widowed                    993
Married-spouse-absent      418
Married-AF-spouse           23
Name: count, dtype: int64
occupation
Prof-specialty       4140
Craft-repair         4099
Exec-managerial      4066
Adm-clerical         3770
Sales                3650
Other-service        3295
Machine-op-inspct    2002
Transport-moving     1597
Handlers-cleaners    1370
Farming-fishing       994
Tech-support          928
Protective-serv       649
Priv-house-serv       149
Armed-Forces            9
Name: count, dtype: int64
relationship
Husband           13193
Not-in-family      8305
Own-child          5068
Unmarried          3446
Wife               1568
Other-relative      981
Name: count, dtype: int64
race
White                 27816
Black                  3124
Asian-Pac-Islander     1039
Amer-Indian-Eskimo      311
Other                   271
Name: count, dtype: int64
sex
Male      21790
Female    10771
Name: count, dtype: int64
capital_gain
0        29849
15024      347
7688       284
7298       246
99999      159
         ...  
1111         1
2538         1
22040        1
4931         1
5060         1
Name: count, Length: 119, dtype: int64
capital_loss
0       31042
1902      202
1977      168
1887      159
1848       51
        ...  
2080        1
1539        1
1844        1
2489        1
1411        1
Name: count, Length: 92, dtype: int64
hours_per_week
40    15217
50     2819
45     1824
60     1475
35     1297
      ...  
82        1
92        1
87        1
74        1
94        1
Name: count, Length: 94, dtype: int64
native_country
United-States                 29170
Mexico                          643
Philippines                     198
Germany                         137
Canada                          121
Puerto-Rico                     114
El-Salvador                     106
India                           100
Cuba                             95
England                          90
Jamaica                          81
South                            80
China                            75
Italy                            73
Dominican-Republic               70
Vietnam                          67
Guatemala                        64
Japan                            62
Poland                           60
Columbia                         59
Taiwan                           51
Haiti                            44
Iran                             43
Portugal                         37
Nicaragua                        34
Peru                             31
France                           29
Greece                           29
Ecuador                          28
Ireland                          24
Hong                             20
Cambodia                         19
Trinadad&Tobago                  19
Laos                             18
Thailand                         18
Yugoslavia                       16
Outlying-US(Guam-USVI-etc)       14
Honduras                         13
Hungary                          13
Scotland                         12
Holand-Netherlands                1
Name: count, dtype: int64
income
<=50K    24720
>50K      7841
Name: count, dtype: int64

Out[7]:

age                  0
workclass         1836
fnlwgt               0
education            0
education_num        0
marital_status       0
occupation        1843
relationship         0
race                 0
sex                  0
capital_gain         0
capital_loss         0
hours_per_week       0
native_country     583
income               0
dtype: int64

note que agora conseguimos ver que existem dados faltantes no dataset.

Use o método .fillna(), para substituir os nulos (np.nan) por zero.

In [8]:

Copied!

## Seu código aqui...

df.fillna(0)
## Seu código aqui...

df.fillna(0)

Out[8]:

	age	workclass	fnlwgt	education	education_num	marital_status	occupation	relationship	race	sex	capital_gain	capital_loss	hours_per_week	native_country	income
0	39	State-gov	77516	Bachelors	13	Never-married	Adm-clerical	Not-in-family	White	Male	2174	0	40	United-States	<=50K
1	50	Self-emp-not-inc	83311	Bachelors	13	Married-civ-spouse	Exec-managerial	Husband	White	Male	0	0	13	United-States	<=50K
2	38	Private	215646	HS-grad	9	Divorced	Handlers-cleaners	Not-in-family	White	Male	0	0	40	United-States	<=50K
3	53	Private	234721	11th	7	Married-civ-spouse	Handlers-cleaners	Husband	Black	Male	0	0	40	United-States	<=50K
4	28	Private	338409	Bachelors	13	Married-civ-spouse	Prof-specialty	Wife	Black	Female	0	0	40	Cuba	<=50K
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
32556	27	Private	257302	Assoc-acdm	12	Married-civ-spouse	Tech-support	Wife	White	Female	0	0	38	United-States	<=50K
32557	40	Private	154374	HS-grad	9	Married-civ-spouse	Machine-op-inspct	Husband	White	Male	0	0	40	United-States	>50K
32558	58	Private	151910	HS-grad	9	Widowed	Adm-clerical	Unmarried	White	Female	0	0	40	United-States	<=50K
32559	22	Private	201490	HS-grad	9	Never-married	Adm-clerical	Own-child	White	Male	0	0	20	United-States	<=50K
32560	52	Self-emp-inc	287927	HS-grad	9	Married-civ-spouse	Exec-managerial	Wife	White	Female	15024	0	40	United-States	>50K

32561 rows × 15 columns

In [9]:

Copied!

df['occupation'].value_counts()
df['occupation'].value_counts()

Out[9]:

occupation
Prof-specialty       4140
Craft-repair         4099
Exec-managerial      4066
Adm-clerical         3770
Sales                3650
Other-service        3295
Machine-op-inspct    2002
Transport-moving     1597
Handlers-cleaners    1370
Farming-fishing       994
Tech-support          928
Protective-serv       649
Priv-house-serv       149
Armed-Forces            9
Name: count, dtype: int64

Ainda temos um problema, os classificadores que estudamos até agora não se dão bem com variáveis categóricas.

Dentre as diversas formas de se fazer isso.....

Vamos utilizar a técnica de OneHotEncoder com a biblioteca category_encoders apenas para economizar tempo.

In [11]:

Copied!

## Se necessário, pip install category_encoders

from sklearn.preprocessing import OneHotEncoder

onehot_encoder = OneHotEncoder(sparse_output=False, cols=["----Coloque aqui as colunas categoricas para realizar a transformação----"])

df = onehot_encoder.fit_transform(df)
## Se necessário, pip install category_encoders

from sklearn.preprocessing import OneHotEncoder

onehot_encoder = OneHotEncoder(sparse_output=False, cols=["----Coloque aqui as colunas categoricas para realizar a transformação----"])

df = onehot_encoder.fit_transform(df)

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[11], line 5
      1 ## Se necessário, pip install category_encoders
      3 from sklearn.preprocessing import OneHotEncoder
----> 5 onehot_encoder = OneHotEncoder(sparse_output=False, cols=["----Coloque aqui as colunas categoricas para realizar a transformação----"])
      7 df = onehot_encoder.fit_transform(df)

TypeError: OneHotEncoder.__init__() got an unexpected keyword argument 'cols'

In [12]:

Copied!

df.head()
df.head()

Out[12]:

	age	workclass	fnlwgt	education	education_num	marital_status	occupation	relationship	race	sex	capital_gain	hours_per_week	native_country	income
0	39	State-gov	77516	Bachelors	13	Never-married	Adm-clerical	Not-in-family	White	Male	2174	40	United-States	<=50K
1	50	Self-emp-not-inc	83311	Bachelors	13	Married-civ-spouse	Exec-managerial	Husband	White	Male	0	13	United-States	<=50K
2	38	Private	215646	HS-grad	9	Divorced	Handlers-cleaners	Not-in-family	White	Male	0	40	United-States	<=50K
3	53	Private	234721	11th	7	Married-civ-spouse	Handlers-cleaners	Husband	Black	Male	0	40	United-States	<=50K
4	28	Private	338409	Bachelors	13	Married-civ-spouse	Prof-specialty	Wife	Black	Female	0	40	Cuba	<=50K

Criamos um novo problema, os valores estão em escalas muito diferentes, o que pode prejudicar o aprendizado.

Podemos tomar algumas descições: Vamos normatizar os dados, mas antes, faça o drop da coluna income para não mudar a escala dela; Da um replace na coluna income para 0 ou 1.

Variavel independente: --> X Variavel dependente: --> y

In [ ]:

Copied!

#Vamos fazer um drop da coluna de interesse de estudo `income`. (y)
#Vamos fazer um drop da coluna de interesse de estudo `income`. (y)

In [13]:

Copied!





## Faça a normalização dos dados
## escolha um dos métodos....
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import minmax_scale
from sklearn.preprocessing import MaxAbsScaler
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import RobustScaler
from sklearn.preprocessing import Normalizer
from sklearn.preprocessing import QuantileTransformer
from sklearn.preprocessing import PowerTransformer

## seu código aqui...
## Faça a normalização dos dados
## escolha um dos métodos....
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import minmax_scale
from sklearn.preprocessing import MaxAbsScaler
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import RobustScaler
from sklearn.preprocessing import Normalizer
from sklearn.preprocessing import QuantileTransformer
from sklearn.preprocessing import PowerTransformer

## seu código aqui...

Agora sim! temos uma base limpa e organizada para rodar diversos modelos de ML. onde:

X - > possui as variaveis independentes. y - > é a nossa variavel de interesse.

Separe os dados em treino e teste, qual a propoção será utiizada para cada subset??

In [14]:

Copied!

#Separar os dados em treino e teste
from sklearn.model_selection import train_test_split
#Separar os dados em treino e teste
from sklearn.model_selection import train_test_split

Classificadores Naive Bayes¶

Trata-se de "classificadores probabilísticos" simples, baseados na aplicação do teorema de Bayes com fortes pressupostos de independência entre os atributos. Eles estão entre os modelos de rede bayesianos mais simples.

Existem 3 tipos diferentes de Naive Bayes:

1.Gaussian Naive Bayes

2.Multinomial Naive Bayes

3.Bernoulli Naive Bayes

São muiiiitttooo utilizados em aplicações com texto e NLP e aplicações como:

[X] Filtro de Spam
[X] Classificação de Texto
[X] Analise de sentimento
[X] Sistemas de recomendação

Para mais referências: https://en.wikipedia.org/wiki/Naive_Bayes_classifier

In [15]:

Copied!

from sklearn.naive_bayes import GaussianNB

gnb = GaussianNB()

gnb.fit(X_train, y_train)
y_pred = gnb.predict(X_test)
y_pred
from sklearn.naive_bayes import GaussianNB

gnb = GaussianNB()

gnb.fit(X_train, y_train)
y_pred = gnb.predict(X_test)
y_pred

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[15], line 5
      1 from sklearn.naive_bayes import GaussianNB
      3 gnb = GaussianNB()
----> 5 gnb.fit(X_train, y_train)
      6 y_pred = gnb.predict(X_test)
      7 y_pred

NameError: name 'X_train' is not defined

In [16]:

Copied!

from sklearn.metrics import accuracy_score

accuracy_score(y_test, y_pred)
from sklearn.metrics import accuracy_score

accuracy_score(y_test, y_pred)

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[16], line 3
      1 from sklearn.metrics import accuracy_score
----> 3 accuracy_score(y_test, y_pred)

NameError: name 'y_test' is not defined

In [17]:

Copied!

y_pred_train = gnb.predict(X_train)
y_pred_train
y_pred_train = gnb.predict(X_train)
y_pred_train

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[17], line 1
----> 1 y_pred_train = gnb.predict(X_train)
      2 y_pred_train

NameError: name 'X_train' is not defined

In [18]:

Copied!

accuracy_score(y_train, y_pred_train)
accuracy_score(y_train, y_pred_train)

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[18], line 1
----> 1 accuracy_score(y_train, y_pred_train)

NameError: name 'y_train' is not defined

Compare o resultado de acuracia com pelo menos 1 outro método de machine learning.¶

In [ ]:

Copied!

### sua resposta aqui....
### sua resposta aqui....

In [ ]: