lab3
Objetivos¶
- Praticar as etapas de entendimento do problema
- praticar as etapas de preparação dos dados
- Praticar os algoritmos de aprendizado de maquina e otimização de hiperparâmetros
- Praticar as metricas de validação dos resultados
Classificador de renda¶
Você foi contratado por uma empresa para prestar um serviço de consultor.
Nesse sentido a empresa disponibilizou uma base em dados demográficos e ocupacionais dos seus clientes e gostaria de saber se é possível criar um modelo que preve se determinada pessoa ganha mais ou menos de 50k dólares por ano.
Nosso dataset:http://archive.ics.uci.edu/ml/datasets/Adult
Importa as bibliotecas¶
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
Carrega o dataset¶
df = pd.read_csv('df.csv', header = None)
columns_name = ['age', 'workclass', 'fnlwgt', 'education', 'education_num', 'marital_status', 'occupation', 'relationship',
'race', 'sex', 'capital_gain', 'capital_loss', 'hours_per_week', 'native_country', 'income']
df.columns = columns_name
df.shape
(32561, 15)
Descrição do dataset:
Age – idade
Workclass – classe de trabalho
fnlwgt – número de pessoas que amostra representa comparada a população.
Education – educação
Education_Num – anos de escolaridade
Martial_Status – Estado Civil
Occupation – ocupação, cargo que ocupa
Relationship – parentesco
Race – raça
Sex – sexo
Capital_Gain – ganho capital
Capital_Loss – perda capital
Hours_per_week – horas por semana
Country – Nacionalidade
income – renda anual
df.head()
age | workclass | fnlwgt | education | education_num | marital_status | occupation | relationship | race | sex | capital_gain | capital_loss | hours_per_week | native_country | income | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 39 | State-gov | 77516 | Bachelors | 13 | Never-married | Adm-clerical | Not-in-family | White | Male | 2174 | 0 | 40 | United-States | <=50K |
1 | 50 | Self-emp-not-inc | 83311 | Bachelors | 13 | Married-civ-spouse | Exec-managerial | Husband | White | Male | 0 | 0 | 13 | United-States | <=50K |
2 | 38 | Private | 215646 | HS-grad | 9 | Divorced | Handlers-cleaners | Not-in-family | White | Male | 0 | 0 | 40 | United-States | <=50K |
3 | 53 | Private | 234721 | 11th | 7 | Married-civ-spouse | Handlers-cleaners | Husband | Black | Male | 0 | 0 | 40 | United-States | <=50K |
4 | 28 | Private | 338409 | Bachelors | 13 | Married-civ-spouse | Prof-specialty | Wife | Black | Female | 0 | 0 | 40 | Cuba | <=50K |
df.groupby('native_country').size()
native_country ? 583 Cambodia 19 Canada 121 China 75 Columbia 59 Cuba 95 Dominican-Republic 70 Ecuador 28 El-Salvador 106 England 90 France 29 Germany 137 Greece 29 Guatemala 64 Haiti 44 Holand-Netherlands 1 Honduras 13 Hong 20 Hungary 13 India 100 Iran 43 Ireland 24 Italy 73 Jamaica 81 Japan 62 Laos 18 Mexico 643 Nicaragua 34 Outlying-US(Guam-USVI-etc) 14 Peru 31 Philippines 198 Poland 60 Portugal 37 Puerto-Rico 114 Scotland 12 South 80 Taiwan 51 Thailand 18 Trinadad&Tobago 19 United-States 29170 Vietnam 67 Yugoslavia 16 dtype: int64
## verifica as informações do dataset
## Seu código aqui....
Verificar se possui dados ausentes¶
## Seu código aqui....
df.isnull().sum()
age 0 workclass 0 fnlwgt 0 education 0 education_num 0 marital_status 0 occupation 0 relationship 0 race 0 sex 0 capital_gain 0 capital_loss 0 hours_per_week 0 native_country 0 income 0 dtype: int64
Aparentemente, não possui dados ausente. Vamos visualizar de forma diferente...
for v2 in df:
print(df[v2].value_counts())
age 36 898 31 888 34 886 23 877 35 876 ... 83 6 88 3 85 3 86 1 87 1 Name: count, Length: 73, dtype: int64 workclass Private 22696 Self-emp-not-inc 2541 Local-gov 2093 ? 1836 State-gov 1298 Self-emp-inc 1116 Federal-gov 960 Without-pay 14 Never-worked 7 Name: count, dtype: int64 fnlwgt 164190 13 203488 13 123011 13 148995 12 121124 12 .. 232784 1 325573 1 140176 1 318264 1 257302 1 Name: count, Length: 21648, dtype: int64 education HS-grad 10501 Some-college 7291 Bachelors 5355 Masters 1723 Assoc-voc 1382 11th 1175 Assoc-acdm 1067 10th 933 7th-8th 646 Prof-school 576 9th 514 12th 433 Doctorate 413 5th-6th 333 1st-4th 168 Preschool 51 Name: count, dtype: int64 education_num 9 10501 10 7291 13 5355 14 1723 11 1382 7 1175 12 1067 6 933 4 646 15 576 5 514 8 433 16 413 3 333 2 168 1 51 Name: count, dtype: int64 marital_status Married-civ-spouse 14976 Never-married 10683 Divorced 4443 Separated 1025 Widowed 993 Married-spouse-absent 418 Married-AF-spouse 23 Name: count, dtype: int64 occupation Prof-specialty 4140 Craft-repair 4099 Exec-managerial 4066 Adm-clerical 3770 Sales 3650 Other-service 3295 Machine-op-inspct 2002 ? 1843 Transport-moving 1597 Handlers-cleaners 1370 Farming-fishing 994 Tech-support 928 Protective-serv 649 Priv-house-serv 149 Armed-Forces 9 Name: count, dtype: int64 relationship Husband 13193 Not-in-family 8305 Own-child 5068 Unmarried 3446 Wife 1568 Other-relative 981 Name: count, dtype: int64 race White 27816 Black 3124 Asian-Pac-Islander 1039 Amer-Indian-Eskimo 311 Other 271 Name: count, dtype: int64 sex Male 21790 Female 10771 Name: count, dtype: int64 capital_gain 0 29849 15024 347 7688 284 7298 246 99999 159 ... 1111 1 2538 1 22040 1 4931 1 5060 1 Name: count, Length: 119, dtype: int64 capital_loss 0 31042 1902 202 1977 168 1887 159 1848 51 ... 2080 1 1539 1 1844 1 2489 1 1411 1 Name: count, Length: 92, dtype: int64 hours_per_week 40 15217 50 2819 45 1824 60 1475 35 1297 ... 82 1 92 1 87 1 74 1 94 1 Name: count, Length: 94, dtype: int64 native_country United-States 29170 Mexico 643 ? 583 Philippines 198 Germany 137 Canada 121 Puerto-Rico 114 El-Salvador 106 India 100 Cuba 95 England 90 Jamaica 81 South 80 China 75 Italy 73 Dominican-Republic 70 Vietnam 67 Guatemala 64 Japan 62 Poland 60 Columbia 59 Taiwan 51 Haiti 44 Iran 43 Portugal 37 Nicaragua 34 Peru 31 France 29 Greece 29 Ecuador 28 Ireland 24 Hong 20 Cambodia 19 Trinadad&Tobago 19 Laos 18 Thailand 18 Yugoslavia 16 Outlying-US(Guam-USVI-etc) 14 Honduras 13 Hungary 13 Scotland 12 Holand-Netherlands 1 Name: count, dtype: int64 income <=50K 24720 >50K 7841 Name: count, dtype: int64
Faça a interpretação das informações, observe o caracter ?
.
Em quais colunas ele aparece?
Faça o replace dos ?
por np.NaN.
## Seu código aqui...
df=df.replace(' ?', np.nan)
Vamos ver como ficou
for v2 in df:
print(df[v2].value_counts())
df.isnull().sum()
age 36 898 31 888 34 886 23 877 35 876 ... 83 6 88 3 85 3 86 1 87 1 Name: count, Length: 73, dtype: int64 workclass Private 22696 Self-emp-not-inc 2541 Local-gov 2093 State-gov 1298 Self-emp-inc 1116 Federal-gov 960 Without-pay 14 Never-worked 7 Name: count, dtype: int64 fnlwgt 164190 13 203488 13 123011 13 148995 12 121124 12 .. 232784 1 325573 1 140176 1 318264 1 257302 1 Name: count, Length: 21648, dtype: int64 education HS-grad 10501 Some-college 7291 Bachelors 5355 Masters 1723 Assoc-voc 1382 11th 1175 Assoc-acdm 1067 10th 933 7th-8th 646 Prof-school 576 9th 514 12th 433 Doctorate 413 5th-6th 333 1st-4th 168 Preschool 51 Name: count, dtype: int64 education_num 9 10501 10 7291 13 5355 14 1723 11 1382 7 1175 12 1067 6 933 4 646 15 576 5 514 8 433 16 413 3 333 2 168 1 51 Name: count, dtype: int64 marital_status Married-civ-spouse 14976 Never-married 10683 Divorced 4443 Separated 1025 Widowed 993 Married-spouse-absent 418 Married-AF-spouse 23 Name: count, dtype: int64 occupation Prof-specialty 4140 Craft-repair 4099 Exec-managerial 4066 Adm-clerical 3770 Sales 3650 Other-service 3295 Machine-op-inspct 2002 Transport-moving 1597 Handlers-cleaners 1370 Farming-fishing 994 Tech-support 928 Protective-serv 649 Priv-house-serv 149 Armed-Forces 9 Name: count, dtype: int64 relationship Husband 13193 Not-in-family 8305 Own-child 5068 Unmarried 3446 Wife 1568 Other-relative 981 Name: count, dtype: int64 race White 27816 Black 3124 Asian-Pac-Islander 1039 Amer-Indian-Eskimo 311 Other 271 Name: count, dtype: int64 sex Male 21790 Female 10771 Name: count, dtype: int64 capital_gain 0 29849 15024 347 7688 284 7298 246 99999 159 ... 1111 1 2538 1 22040 1 4931 1 5060 1 Name: count, Length: 119, dtype: int64 capital_loss 0 31042 1902 202 1977 168 1887 159 1848 51 ... 2080 1 1539 1 1844 1 2489 1 1411 1 Name: count, Length: 92, dtype: int64 hours_per_week 40 15217 50 2819 45 1824 60 1475 35 1297 ... 82 1 92 1 87 1 74 1 94 1 Name: count, Length: 94, dtype: int64 native_country United-States 29170 Mexico 643 Philippines 198 Germany 137 Canada 121 Puerto-Rico 114 El-Salvador 106 India 100 Cuba 95 England 90 Jamaica 81 South 80 China 75 Italy 73 Dominican-Republic 70 Vietnam 67 Guatemala 64 Japan 62 Poland 60 Columbia 59 Taiwan 51 Haiti 44 Iran 43 Portugal 37 Nicaragua 34 Peru 31 France 29 Greece 29 Ecuador 28 Ireland 24 Hong 20 Cambodia 19 Trinadad&Tobago 19 Laos 18 Thailand 18 Yugoslavia 16 Outlying-US(Guam-USVI-etc) 14 Honduras 13 Hungary 13 Scotland 12 Holand-Netherlands 1 Name: count, dtype: int64 income <=50K 24720 >50K 7841 Name: count, dtype: int64
age 0 workclass 1836 fnlwgt 0 education 0 education_num 0 marital_status 0 occupation 1843 relationship 0 race 0 sex 0 capital_gain 0 capital_loss 0 hours_per_week 0 native_country 583 income 0 dtype: int64
note que agora conseguimos ver que existem dados faltantes no dataset.
Use o método .fillna()
, para substituir os nulos (np.nan) por zero.
## Seu código aqui...
df.fillna(0)
age | workclass | fnlwgt | education | education_num | marital_status | occupation | relationship | race | sex | capital_gain | capital_loss | hours_per_week | native_country | income | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 39 | State-gov | 77516 | Bachelors | 13 | Never-married | Adm-clerical | Not-in-family | White | Male | 2174 | 0 | 40 | United-States | <=50K |
1 | 50 | Self-emp-not-inc | 83311 | Bachelors | 13 | Married-civ-spouse | Exec-managerial | Husband | White | Male | 0 | 0 | 13 | United-States | <=50K |
2 | 38 | Private | 215646 | HS-grad | 9 | Divorced | Handlers-cleaners | Not-in-family | White | Male | 0 | 0 | 40 | United-States | <=50K |
3 | 53 | Private | 234721 | 11th | 7 | Married-civ-spouse | Handlers-cleaners | Husband | Black | Male | 0 | 0 | 40 | United-States | <=50K |
4 | 28 | Private | 338409 | Bachelors | 13 | Married-civ-spouse | Prof-specialty | Wife | Black | Female | 0 | 0 | 40 | Cuba | <=50K |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
32556 | 27 | Private | 257302 | Assoc-acdm | 12 | Married-civ-spouse | Tech-support | Wife | White | Female | 0 | 0 | 38 | United-States | <=50K |
32557 | 40 | Private | 154374 | HS-grad | 9 | Married-civ-spouse | Machine-op-inspct | Husband | White | Male | 0 | 0 | 40 | United-States | >50K |
32558 | 58 | Private | 151910 | HS-grad | 9 | Widowed | Adm-clerical | Unmarried | White | Female | 0 | 0 | 40 | United-States | <=50K |
32559 | 22 | Private | 201490 | HS-grad | 9 | Never-married | Adm-clerical | Own-child | White | Male | 0 | 0 | 20 | United-States | <=50K |
32560 | 52 | Self-emp-inc | 287927 | HS-grad | 9 | Married-civ-spouse | Exec-managerial | Wife | White | Female | 15024 | 0 | 40 | United-States | >50K |
32561 rows × 15 columns
df['occupation'].value_counts()
occupation Prof-specialty 4140 Craft-repair 4099 Exec-managerial 4066 Adm-clerical 3770 Sales 3650 Other-service 3295 Machine-op-inspct 2002 Transport-moving 1597 Handlers-cleaners 1370 Farming-fishing 994 Tech-support 928 Protective-serv 649 Priv-house-serv 149 Armed-Forces 9 Name: count, dtype: int64
Ainda temos um problema, os classificadores que estudamos até agora não se dão bem com variáveis categóricas.
Dentre as diversas formas de se fazer isso.....
Vamos utilizar a técnica de OneHotEncoder com a biblioteca category_encoders
apenas para economizar tempo.
## Se necessário, pip install category_encoders
from sklearn.preprocessing import OneHotEncoder
onehot_encoder = OneHotEncoder(sparse_output=False, cols=["----Coloque aqui as colunas categoricas para realizar a transformação----"])
df = onehot_encoder.fit_transform(df)
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) Cell In[11], line 5 1 ## Se necessário, pip install category_encoders 3 from sklearn.preprocessing import OneHotEncoder ----> 5 onehot_encoder = OneHotEncoder(sparse_output=False, cols=["----Coloque aqui as colunas categoricas para realizar a transformação----"]) 7 df = onehot_encoder.fit_transform(df) TypeError: OneHotEncoder.__init__() got an unexpected keyword argument 'cols'
df.head()
age | workclass | fnlwgt | education | education_num | marital_status | occupation | relationship | race | sex | capital_gain | capital_loss | hours_per_week | native_country | income | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 39 | State-gov | 77516 | Bachelors | 13 | Never-married | Adm-clerical | Not-in-family | White | Male | 2174 | 0 | 40 | United-States | <=50K |
1 | 50 | Self-emp-not-inc | 83311 | Bachelors | 13 | Married-civ-spouse | Exec-managerial | Husband | White | Male | 0 | 0 | 13 | United-States | <=50K |
2 | 38 | Private | 215646 | HS-grad | 9 | Divorced | Handlers-cleaners | Not-in-family | White | Male | 0 | 0 | 40 | United-States | <=50K |
3 | 53 | Private | 234721 | 11th | 7 | Married-civ-spouse | Handlers-cleaners | Husband | Black | Male | 0 | 0 | 40 | United-States | <=50K |
4 | 28 | Private | 338409 | Bachelors | 13 | Married-civ-spouse | Prof-specialty | Wife | Black | Female | 0 | 0 | 40 | Cuba | <=50K |
Criamos um novo problema, os valores estão em escalas muito diferentes, o que pode prejudicar o aprendizado.
Podemos tomar algumas descições:
Vamos normatizar os dados, mas antes, faça o drop da coluna income
para não mudar a escala dela;
Da um replace na coluna income
para 0 ou 1.
Variavel independente: --> X Variavel dependente: --> y
#Vamos fazer um drop da coluna de interesse de estudo `income`. (y)
## Faça a normalização dos dados
## escolha um dos métodos....
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import minmax_scale
from sklearn.preprocessing import MaxAbsScaler
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import RobustScaler
from sklearn.preprocessing import Normalizer
from sklearn.preprocessing import QuantileTransformer
from sklearn.preprocessing import PowerTransformer
## seu código aqui...
Agora sim! temos uma base limpa e organizada para rodar diversos modelos de ML. onde:
X - > possui as variaveis independentes. y - > é a nossa variavel de interesse.
Separe os dados em treino e teste, qual a propoção será utiizada para cada subset??
#Separar os dados em treino e teste
from sklearn.model_selection import train_test_split
Classificadores Naive Bayes¶
Trata-se de "classificadores probabilísticos" simples, baseados na aplicação do teorema de Bayes
com fortes pressupostos de independência entre os atributos. Eles estão entre os modelos de rede bayesianos mais simples.
Existem 3 tipos diferentes de Naive Bayes:
1.Gaussian Naive Bayes
2.Multinomial Naive Bayes
3.Bernoulli Naive Bayes
São muiiiitttooo utilizados em aplicações com texto e NLP e aplicações como:
[X] Filtro de Spam
[X] Classificação de Texto
[X] Analise de sentimento
[X] Sistemas de recomendação
Para mais referências: https://en.wikipedia.org/wiki/Naive_Bayes_classifier
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
gnb.fit(X_train, y_train)
y_pred = gnb.predict(X_test)
y_pred
--------------------------------------------------------------------------- NameError Traceback (most recent call last) Cell In[15], line 5 1 from sklearn.naive_bayes import GaussianNB 3 gnb = GaussianNB() ----> 5 gnb.fit(X_train, y_train) 6 y_pred = gnb.predict(X_test) 7 y_pred NameError: name 'X_train' is not defined
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_pred)
--------------------------------------------------------------------------- NameError Traceback (most recent call last) Cell In[16], line 3 1 from sklearn.metrics import accuracy_score ----> 3 accuracy_score(y_test, y_pred) NameError: name 'y_test' is not defined
y_pred_train = gnb.predict(X_train)
y_pred_train
--------------------------------------------------------------------------- NameError Traceback (most recent call last) Cell In[17], line 1 ----> 1 y_pred_train = gnb.predict(X_train) 2 y_pred_train NameError: name 'X_train' is not defined
accuracy_score(y_train, y_pred_train)
--------------------------------------------------------------------------- NameError Traceback (most recent call last) Cell In[18], line 1 ----> 1 accuracy_score(y_train, y_pred_train) NameError: name 'y_train' is not defined
Compare o resultado de acuracia com pelo menos 1 outro método de machine learning.¶
### sua resposta aqui....