Ml classificador renda
Objetivos¶
- Praticar as etapas de entendimento do problema
- praticar as etapas de preparação dos dados
- Praticar os algoritmos de aprendizado de maquina e otimização de hiperparâmetros
- Praticar as metricas de validação dos resultados
Classificador de renda¶
Você foi contratado por uma empresa para prestar um serviço de consultor.
Nesse sentido a empresa disponibilizou uma base em dados demográficos e ocupacionais dos seus clientes e gostaria de saber se é possível criar um modelo que preve se determinada pessoa ganha mais ou menos de 50k dólares por ano.
Nosso dataset:http://archive.ics.uci.edu/ml/datasets/Adult
Importa as bibliotecas¶
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
Carrega o dataset¶
df = pd.read_csv('df.csv', header = None)
columns_name = ['age', 'workclass', 'fnlwgt', 'education', 'education_num', 'marital_status', 'occupation', 'relationship',
'race', 'sex', 'capital_gain', 'capital_loss', 'hours_per_week', 'native_country', 'income']
df.columns = columns_name
df.shape
(32561, 15)
Descrição do dataset:
Age – idade
Workclass – classe de trabalho
fnlwgt – número de pessoas que amostra representa comparada a população.
Education – educação
Education_Num – anos de escolaridade
Martial_Status – Estado Civil
Occupation – ocupação, cargo que ocupa
Relationship – parentesco
Race – raça
Sex – sexo
Capital_Gain – ganho capital
Capital_Loss – perda capital
Hours_per_week – horas por semana
Country – Nacionalidade
income – renda anual
df.head()
| age | workclass | fnlwgt | education | education_num | marital_status | occupation | relationship | race | sex | capital_gain | capital_loss | hours_per_week | native_country | income | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 39 | State-gov | 77516 | Bachelors | 13 | Never-married | Adm-clerical | Not-in-family | White | Male | 2174 | 0 | 40 | United-States | <=50K |
| 1 | 50 | Self-emp-not-inc | 83311 | Bachelors | 13 | Married-civ-spouse | Exec-managerial | Husband | White | Male | 0 | 0 | 13 | United-States | <=50K |
| 2 | 38 | Private | 215646 | HS-grad | 9 | Divorced | Handlers-cleaners | Not-in-family | White | Male | 0 | 0 | 40 | United-States | <=50K |
| 3 | 53 | Private | 234721 | 11th | 7 | Married-civ-spouse | Handlers-cleaners | Husband | Black | Male | 0 | 0 | 40 | United-States | <=50K |
| 4 | 28 | Private | 338409 | Bachelors | 13 | Married-civ-spouse | Prof-specialty | Wife | Black | Female | 0 | 0 | 40 | Cuba | <=50K |
df.groupby('native_country').size()
native_country ? 583 Cambodia 19 Canada 121 China 75 Columbia 59 Cuba 95 Dominican-Republic 70 Ecuador 28 El-Salvador 106 England 90 France 29 Germany 137 Greece 29 Guatemala 64 Haiti 44 Holand-Netherlands 1 Honduras 13 Hong 20 Hungary 13 India 100 Iran 43 Ireland 24 Italy 73 Jamaica 81 Japan 62 Laos 18 Mexico 643 Nicaragua 34 Outlying-US(Guam-USVI-etc) 14 Peru 31 Philippines 198 Poland 60 Portugal 37 Puerto-Rico 114 Scotland 12 South 80 Taiwan 51 Thailand 18 Trinadad&Tobago 19 United-States 29170 Vietnam 67 Yugoslavia 16 dtype: int64
## verifica as informações do dataset
## Seu código aqui....
Verificar se possui dados ausentes¶
## Seu código aqui....
df.isnull().sum()
age 0 workclass 0 fnlwgt 0 education 0 education_num 0 marital_status 0 occupation 0 relationship 0 race 0 sex 0 capital_gain 0 capital_loss 0 hours_per_week 0 native_country 0 income 0 dtype: int64
Aparentemente, não possui dados ausente. Vamos visualizar de forma diferente...
for v2 in df:
print(df[v2].value_counts())
age
36 898
31 888
34 886
23 877
35 876
...
83 6
88 3
85 3
86 1
87 1
Name: count, Length: 73, dtype: int64
workclass
Private 22696
Self-emp-not-inc 2541
Local-gov 2093
? 1836
State-gov 1298
Self-emp-inc 1116
Federal-gov 960
Without-pay 14
Never-worked 7
Name: count, dtype: int64
fnlwgt
164190 13
203488 13
123011 13
148995 12
121124 12
..
232784 1
325573 1
140176 1
318264 1
257302 1
Name: count, Length: 21648, dtype: int64
education
HS-grad 10501
Some-college 7291
Bachelors 5355
Masters 1723
Assoc-voc 1382
11th 1175
Assoc-acdm 1067
10th 933
7th-8th 646
Prof-school 576
9th 514
12th 433
Doctorate 413
5th-6th 333
1st-4th 168
Preschool 51
Name: count, dtype: int64
education_num
9 10501
10 7291
13 5355
14 1723
11 1382
7 1175
12 1067
6 933
4 646
15 576
5 514
8 433
16 413
3 333
2 168
1 51
Name: count, dtype: int64
marital_status
Married-civ-spouse 14976
Never-married 10683
Divorced 4443
Separated 1025
Widowed 993
Married-spouse-absent 418
Married-AF-spouse 23
Name: count, dtype: int64
occupation
Prof-specialty 4140
Craft-repair 4099
Exec-managerial 4066
Adm-clerical 3770
Sales 3650
Other-service 3295
Machine-op-inspct 2002
? 1843
Transport-moving 1597
Handlers-cleaners 1370
Farming-fishing 994
Tech-support 928
Protective-serv 649
Priv-house-serv 149
Armed-Forces 9
Name: count, dtype: int64
relationship
Husband 13193
Not-in-family 8305
Own-child 5068
Unmarried 3446
Wife 1568
Other-relative 981
Name: count, dtype: int64
race
White 27816
Black 3124
Asian-Pac-Islander 1039
Amer-Indian-Eskimo 311
Other 271
Name: count, dtype: int64
sex
Male 21790
Female 10771
Name: count, dtype: int64
capital_gain
0 29849
15024 347
7688 284
7298 246
99999 159
...
1111 1
2538 1
22040 1
4931 1
5060 1
Name: count, Length: 119, dtype: int64
capital_loss
0 31042
1902 202
1977 168
1887 159
1848 51
...
2080 1
1539 1
1844 1
2489 1
1411 1
Name: count, Length: 92, dtype: int64
hours_per_week
40 15217
50 2819
45 1824
60 1475
35 1297
...
82 1
92 1
87 1
74 1
94 1
Name: count, Length: 94, dtype: int64
native_country
United-States 29170
Mexico 643
? 583
Philippines 198
Germany 137
Canada 121
Puerto-Rico 114
El-Salvador 106
India 100
Cuba 95
England 90
Jamaica 81
South 80
China 75
Italy 73
Dominican-Republic 70
Vietnam 67
Guatemala 64
Japan 62
Poland 60
Columbia 59
Taiwan 51
Haiti 44
Iran 43
Portugal 37
Nicaragua 34
Peru 31
France 29
Greece 29
Ecuador 28
Ireland 24
Hong 20
Cambodia 19
Trinadad&Tobago 19
Laos 18
Thailand 18
Yugoslavia 16
Outlying-US(Guam-USVI-etc) 14
Honduras 13
Hungary 13
Scotland 12
Holand-Netherlands 1
Name: count, dtype: int64
income
<=50K 24720
>50K 7841
Name: count, dtype: int64
Faça a interpretação das informações, observe o caracter ?.
Em quais colunas ele aparece?
Faça o replace dos ? por np.NaN.
## Seu código aqui...
df=df.replace(' ?', np.nan)
Vamos ver como ficou
for v2 in df:
print(df[v2].value_counts())
df.isnull().sum()
age
36 898
31 888
34 886
23 877
35 876
...
83 6
88 3
85 3
86 1
87 1
Name: count, Length: 73, dtype: int64
workclass
Private 22696
Self-emp-not-inc 2541
Local-gov 2093
State-gov 1298
Self-emp-inc 1116
Federal-gov 960
Without-pay 14
Never-worked 7
Name: count, dtype: int64
fnlwgt
164190 13
203488 13
123011 13
148995 12
121124 12
..
232784 1
325573 1
140176 1
318264 1
257302 1
Name: count, Length: 21648, dtype: int64
education
HS-grad 10501
Some-college 7291
Bachelors 5355
Masters 1723
Assoc-voc 1382
11th 1175
Assoc-acdm 1067
10th 933
7th-8th 646
Prof-school 576
9th 514
12th 433
Doctorate 413
5th-6th 333
1st-4th 168
Preschool 51
Name: count, dtype: int64
education_num
9 10501
10 7291
13 5355
14 1723
11 1382
7 1175
12 1067
6 933
4 646
15 576
5 514
8 433
16 413
3 333
2 168
1 51
Name: count, dtype: int64
marital_status
Married-civ-spouse 14976
Never-married 10683
Divorced 4443
Separated 1025
Widowed 993
Married-spouse-absent 418
Married-AF-spouse 23
Name: count, dtype: int64
occupation
Prof-specialty 4140
Craft-repair 4099
Exec-managerial 4066
Adm-clerical 3770
Sales 3650
Other-service 3295
Machine-op-inspct 2002
Transport-moving 1597
Handlers-cleaners 1370
Farming-fishing 994
Tech-support 928
Protective-serv 649
Priv-house-serv 149
Armed-Forces 9
Name: count, dtype: int64
relationship
Husband 13193
Not-in-family 8305
Own-child 5068
Unmarried 3446
Wife 1568
Other-relative 981
Name: count, dtype: int64
race
White 27816
Black 3124
Asian-Pac-Islander 1039
Amer-Indian-Eskimo 311
Other 271
Name: count, dtype: int64
sex
Male 21790
Female 10771
Name: count, dtype: int64
capital_gain
0 29849
15024 347
7688 284
7298 246
99999 159
...
1111 1
2538 1
22040 1
4931 1
5060 1
Name: count, Length: 119, dtype: int64
capital_loss
0 31042
1902 202
1977 168
1887 159
1848 51
...
2080 1
1539 1
1844 1
2489 1
1411 1
Name: count, Length: 92, dtype: int64
hours_per_week
40 15217
50 2819
45 1824
60 1475
35 1297
...
82 1
92 1
87 1
74 1
94 1
Name: count, Length: 94, dtype: int64
native_country
United-States 29170
Mexico 643
Philippines 198
Germany 137
Canada 121
Puerto-Rico 114
El-Salvador 106
India 100
Cuba 95
England 90
Jamaica 81
South 80
China 75
Italy 73
Dominican-Republic 70
Vietnam 67
Guatemala 64
Japan 62
Poland 60
Columbia 59
Taiwan 51
Haiti 44
Iran 43
Portugal 37
Nicaragua 34
Peru 31
France 29
Greece 29
Ecuador 28
Ireland 24
Hong 20
Cambodia 19
Trinadad&Tobago 19
Laos 18
Thailand 18
Yugoslavia 16
Outlying-US(Guam-USVI-etc) 14
Honduras 13
Hungary 13
Scotland 12
Holand-Netherlands 1
Name: count, dtype: int64
income
<=50K 24720
>50K 7841
Name: count, dtype: int64
age 0 workclass 1836 fnlwgt 0 education 0 education_num 0 marital_status 0 occupation 1843 relationship 0 race 0 sex 0 capital_gain 0 capital_loss 0 hours_per_week 0 native_country 583 income 0 dtype: int64
note que agora conseguimos ver que existem dados faltantes no dataset.
Use o método .fillna(), para substituir os nulos (np.nan) por zero.
## Seu código aqui...
df.fillna(0)
| age | workclass | fnlwgt | education | education_num | marital_status | occupation | relationship | race | sex | capital_gain | capital_loss | hours_per_week | native_country | income | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 39 | State-gov | 77516 | Bachelors | 13 | Never-married | Adm-clerical | Not-in-family | White | Male | 2174 | 0 | 40 | United-States | <=50K |
| 1 | 50 | Self-emp-not-inc | 83311 | Bachelors | 13 | Married-civ-spouse | Exec-managerial | Husband | White | Male | 0 | 0 | 13 | United-States | <=50K |
| 2 | 38 | Private | 215646 | HS-grad | 9 | Divorced | Handlers-cleaners | Not-in-family | White | Male | 0 | 0 | 40 | United-States | <=50K |
| 3 | 53 | Private | 234721 | 11th | 7 | Married-civ-spouse | Handlers-cleaners | Husband | Black | Male | 0 | 0 | 40 | United-States | <=50K |
| 4 | 28 | Private | 338409 | Bachelors | 13 | Married-civ-spouse | Prof-specialty | Wife | Black | Female | 0 | 0 | 40 | Cuba | <=50K |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 32556 | 27 | Private | 257302 | Assoc-acdm | 12 | Married-civ-spouse | Tech-support | Wife | White | Female | 0 | 0 | 38 | United-States | <=50K |
| 32557 | 40 | Private | 154374 | HS-grad | 9 | Married-civ-spouse | Machine-op-inspct | Husband | White | Male | 0 | 0 | 40 | United-States | >50K |
| 32558 | 58 | Private | 151910 | HS-grad | 9 | Widowed | Adm-clerical | Unmarried | White | Female | 0 | 0 | 40 | United-States | <=50K |
| 32559 | 22 | Private | 201490 | HS-grad | 9 | Never-married | Adm-clerical | Own-child | White | Male | 0 | 0 | 20 | United-States | <=50K |
| 32560 | 52 | Self-emp-inc | 287927 | HS-grad | 9 | Married-civ-spouse | Exec-managerial | Wife | White | Female | 15024 | 0 | 40 | United-States | >50K |
32561 rows × 15 columns
df['occupation'].value_counts()
occupation Prof-specialty 4140 Craft-repair 4099 Exec-managerial 4066 Adm-clerical 3770 Sales 3650 Other-service 3295 Machine-op-inspct 2002 Transport-moving 1597 Handlers-cleaners 1370 Farming-fishing 994 Tech-support 928 Protective-serv 649 Priv-house-serv 149 Armed-Forces 9 Name: count, dtype: int64
Ainda temos um problema, os classificadores que estudamos até agora não se dão bem com variáveis categóricas.
Dentre as diversas formas de se fazer isso.....
Vamos utilizar a técnica de OneHotEncoder com a biblioteca category_encoders apenas para economizar tempo.
## Se necessário, pip install category_encoders
from sklearn.preprocessing import OneHotEncoder
onehot_encoder = OneHotEncoder(sparse_output=False, cols=["----Coloque aqui as colunas categoricas para realizar a transformação----"])
df = onehot_encoder.fit_transform(df)
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) Cell In[11], line 5 1 ## Se necessário, pip install category_encoders 3 from sklearn.preprocessing import OneHotEncoder ----> 5 onehot_encoder = OneHotEncoder(sparse_output=False, cols=["----Coloque aqui as colunas categoricas para realizar a transformação----"]) 7 df = onehot_encoder.fit_transform(df) TypeError: OneHotEncoder.__init__() got an unexpected keyword argument 'cols'
df.head()
| age | workclass | fnlwgt | education | education_num | marital_status | occupation | relationship | race | sex | capital_gain | capital_loss | hours_per_week | native_country | income | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 39 | State-gov | 77516 | Bachelors | 13 | Never-married | Adm-clerical | Not-in-family | White | Male | 2174 | 0 | 40 | United-States | <=50K |
| 1 | 50 | Self-emp-not-inc | 83311 | Bachelors | 13 | Married-civ-spouse | Exec-managerial | Husband | White | Male | 0 | 0 | 13 | United-States | <=50K |
| 2 | 38 | Private | 215646 | HS-grad | 9 | Divorced | Handlers-cleaners | Not-in-family | White | Male | 0 | 0 | 40 | United-States | <=50K |
| 3 | 53 | Private | 234721 | 11th | 7 | Married-civ-spouse | Handlers-cleaners | Husband | Black | Male | 0 | 0 | 40 | United-States | <=50K |
| 4 | 28 | Private | 338409 | Bachelors | 13 | Married-civ-spouse | Prof-specialty | Wife | Black | Female | 0 | 0 | 40 | Cuba | <=50K |
Criamos um novo problema, os valores estão em escalas muito diferentes, o que pode prejudicar o aprendizado.
Podemos tomar algumas descições:
Vamos normatizar os dados, mas antes, faça o drop da coluna income para não mudar a escala dela;
Da um replace na coluna income para 0 ou 1.
Variavel independente: --> X Variavel dependente: --> y
#Vamos fazer um drop da coluna de interesse de estudo `income`. (y)
## Faça a normalização dos dados
## escolha um dos métodos....
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import minmax_scale
from sklearn.preprocessing import MaxAbsScaler
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import RobustScaler
from sklearn.preprocessing import Normalizer
from sklearn.preprocessing import QuantileTransformer
from sklearn.preprocessing import PowerTransformer
## seu código aqui...
Agora sim! temos uma base limpa e organizada para rodar diversos modelos de ML. onde:
X - > possui as variaveis independentes. y - > é a nossa variavel de interesse.
Separe os dados em treino e teste, qual a propoção será utiizada para cada subset??
#Separar os dados em treino e teste
from sklearn.model_selection import train_test_split
Classificadores Naive Bayes¶
Trata-se de "classificadores probabilísticos" simples, baseados na aplicação do teorema de Bayes com fortes pressupostos de independência entre os atributos. Eles estão entre os modelos de rede bayesianos mais simples.
Existem 3 tipos diferentes de Naive Bayes:
1.Gaussian Naive Bayes
2.Multinomial Naive Bayes
3.Bernoulli Naive Bayes
São muiiiitttooo utilizados em aplicações com texto e NLP e aplicações como:
[X] Filtro de Spam
[X] Classificação de Texto
[X] Analise de sentimento
[X] Sistemas de recomendação
Para mais referências: https://en.wikipedia.org/wiki/Naive_Bayes_classifier
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
gnb.fit(X_train, y_train)
y_pred = gnb.predict(X_test)
y_pred
--------------------------------------------------------------------------- NameError Traceback (most recent call last) Cell In[15], line 5 1 from sklearn.naive_bayes import GaussianNB 3 gnb = GaussianNB() ----> 5 gnb.fit(X_train, y_train) 6 y_pred = gnb.predict(X_test) 7 y_pred NameError: name 'X_train' is not defined
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_pred)
--------------------------------------------------------------------------- NameError Traceback (most recent call last) Cell In[16], line 3 1 from sklearn.metrics import accuracy_score ----> 3 accuracy_score(y_test, y_pred) NameError: name 'y_test' is not defined
y_pred_train = gnb.predict(X_train)
y_pred_train
--------------------------------------------------------------------------- NameError Traceback (most recent call last) Cell In[17], line 1 ----> 1 y_pred_train = gnb.predict(X_train) 2 y_pred_train NameError: name 'X_train' is not defined
accuracy_score(y_train, y_pred_train)
--------------------------------------------------------------------------- NameError Traceback (most recent call last) Cell In[18], line 1 ----> 1 accuracy_score(y_train, y_pred_train) NameError: name 'y_train' is not defined
Compare o resultado de acuracia com pelo menos 1 outro método de machine learning.¶
### sua resposta aqui....