UD120: Final Project¶

Enron POI: Classifier¶

Question 1¶

Goal of this project¶

The goal of this project is to use the data from the final_project dataset to create a POI identifier. Machine learning will be useful since there are many features in the dataset and we don't know how much(or if) those features affect the likelihood of someone being a POI.

Data Exploration¶

# Load the dictionary containing the dataset
with open("final_project_dataset.pkl", "rb") as data_file:
    data_dict = pickle.load(data_file)

# Feature subsets for slicing(excluding Total column(s))
payments_features = ['salary', 'bonus', 'long_term_incentive', 'deferred_income', 'deferral_payments',
                     'loan_advances', 'other', 'expenses', 'director_fees']

stock_features = ['exercised_stock_options', 'restricted_stock', 'restricted_stock_deferred']

financial_features = payments_features + stock_features

email_features = ['to_messages', 'from_poi_to_this_person',
                  'from_messages', 'from_this_person_to_poi', 'shared_receipt_with_poi']

poi_label = ['poi']

# replace 'Nan' strings with None in dataset
for outer_keys, inner_dicts in data_dict.items():
    for k, v in inner_dicts.items():
        if v == 'NaN':
            data_dict[outer_keys][k] = None
            
df = pd.DataFrame.from_dict(data_dict,
                            orient='index'  # user outer dict keys as column names
                            )
# Handles email_address field
df.fillna(value=pd.np.nan, inplace=True)

Total Number of data points¶

print("Records: {0}\nFeatures: {1}".format(*df.shape))  # unpack shape tuple into formatted str

Records: 146
Features: 21

Allocation across classes (POI/non-POI)¶

df.poi.value_counts()

False    128
True      18
Name: poi, dtype: int64

df.poi.value_counts(normalize=True)

False    0.876712
True     0.123288
Name: poi, dtype: float64

Feature Names¶

[print(x) for x in df.columns.to_list()];

salary
to_messages
deferral_payments
total_payments
loan_advances
bonus
email_address
restricted_stock_deferred
deferred_income
total_stock_value
expenses
from_poi_to_this_person
exercised_stock_options
from_messages
other
from_this_person_to_poi
poi
long_term_incentive
shared_receipt_with_poi
restricted_stock
director_fees

Are there features with many missing values?¶

cm = sns.light_palette("red", as_cmap=True)

nan_percents = (df.isna().sum() / len(df)).sort_values(ascending=False).to_frame().\
    rename(columns={0:"Missing values"}).style.\
    background_gradient(cmap=cm).\
    format("{:.2%}")
nan_percents

Outlier Investigation¶

From the "Outliers" mini-project we learned that the biggest outlier was the record named 'TOTAL'. This was an issue resulting from the spreadsheet import of the data. This was discovered from visualization as shown below.

# salary by bonus plot before 'TOTAL' removal
sns.lmplot(x='salary', y='bonus', data=df, hue='poi',fit_reg=False)
plt.title("Salary by Bonus");

# the 'TOTAL' record
TOTAL = df.sort_values(by=['salary'], ascending=False, na_position='last').head(1)
TOTAL
# dropping computed 'TOTAL' observation
df.drop(index='TOTAL', inplace=True)

# salary by bonus plot AFTER 'TOTAL' removal
sns.lmplot(x='salary', y='bonus', data=df, hue='poi',fit_reg=False)
plt.title("Salary by Bonus")
plt.savefig('sal_bonus_clean.png')
plt.show();

There's 4 other potential outliers here but those are valid data points.

If theres any records where all features are NaN I'll remove those too.

df.dropna(thresh=2, inplace=True)

Lastly we'll remove the record for 'THE TRAVEL AGENCY IN THE PARK' as it is neither a person, nor a person of interest

df.drop(index='THE TRAVEL AGENCY IN THE PARK', inplace=True)

# DF shape after outlier removal 
df.shape

(143, 21)

Question 2¶

What features did you end up using in your POI identifier?¶

The final 'POI identifier' uses 'poi', 'exercised_stock_options', 'bonus', 'total_stock_value' as features in the classifier.

What selection process did you use to pick them?¶

I used an automated gridCV search that tried a feautures according to K value, ranging from 1 to number of columns(20). The K-scores are shown in a table lower down.

Did you have to do any scaling? Why or why not?¶

In order to use the linear kernel in the support vector classifier, and to get better results overall, I performed scaling on the dataset with StandardScaler from the sklearn.preprocessing namespace. This helped deal with the disparity between large values like 'total_stock' and smaller values like email counts.

Create own feature¶

New feature named 'has_email' created

Explain created feature¶

Created feature 'has_email' stores a boolean based off whether an email address is present(True) or absent(False) in the email_address field.

Explain rationale¶

'email_address' is currently a feature filled with strings(objects) and NaN's. It cannot be used in classfiers in a meaningful way. Translating it to a boolean makes it usable and potentially useful in ML classification.

Effect of feature measured¶

The effect of the created has_email feature can be seen in the Feature score table below, generated during classifier testing.

# New Feature - has_email - creation
df['has_email'] = df.email_address.notna()
df.has_email.head(4)

ALLEN PHILLIP K     True
BADUM JAMES P      False
Name: has_email, dtype: bool

Automating feature selection¶

Using a pipeline together with gridsearchSV to do somewhat automated feature selection and parameter tuningparameter.

# List Feature importances
pd.DataFrame(list(zip(X.columns, grid_search.best_estimator_.steps[1][1].scores_))).sort_values(by=1, ascending=False)

Question 3¶

Test at least 2 ML algorithms¶

POI identifiers were tested with a support vector classifer(SVC()) and Gaussian Naive Bayes(GaussianNB())

Compare Performance¶

Classifier	Precision	Recall
SVC	.56446	.16200
GaussianNB	.45092	.34450

Even with parameter tuning SVC wasn't able to hit the .3 benchmark for Recall. The GaussianNB classifier was though, so it gets the honor of becoming a pickle(how exciting!).

What algorithm was chosen?¶

GaussianNB was chosen since it fufills the performance metric requirements.

Question 4¶

Define parameter tuning¶

Parameter tuning is the process in which different parameter value combinations are tried to improve performance(speed, accuracy metrics,etc) of ML algorithms.

Explain why parameter tuning is important¶

Parameter tuning is important because it can improve things like accuracy or speed, and help avoid problems like overfitting.

How I tuned my parameters¶

My parameters were tuned via GridSearchSV with a parameter_grid. For each parameter in the parameter grid I put a list of values I wanted tested. GridSearchCV then tried each combination of parameters before finally returning the estimator with that parameter value combinations that returned the best accuracy score during cross validation.

Which parameters did you tune?¶

In the pipeline part of my gridsearch I tuned:

K value for SelectKBest
C value for SVM
kernel for SVM
gamma for SVM

#Paramater grid for gridsearchcv
param_grid = dict( features__k=np.arange(1,X.shape[1]),
                  svm__C=[0.1, 1, 10],
                 svm__kernel=['linear','rbf', 'sigmoid'],
                 svm__gamma=['scale', 'auto'])

Question 5¶

What is validation and why is it important?¶

Validation involves dividing a dataset into a testing and training set. This is done because training/testing on the same full dataset will give an unrealistic classifier/regression accuracy. The accuracy will be specious, because it's already seen all the outcomes of the records it's trying to classify. By splitting the dataset to training/testing this issue is handled.

What is a classic mistake if validation is done incorrectly?¶

If done incorrectly, accuracies from validation testing may be extremely high or low. This can occur when the dataset is ordered such that observations with the same outcome are grouped together(i.e. all True outcomes sequentially listed; then all False outcomes). Shufflesplit or StratifiedKFolds can prevent this.

How did I validate my analysis?¶

My analysis was validated during GridSearchCV. As the name suggests GridSearchCV incorporates cross validation as part of it's functionality.

Note the specific type of cross validation performed.¶

By specifying 5 in the 'cv' parameter of Grid_SearchCV, a StratifiedKFold with 5 folds is used during the GridSearch. This divides the dataset into 5 'folds' each containing roughly the same percentages of samples from each target class('poi'). This is useful since our target distribution in the dataset is heavily imbalanced. The folds are then iterated over where 4 are used for training and one to test, until each combination of training testing has happend. These results are then averaged.

grid_search = GridSearchCV(estimator=pipeline, param_grid=param_grid, cv=5, iid=True)

Question 6¶

The two performance metrics we care about for this project are Precision and Recall. Specifically I need a score above .3 for each of them.

Precision in this case means, for my classifier, the ratio of True 'POI' predictions, to how many were actually True in the dataset. Or, how likely are my 'guilty' predictions actually guilty? A perfect precision could be obtined by predicting innocent for every person, since I'd never falsely 'convict' someone.

True positive - Person predicted 'POI' from the features, and they are a POI
False positive - Person predicted 'POI' from the features, but they are not a POI
True Negative - Person predicted not 'POI' from the features, and they are not a POI
False Negative - Person predicted not 'POI' from the features, but they are a POI

Logic	Prediction	Actual
T-Pos	'guilty'	'guilty'
F-Pos	'guilty'	'innocent'
T-Neg	'innocent'	'guilty'
F-Neg	'innocent'	'guilty'

Precision is a ratio that penalizes me if I falsely convict innocent people.

Recall is a ratio that penalizes me if I let guilty people go free.

f1 is a combined measure of these performance metrics. Since this project required equal amounts of precision and recall I optimized my GridSearchCV for the f1 scoring metric. This helped me to hit the .3 requirement using GaussianNB

	Missing values
loan_advances	97.26%
director_fees	88.36%
restricted_stock_deferred	87.67%
deferral_payments	73.29%
deferred_income	66.44%
long_term_incentive	54.79%
bonus	43.84%
shared_receipt_with_poi	41.10%
to_messages	41.10%
from_this_person_to_poi	41.10%
from_messages	41.10%
from_poi_to_this_person	41.10%
other	36.30%
expenses	34.93%
salary	34.93%
exercised_stock_options	30.14%
restricted_stock	24.66%
email_address	23.97%
total_payments	14.38%
total_stock_value	13.70%
poi	0.00%

	0	1
11	exercised_stock_options	28.908665
8	total_stock_value	21.349612
5	bonus	14.657780
0	salary	13.010581
3	total_payments	7.688710
6	restricted_stock_deferred	7.283996
17	restricted_stock	7.252413
19	has_email	6.106925
16	shared_receipt_with_poi	5.799246
7	deferred_income	5.578147
10	from_poi_to_this_person	4.046245
15	long_term_incentive	3.965588
4	loan_advances	3.881639
18	director_fees	3.033245
13	other	1.884540
12	from_messages	1.094739
14	from_this_person_to_poi	0.758045
9	expenses	0.452792
2	deferral_payments	0.377957
1	to_messages	0.143532