UD120: Final Project


Enron POI: Classifier

Question 1

Goal of this project

The goal of this project is to use the data from the final_project dataset to create a POI identifier. Machine learning will be useful since there are many features in the dataset and we don't know how much(or if) those features affect the likelihood of someone being a POI.

Data Exploration

In [2]:
# Load the dictionary containing the dataset
with open("final_project_dataset.pkl", "rb") as data_file:
    data_dict = pickle.load(data_file)

# Feature subsets for slicing(excluding Total column(s))
payments_features = ['salary', 'bonus', 'long_term_incentive', 'deferred_income', 'deferral_payments',
                     'loan_advances', 'other', 'expenses', 'director_fees']

stock_features = ['exercised_stock_options', 'restricted_stock', 'restricted_stock_deferred']

financial_features = payments_features + stock_features

email_features = ['to_messages', 'from_poi_to_this_person',
                  'from_messages', 'from_this_person_to_poi', 'shared_receipt_with_poi']

poi_label = ['poi']

# replace 'Nan' strings with None in dataset
for outer_keys, inner_dicts in data_dict.items():
    for k, v in inner_dicts.items():
        if v == 'NaN':
            data_dict[outer_keys][k] = None
            
df = pd.DataFrame.from_dict(data_dict,
                            orient='index'  # user outer dict keys as column names
                            )
# Handles email_address field
df.fillna(value=pd.np.nan, inplace=True)

Total Number of data points

In [3]:
print("Records: {0}\nFeatures: {1}".format(*df.shape))  # unpack shape tuple into formatted str
Records: 146
Features: 21

Allocation across classes (POI/non-POI)

In [4]:
df.poi.value_counts()
Out[4]:
False    128
True      18
Name: poi, dtype: int64
In [5]:
df.poi.value_counts(normalize=True)
Out[5]:
False    0.876712
True     0.123288
Name: poi, dtype: float64

Feature Names

In [6]:
[print(x) for x in df.columns.to_list()];
salary
to_messages
deferral_payments
total_payments
loan_advances
bonus
email_address
restricted_stock_deferred
deferred_income
total_stock_value
expenses
from_poi_to_this_person
exercised_stock_options
from_messages
other
from_this_person_to_poi
poi
long_term_incentive
shared_receipt_with_poi
restricted_stock
director_fees

Are there features with many missing values?

In [7]:
cm = sns.light_palette("red", as_cmap=True)

nan_percents = (df.isna().sum() / len(df)).sort_values(ascending=False).to_frame().\
    rename(columns={0:"Missing values"}).style.\
    background_gradient(cmap=cm).\
    format("{:.2%}")
nan_percents
Out[7]:
Missing values
loan_advances 97.26%
director_fees 88.36%
restricted_stock_deferred 87.67%
deferral_payments 73.29%
deferred_income 66.44%
long_term_incentive 54.79%
bonus 43.84%
shared_receipt_with_poi 41.10%
to_messages 41.10%
from_this_person_to_poi 41.10%
from_messages 41.10%
from_poi_to_this_person 41.10%
other 36.30%
expenses 34.93%
salary 34.93%
exercised_stock_options 30.14%
restricted_stock 24.66%
email_address 23.97%
total_payments 14.38%
total_stock_value 13.70%
poi 0.00%

Outlier Investigation

From the "Outliers" mini-project we learned that the biggest outlier was the record named 'TOTAL'. This was an issue resulting from the spreadsheet import of the data. This was discovered from visualization as shown below.

In [8]:
# salary by bonus plot before 'TOTAL' removal
sns.lmplot(x='salary', y='bonus', data=df, hue='poi',fit_reg=False)
plt.title("Salary by Bonus");
In [9]:
# the 'TOTAL' record
TOTAL = df.sort_values(by=['salary'], ascending=False, na_position='last').head(1)
TOTAL
# dropping computed 'TOTAL' observation
df.drop(index='TOTAL', inplace=True)
In [10]:
# salary by bonus plot AFTER 'TOTAL' removal
sns.lmplot(x='salary', y='bonus', data=df, hue='poi',fit_reg=False)
plt.title("Salary by Bonus")
plt.savefig('sal_bonus_clean.png')
plt.show();

There's 4 other potential outliers here but those are valid data points.

If theres any records where all features are NaN I'll remove those too.

In [11]:
df.dropna(thresh=2, inplace=True)

Lastly we'll remove the record for 'THE TRAVEL AGENCY IN THE PARK' as it is neither a person, nor a person of interest

In [12]:
df.drop(index='THE TRAVEL AGENCY IN THE PARK', inplace=True)

# DF shape after outlier removal 
df.shape
Out[12]:
(143, 21)

Question 2

What features did you end up using in your POI identifier?

The final 'POI identifier' uses 'poi', 'exercised_stock_options', 'bonus', 'total_stock_value' as features in the classifier.

What selection process did you use to pick them?

I used an automated gridCV search that tried a feautures according to K value, ranging from 1 to number of columns(20). The K-scores are shown in a table lower down.

Did you have to do any scaling? Why or why not?

In order to use the linear kernel in the support vector classifier, and to get better results overall, I performed scaling on the dataset with StandardScaler from the sklearn.preprocessing namespace. This helped deal with the disparity between large values like 'total_stock' and smaller values like email counts.

Create own feature

New feature named 'has_email' created

Explain created feature

Created feature 'has_email' stores a boolean based off whether an email address is present(True) or absent(False) in the email_address field.

Explain rationale

'email_address' is currently a feature filled with strings(objects) and NaN's. It cannot be used in classfiers in a meaningful way. Translating it to a boolean makes it usable and potentially useful in ML classification.

Effect of feature measured

The effect of the created has_email feature can be seen in the Feature score table below, generated during classifier testing.

In [13]:
# New Feature - has_email - creation
df['has_email'] = df.email_address.notna()
df.has_email.head(4)
Out[13]:
ALLEN PHILLIP K     True
BADUM JAMES P      False
Name: has_email, dtype: bool

Automating feature selection

Using a pipeline together with gridsearchSV to do somewhat automated feature selection and parameter tuningparameter.

In [113]:
# List Feature importances
pd.DataFrame(list(zip(X.columns, grid_search.best_estimator_.steps[1][1].scores_))).sort_values(by=1, ascending=False)
Out[113]:
0 1
11 exercised_stock_options 28.908665
8 total_stock_value 21.349612
5 bonus 14.657780
0 salary 13.010581
3 total_payments 7.688710
6 restricted_stock_deferred 7.283996
17 restricted_stock 7.252413
19 has_email 6.106925
16 shared_receipt_with_poi 5.799246
7 deferred_income 5.578147
10 from_poi_to_this_person 4.046245
15 long_term_incentive 3.965588
4 loan_advances 3.881639
18 director_fees 3.033245
13 other 1.884540
12 from_messages 1.094739
14 from_this_person_to_poi 0.758045
9 expenses 0.452792
2 deferral_payments 0.377957
1 to_messages 0.143532

Question 3


Test at least 2 ML algorithms

POI identifiers were tested with a support vector classifer(SVC()) and Gaussian Naive Bayes(GaussianNB())

Compare Performance

Classifier Precision Recall
SVC .56446 .16200
GaussianNB .45092 .34450

Even with parameter tuning SVC wasn't able to hit the .3 benchmark for Recall. The GaussianNB classifier was though, so it gets the honor of becoming a pickle(how exciting!).

What algorithm was chosen?

GaussianNB was chosen since it fufills the performance metric requirements.

Question 4


Define parameter tuning

Parameter tuning is the process in which different parameter value combinations are tried to improve performance(speed, accuracy metrics,etc) of ML algorithms.

Explain why parameter tuning is important

Parameter tuning is important because it can improve things like accuracy or speed, and help avoid problems like overfitting.

How I tuned my parameters

My parameters were tuned via GridSearchSV with a parameter_grid. For each parameter in the parameter grid I put a list of values I wanted tested. GridSearchCV then tried each combination of parameters before finally returning the estimator with that parameter value combinations that returned the best accuracy score during cross validation.

Which parameters did you tune?

In the pipeline part of my gridsearch I tuned:

  • K value for SelectKBest
  • C value for SVM
  • kernel for SVM
  • gamma for SVM
#Paramater grid for gridsearchcv
param_grid = dict( features__k=np.arange(1,X.shape[1]),
                  svm__C=[0.1, 1, 10],
                 svm__kernel=['linear','rbf', 'sigmoid'],
                 svm__gamma=['scale', 'auto'])

Question 5


What is validation and why is it important?

Validation involves dividing a dataset into a testing and training set. This is done because training/testing on the same full dataset will give an unrealistic classifier/regression accuracy. The accuracy will be specious, because it's already seen all the outcomes of the records it's trying to classify. By splitting the dataset to training/testing this issue is handled.

What is a classic mistake if validation is done incorrectly?

If done incorrectly, accuracies from validation testing may be extremely high or low. This can occur when the dataset is ordered such that observations with the same outcome are grouped together(i.e. all True outcomes sequentially listed; then all False outcomes). Shufflesplit or StratifiedKFolds can prevent this.

How did I validate my analysis?

My analysis was validated during GridSearchCV. As the name suggests GridSearchCV incorporates cross validation as part of it's functionality.

Note the specific type of cross validation performed.

By specifying 5 in the 'cv' parameter of Grid_SearchCV, a StratifiedKFold with 5 folds is used during the GridSearch. This divides the dataset into 5 'folds' each containing roughly the same percentages of samples from each target class('poi'). This is useful since our target distribution in the dataset is heavily imbalanced. The folds are then iterated over where 4 are used for training and one to test, until each combination of training testing has happend. These results are then averaged.

grid_search = GridSearchCV(estimator=pipeline, param_grid=param_grid, cv=5, iid=True)

Question 6


The two performance metrics we care about for this project are Precision and Recall. Specifically I need a score above .3 for each of them.

Precision in this case means, for my classifier, the ratio of True 'POI' predictions, to how many were actually True in the dataset. Or, how likely are my 'guilty' predictions actually guilty? A perfect precision could be obtined by predicting innocent for every person, since I'd never falsely 'convict' someone.

  • True positive - Person predicted 'POI' from the features, and they are a POI
  • False positive - Person predicted 'POI' from the features, but they are not a POI
  • True Negative - Person predicted not 'POI' from the features, and they are not a POI
  • False Negative - Person predicted not 'POI' from the features, but they are a POI
Logic Prediction Actual
T-Pos 'guilty' 'guilty'
F-Pos 'guilty' 'innocent'
T-Neg 'innocent' 'guilty'
F-Neg 'innocent' 'guilty'

Precision is a ratio that penalizes me if I falsely convict innocent people.

Recall is a ratio that penalizes me if I let guilty people go free.

f1 is a combined measure of these performance metrics. Since this project required equal amounts of precision and recall I optimized my GridSearchCV for the f1 scoring metric. This helped me to hit the .3 requirement using GaussianNB