The goal of this project is to use the data from the final_project dataset to create a POI identifier. Machine learning will be useful since there are many features in the dataset and we don't know how much(or if) those features affect the likelihood of someone being a POI.
# Load the dictionary containing the dataset
with open("final_project_dataset.pkl", "rb") as data_file:
data_dict = pickle.load(data_file)
# Feature subsets for slicing(excluding Total column(s))
payments_features = ['salary', 'bonus', 'long_term_incentive', 'deferred_income', 'deferral_payments',
'loan_advances', 'other', 'expenses', 'director_fees']
stock_features = ['exercised_stock_options', 'restricted_stock', 'restricted_stock_deferred']
financial_features = payments_features + stock_features
email_features = ['to_messages', 'from_poi_to_this_person',
'from_messages', 'from_this_person_to_poi', 'shared_receipt_with_poi']
poi_label = ['poi']
# replace 'Nan' strings with None in dataset
for outer_keys, inner_dicts in data_dict.items():
for k, v in inner_dicts.items():
if v == 'NaN':
data_dict[outer_keys][k] = None
df = pd.DataFrame.from_dict(data_dict,
orient='index' # user outer dict keys as column names
)
# Handles email_address field
df.fillna(value=pd.np.nan, inplace=True)
print("Records: {0}\nFeatures: {1}".format(*df.shape)) # unpack shape tuple into formatted str
df.poi.value_counts()
df.poi.value_counts(normalize=True)
[print(x) for x in df.columns.to_list()];
cm = sns.light_palette("red", as_cmap=True)
nan_percents = (df.isna().sum() / len(df)).sort_values(ascending=False).to_frame().\
rename(columns={0:"Missing values"}).style.\
background_gradient(cmap=cm).\
format("{:.2%}")
nan_percents
From the "Outliers" mini-project we learned that the biggest outlier was the record named 'TOTAL'. This was an issue resulting from the spreadsheet import of the data. This was discovered from visualization as shown below.
# salary by bonus plot before 'TOTAL' removal
sns.lmplot(x='salary', y='bonus', data=df, hue='poi',fit_reg=False)
plt.title("Salary by Bonus");
# the 'TOTAL' record
TOTAL = df.sort_values(by=['salary'], ascending=False, na_position='last').head(1)
TOTAL
# dropping computed 'TOTAL' observation
df.drop(index='TOTAL', inplace=True)
# salary by bonus plot AFTER 'TOTAL' removal
sns.lmplot(x='salary', y='bonus', data=df, hue='poi',fit_reg=False)
plt.title("Salary by Bonus")
plt.savefig('sal_bonus_clean.png')
plt.show();
There's 4 other potential outliers here but those are valid data points.
If theres any records where all features are NaN I'll remove those too.
df.dropna(thresh=2, inplace=True)
Lastly we'll remove the record for 'THE TRAVEL AGENCY IN THE PARK' as it is neither a person, nor a person of interest
df.drop(index='THE TRAVEL AGENCY IN THE PARK', inplace=True)
# DF shape after outlier removal
df.shape
The final 'POI identifier' uses 'poi', 'exercised_stock_options', 'bonus', 'total_stock_value' as features in the classifier.
I used an automated gridCV search that tried a feautures according to K value, ranging from 1 to number of columns(20). The K-scores are shown in a table lower down.
In order to use the linear kernel in the support vector classifier, and to get better results overall, I performed scaling on the dataset with StandardScaler from the sklearn.preprocessing namespace. This helped deal with the disparity between large values like 'total_stock' and smaller values like email counts.
New feature named 'has_email' created
Created feature 'has_email' stores a boolean based off whether an email address is present(True) or absent(False) in the email_address field.
'email_address' is currently a feature filled with strings(objects) and NaN's. It cannot be used in classfiers in a meaningful way. Translating it to a boolean makes it usable and potentially useful in ML classification.
The effect of the created has_email feature can be seen in the Feature score table below, generated during classifier testing.
# New Feature - has_email - creation
df['has_email'] = df.email_address.notna()
df.has_email.head(4)
Using a pipeline together with gridsearchSV to do somewhat automated feature selection and parameter tuningparameter.
# List Feature importances
pd.DataFrame(list(zip(X.columns, grid_search.best_estimator_.steps[1][1].scores_))).sort_values(by=1, ascending=False)
POI identifiers were tested with a support vector classifer(SVC()) and Gaussian Naive Bayes(GaussianNB())
Classifier | Precision | Recall |
---|---|---|
SVC | .56446 | .16200 |
GaussianNB | .45092 | .34450 |
Even with parameter tuning SVC wasn't able to hit the .3 benchmark for Recall. The GaussianNB classifier was though, so it gets the honor of becoming a pickle(how exciting!).
GaussianNB was chosen since it fufills the performance metric requirements.
Parameter tuning is the process in which different parameter value combinations are tried to improve performance(speed, accuracy metrics,etc) of ML algorithms.
Parameter tuning is important because it can improve things like accuracy or speed, and help avoid problems like overfitting.
My parameters were tuned via GridSearchSV with a parameter_grid. For each parameter in the parameter grid I put a list of values I wanted tested. GridSearchCV then tried each combination of parameters before finally returning the estimator with that parameter value combinations that returned the best accuracy score during cross validation.
In the pipeline part of my gridsearch I tuned:
#Paramater grid for gridsearchcv
param_grid = dict( features__k=np.arange(1,X.shape[1]),
svm__C=[0.1, 1, 10],
svm__kernel=['linear','rbf', 'sigmoid'],
svm__gamma=['scale', 'auto'])
Validation involves dividing a dataset into a testing and training set. This is done because training/testing on the same full dataset will give an unrealistic classifier/regression accuracy. The accuracy will be specious, because it's already seen all the outcomes of the records it's trying to classify. By splitting the dataset to training/testing this issue is handled.
If done incorrectly, accuracies from validation testing may be extremely high or low. This can occur when the dataset is ordered such that observations with the same outcome are grouped together(i.e. all True outcomes sequentially listed; then all False outcomes). Shufflesplit or StratifiedKFolds can prevent this.
My analysis was validated during GridSearchCV. As the name suggests GridSearchCV incorporates cross validation as part of it's functionality.
By specifying 5 in the 'cv' parameter of Grid_SearchCV, a StratifiedKFold with 5 folds is used during the GridSearch. This divides the dataset into 5 'folds' each containing roughly the same percentages of samples from each target class('poi'). This is useful since our target distribution in the dataset is heavily imbalanced. The folds are then iterated over where 4 are used for training and one to test, until each combination of training testing has happend. These results are then averaged.
grid_search = GridSearchCV(estimator=pipeline, param_grid=param_grid, cv=5, iid=True)
The two performance metrics we care about for this project are Precision and Recall. Specifically I need a score above .3 for each of them.
Precision in this case means, for my classifier, the ratio of True 'POI' predictions, to how many were actually True in the dataset. Or, how likely are my 'guilty' predictions actually guilty? A perfect precision could be obtined by predicting innocent for every person, since I'd never falsely 'convict' someone.
Logic | Prediction | Actual |
---|---|---|
T-Pos | 'guilty' | 'guilty' |
F-Pos | 'guilty' | 'innocent' |
T-Neg | 'innocent' | 'guilty' |
F-Neg | 'innocent' | 'guilty' |
Precision is a ratio that penalizes me if I falsely convict innocent people.
Recall is a ratio that penalizes me if I let guilty people go free.
f1 is a combined measure of these performance metrics. Since this project required equal amounts of precision and recall I optimized my GridSearchCV for the f1 scoring metric. This helped me to hit the .3 requirement using GaussianNB