The main purpose of this file is to estimate classifier's parameters using grid search with cross-validation

For Random Forest, two of the most important parameters are "max_features", and "n_estimators".

It turns out that "max_features" should always be "sqrt". (as expected for categorical data) "n_estimators" is changing between data sets:
03 to 04 Data: n_estimators = 70;
03 to 05 Data: 40;
03 to 06 Data: 110;
03 to 07 Data: 80;
03 to 08 Data: 80;

In [1]:
import pandas as pd
import numpy as np

As an example, I am using March to May data to predict June, so I need to train a model based on March to May data

In [2]:
# load March to May data (cleaned data)
DF00 = pd.read_csv('./FB_data_ML_with_uid_2015_35.csv')
In [3]:
DF00.drop('Unnamed: 0',1,inplace=True)
print DF00.shape
DF00.head()
(229378, 6)
Out[3]:
property_userId Goal Gender Habit Day_of_Week Action
0 fabbc998-2a02-46b8-8442-45445096913b Energy male Drink Water 0 0
1 fabbc998-2a02-46b8-8442-45445096913b Energy male Drink Water 0 0
2 fabbc998-2a02-46b8-8442-45445096913b Energy male Meditate 0 0
3 fabbc998-2a02-46b8-8442-45445096913b Energy male Drink Water 0 0
4 fabbc998-2a02-46b8-8442-45445096913b Energy male Clean & Tidy up 0 0
In [4]:
# I am mapping categorical variable to integer numbers.
# The next commented out line is to map userID, but I will not use user ID as a predictor.
# uid_dict =  dict(zip(set(list(DF00['property_userId'])), range(len(set(list(DF00['property_userId']))))))
Goal_dict = dict(zip(set(list(DF00['Goal'])), range(len(set(list(DF00['Goal']))))))
Gender_dict = dict(zip(set(list(DF00['Gender'])),range(len(set(list(DF00['Gender']))))))
Habit_dict = dict(zip(set(list(DF00['Habit'])),range(len(set(list(DF00['Habit']))))))
print Goal_dict
print Gender_dict
print Habit_dict
{'old_user': 0, 'Weight': 1, 'Energy': 2, 'InputNAN': 3, 'Focus': 4, 'Sleep': 5}
{'InputNAN': 0, 'male': 1, 'other': 2, 'female': 3}
{'Write in my Journal': 0, 'Yoga': 1, 'Disconnect & Create': 2, 'Stretch': 3, 'Reach to Friends': 4, 'Morning Pages': 5, 'Floss': 6, 'Weigh myself': 7, 'Meditate': 8, 'Drink Water': 9, 'Get Inspired': 10, 'Exercise': 11, 'Groom Myself': 12, 'Power Nap': 13, 'Read': 14, 'Take Medicine': 15, 'Clean & Tidy up': 16, 'Eat a Great Breakfast': 17, 'Take Vitamins': 18, 'Eat More Fruit & Vegetables': 19, 'Study': 20, 'I feel Great Today!': 21, 'Celebrate!': 22, 'Shower': 23, 'Darker, Quieter, Cooler': 24, 'Be Grateful': 25, 'Call Mother & Father': 26, 'Walk': 27, 'Work on a secret project': 28, 'Drink Tea': 29}
In [5]:
# DF00['UID_int'] = DF00['property_userId'].map(uid_dict);
DF00['Goal_int'] = DF00['Goal'].map(Goal_dict);
DF00['Gender_int'] = DF00['Gender'].map(Gender_dict);
DF00['Habit_int'] = DF00['Habit'].map(Habit_dict);
In [6]:
# This is how the new data frame looks like
DF00.head()
Out[6]:
property_userId Goal Gender Habit Day_of_Week Action Goal_int Gender_int Habit_int
0 fabbc998-2a02-46b8-8442-45445096913b Energy male Drink Water 0 0 2 1 9
1 fabbc998-2a02-46b8-8442-45445096913b Energy male Drink Water 0 0 2 1 9
2 fabbc998-2a02-46b8-8442-45445096913b Energy male Meditate 0 0 2 1 8
3 fabbc998-2a02-46b8-8442-45445096913b Energy male Drink Water 0 0 2 1 9
4 fabbc998-2a02-46b8-8442-45445096913b Energy male Clean & Tidy up 0 0 2 1 16
In [7]:
import time
start_time = time.time()

from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import train_test_split
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import classification_report

X = DF00[['Gender_int','Goal_int', 'Habit_int', 'Day_of_Week']]
y = DF00['Action']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=2015)
print X_train.shape, X_test.shape, y_train.shape, y_test.shape

# The following line is to estimate "n_estimators" and "max_features", and it will take a while, depends on data size.
tuned_parameters = [{'n_estimators':range(40,130,10), 'max_features': ['sqrt',None]}]

# if you don't want to estimate "max_features", feel free to use this line, instead the one above.
# tuned_parameters = [{'n_estimators':range(40,130,10)}]

scores = ['precision', 'recall']

for score in scores:
    print("# Tuning hyper-parameters for %s" % score)
    classifier = RandomForestClassifier(random_state=10, max_depth=None, 
                                        min_samples_split=1, class_weight = 'auto')
    clf = GridSearchCV(classifier, tuned_parameters, cv = 5, n_jobs=4)
    clf.fit(X_train, y_train)
    print("Best parameters set found on development set:")
    print(clf.best_params_)
    print("Grid scores on development set:")
    for params, mean_score, scores in clf.grid_scores_:
        print("%0.3f (+/-%0.03f) for %r"
              % (mean_score, scores.std() * 2, params))
    print("Detailed classification report:")
    print("The model is trained on the full development set.")
    print("The scores are computed on the full evaluation set.")
    y_true, y_pred = y_test, clf.predict(X_test)
    print(classification_report(y_true, y_pred))
    
print("--- %s seconds ---" % (time.time() - start_time))
(160564, 4) (68814, 4) (160564,) (68814,)
# Tuning hyper-parameters for precision
Best parameters set found on development set:
{'max_features': 'sqrt', 'n_estimators': 40}
Grid scores on development set:
0.704 (+/-0.009) for {'max_features': 'sqrt', 'n_estimators': 40}
0.704 (+/-0.007) for {'max_features': 'sqrt', 'n_estimators': 50}
0.703 (+/-0.010) for {'max_features': 'sqrt', 'n_estimators': 60}
0.703 (+/-0.006) for {'max_features': 'sqrt', 'n_estimators': 70}
0.702 (+/-0.003) for {'max_features': 'sqrt', 'n_estimators': 80}
0.702 (+/-0.004) for {'max_features': 'sqrt', 'n_estimators': 90}
0.702 (+/-0.006) for {'max_features': 'sqrt', 'n_estimators': 100}
0.701 (+/-0.008) for {'max_features': 'sqrt', 'n_estimators': 110}
0.701 (+/-0.007) for {'max_features': 'sqrt', 'n_estimators': 120}
0.704 (+/-0.009) for {'max_features': None, 'n_estimators': 40}
0.704 (+/-0.007) for {'max_features': None, 'n_estimators': 50}
0.703 (+/-0.010) for {'max_features': None, 'n_estimators': 60}
0.703 (+/-0.006) for {'max_features': None, 'n_estimators': 70}
0.702 (+/-0.003) for {'max_features': None, 'n_estimators': 80}
0.702 (+/-0.004) for {'max_features': None, 'n_estimators': 90}
0.702 (+/-0.006) for {'max_features': None, 'n_estimators': 100}
0.701 (+/-0.008) for {'max_features': None, 'n_estimators': 110}
0.701 (+/-0.007) for {'max_features': None, 'n_estimators': 120}
Detailed classification report:
The model is trained on the full development set.
The scores are computed on the full evaluation set.
             precision    recall  f1-score   support

          0       0.96      0.71      0.81     64471
          1       0.11      0.53      0.18      4343

avg / total       0.90      0.69      0.77     68814

# Tuning hyper-parameters for recall
Best parameters set found on development set:
{'max_features': 'sqrt', 'n_estimators': 40}
Grid scores on development set:
0.704 (+/-0.009) for {'max_features': 'sqrt', 'n_estimators': 40}
0.704 (+/-0.007) for {'max_features': 'sqrt', 'n_estimators': 50}
0.703 (+/-0.010) for {'max_features': 'sqrt', 'n_estimators': 60}
0.703 (+/-0.006) for {'max_features': 'sqrt', 'n_estimators': 70}
0.702 (+/-0.003) for {'max_features': 'sqrt', 'n_estimators': 80}
0.702 (+/-0.004) for {'max_features': 'sqrt', 'n_estimators': 90}
0.702 (+/-0.006) for {'max_features': 'sqrt', 'n_estimators': 100}
0.701 (+/-0.008) for {'max_features': 'sqrt', 'n_estimators': 110}
0.701 (+/-0.007) for {'max_features': 'sqrt', 'n_estimators': 120}
0.704 (+/-0.009) for {'max_features': None, 'n_estimators': 40}
0.704 (+/-0.007) for {'max_features': None, 'n_estimators': 50}
0.703 (+/-0.010) for {'max_features': None, 'n_estimators': 60}
0.703 (+/-0.006) for {'max_features': None, 'n_estimators': 70}
0.702 (+/-0.003) for {'max_features': None, 'n_estimators': 80}
0.702 (+/-0.004) for {'max_features': None, 'n_estimators': 90}
0.702 (+/-0.006) for {'max_features': None, 'n_estimators': 100}
0.701 (+/-0.008) for {'max_features': None, 'n_estimators': 110}
0.701 (+/-0.007) for {'max_features': None, 'n_estimators': 120}
Detailed classification report:
The model is trained on the full development set.
The scores are computed on the full evaluation set.
             precision    recall  f1-score   support

          0       0.96      0.71      0.81     64471
          1       0.11      0.53      0.18      4343

avg / total       0.90      0.69      0.77     68814

--- 896.787469864 seconds ---
In [ ]:
### end of parameters estimation ###