I'm a night owl. I naturally enjoy staying up. That being said, over the years I've had to reduce how late I stay up in order to get enough sleep. I find that eight hours is my sweet sleep spot, and getting less than that really does affect my work and my life.
I was interested in seeing if, in a sample of 10,000 adults, I could discern whether a set of surveyed variables could be used to predict the classes that adults said their sleep patterns fell into (Poor, Fair, Good, Excellent).
I want to be strict about my inference because, unlike a traditional statistical test, I am not starting with a hypothesis ("Does caffeine affect sleep?"). Instead I'm throwing in a bunch of variables, using LassoCV to select the best ones, and using that model on my test set. This contradicts classical statistics.
The classical inferential theory of mathematical statistics is based on the philosophy that all the models to fit, all the hypotheses to test, and all the parameters to do inference for are fixed prior to seeing the data. This is not how statistics is practiced. The analyst often explores the data to find the right model to fit to the data, the right hypothesis to test, and so on. As Ronald Coase once said, “if you torture the data long enough, it will confess.”
Post-Selection Inference by Arun K. Kuchibhotla, John E. Kolassa and Todd A. Kuffner.
To deal with this issue, I'm adding in a method used for post-selection inference, called Sample Splitting, where I split the training and testing in half, use the first part for selection and use the second part for inference.
First Steps¶
First, read in the libraries.
import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LassoCV
from sklearn.feature_selection import SelectFromModel
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
Read in the data set. This comes from Kaggle. The data are synthetic but reflect real-world patterns.
sleep = pd.read_csv("synthetic_coffee_health_10000.csv")
sleep
ID | Age | Gender | Country | Coffee_Intake | Caffeine_mg | Sleep_Hours | Sleep_Quality | BMI | Heart_Rate | Stress_Level | Physical_Activity_Hours | Health_Issues | Occupation | Smoking | Alcohol_Consumption | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 40 | Male | Germany | 3.5 | 328.1 | 7.5 | Good | 24.9 | 78 | Low | 14.5 | NaN | Other | 0 | 0 |
1 | 2 | 33 | Male | Germany | 1.0 | 94.1 | 6.2 | Good | 20.0 | 67 | Low | 11.0 | NaN | Service | 0 | 0 |
2 | 3 | 42 | Male | Brazil | 5.3 | 503.7 | 5.9 | Fair | 22.7 | 59 | Medium | 11.2 | Mild | Office | 0 | 0 |
3 | 4 | 53 | Male | Germany | 2.6 | 249.2 | 7.3 | Good | 24.7 | 71 | Low | 6.6 | Mild | Other | 0 | 0 |
4 | 5 | 32 | Female | Spain | 3.1 | 298.0 | 5.3 | Fair | 24.1 | 76 | Medium | 8.5 | Mild | Student | 0 | 1 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
9995 | 9996 | 50 | Female | Japan | 2.1 | 199.8 | 6.0 | Fair | 30.5 | 50 | Medium | 10.1 | Moderate | Healthcare | 0 | 1 |
9996 | 9997 | 18 | Female | UK | 3.4 | 319.2 | 5.8 | Fair | 19.1 | 71 | Medium | 11.6 | Mild | Service | 0 | 0 |
9997 | 9998 | 26 | Male | China | 1.6 | 153.4 | 7.1 | Good | 25.1 | 66 | Low | 13.7 | NaN | Student | 1 | 1 |
9998 | 9999 | 40 | Female | Finland | 3.4 | 327.1 | 7.0 | Good | 19.3 | 80 | Low | 0.1 | NaN | Student | 0 | 0 |
9999 | 10000 | 42 | Female | Brazil | 2.9 | 277.5 | 6.4 | Good | 28.1 | 72 | Low | 9.8 | NaN | Student | 1 | 0 |
10000 rows × 16 columns
Capture the variables that I want to potentially use in the model.
sleep = sleep[['Coffee_Intake', 'Age', 'BMI', 'Heart_Rate', 'Stress_Level','Alcohol_Consumption', 'Physical_Activity_Hours', 'Sleep_Quality']]
Confirm that two of the variables are categorical and not continuous.
sleep['Stress_Level'].unique()
array(['Low', 'Medium', 'High'], dtype=object)
sleep['Sleep_Quality'].unique()
array(['Good', 'Fair', 'Excellent', 'Poor'], dtype=object)
Pre-Processing¶
Declare the following variables:
- X as the Feature Matrix (features of sleep without the response)
- y as the response vector (target)
X_table = sleep[['Stress_Level','Coffee_Intake', 'Age', 'BMI', 'Heart_Rate', 'Alcohol_Consumption', 'Physical_Activity_Hours']]
X = sleep[['Stress_Level','Coffee_Intake', 'Age', 'BMI', 'Heart_Rate','Alcohol_Consumption', 'Physical_Activity_Hours']].values
X
array([['Low', 3.5, 40, ..., 78, 0, 14.5], ['Low', 1.0, 33, ..., 67, 0, 11.0], ['Medium', 5.3, 42, ..., 59, 0, 11.2], ..., ['Low', 1.6, 26, ..., 66, 1, 13.7], ['Low', 3.4, 40, ..., 80, 0, 0.1], ['Low', 2.9, 42, ..., 72, 0, 9.8]], dtype=object)
Convert categorical values to numerical values using pandas.get_dummies() for Stress Level.
from sklearn import preprocessing
le_SL = preprocessing.LabelEncoder()
le_SL.fit(['Low', 'Medium', 'High'])
le_SL.classes_ = np.array(['Low', 'Medium', 'High'])
X[:,0] = le_SL.transform(X[:,0])
X
array([[0, 3.5, 40, ..., 78, 0, 14.5], [0, 1.0, 33, ..., 67, 0, 11.0], [1, 5.3, 42, ..., 59, 0, 11.2], ..., [0, 1.6, 26, ..., 66, 1, 13.7], [0, 3.4, 40, ..., 80, 0, 0.1], [0, 2.9, 42, ..., 72, 0, 9.8]], dtype=object)
Now we can fill the target variable.
y = sleep['Sleep_Quality']
y
Sleep_Quality | |
---|---|
0 | Good |
1 | Good |
2 | Fair |
3 | Good |
4 | Fair |
... | ... |
9995 | Fair |
9996 | Fair |
9997 | Good |
9998 | Good |
9999 | Good |
10000 rows × 1 columns
le_SQ = preprocessing.LabelEncoder()
le_SQ.fit([ 'Poor', 'Fair', 'Good', 'Excellent'])
le_SQ.classes_ = np.array(['Poor', 'Fair', 'Good', 'Excellent'])
y = le_SQ.transform(y)
y
array([2, 2, 1, ..., 2, 2, 2])
np.unique(y)
array([0, 1, 2, 3])
Training¶
Import train_test_split.
from sklearn.model_selection import train_test_split
train_test_split will return 4 different parameters. Each will be named:
X_trainset, X_testset, y_trainset, y_testset
The train_test_split will need the parameters:
X, y, test_size=0.5, and random_state=3.
The X and y are the arrays required before the split, the test_size represents the ratio of the testing dataset, and the random_state ensures that we obtain the same splits.
X_trainset, X_testset, y_trainset, y_testset = train_test_split(X, y, test_size=0.5, random_state=3)
X_trainset
array([[0, 1.7, 41, ..., 74, 0, 3.5], [1, 3.4, 60, ..., 74, 1, 0.8], [0, 2.7, 22, ..., 69, 0, 14.4], ..., [1, 3.4, 59, ..., 75, 0, 0.3], [0, 0.0, 38, ..., 69, 1, 5.3], [0, 2.4, 45, ..., 71, 0, 2.6]], dtype=object)
Practice
Print the shape of X_trainset and y_trainset. Ensure that the dimensions match.# your code
print('Shape of X training set {}'.format(X_trainset.shape),'&',' Size of Y training set {}'.format(y_trainset.shape))
Shape of X training set (5000, 7) & Size of Y training set (5000,)
Print the shape of X_testset and y_testset. Ensure that the dimensions match.
# your code
print('Shape of X test set {}'.format(X_testset.shape),'&',' Size of Y training set {}'.format(y_testset.shape))
Shape of X test set (5000, 7) & Size of Y training set (5000,)
# Additional Pre-processing
lasso_cv = LassoCV(cv=10)
lasso_cv.fit(X_trainset, y_trainset)
LassoCV(cv=10)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LassoCV(cv=10)
# Feature selection
sfm = SelectFromModel(lasso_cv, prefit=True)
X_train_selected = sfm.transform(X_trainset )
X_test_selected = sfm.transform(X_testset)
print(X_train_selected)
[[0 1.7 41 ... 74 0 3.5] [1 3.4 60 ... 74 1 0.8] [0 2.7 22 ... 69 0 14.4] ... [1 3.4 59 ... 75 0 0.3] [0 0.0 38 ... 69 1 5.3] [0 2.4 45 ... 71 0 2.6]]
Arrays start at 0, so there are 6 features selected. This function selected Stress Level, Coffee Intake, Age, Heart Rate, Alcohol Consumption, and Physical Activity Hours.
Modeling
We will first create an instance of the DecisionTreeClassifier called drugTree.Inside of the classifier, specify criterion="entropy" so we can see the information gain of each node.
sleepTree = DecisionTreeClassifier(criterion="entropy", max_depth = 5)
sleepTree # it shows the default parameters
DecisionTreeClassifier(criterion='entropy', max_depth=5)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(criterion='entropy', max_depth=5)
Next, we will fit the data with the training feature matrix X_trainset and training response vector y_trainset
sleepTree.fit(X_train_selected,y_trainset)
DecisionTreeClassifier(criterion='entropy', max_depth=5)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(criterion='entropy', max_depth=5)
Prediction¶
predTree = sleepTree.predict(X_test_selected)
We can print out predTree and y_testset if we want to visually compare the predictions to the actual values.
print (predTree [0:5])
print (y_testset [0:5])
[2 1 1 2 2] [2 1 1 2 2]
Evaluation
Next, let's import metrics from sklearn and check the accuracy of our model.from sklearn import metrics
import matplotlib.pyplot as plt
print("DecisionTrees's Accuracy: ", metrics.accuracy_score(y_testset, predTree))
DecisionTrees's Accuracy: 0.8672
X_test_selected
array([[0, 1.3, 18, ..., 53, 0, 6.3], [1, 3.7, 18, ..., 87, 0, 3.1], [1, 4.5, 49, ..., 69, 1, 1.0], ..., [2, 4.8, 32, ..., 78, 0, 7.8], [0, 3.6, 54, ..., 51, 0, 13.5], [0, 0.5, 43, ..., 62, 0, 4.6]], dtype=object)
X_testset
array([[0, 1.3, 18, ..., 53, 0, 6.3], [1, 3.7, 18, ..., 87, 0, 3.1], [1, 4.5, 49, ..., 69, 1, 1.0], ..., [2, 4.8, 32, ..., 78, 0, 7.8], [0, 3.6, 54, ..., 51, 0, 13.5], [0, 0.5, 43, ..., 62, 0, 4.6]], dtype=object)
Accuracy classification score computes subset accuracy: the set of labels predicted for a sample must exactly match the corresponding set of labels in y_true.
In multilabel classification, the function returns the subset accuracy. If the entire set of predicted labels for a sample strictly matches with the true set of labels, then the subset accuracy is 1.0; otherwise it is 0.0.
Visualization
Let's visualize the tree
from io import StringIO
import pydotplus
import matplotlib.image as mpimg
from sklearn import tree
%matplotlib inline
class_names= np.unique(y_trainset)
class_names
array([0, 1, 2, 3])
plt.figure(figsize=(35,30))
featureNames = ('Stress Level', 'Coffee Intake', 'Age', 'Heart Rate', 'Alcohol Consumption', 'Physical Activity Hours', 'Sleep Quality')
classes = ['Poor', 'Fair', 'Good', 'Excellent']
tree.plot_tree(sleepTree,feature_names=featureNames, class_names = classes, filled = True, rounded = True, fontsize = 10)
plt.show()
We can see here that according to this decision tree, a few interesting results.
Excellent sleep, of which there are only 15 records, seem to correspond to a very leisurely life -- low heart rate, low activity, low coffee intake. That being said, it is such a small sample to make any determinations from.
The big result here is that, compared to stress levels, none of the other variables seem to matter much for good sleep. High stress is correlated with poor sleep. Medium stress is correlated with "fair" sleep. I think that is interesting.
This of course feels like a self-perpetuating cycle -- bad sleep results in increased stress results in more bad sleep.
This all makes me think about how, every so often, there's an interview with the oldest woman in the world. She tends to be a person who might have some habits that might be characterized as "bad", like eating saturated fats and smoking, but she is also unmarried, has not worked a day in her life, and has not made a housing payment in 50 years. Low stress right there. Probably good sleep.
Obviously, that's one individual, but it does remind me of how important trying to reduce stress is to every facet of one's life, including sleep.