Predicting Sleep Quality¶

Decision Trees with LassoCV¶

I'm a night owl. I naturally enjoy staying up. That being said, over the years I've had to reduce how late I stay up in order to get enough sleep. I find that eight hours is my sweet sleep spot, and getting less than that really does affect my work and my life.

I was interested in seeing if, in a sample of 10,000 adults, I could discern whether a set of surveyed variables could be used to predict the classes that adults said their sleep patterns fell into (Poor, Fair, Good, Excellent).

I want to be strict about my inference because, unlike a traditional statistical test, I am not starting with a hypothesis ("Does caffeine affect sleep?"). Instead I'm throwing in a bunch of variables, using LassoCV to select the best ones, and using that model on my test set. This contradicts classical statistics.

The classical inferential theory of mathematical statistics is based on the philosophy that all the models to fit, all the hypotheses to test, and all the parameters to do inference for are fixed prior to seeing the data. This is not how statistics is practiced. The analyst often explores the data to find the right model to fit to the data, the right hypothesis to test, and so on. As Ronald Coase once said, “if you torture the data long enough, it will confess.”

Post-Selection Inference by Arun K. Kuchibhotla, John E. Kolassa and Todd A. Kuffner.

To deal with this issue, I'm adding in a method used for post-selection inference, called Sample Splitting, where I split the training and testing in half, use the first part for selection and use the second part for inference.

First Steps¶

First, read in the libraries.

In [ ]:
import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LassoCV
from sklearn.feature_selection import SelectFromModel
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

Read in the data set. This comes from Kaggle. The data are synthetic but reflect real-world patterns.

In [ ]:
sleep = pd.read_csv("synthetic_coffee_health_10000.csv")
sleep
Out[ ]:
ID Age Gender Country Coffee_Intake Caffeine_mg Sleep_Hours Sleep_Quality BMI Heart_Rate Stress_Level Physical_Activity_Hours Health_Issues Occupation Smoking Alcohol_Consumption
0 1 40 Male Germany 3.5 328.1 7.5 Good 24.9 78 Low 14.5 NaN Other 0 0
1 2 33 Male Germany 1.0 94.1 6.2 Good 20.0 67 Low 11.0 NaN Service 0 0
2 3 42 Male Brazil 5.3 503.7 5.9 Fair 22.7 59 Medium 11.2 Mild Office 0 0
3 4 53 Male Germany 2.6 249.2 7.3 Good 24.7 71 Low 6.6 Mild Other 0 0
4 5 32 Female Spain 3.1 298.0 5.3 Fair 24.1 76 Medium 8.5 Mild Student 0 1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
9995 9996 50 Female Japan 2.1 199.8 6.0 Fair 30.5 50 Medium 10.1 Moderate Healthcare 0 1
9996 9997 18 Female UK 3.4 319.2 5.8 Fair 19.1 71 Medium 11.6 Mild Service 0 0
9997 9998 26 Male China 1.6 153.4 7.1 Good 25.1 66 Low 13.7 NaN Student 1 1
9998 9999 40 Female Finland 3.4 327.1 7.0 Good 19.3 80 Low 0.1 NaN Student 0 0
9999 10000 42 Female Brazil 2.9 277.5 6.4 Good 28.1 72 Low 9.8 NaN Student 1 0

10000 rows × 16 columns

Capture the variables that I want to potentially use in the model.

In [ ]:
sleep = sleep[['Coffee_Intake', 'Age', 'BMI', 'Heart_Rate', 'Stress_Level','Alcohol_Consumption', 'Physical_Activity_Hours', 'Sleep_Quality']]

Confirm that two of the variables are categorical and not continuous.

In [ ]:
sleep['Stress_Level'].unique()
Out[ ]:
array(['Low', 'Medium', 'High'], dtype=object)
In [ ]:
sleep['Sleep_Quality'].unique()
Out[ ]:
array(['Good', 'Fair', 'Excellent', 'Poor'], dtype=object)

Pre-Processing¶

Declare the following variables:

  • X as the Feature Matrix (features of sleep without the response)
  • y as the response vector (target)
In [ ]:
X_table = sleep[['Stress_Level','Coffee_Intake', 'Age', 'BMI', 'Heart_Rate', 'Alcohol_Consumption', 'Physical_Activity_Hours']]
X = sleep[['Stress_Level','Coffee_Intake', 'Age', 'BMI', 'Heart_Rate','Alcohol_Consumption', 'Physical_Activity_Hours']].values

X
Out[ ]:
array([['Low', 3.5, 40, ..., 78, 0, 14.5],
       ['Low', 1.0, 33, ..., 67, 0, 11.0],
       ['Medium', 5.3, 42, ..., 59, 0, 11.2],
       ...,
       ['Low', 1.6, 26, ..., 66, 1, 13.7],
       ['Low', 3.4, 40, ..., 80, 0, 0.1],
       ['Low', 2.9, 42, ..., 72, 0, 9.8]], dtype=object)

Convert categorical values to numerical values using pandas.get_dummies() for Stress Level.

In [ ]:
from sklearn import preprocessing

le_SL = preprocessing.LabelEncoder()
le_SL.fit(['Low', 'Medium', 'High'])
le_SL.classes_ = np.array(['Low', 'Medium', 'High'])
X[:,0] = le_SL.transform(X[:,0])
In [ ]:
X
Out[ ]:
array([[0, 3.5, 40, ..., 78, 0, 14.5],
       [0, 1.0, 33, ..., 67, 0, 11.0],
       [1, 5.3, 42, ..., 59, 0, 11.2],
       ...,
       [0, 1.6, 26, ..., 66, 1, 13.7],
       [0, 3.4, 40, ..., 80, 0, 0.1],
       [0, 2.9, 42, ..., 72, 0, 9.8]], dtype=object)

Now we can fill the target variable.

In [ ]:
y = sleep['Sleep_Quality']
y
Out[ ]:
Sleep_Quality
0 Good
1 Good
2 Fair
3 Good
4 Fair
... ...
9995 Fair
9996 Fair
9997 Good
9998 Good
9999 Good

10000 rows × 1 columns


In [ ]:
le_SQ = preprocessing.LabelEncoder()
le_SQ.fit([ 'Poor', 'Fair', 'Good', 'Excellent'])
le_SQ.classes_ = np.array(['Poor', 'Fair', 'Good', 'Excellent'])
y = le_SQ.transform(y)
y
Out[ ]:
array([2, 2, 1, ..., 2, 2, 2])
In [ ]:
np.unique(y)
Out[ ]:
array([0, 1, 2, 3])

Training¶

Import train_test_split.

In [ ]:
from sklearn.model_selection import train_test_split

train_test_split will return 4 different parameters. Each will be named:
X_trainset, X_testset, y_trainset, y_testset

The train_test_split will need the parameters:
X, y, test_size=0.5, and random_state=3.

The X and y are the arrays required before the split, the test_size represents the ratio of the testing dataset, and the random_state ensures that we obtain the same splits.

In [ ]:
X_trainset, X_testset, y_trainset, y_testset = train_test_split(X, y, test_size=0.5, random_state=3)
In [ ]:
X_trainset
Out[ ]:
array([[0, 1.7, 41, ..., 74, 0, 3.5],
       [1, 3.4, 60, ..., 74, 1, 0.8],
       [0, 2.7, 22, ..., 69, 0, 14.4],
       ...,
       [1, 3.4, 59, ..., 75, 0, 0.3],
       [0, 0.0, 38, ..., 69, 1, 5.3],
       [0, 2.4, 45, ..., 71, 0, 2.6]], dtype=object)

Practice

Print the shape of X_trainset and y_trainset. Ensure that the dimensions match.
In [ ]:
# your code
print('Shape of X training set {}'.format(X_trainset.shape),'&',' Size of Y training set {}'.format(y_trainset.shape))
Shape of X training set (5000, 7) &  Size of Y training set (5000,)

Print the shape of X_testset and y_testset. Ensure that the dimensions match.

In [ ]:
# your code
print('Shape of X test set {}'.format(X_testset.shape),'&',' Size of Y training set {}'.format(y_testset.shape))
Shape of X test set (5000, 7) &  Size of Y training set (5000,)
In [ ]:
# Additional Pre-processing
lasso_cv = LassoCV(cv=10)
lasso_cv.fit(X_trainset, y_trainset)
Out[ ]:
LassoCV(cv=10)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LassoCV(cv=10)
In [ ]:
# Feature selection
sfm = SelectFromModel(lasso_cv, prefit=True)
X_train_selected = sfm.transform(X_trainset )
X_test_selected = sfm.transform(X_testset)
In [ ]:
print(X_train_selected)
[[0 1.7 41 ... 74 0 3.5]
 [1 3.4 60 ... 74 1 0.8]
 [0 2.7 22 ... 69 0 14.4]
 ...
 [1 3.4 59 ... 75 0 0.3]
 [0 0.0 38 ... 69 1 5.3]
 [0 2.4 45 ... 71 0 2.6]]

Arrays start at 0, so there are 6 features selected. This function selected Stress Level, Coffee Intake, Age, Heart Rate, Alcohol Consumption, and Physical Activity Hours.


Modeling

We will first create an instance of the DecisionTreeClassifier called drugTree.
Inside of the classifier, specify criterion="entropy" so we can see the information gain of each node.
In [ ]:
sleepTree = DecisionTreeClassifier(criterion="entropy", max_depth = 5)
sleepTree # it shows the default parameters
Out[ ]:
DecisionTreeClassifier(criterion='entropy', max_depth=5)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(criterion='entropy', max_depth=5)

Next, we will fit the data with the training feature matrix X_trainset and training response vector y_trainset

In [ ]:
sleepTree.fit(X_train_selected,y_trainset)
Out[ ]:
DecisionTreeClassifier(criterion='entropy', max_depth=5)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(criterion='entropy', max_depth=5)

Prediction¶

In [ ]:
predTree = sleepTree.predict(X_test_selected)

We can print out predTree and y_testset if we want to visually compare the predictions to the actual values.

In [ ]:
print (predTree [0:5])
print (y_testset [0:5])
[2 1 1 2 2]
[2 1 1 2 2]

Evaluation

Next, let's import metrics from sklearn and check the accuracy of our model.
In [ ]:
from sklearn import metrics
import matplotlib.pyplot as plt
print("DecisionTrees's Accuracy: ", metrics.accuracy_score(y_testset, predTree))
DecisionTrees's Accuracy:  0.8672
In [ ]:
X_test_selected
Out[ ]:
array([[0, 1.3, 18, ..., 53, 0, 6.3],
       [1, 3.7, 18, ..., 87, 0, 3.1],
       [1, 4.5, 49, ..., 69, 1, 1.0],
       ...,
       [2, 4.8, 32, ..., 78, 0, 7.8],
       [0, 3.6, 54, ..., 51, 0, 13.5],
       [0, 0.5, 43, ..., 62, 0, 4.6]], dtype=object)
In [ ]:
X_testset
Out[ ]:
array([[0, 1.3, 18, ..., 53, 0, 6.3],
       [1, 3.7, 18, ..., 87, 0, 3.1],
       [1, 4.5, 49, ..., 69, 1, 1.0],
       ...,
       [2, 4.8, 32, ..., 78, 0, 7.8],
       [0, 3.6, 54, ..., 51, 0, 13.5],
       [0, 0.5, 43, ..., 62, 0, 4.6]], dtype=object)

Accuracy classification score computes subset accuracy: the set of labels predicted for a sample must exactly match the corresponding set of labels in y_true.

In multilabel classification, the function returns the subset accuracy. If the entire set of predicted labels for a sample strictly matches with the true set of labels, then the subset accuracy is 1.0; otherwise it is 0.0.


Visualization

Let's visualize the tree

In [ ]:
from  io import StringIO
import pydotplus
import matplotlib.image as mpimg
from sklearn import tree
%matplotlib inline
In [ ]:
class_names= np.unique(y_trainset)
class_names
Out[ ]:
array([0, 1, 2, 3])
In [ ]:
plt.figure(figsize=(35,30))
featureNames = ('Stress Level', 'Coffee Intake', 'Age', 'Heart Rate', 'Alcohol Consumption', 'Physical Activity Hours', 'Sleep Quality')
classes = ['Poor', 'Fair', 'Good', 'Excellent']
tree.plot_tree(sleepTree,feature_names=featureNames, class_names = classes, filled = True, rounded = True,  fontsize = 10)
plt.show()
No description has been provided for this image

We can see here that according to this decision tree, a few interesting results.

  • Excellent sleep, of which there are only 15 records, seem to correspond to a very leisurely life -- low heart rate, low activity, low coffee intake. That being said, it is such a small sample to make any determinations from.

  • The big result here is that, compared to stress levels, none of the other variables seem to matter much for good sleep. High stress is correlated with poor sleep. Medium stress is correlated with "fair" sleep. I think that is interesting.

  • This of course feels like a self-perpetuating cycle -- bad sleep results in increased stress results in more bad sleep.

This all makes me think about how, every so often, there's an interview with the oldest woman in the world. She tends to be a person who might have some habits that might be characterized as "bad", like eating saturated fats and smoking, but she is also unmarried, has not worked a day in her life, and has not made a housing payment in 50 years. Low stress right there. Probably good sleep.

Obviously, that's one individual, but it does remind me of how important trying to reduce stress is to every facet of one's life, including sleep.