Smartwatch Health Data analysis and Calories Burned Prediction

In this project, I will perform data analysis in a dataset about smartwatches, that recoleted information like Steps, Heart Rate, Device and Gender among others. This dataset is composed of more than 6000 people, the data is available on Harvard.

In this study, researchers wanted to see if popular wearable devices, like the Apple Watch and Fitbit, could accurately predict different activities like lying down, sitting, and various levels of physical activity in a controlled lab setting. They recruited 46 participants to wear these devices while going through a 65-minute protocol, which included treadmill walking and periods of sitting or lying down. They used a method called “indirect calorimetry” to measure energy expenditure.

The researchers focused on the data collected by the devices, such as heart rate, steps, distance, and calories burned, and used machine learning models to analyze it. They tested different types of machine learning models and found that a particular model called “rotation forest” gave the best results. The accuracy of predictions was around 82.6% for the Apple Watch and 89.3% for the Fitbit.

In conclusion, this study showed that commercial wearable devices like the Apple Watch and Fitbit can reasonably predict different types of physical activities. The findings suggest that using minute-by-minute data from these devices with machine learning approaches can be a scalable way to classify physical activity types in larger populations.

Disclaimer This project involves the analysis of health data from smartwatches, sourced from a Harvard dataset. The conclusions are for educational purposes only and not medical advice. I am not a medical professional. Please consult a medical provider for personalized recommendations.

Dataset Introduction

Data Source

The data used in this project is from the paper titled “Replication Data for: Using machine learning methods to predict physical activity types with Apple Watch and Fitbit data using indirect calorimetry as the criterion” authored by Daniel Fuller. The paper’s dataset can be found on Harvard Dataverse.

Paper Title: Replication Data for: Using machine learning methods to predict physical activity types with Apple Watch and Fitbit data using indirect calorimetry as the criterion
Dataset: Harvard Dataverse - Replication Data for: Using machine learning methods to predict physical activity types with Apple Watch and Fitbit data using indirect calorimetry as the criterion
Author: Daniel Fuller
Year: 2020
Version: V1

Importing

Importing libraries

Now we will import all the neccesary libraries that we will use for the project.

import pandas as pd
import numpy as np
import plotly.express as px
import missingno as msno
import matplotlib.pyplot as plt

Apart for this libraries that will be mainly used for representation and data handling, we will also need some libraries for the model creation, that we will use at the end of the porject.

from sklearn.model_selection import train_test_split
from sklearn.svm import SVR

Importing Data

In the offical research we can find three datasets, one with all the data combined, and the the other two separating apple watch and fitbit infromation.

health = pd.read_csv('aw_fb_data.csv', index_col = 0)
health.head()

	age	gender	height	weight	steps	heart_rate	calories	distance	resting_heart	device	activity
1	20	Male	168	65.4	10.7714	78.5313	0.344533	0.00832686	59	apple watch	Lying
2	20	Male	168	65.4	11.4753	78.4534	3.28763	0.00889635	59	apple watch	Lying
3	20	Male	168	65.4	12.1792	78.5408	9.484	0.00946584	59	apple watch	Lying
4	20	Male	168	65.4	12.8831	78.6283	10.1546	0.0100353	59	apple watch	Lying
5	20	Male	168	65.4	13.587	78.7157	10.8251	0.0106048	59	apple watch	Lying

Cleaning Data

Now we are going to clean the data from the Dataframe, this is a multiple stage proccess, we wil need to check the mssing values first, the text values and categorical values, and then the uniformity of the data.

Missing Values

To check the missing values, first we are going to check the amount of null values in the dataframe, and then we will be able to choose a path to proceed, also we can use the package missingno to visualize null values.

health.isnull().sum()

Data	Null		Data	Null
X1	0		age	0
gender	0		height	0
weight	0		steps	0
heart_rate	0		calories	0
distance	0		entropy_heart	0
entropy_setps	0		resting_heart	0
corr_heart_steps	0		norm_heart	0
activity	0		sd_norm_heart	0
steps_times_distance	0		device	0

In this case we dont have missing values in the dataset, that means that we can continue with the rest of the proccess, as a side note, a useful package to use to check for missing values in a visual way is to use the missingno package.

missingno graph

As we saw in the previous method, there is no null values in the dataframe.

Value Consistency

In this case we will check two of the columns available device and activity, the things to check are capitalization and potential blank spaces in the data inputs

device = health['device']
device.value_counts()

Device	Count
Apple Watch	3656
Fitbit	2608

activity = health['activity']
activity.value_counts()

Activity	Count
Lying	1379
Running 7 METs	1114
Running 5 METs	1002
Running 3 METs	950
Sitting	930
Self Pace Walk	889

We can see by the prevous results that there is not errors in value consistency, no need to strip blank spaces or change the capitalization.

Data Type

In all the columns we have different types of data, from categorical, to integers and floats among other, so we will check if the type is correct.

health.dtypes

Data	Type	Data	Type
X1	int64	age	int64
gender	int64	height	float64
weight	float64	steps	float64
heart_rate	float64	calories	float64
distance	float64	entropy_heart	float64
entropy_setps	float64	resting_heart	float64
corr_heart_steps	float64	norm_heart	float64
activity	object	sd_norm_heart	float64
steps_times_distance	float64	device	object

In this case we would not need to change any data type, except the gender, even though we will change it for names to make easier to read, for now, let’s convert it to categorical data. First we are going to see if there is any error in that category by having numbers outside the boundaries, that correspond to the part of the cleaning proccess known as insconsistent data.

health.gender.value_counts()

Gender	Count
0	3279
1	2985

We do not have any error in the data inputs, so lets convert it to categorial data.

health.gender = health.gender.astype('category')

Outliers

In this case, with health data, is more difficult to evaluate the outliers unless is a really extreme value as we do not know the premises of the study, one person can be investing a lot of time for walking for example, other data we can check that is easier to detect potential outliers can be the heart rate.

health[['hear_rate', 'height','weight', 'resting_heart']].describe()

	hear_rate	height	weight	resting_heart
count	6264	6264	6264	6264
mean	86.1423	169.709	69.6145	65.8699
std	28.6484	10.3247	13.4519	21.203
min	2.22222	143	43	3
25%	75.5981	160	60	58.1343
50%	77.2677	168	68	75
75%	95.6691	180	77.3	76.1387
max	194.333	191	115	155

Let’s analyze the data:

Heart Rate: For this category we can observe that most of the categories are logical, except the minimum heart rate, thats 2.22.
Resting Heart: Similar to the previous case the minimum is too low, indicating a potential error, and also the maximum value is too high.

Most of the information that we can check is related to the heart rate, as thats the easier to see potential problems, the other ones, as I indicated previously, like the steps, calories or distance.

Another heart measure that can be important is the entropy for heart, as that’s used in relation to the heart variability.

Heart Rate Distribution

We can see a clear mode in the graphs, also we can see really lows heart rates, now we can check those cases apart, but lets extract them form the total dataset. We are going to consider as valid a low heart rate, as we do not have information if any of the components of the group have Bradycardia; according to this article form Mayo Clinic that having a heart rate of 40 bpms is not a reason to worry.

Bradycardia: Bradycardia (brad-e-KAHR-dee-uh) is a slow heart rate. The hearts of adults at rest usually beat between 60 and 100 times a minute. If you have bradycardia, your heart beats fewer than 60 times a minute. A slow heart rate isn’t always a concern. For example, a resting heart rate between 40 and 60 beats a minute is quite common during sleep and in some people, particularly healthy young adults and trained athletes.

We will now extract the rows below the 40 bpms.

lower_data = health[health['hear_rate'] <= 40.0]

We can see that there is 237 rows where the heart_rate is below the minimum, now we will eliminate that ones.

checker = health['hear_rate'] <= 40.0
health = health[~checker]

We can use an assert to see if the minimum value if above the boundarie, not getting any response means that we do not have any that is not above the limit.

assert health.hear_rate.min() > 40

After extracting the outliers from the main dataframe, we can take a look at the distribution for heart rate and resting heart rate for the outliers.

Heart Rate Distribution in the Outliers

Resting Heart Rate Distribution in the outliers

As we can see the cases where heart rate and resting heart rate is low are in the same group, so there could be a problem in using the device or a measurment error, we cannot know without having more information of the dataset.

Data Analysis

Height by Gender

With all the data clean, and after eliminating the outliers we can proceed to analyse the data and to investigate correlations and trends in the data.

Before starting to facility the easiness to read the information, I will replace in the gender column, the source does not indicate which one is for male and which one for female, so i will use the assumption that the height is bigger in the male case than for females.

Height by gender

We can see that the 0 is lower than the 1, so we will use 0 for Female and 1 for Male.

health.gender = health.gender.replace({0: "Female", 1: "Male"})

Weight by Gender

Weight by gender

There is a deviation, where male present a higher weight compare to women in average, that is caused by a correlation between height and weight, even though we can find outliers, higher height usually implies higher mass. We can check that visualy by using a scatter plot.

Weight Vs Height

Height Vs Weight

As we can see, the taller the person, the bigger is the weight, there are some outliers, as should be expected, but the trendline is a positive one.

Device by Gender

For this, let’s generate two new dataframes to differenciate the devices.

fitbit = health[health['device'] == 'fitbit']
fitbit_gender = fitbit.gender.value_counts()

Gender	Count
Female	1214
Male	1174

apple_watch = health[health['device'] == 'apple watch']
gender_apple = apple_watch.gender.value_counts()

Gender	Count
Female	1925
Male	1714

We can see that there is in general more females in both categories, there is no apparent differece, we can compare it by seeing the porportion of female vs male in the dataframe and then for each device.

Pct. deviation Female Vs Male	Count
Total	8.69%
Apple Watch	12.31%
Fitbit	3.41%

We can see a bigger a difference in the proportion of women in the apple watch compare to the total of the dataset, that indicates that in our dataframe, there are a bigger proportion of women using apple watch, we cannot extrapolate any trend outside the dataframe.

Calories Burned by Device

Height Vs Weight

This shows a clear skew, the calories burned in the fitbit are higher, this may indicate that people using fitbit tend to perform more intense activities compare to the apple watch users that may use the watch not only for sport but also in a day to day basis to improve workflow.

Calories Burned by Activity

Height Vs Weight

We can see that in general, the total amount of the mebers of the Dataframe add in total more calories burned by runing 5 METs, and that they count less in sitting than Lying, we can check which activity on average burned more calories, but we have to undertand that there is a lot of factor affecting how many calories you burned.

Heart Rate by Gender

Height Vs Weight

The graph reperesents a higher amplitude in the data for female comapre to men, with a higher median.

Steps by Gender

Height Vs Weight

In this past case I have displayed the points to have a better representation of the data.

We can see that there is not a lot of difference between both gender, where female have a higher amount of steps in general, but a lower median compare to mean.

Steps by Device

Height Vs Weight

We can observe a big difference between the apple watch compared to the fitbit, this can be due to a different factors, such a big difference can indicate, a problem recording the steps.

Model Creation

Now that we all this data we can create a model to predict the calories burned with all the data we have, to create this model I will discard some of the columns, to keep it simpler, as this is for demostration and observation pouposes not for prediction accuracy.

First we will create a subset of the columns to use.

model_columns = ['age', 'gender', 'height', 'weight', 'steps', 'hear_rate', 'distance', 'resting_heart', 'device', 'activity', 'calories']
model_data = health[model_columns]

The next step is to convert the categorical data to integers to train the model.

model_data["gender"] = model_data["gender"].map({"Male": 1, "Female": 0})
model_data["device"] = model_data["device"].map({"apple watch": 0, "fitbit": 1})
model_data["activity"] = model_data["activity"].map({"Lying": 0, "Running 7 METs": 1, "Running 5 METs": 2, "Running 3 METs": 3, "Sitting": 4, "Self Pace walk": 5})

After that we can proceed to train the model.

x = np.array(model_data[['age', 'gender', 'height', 'weight', 'steps', 'hear_rate', 'distance', 'resting_heart', 'device', 'activity']])
y = np.array(model_data[['calories']])
xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size=0.10, random_state=42)

model = SVR()
model.fit(xtrain, ytrain)

Model Testing

After creating the model we could try to make a prediction using it. Is important to remember that we mapped the categorical values, so when we introduce the data we would need to encode using the same system.

Example Case: 29 year old male, that is 160 cm, with a weight of 80 kg, 2100 steps, with a heart rate of 90 bpms, a distance of 1000m, and a resting heart rate of 60 bpms. The device is an Apple Watch and the activity is running 7 mets.

features = np.array([[29, 1, 160, 80, 2100, 90, 1000, 60, 0, 1]])
print(model.predict(features))

Dataset Introduction#

Data Source#

Importing#

Importing libraries#

Importing Data#

Cleaning Data#

Missing Values#

Value Consistency#

Data Type#

Outliers#

Data Analysis#

Height by Gender#

Weight by Gender#

Weight Vs Height#

Device by Gender#

Calories Burned by Device#

Calories Burned by Activity#

Heart Rate by Gender#

Steps by Gender#

Steps by Device#

Model Creation#

Model Testing#

Dataset Introduction

Data Source

Importing

Importing libraries

Importing Data

Cleaning Data

Missing Values

Value Consistency

Data Type

Outliers

Data Analysis

Height by Gender

Weight by Gender

Weight Vs Height

Device by Gender

Calories Burned by Device

Calories Burned by Activity

Heart Rate by Gender

Steps by Gender

Steps by Device

Model Creation

Model Testing