This is document describes the analysis I conducted for my final project for the Johns Hopkins’ Coursera course “Practical Machine Learning” in the Data Science specialization. You can learn more about the course here and you can find more information on me here. All of my work was conducted in RStudio.
The data for this assignment comes from here, and contains information from belt, forearm, arm, and dumbbell accelerometers. The data are split into a training group (19,622) observations and testing group (20 observations). Participants in this study were asked to do a “Dumbbell Biceps Curl” in five different ways, including using correct form and four common mistakes. Participants were equipped with censors on the arm, belt and dumbbell itself.
First, I split the training set into 90/10 subsamples.
set.seed(614)
library(lattice); library(ggplot2); library(caret)
pml.training <- read.csv("~/R/Coursera /Machine Learning/Project/pml-training.csv")
inTrain <- createDataPartition(y=pml.training$classe, p=0.9, list=FALSE)
training <- pml.training[inTrain,]
testing <- pml.training[-inTrain,]
Note: To run this code, you will need to switch what is inside read.csv("")
to your location of the data. The 90 percent subsample is used to train the model, and the 10 percent sample is used for cross-validation. I chose this simple cross-validation rather than using something like K-fold via the cv.folds
option to cut down on execution time, which was already quite lengthy. Next, I implement a Stochastic Gradient Boosting algorithm via the gbm
package.
ptm <- proc.time()
modFit <- train(classe ~ user_name + pitch_arm + yaw_arm + roll_arm + roll_belt + pitch_belt + yaw_belt + gyros_belt_x + gyros_belt_y + gyros_belt_z + accel_belt_x + accel_belt_y + accel_belt_z + magnet_belt_x + magnet_belt_y + magnet_belt_z + gyros_arm_x + gyros_arm_y + gyros_arm_z + accel_arm_x + accel_arm_y + accel_arm_z + magnet_arm_x + magnet_arm_y + magnet_arm_z + roll_dumbbell + pitch_dumbbell + yaw_dumbbell, method="gbm", data=training, verbose=FALSE)
## Loading required package: gbm
## Loading required package: survival
## Loading required package: splines
##
## Attaching package: 'survival'
##
## The following object is masked from 'package:caret':
##
## cluster
##
## Loading required package: parallel
## Loaded gbm 2.1
## Loading required package: plyr
proc.time() - ptm
## user system elapsed
## 1369.55 18.49 1393.17
To capture the execution time I’ve used ptm
and proc.time()
. On my 2012 Macbook Air (1.8 GHz Intel Core i5 and 8GM) the execution time was roughly 24 minutes.
print(modFit)
## Stochastic Gradient Boosting
##
## 17662 samples
## 159 predictors
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
##
## Summary of sample sizes: 17662, 17662, 17662, 17662, 17662, 17662, ...
##
## Resampling results across tuning parameters:
##
## interaction.depth n.trees Accuracy Kappa Accuracy SD Kappa SD
## 1 50 0.7 0.6 0.005 0.007
## 1 100 0.7 0.6 0.007 0.009
## 1 200 0.7 0.7 0.008 0.01
## 2 50 0.8 0.7 0.008 0.01
## 2 100 0.8 0.8 0.005 0.006
## 2 200 0.9 0.8 0.005 0.006
## 3 50 0.8 0.8 0.006 0.007
## 3 100 0.9 0.9 0.005 0.006
## 3 200 0.9 0.9 0.004 0.005
##
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 150,
## interaction.depth = 3 and shrinkage = 0.1.
predictTr <- predict(modFit,training)
table(predictTr, training$classe)
##
## predictTr A B C D E
## A 4749 145 51 39 37
## B 75 3131 119 17 34
## C 62 113 2839 138 18
## D 98 14 63 2678 15
## E 38 15 8 23 3143
The model correctly classifies 93.6 percent of the observations in the training sample using 150 trees. The “roll_belt”" and “yaw_belt”" features were by far the most important in terms of variable influence.
summary(modFit,n.trees=150)
## var rel.inf
## roll_belt roll_belt 24.15343
## yaw_belt yaw_belt 13.33063
## magnet_belt_z magnet_belt_z 7.01097
## roll_dumbbell roll_dumbbell 6.55280
## pitch_belt pitch_belt 6.23819
## magnet_arm_x magnet_arm_x 4.59912
## accel_arm_x accel_arm_x 3.74037
## roll_arm roll_arm 3.64898
## magnet_arm_z magnet_arm_z 3.25981
## gyros_belt_z gyros_belt_z 3.12695
## user_nameeurico user_nameeurico 3.00163
## yaw_dumbbell yaw_dumbbell 2.63164
## accel_arm_z accel_arm_z 2.63096
## pitch_dumbbell pitch_dumbbell 2.59930
## yaw_arm yaw_arm 2.33455
## magnet_belt_x magnet_belt_x 1.82863
## pitch_arm pitch_arm 1.67995
## accel_belt_z accel_belt_z 1.42771
## magnet_arm_y magnet_arm_y 1.38889
## gyros_arm_y gyros_arm_y 1.29886
## magnet_belt_y magnet_belt_y 1.26592
## user_namecharles user_namecharles 0.74598
## gyros_belt_y gyros_belt_y 0.63706
## gyros_arm_x gyros_arm_x 0.36557
## gyros_belt_x gyros_belt_x 0.25119
## accel_arm_y accel_arm_y 0.17786
## accel_belt_x accel_belt_x 0.07305
## user_namecarlitos user_namecarlitos 0.00000
## user_namejeremy user_namejeremy 0.00000
## user_namepedro user_namepedro 0.00000
## accel_belt_y accel_belt_y 0.00000
## gyros_arm_z gyros_arm_z 0.00000
A plot of these top two features colored by outcome demonstrates their relative importance.
qplot(roll_belt, yaw_belt,colour=classe,data=training)
Even though these are the top features, they’re still not great predictors in their own right. Nonetheless, you can see some bunching in this simple plot. This confirms the choice of a boosting algorithm as a good choice given the large set of relatively weak predictors. This next plot further demonstrates the improved performance gained by using boosting iterations.
ggplot(modFit)
Next, I check the performance on the 10 percent subsample to get an estimate of the algorithm’s out-of-sample performance.
predictTe <- predict(modFit,testing)
table(predictTe, testing$classe)
##
## predictTe A B C D E
## A 530 19 4 2 6
## B 3 346 16 0 5
## C 4 13 312 19 3
## D 15 1 10 297 0
## E 6 0 0 3 346
The algorithm only does slightly worse on the 10 percent subset than it did on the full training set, correctly classifying 93.4 percent of the observations.
Finally, I use the algorithm to predict using the testing set. The results are run through the pml_write_files()
function from the course Coursera site, and stored for submission.
pml.testing <- read.csv("~/R/Coursera /Machine Learning/Project/pml-testing.csv")
answers <- as.character(predict(modFit, pml.testing))
pml_write_files = function(x){
n = length(x)
for(i in 1:n){
filename = paste0("problem_id_",i,".txt")
write.table(x[i],file=filename,quote=FALSE,row.names=FALSE,col.names=FALSE)
}
}
pml_write_files(answers)
After submitting these answers, it turns out that the algorithm correctly predicted the outcome for 20/20 observations further confirming its strong out-of-sample classification accuracy.