Looking Back on the Coursera Data Science Specialization

It’s been around six months since I completed the Data Science specialization on Coursera, so I decided to take a look back on some of the work I did.

My vague recollection was that my work was rushed and sloppy. This was in part because I was relatively new to R, but mainly because the specialization uses a subscription-based payment model, and I was determined to finish it in just a couple months, so I didn’t have to pay too much. It also didn’t help that as I progressed through the specialization, I was growing frustrated that the courses seemed more and more abandoned: it was clear the materials hadn’t been updated in years despite numerous obvious errors, and it was difficult to find help.

Anyway, that’s enough negativity. Despite my frustrations, I don’t regret paying for and completing the specialization at all. It provided the exercise and consistent structure I needed to help me develop my R skills quickly. It also emphasized maintaining a reproducible workflow, which I now find invaluable. And looking back on my work, I found that some of it was pretty neat.

First, we have my final project for the Practical Machine Learning course. You can see the GitHub repo here. In this project we had to use accelerometer data to predict how someone performed a barbell lift. The problem is multiclass classification with 5 classes: one correct way, and four different incorrect ways to perform the lift. I liked that the project didn’t just give us perfectly clean data, already split into test and training sets. Instead, we were given one big messy data set that we had to clean and partition ourselves, just like for real work. (There was a “test set” provided, however it was only 20 cases, and was used for taking a quiz for the course, not for model validation.) However, unlike real work, the random forest I used actually achieved 100% accuracy on the test set. I think that was for learning purposes though, to make it easy to tell if you’d done it correctly.

The biggest stumbling block I remember was that I didn’t notice right away that one of the variables was a row id number and that the classes were listed in order, so that the row id served as a perfect predictor of the class. That served as a good reminder to get familiar with the data before just dropping a random forest on it and calling it a day. Overall, I’m pleased with my work, in particular my adherence to a coherent and reproducible workflow.

The other work I want to talk about is my capstone project. You can see the repo here. The capstone course was pretty cool because we had to take raw data and transform it into a finished web app. At the end, I really felt like I’d accomplished something. The course provided the data I used, which were anonymized tweets scraped from Twitter. The main idea we were given was to create a predictive keyboard app, and we were encouraged to use an n-gram backoff model. I remember I really wanted to create an app that could predict at the character level, like a real smartphone keyboard, so I tried to train a character level LSTM neural net with Keras and TensorFlow. In the end I couldn’t get it to work well enough. I think the problem was that the tweets were randomly truncated, to help with anonymity, so when I tried to make a training corpus by concatenating all the tweets together I got a lot of strings that spanned two tweets and didn’t make any sense.

alt text

The final model I used was the n-gram backoff model. I remember having to tinker quite a bit to get the app really fast. R doesn’t have an obvious hash map or dictionary type, so at first I tried storing the n-grams in a list, but the linear time look up was too slow. I ended up using R environments as hash maps to get the fast look ups I needed, which I was quite proud of. I also remember spending a lot of time getting the UI just the way I wanted. I really wanted it to behave somewhat like an actual smartphone keyboard, so I made the words appear on buttons, which you can then press to insert the word. Despite the spartan appearance (see the screenshot above), I think the app works well, and I’m proud of it. It’s still hosted on shinyapps.io here, so you can check it out yourself.

As I said earlier, I’m glad I completed the data science specialization. It really helped me learn about R and reproducibility, and it gave me a couple of decent pieces of work to show for it.

Written on January 20, 2019