Custom Infix Operators in Python

R has the neat feature that you can easily define your own infix operators. I’ve sometimes wished Python had a similar feature, so I created a short notebook showing a hacky way to implement custom infix operators in Python. While you should probably never need to use this hack, it does show off an interesting application of dunder methods and decorators. So it’s worth taking a look at, especially if you’ve never seen those cool Python features.

Basic Bayesian Survival Analysis

During the fall of 2018 I took intro courses in bayesian statistics and survival analysis. Both were taught by the same professor, and he made the interesting offer that we could do one final project for both classes, so long as we could find a way to incorporate the core material of each course. He actually recommended against this, claiming it would be harder overall, but I couldn’t resist. I ended up gathering my own data and writing my own Metroplis-within-Gibbs MCMC code, which I was quite proud of. I typed up the paper in R Markdown with text and code interleaved, so you can can see exactly how I produced my results. Overall, it’s a bit rough around the edges, but I still like it. You can checkout the paper on GitHub here.

Visualizing Name Popularity Over Time With Tableau

I’ve heard a lot of praise for Tableau, but in general I’m pretty turned off by point-and-click interfaces, especially when there are excellent code-based alternatives, such ggplot2 for R and Seaborn for Python, not to mention the plethora of JavaScript plotting libraries. But I figured I might as well give it a try, so I made a basic dashboard that you can find here. I’ll give a short overview of the dashboard below the break, but first let me say that I’m pleasantly surprised by how generally intuitive it is to create polished interactive plots in Tableau.

Neural Nets from Scratch

A couple of years ago, when I was first learning about deep learning, I wrote my own little computational graph “library.” I put “library” in quotes because it was really just one small Python script implementing the core operations used for building up deep learning models. I found the code the other day while looking through some old folders, and I thought I’d throw it up on my GitHub with a little readme and a short demo. Check it out here. If you want to learn more about deep learning and computational graphs, I highly recommend both the cs231n notes and the Deep Learning book.

Cleaning and Exploring Mountain Project Climbing Data

I’ve just put up a new Jupyter Notebook that shows off some basic data cleaning using Python and Pandas. You can view it on my GitHub here. Here’s a brief description from the intro of the notebook:

Mountain Project is an online climbing guidebook for hundreds of thousands of routes all around the world. One of its cool features is a route finder that will create a table of routes based on certain parameters you give it, such as location and difficulty. You can then export this table as a .csv file. So I used this feature to download information about every roped climb in the Red River Gorge and New River Gorge, since these are the two major climbing destinations nearest to where I live, and they are places where I have actually climbed before.

The main focus of this notebook is demonstrating some basic data cleaning with Pandas. But at the end I’ll do a little exploratory analysis as a payoff for all that hard cleaning work. Another added bonus is that I demonstrate some quick and easy (but primitive) web scraping for gathering some data that the route finder doesn’t export.

Let’s see if there’s anything interesting we can learn about the climbing at the Red and the New!

Nonparametric Bayesian Methods Slides

I’ve uploaded some slides from talks I gave during my Advanced Topics in Applied Statistics course. Both of the talks were about nonparametic Bayesian methods, which I find quite neat. The links to the slides are below in case you want to check them out. The first talk was a shorter one about Gaussian process regression. I tried to avoid getting bogged down in the math, since I didn’t have the time to explain it, so the material is quite light. Those slides are here. The second talk was about Dirichlet process clustering. This one was longer and involved more of the math behind the method, but it is still a non-rigorous overview. Those slides are here.

Iterables, Iterators, and Generators in Python

I’ve just finished working on a little Jupyter notebook that explains iterables, iterators, and generators in Python. I’ve seen a lot of confusion regarding these topics, (My personal confusion: “I guess iterators are some kind of stream thingy? And then generators are…also some kind of stream thingy?”), so I decided to really figure out exactly what these things are, and then write up what I learned for others (mainly future me). You can check out the notebook on my GitHub here.

The Monty Hall Paradox Through a Causal Lens

I’ve recently finished reading Pearl and Mackenzie’s The Book of Why: The New Science of Cause and Effect. In the chapter on paradoxes, the book discusses the famous Monty Hall problem, but with a twist. Rather than basing their analysis of the problem solely on probability theory, they show how formal causal reasoning can elucidate why most people find it so confusing. I found this discussion so illuminating that I’d like to share it here, in my own words.

Movie Recommendations via Matrix Factorization

I’ve written a little introduction to the topic of using matrix factorization algorithms for recommender systems. It only just scratches the surface of what is possible, but I think it gets the main ideas across. Check out the Jupyter Notebook on my GitHub!

My Master's Thesis

I’m all but finished with my thesis now. I’ve titled it “Elo Regression: Extending the Elo Rating System.” The idea behind the work is to create a modified version of the Elo rating system that handles ratings over time in a principled way. Since Elo ratings are updated over time, you can look at the way the ratings change to get an ad hoc model of skill over time. But this ad hoc method has some pathologies. For example, we might observe someone’s rating climbing and then sharply plummeting in such a way that the obvious explanation is that they became overrated temporarily, and then their rating was corrected, which can definitely happen in Elo. Then, if we want to look back and see how strong different players were at different times, we should probably discount that temporary spike somehow. The methods from my thesis will do this automatically. You can read it on my GitHub here. There’s also a slide deck in that same repository, if you just want to see the main ideas.

Names and Entropy

The babynames package/dataset is a great playground for using tidyverse skills to answer questions with data. In this post, I’ll explore the dataset a little bit before diving into my main question (“How is name diversity changing over time?”) and answering it using the Shannon entropy as a measure of diversity. Check out the R Notebook on RPubs here.

Deriving and Implementing a Machine Learning Algorithm for Rating Melee Players

I’ve derived and implemented a custom machine learning algorithm for rating Super Smash Bros. Melee players based on head to head data. Check out the Jupyter Notebook on my GitHub here!

The Rule of Three and Confidence Intervals

The “Rule of Three” is a neat trick for stating a confidence interval for a binomial proportion in the case where you’ve observed all successes or all failures. Thinking about this led me to some general meditations on confidence intervals.

Looking Back on the Coursera Data Science Specialization

It’s been around six months since I completed the Data Science specialization on Coursera, so I decided to take a look back on some of the work I did.