Data Mining Strava: Running for the World Record. And Beyond

I’ve been running a little more seriously this year. On Strava, I’ve registered 427km, including a few contests: EcoMarathon (14km +600m), San Francisco Marathon (41km), Golden Gate Trail Run (30km, +1200m) and Piatra Craiului Marathon (38km, +2300m). During these races I’ve noticed I’m really slow – finishing somewhere in the last 10% – 20% in my category. So the questions that emerged in my mind were:
– taking training out of the equation, am I just slower than others?
– how important is training in improving my running pace? If I trained more, how much should I expect to improve?

Analysis Procedure

I chose as reference the personal record over 10 kilometers. I would get this info about a bunch of users, along with how much they’ve run this year. I would remove users that are new to Strava – since I can’t determine if they just started running of if they just started using Strava, yet having already ran a lot.

Having this data, I would see how much the 10k time improves as an athlete trains more. I would also see how I stand compared to other having similar training and how much I can expect to improve, given more training.

Getting the Data

First off, let’s get some data out of Strava, so I have people to compare myself to. Since Strava doesn’t have a public API, I had to scrape their website. I got a list of users from a monthly running challenge. Scraping here was straightforward – the call didn’t require authentication and gave data in JSON. In the end, I had 17000 ids for active Strava users.

Then, I needed some statistics on each user. Since those statistics required authentication, I used a script that handled all that, so I could access the site from within Python. Worked a little more on getting the data out of HTML format.

After removing users that started using Strava recently, I was left with around 7000 data points. I also had to remove users having erroneous GPS tracks – appears Garmin has a bug that sometimes sends you to the point with coordinates (0, 0):

I also removed people that were very slow. If you need more than 2 hours for 10km, you’re walking. And you’re also messing up my chart.

Analysis

I put all the points on a scatter plot. I trained an SVR on them, so I could also show a pretty line (in green). You can see my record on the chart as a red point.

You can see some fishy stuff on the chart. The fishiest is the group of points very similar to the 10k World Record (26:17 or 1577 seconds). I looked over a couple of the tracks that generated those points and they seem to be mislabeled bike rides. Yet, I can’t explain why there are so many points around the world record value. Most likely a Strava bug.

There are a lot of people faster than the World Record. Some of them are going up to over 500 kmph. Mislabeled… flight?

Let’s zoom on the area with the highest concentration of points. It looks like this:
strava-chart3

What you can notice from this chart:
– if you train more, you get faster. The improvement is not linear – it takes more training to improve the better you become;
– I am indeed slower than other people with similar training.

The SVR model estimated an average runner, having the same training as me, would be 5 minutes faster. Another result from the model: I would need 867km of running in order to reach a time of 50 minutes. Hmm… seems science is telling me to move on to a new sport.

All code is available on Github.

9 thoughts on “Data Mining Strava: Running for the World Record. And Beyond

  1. This is really fascinating. I developed a site called smashrun.com, and one of the the things I things I think that’s really interesting that we do is calculating percentile rank across various metrics, speed at various distances, but also distance run and days run/not run. For an example you can see my results here: http://smashrun.com/chris.lukic/medals/speed
    I’ve wanted to do some analysis similar to what you’ve done here with improvement vs training load, but as always finding the time is a challenge.

    One of the things I can say is that filtering out the outliers is really challenging, there’s just not much difference between the performance of a slow bike rider and world class runner, and while you can be reasonable sure that the sub 30 minute 10ks are bike rides, what you can’t be sure of is how many of the 40 and 50 minute 10ks are “slow” bike rides. Not to mention all of the people with uncalibrated Nike chips, or even more likely mobile apps with dubious recordings. But I think most people have an interest in getting their data right, and to a large effect the errors tend to wash out one another in both directions, I hope…

    On smashrun a 1 hour 10k time puts you right at about the 25th percentile. But from the looks of it you’re running some serious elevation. Most of these record 10ks are done on a straight flat with a tail wind. And that alone could make a huge difference. All you need to do is run 1k/hr to be in the 50th percentile, and a few hills can easily make that difference.

    By the way, I might be able to send you some anonymized aggregated data if you interested in doing some more analysis. Just shoot me an email. However, would appreciate if you don’t crawl the site, because we’re not funded like Strava, and I’m running our poor little servers out of my own pocket.

    • Smashrun looks like an awesome tool for the goal-oriented runner. Great job with it! I particularly enjoy the rank history chart – feeling the need for something like that. I’ve actually started importing my tracks from Strava (the ability to select multiple files when uploading would surely help people in importing historical data).

      The 25th percentile seems about right for me :). I got slightly worse results, but I was also filtering out new (and presumably slower) users. And my one hour time was on flat terrain. I get even more slow when trail running.

      Don’t worry, I won’t crawl Smashrun :). I’m thinking of doing some more analyses, but it will be a while until I find the time. That data would sure come in handy (and it would simplify things a lot). I’ll keep this in mind. Thanks 🙂

      • Very cool. Some of that terrain that your running looks absolutely epic. Envious.
        The import via email is the easiest way to get old data into the site right now. You can attach quite a few files to one email, and it makes it much simpler. I’m working on a zip import though, which would make things even easier.
        Glad you like the site so far, shoot me an email if you run into any problems.

  2. Pingback: Mining the Strava Data | Oliver Nash's Blog

  3. Hello, I am doing some mining on Strava data as well, I just have a question, how did you manage to collect the strava_id of 17000 users?

Leave a Reply

Your email address will not be published. Required fields are marked *

Captcha * Time limit is exhausted. Please reload the CAPTCHA.

This site uses Akismet to reduce spam. Learn how your comment data is processed.