Mining the Best Cycling Roads around Bucharest with Strava

This year I tried to do a little more cycling during the week. I have a road bike, so I’m interested in quality roads – good tarmac, no potholes, enough shoulder to allow trucks to overtake me and live to talk about it. But I don’t know all roads around Bucharest. Best roads for biking are the smaller ones, with little traffic, that lead nowhere important. How can I find them?

I’ve worked on Strava data before and I saw here an opportunity to do it again. Using this simple script, I downloaded (over the course of several days) a total of 6 GB of GPS tracks. I started with the major Strava bike clubs in Bucharest, took all their members, then fetched all rides between April and the middle of June. Starting from 9 clubs, I looked over 1414 users. Only 674 have biked during the analyzed period. The average number of rides for those two and a half months was 25, with the [10, 25, 50, 75, 90]-percentile values of [2, 6, 17, 36, 64]. These are reasonable values, considering there are people that are commuting by bike (even over 20 km per day).

Ok, how can you determine the road quality from 6 GB of data? Simplest solution: take the speed every second (you already have that from Strava) and put it on a map. I did this by converting (lat, lng) pairs to (x, y) coordinates on an image. Each pixel was essentially a bucket, holding a list of speed values.

Let’s check out the area around Bucharest (high res version available here):


In this chart, bright green lines signal roads where there are many riders going over 35 kmph (technically speaking, the 80th percentile is at 35). On the opposite side, red roads are those where you’ll be lucky going with 25 kmph. The color intensity signals how popular a route is, with rarely used roads being barely visible. I already knew the good roads north of Bucharest, with Moara Vlasiei, Dascalu, Snagov and Izvorani. I saw on Strava that the SE ride to Galbinasi is popular, but I’ve never ridden it. From this analysis, I can see there are many good roads to the south and a couple of segments to the west. Unfortunately for me, for anything that’s not in the north I have to cross Bucharest, which is a buzzkill. Also notice the course from Prima Evadare (the jagged red line in the north) and the MTB trails in Cernica, Baneasa and Comana.

Let’s zoom in a little and see how the city looks like:


For this chart, I relaxed the colors a little, with 35 kmph for green and only 20 kmph for red. Things to notice:

  • the RedBull MoonTimeBike course in Tineretului (in red); notice that the whole of Tineretului or Herastrau are red; it means you can’t (and shouldn’t) bike fast in parks; please don’t bike fast in parks, it’s unpleasant (and dangerous) for bikers and pedestrians alike
  • the abandoned road circuit in Baneasa (in bright green)
  • National Arena in green
  • the two slopes around The People’s Palace in green (from how it’s built, all slopes will be green, which is not a problem, since Bucharest is pretty much flat)

Full code available here.

7th Place in Kaggle’s Driver Telematics Challenge

The purpose of the AXA Driver Telematics Challenge was to discover outliers in a dataset of trips. We were given a total of 2730 drivers, each with 200 trips. We were told that a few of those 200 trips per driver weren’t actually his and the task was to identify which ones. Somewhat similar to authorship attribution on texts.

A drive was composed of a series of GPS measurements taken each second. Each drive started at (0, 0) and all the other points were given relative to the origin, in meters. In order to make it harder to match trips to road networks, the trips were randomly rotated and parts from the start and the end were removed.

If the dataset would have been composed of only one driver, then this would actually be outlier detection. But more drivers means more information available. Using a classification approach makes it possible to incorporate that extra info.

System Overview
Local Testing
For each driver, take 180 trips and label them as 1s. Take 180 trips from other drivers and label them as 0s. Train and test on the remaining 20 trips from this driver and on another 20 trips from other drivers. That’s it.

Can we improve on this? Yes. Take more than just 180 trips from other drivers. Best results I’ve got were with values between 4×180 and 10×180. In order to avoid an unbalanced training set, I also duplicated the data from the current driver.

Considering there were trips from other drivers that I was labeling as 1s when testing locally, it wasn’t possible to get the same score locally as on the leaderboard. Yet the difference between the two scores was very predictable, at around 0.045, with variations of around 0.001. The only times I was unsure of my submissions were when I had a data sampling bug and when I tried a clustering approach using the results from local testing.

Leaderboard Submissions
Pretty much the same logic as above. Train on 190 trips from this driver, along with 190 trips from other drivers, test on the remaining 10 trips from this driver. Repeat 20 times, in order to cover all trips (similar to crossvalidation). If the process was too slow, then repeat 10 times with 180+20 splits. Also apply the data enlargement trick used in local testing.

In production systems, you usually try to balance the system’s performance with the computational resources it needs. In Kaggle competitions, you don’t. Ensembling is essential in getting top results. I used a linear model to combine together several predictions. The Lasso model in sklearn gave the best results, mainly because it is able to compute nonnegative weights. Having a linear blend with negative weights is just wrong.

When you have an ensemble (but not only then), you will need the same results again and again for different experiments. Having a system for caching results will make your life easier. I cached results for each model, both for local validation, as well as leaderboard submissions. With some models needing a couple of days to run (with 4 cores at 100%), caching proved useful.

Feature Extraction
Trip Features
These were the best approaches overall, with a maximum local score of 0.877 using Gradient Boosting Trees (that would be an estimated LB score of about 0.922). Some of the features used were histograms and percentiles over speeds, accelerations, angles, speed * angles, accelerations over accelerations, speeds and accelerations over larger windows.

Road segment features
I’ve seen a lot of talk on the forum of matching trips using Euclidean distance. I tried to go a little further and detect similar road segments. This approach should detect repeated trips, but it should also detect if different trips have certain segments in common. I applied a trajectory simplification algorithm (Ramer-Douglas-Peucker available for Python here), then I binned the resulting segments and applied SVMs or Logistic Regression, like on a text corpus. Local results went up to 0.812 (estimated LB score of around 0.857).

One thing I didn’t like about the RDP algorithm was how similar curves could be segmented differently due to how the threshold used by the algorithm was affected by GPS noise. So I built another model. I thought the best way to encode a trip was as a list of instructions, similar to Google Maps directions. I looked at changes in heading, encoding them as left or right turns. This model scored up to 0.803 locally, lower than the RDP model, but in the final ensemble it got a bigger weight.

Movement features
I think these types of features best capture somebody’s driving style, although to some degree they also capture particular junction shapes and other road features. The main idea is to compute a few measurements each second, bin them, then treat them as text data and apply SVMs or Logistic Regression. For example, the best scoring model in this category (local score of just under 0.87) binned together the distances and the angles computed over 3-second intervals on smoothed versions of the trips (smoothing done using the Savitzky-Golay filter in scipy). After binning and converting to text data, I applied Logistic Regression on n-grams, with n between 1 and 5.

I tested over 100 individual models to select a final ensemble of 23 models. Local score was 0.9195 and the final LB score was 0.9645 (public LB), 0.9653 (private LB).

My code is available on GitHub.

Data Mining Strava: Running for the World Record. And Beyond

I’ve been running a little more seriously this year. On Strava, I’ve registered 427km, including a few contests: EcoMarathon (14km +600m), San Francisco Marathon (41km), Golden Gate Trail Run (30km, +1200m) and Piatra Craiului Marathon (38km, +2300m). During these races I’ve noticed I’m really slow – finishing somewhere in the last 10% – 20% in my category. So the questions that emerged in my mind were:
– taking training out of the equation, am I just slower than others?
– how important is training in improving my running pace? If I trained more, how much should I expect to improve?

Analysis Procedure

I chose as reference the personal record over 10 kilometers. I would get this info about a bunch of users, along with how much they’ve run this year. I would remove users that are new to Strava – since I can’t determine if they just started running of if they just started using Strava, yet having already ran a lot.

Having this data, I would see how much the 10k time improves as an athlete trains more. I would also see how I stand compared to other having similar training and how much I can expect to improve, given more training.

Getting the Data

First off, let’s get some data out of Strava, so I have people to compare myself to. Since Strava doesn’t have a public API, I had to scrape their website. I got a list of users from a monthly running challenge. Scraping here was straightforward – the call didn’t require authentication and gave data in JSON. In the end, I had 17000 ids for active Strava users.

Then, I needed some statistics on each user. Since those statistics required authentication, I used a script that handled all that, so I could access the site from within Python. Worked a little more on getting the data out of HTML format.

After removing users that started using Strava recently, I was left with around 7000 data points. I also had to remove users having erroneous GPS tracks – appears Garmin has a bug that sometimes sends you to the point with coordinates (0, 0):

I also removed people that were very slow. If you need more than 2 hours for 10km, you’re walking. And you’re also messing up my chart.


I put all the points on a scatter plot. I trained an SVR on them, so I could also show a pretty line (in green). You can see my record on the chart as a red point.

You can see some fishy stuff on the chart. The fishiest is the group of points very similar to the 10k World Record (26:17 or 1577 seconds). I looked over a couple of the tracks that generated those points and they seem to be mislabeled bike rides. Yet, I can’t explain why there are so many points around the world record value. Most likely a Strava bug.

There are a lot of people faster than the World Record. Some of them are going up to over 500 kmph. Mislabeled… flight?

Let’s zoom on the area with the highest concentration of points. It looks like this:

What you can notice from this chart:
– if you train more, you get faster. The improvement is not linear – it takes more training to improve the better you become;
– I am indeed slower than other people with similar training.

The SVR model estimated an average runner, having the same training as me, would be 5 minutes faster. Another result from the model: I would need 867km of running in order to reach a time of 50 minutes. Hmm… seems science is telling me to move on to a new sport.

All code is available on Github.