Tuesday, June 5, 2018

Analyzing Commutes with Location History Data

commute-Copy7

Analyzing Commutes with Google Location History Data

Studying a year's commutes using Google Location Services data.

Introduction

I'm a physicist at the University of Chicago, and often work on experiments at Fermilab. UChicago is in Hyde Park on the south side of Chicago, while Fermilab is a bit west near Batavia, IL. I live in the middle, and thus spend a lot of time on Chicagoland's glorious highway system getting to these places.

As a scientist I love data, and I want to optimize my commute so that I can get on with the science-ing. I realized that Google Location Services has been my trusty companion on all my travels, all that data just waiting to be analyzed.

In this post I'll do some analysis on my own commutes, and encourage you to download your location history and do the same. You can grab this and much more from https://takeout.google.com. It's your data!

For this project, we'll use Python and a few excellent packages: Jupyter (this notebook), Pandas, numpy and scipy, matplotlib for plotting, plus we'll take statsmodels's regression modeling with R-style formulas for a spin.

Preparing the Data

First, let's understand the Location History data, which is available for download as a giant JSON file from https://takeout.google.com/.

The schema goes something like:

"locations": [
    {
        "timestampMs": "< UTC timestamp in milliseconds >",
        "latitudeE7: < Latitude * 10^7 >,
        "longitudeE7: < Longitude * 10^7 >,
        "accuracy": < Units? >,
        "altitude": < If accurate, probably in meters >,
        "verticalAccuracy": < Also meters? >,
        "velocity": < Units? >,
        "heading": < Degrees >,
        "activity": [
            {
                "timestampMs": "< Slightly different than above [ms] >",
                "activity": [
                    {
                        "type": "< An activity type, see below >",
                        "confidence": < Number 0-100, doesn't add to 100 >,
                    },
                    ...
                ],
                "extra": [
                    {
                        "type": "VALUE",
                        "name": "vehicle_personal_confidence",
                        "intVal": 100
                    }
                ]
            },
            ...
        }               
    },
    ...
]

Fields velocity, heading, and activity are optional. When the extra under activity appears (sporadically), it is always as shown, in my case. There can be a whole bunch of activities with a score from 0-100, which are not mutually exclusive.

Activities include: 'IN_VEHICLE', 'UNKNOWN', 'ON_FOOT', 'TILTING', 'STILL', 'EXITING_VEHICLE', 'ON_BICYCLE', 'WALKING', 'RUNNING', 'IN_ROAD_VEHICLE', 'IN_RAIL_VEHICLE'. I assume is this based on activity recognition being done by my phone, so it may be device-dependent.

Loading the Data

First, import all the things!

In [1]:
%matplotlib inline
from operator import itemgetter
import json
import warnings
from datetime import datetime, timedelta
from IPython.display import display
import numpy as np
import pandas as pd
import scipy.stats
import statsmodels.api as sm
import statsmodels.formula.api as smf
from statsmodels.sandbox.regression.predstd import wls_prediction_std
from matplotlib import pyplot as plt
from matplotlib import rcParams
rcParams['figure.figsize'] = 8, 6
from mpl_toolkits.basemap import Basemap
import geopy
from geopy.geocoders import Nominatim
import pytz
import requests

And actually load the JSON data we've grabbed from Google:

In [2]:
with open('./data/Location History.json') as f:
    locations = json.load(f)['locations']

For this project, I'm only interested in records where I'm in a vehicle. I'll trust the activity reconigition to get that mostly right, and reduce the activity list down to just the most likely activity string:

In [3]:
def choose_activity(activities):
    '''Pick the activity with the highest confidence from a list.
    
    :param activities: A list of activity objects
    :returns: The highest-confidence activity name
    '''
    if len(activities) == 0:
        return 'UNKNOWN'
    act = activities[0]['activity']
    return sorted(act, key=itemgetter('confidence'))[-1]['type']

# Reduce the activity list for each record to a single string
for loc in locations:
    loc['activity'] = choose_activity(loc.get('activity', []))

Now, we load the data into a pandas.DataFrame, and do a little tidying up: convert the timestamp to a datetime with the correct timezone (Chicago), convert latitude and longitude to floating point, and drop a few columns.

In [4]:
df = pd.DataFrame.from_dict(locations)

df['timestamp'] = (pd.to_datetime(df.timestampMs, unit='ms', utc=True)
                       .dt.round('s')
                       .dt.tz_localize(pytz.timezone('UTC'))
                       .dt.tz_convert(pytz.timezone('America/Chicago')))

df['timestampMs'] = pd.to_numeric(df.timestampMs)
df['latitudeE7'] /= float(1e7)
df['longitudeE7'] /= float(1e7)

df.rename(columns={'latitudeE7': 'latitude', 'longitudeE7': 'longitude'}, inplace=True)
df.sort_values(by='timestampMs', inplace=True)
df.drop(['accuracy', 'verticalAccuracy', 'heading', 'altitude', 'velocity'], axis=1, inplace=True)

Visualization

Using matplotlib's Basemap, we can draw these records on a map, sort of like what https://www.google.com/maps/timeline does.

In [5]:
# Somebody set us up the map
warnings.filterwarnings('ignore')
fig = plt.figure(figsize=[12, 12 * 120000 / float(100000)], dpi=100)
ax = fig.add_subplot(111, facecolor='white', frame_on=False)

geolocator = Nominatim()
hyde_park = geolocator.geocode('Hyde Park, IL')
chicago = geolocator.geocode('Chicago, IL')

m = Basemap(ellps='WGS84',
            projection='tmerc',
            lon_0=hyde_park.longitude-0.02, 
            lat_0=hyde_park.latitude+0.035,
            width=25*1000, 
            height=12.5*1000,
            resolution='h',
            area_thresh=10)

m.fillcontinents(color='bisque', lake_color='lightblue')
m.drawstates(color='dimgray')
m.drawmapboundary(fill_color='black')

# Plot location points for a random commute
someday = datetime(2017, 10, 10, 7, 45)
cut = (df.timestamp > someday) & (df.timestamp < someday + timedelta(days=0.25))
x, y = m(df[cut].longitude.values, df[cut].latitude.values)
m.plot(x, y, '-o', color='red', zorder=3)

# Draw points for some locations of interest
for loc in [hyde_park, chicago]:
    x, y = m(loc.longitude, loc.latitude)
    m.plot(x, y, marker='o', color='blue', zorder=4)

plt.show()

Extracting Commutes

Now, there is some work to do to turn a series of timestamped records into discrete commutes with durations. Incorrect activity recognition and other noise will complicate things, and we'll need to make some assumptions. I define a commute as:

  • In a vehicle
  • On a weekday
  • Between 30 minutes and 3 hours long
  • Occurring between 5 and 10 AM (morning) or 2 and 10 PM (evening)

As a trick for extracting the start and end times, I assume that commutes are isolated in time, with at least $N$ non-vehicular minutes on either side. Programmatically, this means finding a set of VEHICLE-tagged samples meeting the above cuts, where the $\Delta t$ to the previous/next VEHICLE-tagged sample is at least $N$ minutes. This should yield the start and end timestamps, and if there are more than two such samples for a given time window, we take the first two. Later we'll apply more filters to the commutes to further clean things up.

In [6]:
# Grab records when we're probably in a car, on a weekday
car = df[(df.activity.str.contains('VEHICLE')) &
         (df.timestamp.dt.dayofweek < 5) &
         (df.timestamp > datetime(2016,9,10))].copy()

# Add extra columns for evening (True if the timestamp is after noon)
# and the number of days since an arbitrary epoch date
car['evening'] = car.timestamp.dt.hour > 12
car['day'] = (car.timestamp - datetime(2016, 1, 1, tzinfo=pytz.timezone('America/Chicago'))).dt.days

# Day of week as factor, for convenience
car['dow'] = pd.Categorical.from_codes(car.timestamp.dt.dayofweek,
                                       ('MON', 'TUE', 'WED', 'THU', 'FRI'))

# Records in a vehicle separated by more than `min_dt` from other
# such records, and occuring during the right time window
min_dt = timedelta(minutes=20)
car = car[((car.timestamp.dt.hour > 5) &
           (car.timestamp.dt.hour < 10) &
            ((abs(car.timestamp - car.timestamp.shift(1)) > min_dt) |
             (abs(car.timestamp.shift(-1) - car.timestamp) > min_dt))) |
          ((car.timestamp.dt.hour > 14) &
           (car.timestamp.dt.hour < 22) &
           ((abs(car.timestamp - car.timestamp.shift(1)) > min_dt) |
            (abs(car.timestamp.shift(-1) - car.timestamp) > min_dt)))].copy()

# Group into morning/evening for each day to build aggregates for estimating commute
# properties (destination and duration).
commutes = (car.groupby(by=['day','evening'])
               .filter(lambda x: len(x) >= 2)
               .groupby(by=['day','evening'])
               .agg({
                   'latitude': {'first': 'first', 'second': lambda x: x.iloc[1]},
                   'longitude': {'first': 'first', 'second': lambda x: x.iloc[1]},
                   'timestamp': {'first': 'first', 'second': lambda x: x.iloc[1]},
               }).reset_index())

# Update columns after aggregation
commutes.columns = ['_'.join(x).strip('_') for x in commutes.columns.ravel()]
commutes['duration'] = commutes.timestamp_second - commutes.timestamp_first
commutes.rename(columns={'timestamp_first': 'timestamp'}, inplace=True)

print(len(commutes), 'commutes')
726 commutes

Next, we look at the start and end points, to tag the commutes with the destination and ensure they match up with work/home as they should for morning/evening.

In [7]:
# Define the destinations
geolocator = Nominatim()
hyde_park = geolocator.geocode('Hyde Park, IL')
chicago = geolocator.geocode('Chicago, IL')
fnal = geopy.location.Location(point=(41.838406,-88.2636437))
home = geopy.location.Location(point=(0,0))  # At home on null island ;)

def close(pos, loc, cut=2):
    '''Check if `pos` is close to `loc` with an approximate calculation.
    
    :param pos: An Nx2 array of (lat, lon) points
    :param loc: A location (returned by the geolocator)
    :param cut: A proximity cut in km
    :return: True if distance is within cut value
    '''
    p = np.array(pos) * np.pi / 180
    llat, llon = np.array((loc.latitude, loc.longitude)) * np.pi / 180
    x = (llon  - p[:,1]) * np.cos(0.5 * (llat + p[:,0]))
    y = llat - p[:,0]
    d = 6371 * np.sqrt(np.square(x) + np.square(y))
    return d < cut

# Commute start coordinates
fpos = np.array(list(zip(*(commutes.latitude_first.values,
                           commutes.longitude_first.values))))

# Commute end coordinates
lpos = np.array(list(zip(*(commutes.latitude_second.values,
                           commutes.longitude_second.values))))

# Boolean arrays that indicate whether a commute started or ended near a given location
start_fnal, end_fnal = close(fpos, fnal), close(lpos, fnal)
start_hydepark, end_hydepark = close(fpos, hyde_park), close(lpos, hyde_park)
start_home, end_home = close(fpos, home, 0.75), close(lpos, home, 0.75)

# An aggregate "location" for a commute, i.e. the workplace involved
location = np.zeros_like(start_fnal, dtype=np.int)
location[start_fnal|end_fnal] = 1
location[start_hydepark|end_hydepark] = 2
commutes['location'] = pd.Categorical.from_codes(location, ('UNKNOWN', 'FERMILAB', 'HYDEPARK'))

# Clean up some columns
commutes.drop(['latitude_first', 'latitude_second',
               'longitude_first', 'longitude_second',
               'timestamp_second'], axis=1, inplace=True)

Finally, we apply some quality checks to the commutes: we start at home in the morning and end at home in the evening, and cut out outliers that are implausibly long or short, perhaps if I didn't go straight home.

In [8]:
# Morning commute ends at work, evening commute starts at work
commutes = commutes[((np.logical_not(commutes.evening)&(end_fnal|end_hydepark)&start_home)|
                     (commutes.evening&(start_fnal|start_hydepark)&end_home))]

# Finally, require a known destination and minimum duration, and cut super long outliers
commutes = commutes[(commutes.location != 'UNKNOWN') & (commutes.duration < timedelta(minutes=100))]

print(len(commutes), 'commutes')
357 commutes

Finally, we do a little bit more cleanup based on looking at the location samples.

Scanning through scatter plots of all sampled points between the starting and stopping timestmap, there are some clear examples of times I didn't go straight home. Most of these can be cut by detecting when I go too far.

I also noticed that the sampling becomes far more frequent (about double) on September 22, 2017. Wonder what's with that! Some of the earlier data is very sparsely sampled and unlikely to yield an accurate travel time, so we further remove any commutes containing fewer than 30 samples.

In [9]:
bad = []
for idx, s in commutes.iterrows():
    dr = df[(df.timestamp>=s.timestamp)&(df.timestamp<=s.timestamp+s.duration)]
    if (dr.latitude.values > home.latitude + 0.003).any() or len(dr) < 30:
        bad.append(idx)
commutes.drop(bad, inplace=True)
print('Dropped %i commutes' % len(bad))
Dropped 10 commutes

Here's a snapshot of the cleaned data:

In [10]:
display(commutes[103:107])
day evening timestamp duration location
296 511 False 2017-05-26 07:50:06-05:00 00:50:16 FERMILAB
297 511 True 2017-05-26 17:10:51-05:00 01:21:35 FERMILAB
298 515 False 2017-05-30 07:34:21-05:00 00:46:08 HYDEPARK
303 517 True 2017-06-01 18:20:45-05:00 00:50:29 HYDEPARK

Analysis

Now we can make some plots, starting with histograms of commute times for Hyde Park and Fermilab in the morning and evening.

In [11]:
fig, (ax1, ax2) = plt.subplots(1,2,figsize=(10,4), dpi=100)
bins = np.arange(10, 100, 5)

def fmt(ax):
    ax.set_xlim(bins[0], bins[-1])
    ax.set_xlabel('Duration (minutes)')
    ax.set_ylabel('Commutes/bin')
    ax.legend(loc='best')

# Convenient boolean masks for subsetting
hp = commutes.location == 'HYDEPARK'
fnal = commutes.location == 'FERMILAB'
eve = commutes.evening
morn = np.logical_not(eve)

# Just the durations in minutes
hp_morn_data = commutes[hp&morn].duration.dt.total_seconds() / 60
nal_morn_data = commutes[fnal&morn].duration.dt.total_seconds() / 60
hp_eve_data = commutes[hp&eve].duration.dt.total_seconds() / 60
nal_eve_data = commutes[fnal&eve].duration.dt.total_seconds() / 60

# Plot!
ax1.hist(hp_morn_data,
         alpha=0.6, bins=bins, label='Hyde Park')
ax1.hist(nal_morn_data,
         alpha=0.6, bins=bins, label='Fermilab')
ax1.set_title('Morning')
fmt(ax1)

ax2.hist(hp_eve_data,
         alpha=0.6, bins=bins, label='Hyde Park')
ax2.hist(nal_eve_data,
         alpha=0.6, bins=bins, label='Fermilab')

ax2.set_title('Evening')
fmt(ax2)

plt.tight_layout()
plt.show()

And some basic statistics:

Hyde Park mornings

In [12]:
print(hp_morn_data.describe(percentiles=[]))
count    136.000000
mean      49.992279
std        8.725789
min       26.666667
50%       49.450000
max       78.350000
Name: duration, dtype: float64

Fermilab mornings

In [13]:
print(nal_morn_data.describe(percentiles=[]))
count    48.000000
mean     53.775000
std       6.234873
min      43.933333
50%      53.308333
max      76.900000
Name: duration, dtype: float64

Hyde Park evenings

In [14]:
print(hp_eve_data.describe(percentiles=[]))
count    137.000000
mean      47.412652
std       12.136662
min       24.716667
50%       46.400000
max       96.866667
Name: duration, dtype: float64

Fermilab evenings

In [15]:
print(nal_eve_data.describe(percentiles=[]))
count    26.000000
mean     65.216026
std      14.227382
min      40.500000
50%      66.408333
max      90.750000
Name: duration, dtype: float64

Duration and Departure Time

Something I've noticed is that it sure feels like commutes get worse through the week. Is it so? Just growing impatience? Or leaving at different times as the week goes on and catching more traffic? Plots will tell.

Here, we look at the distribution of departure times and trip durations by day of the week using a set of box plots, and also take a look at the correlations through the joint distribution. There, we'll bring in statsmodels to fit a linear model.

In [16]:
# Plot the trip duration and departure time by weekday, plus the
# joint distribution
duration_v_depart = []
for name, dest in [('Hyde Park', hp), ('Fermilab', fnal)]:
    fig, axes = plt.subplots(2,3,figsize=(12,12), dpi=100)

    for title, dft, ax in [
            ('Morning, %s' % name, commutes[dest&morn], 0),
            ('Evening, %s' % name, commutes[dest&eve], 1)]:
        dft['depart'] = dft.timestamp.dt.hour + dft.timestamp.dt.minute / 60
        dft['duration_mins'] = dft.duration.dt.total_seconds() / 60
        dft['dow'] = pd.Categorical.from_codes(dft.timestamp.dt.dayofweek,
                                               ('MON', 'TUE', 'WED', 'THU', 'FRI'))
        dft.boxplot(column='duration_mins', by='dow', ax=axes[ax,0])
        dft.boxplot(column='depart', by='dow', ax=axes[ax,1])
        dft.plot.scatter( 'depart','duration_mins', ax=axes[ax,2])

        # Fit a linear model to duration vs. departure time
        dft.sort_values(by='depart', inplace=True)
        res = smf.ols('duration_mins ~ depart', data=dft).fit()
        duration_v_depart.append((title, res.params, res.bse))
        prstd, iv_l, iv_u = wls_prediction_std(res)

        x = dft['depart']
        y = dft['duration_mins']
        axes[ax,2].plot(x, y, 'o', label='Data')
        axes[ax,2].plot(x, res.fittedvalues, 'r--.', label='Model', color='orange')
        axes[ax,2].plot(x, iv_u, 'r--', color='orange')
        axes[ax,2].plot(x, iv_l, 'r--', color='orange')
        axes[ax,2].legend(loc='best')
        
        axes[ax,0].set_title(title)
        axes[ax,0].set_ylabel('Trip duration (minutes)')
        axes[ax,0].set_xlabel('Day of the Week')
        axes[ax,0].set_ylim(20, 100)
        
        axes[ax,1].set_title(title)
        axes[ax,1].set_ylabel('Departure time (hour of day)')
        axes[ax,1].set_xlabel('Day of the Week')
        lim = (6.5, 9) if ax == 0 else (15, 21)
        axes[ax,1].set_ylim(*lim)
        
        axes[ax,2].set_title(title)
        axes[ax,2].set_xlabel('Departure time (hour of day)')
        axes[ax,2].set_ylabel('Trip duration (minutes)')
        axes[ax,2].set_ylim(20, 100)
        axes[ax,2].set_xlim(*lim)

    plt.suptitle('')
    plt.tight_layout()
    plt.show()

The right panels show a linear model fit to the joint distribution, along with the standard errors. The fits are not excellent, but capture the general trend for this scenario where the correct underlying model is diffult to know without more data. The model parameters quantify the change in trip duration as a function of delay in departure:

In [17]:
for title, params, bse in duration_v_depart:
    print('%18s: %6.2f +/- %4.2f minutes per hour delay' % (title, params.depart, bse.depart))
Morning, Hyde Park:   4.86 +/- 1.78 minutes per hour delay
Evening, Hyde Park:  -7.00 +/- 1.16 minutes per hour delay
 Morning, Fermilab:  -2.20 +/- 2.74 minutes per hour delay
 Evening, Fermilab: -13.55 +/- 1.80 minutes per hour delay

That is, leaving an hour later on morning commutes to Hyde Park costs 5 extra minutes, while waiting an extra hour to leave Fermilab in the evening knocks 14 minutes off the trip. For morning commutes to Fermilab, it doesn't much matter when I leave... good to know!

Weather

Next, let's see if there's any correlation between commute time and weather conditions like snow or rain. We can use the DarkSky weather API (see https://darksky.net/dev/docs). You'll need an API key, which provides 1000 free API calls per day.

In [18]:
api_key = 'secret'  # Secret!
In [19]:
def get_weather(timestamp):
    '''Get historical weather for the specified hour from the DarkSy API.
    
    See https://darksky.net/dev/docs for API details. N.B. You must provide a key.
    
    The request timestamp format is "%Y-%m-%dT%H:%M:%S%z"
    
    :param timestamp: Timestamp for weather data
    :returns: A string: RAIN, WINTRY, or CLEAR
    '''
    params = {'exclude': 'currently,minutely,daily,alerts,flags'}
    ts = timestamp.strftime('%Y-%m-%dT%H:%M:%S%z')
    s = 'https://api.darksky.net/forecast/%s/%f,%f,%s' % (api_key, chicago.latitude, chicago.longitude, ts)
    r = requests.get(s, params=params)
    p = pd.DataFrame(r.json()['hourly']['data'])
    p['timestamp'] =(pd.to_datetime(p.time, unit='s', utc=True)
                       .dt.round('s')
                       .dt.tz_localize(pytz.timezone('UTC'))
                       .dt.tz_convert(pytz.timezone('America/Chicago')))
    nearest = p.iloc[np.argmin(np.abs(p.timestamp - timestamp))]

    if nearest.icon in ['snow', 'sleet'] or (nearest.precipProbability > 0.5 and nearest.temperature < 32):
        return 'WINTRY'
    elif nearest.icon in ['rain', 'fog'] or (nearest.precipProbability > 0.5 and nearest.temperature > 32):
        return 'WET'

    return 'CLEAR'

Add a column with a summary of the weather data:

In [20]:
commutes['weather'] = commutes.timestamp.apply(get_weather)
In [21]:
print('Rain/snow %1.1f%% of commutes' % (100.0 * np.sum(commutes.weather != 'CLEAR') / len(commutes)))
Rain/snow 9.2% of commutes

Unforunately the statistics are a bit too low to learn much about the impact of weather on travel time, but a a preliminary look suggests the impact is not too significant. I guess we're already moving pretty slowly, after all.

For an illustration, let's look a higher-statistics sample, morning commutes to Hyde Park:

In [22]:
fig, ax = plt.subplots(1,1,figsize=(4,4), dpi=100)
bins = np.arange(0, 120, 5)

# Nice weather
nice = commutes.weather == 'CLEAR'

# Just the durations in minutes
hp_nice = commutes[hp&morn&nice].duration.dt.total_seconds() / 60
hp_bad = commutes[hp&morn&np.logical_not(nice)].duration.dt.total_seconds() / 60

# Plot!
ax.hist(hp_nice, alpha=0.6, bins=bins, label='Nice')
ax.hist(hp_bad, alpha=0.6, bins=bins, label='Not Nice')
ax.set_title('Hyde Park, Morning')

plt.tight_layout()
plt.legend(loc='best')
plt.show()

A look at the p-value for a two-sample Kolmogorov-Smirnov test shows that the distributions are highly compatible:

In [23]:
print('KS p-value = %1.3f' % scipy.stats.ks_2samp(hp_nice, hp_bad).pvalue)
KS p-value = 0.700
In [24]:
print('== Hyde Park mornings ==')
print('-- Nice Weather :) --')
print(hp_nice.describe(percentiles=[]))
print('-- Bad Weather :( --')
print(hp_bad.describe(percentiles=[]))
== Hyde Park mornings ==
-- Nice Weather :) --
count    119.000000
mean      49.800420
std        8.763247
min       26.666667
50%       49.250000
max       78.350000
Name: duration, dtype: float64
-- Bad Weather :( --
count    17.000000
mean     51.335294
std       8.595143
min      34.316667
50%      51.833333
max      63.366667
Name: duration, dtype: float64

Conclusions

I hope this has been an interesting illustration of how to use your Location Services data to draw your own insights. Here I've looked at my commutes to better understand how I can minimize my car time, and have a more quantitative model for the impact of departure time. But this same approach could be used for all sorts of studies. It's great that Google makes it relatively easy to access this data -- have some fun with it!

And happy commuting 🚘!