Hubway Data Challenge Part 2

Once I had the Hubway bike system data loaded into a database, and modeled into MicroStrategy, I could start to play with the data and do some basic profiling.  The more I looked at the data, the more I wanted to add things.  For example, the trips table as the birth year of subscribing riders, which lends itself to creating an age attribute.  To model in age, I created an attribute and used the pass-through applysimple function.  This is the basic syntax needed: ApplySimple(“(year(#0)-#1)”, [start_date_desc], [birth_date]).

When added to a report by itself, the age attribute will generate the following SQL:

SELECT DISTINCT (YEAR(a11.start_date_desc)-a11.birth_date) CustCol_3
FROM trips a11

As mentioned in the part 1 post, the data offers the opportunity to add more layers and texture because the dimensions are so generic.  Latitude and longitude coordinates can be used to derive the elevation, which would answer one of the questions on the Hubway Data Challenge web site, Do Hubway cyclists only ride downhill?  A data dimension could be used to correlate against the academic schedule, or even gas prices.  Anyway, on to the eye candy…

For those of you who read Stephen Few you know that visual design isn’t easy.  Few’s philosophy espouses simplicity and elegance over complexity and flash.  If you can’t generally understand the data in less than ten seconds you have failed your audience.  Basic, muted colors that make careful use of highlights is preferred over harsh and bright color tones throughout.  These are all great recommendations, and as I progress through the different phases of my interaction with the data I will adhere more closely to these recommendations.  In the meantime, I simply want to profile the data using some basic charts and graphs.  The alternative to graphing the data is that you get wide and long grids of data with visual appeal.  The tradeoff is that you get to pivot the data, sort it, filter it, etc., but exceptions, trends, and a general sense of the data quality doesn’t readily present itself.

So, a quick and dirty way to start to understand the data is to graph it.  I have gotten used to the MicroStrategy graphing options, but many developers will cite the core graphing technology as one of the weaker aspects of the platform.  The widgets and visual insight graphics have exceeded the Desktop graphing capabilities, but I still like to use the graph formatting to create vertical / horizontal bar charts, scatterplots,  and time-series analyses. So, simply to get a flavor of the data I created a few graphs.

This graph shows the activity (trips) for a month — in the page-by — and I tried see if there was a way to quickly tell whether temperature spikes led to a decrease in usage.  To do this correctly I’d likely want to average out the trips by weekday and get a rolling temperature average.  Only with the means in place can I get a true understanding of whether a 10 degree shift in temperature leads to an n% variation in usage.

Trips and Temp, dual axis

One of the data challenge sample questions asks whether rentals after 2 AM are more likely to come from the under 25 population.  I extended this question to ask whether usage varies by gender.  I took liberties with the coloring for effect, but I would mute these tones in a dashboard.  I also incorporated another axis (trip distance) to see whether rides are longer at certain times in the day, but since I didn’t use an average metric, the second axis isn’t very meaningful.

Male Female Usage by Hour

No basic correlation study should go without a scatterplot.  The r values are included, but aren’t very telling.  To make this graph work I had to clear out the trips that involved 0 distance (i.e., the rent and return location are the same).  Because this graph also had month in the page-by, some months showed a higher r value than others.  Again, I’m simply using this to get a feel for the data and get some general answers to high level questions.

Scatterplot, male female correlation

Based on some feedback I got from a colleague, I was advised to try and label the axes.   I tried to do something that tied the color of the axis to the metric, and this is what I got.  To me this graph is telling in that it appears to suggest that as the bike rental program became part of the city culture, people started taking longer rides.

dual axis trending

So, it’s a start.  With some basic profiling underway I am starting to compile a list of some high level questions that might be telling or informative about the data.  Station analysis and trip patterns are a good place to go with the data, and some of the questions that I’ve started to formulate go along these lines:

  • Which station sees the most usage?
  • What percent of trips end and start in the same place?
  • What bikes see the most usage, and of them, what side of the river do they spend the most time on?
  • How has usage changed this year versus last year?  Can the data be used to illustrate the growth of the program in some neighborhoods versus others?
These questions have their parallels to the business community, and represent the typical deep dive that a business analyst would do.  The next layer of analysis is to take this data set and make predictions against the data.  For example, looking at the data at the end of July how closely could I predict usage at the various stations using the historical data, especially the trends elicited from July 2011?  Given a time of day, could I predict what percentage of bikes rented from station x will wind up at station y?  If overcrowding at a station is a problem, and people can’t drop their bikes off because the racks are full, do I need to transport bikes away from certain stations at a certain time of day?



Correlation, Causation, and … lag()

I gave a presentation back in June at a MicroStrategy Meetup and I used a simple data set to illustrate that even one dimension and three data points can yield interesting results.  My data included the following three things:

  • Daily closing price of gold
  • Daily closing price of oil
  • Daily closing value of the VIX (fear index)
Oil, Gold, and Vix by Time
One dimension, three data points

The recent 4 year data set, when visualized looks like this:

Three values, graphed over time

The complete data set:

Gold, Oil, VIX graphed
The three values, back to 1983

My thesis in working with these three data points was that somewhere in this data we could find evidence of correlation.  So, I went about the task of building out some reports that correlated some of the combinations of the data and I plotted them out:

The VIX and Oil saw a high correlation swing between 2007 and 2009, but the overall trend leading up to 2007 was inching upwards to 1.  The sudden drop in crude prices in 2008/2009 could partially explain the easing of the VIX since the financial crisis.

When I plotted the VIX against gold, I saw more dramatic correlation swings year over year.  I found these variation differences to be more interesting than the oil and fear relationship because I had assumed that these two would stay generally correlated above 0.  To see the VIX and gold dip so low in 2008 suggests that one wasn’t keeping up with other.

In the last step I plotted oil and gold together, and found similar precipitous changes year over year.  With the first two correlations there at least seems to be a pattern, but with this last one not so much.   What I was looking for was some consistency (stay above or below 0) in the correlation, but I did not observe this with this data.

Rather than looking for the obvious perfect match between these variables over time, my next thought was to insert a lag into the data and see whether some sort of offset would smooth out the relationships.  The thinking behind this being that socionomic forces exist behind these data points, but the shifts are either reactive or proactive.  For example, it is possible that the fear index responds to changes in oil prices, or that the daily price of gold  reflects the speculation that the economy is worsening and that the only good place to invest assets is in a common precious metal.  To accomplish this I created a series of objects in MicroStrategy that allowed to quickly change the lag parameter and test my assumption.

Metric Edito
Using an embedded lag function

My correlation metric is defined as Correlation([Gold Close (lag n)], [Fear Index Close (lag n)]) {Year}

Gold Close (lag n) is defined as:  Lag([Gold Close], ?[Lag Value Gold (Number)], 0) where the “?” represents a prompt value.

From this I could run a series of quick tests and using the standard deviation of the results I could start to see that embedding a negative lag (-30, -60, then -90) into the data started to lower the dispersion of values.

Standard Deviation - Lag -90

I could certainly do more with this data, and if I was desperate to find that perfect leading indicator that could predict where commodities or the S&P were headed I suppose I was start by extending this and looking for variation of the data that yielded the lowest possible standard deviation in correlation coefficients.  Beyond the sheer number of possibilities this small data set affords me, one could easily start to add more variables into the mix — the DOW closing price, pork belly futures, or the foreign currency exchange rates.