Hubway Data Challenge – Part 3

I’m out of my league.

I’ve seen two submissions for the data challenge already, and yah, some people are really good.  For the rest of who model data and try to create a narrative out of data on a daily basis, a very discrete data set with a very precise way of looking at it is a luxury.  Usually we work with a user population who thinks “I want a dashboard, but I don’t know what I want…can you build it for me?”  Not an issue here…the developers working on this exercise know exactly what they want to see.  The Hubway data set lends itself to being mapped, and while there are a lot of ways to do this, this submission by Russell Goldenberg is very creative.  It can be downloaded as an .app or as an .exe — or previewed (click on the picture to see it on Vimeo).

Then there’s this Neo4j graph database version by Max De Marzi.  I like how the width of the station corresponds to its traffic.

These aren’t dashboards, these are visualizations.  Appealing — yes — but very specific, and custom coded.  I have to keep that in mind, because working with business intelligence tools means that I am often trying to spice up relatively unexciting data.  But MicroStrategy 9.3 and the features added to Visual Insight (as well as the Report Services widget) means that I at least have something to work with now in terms of network maps.  The three layout options are as follows:

  1. force-directed
  2. circular
  3. linear
To put myself in a position to visualize the bike system data I created a cube that had both the bikes and the stations as attributes, as well as the basic metrics (trips, distance, duration).  With these levers to work with I came up three views of the data for bike number B00079 in the months of August 2011 and 2012:
Since dashboards tend to be a collection of visualizations that, when combined, compel the user to quickly glean some information from the data set, these network diagram might have some utility.  If I wanted to create a dashboard that allowed the user to understand what bikes have been used more heavily, the network view along with some numeric stats would be useful for devising a maintenance strategy for bikes.  The trap that I fell into when I first started playing with the network graphs was to try and put a lot of data into it all at once.  Looking at all of the starting and ending stations at once was a mess, but adding the bike attribute narrowed the data set down and brought clarity.
One of the big advantages of Visual Insight is that all of the hooks and interdependencies between data sets are taken care of.  As opposed to a report services dashboard, I don’t need to set the selectors or worry that I’ve forgotten to establish a target data set.  The tradeoff comes with customization, or the lack thereof.  I don’t have a place to drop a customer logo, and I don’t have control over many other things, like the grid formatting.
So, the question comes to this: to make a nice visualization of the Hubway data set, should I create a  dashboard or use Visual Insight?

Hubway Data Challenge Part 2

Once I had the Hubway bike system data loaded into a database, and modeled into MicroStrategy, I could start to play with the data and do some basic profiling.  The more I looked at the data, the more I wanted to add things.  For example, the trips table as the birth year of subscribing riders, which lends itself to creating an age attribute.  To model in age, I created an attribute and used the pass-through applysimple function.  This is the basic syntax needed: ApplySimple(“(year(#0)-#1)”, [start_date_desc], [birth_date]).

When added to a report by itself, the age attribute will generate the following SQL:

SELECT DISTINCT (YEAR(a11.start_date_desc)-a11.birth_date) CustCol_3
FROM trips a11

As mentioned in the part 1 post, the data offers the opportunity to add more layers and texture because the dimensions are so generic.  Latitude and longitude coordinates can be used to derive the elevation, which would answer one of the questions on the Hubway Data Challenge web site, Do Hubway cyclists only ride downhill?  A data dimension could be used to correlate against the academic schedule, or even gas prices.  Anyway, on to the eye candy…

For those of you who read Stephen Few you know that visual design isn’t easy.  Few’s philosophy espouses simplicity and elegance over complexity and flash.  If you can’t generally understand the data in less than ten seconds you have failed your audience.  Basic, muted colors that make careful use of highlights is preferred over harsh and bright color tones throughout.  These are all great recommendations, and as I progress through the different phases of my interaction with the data I will adhere more closely to these recommendations.  In the meantime, I simply want to profile the data using some basic charts and graphs.  The alternative to graphing the data is that you get wide and long grids of data with visual appeal.  The tradeoff is that you get to pivot the data, sort it, filter it, etc., but exceptions, trends, and a general sense of the data quality doesn’t readily present itself.

So, a quick and dirty way to start to understand the data is to graph it.  I have gotten used to the MicroStrategy graphing options, but many developers will cite the core graphing technology as one of the weaker aspects of the platform.  The widgets and visual insight graphics have exceeded the Desktop graphing capabilities, but I still like to use the graph formatting to create vertical / horizontal bar charts, scatterplots,  and time-series analyses. So, simply to get a flavor of the data I created a few graphs.

This graph shows the activity (trips) for a month — in the page-by — and I tried see if there was a way to quickly tell whether temperature spikes led to a decrease in usage.  To do this correctly I’d likely want to average out the trips by weekday and get a rolling temperature average.  Only with the means in place can I get a true understanding of whether a 10 degree shift in temperature leads to an n% variation in usage.

Trips and Temp, dual axis

One of the data challenge sample questions asks whether rentals after 2 AM are more likely to come from the under 25 population.  I extended this question to ask whether usage varies by gender.  I took liberties with the coloring for effect, but I would mute these tones in a dashboard.  I also incorporated another axis (trip distance) to see whether rides are longer at certain times in the day, but since I didn’t use an average metric, the second axis isn’t very meaningful.

Male Female Usage by Hour

No basic correlation study should go without a scatterplot.  The r values are included, but aren’t very telling.  To make this graph work I had to clear out the trips that involved 0 distance (i.e., the rent and return location are the same).  Because this graph also had month in the page-by, some months showed a higher r value than others.  Again, I’m simply using this to get a feel for the data and get some general answers to high level questions.

Scatterplot, male female correlation

Based on some feedback I got from a colleague, I was advised to try and label the axes.   I tried to do something that tied the color of the axis to the metric, and this is what I got.  To me this graph is telling in that it appears to suggest that as the bike rental program became part of the city culture, people started taking longer rides.

dual axis trending

So, it’s a start.  With some basic profiling underway I am starting to compile a list of some high level questions that might be telling or informative about the data.  Station analysis and trip patterns are a good place to go with the data, and some of the questions that I’ve started to formulate go along these lines:

  • Which station sees the most usage?
  • What percent of trips end and start in the same place?
  • What bikes see the most usage, and of them, what side of the river do they spend the most time on?
  • How has usage changed this year versus last year?  Can the data be used to illustrate the growth of the program in some neighborhoods versus others?
These questions have their parallels to the business community, and represent the typical deep dive that a business analyst would do.  The next layer of analysis is to take this data set and make predictions against the data.  For example, looking at the data at the end of July how closely could I predict usage at the various stations using the historical data, especially the trends elicited from July 2011?  Given a time of day, could I predict what percentage of bikes rented from station x will wind up at station y?  If overcrowding at a station is a problem, and people can’t drop their bikes off because the racks are full, do I need to transport bikes away from certain stations at a certain time of day?



Hubway Data Challenge Part 1

I was interested to see that a data set had been posted and that a competition had been started to visualize data collected from the Hubway bike system.  For the uninitiated, the Hubway is a bike rental system with racks scattered across boston.  Users pay with a credit card or have a subscription to use the bikes.  When I was working in downtown Boston I would see these bikes all over the place, especially along the Esplanade, and going down Boylston Street.

The data set itself is a set of two Excel files — stations, trips — totaling about 10 MB zipped.  While quite simple, the data by itself represents an opportunity to do some interesting analysis based on the lat/long pairs of associated with the start and end points of the bike rental system.  The date pairs also represent lend themselves to time-series analysis.  With date as a hinge, other data can be incorporated, and in my example I added a comprehensive Date Dim table that extends the data into a time hierarchy (weeks, months, years), and I pulled weather data from to give myself an opportunity to do some basic correlations.

Some of the challenges that I faced in working with this data in MicroStrategy included:

  1. Modeling the same table (stations) for the start and end points
  2. Calculating distance from a lat/long pair
  3. Using a web service to automate the elevation of the stations
  4. Plotting the lat/long coordinates on a map

I have yet to overcome items 3 & 4, but the first two were interesting problems.  The ultimate goal of this exercise is to produce a meaningful visualization, and since MicroStrategy 9.3 was just released, this data set provides an opportunity to test some of the network diagrams, mapping widgets, and Visual Insight capabilities.

For problem #1, the solution in MicroStrategy is to use table aliases.  Basically, from a modeling standpoint aliases mean that architects do not need to create views to replicate a table.

The table alias within MicroStrategy tells the SQL generation engine that the same table can be used twice.

To create a table alias, go to the schema → tables folder, and right click on a table that has already been modeled in.  Select “Create Table Alias” and a new copy of the table will appear.  For my purposes I created 2 stations tables, one that referenced the start, and one for the end.  Within the attributes that reference the table, make sure that the mapping is set to manual, otherwise the automatic mapping will try to point to both the old and aliased table.

The resulting SQL for a report that wants to join Start Station and End Station would look something like this:

select a11.end_station_id  id,
a11.start_station_id  id0,
count(distinct  WJXBFS1
from trips a11
join stations a12
on (a11.end_station_id =
join stations a13
on (a11.start_station_id =
group by a11.end_station_id,

By aliasing the stations table twice, the engine is forced to join against itself, but the overhead from the database side is minimal.  From this we can start to glean some basic information from the data.  The South Station / North Station (TD Garden) ride is the most commonly used, and this is explained by the fact that there is no good way to get to South Station from North Station or vice versa!  Taking a bike probably constitutes a ~ seven minute ride.  I would speculate that these rides happen during rush hour, but I’ll table that speculation for future analysis.

The next challenge was to calculate distances between stations.  I found a good site that showed how to do this in Excel, and fortunately transposing Excel syntax into MicroStrategy is straightforward since the functions are named exactly the same.  Here is what the calculation looks like in Excel:

and here is what it looks like in MicroStrategy:

With this calculation in place, the previous report could be enhanced to include distance, and then by combining the distance with the trips you could derive a total mileage value.

The downside of this is that unless the start and end stations are different, then the total distance will be 0, as is the case with the Boston Public Library bike rack.

So, this is how I started the data analysis, and I have continued to build out other attributes to fully form the data and make it more interesting.  The next steps are to start to visualize the data.  I started to play with this, and with the availability of Cloud Personal, I threw up some data slices and created a first pass of a visualization.

In the coming weeks there should start to be some submissions coming online.  I have been more focused on pulling outside data together to add flavor and color to the raw data set, and a colleague suggested I analyze other events like Red Sox games or holidays into my analysis.  Any other suggestions?





MicroStrategy Task API

MicroStrategy’s Task API is an advanced topic within the MicroStrategy framework, but in short, it is a protocol that can allow for programmatic access to MicroStrategy in a lightweight manner.  The Task API differs from the URL API, which is a very cheap and easy way to call reports and affect what renders and what features are turned on or off.  Bryan Brandow has a very good explanation on his blog.  Rather than carry all of the overhead and expense of pass a full URL, with the user name and password embedded in the URL, the Task API can pass instructions to MicroStrategy and return the smallest possible result set.

My interest in the Task API came from a need to trim MicroStrategy functionality down to its lightest possible format. I needed to pass instructions to MicroStrategy either through SOAP or HTTP and get a small XML or JSON file back. MicroStrategy’s SDK provides the bulk of what is needed to set up the Task API (i.e., where the .war file is if you want to deploy on Tomcat). The base URL for the API with a Tomcat web server is:


In release 9.2.1 the Task API included a new feature called reportDataService. The beauty of this task is that it handles the login, execution, and then logout of a user. This kind of drive-by execution in one call greatly simplifies getting data from MicroStrategy quickly. My example does not pass prompt parameters, but it could. To keep it simple I created a three row, four column static report in MicroStrategy Desktop.

With the TaskAPI framework set up, and the report created (right-click on it to get the Report ID), I could now create a simple .jsp page that called the report. My final output looks like this (some liberties taken with a .css file, and a superfluous call to a jquery library and voila!):

Since I used HTTP to make the call to MicroStrategy I had to wrap and capture the output so that I could see the output. Even though the Task API has an interface to do this on the taskAmin –> Builder page, I wanted my own wrapper so that I could do some things with the output.

First, I need to bring in a handful of java libraries to handle some of the HTTP passing and data parsing:

<%@ page import="java.util.*" %>
<%@ page import="*" %>
<%@ page import="*" %>

Here are the fundamental snippets:

String sUserName = request.getParameter("userName");
String sPassword = request.getParameter("password");
String sReportID = request.getParameter("reportID");

URL uUrl1 = new URL("http://localhost:8080/MicroStrategy/servlet/taskProc?"
    + "taskId=reportDataService&taskEnv=xml&taskContentType=html"
    + "&server=SERVER2008&project=MicroStrategy+Tutorial"
    + "&userid=" + sUserName
    + "&password=" + sPassword
    + "&reportID=" + sReportID
    + "&styleName=CustomXMLReportStyle");

String sUrl1 = uUrl1.toString();
String sOutputLine = "";   

sOutputLine = GetContent(sUrl1).toString() ;

This handles the construction of the URL from the form. The getContent() function does the work:

StringBuffer GetContent(String sUrl1) throws Exception
    URL uURL1 = new URL(sUrl1);
    BufferedReader oReader = new BufferedReader( new InputStreamReader( uURL1.openStream()));

    StringBuffer sResult = new StringBuffer("") ;
    String sInputLine = null ;
    while ((sInputLine = oReader.readLine()) != null)
    return(sResult) ;

The output in this example will be XML with some HTML wrapping.

value="<taskResponse statusCode="200">
<?xml version="1.0" encoding="utf-8"?>

      <col index="1">Category</col>
      <col index="2">Promotion Type</col>
      <col index="3">Gross Revenue</col>
      <col index="4">Cost</col>
      <col index="5">Profit</col>
      <col index="6">Profit Margin</col>
   <row index="1">
      <col index="1">Books</col>
      <col index="2">No Promotion</col>
      <col index="3">$893,845</col>
      <col index="4">$679,891</col>
      <col index="5">$213,954</col>
      <col index="6">23.94%</col>
   <row index="2">
      <col index="1">Seasonal Sale</col>
      <col index="2">$237,616</col>
      <col index="3">$180,750</col>
      <col index="4">$27,377</col>
      <col index="5">13.15%</col>
   <row index="3">
      <col index="1">Special Sale</col>
      <col index="2">$24,655</col>
      <col index="3">$18,756</col>
      <col index="4">$968</col>
      <col index="5">4.91%</col>

There are several use cases for this level of simple query-response execution, and an external data API to extend an existing MicroStrategy environment for data consumption seems like the easiest to comprehend. Although MicroStrategy has portlets available, the Task API let’s developers go one level deeper into the platform. When the result is only XML or JSON, the consuming application can do whatever it wants with the chunk that is returned.

Correlation, Causation, and … lag()

I gave a presentation back in June at a MicroStrategy Meetup and I used a simple data set to illustrate that even one dimension and three data points can yield interesting results.  My data included the following three things:

  • Daily closing price of gold
  • Daily closing price of oil
  • Daily closing value of the VIX (fear index)
Oil, Gold, and Vix by Time
One dimension, three data points

The recent 4 year data set, when visualized looks like this:

Three values, graphed over time

The complete data set:

Gold, Oil, VIX graphed
The three values, back to 1983

My thesis in working with these three data points was that somewhere in this data we could find evidence of correlation.  So, I went about the task of building out some reports that correlated some of the combinations of the data and I plotted them out:

The VIX and Oil saw a high correlation swing between 2007 and 2009, but the overall trend leading up to 2007 was inching upwards to 1.  The sudden drop in crude prices in 2008/2009 could partially explain the easing of the VIX since the financial crisis.

When I plotted the VIX against gold, I saw more dramatic correlation swings year over year.  I found these variation differences to be more interesting than the oil and fear relationship because I had assumed that these two would stay generally correlated above 0.  To see the VIX and gold dip so low in 2008 suggests that one wasn’t keeping up with other.

In the last step I plotted oil and gold together, and found similar precipitous changes year over year.  With the first two correlations there at least seems to be a pattern, but with this last one not so much.   What I was looking for was some consistency (stay above or below 0) in the correlation, but I did not observe this with this data.

Rather than looking for the obvious perfect match between these variables over time, my next thought was to insert a lag into the data and see whether some sort of offset would smooth out the relationships.  The thinking behind this being that socionomic forces exist behind these data points, but the shifts are either reactive or proactive.  For example, it is possible that the fear index responds to changes in oil prices, or that the daily price of gold  reflects the speculation that the economy is worsening and that the only good place to invest assets is in a common precious metal.  To accomplish this I created a series of objects in MicroStrategy that allowed to quickly change the lag parameter and test my assumption.

Metric Edito
Using an embedded lag function

My correlation metric is defined as Correlation([Gold Close (lag n)], [Fear Index Close (lag n)]) {Year}

Gold Close (lag n) is defined as:  Lag([Gold Close], ?[Lag Value Gold (Number)], 0) where the “?” represents a prompt value.

From this I could run a series of quick tests and using the standard deviation of the results I could start to see that embedding a negative lag (-30, -60, then -90) into the data started to lower the dispersion of values.

Standard Deviation - Lag -90

I could certainly do more with this data, and if I was desperate to find that perfect leading indicator that could predict where commodities or the S&P were headed I suppose I was start by extending this and looking for variation of the data that yielded the lowest possible standard deviation in correlation coefficients.  Beyond the sheer number of possibilities this small data set affords me, one could easily start to add more variables into the mix — the DOW closing price, pork belly futures, or the foreign currency exchange rates.



Still prepping the site but mostly there…

Still tweaking the widgets and the plugins for the site.  Below was the original screenshot I used for the banner on my site.  I needed to blur it a bit, hence I reversed the colors and made the white background black.

Original dual-axis chart from the time series widget in a MicroStrategy dashboard — this was from one of the network dashboards we created.

As I have been working through the list of things I want to initially cover, I’ve realized how many sites and blogs get started and then die from lack of momentum.  Reading through other blogs and books about the topic I realize that writing like this requires a cadence and a discipline.  I can’t imagine keeping a site with the regularity that Paul Krugman does but keeping the site fresh for me means writing about a litany of topics.

I’ve been debating whether to include some of the product ideas that I’ve had over the years, but I think I can write about aspects of my ideas without revealing the whole concept.  If my readers can but the sum of the parts together, then all the better.

The last idea I mocked up was a simplified BI interface that borrowed from the Windows 8 theme of boxed simplicity.

make it easier
simplified UI for BI – too many colors?

I’ll write more on this one in a future post, but the exercise of putting together a wireframe like this was almost as much fun as coming up with the idea.

In the meantime my to do list is getting longer by the day and the time I have to knock these items off is getting shorter.  The wish list includes:

  • MicroStrategy install on CentOS
  • Reviewing the enhancements in MicroStrategy 9.3
  • Hooking the Cloudera VM to 9.3
  • Completing the Coursera class
  • Skinning MicroStrategy Mobile in Xcode

…and next week will include the TDWI conference in Boston and the MicroStrategy Meetup.

Prepping the site…

Around this time of year in New Mexico the sunflowers are in full bloom and the nights get a touch colder.  I took this picture in the fall of 2009, just as I was preparing to move back to Massachusetts.  The truck’s suspension has been lifted since this photo was taken, a result of too many close calls getting stuck in Stubblefield with Conor, and a need to have a beefier set of tires to make the trip back east.

We can’ say the Boston has been bad to us.  It’s just that Santa Fe was very good to us. I went there with a girlfriend and dog, and came back with a wife and a baby girl.  Rest in peace, Jeb…you were a great dog.