On November 29, 2013, I wrote a piece titled “Not Worth the Cost: A 17-Month Case Study of Congestion Pricing in the SF Bay Area.” In that piece, I presented data I manually collected on toll costs for the westbound Express Lane on Highway 237 (running from Milpitas to Sunnyvale, CA) versus the drive time on the highway’s general purpose lanes. I was disappointed to find that the relationship between the two was not very reliable. Moreover, I concluded that neither the toll nor the overall projects costs are worth paying.
Motivated by comments and questions from a reader, I decided to take a deeper look at the data to see whether I could tease out some more complex relationships. I will be doing this analysis in stages. In this first stage, I developed a simple machine learning model using a regression tree to predict drive times based on a full array of variables.
I broke up the data into the following independent variables:
- Cost: price of the toll on the Express Lane in dollars.
- Month: index for month of the year (1=Jan, 2=Feb, etc…) using the date of the data collection.
- DayofWeek: index for the day of the week (2=Mon, 3=Tue, etc..) using the date of the data collection. Note that the tolls only apply on non-holiday weekdays.
- WeekOfYear: index for the week of the year (1=the first week which is Jan 1st, 2 = the second week, etc..) using the date of the data collection. Note that the week starts on Monday.
- Hour: the hour component of the start time of the drive on the general purpose lane.
- Minute: the minute component of the start time of the drive on the general purpose lane.
The dependent variable (what the model is trying to predict/classify) is the duration of the drive on the general purpose lane in seconds. I coded this as DriveTimeSeconds.
I used the e1071 package in R for creating the best regression tree using 10-fold cross-validation.
The initial results are promising. The regression tree below shows that drive time on the general purpose lane is influenced by the day of the week and the fraction of the hour (but NOT the hour itself!). Adding these variables to the cost information from the toll lanes provides a richer understanding of resulting drive times. Of course, the model cannot know whether congestion itself depends on the day of the week and the time, but, for now, congestion does not appear to be material to this model since I generally did not observe congestion in the Express Lane during my data collection. Ironically, the last two days that I added to the data – December 3rd and 6th – DO include some observed (very minor) congestion in the Express Lane.
Here is how to interpret the branches of the tree (from left to right of the leaf or end nodes):
- If the cost of the Express Lane is less than $2.05, then I can an average drive time of 479.2 seconds (8.0 minutes).
- If the cost is less than $2.55 but at least $2.05 AND the weekday is a Wednesday, Thursday, or Friday, then I can expect an average drive time of 641.5 seconds (10.7 minutes).
- If the cost is less than $2.55 but at least $2.05 AND the weekday is a Monday or Tuesday, then I can expect an average drive time of 700.7 seconds (11.7 minutes).
- If the cost is at least $2.55 AND the time is before half past the hour AND the time is at least 15.5 minutes past the hour, then I can expect an average drive time of 701.2 seconds (11.7 minutes).
- If the cost is at least $2.55 AND the time is before 15.5 minutes past the hour, then I can expect an average drive time of 766.8 seconds (12.8 minutes).
- If the cost is at least $2.55 AND the time at least 30 minutes past the hour, then I can expect an average drive time of 803 seconds (13.4 minutes).
With these results, I can move beyond the disappointing scatter of the 2-dimensional graph of drive time versus cost and see the more complex relationships at work. It is VERY interesting to see that while the tolls ranged from $0.85 to $4.25, the tree only contains two branching points based on cost. This verifies that cost is not a sufficient determinant of average driving time from the perspective of the driver in the general purpose lane.
The chart below recasts the original chart: it color-codes the points according to the rules from the regression tree. You can now visualize how the algorithm partitioned the data. The “nodes” in the legend are ordered and numbered as shown in the list above.
With this format, you can also visualize which parts of the model have the highest error rates. The very first rule, “Node1”, has the highest error rate given that with a cost less than $2.05 drive time can range from 200 to 800 seconds (3.3 to 13.3 minutes). If I had additional variables at my disposable, I might be able to reduce the error rate of this region of data. This model can also be a starting point to help the VTA generate a more consistent congestion pricing model (again, from the perspective of the general purpose driver).
In a future analyze, I will apply k-means clustering to these data to see whether I can generate even richer results. I think the partitioning routine of k-means should be well-suited to this problem. I will also explore metrics of performance of these models. Stay tuned!
(Author’s addendum for December 7, 2013: I neglected to include a variable for the year in the above analysis. Such a variable is very effective in detecting whether the VTA’s pricing algorithm has experienced significant change over time. After adding in the year, the model did not change. However, going forward, I will keep this variable so that any significant changes do get flagged.)