On November 29, 2013, I wrote a piece titled “Not Worth the Cost: A 17-Month Case Study of Congestion Pricing in the SF Bay Area.” In that piece, I presented data I manually collected on toll costs for the westbound Express Lane on Highway 237 (running from Milpitas to Sunnyvale, CA) versus the drive time on the highway’s general purpose lanes. I was disappointed to find that the relationship between the two was not very reliable. Moreover, I concluded that neither the toll nor the overall projects costs are worth paying.

Motivated by comments and questions from a reader, I decided to take a deeper look at the data to see whether I could tease out some more complex relationships. I will be doing this analysis in stages. In this first stage, I developed a simple machine learning model using a regression tree to predict drive times based on a full array of variables.

I broke up the data into the following independent variables:

**Cost**: price of the toll on the Express Lane in dollars.**Month**: index for month of the year (1=Jan, 2=Feb, etc…) using the date of the data collection.**DayofWeek**: index for the day of the week (2=Mon, 3=Tue, etc..) using the date of the data collection. Note that the tolls only apply on non-holiday weekdays.**WeekOfYear**: index for the week of the year (1=the first week which is Jan 1st, 2 = the second week, etc..) using the date of the data collection. Note that the week starts on Monday.**Hour**: the hour component of the start time of the drive on the general purpose lane.**Minute**: the minute component of the start time of the drive on the general purpose lane.

The dependent variable (what the model is trying to predict/classify) is the duration of the drive on the general purpose lane in seconds. I coded this as **DriveTimeSeconds**.

I used the e1071 package in R for creating the best regression tree using 10-fold cross-validation.

The initial results are promising. The regression tree below shows that drive time on the general purpose lane is influenced by the day of the week and the fraction of the hour (but NOT the hour itself!). Adding these variables to the cost information from the toll lanes provides a richer understanding of resulting drive times. Of course, the model cannot know whether congestion itself depends on the day of the week and the time, but, for now, congestion does not appear to be material to this model since I generally did not observe congestion in the Express Lane during my data collection. Ironically, the last two days that I added to the data – December 3rd and 6th – DO include some observed (very minor) congestion in the Express Lane.

Here is how to interpret the branches of the tree (from left to right of the leaf or end nodes):

- If the cost of the Express Lane is less than $2.05, then I can an average drive time of 479.2 seconds (8.0 minutes).
- If the cost is less than $2.55 but at least $2.05 AND the weekday is a Wednesday, Thursday, or Friday, then I can expect an average drive time of 641.5 seconds (10.7 minutes).
- If the cost is less than $2.55 but at least $2.05 AND the weekday is a Monday or Tuesday, then I can expect an average drive time of 700.7 seconds (11.7 minutes).
- If the cost is at least $2.55 AND the time is before half past the hour AND the time is at least 15.5 minutes past the hour, then I can expect an average drive time of 701.2 seconds (11.7 minutes).
- If the cost is at least $2.55 AND the time is before 15.5 minutes past the hour, then I can expect an average drive time of 766.8 seconds (12.8 minutes).
- If the cost is at least $2.55 AND the time at least 30 minutes past the hour, then I can expect an average drive time of 803 seconds (13.4 minutes).

With these results, I can move beyond the disappointing scatter of the 2-dimensional graph of drive time versus cost and see the more complex relationships at work. It is VERY interesting to see that while the tolls ranged from $0.85 to $4.25, the tree only contains two branching points based on cost. This verifies that cost is not a sufficient determinant of average driving time from the perspective of the driver in the general purpose lane.

The chart below recasts the original chart: it color-codes the points according to the rules from the regression tree. You can now visualize how the algorithm partitioned the data. The “nodes” in the legend are ordered and numbered as shown in the list above.

With this format, you can also visualize which parts of the model have the highest error rates. The very first rule, “Node1”, has the highest error rate given that with a cost less than $2.05 drive time can range from 200 to 800 seconds (3.3 to 13.3 minutes). If I had additional variables at my disposable, I might be able to reduce the error rate of this region of data. This model can also be a starting point to help the VTA generate a more consistent congestion pricing model (again, from the perspective of the general purpose driver).

In a future analyze, I will apply k-means clustering to these data to see whether I can generate even richer results. I think the partitioning routine of k-means should be well-suited to this problem. I will also explore metrics of performance of these models. Stay tuned!

(Author’s addendum for December 7, 2013: I neglected to include a variable for the year in the above analysis. Such a variable is very effective in detecting whether the VTA’s pricing algorithm has experienced significant change over time. After adding in the year, the model did not change. However, going forward, I will keep this variable so that any significant changes do get flagged.)

Hey Duru! I’m loving these analyses a lot! I see what you did. Creating a decision tree for commute times and color coding the nodes on the same plot from last time is way better than arbitrarily assigning low/med/high categories. A few questions and comments: 1) Is it possible to use other variables to break the decision tree, like weather or a binary variable for accident/no accident? Or I guess, would you want to?

2) Instead of using the current minute and hour variable, would it be better to use a minute variable that contains values from 0 (12am) – 1440 to denote the time of day?

3) Can’t wait for the k-means on this data. Based on your updated graph, I’m seeing 3 clusters, first for node 1, second for node 2-3 and third for node 4-6. Can that represent different tiers of traffic conditions?

4) Based on your results, I’m trying to interpret the meaning of a high error present for nodes 1-3. For example, if they price it <= $2.05, there's a wide range of commute times for general purpose, meaning it's a toss up for how many people actually use the express lane at that price point? Priced above $2.55, it's not going to affect the commute time in the general purpose lane. But is that because it's too expensive for drivers or because it has no impact on the general use lanes due to the sheer amount of existing traffic?

4) High prices are associated with high commute times in the general use. That's probably normal because high traffic times are when they would increase the price to use the express lane. I'm wondering if there's a way to generate a baseline commute time to understand the true effects of the pricing of express lane. For example, the average commute times for nodes 4-6 or (2-3); how's that compare for the average commute time of those days without the express lane? Of course, there's no way to get that number, but are there any methods for generating a proxy for it?

Thanks so much for doing this analysis, I'm really enjoying it!

Thanks again for reading!

Answers to your questions.

1. You can use any binary/indicator/dummy variable you want. To be completely rigorous, I should be checking for independence (or collinearity). In this case, I never observed any accidents during any measurement day, so I have no data for including it. However, I will say that the pricing logic *should* be independent of the presence of an accident. Accidents are just another driver of congestion which is the final variable we really care about. We don’t necessarily care where the congestion came from (a bunch of drivers having a bad day, a bunch of slow-moving cement trucks, etc…). The cause of congestion could matter if the VTA’s pricing system is very slow to respond to real-time driving conditions.

2. Since the vast majority of my driving was in the same 30 minute or so time frame, I decide not to go granular with the specific time of morning. With more data, that time would indeed be more interesting. I was actually hoping the model would be independent of the minute and hour and was VERY surprised to see it wasn’t. It is likely most accurate to interpret this as meaning it mattered whether I hit 237 before or after 9am.

3. Definitely. I think there are multiple tiers of traffic going on. I can’t wait to try k-means (or even SVM)! Stay tuned. This could take me a while though. ðŸ™‚

4. From the VTA’s perspective, their first priority is to make sure traffic is flowing smoothly through the Express Lane. So, theoretically, they are first concerned to make sure price is high enough not to encourage “too many” drivers to spill into the Express Lane. Since I never observed congestion on the Express Lanes except for my last two observations, I am actually a bit at a loss at how the VTA varied pricing. The congestion on the general purpose lanes *must* be taken into account. But if so, it is very surprising to see such a wide range of performance in their logic. From the driver’s perspective trying to make a decision, if you see a price of $2.05 or below, you might as well just slog it out on the general purpose. You are almost as likely to end up with a short drive as a very long one. The exception is if you have an extremely low tolerance for delay on these few miles. Then of course you will pay anything offered up to you. Assuming I am correct about the lack of congestion on the Express Lane, I think the VTA is chronically over-pricing but I think their may be some revenue generating concerns that keep the price higher. That is, the VTA may be most interested in hitting a sweet spot of price elasticity amongst the relatively wealthy drivers who must be tossing their money out the window when using the Express Lane.

5. It would have been best to have data from the time BEFORE the VTA constructed the Express Lane. Maybe a Freedom of Information act request can get me that. ðŸ™‚ I am sure there are several ways to get at a proxy but they would be messy and subject to interpretation. Perhaps the best way is to get the specific data on paying users of the Express Lane. If we assumed all those people were using the general purpose lanes, we could develop a proxy for baseline traffic (with assumptions about when they would commute without the benefit of the Express Lane).

Thanks again for all the great questions!

Thanks Duru for answering my questions! (and apologies if some of them sound amateurish) I just have a couple of follow-up questions if you don’t mind ðŸ™‚ 1) How would you measure the “flow” of traffic? average mph? or some other metric? 2) Let’s say you worked at VTA and was in charge of implementing this program. Imagine you had all the data that you needed, even a historical baseline for traffic before the express lane. How would you determine the “best” price to set? I’m just curious to know what methodology you’d use to find the “best” price point. Thanks again Duru!

Hi Kevin, I am FINALLY getting back to this topic over three years later. I will answer your questions for future readers:

The flow of traffic should be measured by average MPH. That way, you have an approximate measure of congestion assuming that below some threshold defines congestion. If I had all the data I needed, I would just set price to maximize revenue. The dynamic pricing algorithm would need to respond to existing patterns in real-time. That is, historical data would not be enough. I think pricing to manage congestion in the express lane is the correct way to go.