In my Statistics I class at Northwestern, our class was split into teams to do projects using statistics on real world data. My group chose to use daily ridership counts from the Chicago Transit Authority's "L" train system, gathered from the train turnstiles and released by the City of Chicago at their data portal.
We sought to predict future ridership on a stop-by-stop basis using regression, so the CTA could predict where additional capacity would be needed and where stops could be downsized or closed. Taking this approach also meant they could aggregate up stop-level predictions to make broader predictions about specific train lines, neighborhoods, or the system in general.
We used Python with the statsmodels model to fit curves based on the equation we devised after observing the data. Our regression equation accounted for day-of-the-week and day-of-the-year changes as well as general year-over-year trends. We plotted actual ridership with our predicted curves using matplotlib on graphs like the one below:
We were also able to identify outlier stops which our regression couldn't account for, such as those with large student populations (highly periodic based on whether school is in session or not) or those near sporting arenas like Wrigley Field (likely highly correlated with whether it is a game day or not). Here's the Wrigley Field plot as an example:
The code for the project and all of the generated plots are on Github here, and the slides from our final presentation are embedded below.