This website generates crowd level predictions for outdoor parks using twitter and weather data, and is intended to be a useful application for people trying to plan their recreation time, for companies who hope to make more effective use of advertising time and money, and for planners who want to efficiently deploy services such as transportation. In theory, the models developed here could be extended to any park, so long as enough people find it an interesting place to post about on social media! Both the relationship between social media posts and crowds and between weather and outdoor park use have been documented in scientific research and this project tries to combine these ideas to gain predictive insights.
Future plans for the project include incorporating nlp analysis of tweet content to model when users report crowds and their sentiments about their experience at the park.
When the user chooses a park, the app grabs the current local weather forecast for the park from the wunderground API and uses these data to generate predictions for the next few days. These predictions are based on cross-validated machine-learning models built using scikit-learn in python that use tweets as a proxy for crowd levels.
The current models are a combination of linear regression with k-nearest neighbors and Random Forest models of the regression residuals. The models are trained on a historical weather and twitter dataset collected using urllib and a Selenium webscraper, respectively, parsed using BeautifulSoup, and passed to a MySQL database using MySQLdb. The current training dataset consists of one full year of tweets from Disneyland and California Adventure (totaling over 150,000 tweets) and hourly weather data from wunderground.com. To build the models, tweets and weather reports were binned by hour (or by day) to generate crowd level estimates per time period (see the map below for an interactive visualization of the twitter dataset). Several calendar indicator variables (e.g., time, weekday, month of the year) are also included in the models to account for daily, weekly, and seasonal periodicity in park attendance as well as for holiday periods. Only the most recent models are stored in the website repository and are loaded using dill at the time predictions are requested by the user.
Click the slider and drag or use the right arrow key to view the hourly twitter data for the parks for 2015.