This series features mid-course projects for our Data Science Bootcamp. Students were first tasked with posing an interesting data question and finding a dataset to address that question. Next, they spent time cleaning, wrangling, and exploring the data, before designing and building an interactive Shiny app to display their findings and allow for further exploration.
With NHL playoffs just around the corner, hockey teams across the country continue to battle it out for a spot in the playoffs, hoping for a chance to play at their home arena in a series of playoff games. But how much of an advantage does that actually create for the home-teams? Does the “7th Man” as Smashville is referred to by the Nashville Predators really make much of a difference for the players on the ice? Hayden Greer set out to learn more about the statistical advantage home-teams have over away-teams in the NHL with his mid-course capstone project for Nashville Software School’s Data Science Cohort 6.
The Data Question
When looking at the sport of hockey, Hayden shares that fans act much differently compared to other sports. “Most noise from the fans comes after a significant event has already occurred, like a goal. Most of the time there are also long wait times between significant events in the game which mellows out the crowd.”
Hayden started by gathering data from 11 seasons of the NHL and calculating the win percentage over that span. “The home team has a win percentage of 54%,” he states. “Though this does not really prove there is an advantage at home and it does not quantify the advantage for being the home team.” Hayden recognized that simply taking the overall winning percentage would not account for any differences in the skill level of the teams playing.
Cleaning & Analyzing The Data
To get a better estimate of home ice, Hayden needed to find a way to control for the strength of each team. The method that he chose to use was an ELO ranking system. The Elo rating system is a method for calculating the relative skill levels of players in zero-sum games such as chess, and is renowned as one of the best rating systems there is. However, variants of this system are used in many different rating systems today.
Once he had gathered the necessary data, Hayden cleaned and prepared it using the tidyverse libraries, and then he needed to run the calculations to find each team's ELO score over time. “The hardest and most interesting part was I had to develop an Elo rating system,” he explains. “This was done through a series of for loops over about 10 years of NHL data.”
With the Elo rating system in place, Hayden continued by running two models.
The first model Hayden ran was a simple linear model. “I used the difference in ratings for both teams to predict the differential of goals for a certain game,” he explains. “I was only interested in the intercept, since I wanted the only difference between the teams to be where the game is played.” In doing so, the linear model returned that there is a slight advantage for the home team. With a confidence interval of 95% the estimated goal differential is 0.32 to 0.24. “This was cool to see when teams are equal in skill the home team has a slight advantage but it is hard to interpret how much of an advantage from that number.”
This led Hayden to building a logistic regression model to estimate the probability of the home team winning after controlling for differences in ELO rating.
“This model returned with a confidence interval of 95% that the estimated probability the home team will win is 58.2% to 51.2%,” he explains. “So if two equally skilled teams were to play each other the home team already has a more than 50% chance to win the game.” After looking further into it, Hayden learned that the away-teams needed to have roughly about 60 more rating points than the home-team to overcome the advantage of playing at home.
Visualizing The Data
Hayden’s Shiny App
With his data cleaned and analyzed, it was time to create a way to visually compare the home and away stats against each other to help highlight the difference between how well teams typically play at home vs. away using ggplot.
“These were mainly used to build confidence in my statistical model that returned teams have an advantage when playing at home,” Hayden shares. “The second set of visuals I have is just time series data of the elo ratings changing between seasons. This was just for fun since it took a while to create the elo rating system.”
By the end of his mid-course capstone, Hayden had a clear answer. “There is a pretty sizable advantage just for playing at home in the NHL.”
According to his analysis, it takes about 2-3 places in the NHL standings to overcome the difference in playing an at-home team. This means that if the best team in the league goes on the road to play the third best team in the league, the away team would have about a 50/50 chance of winning against the at-home team.
Hayden points out that the Elo rating system may be a better tool to determine which teams are more likely to win the Stanley cup than the typical standings. “However, the models are far from perfect and there are many steps that can be taken to improve their estimated returns,” he concludes.
For more insights from Hayden, visit her Shiny app or his project on GitHub.
Interested in learning how to clean and present data for yourself? Check out all that our Data Science Bootcamp has to offer you in data exploration! Visit our program page to learn more and apply.