This series features mid-course projects for our Data Science Bootcamp. Students were first tasked with posing an interesting data question and finding a dataset to address that question. Next, they spent time cleaning, wrangling, and exploring the data, before designing and building an interactive Shiny app to display their findings and allow for further exploration.
An avid video-gamer who’s spent over 5,000 hours playing Dota 2, Tomo Umer of Data Science Cohort 6 is committed to staying young at heart. His passion for Valve and their Steam platform for PC gaming with Linux support, inspired him to explore data from the Steam platform for his mid-course capstone project.
Tomo was curious about the prevalence of indie (independent) game developers on the market over the years, whether there has been an increase in multiplayer games and games on Linux and MacOS over the years, and whether or not COVID-19 has affected the number of games released on the platform.
To answer these questions, Tomo first attempted web scraping data from Valve. However, he discovered that this approach would result in obtaining data mostly on more recent or popular games and potentially exclude a lot of older games or games that hadn't developed as large of a following. Not wanting to fall into survivorship bias, which can occur when a visible successful subgroup is mistaken as an entire group, due to the failure subgroup not being visible, Tomo sought an alternative data source and discovered the official Steam API. This new source had enough data to answer his questions, including information about the apps available on Steam, release year, genre, and platform.
But even with an API, gathering and cleaning the data was challenging. “The API had erratic behaviors at times which prompted me to update my code several times (with if-else statements). I also had to remove duplicates that were there and think hard on what would make sense to present and how. The trickiest part was dealing with strings of variable sizes and dates.” he explains on his personal blog.
Tomo used the httr and rvest libraries for gathering the data. “For data cleaning, I limited myself to libraries contained in the tidyverse - that way, I can count on the library being supported in the future and potentially redoing the analysis a year from now,” he recounts.
Explore Tomo’s Shiny App
With clean data in hand, Tomo used bar charts and line plots to showcase the growth of games on Steam, as well as the trends over time for genre, categories, and operating systems. “Initially, I drew those in plotly, only to redraw them later in ggplot2 for efficiency,” he shares. “The only plot that I kept in plotly is the Network Graph (igraph + plotly) of relations between video game genres [since] it allowed me to showcase the graph, while not making it too crowded (names and counts are shown by hovering over the points).”
Tomo was intrigued by the “incredible proliferation of independent developers” from 2008 to 2018. While still driving a high percentage of games from 2020 to 2022, the number of games from independent developers has dropped. Tomo hypothesizes that the pandemic may have played a role in this decline.
While Tomo prefers Linux to Windows, Windows continues to dominate the Steam market over Linux and MacOS.
Tomo shares on his blog that the help of his instructor, Michael Holloway, and his experience of being a lifelong gamer, gave him the perseverance to see this project through to completion. “The same amount of determination and willpower required to beat Malenia, Blade of Miquella in Elden Ring, is the one I harnessed in tackling this project,” he recalls with pride. “When people get tired and need a break from the computer screen every couple of hours, I’m just getting warmed up. Countless times I’ve queued Dota 2 into the night with friends making poor choices with my sleep and yet persisting through another match, gaining considerable mental fortitude in the process.”
For more insights from Tomo, visit his Shiny app or his project on GitHub.