This series features mid-course projects for our Data Science Bootcamp. Students were tasked with asking an interesting data question and finding a dataset to answer the question. Next, they spent time cleaning, wrangling, and exploring the data, before designing and building an interactive Shiny app.
Stack Overflow is a question and answer forum that connects programmers from around the globe to help each other troubleshoot. Users can ask questions, share solutions, and search topics based on tags.
The Data Question
As a frequent user of both the English version of Stack Overflow and it’s Russian counterpart, Sergey Motorny was curious to see if there were any trends when comparing tags between the English and Russian language versions of Stack Overflow.
He also defined a hypothesis to evaluate:
Null Hypothesis: Russian and English tags on Stack Overflow belong to the same population group (Russian Stack Overflow is simply a smaller subset of the same tag space).
Alternative Hypothesis: Russian and English tags of Stack Overflow represent two different population groups (with distinct tags and popularity).
Cleaning The Data
Sergey’s data source, StackExchange Data Explorer, keeps their data clean and ready-to-use. Paired with a well-defined query, there was not much data cleaning left to do.
Visualizing The Data
See Sergey’s Shiny app, Visual Tag Analysis of Russian and English Stack Overflow Communities.
He started his visualizations with ggplot, but switched to ggplotly to add interactivity to the graphs. “I find that having features, such as a clickable legend, positively impacts user experience,” shared Sergey.
Additional statistical analysis is needed to confirm or reject his null hypothesis.