Two of our Data Science bootcamp students, Evan Lancaster and Mahesh Rao, and our Data Science instructor, Mary van Valkenburg, were invited to present at the inaugural Metro Data Day last Friday. Metro Data Day is a way for Metro Nashville to gather data champions from across all agencies of the city government to learn.
Keith Durbin, Metro Nashville’s CIO, opened the meeting by sharing about Nashville’s Smart City Projects and how data will help them make better decisions to serve the Nashville community.
Robyn Mace, Metro Nashville’s CDO (Chief Data Officer), then kicked off the Inaugural Metro Data Day. Last fall, Robyn spoke with our Data Science cohort about Metro open data and provided a data question. She wanted to reduce over 750 property violation categories to less than 20. She shared with students strategies to consolidate the categories and apply the new categories to the open dataset of property violations.
After an introduction from Mary about NSS and the Data Science bootcamp, Evan and Mahesh shared the insights they gained from their work with the Metro Nashville property violations data project.
With categories in place, he now faced the challenge of categorizing each violation. He noticed that the violation data included descriptions that, most of the time, used words from the code that was in violation. Since he already had categories for the codes, he turned to Natural Language Processing (NLP) to identify and categorize codes. He used two-thirds of the property codes dataset to train NLP and tested it on the remaining third.
NLP was more accurate for some categories versus others. After reviewing the data, Evan realized that if he was going to make it more accurate, he would have to train it for the edge cases. With an accuracy of 75-80%, he decided that improving the accuracy for edge cases was not the best use of time. He explained how accurate you need to be depends on the type of data you’re using. For example, if you were using NLP for cancer research, you would want your accuracy levels to be very high.
NLP was then applied to the open dataset of property violations over a two year period.
He also showed a strip plot marking each violation for handful of categories over a two year window. That data showed that lawn care violations are seasonal, with few incidents occurring during the winter months. It also showed a rise of violations from short-term rentals as they have grown in popularity. While these align with what one might expect, it’s important to have data to back up our expectations.
Our Data Science bootcamp has been fortunate to work with real-world data provided by organizations across Nashville. This data enriches their learning experience by working with realistic, messy datasets that they will encounter in their jobs. But using real-world data is not just a benefit for our students. It can benefit the organization that provided the data as well.
The agency representatives in attendance were excited about the results and how this data can be used in their jobs. They provided some examples of how this information help them.
While the scope of the project within the class did not involve a working application with a user interface, our hope is that we can continue to work with Metro as they advocate for the use of data across the city and their agencies.