Wrapping Up: A Geospatial Dive into the U.S. Broadband Divide
For my MSDS692 Data Science Practicum project, I decided to tackle a topic close to my professional background: the U.S. Broadband Divide. My goal was to move beyond simple maps and explore whether socioeconomic factors like income and race contribute to disparities in internet access. I wanted to visualize these patterns and ultimately see if we could predict underserved areas using data science.
Leveraging large datasets from the FCC's Broadband Data Collection (providing provider speeds) and the US Census/ACS (offering income, population demographics, and geographic boundaries), I analyzed broadband availability at a granular Census Block Group level across the nation.
The core data science workflow involved several key steps. First came the significant task of downloading, cleaning, and aggregating hundreds of files to create unified national datasets for speed, income, and minority population percentage at the block group level. This part was time-consuming but crucial for building a reliable foundation.
Feature engineering was a particularly interesting stage. I created a Social Vulnerability Index (SVI) by combining normalized income (inverted, so low income = high vulnerability) and minority percentage data. Recognizing that broadband access is often hyperlocal, I also engineered features representing the average speed and average SVI of neighboring block groups.
Next, I applied unsupervised learning (K-Means clustering) to identify distinct “broadband archetypes,” and supervised learning (Random Forest) to predict which block groups were likely underserved, defined as having speeds below 100 Mbps. I also visualized the findings using QGIS.
The analysis yielded some fascinating insights. Clustering revealed distinct patterns, such as the “Infrastructure Gap” — areas with moderate income and SVI but very low speeds, representing true digital deserts — and the “Adoption Gap,” where low-income/high-SVI areas had high speeds available, suggesting affordability or adoption barriers.
Exploratory Data Analysis initially showed only a weak correlation between SVI and max download speed nationally, highlighting that the relationship isn’t uniform across the country. However, the supervised modeling phase was highly successful, achieving high accuracy (0.99) and precision (0.94 for the underserved class).
Interestingly, the most important feature for predicting low speeds wasn’t income or SVI directly, but the average speed of neighboring block groups, followed by their average SVI. This strongly supports the idea that broadband access (or lack thereof) is hyperlocal and clustered geographically. I was surprised to see neighbors’ speeds being more predictive than income itself — a good reminder that context matters.
Ultimately, this project demonstrated that the U.S. broadband divide isn’t a simple yes/no issue. It comprises distinct challenges, primarily relating to infrastructure gaps in some areas and potential adoption or affordability gaps in others. By combining geospatial data, socioeconomic indicators, and machine learning, we can effectively predict underserved areas with high accuracy.
This predictive capability offers a powerful, data-driven tool that could help inform targeted policy decisions and strategic investments aimed at finally closing the digital divide. It was a challenging but rewarding project, blending data engineering, analysis, machine learning, and visualization to shed light on a national issue.
Comments
Post a Comment