Posts

Showing posts from October, 2025

Wrapping Up: A Geospatial Dive into the U.S. Broadband Divide

Image
         For my MSDS692 Data Science Practicum project, I decided to tackle a topic close to my professional background: the U.S. Broadband Divide. My goal was to move beyond simple maps and explore whether socioeconomic factors like income and race contribute to disparities in internet access. I wanted to visualize these patterns and ultimately see if we could predict underserved areas using data science.      Leveraging large datasets from the FCC's Broadband Data Collection (providing provider speeds) and the US Census/ACS (offering income, population demographics, and geographic boundaries), I analyzed broadband availability at a granular Census Block Group level across the nation.      The core data science workflow involved several key steps. First came the significant task of downloading, cleaning, and aggregating hundreds of files to create unified national datasets for speed, income, and minority population percentage at ...

Data Trimming vs Winsorization

Image
  During the data preprocessing stage, I came across several census block groups reporting broadband speeds exceeding 1500 Mbps, some even as high as 7500 Mbps.These extreme values had the potential to skew my statistical models, especially when analyzing national-level patterns. At this point, I had two choices. One option was data trimming , which involves removing extreme outliers entirely from the dataset. This method ensured removing the very high block groups that skewed my analysis. However, trimming also means losing some geographic coverage. My total block groups went from about 240,000 to about 55,000. something I wanted to avoid in a spatial study, since this approach was causing black spaces in the QGIS maps. The other option was  Winsorization , which keeps all records but caps any extreme values at a predefined threshold — in this case, 1500 Mbps. Winsorization reduces the influence of outliers while preserving the overall structure and completeness of the data. ...

Adding Minority % to Social Vulnerability Index Calulation

Image
  During my Week6 work of my US broandband divide analysis project, I encountered an issue. The initial correlation matrix for the whole of the USA showed virtually no correlation (-0.00) between the Social Vulnerability Index (SVI) and the maximum download speed. This was puzzling, especially since the earlier Alabama-level analysis suggested a negative correlation. This led me to realize that what worked at a state level didn't necessarily hold true at a national, aggregate level. To address this, I hypothesized that adding another socioeconomic metric might help uncover the hidden associations. I felt that the percentage of the minority (non-white) population could be a significant factor affecting social vulnerability (SVI) and, consequently, broadband access. To test this, I downloaded the necessary U.S. Census Bureau data—Total Population and White-alone population at the block group level—and used these metrics to calculate the minority percentage for each block group. The n...