Posts

Wrapping Up: A Geospatial Dive into the U.S. Broadband Divide

Image
         For my MSDS692 Data Science Practicum project, I decided to tackle a topic close to my professional background: the U.S. Broadband Divide. My goal was to move beyond simple maps and explore whether socioeconomic factors like income and race contribute to disparities in internet access. I wanted to visualize these patterns and ultimately see if we could predict underserved areas using data science.      Leveraging large datasets from the FCC's Broadband Data Collection (providing provider speeds) and the US Census/ACS (offering income, population demographics, and geographic boundaries), I analyzed broadband availability at a granular Census Block Group level across the nation.      The core data science workflow involved several key steps. First came the significant task of downloading, cleaning, and aggregating hundreds of files to create unified national datasets for speed, income, and minority population percentage at ...

Data Trimming vs Winsorization

Image
  During the data preprocessing stage, I came across several census block groups reporting broadband speeds exceeding 1500 Mbps, some even as high as 7500 Mbps.These extreme values had the potential to skew my statistical models, especially when analyzing national-level patterns. At this point, I had two choices. One option was data trimming , which involves removing extreme outliers entirely from the dataset. This method ensured removing the very high block groups that skewed my analysis. However, trimming also means losing some geographic coverage. My total block groups went from about 240,000 to about 55,000. something I wanted to avoid in a spatial study, since this approach was causing black spaces in the QGIS maps. The other option was  Winsorization , which keeps all records but caps any extreme values at a predefined threshold — in this case, 1500 Mbps. Winsorization reduces the influence of outliers while preserving the overall structure and completeness of the data. ...

Adding Minority % to Social Vulnerability Index Calulation

Image
  During my Week6 work of my US broandband divide analysis project, I encountered an issue. The initial correlation matrix for the whole of the USA showed virtually no correlation (-0.00) between the Social Vulnerability Index (SVI) and the maximum download speed. This was puzzling, especially since the earlier Alabama-level analysis suggested a negative correlation. This led me to realize that what worked at a state level didn't necessarily hold true at a national, aggregate level. To address this, I hypothesized that adding another socioeconomic metric might help uncover the hidden associations. I felt that the percentage of the minority (non-white) population could be a significant factor affecting social vulnerability (SVI) and, consequently, broadband access. To test this, I downloaded the necessary U.S. Census Bureau data—Total Population and White-alone population at the block group level—and used these metrics to calculate the minority percentage for each block group. The n...

Feature Engineering and Exploratory Data Analysis

Image
  For this week, I did  Feature Engineering and Exploratory Data Analysis (EDA) , using the Alabama data as a prototype . I created the the custom features that define the originality of the project, and then did statistical and visual analyses to validate my core hypotheses . In the  Feature Engineering phase, I utilized a Python program to merge the block-group-level broadband speed data (summarized to show max speed) with the median income data and the U.S. Census shapefiles, successfully creating a unified GeoDataFrame . From this GeoDataFrame, I derived three critical features: The Socioeconomic Vulnerability Index (SVI) : Calculated by inverting the normalized median income, this score quantifies digital inequity risk, with values closer to 1 indicating high vulnerability (low income). Neighbor Average Speed : This spatial feature captures the average broadband speed of all adjacent block groups, serving as a powerful proxy for regional infrastructure investment. ...

Spatial Joins in QGIS

Image
This week,  I visualized broadband availability across Alabama at the census block group level using FCC broad band data. Here’s how I did it. Step 1: Clean and Prepare the Data I started with two FCC datasets: cable and fiber broadband availability, and a Census dataset for median income. Merge FCC datasets: Using Python and pandas, I combined the cable and fiber datasets, extracting the first 12 digits of the block_geoid to get the block group level , and calculated the maximum advertised download speed for each block group. Clean Census data: From the Census median income file, using python, I ignored rows with missing values ( -666666666 ) and constructed a 12-digit census block group GEOID by combining state, county, tract, and block group codes. Output: Two clean CSVs — one for max download speed per block group, and one for median income per block group. Step 2: Load Data into QGIS Loaded the Alabama census block group shapefile ( tl_2024_01_bg.shp ...

Downloading data from census.gov's ACS

Image
       The census.gov website has ACS ( American Community survey ) data that has a wealth of information about demographics, median household income etc. It provides API support to download the data. In this tutorial, I will explain how to download the data using API calls, and also creating a python script to automate downloading data with multiple API calls. step1 : get a API key. The API key is needed to make the API calls. sign up and get the key in this URL:  https://api.census.gov/data/key_signup.html step2 : assemble the API call URL. depending upon the information needed, we have to assemble the URL needed to do the API call.  for example: https://api.census.gov/data/2023/acs/acs5?get=NAME,B19013_001E&for=block%20group:*&in=state:36%20county:119&key=xxxxxxxxxxxxxxxxx in the above API Call url,  B19013_001E is the data point for household median income.  in=state:36%20county:119 represents NY state and Westchester county. &...

MSDS Practicum Project - US Broadband divide

Image
  The Term US Broadband divide indicates the unavailability of broadband internet to all americans. According to FCC, 100 mbps speed is the threshold to be called as broadband. The site Ookla.net collects speed tests and shows 60% of the tests > 100 mbps download speeds The data from the FCC- BDC as of December 2024 demonstrates a steady increase in locations with access to high-speed internet. According to the FCC, an additional one million locations gained access to 100/20 Mbps speeds between June and December 2024. Additionally, nearly 7 million additional locations gained access to even faster gigabit speeds (1 Gbps download and 100 Mbps upload) in the same time frame. While the overall trend is positive, the persistent percentage of unserved and underserved locations represents a significant population that is still being left behind The good news is over the years with expanding cable infrastructure and advancement in other technologies - Terrestrial fixed wireless, Cell...