Feature Engineering and Exploratory Data Analysis

 



For this week, I did Feature Engineering and Exploratory Data Analysis (EDA), using the Alabama data as a prototype. I created the the custom features that define the originality of the project, and then did statistical and visual analyses to validate my core hypotheses.

In the Feature Engineering phase, I utilized a Python program to merge the block-group-level broadband speed data (summarized to show max speed) with the median income data and the U.S. Census shapefiles, successfully creating a unified GeoDataFrame. From this GeoDataFrame, I derived three critical features:

  1. The Socioeconomic Vulnerability Index (SVI): Calculated by inverting the normalized median income, this score quantifies digital inequity risk, with values closer to 1 indicating high vulnerability (low income).

  2. Neighbor Average Speed: This spatial feature captures the average broadband speed of all adjacent block groups, serving as a powerful proxy for regional infrastructure investment.

  3. Neighbor Average SVI: This feature indicates whether a block group is situated within a larger cluster of vulnerable areas, showing that the digital divide is a widespread community issue.

The Exploratory Data Analysis (EDA) phase yielded powerful, data-driven insights, primarily from the correlation matrix. This analysis confirmed my central hypothesis: that the digital divide is driven by socioeconomic factors, showing a modest correlation between a block group’s SVI and its Max Down Speed (0.11). Critically, the EDA revealed that my engineered features are the strongest predictors: the Neighbor Average Speed showed a strong positive correlation (+0.62) with a block group's own speed, validating the project's focus on spatial context. 

Comments

Popular posts from this blog

Wrapping Up: A Geospatial Dive into the U.S. Broadband Divide

MSDS Practicum Project - US Broadband divide

Downloading data from census.gov's ACS