U.S. Housing Market Analysis

What drives home prices across America — quantified across 2.2M+ listings with Python analytics + Tableau storytelling.

The raw dataset was chaotic: missingness across key fields, wild outliers (think impossible beds/baths and massive lots), and inconsistent granularity by city/state.

I stabilized it with a real cleaning pipeline: datatype fixes, livability constraints, IQR-based outlier filtering, minimum city sample thresholds, and a full U.S.-only scope with engineered regional groupings.

Net result: a high-signal dataset built for decision-grade insights — shipped as an interactive Tableau story for fast exploration.

Python Tableau BEA API 2.2M listings

Problem

Housing prices in the U.S. don’t just vary — they fracture. The affordability gap across regions raises real questions for buyers, investors, and policy stakeholders.

This project turns that chaos into measurable drivers by connecting property features (size, beds, baths, lot size) with geography and population trends to explain why markets behave so differently.

Dataset

Primary Dataset

U.S. Real Estate Listings (2023 snapshot) — 2.2M+ records.

Includes price + property attributes (size, bedrooms, bathrooms, lot size).

External Dataset

BEA API — U.S. & Texas population time series (1969–2023).

Why it matters: it lets the analysis go beyond “prices are high here” into “prices are high here because…”.

Approach

1) Data Wrangling

Cleaned and validated listings at scale (types, missingness, duplicates, sanity checks).

2) EDA + Feature Relationships

Computed correlations and compared patterns by region/state. Result: home size consistently dominates as the most informative numeric driver.

3) Modeling

  • Linear Regression: price ~ house_size (log-transformed for normalization) → R² ≈ 0.40
  • K-Means clustering: standardized price + size, elbow method → k=3 tiers

4) Time Series (Texas Deep Dive)

Paired Texas housing metrics with BEA population trends to show the growth narrative.

Results

  • Size is the strongest predictor (r ≈ 0.41 nationally).
  • Regional signal is real: correlations appear strongest in South & Midwest.
  • Regression confirms structure: R² ≈ 0.40 after log-transforming price.
  • Clustering finds 3 affordability tiers: Budget: < $300K · Mid-range: $300K–$600K · Premium: > $700K
  • Texas case: population growth aligns with long-term housing price increases; size/baths dominate more strongly in-state.

Tools

Data Wrangling

pandas, numpy, os

EDA & Visualization

matplotlib, seaborn, folium

Modeling

scikit-learn, statsmodels

Time Series

statsmodels.tsa, ADF, ACF/PACF

API

beaapi, requests, json

Storytelling

Tableau Public

Visualizations

This project runs on two lanes: Python = proof (reproducible analytics) and Tableau = product (interactive stakeholder-ready narrative).

Interactive Report (Tableau)

Open in Tableau

Feature Correlation Heatmap

Feature correlation heatmap

Snapshot from Python EDA (correlation structure).

Price vs House Size (Regression)

Regression scatter plot: price vs house size

Log-normalized regression view of size → price relationship.

Affordability Tiers (K-Means)

K-means clustering: affordability tiers

k=3 segments derived from standardized price + size.

Texas Deep Dive (Population Trend)

Texas population time series

Population growth paired with long-term price narrative.

Takeaways

This analysis doesn’t just show where homes are expensive — it explains why, quantifies the strongest levers, and packages it into an interactive story that a non-technical stakeholder can actually use.

If you want a fast read: size drives price, regions reshape the rules, and markets naturally segment into tiers — exactly what you’d need for pricing strategy, risk profiling, or policy planning.