VERTZY CASE STUDY

NYC Property Intelligence Pipeline

Turning 85M+ rows of public property data into targeted solar sales leads

85M+
Records Ingested
3
Public Data Sources
7.5 GB
Raw CSV Data
60 days
Lead Targeting Window

The Problem

A solar sales professional was spending significant time and money relying on his company's internal leads team to identify potential customers. The existing process was manual, slow, and failed to capitalize on the most valuable sales window: the first 60 days after a property purchase, when new owners are actively making investment decisions about their property.

The core insight was simple — publicly available NYC property transaction records contain every residential and commercial building sold in the city. If you could identify new owners fast enough and enrich that data with property characteristics, you'd have a highly targeted lead list that outperforms generic marketing channels.

"What is obvious to you, an engineer steeped in data on a daily basis, is not obvious to non-technical professionals. Most non-technical professionals are unaware of how much money they are leaving on the table by not using available data to its fullest potential."

The Approach

I designed a multi-source data pipeline that unifies three complementary NYC datasets into a single property intelligence layer, hosted on PostgreSQL via Supabase.

Data Sources

ACRIS

NYC Department of Finance

Property transactions: deeds, mortgages, party names and addresses, transaction dates. 85M+ rows across 7 tables.

PLUTO

NYC Department of City Planning

Zoning, building classification, unit counts, assessed values, lot geometry. Filtered to Williamsburg and Greenpoint target zip codes.

StreetEasy

Property Market Data

Rental history, building discovery, market rent data. Web scraped via headless browser with anti-detection.

Technical Implementation

Bulk Load Optimization

The biggest engineering challenge was ingesting 85M+ rows of ACRIS data efficiently into a hosted PostgreSQL instance. A naive INSERT approach would have taken hours. Instead, I implemented a staging table strategy that reduced load time dramatically:

1. Drop all non-primary-key indexes on the target table 2. Truncate the target and create an UNLOGGED staging table (bypasses write-ahead log for speed) 3. COPY raw CSV data into the staging table in bulk 4. INSERT-SELECT from staging to production in 500K-row batches with ON CONFLICT DO NOTHING 5. Recreate indexes with maintenance_work_mem bumped to 1GB for faster index builds

This approach is borrowed from data warehouse loading patterns and is rarely seen in application-layer code. It was the single most impactful technical decision in the project.

Web Scraping with Anti-Detection

StreetEasy data required headless browser automation via Playwright with anti-detection measures including CAPTCHA detection, human-like scroll behavior, and Chrome user-agent spoofing. Building discovery was handled by asynchronously parsing StreetEasy's XML sitemaps to extract building URLs before scraping individual property pages.

Infrastructure

The pipeline uses SQLAlchemy connection pooling (pool size 5, overflow 10, pre-ping health checks) with a custom upsert abstraction that handles composite-key conflicts with 3-attempt retry and exponential backoff. Storage is entirely PostgreSQL on Supabase, organized into three schemas:

• acris — 7 tables of property transactions with parties and legal descriptions • pluto — Zoning, building, ownership, and valuation data • streeteasy — Building discovery and rental market data

Current State & Next Steps

The data ingestion layer is production-ready — all three sources are loaded, validated, and queryable. The enrichment layer (cross-source joins to produce actionable lead lists) and the output layer (automated export and outreach integration) were scoped but not completed due to the end of the client engagement.

The project demonstrates the full data engineering approach: identify a business problem, map it to available public data, build a reliable pipeline, and design for downstream consumption.

Tech Stack

PythonPostgreSQLSupabaseSQLAlchemyPandasPlaywrightBeautifulSouphttpx

Have a similar data problem? Let's talk.

I write about the consulting journey, data engineering, and AI on Substack — the origin story for this project is there too.