Data Extraction as Detective Work: Discovering the Hidden Infrastructure of a Nationwide Nonprofit

The Mission That Became a Mystery

“Can you extract all the data from shpbeds.org for a migration?”

Simple request, right? Pull chapters, blog posts, team members - standard extraction stuff. Three hours and 57 files later, we’d uncovered a sophisticated digital infrastructure with 359 chapters, 1,960 product variants, and a hidden social media toolkit that nobody knew existed.

This is the story of how data extraction became detective work, and why questioning assumptions leads to the best discoveries.

Starting Point: The Obvious Stuff

Sleep in Heavenly Peace - 359 chapters nationwide, website migration needed. Standard extraction targets:

First instinct? Fire up Playwright MCP and start grabbing the obvious targets:

359 chapters with location data
100+ team member profiles
Blog posts and news
Donation infrastructure

But here’s where the approach diverged from typical extraction: dual-format output from the start.

Output Strategy:
  - JSON: Machine-readable, migration-ready
  - Markdown: Human-reviewable, sanity-checkable

Why Both?
  "Trying to review 359 chapters in JSON is a recipe
   for missing something important. Markdown docs
   become the human validation layer."

★ Insight ─────────────────────────────────────

Dual-format extraction isn’t redundant - it’s risk management. The JSON feeds your migration scripts, but the Markdown lets humans spot patterns, inconsistencies, and relationships that automated tools might miss. Think of it as “trust but verify” for data extraction.

─────────────────────────────────────────────────

The Parallel Processing Breakthrough

After extracting chapters and team members sequentially, a realization hit: these sections don’t depend on each other. Why wait?

Spun up 4 parallel Explore agents, each tackling independent sections:

Donation infrastructure
Volunteer portal
Staff authentication
Sponsor data

This parallel approach did something unexpected - it revealed patterns we wouldn’t have caught with sequential extraction.

Extraction Timeline

Session 1

🎯 Core Data Extraction - Chapters, team members, blog posts. The obvious targets.

Session 2

⚡ Parallel Discovery - 4 agents simultaneously uncover donation complexity, Shopify scale, OAuth system.

Session 3

💎 The Page 2 Revelation - Questioning pagination assumptions leads to social media toolkit discovery.

Plot Twist #1: The Shopify Surprise

Expected: Simple merch store. Few t-shirts, maybe stickers.

Reality: 168 products. 1,960 variants.

Think about that for a second. This nonprofit has nearly 2,000 different product configurations. That’s not a merch store - that’s a full-blown e-commerce operation supporting their mission. Each chapter probably has custom items, different sizes, colors, branded merchandise.

The extraction for this alone required handling:

Pagination through product catalogs
Variant structures (size/color combinations)
Pricing tiers
Inventory tracking
Chapter-specific product routing

This wasn’t “grab product names and prices” - this was understanding an entire business system embedded in their site.

Plot Twist #2: The Donation Maze (And Why Complexity is Good)

Their donation system revealed serious sophistication:

🔧 Donation Infrastructure ▼

Payment Processor

FundraiseUp (advanced widget integration)

CRM Backend

Salesforce (donor relationship management)

Giving Options

7 distinct pathways

Routing Logic

Chapter-specific donation allocation

The seven giving options weren’t arbitrary - they solved real donor needs:

One-time donations
Monthly recurring gifts
Tribute/memorial gifts
Corporate matching programs
Donor-advised funds
Stock/crypto donations
Legacy/estate planning

Why this matters for extraction: Understanding the “why” behind complexity helps you not miss critical connections. Each donation pathway had different form fields, routing logic, and follow-up workflows.

The Page 2 Discovery: When Pagination Gets Weird

After the first extraction push, something bugged me. 100 blog posts on page 1. Modern blogs usually paginate at 10-12 posts per page. If they’re showing 100 on page 1, what’s on page 2?

Clicked the pagination link. Shows 0 blog posts initially. Weird.

Tried again.

Plot twist: Page 2 wasn’t more blog posts at all. It was a complete social media toolkit - 30 holidays throughout the year, each with custom graphics and pre-written captions.

Found at /blog/page/2/:
  - Veterans Day content (November 11)
  - MLK Day content (January 15)
  - National Sleep Day (March 10 - perfect!)
  - Christmas, Thanksgiving, Easter...
  - 30 holidays total
  - Professional graphics for each
  - Pre-written captions
  - On-brand messaging

This wasn’t on the extraction list - nobody expected content management tools hiding behind blog pagination.

Think about the engineering here: 359 chapters, coordinated social media content, zero distribution overhead. They repurposed blog pagination as a content distribution mechanism.

That’s solving a coordination problem with available infrastructure.

★ Insight ─────────────────────────────────────

This discovery only happened because we questioned assumptions. “Page 2 should have more blog posts” led to “Wait, why doesn’t it?” which led to “Let me check what’s actually there.” Data extraction isn’t just following a script - it’s investigating a system. Budget time for surprises.

─────────────────────────────────────────────────

Authentication Insights: OAuth Architecture

The staff login page revealed Google OAuth 2.0 integration. This wasn’t just a technical detail - it revealed operational workflow:

No password management overhead
Leverages existing Google Workspace accounts
Enterprise-grade security out of the box
Single sign-on across tools

Migration implication: You can’t just dump users in a new system and expect them to create passwords. The authentication flow is part of their operational muscle memory.

The GuideStar Deep Dive: Scale Context

When we pulled their external verification profile:

Platinum Seal from GuideStar/Candid (highest level)
7 consecutive years of top-tier transparency
$18.2 million in revenue
Complete financial disclosure

This isn’t a small charity. This is a major operation with serious accountability requirements. The data extraction needed to maintain that level of professionalism and completeness.

Technical Decisions That Mattered

Why Playwright MCP Over Basic Scrapers

Modern websites are JavaScript-heavy, dynamically rendered, with complex interactions. Basic scrapers would’ve:

Missed half the dynamic content
Failed on form submissions
Broken on authenticated pages
Missed the social media kit entirely

Playwright MCP gave us:

Full browser context
JavaScript execution
DOM inspection
Network traffic analysis
Screenshot validation

Why Parallel Agents Cut 75% Off Extraction Time

But more importantly: parallel execution lets you spot patterns across sections simultaneously. When you see donation complexity at the same time as Shopify scale and OAuth architecture, you understand the system holistically.

Why Git From the Start

Not an afterthought. Version control becomes documentation of discovery:

Commit History Tells a Story:
1699682 - "Complete SHP website extraction" (93 files, core data)
d122cf8 - "Add Social Media Kit extraction" (the Page 2 discovery!)
8084854 - "Add comprehensive README" (tying it all together)

Each commit is a milestone. Not “WIP” or “more changes” but actual discoveries documented.

What Got Extracted: The Complete Inventory

Final Output:
  - 14 major components
  - 57+ files
  - ~1.8 MB structured data
  - 100% success rate on targets
  - 1 bonus discovery (social media kit!)

Key Extractions:
  ✓ 359 chapters (JSON + CSV + Markdown)
  ✓ 100+ team members
  ✓ 100 blog posts
  ✓ 168 Shopify products (1,960 variants!)
  ✓ 7 donation pathways
  ✓ 25+ volunteer opportunities
  ✓ 4 major sponsors
  ✓ OAuth authentication system
  ✓ Financial transparency data
  ✓ 30-holiday social media toolkit

Lessons: Data Extraction as System Understanding

1. Expect Surprises

Budget time for discovering things that aren’t documented. The social media kit was pure bonus because we questioned pagination.

2. Parallel Everything You Can

Modern machines can handle it. You’ll spot patterns faster when you see multiple sections simultaneously.

3. Document As You Go

Not after. During. Your future self (and the migration team) will thank you.

4. Validate Early and Often

Better to catch issues at 10 records than 10,000. Our dual-format approach made validation continuous.

5. Understand the Why

Don’t just extract the data - understand the system it represents. Why 7 donation options? Why 1,960 product variants? Each complexity solved a real problem.

The Philosophy: Extraction as Investigation

Every website is someone’s solution to a problem. When you’re extracting data, you’re reverse-engineering their engineering decisions.

The complexity we found:

359 chapter routing
Sophisticated donation system
Hidden social media toolkit
Enterprise authentication
E-commerce at scale

This wasn’t accidental. It evolved to solve coordination problems at a scale where email and spreadsheets break down.

That’s what makes extraction projects interesting. You’re not just copying data - you’re learning how systems evolve under real-world constraints.

Technical Stack Reference

🔧 Extraction Technology ▼

Primary Tool

Playwright MCP (browser automation)

Approach

DOM inspection + network analysis

Parallelization

4 Explore subagents simultaneously

Output Format

Dual (JSON + Markdown)

Version Control

Git from project start

Validation

Cross-referenced multiple sources

Key Integrations Documented:

Shopify (e-commerce platform)
FundraiseUp (donation widget)
Salesforce (CRM backend)
JotForm (contact routing)
Google OAuth 2.0 (staff authentication)
GuideStar/Candid (transparency verification)

The Bottom Line

We extracted 14 major components, created 57+ files, about 1.8 MB of structured data. 100% success rate on extraction targets plus bonus discoveries.

But more than that, we understood the system. We know why it’s built the way it is. The migration team isn’t just getting data - they’re getting insight into how the digital infrastructure actually works.

And that social media kit at /blog/page/2/? Perfect example of why you question assumptions during extraction.

The real lesson: Extraction projects are never just about the data. They’re about understanding systems, discovering hidden patterns, and sometimes finding functionality at URLs that make no logical sense.

Project Stats:

Duration: ~3 hours across 3 sessions
Components Extracted: 14
Files Created: 57+
Total Data Size: ~1.8 MB
Success Rate: 100%
Bonus Discoveries: 1 (Social Media Kit)
Coffee Consumed: Probably several cups

Final Thought: The best part of this project wasn’t the extraction itself - it was understanding how systems evolve when they’re solving coordination problems at scale. Every complexity had a reason. Every integration served a purpose.

That’s worth documenting.