AI COLLABORATION

Data Extraction as Detective Work: Discovering the Hidden Infrastructure of a Nationwide Nonprofit

How AI-assisted browser automation uncovered a sophisticated digital ecosystem supporting 359 chapters - including a secret social media toolkit hiding at /blog/page/2/

Tools Used:
Playwright MCPClaude CodeExplore AgentsBrowser Automation

The Mission That Became a Mystery

β€œCan you extract all the data from shpbeds.org for a migration?”

Simple request, right? Pull chapters, blog posts, team members - standard extraction stuff. Three hours and 57 files later, we’d uncovered a sophisticated digital infrastructure with 359 chapters, 1,960 product variants, and a hidden social media toolkit that nobody knew existed.

This is the story of how data extraction became detective work, and why questioning assumptions leads to the best discoveries.

Starting Point: The Obvious Stuff

Sleep in Heavenly Peace - 359 chapters nationwide, website migration needed. Standard extraction targets:

First instinct? Fire up Playwright MCP and start grabbing the obvious targets:

  • 359 chapters with location data
  • 100+ team member profiles
  • Blog posts and news
  • Donation infrastructure

But here’s where the approach diverged from typical extraction: dual-format output from the start.

Output Strategy:
  - JSON: Machine-readable, migration-ready
  - Markdown: Human-reviewable, sanity-checkable

Why Both?
  "Trying to review 359 chapters in JSON is a recipe
   for missing something important. Markdown docs
   become the human validation layer."

β˜… Insight ─────────────────────────────────────

Dual-format extraction isn’t redundant - it’s risk management. The JSON feeds your migration scripts, but the Markdown lets humans spot patterns, inconsistencies, and relationships that automated tools might miss. Think of it as β€œtrust but verify” for data extraction.

─────────────────────────────────────────────────

The Parallel Processing Breakthrough

After extracting chapters and team members sequentially, a realization hit: these sections don’t depend on each other. Why wait?

Spun up 4 parallel Explore agents, each tackling independent sections:

  1. Donation infrastructure
  2. Volunteer portal
  3. Staff authentication
  4. Sponsor data

This parallel approach did something unexpected - it revealed patterns we wouldn’t have caught with sequential extraction.

Extraction Timeline

Session 1
🎯 Core Data Extraction - Chapters, team members, blog posts. The obvious targets.
Session 2
⚑ Parallel Discovery - 4 agents simultaneously uncover donation complexity, Shopify scale, OAuth system.
Session 3
πŸ’Ž The Page 2 Revelation - Questioning pagination assumptions leads to social media toolkit discovery.

Plot Twist #1: The Shopify Surprise

Expected: Simple merch store. Few t-shirts, maybe stickers.

Reality: 168 products. 1,960 variants.

Think about that for a second. This nonprofit has nearly 2,000 different product configurations. That’s not a merch store - that’s a full-blown e-commerce operation supporting their mission. Each chapter probably has custom items, different sizes, colors, branded merchandise.

The extraction for this alone required handling:

  • Pagination through product catalogs
  • Variant structures (size/color combinations)
  • Pricing tiers
  • Inventory tracking
  • Chapter-specific product routing

This wasn’t β€œgrab product names and prices” - this was understanding an entire business system embedded in their site.

Plot Twist #2: The Donation Maze (And Why Complexity is Good)

Their donation system revealed serious sophistication:

πŸ”§ Donation Infrastructure β–Ό
Payment Processor
FundraiseUp (advanced widget integration)
CRM Backend
Salesforce (donor relationship management)
Giving Options
7 distinct pathways
Routing Logic
Chapter-specific donation allocation

The seven giving options weren’t arbitrary - they solved real donor needs:

  • One-time donations
  • Monthly recurring gifts
  • Tribute/memorial gifts
  • Corporate matching programs
  • Donor-advised funds
  • Stock/crypto donations
  • Legacy/estate planning

Why this matters for extraction: Understanding the β€œwhy” behind complexity helps you not miss critical connections. Each donation pathway had different form fields, routing logic, and follow-up workflows.

The Page 2 Discovery: When Pagination Gets Weird

After the first extraction push, something bugged me. 100 blog posts on page 1. Modern blogs usually paginate at 10-12 posts per page. If they’re showing 100 on page 1, what’s on page 2?

Clicked the pagination link. Shows 0 blog posts initially. Weird.

Tried again.

Plot twist: Page 2 wasn’t more blog posts at all. It was a complete social media toolkit - 30 holidays throughout the year, each with custom graphics and pre-written captions.

Found at /blog/page/2/:
  - Veterans Day content (November 11)
  - MLK Day content (January 15)
  - National Sleep Day (March 10 - perfect!)
  - Christmas, Thanksgiving, Easter...
  - 30 holidays total
  - Professional graphics for each
  - Pre-written captions
  - On-brand messaging

This wasn’t on the extraction list - nobody expected content management tools hiding behind blog pagination.

Think about the engineering here: 359 chapters, coordinated social media content, zero distribution overhead. They repurposed blog pagination as a content distribution mechanism.

That’s solving a coordination problem with available infrastructure.

β˜… Insight ─────────────────────────────────────

This discovery only happened because we questioned assumptions. β€œPage 2 should have more blog posts” led to β€œWait, why doesn’t it?” which led to β€œLet me check what’s actually there.” Data extraction isn’t just following a script - it’s investigating a system. Budget time for surprises.

─────────────────────────────────────────────────

Authentication Insights: OAuth Architecture

The staff login page revealed Google OAuth 2.0 integration. This wasn’t just a technical detail - it revealed operational workflow:

  • No password management overhead
  • Leverages existing Google Workspace accounts
  • Enterprise-grade security out of the box
  • Single sign-on across tools

Migration implication: You can’t just dump users in a new system and expect them to create passwords. The authentication flow is part of their operational muscle memory.

The GuideStar Deep Dive: Scale Context

When we pulled their external verification profile:

  • Platinum Seal from GuideStar/Candid (highest level)
  • 7 consecutive years of top-tier transparency
  • $18.2 million in revenue
  • Complete financial disclosure

This isn’t a small charity. This is a major operation with serious accountability requirements. The data extraction needed to maintain that level of professionalism and completeness.

Technical Decisions That Mattered

Why Playwright MCP Over Basic Scrapers

Modern websites are JavaScript-heavy, dynamically rendered, with complex interactions. Basic scrapers would’ve:

  • Missed half the dynamic content
  • Failed on form submissions
  • Broken on authenticated pages
  • Missed the social media kit entirely

Playwright MCP gave us:

  • Full browser context
  • JavaScript execution
  • DOM inspection
  • Network traffic analysis
  • Screenshot validation

Why Parallel Agents Cut 75% Off Extraction Time

But more importantly: parallel execution lets you spot patterns across sections simultaneously. When you see donation complexity at the same time as Shopify scale and OAuth architecture, you understand the system holistically.

Why Git From the Start

Not an afterthought. Version control becomes documentation of discovery:

Commit History Tells a Story:
1699682 - "Complete SHP website extraction" (93 files, core data)
d122cf8 - "Add Social Media Kit extraction" (the Page 2 discovery!)
8084854 - "Add comprehensive README" (tying it all together)

Each commit is a milestone. Not β€œWIP” or β€œmore changes” but actual discoveries documented.

What Got Extracted: The Complete Inventory

Final Output:
  - 14 major components
  - 57+ files
  - ~1.8 MB structured data
  - 100% success rate on targets
  - 1 bonus discovery (social media kit!)

Key Extractions:
  βœ“ 359 chapters (JSON + CSV + Markdown)
  βœ“ 100+ team members
  βœ“ 100 blog posts
  βœ“ 168 Shopify products (1,960 variants!)
  βœ“ 7 donation pathways
  βœ“ 25+ volunteer opportunities
  βœ“ 4 major sponsors
  βœ“ OAuth authentication system
  βœ“ Financial transparency data
  βœ“ 30-holiday social media toolkit

Lessons: Data Extraction as System Understanding

1. Expect Surprises

Budget time for discovering things that aren’t documented. The social media kit was pure bonus because we questioned pagination.

2. Parallel Everything You Can

Modern machines can handle it. You’ll spot patterns faster when you see multiple sections simultaneously.

3. Document As You Go

Not after. During. Your future self (and the migration team) will thank you.

4. Validate Early and Often

Better to catch issues at 10 records than 10,000. Our dual-format approach made validation continuous.

5. Understand the Why

Don’t just extract the data - understand the system it represents. Why 7 donation options? Why 1,960 product variants? Each complexity solved a real problem.

The Philosophy: Extraction as Investigation

Every website is someone’s solution to a problem. When you’re extracting data, you’re reverse-engineering their engineering decisions.

The complexity we found:

  • 359 chapter routing
  • Sophisticated donation system
  • Hidden social media toolkit
  • Enterprise authentication
  • E-commerce at scale

This wasn’t accidental. It evolved to solve coordination problems at a scale where email and spreadsheets break down.

That’s what makes extraction projects interesting. You’re not just copying data - you’re learning how systems evolve under real-world constraints.

Technical Stack Reference

πŸ”§ Extraction Technology β–Ό
Primary Tool
Playwright MCP (browser automation)
Approach
DOM inspection + network analysis
Parallelization
4 Explore subagents simultaneously
Output Format
Dual (JSON + Markdown)
Version Control
Git from project start
Validation
Cross-referenced multiple sources

Key Integrations Documented:

  • Shopify (e-commerce platform)
  • FundraiseUp (donation widget)
  • Salesforce (CRM backend)
  • JotForm (contact routing)
  • Google OAuth 2.0 (staff authentication)
  • GuideStar/Candid (transparency verification)

The Bottom Line

We extracted 14 major components, created 57+ files, about 1.8 MB of structured data. 100% success rate on extraction targets plus bonus discoveries.

But more than that, we understood the system. We know why it’s built the way it is. The migration team isn’t just getting data - they’re getting insight into how the digital infrastructure actually works.

And that social media kit at /blog/page/2/? Perfect example of why you question assumptions during extraction.

The real lesson: Extraction projects are never just about the data. They’re about understanding systems, discovering hidden patterns, and sometimes finding functionality at URLs that make no logical sense.


Project Stats:

  • Duration: ~3 hours across 3 sessions
  • Components Extracted: 14
  • Files Created: 57+
  • Total Data Size: ~1.8 MB
  • Success Rate: 100%
  • Bonus Discoveries: 1 (Social Media Kit)
  • Coffee Consumed: Probably several cups

Final Thought: The best part of this project wasn’t the extraction itself - it was understanding how systems evolve when they’re solving coordination problems at scale. Every complexity had a reason. Every integration served a purpose.

That’s worth documenting.

Outcome

14 components, 57+ files, 1.8 MB structured data, 100% success rate, 1 bonus discovery

#data extraction#playwright#browser automation#nonprofit tech#parallel processing#system investigation#web scraping#discovery
Page Views:
Loading...
πŸ”„ Loading

☎️ contact.info // get in touch

Click to establish communication link

Astro
ASTRO POWERED
HTML5 READY
CSS3 ENHANCED
JS ENABLED
FreeBSD HOST
Caddy
CADDY SERVED
PYTHON SCRIPTS
VIM
VIM EDITED
AI ENHANCED
TERMINAL READY
RAILWAY BBS // SYSTEM DIAGNOSTICS
πŸ” REAL-TIME NETWORK DIAGNOSTICS
πŸ“‘ Connection type: Detecting... β—‰ SCANNING
⚑ Effective bandwidth: Measuring... β—‰ ACTIVE
πŸš€ Round-trip time: Calculating... β—‰ OPTIMAL
πŸ“± Data saver mode: Unknown β—‰ CHECKING
🧠 BROWSER PERFORMANCE METRICS
πŸ’Ύ JS heap used: Analyzing... β—‰ MONITORING
βš™οΈ CPU cores: Detecting... β—‰ AVAILABLE
πŸ“Š Page load time: Measuring... β—‰ COMPLETE
πŸ”‹ Device memory: Querying... β—‰ SUFFICIENT
πŸ›‘οΈ SESSION & SECURITY STATUS
πŸ”’ Protocol: HTTPS/2 β—‰ ENCRYPTED
πŸš€ Session ID: PWA_SESSION_LOADING β—‰ ACTIVE
⏱️ Session duration: 0s β—‰ TRACKING
πŸ“Š Total requests: 1 β—‰ COUNTED
πŸ›‘οΈ Threat level: MONITORED β—‰ MONITORED
πŸ“± PWA & CACHE MANAGEMENT
πŸ”§ PWA install status: Checking... β—‰ SCANNING
πŸ—„οΈ Service Worker: Detecting... β—‰ CHECKING
πŸ’Ύ Cache storage size: Calculating... β—‰ MEASURING
πŸ”’ Notifications: Querying... β—‰ CHECKING
⏰ TEMPORAL SYNC
πŸ•’ Live timestamp: 2025-11-09T00:59:33.106Z
🎯 Update mode: REAL-TIME API β—‰ LIVE
β—‰
REAL-TIME DIAGNOSTICS INITIALIZING...
πŸ“‘ API SUPPORT STATUS
Network Info API: Checking...
Memory API: Checking...
Performance API: Checking...
Hardware API: Checking...
Loading discussion...