The Mission That Became a Mystery
βCan you extract all the data from shpbeds.org for a migration?β
Simple request, right? Pull chapters, blog posts, team members - standard extraction stuff. Three hours and 57 files later, weβd uncovered a sophisticated digital infrastructure with 359 chapters, 1,960 product variants, and a hidden social media toolkit that nobody knew existed.
This is the story of how data extraction became detective work, and why questioning assumptions leads to the best discoveries.
Starting Point: The Obvious Stuff
Sleep in Heavenly Peace - 359 chapters nationwide, website migration needed. Standard extraction targets:
First instinct? Fire up Playwright MCP and start grabbing the obvious targets:
- 359 chapters with location data
- 100+ team member profiles
- Blog posts and news
- Donation infrastructure
But hereβs where the approach diverged from typical extraction: dual-format output from the start.
Output Strategy:
- JSON: Machine-readable, migration-ready
- Markdown: Human-reviewable, sanity-checkable
Why Both?
"Trying to review 359 chapters in JSON is a recipe
for missing something important. Markdown docs
become the human validation layer."
β
Insight βββββββββββββββββββββββββββββββββββββ
Dual-format extraction isnβt redundant - itβs risk management. The JSON feeds your migration scripts, but the Markdown lets humans spot patterns, inconsistencies, and relationships that automated tools might miss. Think of it as βtrust but verifyβ for data extraction.
βββββββββββββββββββββββββββββββββββββββββββββββββ
The Parallel Processing Breakthrough
After extracting chapters and team members sequentially, a realization hit: these sections donβt depend on each other. Why wait?
Spun up 4 parallel Explore agents, each tackling independent sections:
- Donation infrastructure
- Volunteer portal
- Staff authentication
- Sponsor data
This parallel approach did something unexpected - it revealed patterns we wouldnβt have caught with sequential extraction.
Extraction Timeline
Plot Twist #1: The Shopify Surprise
Expected: Simple merch store. Few t-shirts, maybe stickers.
Reality: 168 products. 1,960 variants.
Think about that for a second. This nonprofit has nearly 2,000 different product configurations. Thatβs not a merch store - thatβs a full-blown e-commerce operation supporting their mission. Each chapter probably has custom items, different sizes, colors, branded merchandise.
The extraction for this alone required handling:
- Pagination through product catalogs
- Variant structures (size/color combinations)
- Pricing tiers
- Inventory tracking
- Chapter-specific product routing
This wasnβt βgrab product names and pricesβ - this was understanding an entire business system embedded in their site.
Plot Twist #2: The Donation Maze (And Why Complexity is Good)
Their donation system revealed serious sophistication:
Donation Infrastructure βΌ
The seven giving options werenβt arbitrary - they solved real donor needs:
- One-time donations
- Monthly recurring gifts
- Tribute/memorial gifts
- Corporate matching programs
- Donor-advised funds
- Stock/crypto donations
- Legacy/estate planning
Why this matters for extraction: Understanding the βwhyβ behind complexity helps you not miss critical connections. Each donation pathway had different form fields, routing logic, and follow-up workflows.
The Page 2 Discovery: When Pagination Gets Weird
After the first extraction push, something bugged me. 100 blog posts on page 1. Modern blogs usually paginate at 10-12 posts per page. If theyβre showing 100 on page 1, whatβs on page 2?
Clicked the pagination link. Shows 0 blog posts initially. Weird.
Tried again.
Plot twist: Page 2 wasnβt more blog posts at all. It was a complete social media toolkit - 30 holidays throughout the year, each with custom graphics and pre-written captions.
Found at /blog/page/2/:
- Veterans Day content (November 11)
- MLK Day content (January 15)
- National Sleep Day (March 10 - perfect!)
- Christmas, Thanksgiving, Easter...
- 30 holidays total
- Professional graphics for each
- Pre-written captions
- On-brand messaging
This wasnβt on the extraction list - nobody expected content management tools hiding behind blog pagination.
Think about the engineering here: 359 chapters, coordinated social media content, zero distribution overhead. They repurposed blog pagination as a content distribution mechanism.
Thatβs solving a coordination problem with available infrastructure.
β
Insight βββββββββββββββββββββββββββββββββββββ
This discovery only happened because we questioned assumptions. βPage 2 should have more blog postsβ led to βWait, why doesnβt it?β which led to βLet me check whatβs actually there.β Data extraction isnβt just following a script - itβs investigating a system. Budget time for surprises.
βββββββββββββββββββββββββββββββββββββββββββββββββ
Authentication Insights: OAuth Architecture
The staff login page revealed Google OAuth 2.0 integration. This wasnβt just a technical detail - it revealed operational workflow:
- No password management overhead
- Leverages existing Google Workspace accounts
- Enterprise-grade security out of the box
- Single sign-on across tools
Migration implication: You canβt just dump users in a new system and expect them to create passwords. The authentication flow is part of their operational muscle memory.
The GuideStar Deep Dive: Scale Context
When we pulled their external verification profile:
- Platinum Seal from GuideStar/Candid (highest level)
- 7 consecutive years of top-tier transparency
- $18.2 million in revenue
- Complete financial disclosure
This isnβt a small charity. This is a major operation with serious accountability requirements. The data extraction needed to maintain that level of professionalism and completeness.
Technical Decisions That Mattered
Why Playwright MCP Over Basic Scrapers
Modern websites are JavaScript-heavy, dynamically rendered, with complex interactions. Basic scrapers wouldβve:
- Missed half the dynamic content
- Failed on form submissions
- Broken on authenticated pages
- Missed the social media kit entirely
Playwright MCP gave us:
- Full browser context
- JavaScript execution
- DOM inspection
- Network traffic analysis
- Screenshot validation
Why Parallel Agents Cut 75% Off Extraction Time
But more importantly: parallel execution lets you spot patterns across sections simultaneously. When you see donation complexity at the same time as Shopify scale and OAuth architecture, you understand the system holistically.
Why Git From the Start
Not an afterthought. Version control becomes documentation of discovery:
Commit History Tells a Story:
1699682 - "Complete SHP website extraction" (93 files, core data)
d122cf8 - "Add Social Media Kit extraction" (the Page 2 discovery!)
8084854 - "Add comprehensive README" (tying it all together)
Each commit is a milestone. Not βWIPβ or βmore changesβ but actual discoveries documented.
What Got Extracted: The Complete Inventory
Final Output:
- 14 major components
- 57+ files
- ~1.8 MB structured data
- 100% success rate on targets
- 1 bonus discovery (social media kit!)
Key Extractions:
β 359 chapters (JSON + CSV + Markdown)
β 100+ team members
β 100 blog posts
β 168 Shopify products (1,960 variants!)
β 7 donation pathways
β 25+ volunteer opportunities
β 4 major sponsors
β OAuth authentication system
β Financial transparency data
β 30-holiday social media toolkit
Lessons: Data Extraction as System Understanding
1. Expect Surprises
Budget time for discovering things that arenβt documented. The social media kit was pure bonus because we questioned pagination.
2. Parallel Everything You Can
Modern machines can handle it. Youβll spot patterns faster when you see multiple sections simultaneously.
3. Document As You Go
Not after. During. Your future self (and the migration team) will thank you.
4. Validate Early and Often
Better to catch issues at 10 records than 10,000. Our dual-format approach made validation continuous.
5. Understand the Why
Donβt just extract the data - understand the system it represents. Why 7 donation options? Why 1,960 product variants? Each complexity solved a real problem.
The Philosophy: Extraction as Investigation
Every website is someoneβs solution to a problem. When youβre extracting data, youβre reverse-engineering their engineering decisions.
The complexity we found:
- 359 chapter routing
- Sophisticated donation system
- Hidden social media toolkit
- Enterprise authentication
- E-commerce at scale
This wasnβt accidental. It evolved to solve coordination problems at a scale where email and spreadsheets break down.
Thatβs what makes extraction projects interesting. Youβre not just copying data - youβre learning how systems evolve under real-world constraints.
Technical Stack Reference
Extraction Technology βΌ
Key Integrations Documented:
- Shopify (e-commerce platform)
- FundraiseUp (donation widget)
- Salesforce (CRM backend)
- JotForm (contact routing)
- Google OAuth 2.0 (staff authentication)
- GuideStar/Candid (transparency verification)
The Bottom Line
We extracted 14 major components, created 57+ files, about 1.8 MB of structured data. 100% success rate on extraction targets plus bonus discoveries.
But more than that, we understood the system. We know why itβs built the way it is. The migration team isnβt just getting data - theyβre getting insight into how the digital infrastructure actually works.
And that social media kit at /blog/page/2/? Perfect example of why you question assumptions during extraction.
The real lesson: Extraction projects are never just about the data. Theyβre about understanding systems, discovering hidden patterns, and sometimes finding functionality at URLs that make no logical sense.
Project Stats:
- Duration: ~3 hours across 3 sessions
- Components Extracted: 14
- Files Created: 57+
- Total Data Size: ~1.8 MB
- Success Rate: 100%
- Bonus Discoveries: 1 (Social Media Kit)
- Coffee Consumed: Probably several cups
Final Thought: The best part of this project wasnβt the extraction itself - it was understanding how systems evolve when theyβre solving coordination problems at scale. Every complexity had a reason. Every integration served a purpose.
Thatβs worth documenting.