StructuredDataExtractor Implementation Plan¶

Problem Statement¶

Currently, the content processing pipeline strips all <script> tags before converting HTML to markdown. This removes valuable structured data that could improve LLM extraction accuracy:

JSON-LD product schemas (<script type="application/ld+json">)
Platform-specific product data (Shopify's _RSConfig.product, etc.)
Meta tags are preserved but not explicitly extracted

Proposed Solution¶

Create a StructuredDataExtractor service that extracts structured data from HTML before it's cleaned, and includes it as YAML frontmatter in the markdown output.

Architecture¶

src/Service/Crawler/Extraction/StructuredData/
├── StructuredDataExtractor.php           # Main orchestrator
├── ExtractedStructuredData.php           # Value object for results
└── Extractors/
    ├── StructuredDataExtractorInterface.php
    ├── JsonLdExtractor.php               # JSON-LD schemas (priority 100)
    ├── MetaTagExtractor.php              # OG/meta tags (priority 50)
    └── PlatformSpecific/
        └── ShopifyDataExtractor.php      # Shopify _RSConfig (priority 75)

Tagged Services¶

Extractors use Symfony's tagged services pattern with priority ordering:

services:
    App\Service\Crawler\Extraction\StructuredData\StructuredDataExtractor:
        arguments:
            $extractors: !tagged_iterator { tag: 'app.structured_data_extractor', default_priority_method: 'getPriority' }

Integration Point¶

File: src/Service/Crawler/Step/Processors/ContentProcessingStepProcessor.php

Location: In convertHtmlToMarkdown(), call extractor BEFORE $this->htmlCleaner->cleanHtml($html) (which strips script tags).

private function convertHtmlToMarkdown(string $html, string $url): string
{
    // Extract structured data BEFORE cleaning
    $structuredData = $this->structuredDataExtractor->extract($html);

    // Clean HTML (removes script tags)
    $html = $this->htmlCleaner->cleanHtml($html);

    // Convert to markdown
    $markdown = $this->markdownConverter->convert($html);

    // Build frontmatter with structured data
    $metadata = [
        'url'            => $url,
        'processed_at'   => (new DateTimeImmutable())->format('Y-m-d H:i:s'),
        'content_length' => strlen($html),
    ];

    // Merge extracted data
    if (!$structuredData->isEmpty()) {
        $metadata = array_merge($metadata, $structuredData->toArray());
    }

    $frontmatter = Yaml::dump($metadata, 4, 2);
    return "---\n{$frontmatter}---\n\n{$markdown}";
}

Expected Output Format¶

---
url: https://example-roaster.com/products/ethiopia
processed_at: "2025-11-19 10:30:00"
content_length: 45230
meta:
  title: "Ethiopia Yirgacheffe - Single Origin Coffee"
  description: "Bright and fruity Ethiopian coffee"
  image: "https://example-roaster.com/cdn/ethiopia.jpg"
  canonical: "https://example-roaster.com/products/ethiopia"
structured_data:
  product_name: "Ethiopia Yirgacheffe"
  product_description: "Bright and fruity Ethiopian coffee..."
  brand: "Example Roaster"
  category: "Coffee > Single Origin"
  price: "18.50"
  currency: "EUR"
  availability: "InStock"
platform_data:
  shopify:
    product_tags: ["Single Origin", "Light Roast", "Africa"]
    product_type: "Coffee"
    vendor: "Example Roaster"
---

[markdown content]

Data to Extract¶

Meta Tags (MetaTagExtractor)¶

Field	Sources (priority order)
title	`og:title`, `<title>`
description	`og:description`, `meta[name="description"]`
image	`og:image`
canonical	`link[rel="canonical"]`
product_price	`product:price:amount`
product_currency	`product:price:currency`
product_availability	`product:availability`

JSON-LD (JsonLdExtractor)¶

Extract from <script type="application/ld+json">:

Product schema: name, description, image, brand, category
Offer schema: price, priceCurrency, availability
Handle @graph format (multiple schemas in one block)

Shopify (ShopifyDataExtractor)¶

Extract from inline scripts:

_RSConfig.product: tags, type, vendor, variants
ShopifyAnalytics.meta.product: similar data
Inline var product = {...} patterns

Implementation Phases¶

Phase 1: Core Infrastructure¶

ExtractedStructuredData value object
StructuredDataExtractorInterface
StructuredDataExtractor orchestrator
Service configuration

Phase 2: Basic Extractors¶

MetaTagExtractor (OG tags, meta description)
JsonLdExtractor (Product/Offer schemas)

Phase 3: Integration¶

Update ContentProcessingStepProcessor
Unit tests for extractors
Integration tests

Phase 4: Platform Extractors¶

ShopifyDataExtractor
Additional platforms as needed (Shopware, WooCommerce)

Error Handling¶

Each extractor wraps JSON parsing in try-catch
Invalid data logged as warning, not error
Processing continues with remaining extractors
Empty results gracefully omitted from frontmatter

Validation Required¶

Before implementing, investigate real crawled URLs to confirm:

JSON-LD presence: How many roaster sites include JSON-LD Product schemas?
Data quality: Is the structured data accurate and useful?
Platform coverage: What platforms are most common? (Shopify, Shopware, etc.)
Value add: What specific fields would improve extraction that aren't in the description?

Open Questions¶

Should we extract variant information (sizes, weights, prices)?
How to handle conflicting data between sources (e.g., different prices)?
Should platform-specific data be normalized to a common format?
Maximum frontmatter size limits?

Status¶

Current: Planning / Validation Next: Investigate sample crawled URLs to validate assumptions