Skip to content

StructuredDataExtractor Implementation Plan

Problem Statement

Currently, the content processing pipeline strips all <script> tags before converting HTML to markdown. This removes valuable structured data that could improve LLM extraction accuracy:

  • JSON-LD product schemas (<script type="application/ld+json">)
  • Platform-specific product data (Shopify's _RSConfig.product, etc.)
  • Meta tags are preserved but not explicitly extracted

Proposed Solution

Create a StructuredDataExtractor service that extracts structured data from HTML before it's cleaned, and includes it as YAML frontmatter in the markdown output.

Architecture

src/Service/Crawler/Extraction/StructuredData/
├── StructuredDataExtractor.php           # Main orchestrator
├── ExtractedStructuredData.php           # Value object for results
└── Extractors/
    ├── StructuredDataExtractorInterface.php
    ├── JsonLdExtractor.php               # JSON-LD schemas (priority 100)
    ├── MetaTagExtractor.php              # OG/meta tags (priority 50)
    └── PlatformSpecific/
        └── ShopifyDataExtractor.php      # Shopify _RSConfig (priority 75)

Tagged Services

Extractors use Symfony's tagged services pattern with priority ordering:

services:
    App\Service\Crawler\Extraction\StructuredData\StructuredDataExtractor:
        arguments:
            $extractors: !tagged_iterator { tag: 'app.structured_data_extractor', default_priority_method: 'getPriority' }

Integration Point

File: src/Service/Crawler/Step/Processors/ContentProcessingStepProcessor.php

Location: In convertHtmlToMarkdown(), call extractor BEFORE $this->htmlCleaner->cleanHtml($html) (which strips script tags).

private function convertHtmlToMarkdown(string $html, string $url): string
{
    // Extract structured data BEFORE cleaning
    $structuredData = $this->structuredDataExtractor->extract($html);

    // Clean HTML (removes script tags)
    $html = $this->htmlCleaner->cleanHtml($html);

    // Convert to markdown
    $markdown = $this->markdownConverter->convert($html);

    // Build frontmatter with structured data
    $metadata = [
        'url'            => $url,
        'processed_at'   => (new DateTimeImmutable())->format('Y-m-d H:i:s'),
        'content_length' => strlen($html),
    ];

    // Merge extracted data
    if (!$structuredData->isEmpty()) {
        $metadata = array_merge($metadata, $structuredData->toArray());
    }

    $frontmatter = Yaml::dump($metadata, 4, 2);
    return "---\n{$frontmatter}---\n\n{$markdown}";
}

Expected Output Format

---
url: https://example-roaster.com/products/ethiopia
processed_at: "2025-11-19 10:30:00"
content_length: 45230
meta:
  title: "Ethiopia Yirgacheffe - Single Origin Coffee"
  description: "Bright and fruity Ethiopian coffee"
  image: "https://example-roaster.com/cdn/ethiopia.jpg"
  canonical: "https://example-roaster.com/products/ethiopia"
structured_data:
  product_name: "Ethiopia Yirgacheffe"
  product_description: "Bright and fruity Ethiopian coffee..."
  brand: "Example Roaster"
  category: "Coffee > Single Origin"
  price: "18.50"
  currency: "EUR"
  availability: "InStock"
platform_data:
  shopify:
    product_tags: ["Single Origin", "Light Roast", "Africa"]
    product_type: "Coffee"
    vendor: "Example Roaster"
---

[markdown content]

Data to Extract

Meta Tags (MetaTagExtractor)

Field Sources (priority order)
title og:title, <title>
description og:description, meta[name="description"]
image og:image
canonical link[rel="canonical"]
product_price product:price:amount
product_currency product:price:currency
product_availability product:availability

JSON-LD (JsonLdExtractor)

Extract from <script type="application/ld+json">:

  • Product schema: name, description, image, brand, category
  • Offer schema: price, priceCurrency, availability
  • Handle @graph format (multiple schemas in one block)

Shopify (ShopifyDataExtractor)

Extract from inline scripts:

  • _RSConfig.product: tags, type, vendor, variants
  • ShopifyAnalytics.meta.product: similar data
  • Inline var product = {...} patterns

Implementation Phases

Phase 1: Core Infrastructure

  • ExtractedStructuredData value object
  • StructuredDataExtractorInterface
  • StructuredDataExtractor orchestrator
  • Service configuration

Phase 2: Basic Extractors

  • MetaTagExtractor (OG tags, meta description)
  • JsonLdExtractor (Product/Offer schemas)

Phase 3: Integration

  • Update ContentProcessingStepProcessor
  • Unit tests for extractors
  • Integration tests

Phase 4: Platform Extractors

  • ShopifyDataExtractor
  • Additional platforms as needed (Shopware, WooCommerce)

Error Handling

  • Each extractor wraps JSON parsing in try-catch
  • Invalid data logged as warning, not error
  • Processing continues with remaining extractors
  • Empty results gracefully omitted from frontmatter

Validation Required

Before implementing, investigate real crawled URLs to confirm:

  1. JSON-LD presence: How many roaster sites include JSON-LD Product schemas?
  2. Data quality: Is the structured data accurate and useful?
  3. Platform coverage: What platforms are most common? (Shopify, Shopware, etc.)
  4. Value add: What specific fields would improve extraction that aren't in the description?

Open Questions

  1. Should we extract variant information (sizes, weights, prices)?
  2. How to handle conflicting data between sources (e.g., different prices)?
  3. Should platform-specific data be normalized to a common format?
  4. Maximum frontmatter size limits?

Status

Current: Planning / Validation Next: Investigate sample crawled URLs to validate assumptions