Skip to content

Fix HtmlCleaner Null Pointer Error in findRadioGroupContainer

Priority: 🔴 CRITICAL - Production Runtime Error Status: Planning Sentry Issue: BEANS-BACKEND-4D Related Plans: refactor-html-cleaner-god-object.md

Problem Statement

Production error occurring in HtmlCleaner::findRadioGroupContainer() causing crawler failures.

Error Details

  • Error Message: Warning: Attempt to read property "nodeType" on null
  • Location: src/Service/Crawler/HtmlCleaner.php:344
  • Frequency: 13 occurrences in ~5 seconds
  • Environment: Production
  • URL Triggering Error: https://blossomcoffeeroasters.com/products/test-first-light-breakfast-blend
  • First Seen: 2025-11-09T02:22:24.766Z
  • Last Seen: 2025-11-09T02:22:27.000Z

Stack Trace

HtmlCleaner->findRadioGroupContainer() (line 312)
  ↑ called by
HtmlCleaner->replaceRadioGroupsWithSummaries() (line 250)
  ↑ called by
HtmlCleaner->transformRadioGroups() (line 112)
  ↑ called by
HtmlCleaner->transformFormElements() (line 23)
  ↑ called by
HtmlCleaner->cleanHtml() (line 99)
  ↑ called by
ContentProcessingStepProcessor->convertHtmlToMarkdown() (line 70)

Root Cause Analysis

The Bug (Line 343-344)

// Look for div/container with multiple radios from same group
$parent = $domElement->parentNode;  // Line 343 - Can be null
while ($parent->nodeType === XML_ELEMENT_NODE && $parent instanceof DOMElement) {  // Line 344 - ERROR HERE
    // ...
    $parent = $parent->parentNode;  // Line 350 - Can also become null
}

Issue Breakdown

  1. Line 343: $parent = $domElement->parentNode - No guarantee this isn't null
  2. Line 344: $parent->nodeType - Attempts to access property on potentially null value
  3. Line 350: $parent = $parent->parentNode - Can result in null in subsequent iterations

Why It Happens

  • DOM nodes can have null as parentNode (e.g., document fragments, detached nodes)
  • The while loop condition checks $parent->nodeType before checking if $parent is null
  • This violates PHP's property access requirements (object must not be null)

Impact Assessment

Severity: CRITICAL 🔴

  • Production outage: Crawler fails to process pages with certain form structures
  • Data quality: Coffee bean pages cannot be processed
  • User impact: Products from affected roasters won't appear in search
  • Frequency: Affects specific HTML patterns (radio button groups)

Affected Functionality

  • Content processing step in crawler pipeline
  • HTML to Markdown conversion
  • Form element transformation
  • Radio button group detection and summarization

Solution Design

// Look for div/container with multiple radios from same group
$parent = $domElement->parentNode;

// Add null check before accessing properties
if (!$parent) {
    return null;
}

while ($parent->nodeType === XML_ELEMENT_NODE && $parent instanceof DOMElement) {
    $groupRadios = $domxPath->query('.//input[@type="radio" and @name="' . $groupName . '"]', $parent);
    if ($groupRadios && $groupRadios->length > 1 && $parent->nodeName !== 'body') {
        return $parent;
    }

    $parent = $parent->parentNode;

    // Add null check for subsequent iterations
    if (!$parent) {
        break;
    }
}

return null;

Pros:

  • Minimal change, low risk
  • Fixes the immediate bug
  • Quick to implement and test
  • Follows defensive programming

Cons:

  • Doesn't address underlying complexity issues
  • Band-aid fix on god object

Option 2: Refactor Entire Method

Combine with broader HtmlCleaner refactoring ( see refactor-html-cleaner-god-object.md)

Pros:

  • Addresses root architectural issues
  • Improves testability
  • Reduces future bugs

Cons:

  • Takes longer to implement
  • Higher risk of introducing new bugs
  • Production error needs immediate fix

Phase 1: Immediate Hotfix (This Week)

  1. Implement Option 1 (null checks)
  2. Add unit tests for edge cases
  3. Deploy to production
  4. Monitor Sentry for resolution

Phase 2: Strategic Refactor (Next Sprint)

  1. Follow refactor-html-cleaner-god-object.md
  2. Extract radio group logic to dedicated service
  3. Improve test coverage
  4. Simplify DOM traversal logic

Implementation Plan

Step 1: Add Null Safety Guards

File: src/Service/Crawler/HtmlCleaner.php Lines: 343-351

private function findRadioGroupContainer(DOMXPath $domxPath, DOMElement $domElement, string $groupName): ?DOMElement
{
    // Look for a fieldset containing this radio group
    $fieldsets = $domxPath->query('.//ancestor::fieldset[1]', $domElement);
    if ($fieldsets && $fieldsets->length > 0) {
        $fieldset = $fieldsets->item(0);
        if ($fieldset instanceof DOMElement) {
            // Check if this fieldset contains other radios from the same group
            $groupRadios = $domxPath->query('.//input[@type="radio" and @name="' . $groupName . '"]', $fieldset);
            if ($groupRadios && $groupRadios->length > 1) {
                return $fieldset;
            }
        }
    }

    // Look for div/container with multiple radios from same group
    $parent = $domElement->parentNode;

    // FIX: Add null check before accessing properties
    if (!$parent) {
        return null;
    }

    while ($parent->nodeType === XML_ELEMENT_NODE && $parent instanceof DOMElement) {
        $groupRadios = $domxPath->query('.//input[@type="radio" and @name="' . $groupName . '"]', $parent);
        if ($groupRadios && $groupRadios->length > 1 && $parent->nodeName !== 'body') {
            return $parent;
        }

        $parent = $parent->parentNode;

        // FIX: Add null check for loop continuation
        if (!$parent) {
            break;
        }
    }

    return null;
}

Step 2: Add Unit Tests

File: tests/Service/Crawler/HtmlCleanerTest.php

Add test cases for:

  1. Radio group with detached parent node
  2. Radio group in document fragment
  3. Radio group with null parentNode
  4. Radio group at document root
  5. Deeply nested radio groups
public function testFindRadioGroupContainerWithNullParent(): void
{
    // Test case: detached DOM element
    $dom = new DOMDocument();
    $radio = $dom->createElement('input');
    $radio->setAttribute('type', 'radio');
    $radio->setAttribute('name', 'test-group');

    // Don't append to document - parentNode will be null

    $xpath = new DOMXPath($dom);
    $cleaner = new HtmlCleaner();

    $result = $this->invokePrivateMethod($cleaner, 'findRadioGroupContainer', [$xpath, $radio, 'test-group']);

    $this->assertNull($result, 'Should return null when parentNode is null');
}

public function testFindRadioGroupContainerWithBodyAsRoot(): void
{
    // Test case: radio at document root (body)
    $html = '<body><input type="radio" name="test-group" /></body>';
    $dom = new DOMDocument();
    $dom->loadHTML($html);
    $xpath = new DOMXPath($dom);

    $radio = $xpath->query('//input[@type="radio"]')->item(0);
    $cleaner = new HtmlCleaner();

    $result = $this->invokePrivateMethod($cleaner, 'findRadioGroupContainer', [$xpath, $radio, 'test-group']);

    $this->assertNull($result, 'Should stop at body element');
}

Step 3: Integration Test

File: tests/Service/Crawler/Step/Processors/ContentProcessingStepProcessorTest.php

Add test with real-world HTML from failing URL:

public function testProcessBlossomCoffeeRoastersProductPage(): void
{
    // Reproduce the exact failure scenario from Sentry
    $html = file_get_contents(__DIR__ . '/fixtures/blossom-coffee-roasters-product.html');

    $processor = $this->getContainer()->get(ContentProcessingStepProcessor::class);

    // Should not throw error
    $result = $processor->process($crawlUrl, $html);

    $this->assertNotNull($result);
}

Step 4: Verify Fix in Production

  1. Deploy hotfix to staging
  2. Run crawler on affected URL
  3. Monitor Sentry for errors
  4. Deploy to production
  5. Verify issue BEANS-BACKEND-4D is resolved

Testing Strategy

Unit Tests

  • [ ] Test null parent node scenarios
  • [ ] Test detached DOM elements
  • [ ] Test document fragments
  • [ ] Test deeply nested structures
  • [ ] Test radio groups at document root

Integration Tests

  • [ ] Test with real HTML from Blossom Coffee Roasters
  • [ ] Test with other problematic URLs (if any)
  • [ ] Test end-to-end crawler pipeline

Manual Testing

  • [ ] Test crawler against affected URL locally
  • [ ] Verify markdown output is correct
  • [ ] Check no regressions in radio group transformation

Deployment Plan

Pre-Deployment

  1. Run full test suite: make test
  2. Run PHPStan: make phpstan
  3. Run PHPMD: make phpmd
  4. Manual crawler test on staging

Deployment

  1. Create hotfix branch: hotfix/fix-htmlcleaner-null-pointer
  2. Commit with message: fix: Add null safety guards to HtmlCleaner::findRadioGroupContainer() - Fixes BEANS-BACKEND-4D
  3. Create PR
  4. Get code review
  5. Deploy to staging
  6. Verify fix
  7. Deploy to production

Post-Deployment

  1. Monitor Sentry for 24 hours
  2. Verify BEANS-BACKEND-4D is resolved
  3. Check crawler success rate
  4. Verify affected URLs now process successfully

Success Criteria

  • [ ] Zero occurrences of null pointer error in findRadioGroupContainer()
  • [ ] Sentry issue BEANS-BACKEND-4D marked as resolved
  • [ ] All tests passing
  • [ ] No regressions in radio group processing
  • [ ] Affected URL successfully crawled and processed
  • [ ] Documentation updated

Risk Assessment

Risk: LOW 🟢

  • Minimal code change
  • Defensive programming pattern (early return)
  • Easy to understand and review
  • Low chance of introducing new bugs

Mitigation

  • Comprehensive test coverage
  • Staging deployment first
  • Production monitoring
  • Easy rollback if needed

Follow-up Actions

After Hotfix Deployed

  1. Schedule HtmlCleaner refactoring (see refactor-html-cleaner-god-object.md)
  2. Review other DOM traversal code for similar issues
  3. Consider adding PHPStan rule for null-safe property access
  4. Document DOM handling best practices
  • Search codebase for similar patterns: ->parentNode-> without null checks
  • Review other DOM manipulation code in HtmlCleaner
  • Check if other crawler components have similar issues

Additional Context

Why This Wasn't Caught Earlier

  1. Edge case HTML structure: Not all radio groups trigger this
  2. DOM state dependency: Only affects detached or malformed DOM nodes
  3. Test coverage gap: Missing tests for edge cases
  4. God object complexity: Hard to reason about all code paths

Prevention for Future

  1. Add linting rule for null-safe property access
  2. Improve HtmlCleaner test coverage
  3. Refactor god object to simpler services
  4. Add integration tests with real-world HTML samples

References

  • Sentry Issue: https://alpipego.sentry.io/issues/7009584207/
  • Related Plan: refactor-html-cleaner-god-object.md
  • PHP DOM Documentation: https://www.php.net/manual/en/class.domnode.php
  • SOLID Principles: Single Responsibility, god object anti-pattern

Estimated Effort

  • Hotfix Implementation: 2-4 hours
  • Testing: 2-3 hours
  • Deployment & Monitoring: 1-2 hours
  • Total: 5-9 hours (1 day)

Assignee

TBD

Status Updates

  • 2025-11-09: Plan created after Sentry issue analysis
  • Next: Implement hotfix and tests