Fix HtmlCleaner Null Pointer Error in findRadioGroupContainer¶
Priority: 🔴 CRITICAL - Production Runtime Error Status: Planning Sentry Issue: BEANS-BACKEND-4D Related Plans: refactor-html-cleaner-god-object.md
Problem Statement¶
Production error occurring in HtmlCleaner::findRadioGroupContainer() causing crawler failures.
Error Details¶
- Error Message:
Warning: Attempt to read property "nodeType" on null - Location:
src/Service/Crawler/HtmlCleaner.php:344 - Frequency: 13 occurrences in ~5 seconds
- Environment: Production
- URL Triggering Error:
https://blossomcoffeeroasters.com/products/test-first-light-breakfast-blend - First Seen: 2025-11-09T02:22:24.766Z
- Last Seen: 2025-11-09T02:22:27.000Z
Stack Trace¶
HtmlCleaner->findRadioGroupContainer() (line 312)
↑ called by
HtmlCleaner->replaceRadioGroupsWithSummaries() (line 250)
↑ called by
HtmlCleaner->transformRadioGroups() (line 112)
↑ called by
HtmlCleaner->transformFormElements() (line 23)
↑ called by
HtmlCleaner->cleanHtml() (line 99)
↑ called by
ContentProcessingStepProcessor->convertHtmlToMarkdown() (line 70)
Root Cause Analysis¶
The Bug (Line 343-344)¶
// Look for div/container with multiple radios from same group
$parent = $domElement->parentNode; // Line 343 - Can be null
while ($parent->nodeType === XML_ELEMENT_NODE && $parent instanceof DOMElement) { // Line 344 - ERROR HERE
// ...
$parent = $parent->parentNode; // Line 350 - Can also become null
}
Issue Breakdown¶
- Line 343:
$parent = $domElement->parentNode- No guarantee this isn'tnull - Line 344:
$parent->nodeType- Attempts to access property on potentiallynullvalue - Line 350:
$parent = $parent->parentNode- Can result innullin subsequent iterations
Why It Happens¶
- DOM nodes can have
nullasparentNode(e.g., document fragments, detached nodes) - The while loop condition checks
$parent->nodeTypebefore checking if$parentis null - This violates PHP's property access requirements (object must not be null)
Impact Assessment¶
Severity: CRITICAL 🔴¶
- Production outage: Crawler fails to process pages with certain form structures
- Data quality: Coffee bean pages cannot be processed
- User impact: Products from affected roasters won't appear in search
- Frequency: Affects specific HTML patterns (radio button groups)
Affected Functionality¶
- Content processing step in crawler pipeline
- HTML to Markdown conversion
- Form element transformation
- Radio button group detection and summarization
Solution Design¶
Option 1: Add Null Check Before While Loop (Recommended)¶
// Look for div/container with multiple radios from same group
$parent = $domElement->parentNode;
// Add null check before accessing properties
if (!$parent) {
return null;
}
while ($parent->nodeType === XML_ELEMENT_NODE && $parent instanceof DOMElement) {
$groupRadios = $domxPath->query('.//input[@type="radio" and @name="' . $groupName . '"]', $parent);
if ($groupRadios && $groupRadios->length > 1 && $parent->nodeName !== 'body') {
return $parent;
}
$parent = $parent->parentNode;
// Add null check for subsequent iterations
if (!$parent) {
break;
}
}
return null;
Pros:
- Minimal change, low risk
- Fixes the immediate bug
- Quick to implement and test
- Follows defensive programming
Cons:
- Doesn't address underlying complexity issues
- Band-aid fix on god object
Option 2: Refactor Entire Method¶
Combine with broader HtmlCleaner refactoring ( see refactor-html-cleaner-god-object.md)
Pros:
- Addresses root architectural issues
- Improves testability
- Reduces future bugs
Cons:
- Takes longer to implement
- Higher risk of introducing new bugs
- Production error needs immediate fix
Recommended Approach¶
Phase 1: Immediate Hotfix (This Week)¶
- Implement Option 1 (null checks)
- Add unit tests for edge cases
- Deploy to production
- Monitor Sentry for resolution
Phase 2: Strategic Refactor (Next Sprint)¶
- Follow refactor-html-cleaner-god-object.md
- Extract radio group logic to dedicated service
- Improve test coverage
- Simplify DOM traversal logic
Implementation Plan¶
Step 1: Add Null Safety Guards¶
File: src/Service/Crawler/HtmlCleaner.php
Lines: 343-351
private function findRadioGroupContainer(DOMXPath $domxPath, DOMElement $domElement, string $groupName): ?DOMElement
{
// Look for a fieldset containing this radio group
$fieldsets = $domxPath->query('.//ancestor::fieldset[1]', $domElement);
if ($fieldsets && $fieldsets->length > 0) {
$fieldset = $fieldsets->item(0);
if ($fieldset instanceof DOMElement) {
// Check if this fieldset contains other radios from the same group
$groupRadios = $domxPath->query('.//input[@type="radio" and @name="' . $groupName . '"]', $fieldset);
if ($groupRadios && $groupRadios->length > 1) {
return $fieldset;
}
}
}
// Look for div/container with multiple radios from same group
$parent = $domElement->parentNode;
// FIX: Add null check before accessing properties
if (!$parent) {
return null;
}
while ($parent->nodeType === XML_ELEMENT_NODE && $parent instanceof DOMElement) {
$groupRadios = $domxPath->query('.//input[@type="radio" and @name="' . $groupName . '"]', $parent);
if ($groupRadios && $groupRadios->length > 1 && $parent->nodeName !== 'body') {
return $parent;
}
$parent = $parent->parentNode;
// FIX: Add null check for loop continuation
if (!$parent) {
break;
}
}
return null;
}
Step 2: Add Unit Tests¶
File: tests/Service/Crawler/HtmlCleanerTest.php
Add test cases for:
- Radio group with detached parent node
- Radio group in document fragment
- Radio group with null parentNode
- Radio group at document root
- Deeply nested radio groups
public function testFindRadioGroupContainerWithNullParent(): void
{
// Test case: detached DOM element
$dom = new DOMDocument();
$radio = $dom->createElement('input');
$radio->setAttribute('type', 'radio');
$radio->setAttribute('name', 'test-group');
// Don't append to document - parentNode will be null
$xpath = new DOMXPath($dom);
$cleaner = new HtmlCleaner();
$result = $this->invokePrivateMethod($cleaner, 'findRadioGroupContainer', [$xpath, $radio, 'test-group']);
$this->assertNull($result, 'Should return null when parentNode is null');
}
public function testFindRadioGroupContainerWithBodyAsRoot(): void
{
// Test case: radio at document root (body)
$html = '<body><input type="radio" name="test-group" /></body>';
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$radio = $xpath->query('//input[@type="radio"]')->item(0);
$cleaner = new HtmlCleaner();
$result = $this->invokePrivateMethod($cleaner, 'findRadioGroupContainer', [$xpath, $radio, 'test-group']);
$this->assertNull($result, 'Should stop at body element');
}
Step 3: Integration Test¶
File: tests/Service/Crawler/Step/Processors/ContentProcessingStepProcessorTest.php
Add test with real-world HTML from failing URL:
public function testProcessBlossomCoffeeRoastersProductPage(): void
{
// Reproduce the exact failure scenario from Sentry
$html = file_get_contents(__DIR__ . '/fixtures/blossom-coffee-roasters-product.html');
$processor = $this->getContainer()->get(ContentProcessingStepProcessor::class);
// Should not throw error
$result = $processor->process($crawlUrl, $html);
$this->assertNotNull($result);
}
Step 4: Verify Fix in Production¶
- Deploy hotfix to staging
- Run crawler on affected URL
- Monitor Sentry for errors
- Deploy to production
- Verify issue BEANS-BACKEND-4D is resolved
Testing Strategy¶
Unit Tests¶
- [ ] Test null parent node scenarios
- [ ] Test detached DOM elements
- [ ] Test document fragments
- [ ] Test deeply nested structures
- [ ] Test radio groups at document root
Integration Tests¶
- [ ] Test with real HTML from Blossom Coffee Roasters
- [ ] Test with other problematic URLs (if any)
- [ ] Test end-to-end crawler pipeline
Manual Testing¶
- [ ] Test crawler against affected URL locally
- [ ] Verify markdown output is correct
- [ ] Check no regressions in radio group transformation
Deployment Plan¶
Pre-Deployment¶
- Run full test suite:
make test - Run PHPStan:
make phpstan - Run PHPMD:
make phpmd - Manual crawler test on staging
Deployment¶
- Create hotfix branch:
hotfix/fix-htmlcleaner-null-pointer - Commit with message:
fix: Add null safety guards to HtmlCleaner::findRadioGroupContainer() - Fixes BEANS-BACKEND-4D - Create PR
- Get code review
- Deploy to staging
- Verify fix
- Deploy to production
Post-Deployment¶
- Monitor Sentry for 24 hours
- Verify BEANS-BACKEND-4D is resolved
- Check crawler success rate
- Verify affected URLs now process successfully
Success Criteria¶
- [ ] Zero occurrences of null pointer error in
findRadioGroupContainer() - [ ] Sentry issue BEANS-BACKEND-4D marked as resolved
- [ ] All tests passing
- [ ] No regressions in radio group processing
- [ ] Affected URL successfully crawled and processed
- [ ] Documentation updated
Risk Assessment¶
Risk: LOW 🟢¶
- Minimal code change
- Defensive programming pattern (early return)
- Easy to understand and review
- Low chance of introducing new bugs
Mitigation¶
- Comprehensive test coverage
- Staging deployment first
- Production monitoring
- Easy rollback if needed
Follow-up Actions¶
After Hotfix Deployed¶
- Schedule HtmlCleaner refactoring (see refactor-html-cleaner-god-object.md)
- Review other DOM traversal code for similar issues
- Consider adding PHPStan rule for null-safe property access
- Document DOM handling best practices
Related Issues to Check¶
- Search codebase for similar patterns:
->parentNode->without null checks - Review other DOM manipulation code in HtmlCleaner
- Check if other crawler components have similar issues
Additional Context¶
Why This Wasn't Caught Earlier¶
- Edge case HTML structure: Not all radio groups trigger this
- DOM state dependency: Only affects detached or malformed DOM nodes
- Test coverage gap: Missing tests for edge cases
- God object complexity: Hard to reason about all code paths
Prevention for Future¶
- Add linting rule for null-safe property access
- Improve HtmlCleaner test coverage
- Refactor god object to simpler services
- Add integration tests with real-world HTML samples
References¶
- Sentry Issue: https://alpipego.sentry.io/issues/7009584207/
- Related Plan: refactor-html-cleaner-god-object.md
- PHP DOM Documentation: https://www.php.net/manual/en/class.domnode.php
- SOLID Principles: Single Responsibility, god object anti-pattern
Estimated Effort¶
- Hotfix Implementation: 2-4 hours
- Testing: 2-3 hours
- Deployment & Monitoring: 1-2 hours
- Total: 5-9 hours (1 day)
Assignee¶
TBD
Status Updates¶
- 2025-11-09: Plan created after Sentry issue analysis
- Next: Implement hotfix and tests