Refactor URL Handling Services¶
Priority: 🟡 MEDIUM Status: Planning Related QA Analysis: qa-analysis-overview.md
Problem Statement¶
Three URL-related services have complexity issues affecting crawler reliability:
RoasterUrlsService¶
File: Service/Admin/RoasterUrlsService.php:34
- Method:
getRoasterUrlsData()(line 34) - NPath Complexity: 1,280/250 (412% over threshold)
UrlNormalizerService¶
File: Service/Crawler/UrlNormalizerService.php:25
- Method:
normalize()(line 25) - Cyclomatic Complexity: 11/10 (10% over)
- NPath Complexity: 576/250 (130% over)
SitemapProcessor¶
File: Service/Crawler/Processing/SitemapProcessor.php:55
- Method:
processSitemap()(line 55) - NPath Complexity: 288/250 (15% over)
Impact¶
- Crawler Reliability: URL handling is critical for crawler functionality
- Bug Risk: Complex URL logic can lead to subtle bugs
- Data Quality: Incorrect URL normalization affects data integrity
- Maintainability: Difficult to add new URL handling rules
Guideline Violations¶
- SOLID - Single Responsibility Principle: Methods doing too much
- Complexity: High NPath indicates many execution paths
Root Cause Analysis¶
RoasterUrlsService (NPath: 1,280)¶
Likely handles:
- Fetching roaster URLs from database
- Filtering roaster data
- Complex query building
- Data transformation for admin interface
- Multiple conditional branches for different data states
UrlNormalizerService (CC: 11, NPath: 576)¶
Likely handles:
- Multiple URL normalization rules
- Query parameter handling
- Fragment removal
- Path normalization
- Protocol handling
- Trailing slash logic
- Special case handling
- Domain normalization
SitemapProcessor (NPath: 288)¶
Likely handles:
- Sitemap XML parsing
- URL extraction
- URL validation
- Sitemap index processing
- Nested sitemap handling
- Error handling
- URL filtering
Proposed Refactoring Strategy¶
Step 1: Analyze Each Service¶
Review existing code to understand:
- All responsibilities of each service
- Conditional logic branches
- URL handling rules
- Edge cases and special handling
- Dependencies and interactions
Step 2: RoasterUrlsService Refactoring¶
Issues:
- Very high NPath (1,280) suggests complex query building
- Likely mixing data access, business logic, and presentation
Proposed Solution:
- Extract query building to repository method
- Create dedicated query builder or criteria
- Extract data transformation to mapper/transformer
- Simplify service to orchestration only
Example:
public function getRoasterUrlsData(RoasterFilterCriteria $criteria): array
{
// Simple orchestration
$roasters = $this->repository->findWithUrls($criteria);
return $this->transformer->transformForAdmin($roasters);
}
Step 3: UrlNormalizerService Refactoring¶
Issues:
- Multiple normalization rules in single method
- Complex conditional logic
Proposed Solution:
- Chain of Responsibility Pattern for normalization rules:
interface NormalizationRule
{
public function apply(string $url): string;
}
class RemoveFragmentRule implements NormalizationRule { ... }
class NormalizePathRule implements NormalizationRule { ... }
class NormalizeQueryParamsRule implements NormalizationRule { ... }
class RemoveTrailingSlashRule implements NormalizationRule { ... }
class UrlNormalizerService
{
public function normalize(string $url): string
{
foreach ($this->rules as $rule) {
$url = $rule->apply($url);
}
return $url;
}
}
Benefits:
- Each rule is independently testable
- Easy to add new normalization rules
- Clear separation of concerns
- Configurable rule order
Step 4: SitemapProcessor Refactoring¶
Issues:
- Multiple responsibilities (parsing, validation, processing)
- Complex control flow
Proposed Solution:
- Extract XML parsing to dedicated parser
- Extract URL validation to validator
- Create sitemap index handler
- Simplify main processing method
Example:
public function processSitemap(string $sitemapUrl): array
{
$content = $this->fetcher->fetch($sitemapUrl);
$sitemap = $this->parser->parse($content);
if ($sitemap->isIndex()) {
return $this->indexHandler->process($sitemap);
}
return $this->urlExtractor->extract($sitemap);
}
Success Criteria¶
RoasterUrlsService¶
- NPath complexity < 250
- Query building in repository
- Data transformation separated
- Service focuses on orchestration
UrlNormalizerService¶
- Cyclomatic complexity < 10
- NPath complexity < 250
- Chain of Responsibility for rules
- Each rule independently testable
- Easy to add/remove/reorder rules
SitemapProcessor¶
- NPath complexity < 250
- Clear separation of parsing, validation, processing
- Easy to extend for new sitemap formats
- Improved error handling
Testing Strategy¶
RoasterUrlsService¶
- Test query building separately
- Test data transformation separately
- Integration test for full flow
- Test various filter combinations
UrlNormalizerService¶
- Unit test each normalization rule
- Integration test for rule chain
- Test URL edge cases:
- International domains
- Encoded characters
- Complex query parameters
- Fragment identifiers
- Trailing slashes
- Ensure idempotency (normalize(normalize(url)) == normalize(url))
SitemapProcessor¶
- Test XML sitemap parsing
- Test sitemap index handling
- Test URL extraction
- Test error cases (malformed XML, invalid URLs)
- Test nested sitemaps
Risk Assessment¶
Medium Risk:
- URL handling affects crawler functionality
- Incorrect normalization can cause duplicate detection issues
- Changes could affect existing crawl data
Mitigation:
- Comprehensive test coverage before refactoring
- Test with real-world URL examples
- Verify URL normalization consistency
- Test backward compatibility
- Monitor crawler metrics post-deployment
Estimated Effort¶
Medium:
- RoasterUrlsService analysis & refactoring: 1.5 days
- UrlNormalizerService refactoring: 2 days (Chain of Responsibility)
- SitemapProcessor refactoring: 1.5 days
- Testing & verification: 2 days
- Total: 7 days
Implementation Order¶
-
UrlNormalizerService (most critical for data quality)
- Highest impact on crawler reliability
- Chain of Responsibility is well-established pattern
-
SitemapProcessor (affects crawler coverage)
- Important for sitemap-based crawling
- Cleaner separation of concerns
-
RoasterUrlsService (admin feature)
- Lower priority (admin interface)
- Can be addressed when working on admin features
Dependencies¶
None - can be addressed independently
Related Issues¶
There's an existing plan: url-normalization.md in plans directory
- Review existing plan for context
- Coordinate with that plan if it addresses same issues
- May supersede or complement existing plan
Notes¶
- URL normalization is critical for duplicate detection
- Consider creating URL normalization specification document
- May want to make normalization rules configurable
- Chain of Responsibility pattern is ideal for URL processing rules
- UrlNormalizerService refactoring could prevent future URL-related bugs
- Consider adding URL normalization monitoring/metrics
- Test with wide variety of real-world URLs from target sites