Refactor URL Handling Services¶

Priority: 🟡 MEDIUM Status: Planning Related QA Analysis: qa-analysis-overview.md

Problem Statement¶

Three URL-related services have complexity issues affecting crawler reliability:

RoasterUrlsService¶

File: Service/Admin/RoasterUrlsService.php:34

Method: getRoasterUrlsData() (line 34)
NPath Complexity: 1,280/250 (412% over threshold)

UrlNormalizerService¶

File: Service/Crawler/UrlNormalizerService.php:25

Method: normalize() (line 25)
Cyclomatic Complexity: 11/10 (10% over)
NPath Complexity: 576/250 (130% over)

SitemapProcessor¶

File: Service/Crawler/Processing/SitemapProcessor.php:55

Method: processSitemap() (line 55)
NPath Complexity: 288/250 (15% over)

Impact¶

Crawler Reliability: URL handling is critical for crawler functionality
Bug Risk: Complex URL logic can lead to subtle bugs
Data Quality: Incorrect URL normalization affects data integrity
Maintainability: Difficult to add new URL handling rules

Guideline Violations¶

SOLID - Single Responsibility Principle: Methods doing too much
Complexity: High NPath indicates many execution paths

Root Cause Analysis¶

RoasterUrlsService (NPath: 1,280)¶

Likely handles:

Fetching roaster URLs from database
Filtering roaster data
Complex query building
Data transformation for admin interface
Multiple conditional branches for different data states

UrlNormalizerService (CC: 11, NPath: 576)¶

Likely handles:

Multiple URL normalization rules
Query parameter handling
Fragment removal
Path normalization
Protocol handling
Trailing slash logic
Special case handling
Domain normalization

SitemapProcessor (NPath: 288)¶

Likely handles:

Sitemap XML parsing
URL extraction
URL validation
Sitemap index processing
Nested sitemap handling
Error handling
URL filtering

Proposed Refactoring Strategy¶

Step 1: Analyze Each Service¶

Review existing code to understand:

All responsibilities of each service
Conditional logic branches
URL handling rules
Edge cases and special handling
Dependencies and interactions

Step 2: RoasterUrlsService Refactoring¶

Issues:

Very high NPath (1,280) suggests complex query building
Likely mixing data access, business logic, and presentation

Proposed Solution:

Extract query building to repository method
Create dedicated query builder or criteria
Extract data transformation to mapper/transformer
Simplify service to orchestration only

Example:

public function getRoasterUrlsData(RoasterFilterCriteria $criteria): array
{
    // Simple orchestration
    $roasters = $this->repository->findWithUrls($criteria);
    return $this->transformer->transformForAdmin($roasters);
}

Step 3: UrlNormalizerService Refactoring¶

Issues:

Multiple normalization rules in single method
Complex conditional logic

Proposed Solution:

Chain of Responsibility Pattern for normalization rules:

interface NormalizationRule
{
    public function apply(string $url): string;
}

class RemoveFragmentRule implements NormalizationRule { ... }
class NormalizePathRule implements NormalizationRule { ... }
class NormalizeQueryParamsRule implements NormalizationRule { ... }
class RemoveTrailingSlashRule implements NormalizationRule { ... }

class UrlNormalizerService
{
    public function normalize(string $url): string
    {
        foreach ($this->rules as $rule) {
            $url = $rule->apply($url);
        }
        return $url;
    }
}

Benefits:

Each rule is independently testable
Easy to add new normalization rules
Clear separation of concerns
Configurable rule order

Step 4: SitemapProcessor Refactoring¶

Issues:

Multiple responsibilities (parsing, validation, processing)
Complex control flow

Proposed Solution:

Extract XML parsing to dedicated parser
Extract URL validation to validator
Create sitemap index handler
Simplify main processing method

Example:

public function processSitemap(string $sitemapUrl): array
{
    $content = $this->fetcher->fetch($sitemapUrl);
    $sitemap = $this->parser->parse($content);

    if ($sitemap->isIndex()) {
        return $this->indexHandler->process($sitemap);
    }

    return $this->urlExtractor->extract($sitemap);
}

Success Criteria¶

RoasterUrlsService¶

NPath complexity < 250
Query building in repository
Data transformation separated
Service focuses on orchestration

UrlNormalizerService¶

Cyclomatic complexity < 10
NPath complexity < 250
Chain of Responsibility for rules
Each rule independently testable
Easy to add/remove/reorder rules

SitemapProcessor¶

NPath complexity < 250
Clear separation of parsing, validation, processing
Easy to extend for new sitemap formats
Improved error handling

Testing Strategy¶

RoasterUrlsService¶

Test query building separately
Test data transformation separately
Integration test for full flow
Test various filter combinations

UrlNormalizerService¶

Unit test each normalization rule
Integration test for rule chain
Test URL edge cases:
- International domains
- Encoded characters
- Complex query parameters
- Fragment identifiers
- Trailing slashes
Ensure idempotency (normalize(normalize(url)) == normalize(url))

SitemapProcessor¶

Test XML sitemap parsing
Test sitemap index handling
Test URL extraction
Test error cases (malformed XML, invalid URLs)
Test nested sitemaps

Risk Assessment¶

Medium Risk:

URL handling affects crawler functionality
Incorrect normalization can cause duplicate detection issues
Changes could affect existing crawl data

Mitigation:

Comprehensive test coverage before refactoring
Test with real-world URL examples
Verify URL normalization consistency
Test backward compatibility
Monitor crawler metrics post-deployment

Estimated Effort¶

Medium:

RoasterUrlsService analysis & refactoring: 1.5 days
UrlNormalizerService refactoring: 2 days (Chain of Responsibility)
SitemapProcessor refactoring: 1.5 days
Testing & verification: 2 days
Total: 7 days

Implementation Order¶

UrlNormalizerService (most critical for data quality)
- Highest impact on crawler reliability
- Chain of Responsibility is well-established pattern
SitemapProcessor (affects crawler coverage)
- Important for sitemap-based crawling
- Cleaner separation of concerns
RoasterUrlsService (admin feature)
- Lower priority (admin interface)
- Can be addressed when working on admin features

Dependencies¶

None - can be addressed independently

There's an existing plan: url-normalization.md in plans directory

Review existing plan for context
Coordinate with that plan if it addresses same issues
May supersede or complement existing plan

Notes¶

URL normalization is critical for duplicate detection
Consider creating URL normalization specification document
May want to make normalization rules configurable
Chain of Responsibility pattern is ideal for URL processing rules
UrlNormalizerService refactoring could prevent future URL-related bugs
Consider adding URL normalization monitoring/metrics
Test with wide variety of real-world URLs from target sites

Refactor URL Handling Services¶

Problem Statement¶

RoasterUrlsService¶

UrlNormalizerService¶

SitemapProcessor¶

Impact¶

Guideline Violations¶

Root Cause Analysis¶

RoasterUrlsService (NPath: 1,280)¶

UrlNormalizerService (CC: 11, NPath: 576)¶

SitemapProcessor (NPath: 288)¶

Proposed Refactoring Strategy¶

Step 1: Analyze Each Service¶

Step 2: RoasterUrlsService Refactoring¶

Step 3: UrlNormalizerService Refactoring¶

Step 4: SitemapProcessor Refactoring¶

Success Criteria¶

RoasterUrlsService¶

UrlNormalizerService¶

SitemapProcessor¶

Testing Strategy¶

RoasterUrlsService¶

UrlNormalizerService¶

SitemapProcessor¶

Risk Assessment¶

Estimated Effort¶

Implementation Order¶

Dependencies¶

Related Issues¶

Notes¶