Skip to content

Refactor URL Handling Services

Priority: 🟡 MEDIUM Status: Planning Related QA Analysis: qa-analysis-overview.md

Problem Statement

Three URL-related services have complexity issues affecting crawler reliability:

RoasterUrlsService

File: Service/Admin/RoasterUrlsService.php:34

  • Method: getRoasterUrlsData() (line 34)
  • NPath Complexity: 1,280/250 (412% over threshold)

UrlNormalizerService

File: Service/Crawler/UrlNormalizerService.php:25

  • Method: normalize() (line 25)
  • Cyclomatic Complexity: 11/10 (10% over)
  • NPath Complexity: 576/250 (130% over)

SitemapProcessor

File: Service/Crawler/Processing/SitemapProcessor.php:55

  • Method: processSitemap() (line 55)
  • NPath Complexity: 288/250 (15% over)

Impact

  • Crawler Reliability: URL handling is critical for crawler functionality
  • Bug Risk: Complex URL logic can lead to subtle bugs
  • Data Quality: Incorrect URL normalization affects data integrity
  • Maintainability: Difficult to add new URL handling rules

Guideline Violations

  • SOLID - Single Responsibility Principle: Methods doing too much
  • Complexity: High NPath indicates many execution paths

Root Cause Analysis

RoasterUrlsService (NPath: 1,280)

Likely handles:

  • Fetching roaster URLs from database
  • Filtering roaster data
  • Complex query building
  • Data transformation for admin interface
  • Multiple conditional branches for different data states

UrlNormalizerService (CC: 11, NPath: 576)

Likely handles:

  • Multiple URL normalization rules
  • Query parameter handling
  • Fragment removal
  • Path normalization
  • Protocol handling
  • Trailing slash logic
  • Special case handling
  • Domain normalization

SitemapProcessor (NPath: 288)

Likely handles:

  • Sitemap XML parsing
  • URL extraction
  • URL validation
  • Sitemap index processing
  • Nested sitemap handling
  • Error handling
  • URL filtering

Proposed Refactoring Strategy

Step 1: Analyze Each Service

Review existing code to understand:

  • All responsibilities of each service
  • Conditional logic branches
  • URL handling rules
  • Edge cases and special handling
  • Dependencies and interactions

Step 2: RoasterUrlsService Refactoring

Issues:

  • Very high NPath (1,280) suggests complex query building
  • Likely mixing data access, business logic, and presentation

Proposed Solution:

  1. Extract query building to repository method
  2. Create dedicated query builder or criteria
  3. Extract data transformation to mapper/transformer
  4. Simplify service to orchestration only

Example:

public function getRoasterUrlsData(RoasterFilterCriteria $criteria): array
{
    // Simple orchestration
    $roasters = $this->repository->findWithUrls($criteria);
    return $this->transformer->transformForAdmin($roasters);
}

Step 3: UrlNormalizerService Refactoring

Issues:

  • Multiple normalization rules in single method
  • Complex conditional logic

Proposed Solution:

  • Chain of Responsibility Pattern for normalization rules:
interface NormalizationRule
{
    public function apply(string $url): string;
}

class RemoveFragmentRule implements NormalizationRule { ... }
class NormalizePathRule implements NormalizationRule { ... }
class NormalizeQueryParamsRule implements NormalizationRule { ... }
class RemoveTrailingSlashRule implements NormalizationRule { ... }

class UrlNormalizerService
{
    public function normalize(string $url): string
    {
        foreach ($this->rules as $rule) {
            $url = $rule->apply($url);
        }
        return $url;
    }
}

Benefits:

  • Each rule is independently testable
  • Easy to add new normalization rules
  • Clear separation of concerns
  • Configurable rule order

Step 4: SitemapProcessor Refactoring

Issues:

  • Multiple responsibilities (parsing, validation, processing)
  • Complex control flow

Proposed Solution:

  1. Extract XML parsing to dedicated parser
  2. Extract URL validation to validator
  3. Create sitemap index handler
  4. Simplify main processing method

Example:

public function processSitemap(string $sitemapUrl): array
{
    $content = $this->fetcher->fetch($sitemapUrl);
    $sitemap = $this->parser->parse($content);

    if ($sitemap->isIndex()) {
        return $this->indexHandler->process($sitemap);
    }

    return $this->urlExtractor->extract($sitemap);
}

Success Criteria

RoasterUrlsService

  • NPath complexity < 250
  • Query building in repository
  • Data transformation separated
  • Service focuses on orchestration

UrlNormalizerService

  • Cyclomatic complexity < 10
  • NPath complexity < 250
  • Chain of Responsibility for rules
  • Each rule independently testable
  • Easy to add/remove/reorder rules

SitemapProcessor

  • NPath complexity < 250
  • Clear separation of parsing, validation, processing
  • Easy to extend for new sitemap formats
  • Improved error handling

Testing Strategy

RoasterUrlsService

  • Test query building separately
  • Test data transformation separately
  • Integration test for full flow
  • Test various filter combinations

UrlNormalizerService

  • Unit test each normalization rule
  • Integration test for rule chain
  • Test URL edge cases:
    • International domains
    • Encoded characters
    • Complex query parameters
    • Fragment identifiers
    • Trailing slashes
  • Ensure idempotency (normalize(normalize(url)) == normalize(url))

SitemapProcessor

  • Test XML sitemap parsing
  • Test sitemap index handling
  • Test URL extraction
  • Test error cases (malformed XML, invalid URLs)
  • Test nested sitemaps

Risk Assessment

Medium Risk:

  • URL handling affects crawler functionality
  • Incorrect normalization can cause duplicate detection issues
  • Changes could affect existing crawl data

Mitigation:

  • Comprehensive test coverage before refactoring
  • Test with real-world URL examples
  • Verify URL normalization consistency
  • Test backward compatibility
  • Monitor crawler metrics post-deployment

Estimated Effort

Medium:

  • RoasterUrlsService analysis & refactoring: 1.5 days
  • UrlNormalizerService refactoring: 2 days (Chain of Responsibility)
  • SitemapProcessor refactoring: 1.5 days
  • Testing & verification: 2 days
  • Total: 7 days

Implementation Order

  1. UrlNormalizerService (most critical for data quality)

    • Highest impact on crawler reliability
    • Chain of Responsibility is well-established pattern
  2. SitemapProcessor (affects crawler coverage)

    • Important for sitemap-based crawling
    • Cleaner separation of concerns
  3. RoasterUrlsService (admin feature)

    • Lower priority (admin interface)
    • Can be addressed when working on admin features

Dependencies

None - can be addressed independently

There's an existing plan: url-normalization.md in plans directory

  • Review existing plan for context
  • Coordinate with that plan if it addresses same issues
  • May supersede or complement existing plan

Notes

  • URL normalization is critical for duplicate detection
  • Consider creating URL normalization specification document
  • May want to make normalization rules configurable
  • Chain of Responsibility pattern is ideal for URL processing rules
  • UrlNormalizerService refactoring could prevent future URL-related bugs
  • Consider adding URL normalization monitoring/metrics
  • Test with wide variety of real-world URLs from target sites