Feature Implementation Plan: URL Normalization for Internationalized Domains¶

Executive Summary¶

Handle internationalized domain names (IDN) and special characters in URLs by normalizing them before making HTTP requests, while preserving original URLs in the database.

Key Principle: Normalize URLs just-in-time (during HTTP request building) rather than storing normalized versions, ensuring data integrity and traceability.

Problem Statement¶

URLs with non-Latin characters (e.g., Korean, Japanese roaster names) cause HTTP request failures. Examples: - https://나나커피로스터스.com/products/ethiopia (Korean characters) - https://コーヒー.jp/products/natural (Japanese characters) - https://example.com/café colombia (special characters in path)

Currently, these URLs are stored in the database but fail when used in HTTP requests because they're not properly encoded.

Technical Analysis¶

Current State¶

URL Handling: - URLs with non-Latin characters cause HTTP request failures - Original URLs are stored in database but not normalized for external requests - Both Spider and Crawl4Ai crawlers inherit from AbstractRequestBuilder which calls createRequestPayload($url, $options) at line 42-44

Symfony Package Analysis¶

Available Packages: - symfony/intl (already installed) - Provides idn_to_ascii() for internationalized domain names - symfony/http-client (already installed) - HTTP client but no URL normalization utilities - Conclusion: No dedicated URL normalization package exists; we'll use native PHP functions with existing packages

Integration Points¶

URL Normalization Injection Point:

AbstractRequestBuilder::build($url, $options)  [line 42]
    ↓
AbstractRequestBuilder::createRequestPayload($url, $options)  [line 44]
    ↓
Spider/CrawlRequestBuilder::createRequestPayload()  [line 38: 'url' => $url]
Crawl4Ai/Crawl4AiRequestBuilder::createRequestPayload()  [line 35: 'urls' => [$url]]

Recommendation: Inject UrlNormalizerService into AbstractRequestBuilder and normalize URL in build() method before passing to createRequestPayload(). This ensures both crawlers benefit without code duplication.

Implementation Plan¶

Step 1: Create UrlNormalizerService¶

File: src/Service/Crawler/UrlNormalizerService.php

Implementation Details:

<?php

namespace App\Service\Crawler;

use Psr\Log\LoggerInterface;

final readonly class UrlNormalizerService
{
    public function __construct(
        private LoggerInterface $logger,
    ) {
    }

    /**
     * Normalize a URL for HTTP requests
     *
     * Handles:
     * - Internationalized domain names (IDN) -> ASCII
     * - Special characters in path -> URL encoded
     * - Missing scheme -> defaults to https
     *
     * @param string $url The URL to normalize
     * @return string The normalized URL
     */
    public function normalize(string $url): string
    {
        // 1. Add scheme if missing
        if (!preg_match('~^https?://~i', $url)) {
            $url = 'https://' . $url;
        }

        // 2. Parse URL into components
        $parsed = parse_url($url);
        if ($parsed === false || !isset($parsed['host'])) {
            $this->logger->warning('Failed to parse URL for normalization', [
                'url' => $url,
            ]);
            return $url; // Return original on failure
        }

        // 3. Convert IDN domain to ASCII
        $host = $parsed['host'];
        $asciiHost = idn_to_ascii($host, IDNA_DEFAULT, INTL_IDNA_VARIANT_UTS46);
        if ($asciiHost === false) {
            $this->logger->warning('Failed to convert IDN to ASCII', [
                'host' => $host,
                'url' => $url,
            ]);
            $asciiHost = $host; // Keep original if conversion fails
        }

        // 4. Encode path segments (but preserve slashes)
        $path = $parsed['path'] ?? '/';
        $normalizedPath = $this->encodePath($path);

        // 5. Reconstruct URL
        $normalized = ($parsed['scheme'] ?? 'https') . '://';

        if (isset($parsed['user'])) {
            $normalized .= $parsed['user'];
            if (isset($parsed['pass'])) {
                $normalized .= ':' . $parsed['pass'];
            }
            $normalized .= '@';
        }

        $normalized .= $asciiHost;

        if (isset($parsed['port'])) {
            $normalized .= ':' . $parsed['port'];
        }

        $normalized .= $normalizedPath;

        if (isset($parsed['query'])) {
            $normalized .= '?' . $parsed['query'];
        }

        if (isset($parsed['fragment'])) {
            $normalized .= '#' . $parsed['fragment'];
        }

        // Log normalization if URL changed
        if ($normalized !== $url) {
            $this->logger->debug('URL normalized for HTTP request', [
                'original' => $url,
                'normalized' => $normalized,
            ]);
        }

        return $normalized;
    }

    /**
     * Encode path segments while preserving slashes
     * Avoids double-encoding already-encoded characters
     */
    private function encodePath(string $path): string
    {
        // Split by slash, encode each segment, rejoin
        $segments = explode('/', $path);
        $encoded = array_map(function ($segment) {
            // Decode first to avoid double-encoding
            $decoded = rawurldecode($segment);
            // Re-encode
            return rawurlencode($decoded);
        }, $segments);

        return implode('/', $encoded);
    }
}

Technical Approach: - Use parse_url() to extract scheme, host, path, query, fragment - Apply idn_to_ascii() to host for internationalized domains - Use rawurlencode() on path segments (split by /, encode, rejoin) - Decode before encoding to avoid double-encoding - Leave query strings and fragments as-is - Return normalized URL or original on error (with logging)

Edge Cases Handled: - URLs with already-encoded characters (decode then re-encode to avoid double-encoding) - URLs without scheme (default to https) - Malformed URLs (log and return original) - IDN conversion failures (log and use original host)

Step 2: Inject Normalizer into AbstractRequestBuilder¶

File: src/Service/Crawler/Implementations/Abstract/Http/AbstractRequestBuilder.php

Changes Required:

Add UrlNormalizerService to constructor
Modify build() method:

use App\Service\Crawler\UrlNormalizerService;

abstract class AbstractRequestBuilder implements RequestBuilderInterface
{
    protected string $apiUrl;
    protected string $apiToken;
    protected array $payload;

    public function __construct(
        protected readonly UrlNormalizerService $urlNormalizer,
    ) {
    }

    // ... existing methods ...

    public function build(string $url, array $options = []): static
    {
        $normalizedUrl = $this->urlNormalizer->normalize($url);
        $this->payload = $this->createRequestPayload($normalizedUrl, $options);

        return $this;
    }

    // ... rest of class ...
}

Impact: - Both Spider/Http/CrawlRequestBuilder and Crawl4Ai/Http/Crawl4AiRequestBuilder automatically benefit - No changes needed in crawler-specific classes - Original URLs in database remain untouched - All HTTP requests use normalized URLs

Concrete Request Builders (Spider/Http/CrawlRequestBuilder.php and Crawl4Ai/Http/Crawl4AiRequestBuilder.php):

Need to update constructors to pass UrlNormalizerService to parent:

public function __construct(
    UrlNormalizerService $urlNormalizer,
    ParameterBagInterface $params,
) {
    parent::__construct($urlNormalizer);
    // ... rest of constructor
}

Testing Strategy¶

Unit Tests¶

File: tests/Service/Crawler/UrlNormalizerServiceTest.php

<?php

namespace App\Tests\Service\Crawler;

use App\Service\Crawler\UrlNormalizerService;
use PHPUnit\Framework\TestCase;
use Psr\Log\LoggerInterface;

class UrlNormalizerServiceTest extends TestCase
{
    private UrlNormalizerService $normalizer;

    protected function setUp(): void
    {
        $logger = $this->createMock(LoggerInterface::class);
        $this->normalizer = new UrlNormalizerService($logger);
    }

    public function test_normalizes_internationalized_domain_names(): void
    {
        $input = "https://コーヒー.jp/products/ethiopia";
        $result = $this->normalizer->normalize($input);

        $this->assertStringStartsWith('https://xn--', $result);
        $this->assertStringContainsString('/products/ethiopia', $result);
    }

    public function test_encodes_special_characters_in_path(): void
    {
        $input = "https://example.com/café colombia";
        $result = $this->normalizer->normalize($input);

        $this->assertStringContainsString('caf%C3%A9', $result);
        $this->assertStringContainsString('colombia', $result);
    }

    public function test_handles_already_encoded_urls(): void
    {
        $input = "https://example.com/caf%C3%A9";
        $result = $this->normalizer->normalize($input);

        // Should not double-encode
        $this->assertEquals("https://example.com/caf%C3%A9", $result);
        $this->assertStringNotContainsString('%25', $result);
    }

    public function test_preserves_query_parameters(): void
    {
        $input = "https://example.com/product?id=123&name=café";
        $result = $this->normalizer->normalize($input);

        $this->assertStringContainsString('?id=123&name=café', $result);
    }

    public function test_adds_missing_scheme(): void
    {
        $input = "example.com/products";
        $result = $this->normalizer->normalize($input);

        $this->assertStringStartsWith('https://example.com/products', $result);
    }

    public function test_handles_complex_idn_url(): void
    {
        $input = "https://나나커피로스터스.com/products/ethiopia-sidamo";
        $result = $this->normalizer->normalize($input);

        $this->assertStringStartsWith('https://', $result);
        $this->assertStringContainsString('/products/ethiopia-sidamo', $result);
        // Should not contain Korean characters
        $this->assertStringNotContainsString('나나', $result);
    }

    public function test_preserves_port_numbers(): void
    {
        $input = "https://example.com:8080/products";
        $result = $this->normalizer->normalize($input);

        $this->assertEquals("https://example.com:8080/products", $result);
    }

    public function test_handles_url_with_fragment(): void
    {
        $input = "https://example.com/products#section";
        $result = $this->normalizer->normalize($input);

        $this->assertStringContainsString('#section', $result);
    }
}

Integration Tests¶

File: tests/Service/Crawler/Implementations/Abstract/Http/AbstractRequestBuilderTest.php

public function test_normalizes_url_before_creating_payload(): void
{
    $urlNormalizer = $this->createMock(UrlNormalizerService::class);
    $urlNormalizer->expects($this->once())
        ->method('normalize')
        ->with('https://コーヒー.jp/products/ethiopia')
        ->willReturn('https://xn--gckq7d0d.jp/products/ethiopia');

    $builder = new CrawlRequestBuilder($urlNormalizer, $params);
    $builder->build('https://コーヒー.jp/products/ethiopia');

    $payload = $builder->getPayload();
    $this->assertEquals('https://xn--gckq7d0d.jp/products/ethiopia', $payload['url']);
}

public function test_spider_crawler_uses_normalized_url(): void
{
    $builder = new CrawlRequestBuilder($urlNormalizer, $params);
    $builder->build('https://example.com/café');

    $payload = $builder->getPayload();
    $this->assertStringContainsString('%C3%A9', $payload['url']);
}

public function test_crawl4ai_crawler_uses_normalized_url(): void
{
    $builder = new Crawl4AiRequestBuilder($urlNormalizer, $params);
    $builder->build('https://example.com/café');

    $payload = $builder->getPayload();
    $this->assertStringContainsString('%C3%A9', $payload['urls'][0]);
}

Manual Testing¶

URL Normalization Testing:

Add test CrawlUrl with Korean characters:

https://나나커피로스터스.com/products/ethiopia

Trigger crawl via app:crawler:run --url <crawl-url-id>
Check logs to verify:
"URL normalized for HTTP request" debug message appears
Shows original and normalized URLs
HTTP request uses normalized URL
Verify in database:
Original URL remains unchanged in crawl_url table
CrawlResult shows successful crawl

Test with Japanese domain:

https://コーヒー.jp/products/natural-process

Test with special characters in path:

https://example.com/products/café colombia

Rollout Plan¶

Development¶

✅ Create UrlNormalizerService with comprehensive logic
✅ Write unit tests for UrlNormalizerService
✅ Inject UrlNormalizerService into AbstractRequestBuilder
✅ Update concrete request builders (Spider, Crawl4Ai)
✅ Write integration tests for request builders
✅ Run full QA suite: make qa
✅ Fix any PHPStan/PHPCS issues

Testing¶

Deploy to staging environment
Run manual tests with real internationalized URLs
Monitor logs for normalization behavior:
Check debug logs for "URL normalized" messages
Verify original URLs shown in logs
Verify database integrity:
Original URLs unchanged
Crawls succeed
Test edge cases:
Already-encoded URLs
Malformed URLs
URLs without schemes

Production¶

Deploy with monitoring enabled
Track metrics:
URLs with normalization applied (count debug log entries)
Crawl success rate for internationalized domains
Any normalization errors/warnings
Monitor for 1 week
Review any failures or edge cases
Adjust normalization logic if needed

Success Criteria¶

Functional Success¶

[ ] URLs with internationalized domain names crawl successfully
[ ] URLs with special characters in paths crawl successfully
[ ] Original URLs in database remain unchanged
[ ] Normalized URLs are logged for debugging
[ ] No double-encoding issues
[ ] Edge cases handled gracefully (malformed URLs, missing schemes)

Technical Success¶

[ ] All QA tools pass (PHPStan, PHPCS, PHPUnit)
[ ] Unit test coverage > 90% for UrlNormalizerService
[ ] Integration tests pass for both Spider and Crawl4Ai
[ ] No regression in existing crawl functionality

Performance & Reliability¶

[ ] URL normalization adds < 10ms per request
[ ] No increase in failed crawls
[ ] No regression in existing functionality
[ ] Logging provides clear audit trail for debugging

Future Enhancements¶

URL Handling¶

Add URL deduplication to detect URLs that normalize to the same value
Consider adding URL validation before normalization
Add metrics tracking for normalization frequency by domain
Support for more complex URL encoding scenarios

Monitoring¶

Dashboard showing:
URLs normalized per day
Normalization patterns (IDN vs special chars)
Domains requiring normalization (identify roasters)
Normalization errors/warnings

Risk Assessment¶

Low Risk¶

URL normalization: Non-destructive, only affects HTTP requests
Fallback behavior: Returns original URL on errors

Medium Risk¶

Edge cases in URL normalization: Malformed URLs, already-encoded URLs
IDN conversion failures: Some domains may not convert properly

Mitigation¶

Extensive testing before deployment
Logging for all normalization operations
Fallback to original URL on any errors
Monitoring and alerting for first week in production
Rollback plan: Remove normalization injection, redeploy previous version

Checklist¶

Implementation¶

[ ] Create UrlNormalizerService with comprehensive logic
[ ] Write unit tests for UrlNormalizerService
[ ] Inject UrlNormalizerService into AbstractRequestBuilder
[ ] Update Spider/Http/CrawlRequestBuilder constructor
[ ] Update Crawl4Ai/Http/Crawl4AiRequestBuilder constructor
[ ] Write integration tests for request builders
[ ] Run full QA suite and fix any issues

Testing¶

[ ] Manual test with internationalized domain names (Korean)
[ ] Manual test with internationalized domain names (Japanese)
[ ] Manual test with special characters in paths
[ ] Manual test with already-encoded URLs
[ ] Manual test with malformed URLs
[ ] Verify original URLs remain unchanged in database
[ ] Verify normalized URLs appear in logs

Deployment¶

[ ] Deploy to staging
[ ] Run staging tests
[ ] Deploy to production
[ ] Monitor for 1 week
[ ] Review normalization logs
[ ] Adjust logic if needed
[ ] Document lessons learned