Feature Implementation Plan: URL Normalization for Internationalized Domains¶
Executive Summary¶
Handle internationalized domain names (IDN) and special characters in URLs by normalizing them before making HTTP requests, while preserving original URLs in the database.
Key Principle: Normalize URLs just-in-time (during HTTP request building) rather than storing normalized versions, ensuring data integrity and traceability.
Problem Statement¶
URLs with non-Latin characters (e.g., Korean, Japanese roaster names) cause HTTP request failures. Examples:
- https://나나커피로스터스.com/products/ethiopia (Korean characters)
- https://コーヒー.jp/products/natural (Japanese characters)
- https://example.com/café colombia (special characters in path)
Currently, these URLs are stored in the database but fail when used in HTTP requests because they're not properly encoded.
Technical Analysis¶
Current State¶
URL Handling:
- URLs with non-Latin characters cause HTTP request failures
- Original URLs are stored in database but not normalized for external requests
- Both Spider and Crawl4Ai crawlers inherit from AbstractRequestBuilder which calls createRequestPayload($url, $options) at line 42-44
Symfony Package Analysis¶
Available Packages:
- symfony/intl (already installed) - Provides idn_to_ascii() for internationalized domain names
- symfony/http-client (already installed) - HTTP client but no URL normalization utilities
- Conclusion: No dedicated URL normalization package exists; we'll use native PHP functions with existing packages
Integration Points¶
URL Normalization Injection Point:
AbstractRequestBuilder::build($url, $options) [line 42]
↓
AbstractRequestBuilder::createRequestPayload($url, $options) [line 44]
↓
Spider/CrawlRequestBuilder::createRequestPayload() [line 38: 'url' => $url]
Crawl4Ai/Crawl4AiRequestBuilder::createRequestPayload() [line 35: 'urls' => [$url]]
Recommendation: Inject UrlNormalizerService into AbstractRequestBuilder and normalize URL in build() method before passing to createRequestPayload(). This ensures both crawlers benefit without code duplication.
Implementation Plan¶
Step 1: Create UrlNormalizerService¶
File: src/Service/Crawler/UrlNormalizerService.php
Implementation Details:
<?php
namespace App\Service\Crawler;
use Psr\Log\LoggerInterface;
final readonly class UrlNormalizerService
{
public function __construct(
private LoggerInterface $logger,
) {
}
/**
* Normalize a URL for HTTP requests
*
* Handles:
* - Internationalized domain names (IDN) -> ASCII
* - Special characters in path -> URL encoded
* - Missing scheme -> defaults to https
*
* @param string $url The URL to normalize
* @return string The normalized URL
*/
public function normalize(string $url): string
{
// 1. Add scheme if missing
if (!preg_match('~^https?://~i', $url)) {
$url = 'https://' . $url;
}
// 2. Parse URL into components
$parsed = parse_url($url);
if ($parsed === false || !isset($parsed['host'])) {
$this->logger->warning('Failed to parse URL for normalization', [
'url' => $url,
]);
return $url; // Return original on failure
}
// 3. Convert IDN domain to ASCII
$host = $parsed['host'];
$asciiHost = idn_to_ascii($host, IDNA_DEFAULT, INTL_IDNA_VARIANT_UTS46);
if ($asciiHost === false) {
$this->logger->warning('Failed to convert IDN to ASCII', [
'host' => $host,
'url' => $url,
]);
$asciiHost = $host; // Keep original if conversion fails
}
// 4. Encode path segments (but preserve slashes)
$path = $parsed['path'] ?? '/';
$normalizedPath = $this->encodePath($path);
// 5. Reconstruct URL
$normalized = ($parsed['scheme'] ?? 'https') . '://';
if (isset($parsed['user'])) {
$normalized .= $parsed['user'];
if (isset($parsed['pass'])) {
$normalized .= ':' . $parsed['pass'];
}
$normalized .= '@';
}
$normalized .= $asciiHost;
if (isset($parsed['port'])) {
$normalized .= ':' . $parsed['port'];
}
$normalized .= $normalizedPath;
if (isset($parsed['query'])) {
$normalized .= '?' . $parsed['query'];
}
if (isset($parsed['fragment'])) {
$normalized .= '#' . $parsed['fragment'];
}
// Log normalization if URL changed
if ($normalized !== $url) {
$this->logger->debug('URL normalized for HTTP request', [
'original' => $url,
'normalized' => $normalized,
]);
}
return $normalized;
}
/**
* Encode path segments while preserving slashes
* Avoids double-encoding already-encoded characters
*/
private function encodePath(string $path): string
{
// Split by slash, encode each segment, rejoin
$segments = explode('/', $path);
$encoded = array_map(function ($segment) {
// Decode first to avoid double-encoding
$decoded = rawurldecode($segment);
// Re-encode
return rawurlencode($decoded);
}, $segments);
return implode('/', $encoded);
}
}
Technical Approach:
- Use parse_url() to extract scheme, host, path, query, fragment
- Apply idn_to_ascii() to host for internationalized domains
- Use rawurlencode() on path segments (split by /, encode, rejoin)
- Decode before encoding to avoid double-encoding
- Leave query strings and fragments as-is
- Return normalized URL or original on error (with logging)
Edge Cases Handled: - URLs with already-encoded characters (decode then re-encode to avoid double-encoding) - URLs without scheme (default to https) - Malformed URLs (log and return original) - IDN conversion failures (log and use original host)
Step 2: Inject Normalizer into AbstractRequestBuilder¶
File: src/Service/Crawler/Implementations/Abstract/Http/AbstractRequestBuilder.php
Changes Required:
- Add
UrlNormalizerServiceto constructor - Modify
build()method:
use App\Service\Crawler\UrlNormalizerService;
abstract class AbstractRequestBuilder implements RequestBuilderInterface
{
protected string $apiUrl;
protected string $apiToken;
protected array $payload;
public function __construct(
protected readonly UrlNormalizerService $urlNormalizer,
) {
}
// ... existing methods ...
public function build(string $url, array $options = []): static
{
$normalizedUrl = $this->urlNormalizer->normalize($url);
$this->payload = $this->createRequestPayload($normalizedUrl, $options);
return $this;
}
// ... rest of class ...
}
Impact:
- Both Spider/Http/CrawlRequestBuilder and Crawl4Ai/Http/Crawl4AiRequestBuilder automatically benefit
- No changes needed in crawler-specific classes
- Original URLs in database remain untouched
- All HTTP requests use normalized URLs
Concrete Request Builders (Spider/Http/CrawlRequestBuilder.php and Crawl4Ai/Http/Crawl4AiRequestBuilder.php):
Need to update constructors to pass UrlNormalizerService to parent:
public function __construct(
UrlNormalizerService $urlNormalizer,
ParameterBagInterface $params,
) {
parent::__construct($urlNormalizer);
// ... rest of constructor
}
Testing Strategy¶
Unit Tests¶
File: tests/Service/Crawler/UrlNormalizerServiceTest.php
<?php
namespace App\Tests\Service\Crawler;
use App\Service\Crawler\UrlNormalizerService;
use PHPUnit\Framework\TestCase;
use Psr\Log\LoggerInterface;
class UrlNormalizerServiceTest extends TestCase
{
private UrlNormalizerService $normalizer;
protected function setUp(): void
{
$logger = $this->createMock(LoggerInterface::class);
$this->normalizer = new UrlNormalizerService($logger);
}
public function test_normalizes_internationalized_domain_names(): void
{
$input = "https://コーヒー.jp/products/ethiopia";
$result = $this->normalizer->normalize($input);
$this->assertStringStartsWith('https://xn--', $result);
$this->assertStringContainsString('/products/ethiopia', $result);
}
public function test_encodes_special_characters_in_path(): void
{
$input = "https://example.com/café colombia";
$result = $this->normalizer->normalize($input);
$this->assertStringContainsString('caf%C3%A9', $result);
$this->assertStringContainsString('colombia', $result);
}
public function test_handles_already_encoded_urls(): void
{
$input = "https://example.com/caf%C3%A9";
$result = $this->normalizer->normalize($input);
// Should not double-encode
$this->assertEquals("https://example.com/caf%C3%A9", $result);
$this->assertStringNotContainsString('%25', $result);
}
public function test_preserves_query_parameters(): void
{
$input = "https://example.com/product?id=123&name=café";
$result = $this->normalizer->normalize($input);
$this->assertStringContainsString('?id=123&name=café', $result);
}
public function test_adds_missing_scheme(): void
{
$input = "example.com/products";
$result = $this->normalizer->normalize($input);
$this->assertStringStartsWith('https://example.com/products', $result);
}
public function test_handles_complex_idn_url(): void
{
$input = "https://나나커피로스터스.com/products/ethiopia-sidamo";
$result = $this->normalizer->normalize($input);
$this->assertStringStartsWith('https://', $result);
$this->assertStringContainsString('/products/ethiopia-sidamo', $result);
// Should not contain Korean characters
$this->assertStringNotContainsString('나나', $result);
}
public function test_preserves_port_numbers(): void
{
$input = "https://example.com:8080/products";
$result = $this->normalizer->normalize($input);
$this->assertEquals("https://example.com:8080/products", $result);
}
public function test_handles_url_with_fragment(): void
{
$input = "https://example.com/products#section";
$result = $this->normalizer->normalize($input);
$this->assertStringContainsString('#section', $result);
}
}
Integration Tests¶
File: tests/Service/Crawler/Implementations/Abstract/Http/AbstractRequestBuilderTest.php
public function test_normalizes_url_before_creating_payload(): void
{
$urlNormalizer = $this->createMock(UrlNormalizerService::class);
$urlNormalizer->expects($this->once())
->method('normalize')
->with('https://コーヒー.jp/products/ethiopia')
->willReturn('https://xn--gckq7d0d.jp/products/ethiopia');
$builder = new CrawlRequestBuilder($urlNormalizer, $params);
$builder->build('https://コーヒー.jp/products/ethiopia');
$payload = $builder->getPayload();
$this->assertEquals('https://xn--gckq7d0d.jp/products/ethiopia', $payload['url']);
}
public function test_spider_crawler_uses_normalized_url(): void
{
$builder = new CrawlRequestBuilder($urlNormalizer, $params);
$builder->build('https://example.com/café');
$payload = $builder->getPayload();
$this->assertStringContainsString('%C3%A9', $payload['url']);
}
public function test_crawl4ai_crawler_uses_normalized_url(): void
{
$builder = new Crawl4AiRequestBuilder($urlNormalizer, $params);
$builder->build('https://example.com/café');
$payload = $builder->getPayload();
$this->assertStringContainsString('%C3%A9', $payload['urls'][0]);
}
Manual Testing¶
URL Normalization Testing:
-
Add test CrawlUrl with Korean characters:
-
Trigger crawl via
app:crawler:run --url <crawl-url-id> -
Check logs to verify:
- "URL normalized for HTTP request" debug message appears
- Shows original and normalized URLs
-
HTTP request uses normalized URL
-
Verify in database:
- Original URL remains unchanged in
crawl_urltable -
CrawlResult shows successful crawl
-
Test with Japanese domain:
-
Test with special characters in path:
Rollout Plan¶
Development¶
- ✅ Create
UrlNormalizerServicewith comprehensive logic - ✅ Write unit tests for
UrlNormalizerService - ✅ Inject
UrlNormalizerServiceintoAbstractRequestBuilder - ✅ Update concrete request builders (Spider, Crawl4Ai)
- ✅ Write integration tests for request builders
- ✅ Run full QA suite:
make qa - ✅ Fix any PHPStan/PHPCS issues
Testing¶
- Deploy to staging environment
- Run manual tests with real internationalized URLs
- Monitor logs for normalization behavior:
- Check debug logs for "URL normalized" messages
- Verify original URLs shown in logs
- Verify database integrity:
- Original URLs unchanged
- Crawls succeed
- Test edge cases:
- Already-encoded URLs
- Malformed URLs
- URLs without schemes
Production¶
- Deploy with monitoring enabled
- Track metrics:
- URLs with normalization applied (count debug log entries)
- Crawl success rate for internationalized domains
- Any normalization errors/warnings
- Monitor for 1 week
- Review any failures or edge cases
- Adjust normalization logic if needed
Success Criteria¶
Functional Success¶
- [ ] URLs with internationalized domain names crawl successfully
- [ ] URLs with special characters in paths crawl successfully
- [ ] Original URLs in database remain unchanged
- [ ] Normalized URLs are logged for debugging
- [ ] No double-encoding issues
- [ ] Edge cases handled gracefully (malformed URLs, missing schemes)
Technical Success¶
- [ ] All QA tools pass (PHPStan, PHPCS, PHPUnit)
- [ ] Unit test coverage > 90% for UrlNormalizerService
- [ ] Integration tests pass for both Spider and Crawl4Ai
- [ ] No regression in existing crawl functionality
Performance & Reliability¶
- [ ] URL normalization adds < 10ms per request
- [ ] No increase in failed crawls
- [ ] No regression in existing functionality
- [ ] Logging provides clear audit trail for debugging
Future Enhancements¶
URL Handling¶
- Add URL deduplication to detect URLs that normalize to the same value
- Consider adding URL validation before normalization
- Add metrics tracking for normalization frequency by domain
- Support for more complex URL encoding scenarios
Monitoring¶
- Dashboard showing:
- URLs normalized per day
- Normalization patterns (IDN vs special chars)
- Domains requiring normalization (identify roasters)
- Normalization errors/warnings
Risk Assessment¶
Low Risk¶
- URL normalization: Non-destructive, only affects HTTP requests
- Fallback behavior: Returns original URL on errors
Medium Risk¶
- Edge cases in URL normalization: Malformed URLs, already-encoded URLs
- IDN conversion failures: Some domains may not convert properly
Mitigation¶
- Extensive testing before deployment
- Logging for all normalization operations
- Fallback to original URL on any errors
- Monitoring and alerting for first week in production
- Rollback plan: Remove normalization injection, redeploy previous version
Checklist¶
Implementation¶
- [ ] Create
UrlNormalizerServicewith comprehensive logic - [ ] Write unit tests for
UrlNormalizerService - [ ] Inject
UrlNormalizerServiceintoAbstractRequestBuilder - [ ] Update
Spider/Http/CrawlRequestBuilderconstructor - [ ] Update
Crawl4Ai/Http/Crawl4AiRequestBuilderconstructor - [ ] Write integration tests for request builders
- [ ] Run full QA suite and fix any issues
Testing¶
- [ ] Manual test with internationalized domain names (Korean)
- [ ] Manual test with internationalized domain names (Japanese)
- [ ] Manual test with special characters in paths
- [ ] Manual test with already-encoded URLs
- [ ] Manual test with malformed URLs
- [ ] Verify original URLs remain unchanged in database
- [ ] Verify normalized URLs appear in logs
Deployment¶
- [ ] Deploy to staging
- [ ] Run staging tests
- [ ] Deploy to production
- [ ] Monitor for 1 week
- [ ] Review normalization logs
- [ ] Adjust logic if needed
- [ ] Document lessons learned