Feature Implementation Plan: Enhanced Product Detection for Whole Bean Coffee¶

Executive Summary¶

Improve early filtering of non-whole-bean products (drip bags, pods, capsules, ground-only coffee) through enhanced URL pattern classification and LLM content analysis.

Key Goals: 1. Filter out non-whole-bean products early to avoid wasted crawl processing 2. Correctly identify pages that offer BOTH whole bean and ground options as valid 3. Push LLM to use score 100 for complete whole bean products (not hedging at 85-90) 4. Maintain sensible scoring: drip bags score higher than blogs/equipment but still filtered

Problem Statement¶

Current product detection has several issues:

Missing product types in URL classification: The prompt mentions "equipment, accessories, merchandise" but MISSES drip bags, pods, capsules, ground coffee
Example: https://nanacoffeeroasters.com/products/drip-bag-ethiopia-sidamoguji-anasora may score 40-60 instead of being filtered
Ground coffee ambiguity: Many product pages offer BOTH whole bean AND ground options as selectable variations
If we're too strict, the ground option could cause the page to be rejected
We should only filter EXCLUSIVE ground-only products
LLM hedging on scores: Even for clear, complete whole bean products, LLM often scores 85-90 instead of 100
Lacks clear criteria for when to use 100 vs 90-99 vs 80-89
Inconsistent categorization: Drip bags shouldn't score as low as blogs/equipment - they're still coffee products, just not our target

Technical Analysis¶

Current State¶

URL Pattern Classification (UrlPatternClassificationService.php:208-227):

NOT Coffee Beans:
- Equipment (grinders, filters), accessories, merchandise, gift cards, blog posts, collections

Missing: drip bags, pods, capsules, ground coffee, instant coffee

Content Detection (coffee-bean.schema.json:23-26):

Guidelines for scoring:
- 80-100: Clearly a single coffee bean product with specific details AND purchase options
- 60-80: Likely a coffee bean product with purchase options but missing some specific details
...

Issues: - No explicit criteria for score 100 vs 80-89 - Doesn't address product pages with both whole bean + ground options - Doesn't distinguish drip bags (still coffee) from equipment (non-coffee)

Current Filtering Threshold¶

Currently using confidence threshold of 70.0 in various places. This plan will adjust to 40.0 with better scoring granularity.

Implementation Plan¶

Phase 1: Enhance URL Pattern Classification¶

File: src/Service/Crawler/ContentDetection/UrlPatternClassificationService.php

Current Prompt (lines 208-227):

NOT Coffee Beans:
- Equipment (grinders, filters), accessories, merchandise, gift cards, blog posts, collections

Enhanced Prompt:

Analyze these URLs and score each from 0-100 based on how likely they are to be whole bean coffee product pages.

Coffee Bean Indicators:
- Country/region + coffee terms (e.g., /ethiopia-yirgacheffe, /colombia-huila)
- Farm names, altitude markers (1800m), variety names (gesha, sl28)
- Processing terms (anaerobic, washed, natural)

NOT Coffee Beans - Score by Category:
- Non-whole-bean coffee products (20-39): drip bags, pods, capsules, ground-only coffee, instant coffee
- Non-coffee content (0-19): equipment (grinders, brewers), accessories, merchandise, gift cards, tea, wine, blog posts, collections, about pages

Scoring Guide:
- 90-100: Specific origin + farm/process details (e.g., /ethiopia-yirgacheffe-washed-g1)
- 70-89: Clear coffee product with origin (e.g., /brazil-cerrado-natural)
- 40-69: Ambiguous coffee-related (e.g., /products/ethiopia, /products/coffee)
- 20-39: Coffee products but not whole beans (drip-bag, pod, capsule, ground-only in URL)
- 0-19: Non-coffee content (equipment, blogs, wine, tea, merchandise, collections)

Changes: 1. Add "Non-whole-bean coffee products" category with explicit examples 2. Split negative indicators into two ranges: - 20-39 for coffee products that aren't whole beans - 0-19 for truly non-coffee content 3. Makes scoring more sensible: drip bags > blogs

Expected Impact: - URLs like /products/drip-bag-* will score 20-39 - URLs with pod, capsule, ground will score 20-39 - Equipment/blogs score 0-19 (truly irrelevant) - Filtering threshold: Skip if confidence < 40

Phase 2: Enhance Content-Based Detection Schema¶

File: config/schemas/coffee-bean.schema.json

Current Description (lines 23-26):

"description": "Assess whether this page represents a specific coffee bean product for purchase and provide a confidence score from 0 to 100..."

Enhanced Description:

{
  "isCoffeeBean": {
    "type": "number",
    "description": "Assess whether this page represents a WHOLE COFFEE BEAN product for purchase and provide a confidence score from 0 to 100.\n\nIMPORTANT: This must be WHOLE BEANS. However, many product pages offer BOTH whole bean AND ground options - these should score HIGH.\n\nScoring Guidelines:\n\n100: ABSOLUTELY CERTAIN - All indicators present:\n  • Product is whole beans (or offers whole bean option)\n  • Has specific origin information (country, region, or farm)\n  • Has price and purchase option (add to cart, buy now, etc.)\n  • Focuses on a single, specific coffee product\n  • Page is clearly a product page (not informational content)\n\n90-99: VERY CONFIDENT - Whole bean product with purchase option but:\n  • Missing some specific details (variety, processing, altitude)\n  • OR origin is somewhat generic\n  • Still clearly a purchasable whole bean product\n\n80-89: CONFIDENT - Clearly a whole bean product but:\n  • Missing clear price or purchase mechanism\n  • OR less detailed origin information\n  • Still clearly about whole bean coffee\n\n60-79: LIKELY - Appears to be whole bean product but:\n  • Unclear if purchase option exists\n  • OR missing significant details\n\n40-59: UNCERTAIN - Has coffee information but:\n  • Unclear if whole beans or if purchasable\n  • Could be informational rather than product page\n\n20-39: UNLIKELY - Coffee-related but NOT whole beans:\n  • EXCLUSIVELY drip bags, pods, capsules (no whole bean option)\n  • ONLY available pre-ground (no whole bean option)\n\n0-19: NOT A COFFEE BEAN PRODUCT:\n  • Equipment, grinders, brewers, accessories\n  • Merchandise, apparel, gift cards, subscriptions\n  • Blog posts, brewing guides, producer stories\n  • Collection/category pages\n  • Other beverages (tea, wine, etc.)\n\nIMPORTANT FOR GROUND COFFEE:\nIf the page offers BOTH whole bean AND ground as selectable options/variations, score as whole bean product (70+).\nOnly score low (20-39) if the product is EXCLUSIVELY ground/pods/drip bags with NO whole bean option."
  }
}

Key Changes: 1. Explicit criteria for score 100: Lists exactly what must be present for absolute certainty 2. Clear differentiation between ranges: Each range explains what drops the score 3. Ground coffee intelligence: Pages with BOTH options score high (70+), only EXCLUSIVE ground scores low 4. Sensible categorization: Drip bags/pods (20-39) vs truly irrelevant content (0-19) 5. Pushes LLM to use 100: Clear complete product pages should get 100, not hedging at 85-90

Expected Impact: - LLM will use score 100 for complete, clear whole bean product pages - Pages offering both whole bean + ground options will score 70+ (not filtered) - EXCLUSIVE drip bags/pods/ground score 20-39 (filtered out) - Equipment/blogs/non-coffee score 0-19 (truly irrelevant) - Filtering threshold: Skip if isCoffeeBean < 40

Phase 3: Update Filtering Threshold¶

Files to Update: - Any service using confidence thresholds for filtering

Current Threshold: 70.0 New Threshold: 40.0

This allows: - ✅ Whole bean products: 70+ (pass through) - ✅ Whole bean + ground options: 70+ (pass through) - ❌ Drip bags only: 20-39 (filtered) - ❌ Pods/capsules: 20-39 (filtered) - ❌ Equipment/blogs: 0-19 (filtered)

Testing Strategy¶

Unit Tests¶

Extend: tests/Service/Crawler/ContentDetection/UrlPatternClassificationServiceTest.php

public function test_scores_drip_bag_urls_low_but_coffee_related(): void
{
    $url = 'https://example.com/products/drip-bag-ethiopia';
    $result = $this->service->analyzeUrls([$url]);

    $confidence = $result[$url]->getConfidence();
    $this->assertGreaterThanOrEqual(20, $confidence);
    $this->assertLessThan(40, $confidence);
}

public function test_scores_pod_urls_low_but_coffee_related(): void
{
    $url = 'https://example.com/products/coffee-pod-capsule';
    $result = $this->service->analyzeUrls([$url]);

    $confidence = $result[$url]->getConfidence();
    $this->assertGreaterThanOrEqual(20, $confidence);
    $this->assertLessThan(40, $confidence);
}

public function test_scores_ground_only_coffee_urls_low(): void
{
    $url = 'https://example.com/products/ground-colombia';
    $result = $this->service->analyzeUrls([$url]);

    $confidence = $result[$url]->getConfidence();
    $this->assertGreaterThanOrEqual(20, $confidence);
    $this->assertLessThan(40, $confidence);
}

public function test_scores_equipment_very_low(): void
{
    $url = 'https://example.com/products/coffee-grinder';
    $result = $this->service->analyzeUrls([$url]);

    $confidence = $result[$url]->getConfidence();
    $this->assertLessThan(20, $confidence);
}

public function test_scores_blog_content_very_low(): void
{
    $url = 'https://example.com/blog/brewing-guide';
    $result = $this->service->analyzeUrls([$url]);

    $confidence = $result[$url]->getConfidence();
    $this->assertLessThan(20, $confidence);
}

public function test_scores_whole_bean_urls_high(): void
{
    $url = 'https://example.com/products/ethiopia-yirgacheffe-natural';
    $result = $this->service->analyzeUrls([$url]);

    $confidence = $result[$url]->getConfidence();
    $this->assertGreaterThanOrEqual(70, $confidence);
}

Add: tests/Service/Crawler/Extraction/CoffeeBeanExtractorTest.php

public function test_content_extraction_rejects_exclusive_drip_bag_products(): void
{
    $html = $this->loadFixture('drip_bag_only_product.html');
    $result = $this->extractor->extract($html, 'https://example.com/drip-bag');

    $this->assertGreaterThanOrEqual(20, $result->isCoffeeBean);
    $this->assertLessThan(40, $result->isCoffeeBean);
}

public function test_content_extraction_accepts_whole_bean_with_ground_option(): void
{
    $html = $this->loadFixture('whole_bean_with_ground_variation.html');
    $result = $this->extractor->extract($html, 'https://example.com/product');

    $this->assertGreaterThanOrEqual(70, $result->isCoffeeBean);
}

public function test_content_extraction_gives_100_for_complete_product(): void
{
    $html = $this->loadFixture('complete_whole_bean_product.html');
    $result = $this->extractor->extract($html, 'https://example.com/product');

    $this->assertEquals(100, $result->isCoffeeBean);
}

public function test_content_extraction_rejects_equipment(): void
{
    $html = $this->loadFixture('coffee_grinder_product.html');
    $result = $this->extractor->extract($html, 'https://example.com/grinder');

    $this->assertLessThan(20, $result->isCoffeeBean);
}

Integration Tests¶

End-to-End Filtering (tests/Service/Crawler/CoffeeBeanCrawlProcessingServiceTest.php):

public function test_drip_bag_url_filtered_by_pattern_classification(): void
{
    $crawlUrl = $this->createCrawlUrl('https://example.com/drip-bag-ethiopia');

    $this->urlClassificationService->analyzeCrawlUrls([$crawlUrl]);

    $this->assertLessThan(40, $crawlUrl->getContentConfidence());
    $this->assertGreaterThanOrEqual(20, $crawlUrl->getContentConfidence());
}

public function test_whole_bean_url_passes_pattern_classification(): void
{
    $crawlUrl = $this->createCrawlUrl('https://example.com/ethiopia-yirgacheffe');

    $this->urlClassificationService->analyzeCrawlUrls([$crawlUrl]);

    $this->assertGreaterThanOrEqual(70, $crawlUrl->getContentConfidence());
}

Manual Testing¶

Product Filtering Testing:

Add test CrawlUrls for:
https://example.com/products/drip-bag-ethiopia-sidamo (expect 20-39)
https://example.com/products/coffee-pods-variety-pack (expect 20-39)
https://example.com/products/ground-colombia-huila (expect 20-39)
https://example.com/products/coffee-grinder (expect 0-19)
https://example.com/blog/brewing-tips (expect 0-19)
Run URL pattern classification: app:crawler:classify-urls
Check logs:
Drip bags/pods should score 20-39
Equipment/blogs should score 0-19
Add whole bean URLs and verify they score 70+

Content Detection Testing:

Crawl EXCLUSIVE drip bag product page (no whole bean option)
Verify LLM extraction assigns isCoffeeBean score 20-39
Verify product is not persisted (filtered at threshold 40)
Crawl product page with BOTH whole bean + ground options
Verify LLM assigns high score (70+) and product IS persisted
Crawl complete product page with all details (origin, price, etc.)
Verify LLM assigns score 100 (not hedging at 85-90)
Check logs for filtering decisions

Rollout Plan¶

Development¶

Enhance URL pattern classification prompt
Enhance coffee-bean.schema.json description
Update filtering thresholds (70 → 40)
Write/update unit tests
Run full QA suite: make qa

Testing¶

Deploy to staging environment
Test drip bag/pod/capsule URLs through full pipeline
Test whole bean + ground variation pages
Verify filtering at both URL and content stages
Monitor scores to ensure LLM uses 100 for complete products

Production¶

Deploy with monitoring enabled
Track metrics:
URLs filtered by pattern classification (confidence < 40)
Products filtered by content detection (isCoffeeBean < 40)
Distribution of scores (how many 100s vs 85-90s)
Review filtered products after 1 week to validate accuracy
Adjust prompts if needed based on real-world results

Success Criteria¶

Product Detection Success¶

[ ] Drip bag URLs score 20-39 in pattern classification (sensibly low, not confused with blogs)
[ ] Pod/capsule URLs score 20-39 in pattern classification
[ ] Ground-only coffee URLs score 20-39 in pattern classification
[ ] Non-coffee content (equipment, blogs, tea/wine) scores 0-19 in pattern classification
[ ] Whole bean URLs continue to score 70+ in pattern classification
[ ] EXCLUSIVE drip bag product pages score 20-39 in content extraction
[ ] Pages with BOTH whole bean + ground options score 70+ in content extraction (not filtered)
[ ] Complete whole bean product pages score 100 in content extraction (not 85-90)
[ ] Reduction in false positives (non-whole-bean products persisted)
[ ] No reduction in true positives (whole bean products persisted, including pages with ground option)

Quality Metrics¶

[ ] False positive rate < 5% (non-whole-bean products incorrectly persisted)
[ ] False negative rate < 2% (whole bean products incorrectly filtered)
[ ] LLM uses score 100 for at least 60% of complete product pages
[ ] All QA tools pass (PHPStan, PHPCS, PHPUnit)

Future Enhancements¶

Product Detection¶

Consider adding product type field to CoffeeBean entity (WHOLE_BEAN, GROUND, POD, etc.)
Add automated retraining of confidence thresholds based on manual review feedback
Consider separate schemas for different product types if we expand beyond whole beans
A/B test different prompt variations to optimize scoring accuracy

Monitoring¶

Dashboard showing:
Products filtered by type (drip bag vs pod vs capsule vs ground)
Confidence score distribution (0-19, 20-39, 40-59, 60-79, 80-89, 90-99, 100)
False positive rate (manual review required)
LLM hedging analysis (% using 100 vs 85-90 for complete products)

Risk Assessment¶

Low Risk¶

Schema changes: Only improves LLM prompts, doesn't break existing functionality
Threshold adjustment: 40 is conservative, allows buffer for edge cases

Medium Risk¶

Confidence threshold tuning: May need adjustment based on production data
Ground coffee detection: Need to ensure BOTH-option pages don't get filtered
LLM behavior changes: Scoring may vary with model updates

Mitigation¶

Extensive testing before deployment
Monitoring and logging for first week in production
Ability to quickly adjust confidence thresholds via config
Manual review of filtered products to catch false positives
Rollback plan: Revert schema changes, restore threshold to 70

Checklist¶

Implementation¶

[ ] Enhance UrlPatternClassificationService prompt
[ ] Enhance coffee-bean.schema.json description
[ ] Update confidence threshold from 70 to 40
[ ] Write unit tests for product detection changes
[ ] Create test fixtures (drip bag HTML, whole bean + ground HTML, etc.)
[ ] Run full QA suite and fix any issues

Testing¶

[ ] Manual test with drip bag URLs (expect 20-39)
[ ] Manual test with pod/capsule URLs (expect 20-39)
[ ] Manual test with ground coffee URLs (expect 20-39)
[ ] Manual test with equipment URLs (expect 0-19)
[ ] Manual test with blog URLs (expect 0-19)
[ ] Manual test with whole bean URLs (ensure no regression, expect 70+)
[ ] Manual test with whole bean + ground variation pages (expect 70+)
[ ] Verify complete products score 100 (not 85-90)

Deployment¶

[ ] Deploy to staging
[ ] Run staging tests
[ ] Deploy to production
[ ] Monitor for 1 week
[ ] Review filtered products (manual sampling)
[ ] Check score distribution (ensure 100s are being used)
[ ] Adjust prompts/thresholds if needed
[ ] Document lessons learned