Skip to content

Feature Implementation Plan: Enhanced Product Detection for Whole Bean Coffee

Executive Summary

Improve early filtering of non-whole-bean products (drip bags, pods, capsules, ground-only coffee) through enhanced URL pattern classification and LLM content analysis.

Key Goals: 1. Filter out non-whole-bean products early to avoid wasted crawl processing 2. Correctly identify pages that offer BOTH whole bean and ground options as valid 3. Push LLM to use score 100 for complete whole bean products (not hedging at 85-90) 4. Maintain sensible scoring: drip bags score higher than blogs/equipment but still filtered

Problem Statement

Current product detection has several issues:

  1. Missing product types in URL classification: The prompt mentions "equipment, accessories, merchandise" but MISSES drip bags, pods, capsules, ground coffee
  2. Example: https://nanacoffeeroasters.com/products/drip-bag-ethiopia-sidamoguji-anasora may score 40-60 instead of being filtered

  3. Ground coffee ambiguity: Many product pages offer BOTH whole bean AND ground options as selectable variations

  4. If we're too strict, the ground option could cause the page to be rejected
  5. We should only filter EXCLUSIVE ground-only products

  6. LLM hedging on scores: Even for clear, complete whole bean products, LLM often scores 85-90 instead of 100

  7. Lacks clear criteria for when to use 100 vs 90-99 vs 80-89

  8. Inconsistent categorization: Drip bags shouldn't score as low as blogs/equipment - they're still coffee products, just not our target

Technical Analysis

Current State

URL Pattern Classification (UrlPatternClassificationService.php:208-227):

NOT Coffee Beans:
- Equipment (grinders, filters), accessories, merchandise, gift cards, blog posts, collections
Missing: drip bags, pods, capsules, ground coffee, instant coffee

Content Detection (coffee-bean.schema.json:23-26):

Guidelines for scoring:
- 80-100: Clearly a single coffee bean product with specific details AND purchase options
- 60-80: Likely a coffee bean product with purchase options but missing some specific details
...
Issues: - No explicit criteria for score 100 vs 80-89 - Doesn't address product pages with both whole bean + ground options - Doesn't distinguish drip bags (still coffee) from equipment (non-coffee)

Current Filtering Threshold

Currently using confidence threshold of 70.0 in various places. This plan will adjust to 40.0 with better scoring granularity.

Implementation Plan

Phase 1: Enhance URL Pattern Classification

File: src/Service/Crawler/ContentDetection/UrlPatternClassificationService.php

Current Prompt (lines 208-227):

NOT Coffee Beans:
- Equipment (grinders, filters), accessories, merchandise, gift cards, blog posts, collections

Enhanced Prompt:

Analyze these URLs and score each from 0-100 based on how likely they are to be whole bean coffee product pages.

Coffee Bean Indicators:
- Country/region + coffee terms (e.g., /ethiopia-yirgacheffe, /colombia-huila)
- Farm names, altitude markers (1800m), variety names (gesha, sl28)
- Processing terms (anaerobic, washed, natural)

NOT Coffee Beans - Score by Category:
- Non-whole-bean coffee products (20-39): drip bags, pods, capsules, ground-only coffee, instant coffee
- Non-coffee content (0-19): equipment (grinders, brewers), accessories, merchandise, gift cards, tea, wine, blog posts, collections, about pages

Scoring Guide:
- 90-100: Specific origin + farm/process details (e.g., /ethiopia-yirgacheffe-washed-g1)
- 70-89: Clear coffee product with origin (e.g., /brazil-cerrado-natural)
- 40-69: Ambiguous coffee-related (e.g., /products/ethiopia, /products/coffee)
- 20-39: Coffee products but not whole beans (drip-bag, pod, capsule, ground-only in URL)
- 0-19: Non-coffee content (equipment, blogs, wine, tea, merchandise, collections)

Changes: 1. Add "Non-whole-bean coffee products" category with explicit examples 2. Split negative indicators into two ranges: - 20-39 for coffee products that aren't whole beans - 0-19 for truly non-coffee content 3. Makes scoring more sensible: drip bags > blogs

Expected Impact: - URLs like /products/drip-bag-* will score 20-39 - URLs with pod, capsule, ground will score 20-39 - Equipment/blogs score 0-19 (truly irrelevant) - Filtering threshold: Skip if confidence < 40

Phase 2: Enhance Content-Based Detection Schema

File: config/schemas/coffee-bean.schema.json

Current Description (lines 23-26):

"description": "Assess whether this page represents a specific coffee bean product for purchase and provide a confidence score from 0 to 100..."

Enhanced Description:

{
  "isCoffeeBean": {
    "type": "number",
    "description": "Assess whether this page represents a WHOLE COFFEE BEAN product for purchase and provide a confidence score from 0 to 100.\n\nIMPORTANT: This must be WHOLE BEANS. However, many product pages offer BOTH whole bean AND ground options - these should score HIGH.\n\nScoring Guidelines:\n\n100: ABSOLUTELY CERTAIN - All indicators present:\n  • Product is whole beans (or offers whole bean option)\n  • Has specific origin information (country, region, or farm)\n  • Has price and purchase option (add to cart, buy now, etc.)\n  • Focuses on a single, specific coffee product\n  • Page is clearly a product page (not informational content)\n\n90-99: VERY CONFIDENT - Whole bean product with purchase option but:\n  • Missing some specific details (variety, processing, altitude)\n  • OR origin is somewhat generic\n  • Still clearly a purchasable whole bean product\n\n80-89: CONFIDENT - Clearly a whole bean product but:\n  • Missing clear price or purchase mechanism\n  • OR less detailed origin information\n  • Still clearly about whole bean coffee\n\n60-79: LIKELY - Appears to be whole bean product but:\n  • Unclear if purchase option exists\n  • OR missing significant details\n\n40-59: UNCERTAIN - Has coffee information but:\n  • Unclear if whole beans or if purchasable\n  • Could be informational rather than product page\n\n20-39: UNLIKELY - Coffee-related but NOT whole beans:\n  • EXCLUSIVELY drip bags, pods, capsules (no whole bean option)\n  • ONLY available pre-ground (no whole bean option)\n\n0-19: NOT A COFFEE BEAN PRODUCT:\n  • Equipment, grinders, brewers, accessories\n  • Merchandise, apparel, gift cards, subscriptions\n  • Blog posts, brewing guides, producer stories\n  • Collection/category pages\n  • Other beverages (tea, wine, etc.)\n\nIMPORTANT FOR GROUND COFFEE:\nIf the page offers BOTH whole bean AND ground as selectable options/variations, score as whole bean product (70+).\nOnly score low (20-39) if the product is EXCLUSIVELY ground/pods/drip bags with NO whole bean option."
  }
}

Key Changes: 1. Explicit criteria for score 100: Lists exactly what must be present for absolute certainty 2. Clear differentiation between ranges: Each range explains what drops the score 3. Ground coffee intelligence: Pages with BOTH options score high (70+), only EXCLUSIVE ground scores low 4. Sensible categorization: Drip bags/pods (20-39) vs truly irrelevant content (0-19) 5. Pushes LLM to use 100: Clear complete product pages should get 100, not hedging at 85-90

Expected Impact: - LLM will use score 100 for complete, clear whole bean product pages - Pages offering both whole bean + ground options will score 70+ (not filtered) - EXCLUSIVE drip bags/pods/ground score 20-39 (filtered out) - Equipment/blogs/non-coffee score 0-19 (truly irrelevant) - Filtering threshold: Skip if isCoffeeBean < 40

Phase 3: Update Filtering Threshold

Files to Update: - Any service using confidence thresholds for filtering

Current Threshold: 70.0 New Threshold: 40.0

This allows: - ✅ Whole bean products: 70+ (pass through) - ✅ Whole bean + ground options: 70+ (pass through) - ❌ Drip bags only: 20-39 (filtered) - ❌ Pods/capsules: 20-39 (filtered) - ❌ Equipment/blogs: 0-19 (filtered)

Testing Strategy

Unit Tests

Extend: tests/Service/Crawler/ContentDetection/UrlPatternClassificationServiceTest.php

public function test_scores_drip_bag_urls_low_but_coffee_related(): void
{
    $url = 'https://example.com/products/drip-bag-ethiopia';
    $result = $this->service->analyzeUrls([$url]);

    $confidence = $result[$url]->getConfidence();
    $this->assertGreaterThanOrEqual(20, $confidence);
    $this->assertLessThan(40, $confidence);
}

public function test_scores_pod_urls_low_but_coffee_related(): void
{
    $url = 'https://example.com/products/coffee-pod-capsule';
    $result = $this->service->analyzeUrls([$url]);

    $confidence = $result[$url]->getConfidence();
    $this->assertGreaterThanOrEqual(20, $confidence);
    $this->assertLessThan(40, $confidence);
}

public function test_scores_ground_only_coffee_urls_low(): void
{
    $url = 'https://example.com/products/ground-colombia';
    $result = $this->service->analyzeUrls([$url]);

    $confidence = $result[$url]->getConfidence();
    $this->assertGreaterThanOrEqual(20, $confidence);
    $this->assertLessThan(40, $confidence);
}

public function test_scores_equipment_very_low(): void
{
    $url = 'https://example.com/products/coffee-grinder';
    $result = $this->service->analyzeUrls([$url]);

    $confidence = $result[$url]->getConfidence();
    $this->assertLessThan(20, $confidence);
}

public function test_scores_blog_content_very_low(): void
{
    $url = 'https://example.com/blog/brewing-guide';
    $result = $this->service->analyzeUrls([$url]);

    $confidence = $result[$url]->getConfidence();
    $this->assertLessThan(20, $confidence);
}

public function test_scores_whole_bean_urls_high(): void
{
    $url = 'https://example.com/products/ethiopia-yirgacheffe-natural';
    $result = $this->service->analyzeUrls([$url]);

    $confidence = $result[$url]->getConfidence();
    $this->assertGreaterThanOrEqual(70, $confidence);
}

Add: tests/Service/Crawler/Extraction/CoffeeBeanExtractorTest.php

public function test_content_extraction_rejects_exclusive_drip_bag_products(): void
{
    $html = $this->loadFixture('drip_bag_only_product.html');
    $result = $this->extractor->extract($html, 'https://example.com/drip-bag');

    $this->assertGreaterThanOrEqual(20, $result->isCoffeeBean);
    $this->assertLessThan(40, $result->isCoffeeBean);
}

public function test_content_extraction_accepts_whole_bean_with_ground_option(): void
{
    $html = $this->loadFixture('whole_bean_with_ground_variation.html');
    $result = $this->extractor->extract($html, 'https://example.com/product');

    $this->assertGreaterThanOrEqual(70, $result->isCoffeeBean);
}

public function test_content_extraction_gives_100_for_complete_product(): void
{
    $html = $this->loadFixture('complete_whole_bean_product.html');
    $result = $this->extractor->extract($html, 'https://example.com/product');

    $this->assertEquals(100, $result->isCoffeeBean);
}

public function test_content_extraction_rejects_equipment(): void
{
    $html = $this->loadFixture('coffee_grinder_product.html');
    $result = $this->extractor->extract($html, 'https://example.com/grinder');

    $this->assertLessThan(20, $result->isCoffeeBean);
}

Integration Tests

End-to-End Filtering (tests/Service/Crawler/CoffeeBeanCrawlProcessingServiceTest.php):

public function test_drip_bag_url_filtered_by_pattern_classification(): void
{
    $crawlUrl = $this->createCrawlUrl('https://example.com/drip-bag-ethiopia');

    $this->urlClassificationService->analyzeCrawlUrls([$crawlUrl]);

    $this->assertLessThan(40, $crawlUrl->getContentConfidence());
    $this->assertGreaterThanOrEqual(20, $crawlUrl->getContentConfidence());
}

public function test_whole_bean_url_passes_pattern_classification(): void
{
    $crawlUrl = $this->createCrawlUrl('https://example.com/ethiopia-yirgacheffe');

    $this->urlClassificationService->analyzeCrawlUrls([$crawlUrl]);

    $this->assertGreaterThanOrEqual(70, $crawlUrl->getContentConfidence());
}

Manual Testing

Product Filtering Testing:

  1. Add test CrawlUrls for:
  2. https://example.com/products/drip-bag-ethiopia-sidamo (expect 20-39)
  3. https://example.com/products/coffee-pods-variety-pack (expect 20-39)
  4. https://example.com/products/ground-colombia-huila (expect 20-39)
  5. https://example.com/products/coffee-grinder (expect 0-19)
  6. https://example.com/blog/brewing-tips (expect 0-19)

  7. Run URL pattern classification: app:crawler:classify-urls

  8. Check logs:

  9. Drip bags/pods should score 20-39
  10. Equipment/blogs should score 0-19

  11. Add whole bean URLs and verify they score 70+

Content Detection Testing:

  1. Crawl EXCLUSIVE drip bag product page (no whole bean option)
  2. Verify LLM extraction assigns isCoffeeBean score 20-39
  3. Verify product is not persisted (filtered at threshold 40)

  4. Crawl product page with BOTH whole bean + ground options

  5. Verify LLM assigns high score (70+) and product IS persisted

  6. Crawl complete product page with all details (origin, price, etc.)

  7. Verify LLM assigns score 100 (not hedging at 85-90)

  8. Check logs for filtering decisions

Rollout Plan

Development

  1. Enhance URL pattern classification prompt
  2. Enhance coffee-bean.schema.json description
  3. Update filtering thresholds (70 → 40)
  4. Write/update unit tests
  5. Run full QA suite: make qa

Testing

  1. Deploy to staging environment
  2. Test drip bag/pod/capsule URLs through full pipeline
  3. Test whole bean + ground variation pages
  4. Verify filtering at both URL and content stages
  5. Monitor scores to ensure LLM uses 100 for complete products

Production

  1. Deploy with monitoring enabled
  2. Track metrics:
  3. URLs filtered by pattern classification (confidence < 40)
  4. Products filtered by content detection (isCoffeeBean < 40)
  5. Distribution of scores (how many 100s vs 85-90s)
  6. Review filtered products after 1 week to validate accuracy
  7. Adjust prompts if needed based on real-world results

Success Criteria

Product Detection Success

  • [ ] Drip bag URLs score 20-39 in pattern classification (sensibly low, not confused with blogs)
  • [ ] Pod/capsule URLs score 20-39 in pattern classification
  • [ ] Ground-only coffee URLs score 20-39 in pattern classification
  • [ ] Non-coffee content (equipment, blogs, tea/wine) scores 0-19 in pattern classification
  • [ ] Whole bean URLs continue to score 70+ in pattern classification
  • [ ] EXCLUSIVE drip bag product pages score 20-39 in content extraction
  • [ ] Pages with BOTH whole bean + ground options score 70+ in content extraction (not filtered)
  • [ ] Complete whole bean product pages score 100 in content extraction (not 85-90)
  • [ ] Reduction in false positives (non-whole-bean products persisted)
  • [ ] No reduction in true positives (whole bean products persisted, including pages with ground option)

Quality Metrics

  • [ ] False positive rate < 5% (non-whole-bean products incorrectly persisted)
  • [ ] False negative rate < 2% (whole bean products incorrectly filtered)
  • [ ] LLM uses score 100 for at least 60% of complete product pages
  • [ ] All QA tools pass (PHPStan, PHPCS, PHPUnit)

Future Enhancements

Product Detection

  • Consider adding product type field to CoffeeBean entity (WHOLE_BEAN, GROUND, POD, etc.)
  • Add automated retraining of confidence thresholds based on manual review feedback
  • Consider separate schemas for different product types if we expand beyond whole beans
  • A/B test different prompt variations to optimize scoring accuracy

Monitoring

  • Dashboard showing:
  • Products filtered by type (drip bag vs pod vs capsule vs ground)
  • Confidence score distribution (0-19, 20-39, 40-59, 60-79, 80-89, 90-99, 100)
  • False positive rate (manual review required)
  • LLM hedging analysis (% using 100 vs 85-90 for complete products)

Risk Assessment

Low Risk

  • Schema changes: Only improves LLM prompts, doesn't break existing functionality
  • Threshold adjustment: 40 is conservative, allows buffer for edge cases

Medium Risk

  • Confidence threshold tuning: May need adjustment based on production data
  • Ground coffee detection: Need to ensure BOTH-option pages don't get filtered
  • LLM behavior changes: Scoring may vary with model updates

Mitigation

  • Extensive testing before deployment
  • Monitoring and logging for first week in production
  • Ability to quickly adjust confidence thresholds via config
  • Manual review of filtered products to catch false positives
  • Rollback plan: Revert schema changes, restore threshold to 70

Checklist

Implementation

  • [ ] Enhance UrlPatternClassificationService prompt
  • [ ] Enhance coffee-bean.schema.json description
  • [ ] Update confidence threshold from 70 to 40
  • [ ] Write unit tests for product detection changes
  • [ ] Create test fixtures (drip bag HTML, whole bean + ground HTML, etc.)
  • [ ] Run full QA suite and fix any issues

Testing

  • [ ] Manual test with drip bag URLs (expect 20-39)
  • [ ] Manual test with pod/capsule URLs (expect 20-39)
  • [ ] Manual test with ground coffee URLs (expect 20-39)
  • [ ] Manual test with equipment URLs (expect 0-19)
  • [ ] Manual test with blog URLs (expect 0-19)
  • [ ] Manual test with whole bean URLs (ensure no regression, expect 70+)
  • [ ] Manual test with whole bean + ground variation pages (expect 70+)
  • [ ] Verify complete products score 100 (not 85-90)

Deployment

  • [ ] Deploy to staging
  • [ ] Run staging tests
  • [ ] Deploy to production
  • [ ] Monitor for 1 week
  • [ ] Review filtered products (manual sampling)
  • [ ] Check score distribution (ensure 100s are being used)
  • [ ] Adjust prompts/thresholds if needed
  • [ ] Document lessons learned