Product Deep Dive

Content Inspection: 62 Fields, 5 Retailers, Every Day

February 10, 2026 · 7 min read · Crawlbot Team

When a shopper lands on a product detail page, they see a title, a price, a few photos, and maybe some specs. What they don't see is how wildly different that same product can look across different retailers. A laptop listed on Currys might have 12 photos, a full A+ content section, and a 4.7-star review score. The same laptop on Box.co.uk might have 3 photos, no video, and a title that's truncated to the point of being useless. For the brand behind that laptop, the difference is invisible unless someone is checking every single page, every single day.

That is exactly what our Content Inspection product does. We scrape full product detail pages across 5 active retailers, parse 62 distinct fields per product, and deliver a daily snapshot of how every product in your catalog actually appears on each retailer's website. Not a sample. Not an estimate. Every product, every day.

What we extract: the 62-column dataset

Our core database table, which we internally call final_boss, stores 62 columns per product. That number isn't arbitrary — it's the result of months of iterating on what brands actually need to audit their retail presence. Here's the breakdown by category.

Identity & basics

Every record starts with the fundamentals: title, brand, MPN (Manufacturer Part Number), EAN, and retailer SKU. The MPN and EAN are critical for cross-retailer matching. When we see the same MPN on Currys and Argos, we know it's the same product and can compare content side by side. Our matching waterfall runs in four stages: MPN exact match, then EAN exact match, then a spec hash fallback, and finally new product creation if nothing matches.

Pricing & promotions

We capture current price, was price, promo status, and promo savings. But we go further than just storing today's price. A MySQL trigger (trg_final_boss_price_change) fires on every update to the price field, logging the old price, new price, old promo status, new promo status, and exact timestamp into a price_history table. This means brands can see not just what a product costs today, but exactly when and how prices changed over time — and whether those changes coincided with a promotional campaign.

Media & content

For each product, we record photo count, individual photo URLs, A+ content presence, and video presence. We also calculate a "Clean Content %" score that measures overall listing completeness. A product with 8 photos, a video, rich A+ content, and fully populated spec fields scores much higher than a listing with 2 photos and a sparse description. This scoring lets brand teams quickly identify which listings need attention — sort by score, fix the worst ones first.

Reviews

We extract review count and review score from every PDP. A product with 3 reviews at 2.8 stars on AO.com but 450 reviews at 4.6 stars on Currys tells a brand a lot about where they need to invest in review generation campaigns.

Hardware specs: the hard part

This is where things get genuinely difficult. Every retailer structures their spec data differently. Some have clean JSON-LD. Some embed specs in HTML tables with inconsistent class names. Some render specs client-side from React state. We extract and parse CPU model, CPU brand, GPU, RAM size, RAM type, storage size, storage type, screen size, resolution, refresh rate, panel type, touchscreen, and OS.

Spec parsing is a massive engineering challenge. We use SQL regex (REGEXP_SUBSTR) for structured extraction and JavaScript parsers with brand-specific patterns for edge cases. An ASUS laptop might list its GPU as "NVIDIA GeForce RTX 4060 8GB" while a Lenovo lists the same chip as "RTX4060". Our parsers normalize these into consistent, comparable values.

Five retailers, five different battles

Each of our 5 active retailers required a completely different scraping approach. There is no universal technique that works across all of them.

Currys

Our highest-volume retailer at around 1,200 products per day across 7 categories. Currys is scraped using pure DOM parsing. We target product tile grids on category listing pages to discover products, then visit each PDP for the full data extract. Currys recently broke our scraper by changing their HTML structure — product containers moved from .Product-display to .product-tile, the result count element changed, and title elements moved to h2.pdp-grid-product-name. We adapted within hours.

Box.co.uk

Box added Cloudflare Turnstile CAPTCHA protection, which blocks all headless browsers outright. We rewrote their category scraper to use Bright Data Web Unlocker — a proxy service that handles CAPTCHA solving on our behalf via HTTP requests, no Playwright needed. The PDP pages presented a different challenge: they're entirely client-side rendered by Angular, meaning the raw HTML contains no product data at all. We had to extract from Angular's ng-state on category pages and add DOM fallbacks with proxy-assisted Playwright for individual PDPs.

Argos

Argos runs daily between 6:00 AM and 10:30 AM across 7 categories. Full PDP scraping with EAN and MPN extraction. The longer scraping window is because Argos rate-limits more aggressively, so we space out requests to avoid detection. Each product page is visited, parsed, and stored — including fields that Argos uniquely provides like EAN barcodes, which are invaluable for cross-retailer matching.

AO.com

AO.com is a React single-page application behind Cloudflare, meaning we need both Playwright (to render React) and SmartProxy residential proxies (to bypass Cloudflare). Product photos and videos come from a #product-json script tag embedded in the page, while hardware specs live in accordion sections that are pre-rendered in the DOM but visually collapsed. We scrape 9 categories covering 650+ products, with the scraping window running from 02:30 to 03:26 AM.

Amazon (paused)

Amazon was our fifth active retailer, but we've temporarily paused it due to proxy costs. Amazon's anti-bot defenses are the most aggressive in the market, requiring expensive residential proxy bandwidth for every single request. We'll bring it back once we've optimized our proxy routing to reduce per-request costs.

Nine categories, one nightly schedule

We monitor 9 product categories: Laptops (NC), Chromebooks (CB), Gaming Laptops (NG), Monitors (MO), Gaming Monitors (MG), Desktops (DT), Gaming Desktops (DG), All-in-Ones (AIO), and Projectors (PROJ). Each category has its own URL pattern per retailer, and some retailers don't carry all categories.

The scraping schedule is staggered overnight to avoid overwhelming our workers and to respect retailer rate limits:

By the time the team arrives in the morning, every product across every retailer has been freshly scraped. The dashboard shows the latest data, and any price changes from overnight have already been logged to the history table.

Product matching: linking the same product across retailers

A laptop doesn't have the same SKU on Currys as it does on Argos. The same device might be "ASUS Zenbook 14 UX3405MA" on one site and "Zenbook UX3405MA-PP007W 14in Intel Core Ultra 5" on another. Our matching system uses a four-stage waterfall to link products across retailers:

  1. MPN exact match — The manufacturer part number is the gold standard. When two listings share an MPN, they're definitively the same product.
  2. EAN exact match — European Article Numbers (barcodes) provide a secondary unique identifier, especially useful when retailers omit the MPN.
  3. Spec hash fallback — When neither MPN nor EAN is available, we hash the key specs (CPU, RAM, storage, screen) and match on that.
  4. Create new — If nothing matches, a new canonical product entry is created.

What brands do with this data

Content Inspection isn't just a data dump. It answers specific business questions that brand teams deal with every week:

The result is a daily audit of your entire retail presence, automated and delivered before breakfast. No manual spot-checking, no spreadsheets copy-pasted from retailer portals, no gaps in coverage. Sixty-two fields, five retailers, every product, every day.