Beating Anti-Bot Defenses | Crawlbot.pl Blog

Anti-bot technology is the single biggest challenge in e-commerce scraping. Every major retailer we monitor employs some form of bot detection, and the techniques evolve constantly. In the past six weeks alone, we have dealt with Cloudflare Turnstile deployments, Amazon's increasingly aggressive request fingerprinting, and retailers quietly switching from server-rendered HTML to fully client-side JavaScript applications. There is no universal bypass. What works on one retailer fails on another, and what worked last month may stop working tomorrow.

We have built a multi-layered approach at Crawlbot that adapts per retailer. Across our 38 scraper files covering 18 retailers in the UK and South Africa, we deploy five distinct anti-detection strategies. Here is a detailed look at each one, including the specific problems we solved and the results we achieved.

Cloudflare Turnstile: The Box.co.uk Challenge

In February 2026, Box.co.uk deployed Cloudflare Turnstile across their entire site. Turnstile is Cloudflare's successor to hCAPTCHA -- a non-interactive challenge that runs JavaScript checks in the background to determine whether a visitor is human. Unlike traditional CAPTCHAs, there is nothing to "solve." The challenge evaluates browser environment signals: WebGL renderer strings, canvas fingerprints, installed fonts, mouse movement patterns, and dozens of other signals that headless browsers get wrong.

Every headless browser we tested failed. Playwright with stealth plugins, Puppeteer Extra with the stealth plugin, even browser automation frameworks that claim Turnstile bypass -- all returned empty pages or infinite redirect loops. The challenge tokens were never issued because the browser environment checks never passed.

Our solution was to remove the browser entirely. We rewrote the Box SoV and category scrapers to use Bright Data's Web Unlocker API -- pure HTTP requests routed through Bright Data's infrastructure, which handles Cloudflare challenges server-side. No Playwright, no headless Chrome, no browser fingerprints to detect. The API returns fully rendered HTML as if a real browser had loaded the page.

The results were immediate: 24 products per category page, 797 laptops across 34 pages with full pagination support. Category pages returned clean JSON from Box's Angular application state, which we parse directly for product titles, prices, SKUs, and sponsored flags.

Product detail pages (PDPs) were a different problem. Box had also migrated their PDPs to a fully client-side Angular application. The raw HTML source contains no product data at all -- everything is rendered by JavaScript after page load. For these pages, we needed a real browser. We added Box to our SmartProxy residential proxy rotation and use Playwright with stealth plugins to render the Angular app, then extract product data from the rendered DOM using fallback selectors for product codes, specifications, and EAN numbers.

Amazon Bot Detection: From Zero Products to Thirty

Amazon's bot detection is among the most aggressive in e-commerce. Our original Playwright-based SoV scraper was returning zero products. Not some products, not degraded results -- literally zero. Amazon detected the headless browser within milliseconds and served a CAPTCHA page or an empty search results container.

We tried the standard countermeasures: randomized user agents, stealth plugins, residential proxies, human-like delays. None worked consistently. Amazon's detection goes deeper than browser fingerprints. They analyze TLS fingerprints (the way the browser negotiates the HTTPS connection), HTTP/2 frame ordering, header capitalization patterns, and cookie behavior across redirects. A headless Chromium instance simply does not look like a real Chrome browser at the network protocol level.

We took the same approach that worked for Box: we rewrote the Amazon SoV scraper from Playwright to Bright Data Web Unlocker. The scraper now sends pure HTTP requests with country: 'gb' targeting to ensure UK-localized results, then parses the returned HTML with regex-based extraction.

Product extraction targets data-asin attributes and s-result-item blocks in the search results HTML. Sponsored detection looks for the AdHolder CSS class or the literal text "Sponsored" within product card containers. Price extraction targets a-offscreen spans while carefully skipping strikethrough and "was-price" elements that would give us the wrong number.

The result: 16 to 30 products per category with correct GBP prices, reliable sponsored/organic classification, and consistent daily coverage across all 9 Amazon UK laptop categories.

SmartProxy Residential Rotation for Moderate Defenses

Not every retailer requires the nuclear option of a Web Unlocker. For sites with moderate anti-bot defenses -- primarily Cloudflare in "managed challenge" mode rather than Turnstile -- we use SmartProxy residential UK proxies with Playwright.

AO.com is a React single-page application behind Cloudflare. The product data is rendered client-side from JSON embedded in script tags, so we need a real browser to execute the JavaScript. But AO's Cloudflare configuration is less aggressive than Box's Turnstile -- it primarily checks IP reputation and basic browser signals. A Playwright instance routed through a UK residential IP passes these checks reliably.

Argos uses similar moderate protection. Their product detail pages require consistent sessions (the same IP must load the category page and then the PDP, or the session cookie is invalidated). SmartProxy's sticky session feature handles this -- we bind a session to a single residential IP for the duration of a product scraping run.

The proxy endpoint is configured at proxy.smartproxy.net:3120 and currently handles three retailers: Argos, AO.com, and Box PDP pages. We route approximately 2,500 requests per day through SmartProxy, with a success rate above 97%.

Browser Fingerprint Stealth

For retailers without Cloudflare or advanced bot detection -- sites like Currys, John Lewis, Overclockers, and Scan -- we use Playwright Extra with the stealth plugin. This modifies browser fingerprints to make the headless Chromium instance appear as a regular desktop Chrome installation.

The stealth plugin handles the well-known detection vectors: it patches navigator.webdriver to return false, randomizes WebGL renderer and vendor strings, modifies canvas fingerprinting responses, spoofs the Chrome runtime object, and adjusts the languages and plugins arrays to match a real Chrome installation.

We also apply heavy resource blocking to improve speed and reduce detection surface. Images, fonts, stylesheets, and tracking scripts are all blocked at the network level before they load. A typical category page that would take 8 seconds with full resources loads in under 2 seconds with blocking enabled. The exception is retailers like Box, where blocking resources prevents the client-side application from rendering at all.

Per-Retailer Adaptation: No Two Sites Are the Same

The reason generic scraping platforms fail at scale is that they apply one strategy to every website. Our 38 scraper files exist because each retailer presents a unique combination of anti-bot technology, page structure, and data delivery method.

•Game.co.za does not use a standard website at all for its product listings. Behind the consumer-facing pages sits a SAP Hybris OCC API. We POST directly to their API endpoints, bypassing the website entirely. The API requires specific headers and returns JSON, but has aggressive rate limiting -- we throttle to 5-10 second delays between page fetches and implement retry with exponential backoff.
•John Lewis loads products progressively. The initial page shows 24 items, and a "Show more" button must be clicked to reveal the rest. Our scraper programmatically clicks this button until all products are loaded, handling the asynchronous DOM updates between each click.
•Currys changed their entire DOM structure without warning in February 2026. Product containers moved from .Product-display to .product-tile, the result count element changed, the product grid class changed, and title elements shifted from generic divs to h2.pdp-grid-product-name. We detected the breakage through our monitoring dashboard and deployed updated selectors within hours. Across 7 categories, the updated scraper found 1,167 products: 317 notebooks, 45 Chromebooks, 147 gaming laptops, 190 monitors, 222 mini/micro PCs, 110 desktops, and 136 desktop gaming rigs.
•Computermania runs on Shopify, where collection handles are singular (chromebook not chromebooks). A single character difference in the URL produces a 404 instead of a product listing. Every retailer has quirks like this that only surface through testing.

Smart Request Throttling and Memory Management

Anti-bot defenses are not only about fingerprints and CAPTCHAs. Rate limiting is equally important. Sending requests too quickly triggers IP bans and CAPTCHA challenges even when individual requests look perfectly human.

We implement per-retailer rate limiting based on observed thresholds. Game.co.za's PX rate limiter triggers after just a few rapid requests, so we insert 5-10 second delays between page fetches. Amazon requires careful pacing to avoid triggering their automated review. For less aggressive retailers, we use 1-2 second delays that balance speed with safety.

All request failures trigger exponential backoff -- the delay between retries doubles with each failure, starting at 2 seconds and capping at 60 seconds. This prevents cascading failures where a temporary rate limit escalates into a permanent IP ban.

Our workers also enforce a 1.5 GB memory ceiling. After each scraping job completes, the worker checks its heap usage. If it exceeds the threshold, the worker exits gracefully and NSSM (our Windows service manager) restarts it within 5 seconds. This prevents memory-degraded performance -- a browser instance that has been running for hours accumulates leaked DOM references, cached resources, and fragmented memory that slows down page loads and makes detection more likely.

The Toolkit Approach

There is no single "silver bullet" for anti-bot bypass, and anyone who tells you otherwise is selling something that will stop working next month. The key is having a toolkit of approaches and applying the right one for each retailer's specific anti-bot profile.

Our current toolkit has five layers: Bright Data Web Unlocker for the hardest targets (Cloudflare Turnstile, Amazon), SmartProxy residential rotation for moderate Cloudflare, Playwright stealth for sites with basic detection, direct API access for retailers that expose their data programmatically, and per-retailer DOM selectors that adapt as sites change their markup.

This is why generic scraping platforms fail when applied to competitive intelligence at scale. They use one approach for all sites and inevitably hit a retailer where that approach does not work. Building and maintaining 38 individual scrapers is more work than a single universal scraper, but it is the only approach that delivers consistent, reliable data across 18 retailers, every hour of every day.

Back to Blog