Building a Distributed Scraping Network

For the first year of Crawlbot's existence, we relied on Octoparse -- a cloud-based SaaS platform for web scraping. It was a reasonable starting point: visual workflow builder, managed infrastructure, and an API for triggering jobs. But as our monitoring needs grew from a handful of Currys category pages to hundreds of schedules across more than a dozen retailers, the cracks became impossible to ignore. In January 2026, we made the decision to rip it all out and build something better from scratch.

This is the story of how we replaced a cloud scraping service with a distributed queue system running on five physical machines in a local network -- and why it was one of the best engineering decisions we have made.

Why We Left Octoparse

Octoparse served us well initially, but three fundamental problems pushed us toward building our own infrastructure.

First, reliability. The Octoparse API would regularly return ENOTFOUND api.octoparse.com errors -- DNS resolution failures from their own API endpoint. Our orchestrator had to implement retry logic with exponential backoff just to trigger scraping jobs, and even then we would lose hours of data when their infrastructure had issues. For a product that promises hourly Share of Voice monitoring, "the API was down" is not an acceptable excuse.

Second, anti-bot detection. Modern retailers invest heavily in bot protection -- Cloudflare Turnstile, Amazon's automated detection, DataDome, and others. Octoparse's cloud infrastructure uses well-known IP ranges that many retailers have already flagged. We had no way to use residential proxies, stealth plugins, or custom browser fingerprinting. If a retailer blocked Octoparse's IPs, that retailer was simply inaccessible to us.

Third, cost and flexibility. Octoparse charges per task execution, and at 160+ hourly SoV schedules plus daily content scraping across thousands of products, the bill was growing fast. More importantly, every retailer-specific customization -- clicking a "Show more" button on John Lewis, intercepting XHR responses on Scan, forcing a UK location on Amazon -- required workarounds within Octoparse's visual designer that were fragile and hard to maintain.

The Architecture We Built

Our replacement is a TypeScript application built on three core technologies: BullMQ for job queuing, Redis for state management and message brokering, and Playwright for browser automation. The data flow is straightforward:

Scheduler (cron triggers on MASTER)
  --> Producer (reads pending work from MySQL, enqueues to BullMQ)
    --> Redis (two queues: "scraping" for Content, "scraping-sov" for SoV)
      --> Workers (10 processes across 5 machines pick up jobs)
        --> Playwright (headless Chromium scrapes pages)
          --> MySQL (results written to sov_results and final_boss tables)

The Scheduler runs only on the MASTER machine. It reads from a JSON configuration file containing 191 active schedule entries -- 31 for content scraping and 160 for Share of Voice. Each entry specifies a retailer, category, URL, and cron expression. When a schedule fires, the Scheduler invokes the appropriate Producer.

The Producer queries MySQL for pending work (products that need scraping, or category URLs that need SoV monitoring) and pushes jobs into BullMQ queues. There are two queues: scraping for content inspection (full product page scraping) and scraping-sov for Share of Voice (category page monitoring).

The Workers are the workhorses. Each of our five machines runs two worker processes -- one consuming from the content queue, one from the SoV queue. That gives us 10 workers total. When a worker picks up a job, it launches a headless Chromium instance via Playwright, navigates to the target URL, extracts the data, and writes results to MySQL.

Five Machines, One Network

Our scraping cluster consists of five Windows machines on a local network:

MASTER -- the brain. Runs the Scheduler, Producer, Redis (in Docker), the Bull Board dashboard, plus its own pair of Content and SoV workers. This machine orchestrates everything.
BOT1 through BOT4 -- pure worker machines. Each runs one Content worker and one SoV worker, pulling jobs from the Redis queues on MASTER.

All five machines connect to the same Redis instance on MASTER over the local network. This is one of the elegant properties of BullMQ: workers do not need to know about each other. They only need to know where Redis is. Jobs are distributed automatically -- whichever worker is idle picks up the next job from the queue. There is no load balancer, no service mesh, no Kubernetes. Just Redis and five machines pulling from the same queue.

Why Concurrency Is 1

Each worker process runs with a concurrency of 1 -- meaning it processes one job at a time. This might sound like a bottleneck, but Playwright is memory-intensive. A single headless Chromium instance with a fully rendered e-commerce product page can consume 500MB to 1GB of RAM. Running multiple browsers simultaneously on a single machine leads to memory pressure, swapping, and eventually out-of-memory crashes.

Instead of fighting memory limits, we lean into the distributed nature of the system. Ten workers each processing one job at a time gives us a throughput of ten concurrent scraping sessions. That is more than enough to complete 160 hourly SoV schedules (each taking 20-40 seconds) well within the hour, with plenty of capacity left for content scraping jobs.

Memory Management

Even with concurrency of 1, long-running Node.js processes accumulate memory over time -- leaked event listeners, cached DOM trees, Playwright browser contexts that do not fully clean up. We solved this pragmatically: after each job completes, the worker checks its own heap usage. If it exceeds 1.5GB, the worker exits gracefully.

We run all worker processes as Windows Services using NSSM (the Non-Sucking Service Manager). NSSM detects the exit and automatically restarts the process within 5 seconds. The worker comes back up, connects to Redis, and picks up the next job as if nothing happened. From the system's perspective, a worker restart is invisible -- BullMQ handles the brief absence and the job queue keeps flowing.

Per-Retailer Scraper Architecture

One of the biggest advantages of building our own system is that every retailer gets a custom scraper. We currently maintain 38 TypeScript scraper files, each tailored to a specific retailer's page structure and anti-bot measures. The SoV worker dynamically loads the correct scraper based on the retailer name in the job data:

Currys -- DOM parsing of product tiles, sponsored detection via Criteo ad markers injected into the HTML
Amazon -- HTTP-based scraping via Bright Data Web Unlocker (Playwright gets blocked), regex extraction from raw HTML, ASIN deduplication, UK location forcing via UB7 0DQ postcode
John Lewis -- Playwright with "Show more" button clicking to load all results, sponsored detection via [data-testid="sponsored-product-tag"]
Game.co.za -- no browser at all. This South African retailer runs on SAP Hybris, so we hit their OCC API directly via HTTP requests through Bright Data
Box.co.uk -- after they added Cloudflare Turnstile, we rewrote the scraper to use Bright Data Web Unlocker, bypassing the browser entirely for category pages

This per-retailer flexibility is something Octoparse could never offer. When Amazon changes their page structure, we update one file. When a retailer adds Cloudflare protection, we switch that scraper from Playwright to a proxy-based HTTP approach. The rest of the system does not need to know or care.

Monitoring and Reliability

Running a 10-worker distributed system on physical hardware means things will go wrong. Machines lose power. Network connections drop. Windows Update decides it is time to reboot (despite our hardening scripts that disable auto-reboots). We built several layers of defense:

Heartbeats -- every worker writes a key to Redis every 30 seconds with a 60-second TTL. If a worker stops updating, the key expires. A simple check script tells us instantly which workers are alive.
Watchdog -- a PowerShell script runs every 5 minutes on each machine via Task Scheduler. It checks whether the worker processes exist, whether Redis is reachable, and whether heartbeats are fresh (stale = older than 120 seconds). If anything fails, it sends a Discord webhook alert.
NSSM auto-restart -- all services restart automatically on crash with a 5-second delay. Between the memory ceiling exits and occasional Playwright crashes, this happens regularly and is completely invisible to us.
Power hardening -- we run a script on every machine that sets the power plan to High Performance, disables sleep and hibernate, disables USB selective suspend, prevents Windows Update from rebooting, and disables screensavers. These machines exist for one purpose.

The Results

Six weeks after the migration, the numbers speak for themselves. Our SoV data completeness went from approximately 85% (with frequent Octoparse failures and gaps) to over 99%. We can now scrape retailers that were previously inaccessible -- Amazon, Box.co.uk after their Cloudflare upgrade, and several South African retailers that Octoparse could not handle at all. Our monthly infrastructure cost dropped by more than 60% because we are running on hardware we already own.

But the biggest win is iteration speed. When we need to add a new retailer, we write one TypeScript file, add an entry to the schedule config, deploy to all five machines via SCP, and restart the services with NSSM. The entire process takes about 30 minutes. With Octoparse, adding a new retailer could take days of trial and error in their visual designer, only to discover that the retailer's anti-bot measures made it impossible anyway.

Building your own distributed scraping infrastructure is not trivial. But if you need per-retailer customization, proxy support, and full control over your data pipeline, a BullMQ + Redis + Playwright stack running on a handful of dedicated machines is a remarkably effective approach. We have not looked back.

From Cloud to Local: Building a Distributed Scraping Network