Always-On Reliability | Crawlbot.pl Blog

When you monitor 18 retailers hourly, every missed scraping window is a gap in the data. A gap at 2 PM on a Tuesday might be the exact hour your competitor launched a flash promotion, or the hour a retailer ran out of sponsored inventory and your brand suddenly dominated the organic results. You'll never know — because the data wasn't collected.

We run 10 workers across 5 physical machines, executing 191 scheduled scraping jobs. Some run hourly, some nightly, some on custom cadences. The infrastructure doesn't live in a managed cloud with auto-scaling and health checks built in. It runs on dedicated machines in a local network, each one a Windows box that can crash, lose power, leak memory, or get rebooted by Windows Update at 3 AM. Building reliability into this kind of environment means layering defenses at every level. Here's how we do it.

Layer 1: NSSM Windows Services

Every worker on every machine runs as a Windows Service managed by NSSM (the Non-Sucking Service Manager). NSSM wraps our Node.js processes and gives them three critical properties that a bare node worker.js process doesn't have:

Auto-start on boot. When a machine reboots — whether from a power cut, a Windows Update, or someone accidentally pulling the plug — every worker comes back online without anyone logging in.
Auto-restart on crash. If a worker process exits unexpectedly, NSSM waits 5 seconds and starts it again. No pager, no manual intervention.
Survives logoff. Interactive processes die when a user logs out. NSSM services run under the SYSTEM account, completely independent of user sessions.
Log rotation. NSSM rotates stdout and stderr logs at 10MB per file, preventing disk-full scenarios that could take down a machine.

Each machine runs three services: CrawlbotWorker (content scraping), CrawlbotWorkerSoV (Share of Voice scraping), and on the MASTER machine, CrawlbotScheduler (the cron-based job trigger). All managed through a single setup-nssm.ps1 script that installs, configures, or uninstalls all services with one command.

Layer 2: Memory ceiling

Playwright is memory-hungry. It launches real Chromium instances, navigates pages, waits for JavaScript to render, takes screenshots — and Chromium does not give that memory back gracefully. In a long-running Node.js process, this manifests as a slow, steady memory leak. After 200 or 300 jobs, a worker that started at 200MB of heap usage is now at 1.8GB and climbing.

Our solution is a memory ceiling. After every completed scraping job, the worker checks its V8 heap usage. If it exceeds 1.5GB, the worker logs a message and gracefully exits. NSSM sees the process exit and restarts it in 5 seconds. The fresh worker starts with a clean heap at ~80MB.

This pattern — controlled self-termination followed by automatic restart — is borrowed from how Erlang processes work. Rather than trying to fix memory leaks in Chromium's rendering engine (an impossible task), we accept that the leak exists and plan for it. The result is workers that run indefinitely without degradation.

Layer 3: Watchdog with Discord alerts

NSSM handles crashes. But what about silent failures? A worker process might still be running but stuck in an infinite loop, or unable to connect to Redis, or frozen waiting for a page that will never load. That's where the watchdog comes in.

A PowerShell script called watchdog.ps1 runs every 5 minutes via Windows Task Scheduler on every machine. It performs three checks:

Is the worker process running? Simple process existence check. If the service crashed and NSSM hasn't restarted it yet (or NSSM itself failed), we catch it here.
Is Redis reachable? Workers can't function without the job queue. If Redis goes down, every worker on every machine is effectively dead. The watchdog pings Redis and alerts immediately on failure.
Is the heartbeat fresh? Every worker writes a Redis key with a 60-second TTL, updated every 30 seconds. Content workers write to worker:{ID}:heartbeat and SoV workers to worker:{ID}:sov:heartbeat. If a heartbeat hasn't been updated in over 120 seconds, the worker is considered stale — it might be running, but it's not doing useful work.

When any check fails, the watchdog fires a Discord webhook with the machine name, failure type, and timestamp. We see the alert on our phones within seconds of something going wrong. Healthy state is 10 active heartbeats across the cluster: 5 content workers + 5 SoV workers.

Layer 4: Power hardening

The most frustrating type of downtime is self-inflicted. Windows machines love to sleep, hibernate, dim screens, disable network adapters to save power, and reboot for updates at 3 AM — which is exactly when our Content Inspection scrapers are running.

Our harden-power.ps1 script enforces the following on every machine in the cluster:

High Performance power plan — No CPU throttling, no power saving states.
Sleep and hibernate disabled — AC and DC timeouts set to zero.
Network adapter power management disabled — Prevents the OS from putting NICs to sleep, which would kill Redis connections.
Windows Update auto-reboot prevention — Active Hours configured to cover the full 24-hour scraping window.
Screensaver disabled — Even screensavers can interfere with GPU resources that Playwright's Chromium uses.

The script supports a -DryRun flag for auditing what would change, and every setting is idempotent — safe to run repeatedly.

Layer 5: Docker for core services

Redis is the backbone of the entire system. Every job queue, every heartbeat, every worker coordination mechanism flows through Redis. If Redis goes down, every worker across all 5 machines stalls. We run Redis 7 in a Docker container on the MASTER machine with the --restart unless-stopped flag. Even if the Docker daemon restarts, the Redis container comes back automatically.

The Bull Board dashboard (our queue monitoring UI) runs in a separate Docker container with the same restart policy. This gives us real-time visibility into job status, queue depths, and worker activity at queue.crawlbot.pl.

Layer 6: SSH reverse tunnel

The Bull Board dashboard runs on the MASTER machine inside our local network. To make it accessible from anywhere, we run an SSH reverse tunnel from MASTER to our VPS server. This tunnel itself is managed as an NSSM service (CrawlbotTunnel) with auto-restart, so if the SSH connection drops, it reconnects automatically. Nginx on the VPS proxies queue.crawlbot.pl to the tunnel endpoint with SSL termination via Let's Encrypt. The result: our operations team can monitor queue health from their phones, from anywhere.

Real-world incidents

Theory is nice. Here's how these layers have worked in practice.

The 3 AM Windows Update reboot

BOT3 decided to install updates and reboot at 3:12 AM. The watchdog on MASTER detected stale heartbeats within 5 minutes and fired a Discord alert. But by the time we looked at the alert, BOT3 had already finished rebooting, NSSM had auto-started both workers, and fresh heartbeats were flowing. Total data loss: approximately 10 minutes of scraping capacity on one machine. The other 4 machines continued operating normally throughout.

The Argos memory leak

After a batch of Argos PDP scrapes, one worker's heap climbed to 2.1GB. Argos pages are heavy — lots of embedded JSON, large product images, and complex React component trees. The memory ceiling detected the overshoot after the job completed, logged "Memory ceiling exceeded: 2.1GB > 1.5GB, exiting for restart", and the process exited cleanly. NSSM restarted it in 5 seconds. The next job processed normally at 85MB of initial heap usage. No alert was even needed — the system healed itself.

The BOT2 network migration

BOT2 originally connected via WiFi (IP ending .217). Intermittent WiFi drops were causing the watchdog to fire alerts every few hours. We switched BOT2 to a wired Ethernet connection, which changed its IP to .180. The watchdog immediately detected stale heartbeats on the old IP. After updating our SSH config and NSSM service configuration, heartbeats resumed. The root cause — WiFi instability — was permanently eliminated.

The result: near-perfect uptime

These six layers work together to deliver 99%+ uptime for our hourly SoV monitoring. The system automatically recovers from the vast majority of failure modes: process crashes, memory leaks, power events, network blips, and Windows updates. The only failures that require human intervention are hardware failures (a dead hard drive), infrastructure changes (IP migrations), and novel anti-bot defenses from retailers (which require new scraper code, not reliability engineering).

Reliability engineering is not about building a system that never fails. It's about building a system that fails gracefully and recovers automatically. Every layer we've described assumes that the layer below it will eventually fail. NSSM assumes processes will crash. The memory ceiling assumes Playwright will leak. The watchdog assumes NSSM might not catch everything. Power hardening assumes Windows will try to sabotage the machine. Each layer catches what the previous one misses.

The brands that rely on Crawlbot data for daily competitive decisions need it to be there when they check in the morning. Not most mornings. Every morning. That's what this reliability stack delivers.

Always On: Building Reliability Into Every Layer