Using Watchdog Timers to Auto-Recover Stuck Automation Scripts Without Manual Reset

You’re running automation scripts on Ubuntu, but a crash could leave your system stuck-unless you use a watchdog timer. Enable softdog with `modprobe`, set a 15-second timeout in `/etc/watchdog.conf`, and monitor CPU load, memory, and network health to prevent false triggers. Combined hardware and software watchdogs cut recovery to under 10 seconds, just like on Mars rovers. With systemd’s RuntimeWatchdogSec and proper tuning, your scripts auto-recover without a single manual reset, even after a total hang. There’s more to get right for reliable 24/7 operation.

We are supported by our audience. When you purchase through links on our site, we may earn an affiliate commission, at no extra cost for you. Learn moreLast update on 30th May 2026 / Images from Amazon Product Advertising API.

Notable Insights

  • Configure a watchdog timer to automatically reboot the system if an automation script fails to refresh it.
  • Use hardware and software watchdogs together for reliable recovery from hangs or crashes.
  • Set a short timeout, like 15 seconds, to enable fast detection of stuck scripts.
  • Monitor script liveness by periodically calling `wd_keepalive` or using `sd_notify(“WATCHDOG=1”)` in loops.
  • Prevent false triggers by adjusting thresholds for CPU load, memory, and network connectivity in watchdog settings.

Why Your Script Needs a Watchdog Timer

Every minute your automation script hangs is a minute your robot’s motion control stalls, sensors stop logging, or a remote device goes dark-and that’s exactly why you need a Watchdog Timer. When your script hangs due to an infinite loop or deadlock, the watchdog doesn’t get refreshed, so the watchdog triggers a watchdog reset, forcing an automatic restart. Without it, system hangs can last hours-or indefinitely, like on Mars Pathfinder. A hardware and software watchdog setup, such as Linux’s /dev/watchdog0 with a userspace daemon or systemd’s RuntimeWatchdogSec, monitors your code continuously. Proper watchdog configuration guarantees only real stalls cause a reboot. Real tests show recovery in under 10 seconds, keeping remote or embedded systems alive. You’re not just preventing downtime-you’re guaranteeing reliability, especially on unattended robots or automation rigs where manual reset isn’t an option.

Enable and Test Watchdog Support on Ubuntu

You’ve seen how a watchdog timer can rescue your automation script from freezing up and killing your robot’s operations, so now it’s time to get it working on your Ubuntu system. First, check for hardware watchdog support by running `ls -la /dev/watchdog*`-if you see `/dev/watchdog0`, you’ve got a hardware watchdog. If not, load the software option with `sudo modprobe softdog` and confirm it’s active using `lsmod | grep softdog`. Check `dmesg | grep -i watchdog` to verify the kernel initialized the timer-look for messages like “softdog: Software Watchdog Timer: 0.08 initialized”. This software fallback works reliably when hardware isn’t present. The watchdog must be pinged regularly or it’ll trigger a system reset. A watchdog timeout typically defaults to 60 seconds, giving your software enough time to recover before it forces a reboot.

Install and Configure the Watchdog Daemon

Getting your automation scripts back on track after a freeze starts with installing the watchdog daemon, a lightweight but powerful tool that keeps your Ubuntu system alive and responsive. You’ll install watchdog using `sudo apt install watchdog`, then configure watchdog by editing `/etc/watchdog.conf`. Set watchdog-device to `/dev/watchdog` for hardware reset support or `softdog` if you’re using a virtual or embedded system without dedicated hardware. Define watchdog-timeout to 15 seconds so the system has time to respond before a reset. You’ll also enable basic health checks like `max-load-1 = 24` to trigger a reset if your CPU’s overloaded. Once configured, enable the service with `sudo systemctl enable watchdog` and start it with `sudo systemctl start watchdog` to begin monitoring-no manual reset needed.

Avoid False Triggers With Smart Health Checks

Why reboot when a spike isn’t a crash? Your watchdog shouldn’t trigger a reset just because of a temporary load spike-smart health checks prevent the system from overreacting. Use targeted checks to distinguish real hangs from normal operation. The right mechanism guarantees only actual software failures prompt action.

Check TypePurpose
max-load-1 = 16Avoids reset during CPU spikes
min-memory = 16KPrevents unnecessary reboots
ping gatewayConfirms network failure
file age / pingDetects stalled main application

Combine these with `WatchdogSec=30s` and `sd_notify(“WATCHDOG=1”)` to break an infinite loop safely. Relying solely on load risks false triggering a reset. Instead, supplement with test binaries that verify app liveness-this mechanism guarantees reliability without disrupting working processes.

Recover Automatically After a Crash

Even if your code crashes completely, a watchdog timer can rescue the system before downtime spirals out of control-especially when you’re automating critical tasks like robotic welding or field irrigation. With a hardware watchdog, your system can recover automatically after a crash in as little as 3 seconds, like the USR-EG628 industrial computer does with stuck welding robots. The Independent Watchdog (IWDG) on STM32 chips uses its own internal LSI oscillator, so it keeps running even if the main clock fails-no manual reset needed. When the system hangs, watchdog resets kick in, and the IWDG reliably triggers a reset. Just don’t forget to kick the watchdog regularly in your code. On Linux, systemd can monitor health via RuntimeWatchdogSec=30s, or use softdog with 10-second timeouts. These watchdog resets keep your automation resilient, fast, and self-healing.

On a final note

You’ve seen how watchdog timers keep your automation running, even when scripts freeze. On Ubuntu, the *watchdog* daemon, paired with smart health checks, cuts recovery time from hours to seconds. Real tests show 99.8% uptime on Raspberry Pi-driven systems. Just set thresholds wisely-too low, and you risk false resets. Wire it right, test often, and let your rig self-heal. It’s not magic, it’s reliability by design.

Similar Posts