Using Watchdog Timers to Auto-Recover Stuck Automation Scripts Without Manual Reset
You’re running automation scripts on Ubuntu, but a crash could leave your system stuck-unless you use a watchdog timer. Enable softdog with `modprobe`, set a 15-second timeout in `/etc/watchdog.conf`, and monitor CPU load, memory, and network health to prevent false triggers. Combined hardware and software watchdogs cut recovery to under 10 seconds, just like on Mars rovers. With systemd’s RuntimeWatchdogSec and proper tuning, your scripts auto-recover without a single manual reset, even after a total hang. There’s more to get right for reliable 24/7 operation.
We are supported by our audience. When you purchase through links on our site, we may earn an affiliate commission, at no extra cost for you. Learn more. Last update on 30th May 2026 / Images from Amazon Product Advertising API.
Notable Insights
- Configure a watchdog timer to automatically reboot the system if an automation script fails to refresh it.
- Use hardware and software watchdogs together for reliable recovery from hangs or crashes.
- Set a short timeout, like 15 seconds, to enable fast detection of stuck scripts.
- Monitor script liveness by periodically calling `wd_keepalive` or using `sd_notify(“WATCHDOG=1”)` in loops.
- Prevent false triggers by adjusting thresholds for CPU load, memory, and network connectivity in watchdog settings.
Why Your Script Needs a Watchdog Timer
Every minute your automation script hangs is a minute your robot’s motion control stalls, sensors stop logging, or a remote device goes dark-and that’s exactly why you need a Watchdog Timer. When your script hangs due to an infinite loop or deadlock, the watchdog doesn’t get refreshed, so the watchdog triggers a watchdog reset, forcing an automatic restart. Without it, system hangs can last hours-or indefinitely, like on Mars Pathfinder. A hardware and software watchdog setup, such as Linux’s /dev/watchdog0 with a userspace daemon or systemd’s RuntimeWatchdogSec, monitors your code continuously. Proper watchdog configuration guarantees only real stalls cause a reboot. Real tests show recovery in under 10 seconds, keeping remote or embedded systems alive. You’re not just preventing downtime-you’re guaranteeing reliability, especially on unattended robots or automation rigs where manual reset isn’t an option.
Enable and Test Watchdog Support on Ubuntu
You’ve seen how a watchdog timer can rescue your automation script from freezing up and killing your robot’s operations, so now it’s time to get it working on your Ubuntu system. First, check for hardware watchdog support by running `ls -la /dev/watchdog*`-if you see `/dev/watchdog0`, you’ve got a hardware watchdog. If not, load the software option with `sudo modprobe softdog` and confirm it’s active using `lsmod | grep softdog`. Check `dmesg | grep -i watchdog` to verify the kernel initialized the timer-look for messages like “softdog: Software Watchdog Timer: 0.08 initialized”. This software fallback works reliably when hardware isn’t present. The watchdog must be pinged regularly or it’ll trigger a system reset. A watchdog timeout typically defaults to 60 seconds, giving your software enough time to recover before it forces a reboot.
Install and Configure the Watchdog Daemon
Getting your automation scripts back on track after a freeze starts with installing the watchdog daemon, a lightweight but powerful tool that keeps your Ubuntu system alive and responsive. You’ll install watchdog using `sudo apt install watchdog`, then configure watchdog by editing `/etc/watchdog.conf`. Set watchdog-device to `/dev/watchdog` for hardware reset support or `softdog` if you’re using a virtual or embedded system without dedicated hardware. Define watchdog-timeout to 15 seconds so the system has time to respond before a reset. You’ll also enable basic health checks like `max-load-1 = 24` to trigger a reset if your CPU’s overloaded. Once configured, enable the service with `sudo systemctl enable watchdog` and start it with `sudo systemctl start watchdog` to begin monitoring-no manual reset needed.
Avoid False Triggers With Smart Health Checks
Why reboot when a spike isn’t a crash? Your watchdog shouldn’t trigger a reset just because of a temporary load spike-smart health checks prevent the system from overreacting. Use targeted checks to distinguish real hangs from normal operation. The right mechanism guarantees only actual software failures prompt action.
| Check Type | Purpose |
|---|---|
| max-load-1 = 16 | Avoids reset during CPU spikes |
| min-memory = 16K | Prevents unnecessary reboots |
| ping gateway | Confirms network failure |
| file age / ping | Detects stalled main application |
Combine these with `WatchdogSec=30s` and `sd_notify(“WATCHDOG=1”)` to break an infinite loop safely. Relying solely on load risks false triggering a reset. Instead, supplement with test binaries that verify app liveness-this mechanism guarantees reliability without disrupting working processes.
Recover Automatically After a Crash
Even if your code crashes completely, a watchdog timer can rescue the system before downtime spirals out of control-especially when you’re automating critical tasks like robotic welding or field irrigation. With a hardware watchdog, your system can recover automatically after a crash in as little as 3 seconds, like the USR-EG628 industrial computer does with stuck welding robots. The Independent Watchdog (IWDG) on STM32 chips uses its own internal LSI oscillator, so it keeps running even if the main clock fails-no manual reset needed. When the system hangs, watchdog resets kick in, and the IWDG reliably triggers a reset. Just don’t forget to kick the watchdog regularly in your code. On Linux, systemd can monitor health via RuntimeWatchdogSec=30s, or use softdog with 10-second timeouts. These watchdog resets keep your automation resilient, fast, and self-healing.
On a final note
You’ve seen how watchdog timers keep your automation running, even when scripts freeze. On Ubuntu, the *watchdog* daemon, paired with smart health checks, cuts recovery time from hours to seconds. Real tests show 99.8% uptime on Raspberry Pi-driven systems. Just set thresholds wisely-too low, and you risk false resets. Wire it right, test often, and let your rig self-heal. It’s not magic, it’s reliability by design.




