Debugging Techniques for Industrial Systems
When Code Fails: The Art of Bug Hunting
In industrial programming, a software bug is not just an annoyance -- it can mean an entire production line going down, damaged products, or even safety hazards for workers. This is why debugging skills are among the most critical abilities an automation engineer needs.
The term "Bug" has a famous origin story -- an actual moth was found stuck in a relay of the Harvard Mark II computer in 1947, causing a malfunction. Since then, "debugging" has been an official term in software engineering.
Types of Errors in Industrial Control Code
Syntax Errors
The easiest to find -- the compiler catches them immediately:
# Error: missing colon
if temperature > 100
activate_cooling()
# Correct:
if temperature > 100:
activate_cooling()
Runtime Errors
These appear while the program is running:
def calculate_flow_rate(volume, time_seconds):
return volume / time_seconds # What if time_seconds = 0?
# Safe solution:
def calculate_flow_rate(volume, time_seconds):
if time_seconds <= 0:
logging.warning("Invalid time value: %s", time_seconds)
return 0.0
return volume / time_seconds
Logic Errors
The most dangerous and hardest to find -- the code runs without error messages, but produces wrong results:
# Logic error: using OR instead of AND
if pressure > 5 or temperature > 80:
emergency_shutdown() # Triggers even if only one condition is met!
# Intended: shut down only when BOTH conditions are true
if pressure > 5 and temperature > 80:
emergency_shutdown()
Timing Errors
Specific to industrial systems where everything depends on timing:
# Error: reading sensor before it stabilizes
def read_pressure():
sensor.power_on()
value = sensor.read() # Immediate read -- sensor hasn't stabilized!
return value
# Correct: wait for settling time
def read_pressure():
sensor.power_on()
time.sleep(0.5) # Wait 500ms for sensor to stabilize
value = sensor.read()
return value
Logging Strategies
Logging is your eyes inside the program while it runs. In industrial environments, you cannot stand next to the machine adding print statements -- you need a structured logging system.
Log Levels
import logging
logging.basicConfig(
level=logging.DEBUG,
format='%(asctime)s [%(levelname)s] %(name)s: %(message)s',
handlers=[
logging.FileHandler('/var/log/plc_control.log'),
logging.StreamHandler()
]
)
logger = logging.getLogger("MotorControl")
def start_motor(motor_id):
logger.info("Starting motor %s", motor_id)
current = read_motor_current(motor_id)
logger.debug("Initial current: %.2f A", current)
if current > 15.0:
logger.warning("High start current: %.2f A for motor %s", current, motor_id)
if current > 25.0:
logger.error("Dangerous current! Stopping motor %s", motor_id)
stop_motor(motor_id)
return False
logger.info("Motor %s running successfully", motor_id)
return True
Golden Rules of Logging
| Level | When to Use | Example |
|---|---|---|
DEBUG |
Internal details for developers only | Variable values each cycle |
INFO |
Important normal events | Motor start/stop, mode change |
WARNING |
Unexpected but program continues | Temperature approaching limit |
ERROR |
A specific operation failed | Sensor read failure |
CRITICAL |
Catastrophic failure needing immediate action | PLC connection lost |
Structured Logging
For complex systems, plain text logs are not enough -- use structured logging:
import json
from datetime import datetime
def structured_log(level, event, **data):
entry = {
"timestamp": datetime.utcnow().isoformat(),
"level": level,
"event": event,
**data
}
print(json.dumps(entry))
# Usage:
structured_log("INFO", "motor_started",
motor_id="M-101",
current_amps=8.5,
voltage=380,
line="filling_line_1"
)
Output:
{"timestamp": "2026-04-05T10:30:00", "level": "INFO", "event": "motor_started", "motor_id": "M-101", "current_amps": 8.5, "voltage": 380, "line": "filling_line_1"}
This format is easily parsed by tools like Elasticsearch or Grafana.
Breakpoints and Step-by-Step Tracing
Using Breakpoints
A breakpoint freezes the program at a specific line and lets you inspect all variables:
# In VS Code or PyCharm: place a breakpoint at the suspicious line
def control_loop(setpoint, sensor_value):
error = setpoint - sensor_value
# <-- Place breakpoint here
kp = 2.5
ki = 0.1
integral = 0
integral += error * dt
output = kp * error + ki * integral
if output > 100:
output = 100 # Output saturation
return output
Conditional Breakpoints
Instead of stopping every cycle, stop only when a condition is met:
# In the debugger:
# Condition: error > 10 and cycle_count > 1000
# This stops only when the error is large after some runtime
Tracing Without Stopping
When you cannot stop the program (because the machine is running!), use tracing:
import functools
def trace_call(func):
@functools.wraps(func)
def wrapper(*args, **kwargs):
logger.debug("Calling %s(%s, %s)", func.__name__, args, kwargs)
result = func(*args, **kwargs)
logger.debug("Result of %s = %s", func.__name__, result)
return result
return wrapper
@trace_call
def calculate_pid(setpoint, current, kp, ki, kd):
error = setpoint - current
return kp * error # Simplified for clarity
Simulation: Testing Without a Real Machine
In industrial automation, you cannot always test on the real machine. Simulation saves you:
class SimulatedSensor:
"""Simulated temperature sensor for testing"""
def __init__(self, initial_temp=25.0):
self._temp = initial_temp
self._noise_level = 0.5
def read(self):
import random
noise = random.uniform(-self._noise_level, self._noise_level)
return self._temp + noise
def simulate_heating(self, rate_per_second, duration):
self._temp += rate_per_second * duration
class SimulatedMotor:
"""Simulated motor with realistic ramp delay"""
def __init__(self, rated_rpm=1450):
self.target_rpm = 0
self.actual_rpm = 0
self._rated_rpm = rated_rpm
self._ramp_rate = 200 # RPM per second
def set_speed(self, rpm):
self.target_rpm = min(rpm, self._rated_rpm)
def update(self, dt):
"""Call each simulation cycle"""
if self.actual_rpm < self.target_rpm:
self.actual_rpm = min(
self.target_rpm,
self.actual_rpm + self._ramp_rate * dt
)
elif self.actual_rpm > self.target_rpm:
self.actual_rpm = max(
self.target_rpm,
self.actual_rpm - self._ramp_rate * dt
)
# Test control algorithm without a real machine
sensor = SimulatedSensor(initial_temp=20.0)
for cycle in range(100):
temp = sensor.read()
if temp < 50:
sensor.simulate_heating(0.5, 0.1)
print(f"Cycle {cycle}: temp = {temp:.1f}")
Remote Debugging
Industrial systems often run on remote devices. Here is how to debug remotely:
Using SSH and debugpy (Python)
# On the target device (industrial PC):
pip install debugpy
python -m debugpy --listen 0.0.0.0:5678 --wait-for-client control_main.py
# On your workstation (VS Code), add to launch.json:
{
"name": "Remote Debug",
"type": "python",
"request": "attach",
"connect": {
"host": "192.168.1.100",
"port": 5678
},
"pathMappings": [
{
"localRoot": "${workspaceFolder}/src",
"remoteRoot": "/opt/control/src"
}
]
}
Centralized Network Logging
import logging
from logging.handlers import SocketHandler
# Send logs to a central server
handler = SocketHandler('192.168.1.50', 9020)
logger.addHandler(handler)
# Now all logs from all devices reach one place
logger.info("Filling line 3 motor -- high current detected")
Common Pitfalls in Control Programming
1. Race Conditions
# Bug: two threads reading and writing the same variable
shared_counter = 0
def sensor_thread():
global shared_counter
shared_counter += 1 # Not an atomic operation!
# Fix: use a lock
import threading
lock = threading.Lock()
def sensor_thread_safe():
global shared_counter
with lock:
shared_counter += 1
2. Memory Leaks
# Bug: storing readings forever
readings = []
while True:
readings.append(sensor.read()) # Memory fills up!
# Fix: use a circular buffer
from collections import deque
readings = deque(maxlen=1000) # Keep only the last 1000 readings
while True:
readings.append(sensor.read())
3. Not Handling Connection Loss
# Bug: assuming connection is always available
value = plc.read_register(100)
# Fix: error handling with retry
import time
def safe_read(plc, register, retries=3):
for attempt in range(retries):
try:
return plc.read_register(register)
except ConnectionError:
logger.warning("Connection failed (attempt %d/%d)", attempt+1, retries)
time.sleep(1)
plc.reconnect()
logger.error("Final failure reading register %d", register)
return None
4. Floating-Point Comparison
# Bug: direct comparison of floating-point numbers
if temperature == 37.5: # May never be true!
activate_alarm()
# Fix: use a tolerance margin
EPSILON = 0.01
if abs(temperature - 37.5) < EPSILON:
activate_alarm()
A Systematic Debugging Methodology
- Reproduce the bug -- if you cannot repeat it, you cannot fix it
- Isolate the problem -- disable components one by one until you find the cause
- Read the logs -- the answer is often already in the log files
- Check your assumptions -- is the sensor actually working? Is the connection alive?
- Change only one thing at a time -- never modify multiple things then test
- Document the fix -- write down what the bug was and how you fixed it to prevent recurrence
Summary
Debugging in industrial programming requires:
- A structured logging system with clear levels
- Smart breakpoints (especially conditional ones)
- Simulation for safe testing without risking the real machine
- Remote debugging for devices in the field
- Awareness of common pitfalls like race conditions and memory leaks
Remember: a good programmer does not write bug-free code -- a good programmer writes code where bugs are easy to find and fix.