7.6 KiB
PLC Startup Race Condition Fix
Problem
PLC2 was crashing at startup with ConnectionRefusedError when attempting to write to PLC1 via Modbus TCP before PLC1's server was ready.
Root Cause
The generated PLC logic (tools/compile_ir.py) produced a _write() function that directly invoked callbacks:
def _write(out_regs, cbs, key, value):
...
if key in cbs:
cbs[key]() # <-- CRASHES if remote PLC not ready
When PLC2 calls _write(output_registers, state_update_callbacks, 'fill_request', 1), the callback attempts to connect to PLC1 at 192.168.100.12:502. If PLC1 isn't ready, this raises an exception and crashes the PLC2 container.
Solution
Added a safe retry wrapper _safe_callback() that:
- Retries up to 30 times with 0.2s delay (6 seconds total)
- Catches all exceptions during callback execution
- Never raises from the callback
- Prints a warning on final failure and returns gracefully
Files Changed
File: tools/compile_ir.py
Changes:
- Added
import timeat top of generated files - Added
_safe_callback()function with retry logic - Modified
_write()to call_safe_callback(cbs[key])instead ofcbs[key]()
Diff:
@@ -22,7 +22,17 @@ def render_plc_rules(plc_name: str, rules: List[object]) -> str:
lines.append("Autogenerated by ics-simlab-config-gen (IR compiler).\n")
lines.append('"""\n\n')
- lines.append("from typing import Any, Callable, Dict\n\n\n")
+ lines.append("import time\n")
+ lines.append("from typing import Any, Callable, Dict\n\n\n")
...
+ lines.append("def _safe_callback(cb: Callable[[], None], retries: int = 30, delay: float = 0.2) -> None:\n")
+ lines.append(" \"\"\"Invoke callback with retry logic to handle startup race conditions.\"\"\"\n")
+ lines.append(" for attempt in range(retries):\n")
+ lines.append(" try:\n")
+ lines.append(" cb()\n")
+ lines.append(" return\n")
+ lines.append(" except Exception as e:\n")
+ lines.append(" if attempt == retries - 1:\n")
+ lines.append(" print(f\"WARNING: Callback failed after {retries} attempts: {e}\")\n")
+ lines.append(" return\n")
+ lines.append(" time.sleep(delay)\n\n\n")
...
lines.append(" if key in cbs:\n")
- lines.append(" cbs[key]()\n\n\n")
+ lines.append(" _safe_callback(cbs[key])\n\n\n")
Generated Code Example
Before (old plc2.py):
def _write(out_regs, cbs, key, value):
if key not in out_regs:
return
cur = out_regs[key].get('value', None)
if cur == value:
return
out_regs[key]['value'] = value
if key in cbs:
cbs[key]() # CRASHES HERE
After (new plc2.py):
import time
def _safe_callback(cb: Callable[[], None], retries: int = 30, delay: float = 0.2) -> None:
"""Invoke callback with retry logic to handle startup race conditions."""
for attempt in range(retries):
try:
cb()
return
except Exception as e:
if attempt == retries - 1:
print(f"WARNING: Callback failed after {retries} attempts: {e}")
return
time.sleep(delay)
def _write(out_regs, cbs, key, value):
if key not in out_regs:
return
cur = out_regs[key].get('value', None)
if cur == value:
return
out_regs[key]['value'] = value
if key in cbs:
_safe_callback(cbs[key]) # NOW SAFE
Workflow Fix
Issue
The pipeline was using Python from wrong venv: /home/stefano/projects/ics-simlab-config-gen/.venv instead of the current repo's venv.
Solution
Created build_scenario.py script that:
- Uses
sys.executableto ensure correct Python interpreter - Orchestrates: config.json → IR → logic/*.py
- Validates generated files
- Copies everything to
outputs/scenario_run/
Building and Testing
1. Build Scenario
# Activate the correct venv
source .venv/bin/activate # Or: .venv/bin/python3
# Build the scenario
.venv/bin/python3 build_scenario.py --out outputs/scenario_run --overwrite
This creates:
outputs/scenario_run/
├── configuration.json
└── logic/
├── plc1.py
├── plc2.py
└── hil_1.py
2. Verify Fix is Present
# Check for _safe_callback in generated file
grep "_safe_callback" outputs/scenario_run/logic/plc2.py
Expected output:
def _safe_callback(cb: Callable[[], None], retries: int = 30, delay: float = 0.2) -> None:
_safe_callback(cbs[key])
3. Run ICS-SimLab
cd ~/projects/ICS-SimLab-main/curtin-ics-simlab
sudo ./start.sh ~/projects/ics-simlab-config-gen_claude/outputs/scenario_run
4. Monitor PLC2 Logs
# Find PLC2 container name
sudo docker ps | grep plc2
# View logs
sudo docker logs <plc2_container_name> -f
5. Expected Behavior
Success indicators:
- No
Exception in thread ... logicerrors in PLC2 logs - May see:
WARNING: Callback failed after 30 attemptsif PLC1 takes too long to start - Eventually: Successful Modbus TCP connections once PLC1 is ready
- No container crashes
What to look for:
# Early attempts (PLC1 not ready yet):
# (Silent retries in background - no output unless all fail)
# After PLC1 is ready:
# (Normal operation - callbacks succeed)
# If PLC1 never comes up:
WARNING: Callback failed after 30 attempts: [Errno 111] Connection refused
6. Test Connectivity (if issues persist)
# From host, test if PLC1 port is open
nc -zv 192.168.100.12 502
# Or from inside PLC2 container
sudo docker exec -it <plc2_container> bash
python3 -c "from pymodbus.client import ModbusTcpClient; c=ModbusTcpClient('192.168.100.12', 502); print('Connected:', c.connect())"
7. Stop ICS-SimLab
cd ~/projects/ICS-SimLab-main/curtin-ics-simlab
sudo ./stop.sh
Scripts Created
build_scenario.py- Build complete scenario directorytest_simlab.sh- Interactive ICS-SimLab launcherdiagnose_runtime.sh- Check scenario files and Docker state
Key Constraints Met
- ✅ Only uses
time.sleep(stdlib only, no extra dependencies) - ✅ Never raises from callbacks (catches all exceptions)
- ✅ Preserves PLC logic contract (no signature changes)
- ✅ Automatic propagation (fix in generator, not manual patches)
- ✅ Uses correct Python venv (
sys.executable) - ✅ 30 retries × 0.2s = 6s total (sufficient for container startup)
Troubleshooting
Issue: Still crashes after fix
Verify fix is present:
grep "_safe_callback" outputs/scenario_run/logic/plc2.py
If missing, rebuild:
.venv/bin/python3 build_scenario.py --overwrite
Issue: "WARNING: Callback failed after 30 attempts"
Cause: PLC1 took >6 seconds to start or isn't running.
Solution: Increase retries or check PLC1 status:
sudo docker ps | grep plc1
sudo docker logs <plc1_container> -f
Issue: Network connectivity
Test from PLC2 container:
sudo docker exec -it <plc2_container> bash
ping 192.168.100.12 # Should reach PLC1
telnet 192.168.100.12 502 # Should connect to Modbus
Issue: Wrong Python venv
Always use explicit venv path:
.venv/bin/python3 build_scenario.py --overwrite
Check which Python is active:
which python3 # Should be .venv/bin/python3
python3 --version
Future Improvements
- Configurable retry parameters: Pass retries/delay as IR metadata
- Exponential backoff: Improve retry strategy for slow networks
- Connection pooling: Reuse Modbus client connections
- Health checks: Add startup probes to ICS-SimLab containers