# PLC Startup Race Condition Fix ## Problem PLC2 was crashing at startup with `ConnectionRefusedError` when attempting to write to PLC1 via Modbus TCP before PLC1's server was ready. ### Root Cause The generated PLC logic (`tools/compile_ir.py`) produced a `_write()` function that directly invoked callbacks: ```python def _write(out_regs, cbs, key, value): ... if key in cbs: cbs[key]() # <-- CRASHES if remote PLC not ready ``` When PLC2 calls `_write(output_registers, state_update_callbacks, 'fill_request', 1)`, the callback attempts to connect to PLC1 at `192.168.100.12:502`. If PLC1 isn't ready, this raises an exception and crashes the PLC2 container. ## Solution Added a safe retry wrapper `_safe_callback()` that: - Retries up to 30 times with 0.2s delay (6 seconds total) - Catches all exceptions during callback execution - Never raises from the callback - Prints a warning on final failure and returns gracefully ### Files Changed **File:** `tools/compile_ir.py` **Changes:** 1. Added `import time` at top of generated files 2. Added `_safe_callback()` function with retry logic 3. Modified `_write()` to call `_safe_callback(cbs[key])` instead of `cbs[key]()` **Diff:** ```diff @@ -22,7 +22,17 @@ def render_plc_rules(plc_name: str, rules: List[object]) -> str: lines.append("Autogenerated by ics-simlab-config-gen (IR compiler).\n") lines.append('"""\n\n') - lines.append("from typing import Any, Callable, Dict\n\n\n") + lines.append("import time\n") + lines.append("from typing import Any, Callable, Dict\n\n\n") ... + lines.append("def _safe_callback(cb: Callable[[], None], retries: int = 30, delay: float = 0.2) -> None:\n") + lines.append(" \"\"\"Invoke callback with retry logic to handle startup race conditions.\"\"\"\n") + lines.append(" for attempt in range(retries):\n") + lines.append(" try:\n") + lines.append(" cb()\n") + lines.append(" return\n") + lines.append(" except Exception as e:\n") + lines.append(" if attempt == retries - 1:\n") + lines.append(" print(f\"WARNING: Callback failed after {retries} attempts: {e}\")\n") + lines.append(" return\n") + lines.append(" time.sleep(delay)\n\n\n") ... lines.append(" if key in cbs:\n") - lines.append(" cbs[key]()\n\n\n") + lines.append(" _safe_callback(cbs[key])\n\n\n") ``` ### Generated Code Example **Before (old plc2.py):** ```python def _write(out_regs, cbs, key, value): if key not in out_regs: return cur = out_regs[key].get('value', None) if cur == value: return out_regs[key]['value'] = value if key in cbs: cbs[key]() # CRASHES HERE ``` **After (new plc2.py):** ```python import time def _safe_callback(cb: Callable[[], None], retries: int = 30, delay: float = 0.2) -> None: """Invoke callback with retry logic to handle startup race conditions.""" for attempt in range(retries): try: cb() return except Exception as e: if attempt == retries - 1: print(f"WARNING: Callback failed after {retries} attempts: {e}") return time.sleep(delay) def _write(out_regs, cbs, key, value): if key not in out_regs: return cur = out_regs[key].get('value', None) if cur == value: return out_regs[key]['value'] = value if key in cbs: _safe_callback(cbs[key]) # NOW SAFE ``` ## Workflow Fix ### Issue The pipeline was using Python from wrong venv: `/home/stefano/projects/ics-simlab-config-gen/.venv` instead of the current repo's venv. ### Solution Created `build_scenario.py` script that: 1. Uses `sys.executable` to ensure correct Python interpreter 2. Orchestrates: config.json → IR → logic/*.py 3. Validates generated files 4. Copies everything to `outputs/scenario_run/` ## Building and Testing ### 1. Build Scenario ```bash # Activate the correct venv source .venv/bin/activate # Or: .venv/bin/python3 # Build the scenario .venv/bin/python3 build_scenario.py --out outputs/scenario_run --overwrite ``` This creates: ``` outputs/scenario_run/ ├── configuration.json └── logic/ ├── plc1.py ├── plc2.py └── hil_1.py ``` ### 2. Verify Fix is Present ```bash # Check for _safe_callback in generated file grep "_safe_callback" outputs/scenario_run/logic/plc2.py ``` Expected output: ``` def _safe_callback(cb: Callable[[], None], retries: int = 30, delay: float = 0.2) -> None: _safe_callback(cbs[key]) ``` ### 3. Run ICS-SimLab ```bash cd ~/projects/ICS-SimLab-main/curtin-ics-simlab sudo ./start.sh ~/projects/ics-simlab-config-gen_claude/outputs/scenario_run ``` ### 4. Monitor PLC2 Logs ```bash # Find PLC2 container name sudo docker ps | grep plc2 # View logs sudo docker logs -f ``` ### 5. Expected Behavior **Success indicators:** - No `Exception in thread ... logic` errors in PLC2 logs - May see: `WARNING: Callback failed after 30 attempts` if PLC1 takes too long to start - Eventually: Successful Modbus TCP connections once PLC1 is ready - No container crashes **What to look for:** ``` # Early attempts (PLC1 not ready yet): # (Silent retries in background - no output unless all fail) # After PLC1 is ready: # (Normal operation - callbacks succeed) # If PLC1 never comes up: WARNING: Callback failed after 30 attempts: [Errno 111] Connection refused ``` ### 6. Test Connectivity (if issues persist) ```bash # From host, test if PLC1 port is open nc -zv 192.168.100.12 502 # Or from inside PLC2 container sudo docker exec -it bash python3 -c "from pymodbus.client import ModbusTcpClient; c=ModbusTcpClient('192.168.100.12', 502); print('Connected:', c.connect())" ``` ### 7. Stop ICS-SimLab ```bash cd ~/projects/ICS-SimLab-main/curtin-ics-simlab sudo ./stop.sh ``` ## Scripts Created 1. **`build_scenario.py`** - Build complete scenario directory 2. **`test_simlab.sh`** - Interactive ICS-SimLab launcher 3. **`diagnose_runtime.sh`** - Check scenario files and Docker state ## Key Constraints Met - ✅ Only uses `time.sleep` (stdlib only, no extra dependencies) - ✅ Never raises from callbacks (catches all exceptions) - ✅ Preserves PLC logic contract (no signature changes) - ✅ Automatic propagation (fix in generator, not manual patches) - ✅ Uses correct Python venv (`sys.executable`) - ✅ 30 retries × 0.2s = 6s total (sufficient for container startup) ## Troubleshooting ### Issue: Still crashes after fix **Verify fix is present:** ```bash grep "_safe_callback" outputs/scenario_run/logic/plc2.py ``` If missing, rebuild: ```bash .venv/bin/python3 build_scenario.py --overwrite ``` ### Issue: "WARNING: Callback failed after 30 attempts" **Cause:** PLC1 took >6 seconds to start or isn't running. **Solution:** Increase retries or check PLC1 status: ```bash sudo docker ps | grep plc1 sudo docker logs -f ``` ### Issue: Network connectivity **Test from PLC2 container:** ```bash sudo docker exec -it bash ping 192.168.100.12 # Should reach PLC1 telnet 192.168.100.12 502 # Should connect to Modbus ``` ### Issue: Wrong Python venv **Always use explicit venv path:** ```bash .venv/bin/python3 build_scenario.py --overwrite ``` **Check which Python is active:** ```bash which python3 # Should be .venv/bin/python3 python3 --version ``` ## Future Improvements 1. **Configurable retry parameters:** Pass retries/delay as IR metadata 2. **Exponential backoff:** Improve retry strategy for slow networks 3. **Connection pooling:** Reuse Modbus client connections 4. **Health checks:** Add startup probes to ICS-SimLab containers