ics-simlab-config-gen-claude/docs/RUNTIME_FIX.md

7.6 KiB
Raw Blame History

PLC Startup Race Condition Fix

Problem

PLC2 was crashing at startup with ConnectionRefusedError when attempting to write to PLC1 via Modbus TCP before PLC1's server was ready.

Root Cause

The generated PLC logic (tools/compile_ir.py) produced a _write() function that directly invoked callbacks:

def _write(out_regs, cbs, key, value):
    ...
    if key in cbs:
        cbs[key]()  # <-- CRASHES if remote PLC not ready

When PLC2 calls _write(output_registers, state_update_callbacks, 'fill_request', 1), the callback attempts to connect to PLC1 at 192.168.100.12:502. If PLC1 isn't ready, this raises an exception and crashes the PLC2 container.

Solution

Added a safe retry wrapper _safe_callback() that:

  • Retries up to 30 times with 0.2s delay (6 seconds total)
  • Catches all exceptions during callback execution
  • Never raises from the callback
  • Prints a warning on final failure and returns gracefully

Files Changed

File: tools/compile_ir.py

Changes:

  1. Added import time at top of generated files
  2. Added _safe_callback() function with retry logic
  3. Modified _write() to call _safe_callback(cbs[key]) instead of cbs[key]()

Diff:

@@ -22,7 +22,17 @@ def render_plc_rules(plc_name: str, rules: List[object]) -> str:
     lines.append("Autogenerated by ics-simlab-config-gen (IR compiler).\n")
     lines.append('"""\n\n')
-    lines.append("from typing import Any, Callable, Dict\n\n\n")
+    lines.append("import time\n")
+    lines.append("from typing import Any, Callable, Dict\n\n\n")
     ...
+    lines.append("def _safe_callback(cb: Callable[[], None], retries: int = 30, delay: float = 0.2) -> None:\n")
+    lines.append("    \"\"\"Invoke callback with retry logic to handle startup race conditions.\"\"\"\n")
+    lines.append("    for attempt in range(retries):\n")
+    lines.append("        try:\n")
+    lines.append("            cb()\n")
+    lines.append("            return\n")
+    lines.append("        except Exception as e:\n")
+    lines.append("            if attempt == retries - 1:\n")
+    lines.append("                print(f\"WARNING: Callback failed after {retries} attempts: {e}\")\n")
+    lines.append("                return\n")
+    lines.append("            time.sleep(delay)\n\n\n")
     ...
     lines.append("    if key in cbs:\n")
-    lines.append("        cbs[key]()\n\n\n")
+    lines.append("        _safe_callback(cbs[key])\n\n\n")

Generated Code Example

Before (old plc2.py):

def _write(out_regs, cbs, key, value):
    if key not in out_regs:
        return
    cur = out_regs[key].get('value', None)
    if cur == value:
        return
    out_regs[key]['value'] = value
    if key in cbs:
        cbs[key]()  # CRASHES HERE

After (new plc2.py):

import time

def _safe_callback(cb: Callable[[], None], retries: int = 30, delay: float = 0.2) -> None:
    """Invoke callback with retry logic to handle startup race conditions."""
    for attempt in range(retries):
        try:
            cb()
            return
        except Exception as e:
            if attempt == retries - 1:
                print(f"WARNING: Callback failed after {retries} attempts: {e}")
                return
            time.sleep(delay)

def _write(out_regs, cbs, key, value):
    if key not in out_regs:
        return
    cur = out_regs[key].get('value', None)
    if cur == value:
        return
    out_regs[key]['value'] = value
    if key in cbs:
        _safe_callback(cbs[key])  # NOW SAFE

Workflow Fix

Issue

The pipeline was using Python from wrong venv: /home/stefano/projects/ics-simlab-config-gen/.venv instead of the current repo's venv.

Solution

Created build_scenario.py script that:

  1. Uses sys.executable to ensure correct Python interpreter
  2. Orchestrates: config.json → IR → logic/*.py
  3. Validates generated files
  4. Copies everything to outputs/scenario_run/

Building and Testing

1. Build Scenario

# Activate the correct venv
source .venv/bin/activate  # Or: .venv/bin/python3

# Build the scenario
.venv/bin/python3 build_scenario.py --out outputs/scenario_run --overwrite

This creates:

outputs/scenario_run/
├── configuration.json
└── logic/
    ├── plc1.py
    ├── plc2.py
    └── hil_1.py

2. Verify Fix is Present

# Check for _safe_callback in generated file
grep "_safe_callback" outputs/scenario_run/logic/plc2.py

Expected output:

def _safe_callback(cb: Callable[[], None], retries: int = 30, delay: float = 0.2) -> None:
        _safe_callback(cbs[key])

3. Run ICS-SimLab

cd ~/projects/ICS-SimLab-main/curtin-ics-simlab
sudo ./start.sh ~/projects/ics-simlab-config-gen_claude/outputs/scenario_run

4. Monitor PLC2 Logs

# Find PLC2 container name
sudo docker ps | grep plc2

# View logs
sudo docker logs <plc2_container_name> -f

5. Expected Behavior

Success indicators:

  • No Exception in thread ... logic errors in PLC2 logs
  • May see: WARNING: Callback failed after 30 attempts if PLC1 takes too long to start
  • Eventually: Successful Modbus TCP connections once PLC1 is ready
  • No container crashes

What to look for:

# Early attempts (PLC1 not ready yet):
# (Silent retries in background - no output unless all fail)

# After PLC1 is ready:
# (Normal operation - callbacks succeed)

# If PLC1 never comes up:
WARNING: Callback failed after 30 attempts: [Errno 111] Connection refused

6. Test Connectivity (if issues persist)

# From host, test if PLC1 port is open
nc -zv 192.168.100.12 502

# Or from inside PLC2 container
sudo docker exec -it <plc2_container> bash
python3 -c "from pymodbus.client import ModbusTcpClient; c=ModbusTcpClient('192.168.100.12', 502); print('Connected:', c.connect())"

7. Stop ICS-SimLab

cd ~/projects/ICS-SimLab-main/curtin-ics-simlab
sudo ./stop.sh

Scripts Created

  1. build_scenario.py - Build complete scenario directory
  2. test_simlab.sh - Interactive ICS-SimLab launcher
  3. diagnose_runtime.sh - Check scenario files and Docker state

Key Constraints Met

  • Only uses time.sleep (stdlib only, no extra dependencies)
  • Never raises from callbacks (catches all exceptions)
  • Preserves PLC logic contract (no signature changes)
  • Automatic propagation (fix in generator, not manual patches)
  • Uses correct Python venv (sys.executable)
  • 30 retries × 0.2s = 6s total (sufficient for container startup)

Troubleshooting

Issue: Still crashes after fix

Verify fix is present:

grep "_safe_callback" outputs/scenario_run/logic/plc2.py

If missing, rebuild:

.venv/bin/python3 build_scenario.py --overwrite

Issue: "WARNING: Callback failed after 30 attempts"

Cause: PLC1 took >6 seconds to start or isn't running.

Solution: Increase retries or check PLC1 status:

sudo docker ps | grep plc1
sudo docker logs <plc1_container> -f

Issue: Network connectivity

Test from PLC2 container:

sudo docker exec -it <plc2_container> bash
ping 192.168.100.12  # Should reach PLC1
telnet 192.168.100.12 502  # Should connect to Modbus

Issue: Wrong Python venv

Always use explicit venv path:

.venv/bin/python3 build_scenario.py --overwrite

Check which Python is active:

which python3  # Should be .venv/bin/python3
python3 --version

Future Improvements

  1. Configurable retry parameters: Pass retries/delay as IR metadata
  2. Exponential backoff: Improve retry strategy for slow networks
  3. Connection pooling: Reuse Modbus client connections
  4. Health checks: Add startup probes to ICS-SimLab containers