ics-simlab-config-gen-claude/docs/RUNTIME_FIX.md

274 lines
7.6 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# PLC Startup Race Condition Fix
## Problem
PLC2 was crashing at startup with `ConnectionRefusedError` when attempting to write to PLC1 via Modbus TCP before PLC1's server was ready.
### Root Cause
The generated PLC logic (`tools/compile_ir.py`) produced a `_write()` function that directly invoked callbacks:
```python
def _write(out_regs, cbs, key, value):
...
if key in cbs:
cbs[key]() # <-- CRASHES if remote PLC not ready
```
When PLC2 calls `_write(output_registers, state_update_callbacks, 'fill_request', 1)`, the callback attempts to connect to PLC1 at `192.168.100.12:502`. If PLC1 isn't ready, this raises an exception and crashes the PLC2 container.
## Solution
Added a safe retry wrapper `_safe_callback()` that:
- Retries up to 30 times with 0.2s delay (6 seconds total)
- Catches all exceptions during callback execution
- Never raises from the callback
- Prints a warning on final failure and returns gracefully
### Files Changed
**File:** `tools/compile_ir.py`
**Changes:**
1. Added `import time` at top of generated files
2. Added `_safe_callback()` function with retry logic
3. Modified `_write()` to call `_safe_callback(cbs[key])` instead of `cbs[key]()`
**Diff:**
```diff
@@ -22,7 +22,17 @@ def render_plc_rules(plc_name: str, rules: List[object]) -> str:
lines.append("Autogenerated by ics-simlab-config-gen (IR compiler).\n")
lines.append('"""\n\n')
- lines.append("from typing import Any, Callable, Dict\n\n\n")
+ lines.append("import time\n")
+ lines.append("from typing import Any, Callable, Dict\n\n\n")
...
+ lines.append("def _safe_callback(cb: Callable[[], None], retries: int = 30, delay: float = 0.2) -> None:\n")
+ lines.append(" \"\"\"Invoke callback with retry logic to handle startup race conditions.\"\"\"\n")
+ lines.append(" for attempt in range(retries):\n")
+ lines.append(" try:\n")
+ lines.append(" cb()\n")
+ lines.append(" return\n")
+ lines.append(" except Exception as e:\n")
+ lines.append(" if attempt == retries - 1:\n")
+ lines.append(" print(f\"WARNING: Callback failed after {retries} attempts: {e}\")\n")
+ lines.append(" return\n")
+ lines.append(" time.sleep(delay)\n\n\n")
...
lines.append(" if key in cbs:\n")
- lines.append(" cbs[key]()\n\n\n")
+ lines.append(" _safe_callback(cbs[key])\n\n\n")
```
### Generated Code Example
**Before (old plc2.py):**
```python
def _write(out_regs, cbs, key, value):
if key not in out_regs:
return
cur = out_regs[key].get('value', None)
if cur == value:
return
out_regs[key]['value'] = value
if key in cbs:
cbs[key]() # CRASHES HERE
```
**After (new plc2.py):**
```python
import time
def _safe_callback(cb: Callable[[], None], retries: int = 30, delay: float = 0.2) -> None:
"""Invoke callback with retry logic to handle startup race conditions."""
for attempt in range(retries):
try:
cb()
return
except Exception as e:
if attempt == retries - 1:
print(f"WARNING: Callback failed after {retries} attempts: {e}")
return
time.sleep(delay)
def _write(out_regs, cbs, key, value):
if key not in out_regs:
return
cur = out_regs[key].get('value', None)
if cur == value:
return
out_regs[key]['value'] = value
if key in cbs:
_safe_callback(cbs[key]) # NOW SAFE
```
## Workflow Fix
### Issue
The pipeline was using Python from wrong venv: `/home/stefano/projects/ics-simlab-config-gen/.venv` instead of the current repo's venv.
### Solution
Created `build_scenario.py` script that:
1. Uses `sys.executable` to ensure correct Python interpreter
2. Orchestrates: config.json → IR → logic/*.py
3. Validates generated files
4. Copies everything to `outputs/scenario_run/`
## Building and Testing
### 1. Build Scenario
```bash
# Activate the correct venv
source .venv/bin/activate # Or: .venv/bin/python3
# Build the scenario
.venv/bin/python3 build_scenario.py --out outputs/scenario_run --overwrite
```
This creates:
```
outputs/scenario_run/
├── configuration.json
└── logic/
├── plc1.py
├── plc2.py
└── hil_1.py
```
### 2. Verify Fix is Present
```bash
# Check for _safe_callback in generated file
grep "_safe_callback" outputs/scenario_run/logic/plc2.py
```
Expected output:
```
def _safe_callback(cb: Callable[[], None], retries: int = 30, delay: float = 0.2) -> None:
_safe_callback(cbs[key])
```
### 3. Run ICS-SimLab
```bash
cd ~/projects/ICS-SimLab-main/curtin-ics-simlab
sudo ./start.sh ~/projects/ics-simlab-config-gen_claude/outputs/scenario_run
```
### 4. Monitor PLC2 Logs
```bash
# Find PLC2 container name
sudo docker ps | grep plc2
# View logs
sudo docker logs <plc2_container_name> -f
```
### 5. Expected Behavior
**Success indicators:**
- No `Exception in thread ... logic` errors in PLC2 logs
- May see: `WARNING: Callback failed after 30 attempts` if PLC1 takes too long to start
- Eventually: Successful Modbus TCP connections once PLC1 is ready
- No container crashes
**What to look for:**
```
# Early attempts (PLC1 not ready yet):
# (Silent retries in background - no output unless all fail)
# After PLC1 is ready:
# (Normal operation - callbacks succeed)
# If PLC1 never comes up:
WARNING: Callback failed after 30 attempts: [Errno 111] Connection refused
```
### 6. Test Connectivity (if issues persist)
```bash
# From host, test if PLC1 port is open
nc -zv 192.168.100.12 502
# Or from inside PLC2 container
sudo docker exec -it <plc2_container> bash
python3 -c "from pymodbus.client import ModbusTcpClient; c=ModbusTcpClient('192.168.100.12', 502); print('Connected:', c.connect())"
```
### 7. Stop ICS-SimLab
```bash
cd ~/projects/ICS-SimLab-main/curtin-ics-simlab
sudo ./stop.sh
```
## Scripts Created
1. **`build_scenario.py`** - Build complete scenario directory
2. **`test_simlab.sh`** - Interactive ICS-SimLab launcher
3. **`diagnose_runtime.sh`** - Check scenario files and Docker state
## Key Constraints Met
- ✅ Only uses `time.sleep` (stdlib only, no extra dependencies)
- ✅ Never raises from callbacks (catches all exceptions)
- ✅ Preserves PLC logic contract (no signature changes)
- ✅ Automatic propagation (fix in generator, not manual patches)
- ✅ Uses correct Python venv (`sys.executable`)
- ✅ 30 retries × 0.2s = 6s total (sufficient for container startup)
## Troubleshooting
### Issue: Still crashes after fix
**Verify fix is present:**
```bash
grep "_safe_callback" outputs/scenario_run/logic/plc2.py
```
If missing, rebuild:
```bash
.venv/bin/python3 build_scenario.py --overwrite
```
### Issue: "WARNING: Callback failed after 30 attempts"
**Cause:** PLC1 took >6 seconds to start or isn't running.
**Solution:** Increase retries or check PLC1 status:
```bash
sudo docker ps | grep plc1
sudo docker logs <plc1_container> -f
```
### Issue: Network connectivity
**Test from PLC2 container:**
```bash
sudo docker exec -it <plc2_container> bash
ping 192.168.100.12 # Should reach PLC1
telnet 192.168.100.12 502 # Should connect to Modbus
```
### Issue: Wrong Python venv
**Always use explicit venv path:**
```bash
.venv/bin/python3 build_scenario.py --overwrite
```
**Check which Python is active:**
```bash
which python3 # Should be .venv/bin/python3
python3 --version
```
## Future Improvements
1. **Configurable retry parameters:** Pass retries/delay as IR metadata
2. **Exponential backoff:** Improve retry strategy for slow networks
3. **Connection pooling:** Reuse Modbus client connections
4. **Health checks:** Add startup probes to ICS-SimLab containers