274 lines
7.6 KiB
Markdown
274 lines
7.6 KiB
Markdown
# PLC Startup Race Condition Fix
|
||
|
||
## Problem
|
||
|
||
PLC2 was crashing at startup with `ConnectionRefusedError` when attempting to write to PLC1 via Modbus TCP before PLC1's server was ready.
|
||
|
||
### Root Cause
|
||
|
||
The generated PLC logic (`tools/compile_ir.py`) produced a `_write()` function that directly invoked callbacks:
|
||
|
||
```python
|
||
def _write(out_regs, cbs, key, value):
|
||
...
|
||
if key in cbs:
|
||
cbs[key]() # <-- CRASHES if remote PLC not ready
|
||
```
|
||
|
||
When PLC2 calls `_write(output_registers, state_update_callbacks, 'fill_request', 1)`, the callback attempts to connect to PLC1 at `192.168.100.12:502`. If PLC1 isn't ready, this raises an exception and crashes the PLC2 container.
|
||
|
||
## Solution
|
||
|
||
Added a safe retry wrapper `_safe_callback()` that:
|
||
- Retries up to 30 times with 0.2s delay (6 seconds total)
|
||
- Catches all exceptions during callback execution
|
||
- Never raises from the callback
|
||
- Prints a warning on final failure and returns gracefully
|
||
|
||
### Files Changed
|
||
|
||
**File:** `tools/compile_ir.py`
|
||
|
||
**Changes:**
|
||
1. Added `import time` at top of generated files
|
||
2. Added `_safe_callback()` function with retry logic
|
||
3. Modified `_write()` to call `_safe_callback(cbs[key])` instead of `cbs[key]()`
|
||
|
||
**Diff:**
|
||
```diff
|
||
@@ -22,7 +22,17 @@ def render_plc_rules(plc_name: str, rules: List[object]) -> str:
|
||
lines.append("Autogenerated by ics-simlab-config-gen (IR compiler).\n")
|
||
lines.append('"""\n\n')
|
||
- lines.append("from typing import Any, Callable, Dict\n\n\n")
|
||
+ lines.append("import time\n")
|
||
+ lines.append("from typing import Any, Callable, Dict\n\n\n")
|
||
...
|
||
+ lines.append("def _safe_callback(cb: Callable[[], None], retries: int = 30, delay: float = 0.2) -> None:\n")
|
||
+ lines.append(" \"\"\"Invoke callback with retry logic to handle startup race conditions.\"\"\"\n")
|
||
+ lines.append(" for attempt in range(retries):\n")
|
||
+ lines.append(" try:\n")
|
||
+ lines.append(" cb()\n")
|
||
+ lines.append(" return\n")
|
||
+ lines.append(" except Exception as e:\n")
|
||
+ lines.append(" if attempt == retries - 1:\n")
|
||
+ lines.append(" print(f\"WARNING: Callback failed after {retries} attempts: {e}\")\n")
|
||
+ lines.append(" return\n")
|
||
+ lines.append(" time.sleep(delay)\n\n\n")
|
||
...
|
||
lines.append(" if key in cbs:\n")
|
||
- lines.append(" cbs[key]()\n\n\n")
|
||
+ lines.append(" _safe_callback(cbs[key])\n\n\n")
|
||
```
|
||
|
||
### Generated Code Example
|
||
|
||
**Before (old plc2.py):**
|
||
```python
|
||
def _write(out_regs, cbs, key, value):
|
||
if key not in out_regs:
|
||
return
|
||
cur = out_regs[key].get('value', None)
|
||
if cur == value:
|
||
return
|
||
out_regs[key]['value'] = value
|
||
if key in cbs:
|
||
cbs[key]() # CRASHES HERE
|
||
```
|
||
|
||
**After (new plc2.py):**
|
||
```python
|
||
import time
|
||
|
||
def _safe_callback(cb: Callable[[], None], retries: int = 30, delay: float = 0.2) -> None:
|
||
"""Invoke callback with retry logic to handle startup race conditions."""
|
||
for attempt in range(retries):
|
||
try:
|
||
cb()
|
||
return
|
||
except Exception as e:
|
||
if attempt == retries - 1:
|
||
print(f"WARNING: Callback failed after {retries} attempts: {e}")
|
||
return
|
||
time.sleep(delay)
|
||
|
||
def _write(out_regs, cbs, key, value):
|
||
if key not in out_regs:
|
||
return
|
||
cur = out_regs[key].get('value', None)
|
||
if cur == value:
|
||
return
|
||
out_regs[key]['value'] = value
|
||
if key in cbs:
|
||
_safe_callback(cbs[key]) # NOW SAFE
|
||
```
|
||
|
||
## Workflow Fix
|
||
|
||
### Issue
|
||
The pipeline was using Python from wrong venv: `/home/stefano/projects/ics-simlab-config-gen/.venv` instead of the current repo's venv.
|
||
|
||
### Solution
|
||
Created `build_scenario.py` script that:
|
||
1. Uses `sys.executable` to ensure correct Python interpreter
|
||
2. Orchestrates: config.json → IR → logic/*.py
|
||
3. Validates generated files
|
||
4. Copies everything to `outputs/scenario_run/`
|
||
|
||
## Building and Testing
|
||
|
||
### 1. Build Scenario
|
||
|
||
```bash
|
||
# Activate the correct venv
|
||
source .venv/bin/activate # Or: .venv/bin/python3
|
||
|
||
# Build the scenario
|
||
.venv/bin/python3 build_scenario.py --out outputs/scenario_run --overwrite
|
||
```
|
||
|
||
This creates:
|
||
```
|
||
outputs/scenario_run/
|
||
├── configuration.json
|
||
└── logic/
|
||
├── plc1.py
|
||
├── plc2.py
|
||
└── hil_1.py
|
||
```
|
||
|
||
### 2. Verify Fix is Present
|
||
|
||
```bash
|
||
# Check for _safe_callback in generated file
|
||
grep "_safe_callback" outputs/scenario_run/logic/plc2.py
|
||
```
|
||
|
||
Expected output:
|
||
```
|
||
def _safe_callback(cb: Callable[[], None], retries: int = 30, delay: float = 0.2) -> None:
|
||
_safe_callback(cbs[key])
|
||
```
|
||
|
||
### 3. Run ICS-SimLab
|
||
|
||
```bash
|
||
cd ~/projects/ICS-SimLab-main/curtin-ics-simlab
|
||
sudo ./start.sh ~/projects/ics-simlab-config-gen_claude/outputs/scenario_run
|
||
```
|
||
|
||
### 4. Monitor PLC2 Logs
|
||
|
||
```bash
|
||
# Find PLC2 container name
|
||
sudo docker ps | grep plc2
|
||
|
||
# View logs
|
||
sudo docker logs <plc2_container_name> -f
|
||
```
|
||
|
||
### 5. Expected Behavior
|
||
|
||
**Success indicators:**
|
||
- No `Exception in thread ... logic` errors in PLC2 logs
|
||
- May see: `WARNING: Callback failed after 30 attempts` if PLC1 takes too long to start
|
||
- Eventually: Successful Modbus TCP connections once PLC1 is ready
|
||
- No container crashes
|
||
|
||
**What to look for:**
|
||
```
|
||
# Early attempts (PLC1 not ready yet):
|
||
# (Silent retries in background - no output unless all fail)
|
||
|
||
# After PLC1 is ready:
|
||
# (Normal operation - callbacks succeed)
|
||
|
||
# If PLC1 never comes up:
|
||
WARNING: Callback failed after 30 attempts: [Errno 111] Connection refused
|
||
```
|
||
|
||
### 6. Test Connectivity (if issues persist)
|
||
|
||
```bash
|
||
# From host, test if PLC1 port is open
|
||
nc -zv 192.168.100.12 502
|
||
|
||
# Or from inside PLC2 container
|
||
sudo docker exec -it <plc2_container> bash
|
||
python3 -c "from pymodbus.client import ModbusTcpClient; c=ModbusTcpClient('192.168.100.12', 502); print('Connected:', c.connect())"
|
||
```
|
||
|
||
### 7. Stop ICS-SimLab
|
||
|
||
```bash
|
||
cd ~/projects/ICS-SimLab-main/curtin-ics-simlab
|
||
sudo ./stop.sh
|
||
```
|
||
|
||
## Scripts Created
|
||
|
||
1. **`build_scenario.py`** - Build complete scenario directory
|
||
2. **`test_simlab.sh`** - Interactive ICS-SimLab launcher
|
||
3. **`diagnose_runtime.sh`** - Check scenario files and Docker state
|
||
|
||
## Key Constraints Met
|
||
|
||
- ✅ Only uses `time.sleep` (stdlib only, no extra dependencies)
|
||
- ✅ Never raises from callbacks (catches all exceptions)
|
||
- ✅ Preserves PLC logic contract (no signature changes)
|
||
- ✅ Automatic propagation (fix in generator, not manual patches)
|
||
- ✅ Uses correct Python venv (`sys.executable`)
|
||
- ✅ 30 retries × 0.2s = 6s total (sufficient for container startup)
|
||
|
||
## Troubleshooting
|
||
|
||
### Issue: Still crashes after fix
|
||
|
||
**Verify fix is present:**
|
||
```bash
|
||
grep "_safe_callback" outputs/scenario_run/logic/plc2.py
|
||
```
|
||
|
||
If missing, rebuild:
|
||
```bash
|
||
.venv/bin/python3 build_scenario.py --overwrite
|
||
```
|
||
|
||
### Issue: "WARNING: Callback failed after 30 attempts"
|
||
|
||
**Cause:** PLC1 took >6 seconds to start or isn't running.
|
||
|
||
**Solution:** Increase retries or check PLC1 status:
|
||
```bash
|
||
sudo docker ps | grep plc1
|
||
sudo docker logs <plc1_container> -f
|
||
```
|
||
|
||
### Issue: Network connectivity
|
||
|
||
**Test from PLC2 container:**
|
||
```bash
|
||
sudo docker exec -it <plc2_container> bash
|
||
ping 192.168.100.12 # Should reach PLC1
|
||
telnet 192.168.100.12 502 # Should connect to Modbus
|
||
```
|
||
|
||
### Issue: Wrong Python venv
|
||
|
||
**Always use explicit venv path:**
|
||
```bash
|
||
.venv/bin/python3 build_scenario.py --overwrite
|
||
```
|
||
|
||
**Check which Python is active:**
|
||
```bash
|
||
which python3 # Should be .venv/bin/python3
|
||
python3 --version
|
||
```
|
||
|
||
## Future Improvements
|
||
|
||
1. **Configurable retry parameters:** Pass retries/delay as IR metadata
|
||
2. **Exponential backoff:** Improve retry strategy for slow networks
|
||
3. **Connection pooling:** Reuse Modbus client connections
|
||
4. **Health checks:** Add startup probes to ICS-SimLab containers
|