VRT hangs indefinitely when the emulation/simulation server crashes
Summary
When VRT is used in EMULATION or SIMULATION mode, it spawns a child process (vpp_emu / vpp_sim) that exposes a ZeroMQ REP socket on tcp://localhost:5555, and talks to it through vrt::ZmqServer over a ZMQ_REQ socket. If that child process never starts, dies during startup, or crashes mid-run, VRT does not detect this — every subsequent socket.recv(...) blocks forever and the calling application hangs with no diagnostic.
VRT should detect that the emulation/simulation peer is unreachable or has gone away and fail fast with a clear error instead of blocking indefinitely.
Where this lives
vrt/src/device.cpp — Device::Device(...) (EMULATION/SIMULATION branches): launches the child via std::system(...) inside a detached std::thread. The PID is never captured, so liveness can't be polled and exit status can't be reported.
vrt/src/utils/zmq_server.cpp — ZmqServer::ZmqServer(): creates a ZMQ_REQ socket and calls connect(). ZMQ's connect succeeds even when no peer is listening; failure only manifests at send / recv time, where the current code blocks unconditionally.
Proposed fix (in increasing order of invasiveness)
- Socket-level timeouts. Set
ZMQ_RCVTIMEO, ZMQ_SNDTIMEO, and ZMQ_LINGER=0 on the REQ socket in ZmqServer. Wrap every socket.recv(reply) so a timeout raises std::runtime_error with a clear message ("emulation/simulation server did not respond within Xs — it may have crashed; check the server logs"). Note: a REQ socket whose request has timed out is in an invalid state and must be re-created — simplest is to throw and let the caller bail.
- Startup handshake. In the
Device constructor, after spawning the child, send a probe command with a short timeout and a few retries. This converts "hang on first real call" into "fail fast at construction with a diagnostic."
- Capture the child PID. Replace
std::system() + detached std::thread with posix_spawn (or fork/execvp), store the PID on Device, and call waitpid(pid, &status, WNOHANG) from cleanup() and on any timeout to surface the exit code/signal in the error message. This also catches mid-run crashes, not just startup failures.
(1) alone removes the hang. (1)+(2) gives a clear early failure. (3) is the right long-term fix because it also catches mid-run crashes and reports the underlying cause.
Acceptance criteria
- Pointing VRT at an emu/sim binary that exits immediately produces a clear error within a few seconds (no hang).
- An emu/sim binary that crashes mid-execution causes the next VRT call to throw with a diagnostic that names the failed command and, ideally, the child's exit status or signal.
VRT hangs indefinitely when the emulation/simulation server crashes
Summary
When VRT is used in
EMULATIONorSIMULATIONmode, it spawns a child process (vpp_emu/vpp_sim) that exposes a ZeroMQ REP socket ontcp://localhost:5555, and talks to it throughvrt::ZmqServerover aZMQ_REQsocket. If that child process never starts, dies during startup, or crashes mid-run, VRT does not detect this — every subsequentsocket.recv(...)blocks forever and the calling application hangs with no diagnostic.VRT should detect that the emulation/simulation peer is unreachable or has gone away and fail fast with a clear error instead of blocking indefinitely.
Where this lives
vrt/src/device.cpp—Device::Device(...)(EMULATION/SIMULATION branches): launches the child viastd::system(...)inside a detachedstd::thread. The PID is never captured, so liveness can't be polled and exit status can't be reported.vrt/src/utils/zmq_server.cpp—ZmqServer::ZmqServer(): creates aZMQ_REQsocket and callsconnect(). ZMQ'sconnectsucceeds even when no peer is listening; failure only manifests atsend/recvtime, where the current code blocks unconditionally.Proposed fix (in increasing order of invasiveness)
ZMQ_RCVTIMEO,ZMQ_SNDTIMEO, andZMQ_LINGER=0on the REQ socket inZmqServer. Wrap everysocket.recv(reply)so a timeout raisesstd::runtime_errorwith a clear message ("emulation/simulation server did not respond within Xs — it may have crashed; check the server logs"). Note: a REQ socket whose request has timed out is in an invalid state and must be re-created — simplest is to throw and let the caller bail.Deviceconstructor, after spawning the child, send a probe command with a short timeout and a few retries. This converts "hang on first real call" into "fail fast at construction with a diagnostic."std::system()+ detachedstd::threadwithposix_spawn(orfork/execvp), store the PID onDevice, and callwaitpid(pid, &status, WNOHANG)fromcleanup()and on any timeout to surface the exit code/signal in the error message. This also catches mid-run crashes, not just startup failures.(1) alone removes the hang. (1)+(2) gives a clear early failure. (3) is the right long-term fix because it also catches mid-run crashes and reports the underlying cause.
Acceptance criteria