Skip to content

<fix>[vm]: publish HA start state change#3910

Closed
zstack-robot-1 wants to merge 1 commit into
5.4.8from
sync/hanyu.liang/fix-84266@@2
Closed

<fix>[vm]: publish HA start state change#3910
zstack-robot-1 wants to merge 1 commit into
5.4.8from
sync/hanyu.liang/fix-84266@@2

Conversation

@zstack-robot-1
Copy link
Copy Markdown
Collaborator

Root Cause

VmInstanceBase.handle(HaStartVmInstanceMsg) directly updated VmInstanceVO.state to Stopped through SQL before HA start. This bypassed changeVmStateInDb(), so the VM state transition log, /vm/state/change canonical event, and VmStateChangedExtensionPoint callback were not emitted for the HA internal Unknown -> Stopped transition.

Resolve

  • Add optional stateChangeSource to VmCanonicalEvents.VmStateChangedData; existing events keep the default null value.
  • Route HA start's internal state reset through changeVmStateInDb(VmInstanceStateEvent.stopped, ..., HaStartVmInstanceMsg.class.getName()) instead of direct SQL update.
  • The matching premium MR uses this source to skip HA's own state-change event and avoid recursive NeverStop HA start.

Verification

Remote environment: root@172.20.18.132

  • /home/zstack/header: /usr/local/maven/bin/mvn clean install -Dmaven.test.skip=true -> BUILD SUCCESS
  • /home/zstack/compute: /usr/local/maven/bin/mvn clean install -Dmaven.test.skip=true -> BUILD SUCCESS
  • /home/zstack/premium/mevoco: /usr/local/maven/bin/mvn clean install -Dmaven.test.skip=true -> BUILD SUCCESS
  • /home/zstack/premium/test-premium: /usr/local/maven/bin/mvn test -T 1C -Dtest=VmHaStartStateChangeEventCase -> Tests run: 1, Failures: 0, Errors: 0

Jira: ZSTAC-84266

Supersedes wrong branch-name MR: http://dev.zstack.io:9080/zstackio/zstack/-/merge_requests/9794

sync from gitlab !9795

Resolves: ZSTAC-84266

Change-Id: Ib8851991f035890cac610ac0fb3829e6780ded0f
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 9, 2026

Review Change Stack
No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: http://open.zstack.ai:20001/code-reviews/zstack-cloud.yaml (via .coderabbit.yaml)

Review profile: CHILL

Plan: Pro

Run ID: 043af87e-5ff9-4e01-b2f2-b9028db059a8

📥 Commits

Reviewing files that changed from the base of the PR and between 6de8314 and ae30a7b.

📒 Files selected for processing (2)
  • compute/src/main/java/org/zstack/compute/vm/VmInstanceBase.java
  • header/src/main/java/org/zstack/header/vm/VmCanonicalEvents.java

通览

此变更为VM状态变化规范事件添加了源追踪能力。新增stateChangeSource字段到VmStateChangedData,扩展changeVmStateInDb方法接收此参数,并在HA启动处理中将状态转换从手动SQL改为调用新增的API。

变更

VM状态变化源追踪

层级 / 文件 摘要
事件数据契约
header/src/main/java/org/zstack/header/vm/VmCanonicalEvents.java
VmStateChangedData新增private字段stateChangeSource及其public getter/setter方法。
状态转换方法
compute/src/main/java/org/zstack/compute/vm/VmInstanceBase.java
新增三参数重载changeVmStateInDb(VmInstanceStateEvent, Runnable, String),原有的两参数重载改为委托此新重载。
事件数据设置
compute/src/main/java/org/zstack/compute/vm/VmInstanceBase.java
VM完整状态变化规范事件中,将stateChangeSource值设置到VmStateChangedData对象。
HA启动集成
compute/src/main/java/org/zstack/compute/vm/VmInstanceBase.java
HA启动处理将VM停止转换从显式UpdateQuery改为调用changeVmStateInDb(VmInstanceStateEvent.stopped, null, HaStartVmInstanceMsg.class.getName())。

代码审查工作量估计

🎯 2 (简单) | ⏱️ ~12 分钟

诗歌

🐰 一只小兔来报信,
VM状态有了源头印,
HA启动需溯源,
事件标记追踪真,
代码清晰又无尘!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed PR title follows the required [scope]: format and is 40 characters, well under the 72 character limit. Title accurately describes the main change of publishing HA start state change events.
Description check ✅ Passed PR description is comprehensive and directly related to the changeset. It explains the root cause, resolution approach, verification steps, and references the JIRA issue ZSTAC-84266.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch sync/hanyu.liang/fix-84266@@2

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 ast-grep (0.42.1)
compute/src/main/java/org/zstack/compute/vm/VmInstanceBase.java

Comment @coderabbitai help to get the list of available commands and usage tips.

@zstack-robot-1
Copy link
Copy Markdown
Collaborator Author

Comment from yaohua.wu:

Review: MR !9795 — ZSTAC-84266

总体结论

APPROVED(无 Critical 阻塞项;2 Warning + 4 Suggestion,非阻塞)

修复方向正确:用 changeVmStateInDb() 替代裸 SQL,确实补齐了 HA 内部 Unknown -> Stopped 转换缺失的状态变更日志、/vm/state/change canonical event 和 VmStateChangedExtensionPoint 通知,与 Jira 描述的"227 台触发 HA 但只有 71 条告警"症状根因一致。

关键正确性确认(已逐行核对 changeVmStateInDb 的 SQLBatch 体):旧的裸 SQL 做了 hostUuid = null + lastHostUuid = self.getHostUuid()(仅当非空),新方法在 state == Stopped 分支中做了完全相同的事,并且额外通过 findByUuid 重新加载 self、DataIntegrityViolationException 重试,行为不仅未丢失,反而更鲁棒。


🟡 Warning

W1. stateChangeSource 跨仓库字符串耦合,缺少集中常量定义
位置: VmInstanceBase.java:1012changeVmStateInDb(..., HaStartVmInstanceMsg.class.getName())

zstack 这一端用 HaStartVmInstanceMsg.class.getName() 写入,premium 那一端按字符串过滤跳过 NeverStop 重启。这套 contract 存在两个风险:

  1. 如果未来重命名或移包 HaStartVmInstanceMsg(IDE refactor 不会跨 repo 检查),premium 端的字符串比较就会静默失效,重新出现"NeverStop 对 HA 内部状态修正再次触发 HA start"的递归 — 这正是这次修复想避免的场景,且回归不会被任何编译器/CI 捕获。
  2. 字段语义只在 commit message 里说明,header 里没有任何注释或常量声明哪些字符串是合法的 source。

建议: 在 VmCanonicalEventsVmTracerCanonicalEvents 之类的 header 公共类中定义一组 public static final String SOURCE_HA_START = HaStartVmInstanceMsg.class.getName(); 常量,zstack/premium 都引用这个常量。这样既保留了 IDE refactor 安全,也给 source 字段提供了显式枚举式 contract。


W2. 新增事件广播面,对其他 listener 的影响未审计
位置: VmInstanceBase.java:382data.setStateChangeSource(stateChangeSource); evtf.fire(VM_FULL_STATE_CHANGED_PATH, ...)

修复前 HA-internal 的 Unknown -> Stopped 不发事件;修复后会发出,并触发所有 VmStateChangedExtensionPoint 实现以及 /vm/state/change 上的 canonical event 订阅者。premium HaManager 已通过 stateChangeSource 过滤;但其他 listener(告警、计费、统计、审计、第三方扩展、operator 控制台 SDK 等)这次会首次看到这类事件。

MR description 与 Jira 评论里只列了 premium HaManager 的过滤逻辑,没有列受影响的扩展点清单。请确认(或在 description 里附上):

  • grep -r "VmStateChangedExtensionPoint" --include="*.java" 的全部实现是否都能容忍 Unknown -> Stopped 这种"虚拟"的状态变化(实际上 VM 一直在跑,只是 MN 视图修正)。
  • /vm/state/change 的 alarm/billing/audit 订阅者是否会因此误报"VM 停机"事件 — 这是 Jira 想要的(恢复缺失的 71/227 告警),但应当显式列出预期受益的订阅者,避免后续遗漏。

注:从 Jira 的修复初衷看,广播本身就是目的,所以这条不是要求拦截,而是要求审计预期影响面。


🟢 Suggestion

S1. 缺少 Javadoc
位置:

  • VmCanonicalEvents.java:158private String stateChangeSource;
  • VmInstanceBase.java:329protected VmInstanceVO changeVmStateInDb(..., String stateChangeSource)

新字段和新重载没有任何注释说明语义、合法值与使用约定。下次修 bug 的人只能反向工程 git blame 才能搞清楚这个 source 是干嘛的。建议至少加:

/** Optional. Identifies non-user/non-API internal triggers of this state transition
 *  (e.g. HA recovery). Downstream listeners can use this to filter out internal corrections.
 *  Convention: fully-qualified class name of the originating internal Msg/job. */

S2. 状态机隐式约束
位置: VmInstanceBase.java:1012

旧的裸 SQL 直接覆写 state 字段,绕过 VmInstanceState 状态机;新代码会走 self.getState().nextState(VmInstanceStateEvent.stopped)。这要求状态机配置允许"当前状态 -> Stopped via stopped 事件"对所有可能进入此分支的 state 都成立。

进入 HA judger qualified 分支时,self 在 SQLBatch 内被 findByUuid 重新加载 — 此时 state 不一定是 Unknown

  • 物理机短暂闪断后 host-tracker 已经把 state 改回 Running;
  • 并发的 user 操作让 state 已经是 Stopped;
  • 极端场景下 VM 已被 Destroyed。

如果 nextState(stopped) 不接受这些 source state,会抛 IllegalStateException 中断 HA 流程,回归比 raw SQL 更脆。建议补一个最小化的边界用例(Running / Stopped / Unknown / Destroyed × stopped 事件)确认状态机行为,或在调用前显式判断 state 是否为 Unknown。

S3. zstack 端零增量单测
位置: 整个 MR

VmHaStartStateChangeEventCase 在 premium 仓库(test-premium),覆盖了端到端行为;但 zstack 这边的核心改动 —— 三参 changeVmStateInDbstateChangeSource 透传到 VmStateChangedData —— 没有 unit 级别的回归。建议在 zstack 加一个轻量 SubCase 直接断言:调用三参版本后,/vm/state/change 上抓到的 VmStateChangedData.stateChangeSource 与传入值一致。这样就算未来 premium 测试基础设施波动,header/compute 的 contract 仍有本地兜底。

S4. 文件末尾空行被删除
位置: VmInstanceBase.java:8983 — 删除了文件末尾的尾随空行

POSIX 习惯要求文本文件以换行结尾,绝大多数 Java 项目(包括 zstack 同目录其他文件)都保留尾行。无功能影响,但会导致后续 diff 出现"添加最后一个换行"的噪音。建议恢复尾行。


Coverage

  • 已读完整 changeVmStateInDb 实现,确认 hostUuid/lastHostUuid 语义保持。
  • 已读 Jira 描述、产品评论、MR description 三方对齐 root cause。
  • 已确认 MR 的 merge_status=can_be_mergeddiverged_commits_count=0,与 5.4.8 上游无冲突。
  • 未深读 premium 配套 MR (!13805) — 那边的 source filter 逻辑在另一次 review 中评估。
  • 未跑动态测试 — 信任 MR description 提供的 VmHaStartStateChangeEventCase 通过结果。

Verdict

APPROVED:核心修复正确、行为完全保留、改动定位精准。Warning 都是关于"扩散面"和"contract 显式化"的工程建议,不阻塞合并;Suggestion 全部是可选增强。建议在 follow-up(同一 sprint 内)处理 W1(提取常量)+ S1(加 Javadoc),其余按需。


🤖 Robot Reviewer

@zstack-robot-2
Copy link
Copy Markdown
Collaborator

Comment from hanyu.liang:

W1.msg变化后这个字符还会随之变化,所以没有影响
w2.本身就是状态转变,所以状态转变都按照正常状态变化执行VmStateChangedExtensionPoint是正常的。

@zstack-robot-1 zstack-robot-1 deleted the sync/hanyu.liang/fix-84266@@2 branch May 11, 2026 07:24
@MatheMatrix MatheMatrix restored the sync/hanyu.liang/fix-84266@@2 branch May 11, 2026 07:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants