Skip to content

[FLINK-39176][runtime] Add configuration support for management node quarantine#27714

Open
featzhang wants to merge 16 commits intoapache:masterfrom
featzhang:feature/FLINK-39176-blocklist-management
Open

[FLINK-39176][runtime] Add configuration support for management node quarantine#27714
featzhang wants to merge 16 commits intoapache:masterfrom
featzhang:feature/FLINK-39176-blocklist-management

Conversation

@featzhang
Copy link
Member

@featzhang featzhang commented Feb 28, 2026

What is the purpose of the change

This is PR-4 in the FLINK-39176 series: Node Health Management & Quarantine Framework.

This PR adds configuration support for the management node quarantine feature and renames the "management blocklist" terminology to "node quarantine" to clearly distinguish it from Flink's existing blocklist mechanism (FLIP-224, used for batch speculative execution). It enables administrators to control the node quarantine behavior through Flink's configuration system, including enabling/disabling the feature, setting default quarantine duration, and maximum allowed duration.

PR Series Overview

This feature is implemented across 6 PRs, each independently reviewable and mergeable:

PR Title Link
PR-1 Introduce NodeHealthManager Abstraction #27701
PR-2 Integrate NodeHealthManager with Slot Filtering #27711
PR-3 Add REST API for Node Quarantine #27712
PR-4 Configuration Support for Node Quarantine This PR
PR-5 Expiration Cleanup Scheduler #27715
PR-6 Web UI Node Health Page #27716

Brief change log

  • Rename management/blocklist package to management/nodequarantine to avoid confusion with the existing speculative execution blocklist (FLIP-224)
  • Add ManagementOptions configuration class with node-quarantine.enabled, node-quarantine.default-duration, node-quarantine.max-duration
  • Add DefaultManagementNodeQuarantineHandler with automatic expiration cleanup via ScheduledExecutor
  • Add ManagementNodeQuarantineUtils for configuration-based handler factory loading
  • Add REST API handlers: NodeQuarantineAddHandler, NodeQuarantineListManagementHandler, NodeQuarantineRemoveHandler for POST/GET/DELETE /cluster/node-quarantine
  • Extend ResourceManagerGateway with addManagementQuarantinedNode(), removeManagementQuarantinedNode(), getAllManagementQuarantinedNodes()
  • Update ResourceManager, StandaloneResourceManager, ActiveResourceManager to accept ManagementNodeQuarantineHandler.Factory
  • Integrate NodeHealthManager check into FineGrainedSlotManager.isBlockedTaskManager() for resource allocation filtering
  • Add comprehensive test suite covering core functionality, edge cases, and integration scenarios

Verifying this change

This change is already covered by existing tests and new tests added in this PR:

  • SimpleManagementNodeQuarantineTest — Core quarantine handler functionality
  • ManagementNodeQuarantineEdgeCasesTest — Edge cases (null inputs, duplicates, expiration, special characters, large node counts)
  • SimpleNodeQuarantineHandlerTest — REST handler and gateway method validation
  • NodeHealthManagerTest — Node health manager tests
  • NodeQuarantineSlotFilteringITCase — Integration test verifying quarantined nodes are skipped during slot allocation
  • StandaloneResourceManagerTest — ResourceManager with quarantine handler factory
  • ActiveResourceManagerTest — Active ResourceManager integration
mvn test -pl flink-runtime -Dtest="SimpleManagementNodeQuarantineTest,ManagementNodeQuarantineEdgeCasesTest,SimpleNodeQuarantineHandlerTest,NodeHealthManagerTest,NodeQuarantineSlotFilteringITCase,StandaloneResourceManagerTest,ActiveResourceManagerTest"

Does this pull request potentially affect one of the following parts

  • Dependencies (does it add or upgrade a dependency): no
  • The public API, i.e., is any changed class annotated with @Public(Evolving): no
  • The serializers: no
  • The runtime per-record code paths (performance sensitive): no
  • Anything that affects deployment or recovery: no
  • The (de)serialization that stored state depends on: no

Documentation

  • Does this pull request introduce a new feature? yes
  • If yes, how is the feature documented? docs + JavaDocs

@flinkbot
Copy link
Collaborator

flinkbot commented Feb 28, 2026

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

@featzhang featzhang force-pushed the feature/FLINK-39176-blocklist-management branch from 2a03f2f to 333432a Compare March 2, 2026 13:18
@featzhang featzhang changed the title [FLINK-39176][runtime] Add configuration support for management blocklist [FLINK-39176][runtime] Add configuration support for management node quarantine Mar 2, 2026
@featzhang
Copy link
Member Author

@flinkbot run azure

@featzhang featzhang force-pushed the feature/FLINK-39176-blocklist-management branch from 26bc5bb to e3ecbcd Compare March 2, 2026 17:22
@featzhang
Copy link
Member Author

@flinkbot run azure

@featzhang
Copy link
Member Author

@flinkbot run ci

@featzhang
Copy link
Member Author

@flinkbot run azure

3 similar comments
@featzhang
Copy link
Member Author

@flinkbot run azure

@featzhang
Copy link
Member Author

@flinkbot run azure

@featzhang
Copy link
Member Author

@flinkbot run azure

@featzhang
Copy link
Member Author

/test

@featzhang
Copy link
Member Author

@flinkbot run azure

featzhang and others added 11 commits March 4, 2026 08:57
This PR introduces the NodeHealthManager abstraction layer for the
upcoming generic blacklist feature.

Changes:
- Add NodeHealthManager interface with methods for checking node health,
  marking nodes as quarantined, removing quarantine, listing all statuses,
  and cleaning up expired entries
- Add NodeHealthStatus data class to hold node health information
- Add NoOpNodeHealthManager implementation that always considers nodes
  healthy (no-op implementation for backward compatibility)
- Add DefaultNodeHealthManager implementation using ConcurrentHashMap
  to manage node health states
- Integrate NodeHealthManager into ResourceManager with NoOpNodeHealthManager
  as the default implementation (no behavior change in this PR)
- Add comprehensive unit tests for all implementations

This is the first phase of the generic blacklist feature and does not
change any existing behavior.
This commit implements the integration of NodeHealthManager with the slot allocation process in FineGrainedSlotManager. The changes include:

- Modified FineGrainedSlotManager to filter out quarantined nodes during slot allocation
- Updated ResourceManagerRuntimeServices to accept NodeHealthManager parameter
- Enhanced ResourceManagerFactory to pass NoOpNodeHealthManager as default
- Added comprehensive integration tests for slot filtering functionality
- Fixed compilation issues in test infrastructure

The implementation ensures that slots are not allocated on nodes that are marked as unhealthy by the NodeHealthManager, while maintaining backward compatibility with existing code.
Implements REST API endpoints for node quarantine management:
- POST /cluster/nodes/{nodeId}/quarantine - quarantine a node
- DELETE /cluster/nodes/{nodeId}/quarantine - remove quarantine
- GET /cluster/nodes/quarantine - list quarantined nodes
- Extended ResourceManagerGateway with quarantine methods
- Added comprehensive REST handler tests
- Implement NodeQuarantineHandler for quarantining nodes
- Implement NodeQuarantineListHandler for listing quarantined nodes
- Implement NodeRemoveQuarantineHandler for removing nodes from quarantine
- Add REST message classes for quarantine operations
- Register quarantine handlers in WebMonitorEndpoint
- Fix Checkstyle violations and apply Spotless formatting
- Remove test file due to framework complexity
- Fixed compilation errors in Headers classes by implementing RuntimeMessageHeaders
- Resolved EmptyMessageParameters import conflicts
- Updated configuration references to use BatchExecutionOptions.BLOCKLIST_ENABLED
- Fixed checkstyle violations and import ordering
- Added comprehensive API usage documentation
- Verified compilation and existing tests pass

This completes PR-4 of the FLINK-39176 Node Quarantine REST API project,
providing independent blocklist management functionality separate from
speculative execution.
…klist

- Created independent ManagementBlocklistHandler system
- Added ManagementOptions configuration class
- Updated ResourceManagerGateway with management-specific methods
- Modified REST handlers to use management blocklist APIs
- Separated configuration: cluster.management.blocklist.* vs execution.batch.speculative.*
- Updated documentation to clarify the distinction between systems

This ensures management blocklist (manual REST API) is independent
from batch execution blocklist (automatic speculative execution).
- Add SimpleManagementBlocklistTest for core functionality validation
- Add REST handler tests for BlocklistAdd/Remove/Get handlers
- Extend TestingResourceManagerGateway to support management blocklist methods
- Fix timestamp handling in DefaultManagementBlocklistHandler
- Remove obsolete BLOCKLIST_API_USAGE.md documentation

Tests verify:
- Node addition/removal operations
- Blocked status checking
- Automatic expiration cleanup
- REST API request/response handling
- Integration with ResourceManager gateway

All core functionality tests pass successfully.
- Replace complex REST handler tests with simplified SimpleBlocklistHandlerTest
- Add comprehensive edge case testing in ManagementBlocklistEdgeCasesTest
- Fix TestingResourceManagerGateway method signatures to match interface
- Update method calls to use correct names (addBlockedNode, isNodeBlocked, getCause)

Test coverage includes:
- Basic functionality validation (add/remove/check operations)
- Edge cases (null parameters, empty strings, special characters)
- Boundary conditions (very short/long durations, large node counts)
- Concurrent operations and thread safety
- Automatic expiration and cleanup mechanisms
- ResourceManager gateway integration

All tests pass successfully, providing robust validation of the management
blocklist functionality for FLINK-39176.
- Add new management_blocklist.md with complete feature documentation
- Include REST API endpoints, configuration options, and usage examples
- Update rest_api.md to reference Management Blocklist APIs
- Document integration with speculative execution and adaptive scheduler
- Provide troubleshooting guide and best practices

The documentation covers:
- Configuration options (enabled, default-duration, max-duration)
- REST API endpoints (POST/DELETE/GET /cluster/blocklist)
- Usage examples with curl and CLI
- Behavior, limitations, and best practices
- Integration with other Flink features
- Troubleshooting common issues

This completes the documentation requirements for FLINK-39176.
This PR adds comprehensive management blocklist functionality to the Flink runtime:

- Implement BlocklistHandler with management integration

- Add REST API endpoints for blocklist operations

- Integrate with ActiveResourceManager for runtime control

- Provide web monitor UI integration

- Include complete test coverage for core functionality

Signed-off-by: Feat Zhang <featzhang@apache.org>
- Rename management/blocklist package to management/nodequarantine
- Rename ManagementBlocklistHandler to ManagementNodeQuarantineHandler
- Rename config keys: cluster.management.blocklist.* to cluster.management.node-quarantine.*
- Rename REST endpoints: /cluster/blocklist to /cluster/node-quarantine
- Rename gateway methods: *ManagementBlocked* to *ManagementQuarantined*
- Rename REST handler/message classes: Blocklist* to NodeQuarantine*
- Fix FineGrainedSlotManager to check nodeHealthManager in resource allocation strategy
- Preserve existing blocklist package (used for speculative execution) unchanged
…lity in YarnResourceManagerDriverTest

BlockedNodeRetriever was extended with a second abstract method getAllBlockedNodes(),

making it no longer a functional interface. Replace the lambda expression in

YarnResourceManagerDriverTest with an anonymous class implementation that

properly implements both getAllBlockedNodeIds() and getAllBlockedNodes().

Signed-off-by: Feat Zhang <featzhang@apache.org>
…call in TaskManagerDisconnectOnShutdownITCase

Add missing managementNodeQuarantineHandlerFactory argument to
StandaloneResourceManager constructor invocation in
TaskManagerDisconnectOnShutdownITCase. This was introduced when
PR-4 added ManagementNodeQuarantine support to StandaloneResourceManager
but the flink-tests integration test was not updated accordingly.
… node quarantine

- Add management_configuration.html for ManagementOptions (cluster.management.node-quarantine.*)
- Update expert_scheduling_section.html to include node quarantine config options
- Update optimizer_config_configuration.html for new dim-lookup-join.batch options
- Regenerate rest_api_v1.snapshot to reflect compatible API changes

This fixes ConfigOptionsDocsCompletenessITCase and RuntimeRestAPIStabilityTest failures.
…nessITCase

Remove stale table.optimizer.dim-lookup-join.batch.* entries from
generated optimizer_config_configuration.html that no longer exist
in the codebase after rebase on master.
@featzhang featzhang force-pushed the feature/FLINK-39176-blocklist-management branch from 60d0e98 to b2859d4 Compare March 4, 2026 00:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants