Skip to content

Conversation

@lionakhnazarov
Copy link
Contributor

@lionakhnazarov lionakhnazarov commented Dec 17, 2025

The Keep Core node now exposes 31+ performance metrics via the /metrics endpoint (port 9601). These metrics provide comprehensive visibility into node operations, network health, and system performance.

Integrated Metrics by Category

1. DKG (Distributed Key Generation) Metrics (6 metrics)

Counters:

  • performance_dkg_joined_total - Total number of DKG joins (members joined)
  • performance_dkg_failed_total - Total number of failed DKG executions
  • performance_dkg_validation_total - Total number of DKG result validations performed
  • performance_dkg_challenges_submitted_total - Total number of DKG challenges submitted on-chain
  • performance_dkg_approvals_submitted_total - Total number of DKG approvals submitted on-chain

Duration Metrics:

  • performance_dkg_duration_seconds - Average duration of DKG operations
  • performance_dkg_duration_seconds_count - Total count of DKG operations

Performance Insights:

  • Success Rate: dkg_joined_total / (dkg_joined_total + dkg_failed_total) - Monitor DKG participation and success rates
  • Duration Monitoring: Alert if dkg_duration_seconds exceeds 300 seconds (5 minutes) - indicates slow DKG operations
  • On-chain Activity: Track dkg_challenges_submitted_total and dkg_approvals_submitted_total to monitor dispute resolution activity
  • Validation Rate: High dkg_validation_total relative to joins indicates active validation of DKG results

2. Signing Operations Metrics (5 metrics)

Counters:

  • performance_signing_operations_total - Total number of signing operations attempted
  • performance_signing_success_total - Total number of successful signing operations
  • performance_signing_failed_total - Total number of failed signing operations
  • performance_signing_timeouts_total - Total number of signing operations that timed out

Duration Metrics:

  • performance_signing_duration_seconds - Average duration of signing operations
  • performance_signing_duration_seconds_count - Total count of signing operations

Performance Insights:

  • Success Rate: signing_success_total / signing_operations_total - Critical metric for node reliability
  • Failure Rate: Alert if signing_failed_total rate > 10% of total operations
  • Timeout Rate: signing_timeouts_total / signing_operations_total - Indicates network or coordination issues
  • Performance: Alert if signing_duration_seconds exceeds 60 seconds - indicates slow signing operations
  • Throughput: Monitor signing_operations_total rate to understand signing workload

3. Wallet Dispatcher Metrics (6 metrics)

Counters:

  • performance_wallet_actions_total - Total number of wallet actions dispatched
  • performance_wallet_action_success_total - Total number of successfully completed wallet actions
  • performance_wallet_action_failed_total - Total number of failed wallet actions
  • performance_wallet_dispatcher_rejected_total - Total number of wallet actions rejected (wallet busy)
  • performance_wallet_heartbeat_failures_total - Total number of wallet heartbeat failures

Gauges:

  • performance_wallet_dispatcher_active_actions - Current number of wallets with active actions

Duration Metrics:

  • performance_wallet_action_duration_seconds - Average duration of wallet actions
  • performance_wallet_action_duration_seconds_count - Total count of wallet actions

Performance Insights:

  • Rejection Rate: wallet_dispatcher_rejected_total / wallet_actions_total - Alert if > 5% indicates wallet saturation
  • Success Rate: wallet_action_success_total / wallet_actions_total - Monitor wallet action reliability
  • Utilization: wallet_dispatcher_active_actions shows current wallet workload
  • Bottleneck Detection: High rejection rate + high active actions = wallet bottleneck
  • Health Monitoring: wallet_heartbeat_failures_total indicates wallet connectivity issues

4. Coordination Operations Metrics (4 metrics)

Counters:

  • performance_coordination_windows_detected_total - Total number of coordination windows detected
  • performance_coordination_procedures_executed_total - Total number of coordination procedures executed successfully
  • performance_coordination_failed_total - Total number of failed coordination procedures

Duration Metrics:

  • performance_coordination_duration_seconds - Average duration of coordination procedures
  • performance_coordination_duration_seconds_count - Total count of coordination procedures

Performance Insights:

  • Execution Rate: coordination_procedures_executed_total / coordination_windows_detected_total - Success rate of coordination
  • Failure Rate: Alert if coordination_failed_total rate > 5% of detected windows
  • Window Detection: Monitor coordination_windows_detected_total to understand coordination frequency
  • Performance: Track coordination_duration_seconds to identify slow coordination operations

5. Network Operations Metrics (10 metrics)

Peer Connection Metrics:

  • performance_peer_connections_total - Total number of peer connections established
  • performance_peer_disconnections_total - Total number of peer disconnections

Message Metrics:

  • performance_message_broadcast_total - Total number of messages broadcast to the network
  • performance_message_received_total - Total number of messages received from the network

Queue Size Metrics (Gauges):

  • performance_incoming_message_queue_size - Current size of incoming message queue (with channel label)
  • performance_message_handler_queue_size - Current size of message handler queues (with channel and handler labels)

Ping Test Metrics:

  • performance_ping_test_total - Total number of ping tests performed
  • performance_ping_test_success_total - Total number of successful ping tests
  • performance_ping_test_failed_total - Total number of failed ping tests
  • performance_ping_test_duration_seconds - Average duration of ping tests
  • performance_ping_test_duration_seconds_count - Total count of ping tests

Performance Insights:

  • Network Health: peer_connections_total vs peer_disconnections_total - Monitor connection stability
  • Message Throughput: Track message_broadcast_total and message_received_total rates
  • Queue Backlog: Alert if incoming_message_queue_size > 3000 (75% of 4096 capacity) - indicates message processing bottleneck
  • Handler Backlog: Alert if message_handler_queue_size > 400 (75% of 512 capacity) - indicates handler saturation
  • Network Latency: ping_test_duration_seconds shows network round-trip time
  • Connectivity: Alert if ping_test_failed_total rate > 10% of ping tests - indicates network issues
  • Message Balance: Compare broadcast vs received to detect message loss

- Introduced a new  system to monitor various operations within the Keep Core node, including wallet actions, DKG processes, signing operations, coordination procedures, and network activities.
- Metrics are recorded through a new interface, allowing for optional integration without impacting performance when disabled.
- Updated relevant components to wire in metrics recording, ensuring comprehensive coverage of critical operations.
- Added documentation detailing implemented metrics and their usage.

This enhancement provides better visibility into node performance and health, facilitating monitoring and troubleshooting.
@lionakhnazarov lionakhnazarov marked this pull request as ready for review December 31, 2025 18:43
- Introduced performance metrics for deposit and redemption process, including execution and proof submission metrics.
- Updated the .gitignore file to exclude new directories: data/, logs/, and storage/.
- Enhanced existing code to wire in metrics recording for redemption actions, improving visibility into redemption performance and potential bottlenecks.
- Added documentation outlining the new metrics and their implementation details.
Copy link
Contributor

@jose-blockchain jose-blockchain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated recommendations:

  1. Fix the deadlock in wallet.go before merge - this will freeze the node if triggered, is confirmed
  2. Add context cancellation to monitorQueueSizes - minor resource leak, not urgent but good to fix
  3. Document that metrics endpoint should be firewalled - standard practice, just worth noting in docs

the code doesn't introduce direct vulnerabilities like injection or auth bypass. The metrics are useful operational data that node operators need. Just ensure port 9601 isn't exposed publicly (standard practice for any metrics endpoint).

pm.registerAllMetrics()

// Register gauge observers for all gauges
go pm.observeGauges()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just for clarity, this starts a goroutine that calls observeGauges() which is essentially empty (line 1077-1080). might want to either remove the goroutine or add a TODO comment explaining future plans for it?


// Configuration
checkInterval time.Duration
timeout time.Duration
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a minor comment, nice that you have a timeout field configured, but it doesn't seem to be used anywhere in the actual health checks. was this intended for wrapping the RPC calls with context timeout? might be worth adding or removing if not needed.

}

// monitorQueueSizes periodically records queue sizes as metrics.
func (c *channel) monitorQueueSizes(recorder interface {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

potential suggestion: the monitorQueueSizes function creates its own context but there's no way to stop it when the channel is closed. it'll keep running forever once started. maybe consider passing in a context from the channel or using a done channel?

p.broadcastChannelManager.setMetricsRecorder(recorder)
}
// Update notifiee with metrics recorder
p.host.Network().Notify(buildNotifiee(p.host, recorder))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like buildNotifiee gets called twice... once at connection time with nil metrics, and again in SetMetricsRecorder. the second call adds a new notifiee but doesn't remove the first one. this should work fine but you'll have two notifiees registered. just flagging in case that wasn't intentional.


// Update metrics
if wd.metricsRecorder != nil {
wd.actionsMutex.Lock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the mutex wd.actionsMutex is already held from other lock calls. calling Lock() again here will deadlock since Go mutexes aren't reentrant.
suggest removing the lock/unlock and just using len(wd.actions) directly, if this makes sense (maybe not)

r.ethLastCheck = startTime
r.ethLastError = err
r.ethMutex.Unlock()
rpcHealthLogger.Warnf(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error messages and RPC response times are logged and exposed as metrics:

  • rpc_eth_health_status - reveals if Ethereum RPC is down
  • rpc_btc_health_status - reveals if Bitcoin RPC is down
  • rpc_eth_response_time_seconds - reveals RPC latency
    an attacker monitoring these metrics knows:
  • when to attack (RPC is slow/degraded)
  • which backend service to target
  • when their DoS attack on RPC is succeeding
    not sure this is expected

- Updated the performance metrics initialization to accept an existing instance, preventing duplicate registrations.
- Improved error handling in the metrics observer to log duplicate registrations at the debug level instead of warnings.
- Added a method to periodically observe gauge metrics, ensuring better monitoring capabilities.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants