-
Notifications
You must be signed in to change notification settings - Fork 3.7k
[improve][pip] PIP-471: Authorization operation metrics #25513
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
mattisonchao
wants to merge
3
commits into
master
Choose a base branch
from
mattison/pip-298-authorization-metrics
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+302
−0
Open
Changes from all commits
Commits
Show all changes
3 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,302 @@ | ||
| # PIP-471: Authorization Operation Metrics | ||
|
|
||
| # Background knowledge | ||
|
|
||
| Pulsar brokers perform authorization checks before allowing clients, proxies, and administrative callers to access | ||
| topics, namespaces, tenants, brokers, clusters, and policy operations. These checks are handled through the broker-side | ||
| `AuthorizationService`, which delegates decisions to the configured `AuthorizationProvider`. | ||
|
|
||
| Pulsar already exposes security-related metrics, especially around authentication. These metrics help operators detect | ||
| login failures, unhealthy clients, and changes in access patterns. However, Pulsar does not expose a generic broker-level | ||
| metric stream for authorization outcomes. Authorization denials are mostly visible through request failures and logs, | ||
| which makes them harder to alert on and harder to compare with successful authorization traffic. | ||
|
|
||
| Pulsar also supports both Prometheus-compatible metrics and OpenTelemetry metrics. New broker observability features | ||
| should keep those pipelines aligned when possible, so operators can consume equivalent signals regardless of their | ||
| metrics backend. | ||
|
|
||
| # Motivation | ||
|
|
||
| Operators need a low-cardinality, broker-native signal that shows whether authorization checks are succeeding or failing. | ||
| This is useful for security alerting, baseline monitoring, and compliance-oriented reporting. | ||
|
|
||
| Without a dedicated authorization metric, operators have to infer authorization denials from logs, HTTP status codes, or | ||
| client-side errors. That is brittle and does not support standard monitoring patterns such as: | ||
|
|
||
| - Alerting on spikes in authorization failures. | ||
| - Comparing authorization failures against successful authorizations. | ||
| - Distinguishing authentication failures from authorization failures. | ||
| - Building dashboards by authorization resource category. | ||
| - Exporting equivalent authorization signals through both Prometheus and OpenTelemetry. | ||
|
|
||
| A failure-only metric is also not sufficient. Operators often need success, failure, and error counts together to | ||
| understand whether a denial spike reflects an attack, a rollout issue, a policy mistake, an authorization provider | ||
| problem, or a normal traffic shift. | ||
|
|
||
| # Goals | ||
|
|
||
| ## In Scope | ||
|
|
||
| - Add a low-cardinality broker authorization metric for operation outcomes. | ||
| - Record successful, failed, and errored authorization operations. | ||
| - Expose the metric through the Prometheus-compatible broker metrics endpoint. | ||
| - Expose the same metric through OpenTelemetry. | ||
| - Centralize instrumentation in `AuthorizationService` so broker authorization paths share the same metric model. | ||
| - Avoid identity-bearing or high-cardinality metric dimensions. | ||
|
|
||
| ## Out of Scope | ||
|
|
||
| - Per-role, per-topic, per-tenant, per-namespace, or per-principal labels. | ||
| - Audit-log payloads or structured security event streams. | ||
| - New authorization APIs or binary protocol changes. | ||
| - Alert rule definitions for downstream monitoring stacks. | ||
| - Configuration to enable or disable this specific metric. | ||
|
|
||
| # High Level Design | ||
|
|
||
| Introduce a generic authorization operation counter that is incremented when the broker finishes an authorization | ||
| decision or rejects an authorization request before invoking the configured provider. | ||
|
|
||
| The metric is recorded centrally in `AuthorizationService`, which is the broker-side entry point for authorization checks | ||
| across topic, namespace, tenant, broker, cluster, and policy operations. Each completed provider decision or direct | ||
| authorization rejection emits one result with a small, fixed dimension set: | ||
|
|
||
| - the resource type that was checked | ||
| - the operation that was requested | ||
| - whether the result was a success, failure, or error | ||
|
|
||
| This metric is exported in two equivalent forms by the same helper class: | ||
|
|
||
| - a Prometheus counter for the existing broker metrics endpoint | ||
| - an OpenTelemetry `LongCounter` for OpenTelemetry metrics export | ||
|
|
||
| Invalid original-principal combinations in proxied authorization flows are counted as authorization failures because the | ||
| broker rejects the request during authorization handling. For valid proxied authorization flows, the broker evaluates | ||
| both the proxy role and the original principal, and each completed authorization decision is recorded. | ||
|
|
||
| # Detailed Design | ||
|
|
||
| ## Design & Implementation Details | ||
|
|
||
| This proposal introduces `org.apache.pulsar.broker.authorization.metrics.AuthorizationMetrics`, a broker authorization | ||
| metrics helper that owns: | ||
|
|
||
| - a Prometheus `Counter` for broker metrics scraping | ||
| - an OpenTelemetry `LongCounter` for OpenTelemetry metrics export | ||
|
|
||
| The helper uses the following constants for metric names, instrumentation scope, label values, and OpenTelemetry | ||
| attribute keys: | ||
|
|
||
| | Constant | Value | | ||
| |---|---| | ||
| | `AUTHORIZATION_OPERATIONS_METRIC_NAME` | `pulsar_authorization_operations_total` | | ||
| | `AUTHORIZATION_COUNTER_METRIC_NAME` | `pulsar.authorization.operation.count` | | ||
| | `INSTRUMENTATION_SCOPE_NAME` | `org.apache.pulsar.authorization` | | ||
| | `RESULT_SUCCESS` | `success` | | ||
| | `RESULT_FAILURE` | `failure` | | ||
| | `RESULT_ERROR` | `error` | | ||
| | `RESOURCE_TYPE_KEY` | `pulsar.authorization.resource.type` | | ||
| | `OPERATION_KEY` | `pulsar.authorization.operation` | | ||
| | `RESULT_KEY` | `pulsar.authorization.result` | | ||
|
|
||
| `AuthorizationMetrics` registers a static Prometheus counter with labels `resource_type`, `operation`, and `result`. | ||
| It also builds an OpenTelemetry `LongCounter` from the `OpenTelemetry` instance passed to the constructor. | ||
|
|
||
| The helper exposes three recording methods: | ||
|
|
||
| | Method | Behavior | | ||
| |---|---| | ||
| | `recordSuccess(resourceType, operation)` | Increments the Prometheus counter with `result="success"` and adds `1` to the OpenTelemetry counter with `pulsar.authorization.result="success"`. | | ||
| | `recordFailure(resourceType, operation)` | Increments the Prometheus counter with `result="failure"` and adds `1` to the OpenTelemetry counter with `pulsar.authorization.result="failure"`. | | ||
| | `recordError(resourceType, operation)` | Increments the Prometheus counter with `result="error"` and adds `1` to the OpenTelemetry counter with `pulsar.authorization.result="error"`. | | ||
|
|
||
| `AuthorizationService` owns one `AuthorizationMetrics` instance. The existing `AuthorizationService` constructor remains | ||
| available and delegates to a new constructor with `OpenTelemetry.noop()`. `BrokerService` constructs | ||
| `AuthorizationService` with `pulsar.getOpenTelemetry().getOpenTelemetry()` so the OpenTelemetry counter is exported by | ||
| the broker's OpenTelemetry pipeline. | ||
|
|
||
| `AuthorizationService` records a result after each completed authorization operation. If the provider returns `true`, the | ||
| helper records a success. If the provider returns `false`, the helper records a failure. If the provider future completes | ||
| exceptionally, the helper records an error because authorization evaluation failed before a boolean decision was returned. | ||
|
|
||
| If `AuthorizationService` rejects a request before provider evaluation, such as an invalid original-principal combination | ||
| for proxied requests, it records a failure directly and returns a completed `false` future. Existing | ||
| authorization-disabled short-circuit behavior is preserved; operation methods that already return early when | ||
| authorization is disabled do not emit this metric on that path. | ||
|
|
||
| The instrumentation applies to the following authorization flows: | ||
|
|
||
| - superuser checks | ||
| - tenant-admin checks | ||
| - tenant operations | ||
| - broker operations | ||
| - cluster operations | ||
| - cluster policy operations | ||
| - namespace operations | ||
| - namespace policy operations | ||
| - topic operations | ||
| - topic policy operations | ||
|
|
||
| The metric dimensions are intentionally bounded. The resource type is selected from a fixed set of constants in | ||
| `AuthorizationMetrics`. The operation is `check` for superuser and tenant-admin checks. For enum-backed operations, the | ||
| operation is the lower-case enum name. If an existing authorization path does not provide an operation value, the metric | ||
| uses a fixed `unknown` operation value rather than failing the request path or introducing dynamic labels. | ||
|
|
||
| The metric does not include role names, topic names, tenant names, namespace names, client addresses, provider names, | ||
| exception classes, or error messages. | ||
|
|
||
| ## Public-facing Changes | ||
|
|
||
| ### Public API | ||
|
|
||
| No public client, admin, REST, or `AuthorizationProvider` API changes. | ||
|
|
||
| ### Binary protocol | ||
|
|
||
| No binary protocol changes. | ||
|
|
||
| ### Configuration | ||
|
|
||
| No new configuration is required. | ||
|
|
||
| ### CLI | ||
|
|
||
| No CLI changes. | ||
|
|
||
| ### Metrics | ||
|
|
||
| Prometheus metric: | ||
|
|
||
| | Field | Value | | ||
| |---|---| | ||
| | Full name | `pulsar_authorization_operations_total` | | ||
| | Description | Pulsar authorization operations | | ||
| | Type | Counter | | ||
| | Labels | `resource_type`, `operation`, `result` | | ||
| | Unit | operations | | ||
|
|
||
| OpenTelemetry metric: | ||
|
|
||
| | Field | Value | | ||
| |---|---| | ||
| | Full name | `pulsar.authorization.operation.count` | | ||
| | Description | The number of authorization operations | | ||
| | Type | `LongCounter` | | ||
| | Attributes | `pulsar.authorization.resource.type`, `pulsar.authorization.operation`, `pulsar.authorization.result` | | ||
| | Unit | `{operation}` | | ||
|
|
||
| Result values: | ||
|
|
||
| | Value | Meaning | | ||
| |---|---| | ||
| | `success` | The authorization request was allowed. | | ||
| | `failure` | The authorization request was denied or rejected by authorization handling. | | ||
| | `error` | Authorization evaluation failed before an allow/deny decision was returned. | | ||
|
|
||
| Resource type values: | ||
|
|
||
| | Value | Meaning | | ||
| |---|---| | ||
| | `superuser` | Superuser authorization check. | | ||
| | `tenant_admin` | Tenant-admin authorization check. | | ||
| | `tenant` | Tenant operation authorization check. | | ||
| | `broker` | Broker operation authorization check. | | ||
| | `cluster` | Cluster operation authorization check. | | ||
| | `cluster_policy` | Cluster policy operation authorization check. | | ||
| | `namespace` | Namespace operation authorization check. | | ||
| | `namespace_policy` | Namespace policy operation authorization check. | | ||
| | `topic` | Topic operation authorization check. | | ||
| | `topic_policy` | Topic policy operation authorization check. | | ||
|
|
||
| Operation values are normalized authorization operation names. Examples include `produce`, `consume`, `lookup`, | ||
| `packages`, and `read`. Superuser and tenant-admin checks use `check`. Existing authorization paths that do not provide | ||
| a concrete operation value use `unknown`. | ||
|
|
||
| # Monitoring | ||
|
|
||
| Operators should monitor absolute authorization failures and errors, plus the relationship between failures and | ||
| successes. | ||
| Recommended patterns include: | ||
|
|
||
| - Alert on sustained increases in `result="failure"`. | ||
| - Alert on sustained increases in `result="error"`, which can indicate authorization provider failures or outages. | ||
| - Build dashboards that show `success`, `failure`, and `error` together by `resource_type`. | ||
| - Investigate rollout regressions by comparing failure rates before and after authorization policy changes. | ||
| - Correlate authorization failures with authentication metrics to distinguish authentication incidents from | ||
| authorization incidents. | ||
|
|
||
| This proposal enables ratio-based alerting because success, failure, and error outcomes are reported in the same metric | ||
| family. | ||
|
|
||
| # Security Considerations | ||
|
|
||
| This proposal improves security observability but does not change authorization semantics. | ||
|
|
||
| Authorization decisions can be high volume and may involve sensitive identifiers. The metric therefore avoids | ||
| identity-bearing labels and attributes. It does not include roles, principals, topics, namespaces, tenants, client | ||
| addresses, or error messages. This keeps the metric useful for operations without turning it into an audit-log substitute | ||
| or a high-cardinality data leak. | ||
|
|
||
| Failed proxy original-principal validation is counted as an authorization failure because the broker rejects the request | ||
| during authorization handling. | ||
|
|
||
| # Backward & Forward Compatibility | ||
|
|
||
| ## Upgrade | ||
|
|
||
| No special upgrade action is required. The new metrics appear automatically after upgrading brokers that include this | ||
| feature. | ||
|
|
||
| Monitoring systems should treat these as new metric series. Existing metrics and authorization behavior are unchanged. | ||
|
|
||
| ## Downgrade / Rollback | ||
|
|
||
| Downgrading removes the new metrics. Monitoring systems should tolerate missing-series behavior during rollback. | ||
|
|
||
| ## Pulsar Geo-Replication Upgrade & Downgrade/Rollback Considerations | ||
|
|
||
| No geo-replication protocol, metadata, or wire compatibility changes are introduced. | ||
|
|
||
| # Alternatives | ||
|
|
||
| - Failure-only counter: | ||
| Rejected because operators often need success, failure, and error counts to interpret changes correctly and to build | ||
| ratio-based alerts. | ||
|
|
||
| - OpenTelemetry-only metric: | ||
| Rejected because Pulsar still exposes Prometheus-compatible broker metrics and many deployments rely on the broker | ||
| metrics endpoint. | ||
|
|
||
| - Prometheus-only metric: | ||
| Rejected because Pulsar is adding OpenTelemetry support and new broker observability should keep equivalent | ||
| OpenTelemetry signals where practical. | ||
|
|
||
| - Detailed identity labels such as role, tenant, namespace, or topic: | ||
| Rejected due to cardinality and privacy concerns. | ||
|
|
||
| - Instrument each authorization call site independently: | ||
| Rejected because it would be error-prone and would likely produce inconsistent semantics across broker paths. | ||
|
|
||
| - Cache Prometheus label children or prebuild OpenTelemetry attributes for every resource type, operation, and result | ||
| combination: | ||
| Deferred because the initial implementation keeps the dimension set bounded and simple. This can be added later if | ||
| profiling shows metric recording overhead is significant on hot authorization paths. | ||
|
|
||
| # General Notes | ||
|
|
||
| This proposal is intentionally limited to broker metrics. It does not replace audit logging or structured security | ||
| events. | ||
|
|
||
| The metric dimensions add some per-recording overhead because Prometheus label children and OpenTelemetry attributes | ||
| must be resolved when recording. The proposed dimension set is deliberately small and bounded to keep this overhead | ||
| predictable. | ||
|
|
||
| The implementation includes focused test coverage for both metric export paths: | ||
|
|
||
| - Prometheus samples are validated through `CollectorRegistry.defaultRegistry.getSampleValue(...)`. | ||
| - OpenTelemetry samples are validated through the broker OpenTelemetry metric reader. | ||
|
|
||
| # Links | ||
|
|
||
| * Mailing List discussion thread: | ||
| * Mailing List voting thread: | ||
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this metric label represents an authorization outcome, could we use allowed/denied instead of success/failure? The explicit values will read more clearly in dashboards and PromQL queries, and avoid ambiguity about what success means here (policy evaluated successfully vs. access granted).