apache · mattisonchao · Apr 26, 2026 · Apr 26, 2026 · Apr 26, 2026 · lhotari
diff --git a/pip/pip-471.md b/pip/pip-471.md
@@ -0,0 +1,302 @@
+# PIP-471: Authorization Operation Metrics
+
+# Background knowledge
+
+Pulsar brokers perform authorization checks before allowing clients, proxies, and administrative callers to access
+topics, namespaces, tenants, brokers, clusters, and policy operations. These checks are handled through the broker-side
+`AuthorizationService`, which delegates decisions to the configured `AuthorizationProvider`.
+
+Pulsar already exposes security-related metrics, especially around authentication. These metrics help operators detect
+login failures, unhealthy clients, and changes in access patterns. However, Pulsar does not expose a generic broker-level
+metric stream for authorization outcomes. Authorization denials are mostly visible through request failures and logs,
+which makes them harder to alert on and harder to compare with successful authorization traffic.
+
+Pulsar also supports both Prometheus-compatible metrics and OpenTelemetry metrics. New broker observability features
+should keep those pipelines aligned when possible, so operators can consume equivalent signals regardless of their
+metrics backend.
+
+# Motivation
+
+Operators need a low-cardinality, broker-native signal that shows whether authorization checks are succeeding or failing.
+This is useful for security alerting, baseline monitoring, and compliance-oriented reporting.
+
+Without a dedicated authorization metric, operators have to infer authorization denials from logs, HTTP status codes, or
+client-side errors. That is brittle and does not support standard monitoring patterns such as:
+
+- Alerting on spikes in authorization failures.
+- Comparing authorization failures against successful authorizations.
+- Distinguishing authentication failures from authorization failures.
+- Building dashboards by authorization resource category.
+- Exporting equivalent authorization signals through both Prometheus and OpenTelemetry.
+
+A failure-only metric is also not sufficient. Operators often need success, failure, and error counts together to
+understand whether a denial spike reflects an attack, a rollout issue, a policy mistake, an authorization provider
+problem, or a normal traffic shift.
+
+# Goals
+
+## In Scope
+
+- Add a low-cardinality broker authorization metric for operation outcomes.
+- Record successful, failed, and errored authorization operations.
+- Expose the metric through the Prometheus-compatible broker metrics endpoint.
+- Expose the same metric through OpenTelemetry.
+- Centralize instrumentation in `AuthorizationService` so broker authorization paths share the same metric model.
+- Avoid identity-bearing or high-cardinality metric dimensions.
+
+## Out of Scope
+
+- Per-role, per-topic, per-tenant, per-namespace, or per-principal labels.
+- Audit-log payloads or structured security event streams.
+- New authorization APIs or binary protocol changes.
+- Alert rule definitions for downstream monitoring stacks.
+- Configuration to enable or disable this specific metric.
+
+# High Level Design
+
+Introduce a generic authorization operation counter that is incremented when the broker finishes an authorization
+decision or rejects an authorization request before invoking the configured provider.
+
+The metric is recorded centrally in `AuthorizationService`, which is the broker-side entry point for authorization checks
+across topic, namespace, tenant, broker, cluster, and policy operations. Each completed provider decision or direct
+authorization rejection emits one result with a small, fixed dimension set:
+
+- the resource type that was checked
+- the operation that was requested
+- whether the result was a success, failure, or error
+
+This metric is exported in two equivalent forms by the same helper class:
+
+- a Prometheus counter for the existing broker metrics endpoint
+- an OpenTelemetry `LongCounter` for OpenTelemetry metrics export
+
+Invalid original-principal combinations in proxied authorization flows are counted as authorization failures because the
+broker rejects the request during authorization handling. For valid proxied authorization flows, the broker evaluates
+both the proxy role and the original principal, and each completed authorization decision is recorded.
+
+# Detailed Design
+
+## Design & Implementation Details
+
+This proposal introduces `org.apache.pulsar.broker.authorization.metrics.AuthorizationMetrics`, a broker authorization
+metrics helper that owns:
+
+- a Prometheus `Counter` for broker metrics scraping
+- an OpenTelemetry `LongCounter` for OpenTelemetry metrics export
+
+The helper uses the following constants for metric names, instrumentation scope, label values, and OpenTelemetry
+attribute keys:
+
+| Constant | Value |
+|---|---|
+| `AUTHORIZATION_OPERATIONS_METRIC_NAME` | `pulsar_authorization_operations_total` |
+| `AUTHORIZATION_COUNTER_METRIC_NAME` | `pulsar.authorization.operation.count` |
+| `INSTRUMENTATION_SCOPE_NAME` | `org.apache.pulsar.authorization` |
+| `RESULT_SUCCESS` | `success` |
+| `RESULT_FAILURE` | `failure` |
+| `RESULT_ERROR` | `error` |
+| `RESOURCE_TYPE_KEY` | `pulsar.authorization.resource.type` |
+| `OPERATION_KEY` | `pulsar.authorization.operation` |
+| `RESULT_KEY` | `pulsar.authorization.result` |
+
+`AuthorizationMetrics` registers a static Prometheus counter with labels `resource_type`, `operation`, and `result`.
+It also builds an OpenTelemetry `LongCounter` from the `OpenTelemetry` instance passed to the constructor.
+
+The helper exposes three recording methods:
+
+| Method | Behavior |
+|---|---|
+| `recordSuccess(resourceType, operation)` | Increments the Prometheus counter with `result="success"` and adds `1` to the OpenTelemetry counter with `pulsar.authorization.result="success"`. |
+| `recordFailure(resourceType, operation)` | Increments the Prometheus counter with `result="failure"` and adds `1` to the OpenTelemetry counter with `pulsar.authorization.result="failure"`. |
+| `recordError(resourceType, operation)` | Increments the Prometheus counter with `result="error"` and adds `1` to the OpenTelemetry counter with `pulsar.authorization.result="error"`. |
+
+`AuthorizationService` owns one `AuthorizationMetrics` instance. The existing `AuthorizationService` constructor remains
+available and delegates to a new constructor with `OpenTelemetry.noop()`. `BrokerService` constructs
+`AuthorizationService` with `pulsar.getOpenTelemetry().getOpenTelemetry()` so the OpenTelemetry counter is exported by
+the broker's OpenTelemetry pipeline.
+
+`AuthorizationService` records a result after each completed authorization operation. If the provider returns `true`, the
+helper records a success. If the provider returns `false`, the helper records a failure. If the provider future completes
+exceptionally, the helper records an error because authorization evaluation failed before a boolean decision was returned.
+
+If `AuthorizationService` rejects a request before provider evaluation, such as an invalid original-principal combination
+for proxied requests, it records a failure directly and returns a completed `false` future. Existing
+authorization-disabled short-circuit behavior is preserved; operation methods that already return early when
+authorization is disabled do not emit this metric on that path.
+
+The instrumentation applies to the following authorization flows:
+
+- superuser checks
+- tenant-admin checks
+- tenant operations
+- broker operations
+- cluster operations
+- cluster policy operations
+- namespace operations
+- namespace policy operations
+- topic operations
+- topic policy operations
+
+The metric dimensions are intentionally bounded. The resource type is selected from a fixed set of constants in
+`AuthorizationMetrics`. The operation is `check` for superuser and tenant-admin checks. For enum-backed operations, the
+operation is the lower-case enum name. If an existing authorization path does not provide an operation value, the metric
+uses a fixed `unknown` operation value rather than failing the request path or introducing dynamic labels.
+
+The metric does not include role names, topic names, tenant names, namespace names, client addresses, provider names,
+exception classes, or error messages.
+
+## Public-facing Changes
+
+### Public API
+
+No public client, admin, REST, or `AuthorizationProvider` API changes.
+
+### Binary protocol
+
+No binary protocol changes.
+
+### Configuration
+
+No new configuration is required.
+
+### CLI
+
+No CLI changes.
+
+### Metrics
+
+Prometheus metric:
+
+| Field | Value |
+|---|---|
+| Full name | `pulsar_authorization_operations_total` |
+| Description | Pulsar authorization operations |
+| Type | Counter |
+| Labels | `resource_type`, `operation`, `result` |
+| Unit | operations |
+
+OpenTelemetry metric:
+
+| Field | Value |
+|---|---|
+| Full name | `pulsar.authorization.operation.count` |
+| Description | The number of authorization operations |
+| Type | `LongCounter` |
+| Attributes | `pulsar.authorization.resource.type`, `pulsar.authorization.operation`, `pulsar.authorization.result` |
+| Unit | `{operation}` |
+
+Result values:
+
+| Value | Meaning |
+|---|---|
+| `success` | The authorization request was allowed. |
+| `failure` | The authorization request was denied or rejected by authorization handling. |
+| `error` | Authorization evaluation failed before an allow/deny decision was returned. |
+
+Resource type values:
+
+| Value | Meaning |
+|---|---|
+| `superuser` | Superuser authorization check. |
+| `tenant_admin` | Tenant-admin authorization check. |
+| `tenant` | Tenant operation authorization check. |
+| `broker` | Broker operation authorization check. |
+| `cluster` | Cluster operation authorization check. |
+| `cluster_policy` | Cluster policy operation authorization check. |
+| `namespace` | Namespace operation authorization check. |
+| `namespace_policy` | Namespace policy operation authorization check. |
+| `topic` | Topic operation authorization check. |
+| `topic_policy` | Topic policy operation authorization check. |
+
+Operation values are normalized authorization operation names. Examples include `produce`, `consume`, `lookup`,
+`packages`, and `read`. Superuser and tenant-admin checks use `check`. Existing authorization paths that do not provide
+a concrete operation value use `unknown`.
+
+# Monitoring
+
+Operators should monitor absolute authorization failures and errors, plus the relationship between failures and
+successes.
+Recommended patterns include:
+
+- Alert on sustained increases in `result="failure"`.
+- Alert on sustained increases in `result="error"`, which can indicate authorization provider failures or outages.
+- Build dashboards that show `success`, `failure`, and `error` together by `resource_type`.
+- Investigate rollout regressions by comparing failure rates before and after authorization policy changes.
+- Correlate authorization failures with authentication metrics to distinguish authentication incidents from
+  authorization incidents.
+
+This proposal enables ratio-based alerting because success, failure, and error outcomes are reported in the same metric
+family.
+
+# Security Considerations
+
+This proposal improves security observability but does not change authorization semantics.
+
+Authorization decisions can be high volume and may involve sensitive identifiers. The metric therefore avoids
+identity-bearing labels and attributes. It does not include roles, principals, topics, namespaces, tenants, client
+addresses, or error messages. This keeps the metric useful for operations without turning it into an audit-log substitute
+or a high-cardinality data leak.
+
+Failed proxy original-principal validation is counted as an authorization failure because the broker rejects the request
+during authorization handling.
+
+# Backward & Forward Compatibility
+
+## Upgrade
+
+No special upgrade action is required. The new metrics appear automatically after upgrading brokers that include this
+feature.
+
+Monitoring systems should treat these as new metric series. Existing metrics and authorization behavior are unchanged.
+
+## Downgrade / Rollback
+
+Downgrading removes the new metrics. Monitoring systems should tolerate missing-series behavior during rollback.
+
+## Pulsar Geo-Replication Upgrade & Downgrade/Rollback Considerations
+
+No geo-replication protocol, metadata, or wire compatibility changes are introduced.
+
+# Alternatives
+
+- Failure-only counter:
+  Rejected because operators often need success, failure, and error counts to interpret changes correctly and to build
+  ratio-based alerts.
+
+- OpenTelemetry-only metric:
+  Rejected because Pulsar still exposes Prometheus-compatible broker metrics and many deployments rely on the broker
+  metrics endpoint.
+
+- Prometheus-only metric:
+  Rejected because Pulsar is adding OpenTelemetry support and new broker observability should keep equivalent
+  OpenTelemetry signals where practical.
+
+- Detailed identity labels such as role, tenant, namespace, or topic:
+  Rejected due to cardinality and privacy concerns.
+
+- Instrument each authorization call site independently:
+  Rejected because it would be error-prone and would likely produce inconsistent semantics across broker paths.
+
+- Cache Prometheus label children or prebuild OpenTelemetry attributes for every resource type, operation, and result
+  combination:
+  Deferred because the initial implementation keeps the dimension set bounded and simple. This can be added later if
+  profiling shows metric recording overhead is significant on hot authorization paths.
+
+# General Notes
+
+This proposal is intentionally limited to broker metrics. It does not replace audit logging or structured security
+events.
+
+The metric dimensions add some per-recording overhead because Prometheus label children and OpenTelemetry attributes
+must be resolved when recording. The proposed dimension set is deliberately small and bounded to keep this overhead
+predictable.
+
+The implementation includes focused test coverage for both metric export paths:
+
+- Prometheus samples are validated through `CollectorRegistry.defaultRegistry.getSampleValue(...)`.
+- OpenTelemetry samples are validated through the broker OpenTelemetry metric reader.
+
+# Links
+
+* Mailing List discussion thread:
+* Mailing List voting thread: