-
Notifications
You must be signed in to change notification settings - Fork 2
feat(redis): add resilient redis conn layer #511
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||
|---|---|---|---|---|
| @@ -0,0 +1,88 @@ | ||||
| # ADR-001: Resilient Redis Connection with Idempotent Command Retry | ||||
|
|
||||
| **Date:** 2026-02-25 | ||||
| **Status:** Accepted | ||||
| **Authors:** Sebastian Marcet | ||||
|
|
||||
| ## Context | ||||
|
|
||||
| Production runs on a managed Valkey (Redis-compatible) instance on DigitalOcean over TLS. Transient connection failures — network hiccups, TLS renegotiation, server maintenance — cause `Predis\Connection\ConnectionException` errors that propagate up through the application and fail business-critical operations. | ||||
|
|
||||
| The immediate trigger was audit log job dispatch (`EmitAuditLogJob`) failing during Doctrine's `onFlush` event inside `MemberService::synchronizeGroups`. The Redis write failure bubbled up through the Doctrine `UnitOfWork::commit()` and caused the entire member synchronization transaction to fail. Audit logging is non-critical and should never break business operations. | ||||
|
|
||||
| Beyond audit logging, any Redis operation (cache reads/writes, session access, rate limiting) is vulnerable to the same transient failures. | ||||
|
|
||||
| ## Decision | ||||
|
|
||||
| ### 1. Resilient Redis Connection Layer | ||||
|
|
||||
| Introduce a custom Redis driver (`predis_resilient`) that automatically retries **idempotent** commands on `ConnectionException`, with disconnect/reconnect between attempts and exponential backoff. | ||||
|
|
||||
| **Architecture:** | ||||
|
|
||||
| - `ResilientPredisConnection` extends Laravel's `PredisConnection`, overrides `command()` to catch `ConnectionException` and retry only idempotent commands | ||||
| - `ResilientPredisConnector` extends `PredisConnector`, calls `parent::connect()` to reuse all config/TLS/option handling, then wraps the `Predis\Client` in `ResilientPredisConnection` | ||||
| - `RedisResilienceServiceProvider` registers the driver via `RedisManager::extend('predis_resilient', ...)` | ||||
| - Activated by setting `REDIS_CLIENT=predis_resilient` in `.env` — zero behavior change without this flag | ||||
|
|
||||
| **Retry behavior:** | ||||
|
|
||||
| | Command type | On ConnectionException | Rationale | | ||||
| |---|---|---| | ||||
| | Idempotent (GET, SET, DEL, HSET, EXPIRE, ...) | Disconnect, reconnect, retry up to N times with exponential backoff | Executing twice produces the same result | | ||||
| | Non-idempotent (INCR, LPUSH, RPUSH, EVAL, ...) | Rethrow immediately | Command may have executed before the read-side failed; retrying could duplicate data | | ||||
|
|
||||
| **Configuration** (per-connection in `config/database.php`): | ||||
|
|
||||
| | Parameter | Env var | Default | Description | | ||||
| |---|---|---|---| | ||||
| | `retry_limit` | `REDIS_RETRY_LIMIT` | 2 | Max retry attempts (0 disables retries) | | ||||
| | `retry_delay` | `REDIS_RETRY_DELAY` | 50 | Base delay in ms (doubles each attempt: 50, 100, 200) | | ||||
|
|
||||
| ### 2. Job Dispatch Fallback for Audit Logging | ||||
|
|
||||
| Separately, `AuditLogOtlpStrategy` was updated to dispatch `EmitAuditLogJob` via `JobDispatcher::withSyncFallback()` instead of `EmitAuditLogJob::dispatch()`. This catches Redis failures at the job dispatch level and runs the audit log emission synchronously as a fallback — preventing audit logging from ever failing the parent business transaction. | ||||
|
|
||||
| ## Consequences | ||||
|
|
||||
| ### Positive | ||||
|
|
||||
| - Transient Redis failures on idempotent commands (cache GET/SET, session reads, key expiry) are automatically recovered without application-level error handling | ||||
| - Non-idempotent commands (queue pushes, counters, list operations) are never retried, preventing data duplication | ||||
| - Opt-in activation via env var — no risk to existing deployments | ||||
| - Per-connection retry configuration allows tuning (e.g., more retries for cache, fewer for workers) | ||||
| - Audit log failures can no longer crash business transactions | ||||
|
|
||||
| ### Negative | ||||
|
|
||||
| - Retry adds latency on failure (up to ~350ms with defaults: 50 + 100 + 200ms backoff) | ||||
| - `usleep()` in the retry loop blocks the PHP process during backoff — acceptable for 2-3 retries but would need async handling at higher retry counts | ||||
| - The idempotent command list is manually maintained and must be updated if new Redis commands are used | ||||
|
Comment on lines
+58
to
+60
|
||||
|
|
||||
| ### Neutral | ||||
|
|
||||
| - Queue push operations (`EVAL` with Lua scripts) are NOT retried by the resilient connection — they remain protected by `JobDispatcher::withSyncFallback` / `withDbFallback` at the application layer | ||||
| - The `predis_resilient` driver shares the same Predis `Client` configuration as `predis` — no TLS, auth, or timeout differences | ||||
|
|
||||
| ## Alternatives Considered | ||||
|
|
||||
| 1. **Predis built-in retry** — Predis does not offer connection retry on command failure (only on initial connect via `aggregate` connections). Rejected. | ||||
|
|
||||
| 2. **Retry all commands** — Would risk duplicating non-idempotent operations (double LPUSH, double INCR) when the failure occurs after the command was sent but before the response was read. Rejected. | ||||
|
|
||||
| 3. **Catch-and-ignore at every call site** — Would require wrapping every Redis call in try/catch throughout the codebase. Not maintainable. Rejected. | ||||
|
|
||||
| 4. **Switch to phpredis extension** — phpredis has built-in retry support, but would require changing the entire Redis integration layer and testing all connection configurations. Disproportionate effort for the problem at hand. Not pursued. | ||||
|
|
||||
| ## Files | ||||
|
|
||||
| | File | Purpose | | ||||
| |---|---| | ||||
| | `app/Redis/ResilientPredisConnection.php` | Connection with retry logic | | ||||
| | `app/Redis/ResilientPredisConnector.php` | Connector that swaps connection class | | ||||
| | `app/Providers/RedisResilienceServiceProvider.php` | Registers `predis_resilient` driver | | ||||
| | `config/database.php` | Added `retry_limit`, `retry_delay` to Redis connections | | ||||
| | `config/app.php` | Registered service provider | | ||||
| | `app/Audit/AuditLogOtlpStrategy.php` | Changed to `JobDispatcher::withSyncFallback` | | ||||
|
||||
| | `app/Audit/AuditLogOtlpStrategy.php` | Changed to `JobDispatcher::withSyncFallback` | |
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
| @@ -0,0 +1,38 @@ | ||||||
| <?php namespace App\Providers; | ||||||
| /** | ||||||
| * Copyright 2026 OpenStack Foundation | ||||||
| * Licensed under the Apache License, Version 2.0 (the "License"); | ||||||
| * you may not use this file except in compliance with the License. | ||||||
| * You may obtain a copy of the License at | ||||||
| * http://www.apache.org/licenses/LICENSE-2.0 | ||||||
| * Unless required by applicable law or agreed to in writing, software | ||||||
| * distributed under the License is distributed on an "AS IS" BASIS, | ||||||
| * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||||||
| * See the License for the specific language governing permissions and | ||||||
| * limitations under the License. | ||||||
| **/ | ||||||
|
|
||||||
| use App\Redis\ResilientPredisConnector; | ||||||
| use Illuminate\Contracts\Foundation\Application; | ||||||
| use Illuminate\Redis\RedisManager; | ||||||
| use Illuminate\Support\ServiceProvider; | ||||||
|
|
||||||
| /** | ||||||
| * Class RedisResilienceServiceProvider | ||||||
| * | ||||||
| * Registers the "predis_resilient" Redis driver which adds automatic | ||||||
| * retry-with-reconnect for idempotent commands on transient failures. | ||||||
| * | ||||||
| * To activate, set REDIS_CLIENT=predis_resilient in your .env. | ||||||
| */ | ||||||
| class RedisResilienceServiceProvider extends ServiceProvider | ||||||
| { | ||||||
| public function boot(): void | ||||||
| { | ||||||
| $this->app->afterResolving('redis', function (RedisManager $redis, Application $app) { | ||||||
| $redis->extend('predis_resilient', function () { | ||||||
|
||||||
| $redis->extend('predis_resilient', function () { | |
| $redis->extend('predis_resilient', function (...$parameters) { |
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
| @@ -0,0 +1,125 @@ | ||||||
| <?php namespace App\Redis; | ||||||
| /** | ||||||
| * Copyright 2026 OpenStack Foundation | ||||||
| * Licensed under the Apache License, Version 2.0 (the "License"); | ||||||
| * you may not use this file except in compliance with the License. | ||||||
| * You may obtain a copy of the License at | ||||||
| * http://www.apache.org/licenses/LICENSE-2.0 | ||||||
| * Unless required by applicable law or agreed to in writing, software | ||||||
| * distributed under the License is distributed on an "AS IS" BASIS, | ||||||
| * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||||||
| * See the License for the specific language governing permissions and | ||||||
| * limitations under the License. | ||||||
| **/ | ||||||
|
|
||||||
| use Illuminate\Redis\Connections\PredisConnection; | ||||||
| use Illuminate\Support\Facades\Log; | ||||||
| use Predis\Connection\ConnectionException; | ||||||
|
|
||||||
| /** | ||||||
| * Class ResilientPredisConnection | ||||||
| * | ||||||
| * Extends the default PredisConnection to add automatic retry with | ||||||
| * reconnect for idempotent Redis commands on transient connection failures. | ||||||
| * | ||||||
| * Non-idempotent commands (INCR, LPUSH, RPUSH, EVAL, etc.) are never | ||||||
| * retried because the command may have already been executed on the | ||||||
| * server before the read-side of the connection failed. | ||||||
| */ | ||||||
| class ResilientPredisConnection extends PredisConnection | ||||||
| { | ||||||
| private int $retryLimit; | ||||||
|
|
||||||
| private int $retryDelay; | ||||||
|
|
||||||
| /** | ||||||
| * Commands that are safe to retry after a connection failure. | ||||||
| * A command is safe when executing it twice produces the same result. | ||||||
| */ | ||||||
| private const IDEMPOTENT_COMMANDS = [ | ||||||
| // reads | ||||||
| 'GET', 'MGET', 'HGET', 'HGETALL', 'HMGET', 'HEXISTS', 'HLEN', 'HKEYS', 'HVALS', | ||||||
| 'LLEN', 'LRANGE', 'LINDEX', | ||||||
| 'SCARD', 'SMEMBERS', 'SISMEMBER', | ||||||
| 'ZCARD', 'ZCOUNT', 'ZRANGE', 'ZRANGEBYSCORE', 'ZREVRANGEBYSCORE', 'ZSCORE', 'ZRANK', 'ZREVRANK', | ||||||
| 'EXISTS', 'TYPE', 'TTL', 'PTTL', 'KEYS', 'SCAN', 'HSCAN', 'SSCAN', 'ZSCAN', | ||||||
| 'INFO', 'PING', 'DBSIZE', 'TIME', 'STRLEN', 'GETRANGE', | ||||||
| // idempotent writes | ||||||
| 'SET', 'SETEX', 'PSETEX', 'MSET', 'SETNX', 'GETSET', | ||||||
|
||||||
| 'SET', 'SETEX', 'PSETEX', 'MSET', 'SETNX', 'GETSET', | |
| 'SET', 'SETEX', 'PSETEX', 'MSET', |
Copilot
AI
Mar 11, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Several commands in the "idempotent writes" allowlist (e.g., HSET, DEL, EXPIRE, SADD, ZADD) can return different replies when executed twice (counts / 1-or-0 flags). If callers rely on these return values, a retry after a connection failure can change application behavior. Consider restricting retries to commands with stable replies on repetition, or explicitly redefining/normalizing the semantics you consider safe to retry.
Copilot
AI
Mar 11, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When retryLimit is 0, command() still routes into retryCommand(), which will emit the "all retries exhausted" error log before rethrowing. That’s not quite "behaves like stock PredisConnection" as described in the constructor doc. Consider short-circuiting in command() (or retryCommand()) when retryLimit <= 0 to rethrow without the retry/error-log path.
Copilot
AI
Mar 11, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the retry warning log, the 'error' field is always taken from the original $previous exception, so later attempts will log the wrong message if subsequent retries fail with a different ConnectionException. Use the most recent exception for logging (e.g., the caught exception for that attempt) so the log reflects the actual failure being retried.
| 'error' => $previous->getMessage(), | |
| 'error' => $lastException->getMessage(), |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,41 @@ | ||
| <?php namespace App\Redis; | ||
| /** | ||
| * Copyright 2026 OpenStack Foundation | ||
| * Licensed under the Apache License, Version 2.0 (the "License"); | ||
| * you may not use this file except in compliance with the License. | ||
| * You may obtain a copy of the License at | ||
| * http://www.apache.org/licenses/LICENSE-2.0 | ||
| * Unless required by applicable law or agreed to in writing, software | ||
| * distributed under the License is distributed on an "AS IS" BASIS, | ||
| * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| * See the License for the specific language governing permissions and | ||
| * limitations under the License. | ||
| **/ | ||
|
|
||
| use Illuminate\Redis\Connectors\PredisConnector; | ||
|
|
||
| /** | ||
| * Class ResilientPredisConnector | ||
| * | ||
| * Wraps the stock PredisConnector, reusing all its config/TLS/option | ||
| * handling, and swaps the returned connection to ResilientPredisConnection. | ||
| */ | ||
| class ResilientPredisConnector extends PredisConnector | ||
| { | ||
| /** | ||
| * @inheritdoc | ||
| */ | ||
| public function connect(array $config, array $options) | ||
| { | ||
| $connection = parent::connect($config, $options); | ||
|
|
||
| $retryLimit = (int) ($config['retry_limit'] ?? 2); | ||
| $retryDelay = (int) ($config['retry_delay'] ?? 50); | ||
|
|
||
| return new ResilientPredisConnection( | ||
| $connection->client(), | ||
| $retryLimit, | ||
| $retryDelay | ||
| ); | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This ADR states that
AuditLogOtlpStrategywas updated to useJobDispatcher::withSyncFallback(), but that change is not part of this PR’s diffs. Either include the referenced code change in this PR, or adjust the ADR text to avoid documenting behavior that isn’t implemented here.