Investigating Unexpected Behavior in Our Application

Incident Report for OutSmart

Postmortem

WHAT HAPPENED

On May 21, OutSmart experienced a second outage caused by a different but related Redis failure. Rather than running out of memory (as in Incident #1 on May 19), the Redis cluster itself became unstable and began dropping connections. The effect was the same: user sessions, background job processing, and caching all stopped working simultaneously, causing login failures, degraded performance, and a spike in database load across the application.

We recognise this is the second Redis-related outage in three days and take that seriously.

TIMELINE (all times CEST)

14:55 - Redis cluster began experiencing connectivity failures.
15:31 - Incident detected and reported on status page. Investigation started.
15:33 - Root cause identified: Redis cluster instability causing connection failures.
15:40 - Redis cluster stabilised. Application services began recovering.
15:51 - Recovery confirmed. Infrastructure capacity expansion initiated as a structural improvement.
17:19 - Capacity upgrade completed and verified. Incident formally resolved. Monitoring continued.

Total active impact window: approximately 45 minutes (14:55 to 15:40 CEST). Service fully confirmed stable at 17:19 CEST.

IMPACT

Login, session management, and background processing were affected across all OutSmart services. Database load spiked significantly during the impact window before normalising once Redis recovered.

WHAT WE ARE DOING TO PREVENT RECURRENCE

  • Both Redis servers have been expanded to handle higher peak loads and burst traffic.
  • Monitoring and threshold alerts have been enhanced to catch Redis memory pressure and connectivity issues earlier.
  • The overflow configuration has been reviewed so it triggers faster in future incidents.
  • We are working directly with AWS Enterprise Support to implement additional infrastructure safeguards and prevent cascading failures of this type.
Posted May 22, 2026 - 10:06 CEST

Resolved

Fix has been implemented succesful, we are continuing to monitoring performance.
Posted May 21, 2026 - 17:19 CEST

Monitoring

A fix has been implemented and we are monitoring the results.
Posted May 21, 2026 - 15:51 CEST

Identified

The issue has been identified and a fix is being evaluated.
Posted May 21, 2026 - 15:33 CEST

Investigating

We are currently investigating reports of unexpected behavior within our application. Our technical team is actively working to identify the cause of these issues to address them as quickly as possible. We are committed to maintaining the highest level of service quality and will provide updates as our investigation progresses. We appreciate your patience and understanding.
Posted May 21, 2026 - 15:31 CEST
This incident affected: Integrations & Open API (Integration Page & Settings, Open API), Application (Web-Application, Login & Authentication, Customer Portal, E-Mail Delivery, SMS Service, Location Services, Routing, File Storage), and Mobile Applications (Android Mobile App, iOS Mobile App, Mobile API).