Back to overview
Downtime

Authentik, Plex, and 3 other services are down

Dec 24 at 10:18pm EST
Affected services
Authentik
Plex
Overseer
Fleet
Unifi Controller

Resolved
Dec 24 at 11:40pm EST

Incident Report: Service Outage - December 24, 2024

Status: Resolved

Duration: 17 minutes (22:23 - 22:40 EST)

Impact: All services unavailable

Affected Services: Plex, FleetDM, Authentik, media management applications


What happened

On December 24 at 22:23 EST, all homelab services became unavailable due to a system memory exhaustion event. The host server ran out of available RAM, causing the operating system to terminate processes and eventually become unresponsive, requiring a manual restart.

During the restart, a storage configuration issue prevented the system from booting normally, extending the outage by an additional 16 minutes while network-dependent storage initialized.

Timeline

22:21 EST - Host server began experiencing severe memory pressure

22:23 EST - Services became unavailable as system exhausted all available RAM

22:24 EST - System became unresponsive, manual restart initiated

22:26 EST - System boot process stalled due to storage mount configuration issue

22:40 EST - Storage mount completed, services restored

22:42 EST - Monitoring confirmed all services operational

Root cause

Primary issue: Insufficient system resources

The host server has 16GB of RAM, which proved insufficient for the number of services running simultaneously. At the time of failure, the system had exhausted 99% of available memory with no remaining capacity.

The largest memory consumer (Plex Media Server) was using 5GB of RAM with no configured limits, while 30+ other services competed for the remaining capacity.

Secondary issue: Storage configuration

Network-attached storage was not properly configured to wait for network initialization during boot. When the system restarted, it couldn't access required storage and entered emergency mode until the network became available.

Resolution

Immediate fixes applied:
- Corrected storage mount configuration to properly handle network dependencies
- System verified operational with all services running normally

Permanent fixes in progress:
- Hardware upgrade: Adding additional RAM (16GB → 32GB minimum)
- Implementing memory limits on all services to prevent any single service from consuming excessive resources
- Deploying proactive memory monitoring with alerts before exhaustion occurs

Prevention

To prevent recurrence, I'm implementing:

  1. Capacity upgrade - Adding 16-48GB additional RAM to provide adequate headroom
  2. Resource controls - Enforcing memory limits on all services
  3. Proactive monitoring - Alerting on memory usage at 80% and 90% thresholds
  4. Service optimization - Migrating non-critical services to additional infrastructure
  5. Automated testing - Validating boot process and storage configuration changes before deployment

Impact summary

  • Total outage: 17 minutes
  • Data loss: None
  • Service degradation after recovery: None

All services are operating normally. No user data was affected during this incident.

Updated
Dec 24 at 10:46pm EST

Fleet recovered.

Updated
Dec 24 at 10:43pm EST

Unifi Controller recovered.

Updated
Dec 24 at 10:43pm EST

Overseer recovered.

Updated
Dec 24 at 10:42pm EST

Authentik and Plex recovered.

Updated
Dec 24 at 10:25pm EST

Unifi Controller went down.

Updated
Dec 24 at 10:24pm EST

Authentik and Plex went down.

Updated
Dec 24 at 10:21pm EST

Authentik and Plex recovered.

Updated
Dec 24 at 10:20pm EST

Overseer went down.

Updated
Dec 24 at 10:19pm EST

Fleet went down.

Created
Dec 24 at 10:18pm EST

Authentik and Plex went down.