Authentik, Plex, and 3 other services are down
Resolved
Dec 24 at 11:40pm EST
Incident Report: Service Outage - December 24, 2024
Status: Resolved
Duration: 17 minutes (22:23 - 22:40 EST)
Impact: All services unavailable
Affected Services: Plex, FleetDM, Authentik, media management applications
What happened
On December 24 at 22:23 EST, all homelab services became unavailable due to a system memory exhaustion event. The host server ran out of available RAM, causing the operating system to terminate processes and eventually become unresponsive, requiring a manual restart.
During the restart, a storage configuration issue prevented the system from booting normally, extending the outage by an additional 16 minutes while network-dependent storage initialized.
Timeline
22:21 EST - Host server began experiencing severe memory pressure
22:23 EST - Services became unavailable as system exhausted all available RAM
22:24 EST - System became unresponsive, manual restart initiated
22:26 EST - System boot process stalled due to storage mount configuration issue
22:40 EST - Storage mount completed, services restored
22:42 EST - Monitoring confirmed all services operational
Root cause
Primary issue: Insufficient system resources
The host server has 16GB of RAM, which proved insufficient for the number of services running simultaneously. At the time of failure, the system had exhausted 99% of available memory with no remaining capacity.
The largest memory consumer (Plex Media Server) was using 5GB of RAM with no configured limits, while 30+ other services competed for the remaining capacity.
Secondary issue: Storage configuration
Network-attached storage was not properly configured to wait for network initialization during boot. When the system restarted, it couldn't access required storage and entered emergency mode until the network became available.
Resolution
Immediate fixes applied:
- Corrected storage mount configuration to properly handle network dependencies
- System verified operational with all services running normally
Permanent fixes in progress:
- Hardware upgrade: Adding additional RAM (16GB → 32GB minimum)
- Implementing memory limits on all services to prevent any single service from consuming excessive resources
- Deploying proactive memory monitoring with alerts before exhaustion occurs
Prevention
To prevent recurrence, I'm implementing:
- Capacity upgrade - Adding 16-48GB additional RAM to provide adequate headroom
- Resource controls - Enforcing memory limits on all services
- Proactive monitoring - Alerting on memory usage at 80% and 90% thresholds
- Service optimization - Migrating non-critical services to additional infrastructure
- Automated testing - Validating boot process and storage configuration changes before deployment
Impact summary
- Total outage: 17 minutes
- Data loss: None
- Service degradation after recovery: None
All services are operating normally. No user data was affected during this incident.
Affected services
Updated
Dec 24 at 10:46pm EST
Fleet recovered.
Affected services
Updated
Dec 24 at 10:43pm EST
Unifi Controller recovered.
Affected services
Updated
Dec 24 at 10:43pm EST
Overseer recovered.
Affected services
Updated
Dec 24 at 10:42pm EST
Authentik and Plex recovered.
Affected services
Updated
Dec 24 at 10:25pm EST
Unifi Controller went down.
Affected services
Updated
Dec 24 at 10:24pm EST
Authentik and Plex went down.
Affected services
Updated
Dec 24 at 10:21pm EST
Authentik and Plex recovered.
Affected services
Updated
Dec 24 at 10:20pm EST
Overseer went down.
Affected services
Updated
Dec 24 at 10:19pm EST
Fleet went down.
Affected services
Created
Dec 24 at 10:18pm EST
Authentik and Plex went down.
Affected services