Incidents | kitzy.net

Incidents | kitzy.net Incidents reported on status page for kitzy.net https://status.kitzy.net/ en Plex is down https://status.kitzy.net/incident/810736 Sat, 24 Jan 2026 01:33:09 -0000 https://status.kitzy.net/incident/810736#fc70c73f6979f486723c5617410128683b1959cf184ea26fbba0fc127aa79b42 Plex recovered. Plex is down https://status.kitzy.net/incident/810736 Sat, 24 Jan 2026 01:30:28 -0000 https://status.kitzy.net/incident/810736#e151272c7fae7e0c3c3356c7465b8e8dd1a07b2d4f7035207a06d67cfd0d402b Plex went down. Authentik and Fleet are down https://status.kitzy.net/incident/801491 Fri, 09 Jan 2026 18:39:13 -0000 https://status.kitzy.net/incident/801491#5d46a35d29034c85eee384bd6328ec70157b4a331627cb76890a75ca30eed0b7 Authentik recovered. Authentik and Fleet are down https://status.kitzy.net/incident/801491 Fri, 09 Jan 2026 18:37:13 -0000 https://status.kitzy.net/incident/801491#f20507bb7abb50511481b6b041b0a2d59d5bfeb64b976e10da45a76902b1cb49 Fleet recovered. Authentik and Fleet are down https://status.kitzy.net/incident/801491 Fri, 09 Jan 2026 18:25:35 -0000 https://status.kitzy.net/incident/801491#2c265c69bcc128e1b62d3b099f315a2db9f3b220f72b3fd5fab329b1d3ba8913 Fleet went down. Authentik and Fleet are down https://status.kitzy.net/incident/801491 Fri, 09 Jan 2026 18:21:31 -0000 https://status.kitzy.net/incident/801491#07828a9714ee04906ac6b8b6b27f6e16fb46d98286753b478374de7bbec20f39 Authentik went down. Authentik is down https://status.kitzy.net/incident/798339 Mon, 05 Jan 2026 00:49:40 -0000 https://status.kitzy.net/incident/798339#5db319a38265ec3fff577e6ef7329264666615fa1c2b25b423370d0022c873ef Authentik recovered. Authentik is down https://status.kitzy.net/incident/798339 Sun, 04 Jan 2026 18:26:42 -0000 https://status.kitzy.net/incident/798339#1ef04bf4393ec44cd84fe698320da9a371d14629219ddd1be7a911a5341b97b1 Authentik went down. Authentik, Plex, and 3 other services are down https://status.kitzy.net/incident/797010 Thu, 01 Jan 2026 11:43:51 -0000 https://status.kitzy.net/incident/797010#d7c840b53ff86426a41f1858bda346f4f5f5dcb74a82ab68257bf59127bba37e Overseer recovered. Authentik, Plex, and 3 other services are down https://status.kitzy.net/incident/797010 Thu, 01 Jan 2026 11:42:04 -0000 https://status.kitzy.net/incident/797010#cec7207fff52604c4b66934b80e9b648484f04e8010877f7cfc8b4afe448243f Authentik and Plex recovered. Authentik, Plex, and 3 other services are down https://status.kitzy.net/incident/797010 Thu, 01 Jan 2026 11:42:04 -0000 https://status.kitzy.net/incident/797010#cec7207fff52604c4b66934b80e9b648484f04e8010877f7cfc8b4afe448243f Authentik and Plex recovered. Authentik, Plex, and 3 other services are down https://status.kitzy.net/incident/797010 Thu, 01 Jan 2026 11:41:26 -0000 https://status.kitzy.net/incident/797010#de25929c8fc36ac74b35a783fc6c45ae10081d2b836a1487b8a3464c33ad13c5 Unifi Controller recovered. Authentik, Plex, and 3 other services are down https://status.kitzy.net/incident/797010 Thu, 01 Jan 2026 11:40:19 -0000 https://status.kitzy.net/incident/797010#f3f03692cc6e238ddb1fbf67db2cd245f7c2337f0167ad49a59b05db34501642 Fleet recovered. Authentik, Plex, and 3 other services are down https://status.kitzy.net/incident/797010 Thu, 01 Jan 2026 11:38:26 -0000 https://status.kitzy.net/incident/797010#241ab3afb151261323ee065e442183b7fd36eccbbca173248f6538462db233aa Unifi Controller went down. Authentik, Plex, and 3 other services are down https://status.kitzy.net/incident/797010 Thu, 01 Jan 2026 11:38:06 -0000 https://status.kitzy.net/incident/797010#f685b199081fb8a417d9c8bd3c95ffb89a4aa7ac4dcd4a141c379086708d86d0 Overseer went down. Authentik, Plex, and 3 other services are down https://status.kitzy.net/incident/797010 Thu, 01 Jan 2026 11:37:48 -0000 https://status.kitzy.net/incident/797010#dd0365a3bf4e21481d5c6157158ddb6c07d9b4aa8e5d80329905e90caafd3ff2 Fleet went down. Authentik, Plex, and 3 other services are down https://status.kitzy.net/incident/797010 Thu, 01 Jan 2026 11:36:23 -0000 https://status.kitzy.net/incident/797010#bbcde92eba703384b0eba0d29431dd4c9c233a474a5ed170db607a248828d9a3 Authentik and Plex went down. Authentik, Plex, and 3 other services are down https://status.kitzy.net/incident/797010 Thu, 01 Jan 2026 11:36:23 -0000 https://status.kitzy.net/incident/797010#bbcde92eba703384b0eba0d29431dd4c9c233a474a5ed170db607a248828d9a3 Authentik and Plex went down. Authentik, Plex, and 3 other services are down https://status.kitzy.net/incident/793807 Thu, 25 Dec 2025 04:40:00 -0000 https://status.kitzy.net/incident/793807#6c7aa685e2f387e18abc696961b05f61465577da55f12e26ec166a386882ddc7 # Incident Report: Service Outage - December 24, 2024 **Status**: Resolved **Duration**: 17 minutes (22:23 - 22:40 EST) **Impact**: All services unavailable **Affected Services**: Plex, FleetDM, Authentik, media management applications --- ## What happened On December 24 at 22:23 EST, all homelab services became unavailable due to a system memory exhaustion event. The host server ran out of available RAM, causing the operating system to terminate processes and eventually become unresponsive, requiring a manual restart. During the restart, a storage configuration issue prevented the system from booting normally, extending the outage by an additional 16 minutes while network-dependent storage initialized. ## Timeline **22:21 EST** - Host server began experiencing severe memory pressure **22:23 EST** - Services became unavailable as system exhausted all available RAM **22:24 EST** - System became unresponsive, manual restart initiated **22:26 EST** - System boot process stalled due to storage mount configuration issue **22:40 EST** - Storage mount completed, services restored **22:42 EST** - Monitoring confirmed all services operational ## Root cause **Primary issue: Insufficient system resources** The host server has 16GB of RAM, which proved insufficient for the number of services running simultaneously. At the time of failure, the system had exhausted 99% of available memory with no remaining capacity. The largest memory consumer (Plex Media Server) was using 5GB of RAM with no configured limits, while 30+ other services competed for the remaining capacity. **Secondary issue: Storage configuration** Network-attached storage was not properly configured to wait for network initialization during boot. When the system restarted, it couldn't access required storage and entered emergency mode until the network became available. ## Resolution **Immediate fixes applied:** - Corrected storage mount configuration to properly handle network dependencies - System verified operational with all services running normally **Permanent fixes in progress:** - Hardware upgrade: Adding additional RAM (16GB → 32GB minimum) - Implementing memory limits on all services to prevent any single service from consuming excessive resources - Deploying proactive memory monitoring with alerts before exhaustion occurs ## Prevention To prevent recurrence, I'm implementing: 1. **Capacity upgrade** - Adding 16-48GB additional RAM to provide adequate headroom 2. **Resource controls** - Enforcing memory limits on all services 3. **Proactive monitoring** - Alerting on memory usage at 80% and 90% thresholds 4. **Service optimization** - Migrating non-critical services to additional infrastructure 5. **Automated testing** - Validating boot process and storage configuration changes before deployment ## Impact summary - **Total outage**: 17 minutes - **Data loss**: None - **Service degradation after recovery**: None All services are operating normally. No user data was affected during this incident. Authentik, Plex, and 3 other services are down https://status.kitzy.net/incident/793807 Thu, 25 Dec 2025 04:40:00 -0000 https://status.kitzy.net/incident/793807#6c7aa685e2f387e18abc696961b05f61465577da55f12e26ec166a386882ddc7 # Incident Report: Service Outage - December 24, 2024 **Status**: Resolved **Duration**: 17 minutes (22:23 - 22:40 EST) **Impact**: All services unavailable **Affected Services**: Plex, FleetDM, Authentik, media management applications --- ## What happened On December 24 at 22:23 EST, all homelab services became unavailable due to a system memory exhaustion event. The host server ran out of available RAM, causing the operating system to terminate processes and eventually become unresponsive, requiring a manual restart. During the restart, a storage configuration issue prevented the system from booting normally, extending the outage by an additional 16 minutes while network-dependent storage initialized. ## Timeline **22:21 EST** - Host server began experiencing severe memory pressure **22:23 EST** - Services became unavailable as system exhausted all available RAM **22:24 EST** - System became unresponsive, manual restart initiated **22:26 EST** - System boot process stalled due to storage mount configuration issue **22:40 EST** - Storage mount completed, services restored **22:42 EST** - Monitoring confirmed all services operational ## Root cause **Primary issue: Insufficient system resources** The host server has 16GB of RAM, which proved insufficient for the number of services running simultaneously. At the time of failure, the system had exhausted 99% of available memory with no remaining capacity. The largest memory consumer (Plex Media Server) was using 5GB of RAM with no configured limits, while 30+ other services competed for the remaining capacity. **Secondary issue: Storage configuration** Network-attached storage was not properly configured to wait for network initialization during boot. When the system restarted, it couldn't access required storage and entered emergency mode until the network became available. ## Resolution **Immediate fixes applied:** - Corrected storage mount configuration to properly handle network dependencies - System verified operational with all services running normally **Permanent fixes in progress:** - Hardware upgrade: Adding additional RAM (16GB → 32GB minimum) - Implementing memory limits on all services to prevent any single service from consuming excessive resources - Deploying proactive memory monitoring with alerts before exhaustion occurs ## Prevention To prevent recurrence, I'm implementing: 1. **Capacity upgrade** - Adding 16-48GB additional RAM to provide adequate headroom 2. **Resource controls** - Enforcing memory limits on all services 3. **Proactive monitoring** - Alerting on memory usage at 80% and 90% thresholds 4. **Service optimization** - Migrating non-critical services to additional infrastructure 5. **Automated testing** - Validating boot process and storage configuration changes before deployment ## Impact summary - **Total outage**: 17 minutes - **Data loss**: None - **Service degradation after recovery**: None All services are operating normally. No user data was affected during this incident. Authentik, Plex, and 3 other services are down https://status.kitzy.net/incident/793807 Thu, 25 Dec 2025 04:40:00 -0000 https://status.kitzy.net/incident/793807#6c7aa685e2f387e18abc696961b05f61465577da55f12e26ec166a386882ddc7 # Incident Report: Service Outage - December 24, 2024 **Status**: Resolved **Duration**: 17 minutes (22:23 - 22:40 EST) **Impact**: All services unavailable **Affected Services**: Plex, FleetDM, Authentik, media management applications --- ## What happened On December 24 at 22:23 EST, all homelab services became unavailable due to a system memory exhaustion event. The host server ran out of available RAM, causing the operating system to terminate processes and eventually become unresponsive, requiring a manual restart. During the restart, a storage configuration issue prevented the system from booting normally, extending the outage by an additional 16 minutes while network-dependent storage initialized. ## Timeline **22:21 EST** - Host server began experiencing severe memory pressure **22:23 EST** - Services became unavailable as system exhausted all available RAM **22:24 EST** - System became unresponsive, manual restart initiated **22:26 EST** - System boot process stalled due to storage mount configuration issue **22:40 EST** - Storage mount completed, services restored **22:42 EST** - Monitoring confirmed all services operational ## Root cause **Primary issue: Insufficient system resources** The host server has 16GB of RAM, which proved insufficient for the number of services running simultaneously. At the time of failure, the system had exhausted 99% of available memory with no remaining capacity. The largest memory consumer (Plex Media Server) was using 5GB of RAM with no configured limits, while 30+ other services competed for the remaining capacity. **Secondary issue: Storage configuration** Network-attached storage was not properly configured to wait for network initialization during boot. When the system restarted, it couldn't access required storage and entered emergency mode until the network became available. ## Resolution **Immediate fixes applied:** - Corrected storage mount configuration to properly handle network dependencies - System verified operational with all services running normally **Permanent fixes in progress:** - Hardware upgrade: Adding additional RAM (16GB → 32GB minimum) - Implementing memory limits on all services to prevent any single service from consuming excessive resources - Deploying proactive memory monitoring with alerts before exhaustion occurs ## Prevention To prevent recurrence, I'm implementing: 1. **Capacity upgrade** - Adding 16-48GB additional RAM to provide adequate headroom 2. **Resource controls** - Enforcing memory limits on all services 3. **Proactive monitoring** - Alerting on memory usage at 80% and 90% thresholds 4. **Service optimization** - Migrating non-critical services to additional infrastructure 5. **Automated testing** - Validating boot process and storage configuration changes before deployment ## Impact summary - **Total outage**: 17 minutes - **Data loss**: None - **Service degradation after recovery**: None All services are operating normally. No user data was affected during this incident. Authentik, Plex, and 3 other services are down https://status.kitzy.net/incident/793807 Thu, 25 Dec 2025 04:40:00 -0000 https://status.kitzy.net/incident/793807#6c7aa685e2f387e18abc696961b05f61465577da55f12e26ec166a386882ddc7 # Incident Report: Service Outage - December 24, 2024 **Status**: Resolved **Duration**: 17 minutes (22:23 - 22:40 EST) **Impact**: All services unavailable **Affected Services**: Plex, FleetDM, Authentik, media management applications --- ## What happened On December 24 at 22:23 EST, all homelab services became unavailable due to a system memory exhaustion event. The host server ran out of available RAM, causing the operating system to terminate processes and eventually become unresponsive, requiring a manual restart. During the restart, a storage configuration issue prevented the system from booting normally, extending the outage by an additional 16 minutes while network-dependent storage initialized. ## Timeline **22:21 EST** - Host server began experiencing severe memory pressure **22:23 EST** - Services became unavailable as system exhausted all available RAM **22:24 EST** - System became unresponsive, manual restart initiated **22:26 EST** - System boot process stalled due to storage mount configuration issue **22:40 EST** - Storage mount completed, services restored **22:42 EST** - Monitoring confirmed all services operational ## Root cause **Primary issue: Insufficient system resources** The host server has 16GB of RAM, which proved insufficient for the number of services running simultaneously. At the time of failure, the system had exhausted 99% of available memory with no remaining capacity. The largest memory consumer (Plex Media Server) was using 5GB of RAM with no configured limits, while 30+ other services competed for the remaining capacity. **Secondary issue: Storage configuration** Network-attached storage was not properly configured to wait for network initialization during boot. When the system restarted, it couldn't access required storage and entered emergency mode until the network became available. ## Resolution **Immediate fixes applied:** - Corrected storage mount configuration to properly handle network dependencies - System verified operational with all services running normally **Permanent fixes in progress:** - Hardware upgrade: Adding additional RAM (16GB → 32GB minimum) - Implementing memory limits on all services to prevent any single service from consuming excessive resources - Deploying proactive memory monitoring with alerts before exhaustion occurs ## Prevention To prevent recurrence, I'm implementing: 1. **Capacity upgrade** - Adding 16-48GB additional RAM to provide adequate headroom 2. **Resource controls** - Enforcing memory limits on all services 3. **Proactive monitoring** - Alerting on memory usage at 80% and 90% thresholds 4. **Service optimization** - Migrating non-critical services to additional infrastructure 5. **Automated testing** - Validating boot process and storage configuration changes before deployment ## Impact summary - **Total outage**: 17 minutes - **Data loss**: None - **Service degradation after recovery**: None All services are operating normally. No user data was affected during this incident. Authentik, Plex, and 3 other services are down https://status.kitzy.net/incident/793807 Thu, 25 Dec 2025 04:40:00 -0000 https://status.kitzy.net/incident/793807#6c7aa685e2f387e18abc696961b05f61465577da55f12e26ec166a386882ddc7 # Incident Report: Service Outage - December 24, 2024 **Status**: Resolved **Duration**: 17 minutes (22:23 - 22:40 EST) **Impact**: All services unavailable **Affected Services**: Plex, FleetDM, Authentik, media management applications --- ## What happened On December 24 at 22:23 EST, all homelab services became unavailable due to a system memory exhaustion event. The host server ran out of available RAM, causing the operating system to terminate processes and eventually become unresponsive, requiring a manual restart. During the restart, a storage configuration issue prevented the system from booting normally, extending the outage by an additional 16 minutes while network-dependent storage initialized. ## Timeline **22:21 EST** - Host server began experiencing severe memory pressure **22:23 EST** - Services became unavailable as system exhausted all available RAM **22:24 EST** - System became unresponsive, manual restart initiated **22:26 EST** - System boot process stalled due to storage mount configuration issue **22:40 EST** - Storage mount completed, services restored **22:42 EST** - Monitoring confirmed all services operational ## Root cause **Primary issue: Insufficient system resources** The host server has 16GB of RAM, which proved insufficient for the number of services running simultaneously. At the time of failure, the system had exhausted 99% of available memory with no remaining capacity. The largest memory consumer (Plex Media Server) was using 5GB of RAM with no configured limits, while 30+ other services competed for the remaining capacity. **Secondary issue: Storage configuration** Network-attached storage was not properly configured to wait for network initialization during boot. When the system restarted, it couldn't access required storage and entered emergency mode until the network became available. ## Resolution **Immediate fixes applied:** - Corrected storage mount configuration to properly handle network dependencies - System verified operational with all services running normally **Permanent fixes in progress:** - Hardware upgrade: Adding additional RAM (16GB → 32GB minimum) - Implementing memory limits on all services to prevent any single service from consuming excessive resources - Deploying proactive memory monitoring with alerts before exhaustion occurs ## Prevention To prevent recurrence, I'm implementing: 1. **Capacity upgrade** - Adding 16-48GB additional RAM to provide adequate headroom 2. **Resource controls** - Enforcing memory limits on all services 3. **Proactive monitoring** - Alerting on memory usage at 80% and 90% thresholds 4. **Service optimization** - Migrating non-critical services to additional infrastructure 5. **Automated testing** - Validating boot process and storage configuration changes before deployment ## Impact summary - **Total outage**: 17 minutes - **Data loss**: None - **Service degradation after recovery**: None All services are operating normally. No user data was affected during this incident. Authentik, Plex, and 3 other services are down https://status.kitzy.net/incident/793807 Thu, 25 Dec 2025 03:46:18 -0000 https://status.kitzy.net/incident/793807#84948976c7f387b975a31ffb0ce9cef203990ccfc5eb79648651eb4d4e0d36bc Fleet recovered. Authentik, Plex, and 3 other services are down https://status.kitzy.net/incident/793807 Thu, 25 Dec 2025 03:43:59 -0000 https://status.kitzy.net/incident/793807#df3971abf96bb495372d2accc8ff4bc7f1f3b33d384b966dedd546535fd7a71c Unifi Controller recovered. Authentik, Plex, and 3 other services are down https://status.kitzy.net/incident/793807 Thu, 25 Dec 2025 03:43:40 -0000 https://status.kitzy.net/incident/793807#a61be1ef9eb51d3dfa46a852947b3cbc709f78903a814d6f6aa9692c70cfd7ce Overseer recovered. Authentik, Plex, and 3 other services are down https://status.kitzy.net/incident/793807 Thu, 25 Dec 2025 03:42:14 -0000 https://status.kitzy.net/incident/793807#fe5a3ee168c4b00c328c7e9a418bed1924caa9357f8967aeaa8465dd935505b9 Authentik and Plex recovered. Authentik, Plex, and 3 other services are down https://status.kitzy.net/incident/793807 Thu, 25 Dec 2025 03:42:14 -0000 https://status.kitzy.net/incident/793807#fe5a3ee168c4b00c328c7e9a418bed1924caa9357f8967aeaa8465dd935505b9 Authentik and Plex recovered. Authentik, Plex, and 3 other services are down https://status.kitzy.net/incident/793807 Thu, 25 Dec 2025 03:25:52 -0000 https://status.kitzy.net/incident/793807#d69efdbba533cf49efed3da4474e5b0dbf91bfdfe9a0b335afaa8d8e3a126f93 Unifi Controller went down. Authentik, Plex, and 3 other services are down https://status.kitzy.net/incident/793807 Thu, 25 Dec 2025 03:24:20 -0000 https://status.kitzy.net/incident/793807#a3f80ee01eded55596bc1003f5b0a33012b5874c80428e3c1254d46addb6c170 Authentik and Plex went down. Authentik, Plex, and 3 other services are down https://status.kitzy.net/incident/793807 Thu, 25 Dec 2025 03:24:20 -0000 https://status.kitzy.net/incident/793807#a3f80ee01eded55596bc1003f5b0a33012b5874c80428e3c1254d46addb6c170 Authentik and Plex went down. Authentik, Plex, and 3 other services are down https://status.kitzy.net/incident/793807 Thu, 25 Dec 2025 03:21:05 -0000 https://status.kitzy.net/incident/793807#3ed00549d4fbf38cfc0ebcd70d31245017c0976578468797ca177e36f7bb86ac Authentik and Plex recovered. Authentik, Plex, and 3 other services are down https://status.kitzy.net/incident/793807 Thu, 25 Dec 2025 03:21:05 -0000 https://status.kitzy.net/incident/793807#3ed00549d4fbf38cfc0ebcd70d31245017c0976578468797ca177e36f7bb86ac Authentik and Plex recovered. Authentik, Plex, and 3 other services are down https://status.kitzy.net/incident/793807 Thu, 25 Dec 2025 03:20:09 -0000 https://status.kitzy.net/incident/793807#391f5f6303be7d064b880827706fe486b537c37b9bad57424a929cddd423a435 Overseer went down. Authentik, Plex, and 3 other services are down https://status.kitzy.net/incident/793807 Thu, 25 Dec 2025 03:19:47 -0000 https://status.kitzy.net/incident/793807#87876b8e64ae0e9ac4004c3618d4934bf25d3e8d38e5ece792029aaa1f8f1929 Fleet went down. Authentik, Plex, and 3 other services are down https://status.kitzy.net/incident/793807 Thu, 25 Dec 2025 03:18:38 -0000 https://status.kitzy.net/incident/793807#084fa849b19b7c6945e204c415c36aeb5aa716d057796abf6bd99715a5401343 Authentik and Plex went down. Authentik, Plex, and 3 other services are down https://status.kitzy.net/incident/793807 Thu, 25 Dec 2025 03:18:38 -0000 https://status.kitzy.net/incident/793807#084fa849b19b7c6945e204c415c36aeb5aa716d057796abf6bd99715a5401343 Authentik and Plex went down. Overseer and Unifi Controller are down https://status.kitzy.net/incident/781580 Sun, 07 Dec 2025 02:47:39 -0000 https://status.kitzy.net/incident/781580#608c6aeb8b635a70a83075fdeab5f8c1fbd6909d2bfb34b5440d852a94cdb290 Unifi Controller recovered. Overseer and Unifi Controller are down https://status.kitzy.net/incident/781580 Sun, 07 Dec 2025 02:44:05 -0000 https://status.kitzy.net/incident/781580#462ba53e1e6ba4a35711173b35e4fb498ce7a8e6fbc42ffa6e6f58c9fc797bf8 Unifi Controller went down and Overseer recovered. Overseer and Unifi Controller are down https://status.kitzy.net/incident/781580 Sun, 07 Dec 2025 02:44:05 -0000 https://status.kitzy.net/incident/781580#462ba53e1e6ba4a35711173b35e4fb498ce7a8e6fbc42ffa6e6f58c9fc797bf8 Unifi Controller went down and Overseer recovered. Overseer and Unifi Controller are down https://status.kitzy.net/incident/781580 Sun, 07 Dec 2025 02:40:43 -0000 https://status.kitzy.net/incident/781580#37944d7a1bedd06f4f5c96a2dab46b17ffdc92df37af15e61e9c67bf14820df0 Overseer went down. Plex, Overseer, and 1 other service are down https://status.kitzy.net/incident/773608 Tue, 25 Nov 2025 13:37:00 -0000 https://status.kitzy.net/incident/773608#641f692c3d22272dfacd2a76292cd81b82aa8e721662d35adeee29efcbad2a56 Unifi Controller recovered. Plex, Overseer, and 1 other service are down https://status.kitzy.net/incident/773608 Tue, 25 Nov 2025 13:34:01 -0000 https://status.kitzy.net/incident/773608#4d476da6a5f40bc44c97c17b7c043ad46ff85614d2762f40842e1a8ba2bbcadd Unifi Controller went down. Plex, Overseer, and 1 other service are down https://status.kitzy.net/incident/773608 Tue, 25 Nov 2025 04:13:25 -0000 https://status.kitzy.net/incident/773608#870cc48a60c73874044ed50446313be9d0f68b74a018d27b194cbe576a9c3b81 Plex recovered. Plex, Overseer, and 1 other service are down https://status.kitzy.net/incident/773608 Tue, 25 Nov 2025 04:12:21 -0000 https://status.kitzy.net/incident/773608#6b3bf9986d142f5f628919951591ea5e53d1464efb7f5f58a1cbc578a59385ba Overseer recovered. Plex, Overseer, and 1 other service are down https://status.kitzy.net/incident/773608 Tue, 25 Nov 2025 04:10:24 -0000 https://status.kitzy.net/incident/773608#97c11491953b965264396572b3c3414c6895da20b608dbd400b05614055800fc Plex went down. Plex, Overseer, and 1 other service are down https://status.kitzy.net/incident/773608 Tue, 25 Nov 2025 04:09:30 -0000 https://status.kitzy.net/incident/773608#55364c0bb2808d6799307db6230117415c947a5678b328b4f682d610ba6b0665 Overseer went down. Overseer is down https://status.kitzy.net/incident/772523 Sun, 23 Nov 2025 08:29:08 -0000 https://status.kitzy.net/incident/772523#32889b8591eb5d90f881f8461b42d343a9ed8396c45abeaa978665b9c4f14056 Overseer recovered. Overseer is down https://status.kitzy.net/incident/772523 Sun, 23 Nov 2025 08:23:01 -0000 https://status.kitzy.net/incident/772523#feb5508f05ce6f6aaeb6f99d02e352f31068eec429080b963e662d0c7af8e663 Overseer went down. Overseer is down https://status.kitzy.net/incident/772271 Sat, 22 Nov 2025 18:31:55 -0000 https://status.kitzy.net/incident/772271#fd21f1d73eb9bd730695f626648be5297429c308857fe040e9407faede738088 Overseer recovered. Overseer is down https://status.kitzy.net/incident/772271 Sat, 22 Nov 2025 18:29:30 -0000 https://status.kitzy.net/incident/772271#1c2dc8170887aadd0cfd984a203d0471db84a1bfe61563d1fe930247976ec6b5 Overseer went down. Network is down https://status.kitzy.net/incident/768371 Tue, 18 Nov 2025 11:46:31 -0000 https://status.kitzy.net/incident/768371#4219a383e427e2f0fd5ff36d0073e5fda8770cda455160378318adcd81eba8db Network recovered. Network is down https://status.kitzy.net/incident/768371 Tue, 18 Nov 2025 11:34:10 -0000 https://status.kitzy.net/incident/768371#74b21356453d15e9c86d6dfb70b526a34598d94d3128e49349244fa9d15e6d95 Network went down. Authentik, Plex, and 3 other services are down https://status.kitzy.net/incident/759677 Fri, 07 Nov 2025 04:39:59 -0000 https://status.kitzy.net/incident/759677#bb4ff8311b3bb920b08efd23ece9ee2e006b4964729637d1232612f5f4b6caed Fleet recovered. Authentik, Plex, and 3 other services are down https://status.kitzy.net/incident/759677 Fri, 07 Nov 2025 04:38:46 -0000 https://status.kitzy.net/incident/759677#866b28d453283fe91141cdf71b8779b3edf075cdeb94faea5a6ff652f958afcd Authentik and Plex recovered. Authentik, Plex, and 3 other services are down https://status.kitzy.net/incident/759677 Fri, 07 Nov 2025 04:38:46 -0000 https://status.kitzy.net/incident/759677#866b28d453283fe91141cdf71b8779b3edf075cdeb94faea5a6ff652f958afcd Authentik and Plex recovered. Authentik, Plex, and 3 other services are down https://status.kitzy.net/incident/759677 Fri, 07 Nov 2025 04:37:46 -0000 https://status.kitzy.net/incident/759677#12db4be9bb6bdf24618a0655e548770444c446c32ea65147f69e16010ce6f3ae Overseer and Unifi Controller recovered. Authentik, Plex, and 3 other services are down https://status.kitzy.net/incident/759677 Fri, 07 Nov 2025 04:37:46 -0000 https://status.kitzy.net/incident/759677#12db4be9bb6bdf24618a0655e548770444c446c32ea65147f69e16010ce6f3ae Overseer and Unifi Controller recovered. Authentik, Plex, and 3 other services are down https://status.kitzy.net/incident/759677 Fri, 07 Nov 2025 04:35:01 -0000 https://status.kitzy.net/incident/759677#05c61f950dcb677512f8bdb6b6a90108adfe073519249071eb5e510f00ecbc87 Unifi Controller went down. Authentik, Plex, and 3 other services are down https://status.kitzy.net/incident/759677 Fri, 07 Nov 2025 04:34:41 -0000 https://status.kitzy.net/incident/759677#3cc93a21c6d1012ce6d25a5976e6481748e01629c7c5d3ee7117150f5375cd4b Overseer went down. Authentik, Plex, and 3 other services are down https://status.kitzy.net/incident/759677 Fri, 07 Nov 2025 04:34:20 -0000 https://status.kitzy.net/incident/759677#bb252b8a6e2e0d4057ed99e23f6f72fede05b941ddc1982e4d19fef2b61a7445 Fleet went down. Authentik, Plex, and 3 other services are down https://status.kitzy.net/incident/759677 Fri, 07 Nov 2025 04:33:15 -0000 https://status.kitzy.net/incident/759677#748a74ffbcaabec830976dd1ebb9f476a727462dcecdd25cfceaf03f094bdc08 Authentik went down. Authentik, Plex, and 3 other services are down https://status.kitzy.net/incident/759677 Fri, 07 Nov 2025 04:33:04 -0000 https://status.kitzy.net/incident/759677#0e5a546a343bea325e83c1a7e338b7465a0cd87ea1d84caeae7299b1f6e14cfb Plex went down. Network and Unifi Controller are down https://status.kitzy.net/incident/756705 Mon, 03 Nov 2025 06:44:36 -0000 https://status.kitzy.net/incident/756705#d0610aae05eaf6b50f3dbc3e1e3acc784bbf5e0df441a34936333102d81ca51d Unifi Controller recovered. Network and Unifi Controller are down https://status.kitzy.net/incident/756705 Mon, 03 Nov 2025 06:41:59 -0000 https://status.kitzy.net/incident/756705#8f06ee04c065a20ca2b6b5b3104a90f706337f5d0cbe9280dd0b80eafd9a042c Network recovered. Network and Unifi Controller are down https://status.kitzy.net/incident/756705 Mon, 03 Nov 2025 06:41:46 -0000 https://status.kitzy.net/incident/756705#08d6bc799729cd1888da1291796c44f2e66cb36ba5ebef26fd4f9fd7a0d48995 Unifi Controller went down. Network and Unifi Controller are down https://status.kitzy.net/incident/756705 Mon, 03 Nov 2025 06:41:16 -0000 https://status.kitzy.net/incident/756705#c4805c2b4b59b44675d94eff7f891021d68b314554b266da44b0cb80e8582e50 Network went down. Authentik, Plex, and 1 other service are down https://status.kitzy.net/incident/745241 Fri, 17 Oct 2025 07:16:34 -0000 https://status.kitzy.net/incident/745241#f8b26f1c3f3d2ebd72e035bd8246b7b93a83543d1b17faa126a35898c15de6e6 Authentik and Plex recovered. Authentik, Plex, and 1 other service are down https://status.kitzy.net/incident/745241 Fri, 17 Oct 2025 07:16:34 -0000 https://status.kitzy.net/incident/745241#f8b26f1c3f3d2ebd72e035bd8246b7b93a83543d1b17faa126a35898c15de6e6 Authentik and Plex recovered. Authentik, Plex, and 1 other service are down https://status.kitzy.net/incident/745241 Fri, 17 Oct 2025 07:13:14 -0000 https://status.kitzy.net/incident/745241#9de64e5423bf08f87317ce24ee170fb99527a096c0932fad3bd477af022aafa6 Authentik went down. Authentik, Plex, and 1 other service are down https://status.kitzy.net/incident/745241 Fri, 17 Oct 2025 07:12:57 -0000 https://status.kitzy.net/incident/745241#78170b43158eea291342a5bfe95cde5f0944dbd910bf5a3adb4023d2eb71290e Plex went down. Authentik, Plex, and 1 other service are down https://status.kitzy.net/incident/745241 Fri, 17 Oct 2025 07:09:31 -0000 https://status.kitzy.net/incident/745241#a62410344a0ba3f74d63cab76e7b2939054860ae87d580cfcedbd74b587ee5b3 Unifi Controller recovered. Authentik, Plex, and 1 other service are down https://status.kitzy.net/incident/745241 Fri, 17 Oct 2025 07:07:27 -0000 https://status.kitzy.net/incident/745241#2c3370c8f32a60e1c10652a4d2a51444a5206d17ec75cddf9f07fa029587f849 Unifi Controller went down. Plex is down https://status.kitzy.net/incident/741491 Sat, 11 Oct 2025 16:39:00 -0000 https://status.kitzy.net/incident/741491#c4e6f4b45c96ca251b7c9b03fd528af92f6bc52a796fc25988013ccd043cfbd2 # Root Cause Analysis: Plex Transcoding Failures on Ubuntu 25.04 **Incident date**: October 11, 2025 **Detected**: October 11, 2025 00:37 UTC **Resolved**: October 11, 2025 00:48 UTC **Severity**: High (media streaming unavailable) **Author**: Kitzy ## Executive summary Plex transcoding failed completely on clients following Ubuntu host upgrade from 24.10 to 25.04. All transcode attempts crashed with FFmpeg errors. Root cause was an incompatibility between the LinuxServer.io Plex Docker image's bundled FFmpeg and Ubuntu 25.04. Switching to the official Plex Docker image resolved the issue immediately. **Impact**: 100% of transcoding attempts failed for 11 minutes. Direct play was unaffected. ## Timeline **October 11, 2025 ~00:00**: Upgraded Beelink host from Ubuntu 24.10 to 25.04 **October 11, 2025 00:17**: Upgraded Plex container from ls280 to ls282 **October 11, 2025 00:37**: First transcoding failure detected on AppleTV **October 11, 2025 00:37-00:48**: Troubleshooting and diagnosis **October 11, 2025 00:48**: Switched to official Plex image (plexinc/pms-docker) **October 11, 2025 00:48**: Transcoding confirmed working ## Root cause The LinuxServer.io Plex Docker image (lscr.io/linuxserver/plex) contains a custom-built FFmpeg transcoder that generates commands including the `segment_copyts` option for subtitle segmentation. This option is not recognized by the version of FFmpeg bundled in the LinuxServer image, causing the transcoder to crash with exit code 8. While this bug exists in all versions of the LinuxServer image, it was not triggered on Ubuntu 24.10. The Ubuntu 25.04 environment (likely kernel 6.14.0 or updated DRM libraries) causes Plex's transcoder command generator to include the problematic `segment_copyts` option in its FFmpeg invocations. **Error logs**: ``` ERROR - [Req#12c/Transcode] Unrecognized option 'segment_copyts'. ERROR - [Req#12f/Transcode] Error splitting the argument list: Option not found Jobs: '/usr/lib/plexmediaserver/Plex Transcoder' exit code for process 513 is 8 (failure) ``` ## Contributing factors 1. **Ubuntu version management**: Running non-LTS Ubuntu (24.10) which went EOL, forcing upgrade to 25.04 development release 2. **Hardware requirements**: Original rationale for using 24.10 was newer kernel requirement for Intel N150 hardware 3. **Image selection**: LinuxServer.io image chosen over official image without awareness of potential compatibility issues 4. **Testing gap**: No transcoding tests performed immediately after Ubuntu upgrade 5. **Documentation**: Ubuntu version and Plex image details not documented in wiki ## Resolution Switched from LinuxServer.io Plex image to official Plex image: ```yaml # Before image: lscr.io/linuxserver/plex:latest environment: - PUID=996 - PGID=988 - VERSION=latest # After image: plexinc/pms-docker:latest environment: - TZ=America/New_York ``` Additional changes: - Removed PUID/PGID environment variables (not used by official image) - Fixed config directory ownership: `chown -R 797:797 /mnt/data/docker/volumes/plex` - Retained hardware transcoding configuration (device passthrough and render group) The official Plex image uses a different FFmpeg build that does not have this bug. ## Prevention and action items ### Immediate actions (completed) - [x] Switch to official Plex image - [x] Document Ubuntu version in Current Hardware wiki page - [x] Create troubleshooting runbook for this issue - [x] Verify hardware transcoding works correctly ### Short-term actions - [ ] Evaluate Ubuntu 24.04 LTS with HWE kernel as alternative to 25.04 - [ ] Document specific hardware compatibility issue that required 24.10 originally - [ ] Add transcoding test to post-upgrade checklist - [ ] Set up automated testing for critical services after infrastructure changes ### Long-term actions - [ ] Establish OS selection criteria (prefer LTS over non-LTS) - [ ] Document image selection rationale (official vs community-maintained) - [ ] Implement pre-production testing environment for infrastructure changes - [ ] Add monitoring/alerting for transcoding failures ## Lessons learned ### What went well - Hardware acceleration configuration (device passthrough, render group) was correct - Systematic troubleshooting approach identified root cause quickly - Official Plex image provided immediate resolution without data loss ### What could be improved - **OS stability vs hardware support tradeoff**: Running development release (Ubuntu 25.04) in production introduces unnecessary risk. Should have investigated Ubuntu 24.04 LTS + HWE kernel instead of jumping to non-LTS versions. - **Testing after upgrades**: No transcoding test performed after Ubuntu upgrade. Should have caught this immediately. - **Image selection**: No documented rationale for choosing LinuxServer.io image over official image. Community images may lag behind in compatibility. - **Documentation**: Current Hardware page showed "OS: Ubuntu" without version number, making troubleshooting harder. ### Questions for follow-up 1. What specific hardware feature required kernel >6.8 that 24.04 LTS didn't provide? 2. Does Ubuntu 24.04 LTS + HWE kernel provide adequate support for Intel N150? 3. Should we establish a policy preferring official Docker images over community alternatives? ## Technical details ### Environment - Host OS: Ubuntu 25.04 (kernel 6.14.0-33-generic) - Hardware: Beelink EQ14, Intel N150 (Alder Lake-N), 16GB RAM - Container runtime: Docker via Portainer - Previous image: lscr.io/linuxserver/plex:1.42.2.10156-f737b826c-ls282 - Current image: plexinc/pms-docker:1.42.2.10156-f737b826c ### Verification Intel VA-API drivers confirmed working on host: ``` $ sudo vainfo --display drm --device /dev/dri/renderD128 vainfo: VA-API version: 1.22 (libva 2.22.0) vainfo: Driver version: Intel iHD driver for Intel(R) Gen Graphics - 25.1.2 ``` Hardware transcoding confirmed active in Plex dashboard with "HW" indicator. ## Related documentation - Troubleshooting runbook: [Plex transcoding failures on Ubuntu 25.04](https://www.notion.so/289f8d994cff818a9c2df3c6eb9e8325) - Docker Services page: https://www.notion.so/ac72a6815fe745b4954b3b44bc80c9b4 - Current Hardware page: https://www.notion.so/beff2f7c6b4d449b88107f7554032a98 Plex is down https://status.kitzy.net/incident/741491 Sat, 11 Oct 2025 01:13:18 -0000 https://status.kitzy.net/incident/741491#ee754f45d49a3dee3850820606b889792f0e284dee6e52590182557f63ff65d5 Plex recovered. Plex is down https://status.kitzy.net/incident/741491 Sat, 11 Oct 2025 01:07:00 -0000 https://status.kitzy.net/incident/741491#cb664b96ecc627e392612a2203932898c9f9fe200a701e97a962ae6cbab99913 Plex went down. Plex is down https://status.kitzy.net/incident/741491 Sat, 11 Oct 2025 00:52:32 -0000 https://status.kitzy.net/incident/741491#a0e84729808f123ca7838b7e5f18672dc5777e01ab1819eecb9475ccb0b9b7f6 Plex recovered. Plex is down https://status.kitzy.net/incident/741491 Sat, 11 Oct 2025 00:48:40 -0000 https://status.kitzy.net/incident/741491#fca0bf508964b515c5b026b53a6076314d2e91e6dfc943601f7147fbf4cd9314 Plex went down. kitzysound.com, Authentik, and 4 other services are down https://status.kitzy.net/incident/739575 Wed, 08 Oct 2025 01:09:03 -0000 https://status.kitzy.net/incident/739575#8661fc5579af7d62b999f8c0b8b76766ea0ba9d0a9f169c06f00a2415d167626 Fleet recovered. kitzysound.com, Authentik, and 4 other services are down https://status.kitzy.net/incident/739575 Wed, 08 Oct 2025 01:07:38 -0000 https://status.kitzy.net/incident/739575#92566b07b754e468bc014d1f8362818e366db57b6ca9858ffd2c7bb1b08d0d0c Authentik and Plex recovered. kitzysound.com, Authentik, and 4 other services are down https://status.kitzy.net/incident/739575 Wed, 08 Oct 2025 01:07:38 -0000 https://status.kitzy.net/incident/739575#92566b07b754e468bc014d1f8362818e366db57b6ca9858ffd2c7bb1b08d0d0c Authentik and Plex recovered. kitzysound.com, Authentik, and 4 other services are down https://status.kitzy.net/incident/739575 Wed, 08 Oct 2025 01:06:31 -0000 https://status.kitzy.net/incident/739575#ef3c16b7b1490a344339bbb1bc295a01468d575fa621abf43ed433bfeafcc933 Unifi Controller recovered. kitzysound.com, Authentik, and 4 other services are down https://status.kitzy.net/incident/739575 Wed, 08 Oct 2025 01:05:48 -0000 https://status.kitzy.net/incident/739575#3c2b0a52d3bd2669304ce440930fa0d0515a8b28c427464de8dd6e7f18e1e697 Fleet went down. kitzysound.com, Authentik, and 4 other services are down https://status.kitzy.net/incident/739575 Wed, 08 Oct 2025 01:05:17 -0000 https://status.kitzy.net/incident/739575#fd8343824399aa072fdc650e46a1aa5c1193a24ea4858f679b5c9ba3e8e71891 Authentik and Plex went down. kitzysound.com, Authentik, and 4 other services are down https://status.kitzy.net/incident/739575 Wed, 08 Oct 2025 01:05:17 -0000 https://status.kitzy.net/incident/739575#fd8343824399aa072fdc650e46a1aa5c1193a24ea4858f679b5c9ba3e8e71891 Authentik and Plex went down. kitzysound.com, Authentik, and 4 other services are down https://status.kitzy.net/incident/739575 Wed, 08 Oct 2025 01:03:37 -0000 https://status.kitzy.net/incident/739575#dd9980b9f180c53ed609639f255d5ecff40a2f0ed735808f1df6dc9e6170fc22 Unifi Controller went down. kitzysound.com, Authentik, and 4 other services are down https://status.kitzy.net/incident/739575 Wed, 08 Oct 2025 00:59:05 -0000 https://status.kitzy.net/incident/739575#b7cd45e8ff3b20a7079221350c6a5de3c892f424fe3ec9da1b5954b9c007a214 Authentik recovered. kitzysound.com, Authentik, and 4 other services are down https://status.kitzy.net/incident/739575 Wed, 08 Oct 2025 00:58:35 -0000 https://status.kitzy.net/incident/739575#e49e8351c2b42d82e80ffacf16e5d8df47935499c6f89d33ad2982d887113312 Plex recovered. kitzysound.com, Authentik, and 4 other services are down https://status.kitzy.net/incident/739575 Wed, 08 Oct 2025 00:55:03 -0000 https://status.kitzy.net/incident/739575#56eb8c6688920a4b8359f9278be4c9a370289ca4c015a041b10bf6a1d10e582f Authentik and Plex went down. kitzysound.com, Authentik, and 4 other services are down https://status.kitzy.net/incident/739575 Wed, 08 Oct 2025 00:55:03 -0000 https://status.kitzy.net/incident/739575#56eb8c6688920a4b8359f9278be4c9a370289ca4c015a041b10bf6a1d10e582f Authentik and Plex went down. Authentik stack failed to deploy (PostgreSQL unhealthy) https://status.kitzy.net/incident/733500 Sat, 27 Sep 2025 22:27:00 -0000 https://status.kitzy.net/incident/733500#8cea5ab8eb216d7ca58839aec770141e2676d50ff18ef9f78ca9095ce8ffadb8 # RCA — Authentik stack failed to deploy (PostgreSQL unhealthy) ## Summary New deployments of the Authentik stack failed in Portainer because the PostgreSQL service was repeatedly marked **unhealthy** and (earlier) failed to start. The **root cause** was an incorrect Docker volume mount target for the official `postgres` image, which declares `VOLUME /var/lib/postgresql/data`. We mounted our named volume at the **parent** (`/var/lib/postgresql`) instead of the declared data dir. Docker then created an **anonymous child volume** at `/var/lib/postgresql/data`, **masking the real cluster** on the named volume. A non-standard nested layout (`…/data/data`) and some role/healthcheck misconfigurations amplified the symptoms and obscured the root cause. ## Impact - **User-facing:** New logins to Authentik failed; already-authenticated sessions continued to work. - **Downstream services affected:** AWS (SSO), Portainer, Fleet. - **Start:** **2025-09-27 14:49 EDT** - **End:** **2025-09-27 18:07 EDT** - **Duration:** **3h 18m** - **Data loss:** None observed. After restore, schema contained **178** non-system tables; application state validated as intact. ## Timeline (EDT) - **14:49** – Portainer deploy fails: mount/propagation errors, then `unhealthy` on PostgreSQL. - **15:00–16:30** – Investigation finds valid PG 17 cluster on host volume under `…/_data/data` with ownership uid/gid 999; inside containers, `/var/lib/postgresql/data` appears empty or newly initialized. Health checks produce `role "root"/"postgres" does not exist` messages (clients connecting without a username). - **~16:45** – Root cause identified: named volume mounted at `/var/lib/postgresql` + image’s `VOLUME /var/lib/postgresql/data` ⇒ anonymous child volume masks real data. - **~17:00–18:00** – Recovery plan executed: logical dump (globals + `authentik` DB) from the old cluster, destroy volume, redeploy Postgres 17 with **volume mounted at `/var/lib/postgresql/data`**, restore globals and DB, add `pg_trgm` and `uuid-ossp` extensions, verify counts. - **18:07** – Stack healthy; Authentik available; AWS/Portainer/Fleet auth restored. ## Technical Details ### Root cause 1. **Mount target mismatch for a data-dir image.** The official `postgres` image declares `VOLUME /var/lib/postgresql/data`. Mounting the named volume at the parent path (`/var/lib/postgresql`) led Docker to mount an anonymous child volume at `/var/lib/postgresql/data`, hiding the real database files. Containers subsequently saw an empty/new data directory and either failed `initdb` (“exists but not empty”) or started against the wrong path. 2. **Non-standard nested layout.** The actual cluster lived under `…/data/data`. With `PGDATA` pointed at `/var/lib/postgresql/data`, the server didn’t see `PG_VERSION` and behaved as uninitialized; pointing `PGDATA` at the nested path worked but remained non-standard and error-prone. ### Contributing factors - **Health/monitoring confusion.** Docker healthchecks (and some client probes) connected without `-U`, defaulting to OS user `root` inside the container, producing noisy `role "root" does not exist`. Production monitoring used an **HTTP check from Better Stack**, which caught the outage but didn’t illuminate the DB mount issue. - **Role drift.** The legacy cluster at one point lacked a `postgres` role; probes using `-U postgres` failed. - **Tag drift risk.** Unpinned `library/postgres` tag increases chance of major version bumps during pull. - **No routine backups.** There was no preexisting logical backup job for Authentik; we relied on ad-hoc dumps during incident response. ### What we tried (and why it didn’t stick) - **Chown/chmod/ACL normalization** – necessary, but the masked child volume still hid the real cluster. - **Changing `PGDATA` while mounting the parent** – still subject to masking by the image’s declared VOLUME. - **Creating roles via psql** – failed until we targeted the actual `PGDATA` or used single-user mode, because the container wasn’t looking at the real data directory. ### Resolution - **Deterministic reset:** 1) Dumped roles (`pg_dumpall --globals-only`) and DB (`pg_dump -Fc authentik`) from the old cluster. 2) Removed the old volume (with an extra tarball snapshot as belt-and-suspenders). 3) Redeployed **`postgres:17`** with **named volume mounted at `/var/lib/postgresql/data`** and `PGDATA=/var/lib/postgresql/data`. 4) Restored globals and DB; ensured `pg_trgm` / `uuid-ossp` present; validated 178 non-system tables owned by `authentik`. 5) Updated healthcheck to a **user-agnostic** probe (`pg_isready -q -h 127.0.0.1 -p 5432`). ## Corrective & Preventive Actions ### Config standards (immediately) - **Pin the image** to `postgres:17` in Compose. - **Mount at the child path:** always mount the named volume at **`/var/lib/postgresql/data`** for the official image. Do not mount the parent. - **Standardize DB env consumption:** both `auth-server` and `auth-worker` read `${POSTGRES_USER}`, `${POSTGRES_DB}`, `${POSTGRES_PASSWORD}` (no hardcoding). - **Healthcheck:** use user-agnostic probe (TCP or `pg_isready` with host/port only). Avoid checks that require a specific DB user. ### Operational guardrails (this week) - **Pre-flight verification script** (run before deploy): `docker run --rm -v <vol>:/var/lib/postgresql/data --entrypoint sh postgres:17 -lc 'test -f /var/lib/postgresql/data/PG_VERSION || (echo "PG_VERSION missing at target"; ls -la /var/lib/postgresql/data; exit 1)'` - **Detect anonymous child volumes:** flag when a container has both a named mount at `/var/lib/postgresql` and any mount at `/var/lib/postgresql/data`. - **Monitoring:** keep Better Stack HTTP checks; add a DB socket/TCP liveness check on port 5432 from within the host or a sidecar to reduce false attribution. ### Backups & restore (this week) - **Nightly logical backups:** - `pg_dumpall --globals-only > pg-globals.sql` - `pg_dump -Fc authentik > authentik.dump` - Retention: 14–30 days; store off-box (MinIO/S3). - **Quarterly restore drill** into a throwaway container to ensure backups are viable. ### Documentation / runbooks (this week) - “Postgres in Docker” one-pager: **must mount `/var/lib/postgresql/data`**, effects of the image’s `VOLUME`, and nested layout pitfalls. - “Role repair via single-user mode” snippet (for missing `postgres`). - “Restore procedure” (drop/recreate DB vs temp DB swap). ### Owners & dates - **Compose standardization & pinning:** **@kitzy** — **Done** (verify in repo). - **Pre-flight check in CI/Portainer template:** **@kitzy** — **by EOW**. - **Backups to S3/MinIO + retention policy:** **@kitzy** — **by EOW**. - **Runbooks (deploy, backup, restore, role repair):** **@kitzy** — **by EOW**. - **Monitoring additions (DB TCP liveness):** **@kitzy** — **by EOW**. ## Evidence - Host volume: `…/authentik_postgresql/_data/data/PG_VERSION` (PG 17) with full cluster files. - Inside-container (parent mount): `/var/lib/postgresql/data` initially empty or re-initialized; after correct mount (child path), cluster is visible. - Successful restore: `current_user = authentik`, `count = 178` non-system tables. - Final Compose: `Destination=/var/lib/postgresql/data` mounted from named volume; user-agnostic healthcheck. Authentik stack failed to deploy (PostgreSQL unhealthy) https://status.kitzy.net/incident/733500 Sat, 27 Sep 2025 22:27:00 -0000 https://status.kitzy.net/incident/733500#8cea5ab8eb216d7ca58839aec770141e2676d50ff18ef9f78ca9095ce8ffadb8 # RCA — Authentik stack failed to deploy (PostgreSQL unhealthy) ## Summary New deployments of the Authentik stack failed in Portainer because the PostgreSQL service was repeatedly marked **unhealthy** and (earlier) failed to start. The **root cause** was an incorrect Docker volume mount target for the official `postgres` image, which declares `VOLUME /var/lib/postgresql/data`. We mounted our named volume at the **parent** (`/var/lib/postgresql`) instead of the declared data dir. Docker then created an **anonymous child volume** at `/var/lib/postgresql/data`, **masking the real cluster** on the named volume. A non-standard nested layout (`…/data/data`) and some role/healthcheck misconfigurations amplified the symptoms and obscured the root cause. ## Impact - **User-facing:** New logins to Authentik failed; already-authenticated sessions continued to work. - **Downstream services affected:** AWS (SSO), Portainer, Fleet. - **Start:** **2025-09-27 14:49 EDT** - **End:** **2025-09-27 18:07 EDT** - **Duration:** **3h 18m** - **Data loss:** None observed. After restore, schema contained **178** non-system tables; application state validated as intact. ## Timeline (EDT) - **14:49** – Portainer deploy fails: mount/propagation errors, then `unhealthy` on PostgreSQL. - **15:00–16:30** – Investigation finds valid PG 17 cluster on host volume under `…/_data/data` with ownership uid/gid 999; inside containers, `/var/lib/postgresql/data` appears empty or newly initialized. Health checks produce `role "root"/"postgres" does not exist` messages (clients connecting without a username). - **~16:45** – Root cause identified: named volume mounted at `/var/lib/postgresql` + image’s `VOLUME /var/lib/postgresql/data` ⇒ anonymous child volume masks real data. - **~17:00–18:00** – Recovery plan executed: logical dump (globals + `authentik` DB) from the old cluster, destroy volume, redeploy Postgres 17 with **volume mounted at `/var/lib/postgresql/data`**, restore globals and DB, add `pg_trgm` and `uuid-ossp` extensions, verify counts. - **18:07** – Stack healthy; Authentik available; AWS/Portainer/Fleet auth restored. ## Technical Details ### Root cause 1. **Mount target mismatch for a data-dir image.** The official `postgres` image declares `VOLUME /var/lib/postgresql/data`. Mounting the named volume at the parent path (`/var/lib/postgresql`) led Docker to mount an anonymous child volume at `/var/lib/postgresql/data`, hiding the real database files. Containers subsequently saw an empty/new data directory and either failed `initdb` (“exists but not empty”) or started against the wrong path. 2. **Non-standard nested layout.** The actual cluster lived under `…/data/data`. With `PGDATA` pointed at `/var/lib/postgresql/data`, the server didn’t see `PG_VERSION` and behaved as uninitialized; pointing `PGDATA` at the nested path worked but remained non-standard and error-prone. ### Contributing factors - **Health/monitoring confusion.** Docker healthchecks (and some client probes) connected without `-U`, defaulting to OS user `root` inside the container, producing noisy `role "root" does not exist`. Production monitoring used an **HTTP check from Better Stack**, which caught the outage but didn’t illuminate the DB mount issue. - **Role drift.** The legacy cluster at one point lacked a `postgres` role; probes using `-U postgres` failed. - **Tag drift risk.** Unpinned `library/postgres` tag increases chance of major version bumps during pull. - **No routine backups.** There was no preexisting logical backup job for Authentik; we relied on ad-hoc dumps during incident response. ### What we tried (and why it didn’t stick) - **Chown/chmod/ACL normalization** – necessary, but the masked child volume still hid the real cluster. - **Changing `PGDATA` while mounting the parent** – still subject to masking by the image’s declared VOLUME. - **Creating roles via psql** – failed until we targeted the actual `PGDATA` or used single-user mode, because the container wasn’t looking at the real data directory. ### Resolution - **Deterministic reset:** 1) Dumped roles (`pg_dumpall --globals-only`) and DB (`pg_dump -Fc authentik`) from the old cluster. 2) Removed the old volume (with an extra tarball snapshot as belt-and-suspenders). 3) Redeployed **`postgres:17`** with **named volume mounted at `/var/lib/postgresql/data`** and `PGDATA=/var/lib/postgresql/data`. 4) Restored globals and DB; ensured `pg_trgm` / `uuid-ossp` present; validated 178 non-system tables owned by `authentik`. 5) Updated healthcheck to a **user-agnostic** probe (`pg_isready -q -h 127.0.0.1 -p 5432`). ## Corrective & Preventive Actions ### Config standards (immediately) - **Pin the image** to `postgres:17` in Compose. - **Mount at the child path:** always mount the named volume at **`/var/lib/postgresql/data`** for the official image. Do not mount the parent. - **Standardize DB env consumption:** both `auth-server` and `auth-worker` read `${POSTGRES_USER}`, `${POSTGRES_DB}`, `${POSTGRES_PASSWORD}` (no hardcoding). - **Healthcheck:** use user-agnostic probe (TCP or `pg_isready` with host/port only). Avoid checks that require a specific DB user. ### Operational guardrails (this week) - **Pre-flight verification script** (run before deploy): `docker run --rm -v <vol>:/var/lib/postgresql/data --entrypoint sh postgres:17 -lc 'test -f /var/lib/postgresql/data/PG_VERSION || (echo "PG_VERSION missing at target"; ls -la /var/lib/postgresql/data; exit 1)'` - **Detect anonymous child volumes:** flag when a container has both a named mount at `/var/lib/postgresql` and any mount at `/var/lib/postgresql/data`. - **Monitoring:** keep Better Stack HTTP checks; add a DB socket/TCP liveness check on port 5432 from within the host or a sidecar to reduce false attribution. ### Backups & restore (this week) - **Nightly logical backups:** - `pg_dumpall --globals-only > pg-globals.sql` - `pg_dump -Fc authentik > authentik.dump` - Retention: 14–30 days; store off-box (MinIO/S3). - **Quarterly restore drill** into a throwaway container to ensure backups are viable. ### Documentation / runbooks (this week) - “Postgres in Docker” one-pager: **must mount `/var/lib/postgresql/data`**, effects of the image’s `VOLUME`, and nested layout pitfalls. - “Role repair via single-user mode” snippet (for missing `postgres`). - “Restore procedure” (drop/recreate DB vs temp DB swap). ### Owners & dates - **Compose standardization & pinning:** **@kitzy** — **Done** (verify in repo). - **Pre-flight check in CI/Portainer template:** **@kitzy** — **by EOW**. - **Backups to S3/MinIO + retention policy:** **@kitzy** — **by EOW**. - **Runbooks (deploy, backup, restore, role repair):** **@kitzy** — **by EOW**. - **Monitoring additions (DB TCP liveness):** **@kitzy** — **by EOW**. ## Evidence - Host volume: `…/authentik_postgresql/_data/data/PG_VERSION` (PG 17) with full cluster files. - Inside-container (parent mount): `/var/lib/postgresql/data` initially empty or re-initialized; after correct mount (child path), cluster is visible. - Successful restore: `current_user = authentik`, `count = 178` non-system tables. - Final Compose: `Destination=/var/lib/postgresql/data` mounted from named volume; user-agnostic healthcheck. Authentik stack failed to deploy (PostgreSQL unhealthy) https://status.kitzy.net/incident/733500 Sat, 27 Sep 2025 22:07:00 -0000 https://status.kitzy.net/incident/733500#aa8244055c0e8fb4acd5065bea3c5931591aa07f8baf7bf323484897487af137 Stack healthy; Authentik available; AWS/Portainer/Fleet auth restored. Authentik stack failed to deploy (PostgreSQL unhealthy) https://status.kitzy.net/incident/733500 Sat, 27 Sep 2025 22:07:00 -0000 https://status.kitzy.net/incident/733500#aa8244055c0e8fb4acd5065bea3c5931591aa07f8baf7bf323484897487af137 Stack healthy; Authentik available; AWS/Portainer/Fleet auth restored. Authentik stack failed to deploy (PostgreSQL unhealthy) https://status.kitzy.net/incident/733500 Sat, 27 Sep 2025 21:37:00 -0000 https://status.kitzy.net/incident/733500#b9d6f80f5c5f92701b60c5ea6fb1843d4d1d1690a1ea84323052ceb39d80e673 Recovery plan executed: logical dump (globals + `authentik` DB) from the old cluster, destroy volume, redeploy Postgres 17 with **volume mounted at `/var/lib/postgresql/data`**, restore globals and DB, add `pg_trgm` and `uuid-ossp` extensions, verify counts. Authentik stack failed to deploy (PostgreSQL unhealthy) https://status.kitzy.net/incident/733500 Sat, 27 Sep 2025 21:37:00 -0000 https://status.kitzy.net/incident/733500#b9d6f80f5c5f92701b60c5ea6fb1843d4d1d1690a1ea84323052ceb39d80e673 Recovery plan executed: logical dump (globals + `authentik` DB) from the old cluster, destroy volume, redeploy Postgres 17 with **volume mounted at `/var/lib/postgresql/data`**, restore globals and DB, add `pg_trgm` and `uuid-ossp` extensions, verify counts. Authentik stack failed to deploy (PostgreSQL unhealthy) https://status.kitzy.net/incident/733500 Sat, 27 Sep 2025 20:45:00 -0000 https://status.kitzy.net/incident/733500#9ce6afd2146254fb62d1194355b7a1a6f63efd11e30c9fcb9413e18bf421ee2e Root cause identified: named volume mounted at `/var/lib/postgresql` + image’s `VOLUME /var/lib/postgresql/data` ⇒ anonymous child volume masks real data. Authentik stack failed to deploy (PostgreSQL unhealthy) https://status.kitzy.net/incident/733500 Sat, 27 Sep 2025 20:45:00 -0000 https://status.kitzy.net/incident/733500#9ce6afd2146254fb62d1194355b7a1a6f63efd11e30c9fcb9413e18bf421ee2e Root cause identified: named volume mounted at `/var/lib/postgresql` + image’s `VOLUME /var/lib/postgresql/data` ⇒ anonymous child volume masks real data. Authentik stack failed to deploy (PostgreSQL unhealthy) https://status.kitzy.net/incident/733500 Sat, 27 Sep 2025 20:02:00 -0000 https://status.kitzy.net/incident/733500#c2008d3f461d76cc877aecf6e304622ecec0318ef5a1580c9763a1a18d491706 Investigation finds valid PG 17 cluster on host volume under `…/_data/data` with ownership uid/gid 999; inside containers, `/var/lib/postgresql/data` appears empty or newly initialized. Health checks produce `role "root"/"postgres" does not exist` messages (clients connecting without a username). Authentik stack failed to deploy (PostgreSQL unhealthy) https://status.kitzy.net/incident/733500 Sat, 27 Sep 2025 20:02:00 -0000 https://status.kitzy.net/incident/733500#c2008d3f461d76cc877aecf6e304622ecec0318ef5a1580c9763a1a18d491706 Investigation finds valid PG 17 cluster on host volume under `…/_data/data` with ownership uid/gid 999; inside containers, `/var/lib/postgresql/data` appears empty or newly initialized. Health checks produce `role "root"/"postgres" does not exist` messages (clients connecting without a username). Authentik stack failed to deploy (PostgreSQL unhealthy) https://status.kitzy.net/incident/733500 Sat, 27 Sep 2025 19:28:00 -0000 https://status.kitzy.net/incident/733500#a6a31cfae1237c85e3679f53350d2435759f1b63be1dbbdd00933e408996fd0e There was an issue updating a postgres container, I am investigating a fix. Authentication to AWS, Portainer, and Fleet are all impacted. Existing user sessions should continue to work. Authentik stack failed to deploy (PostgreSQL unhealthy) https://status.kitzy.net/incident/733500 Sat, 27 Sep 2025 19:28:00 -0000 https://status.kitzy.net/incident/733500#a6a31cfae1237c85e3679f53350d2435759f1b63be1dbbdd00933e408996fd0e There was an issue updating a postgres container, I am investigating a fix. Authentication to AWS, Portainer, and Fleet are all impacted. Existing user sessions should continue to work. Authentik stack failed to deploy (PostgreSQL unhealthy) https://status.kitzy.net/incident/733500 Sat, 27 Sep 2025 18:50:13 -0000 https://status.kitzy.net/incident/733500#1921ca4b607458d5ccb4c7e4f53067e85b87442a859613064e52c31cfb4ea663 Authentik went down.