Incidents | kitzy.net Incidents reported on status page for kitzy.net https://status.kitzy.net/ en Plex, Overseer, and 1 other service are down https://status.kitzy.net/incident/773608 Tue, 25 Nov 2025 13:37:00 -0000 https://status.kitzy.net/incident/773608#641f692c3d22272dfacd2a76292cd81b82aa8e721662d35adeee29efcbad2a56 Unifi Controller recovered. Plex, Overseer, and 1 other service are down https://status.kitzy.net/incident/773608 Tue, 25 Nov 2025 13:34:01 -0000 https://status.kitzy.net/incident/773608#4d476da6a5f40bc44c97c17b7c043ad46ff85614d2762f40842e1a8ba2bbcadd Unifi Controller went down. Plex, Overseer, and 1 other service are down https://status.kitzy.net/incident/773608 Tue, 25 Nov 2025 04:13:25 -0000 https://status.kitzy.net/incident/773608#870cc48a60c73874044ed50446313be9d0f68b74a018d27b194cbe576a9c3b81 Plex recovered. Plex, Overseer, and 1 other service are down https://status.kitzy.net/incident/773608 Tue, 25 Nov 2025 04:12:21 -0000 https://status.kitzy.net/incident/773608#6b3bf9986d142f5f628919951591ea5e53d1464efb7f5f58a1cbc578a59385ba Overseer recovered. Plex, Overseer, and 1 other service are down https://status.kitzy.net/incident/773608 Tue, 25 Nov 2025 04:10:24 -0000 https://status.kitzy.net/incident/773608#97c11491953b965264396572b3c3414c6895da20b608dbd400b05614055800fc Plex went down. Plex, Overseer, and 1 other service are down https://status.kitzy.net/incident/773608 Tue, 25 Nov 2025 04:09:30 -0000 https://status.kitzy.net/incident/773608#55364c0bb2808d6799307db6230117415c947a5678b328b4f682d610ba6b0665 Overseer went down. Overseer is down https://status.kitzy.net/incident/772523 Sun, 23 Nov 2025 08:29:08 -0000 https://status.kitzy.net/incident/772523#32889b8591eb5d90f881f8461b42d343a9ed8396c45abeaa978665b9c4f14056 Overseer recovered. Overseer is down https://status.kitzy.net/incident/772523 Sun, 23 Nov 2025 08:23:01 -0000 https://status.kitzy.net/incident/772523#feb5508f05ce6f6aaeb6f99d02e352f31068eec429080b963e662d0c7af8e663 Overseer went down. Overseer is down https://status.kitzy.net/incident/772271 Sat, 22 Nov 2025 18:31:55 -0000 https://status.kitzy.net/incident/772271#fd21f1d73eb9bd730695f626648be5297429c308857fe040e9407faede738088 Overseer recovered. Overseer is down https://status.kitzy.net/incident/772271 Sat, 22 Nov 2025 18:29:30 -0000 https://status.kitzy.net/incident/772271#1c2dc8170887aadd0cfd984a203d0471db84a1bfe61563d1fe930247976ec6b5 Overseer went down. Network is down https://status.kitzy.net/incident/768371 Tue, 18 Nov 2025 11:46:31 -0000 https://status.kitzy.net/incident/768371#4219a383e427e2f0fd5ff36d0073e5fda8770cda455160378318adcd81eba8db Network recovered. Network is down https://status.kitzy.net/incident/768371 Tue, 18 Nov 2025 11:34:10 -0000 https://status.kitzy.net/incident/768371#74b21356453d15e9c86d6dfb70b526a34598d94d3128e49349244fa9d15e6d95 Network went down. Authentik, Plex, and 3 other services are down https://status.kitzy.net/incident/759677 Fri, 07 Nov 2025 04:39:59 -0000 https://status.kitzy.net/incident/759677#bb4ff8311b3bb920b08efd23ece9ee2e006b4964729637d1232612f5f4b6caed Fleet recovered. Authentik, Plex, and 3 other services are down https://status.kitzy.net/incident/759677 Fri, 07 Nov 2025 04:38:46 -0000 https://status.kitzy.net/incident/759677#866b28d453283fe91141cdf71b8779b3edf075cdeb94faea5a6ff652f958afcd Authentik and Plex recovered. Authentik, Plex, and 3 other services are down https://status.kitzy.net/incident/759677 Fri, 07 Nov 2025 04:38:46 -0000 https://status.kitzy.net/incident/759677#866b28d453283fe91141cdf71b8779b3edf075cdeb94faea5a6ff652f958afcd Authentik and Plex recovered. Authentik, Plex, and 3 other services are down https://status.kitzy.net/incident/759677 Fri, 07 Nov 2025 04:37:46 -0000 https://status.kitzy.net/incident/759677#12db4be9bb6bdf24618a0655e548770444c446c32ea65147f69e16010ce6f3ae Overseer and Unifi Controller recovered. Authentik, Plex, and 3 other services are down https://status.kitzy.net/incident/759677 Fri, 07 Nov 2025 04:37:46 -0000 https://status.kitzy.net/incident/759677#12db4be9bb6bdf24618a0655e548770444c446c32ea65147f69e16010ce6f3ae Overseer and Unifi Controller recovered. Authentik, Plex, and 3 other services are down https://status.kitzy.net/incident/759677 Fri, 07 Nov 2025 04:35:01 -0000 https://status.kitzy.net/incident/759677#05c61f950dcb677512f8bdb6b6a90108adfe073519249071eb5e510f00ecbc87 Unifi Controller went down. Authentik, Plex, and 3 other services are down https://status.kitzy.net/incident/759677 Fri, 07 Nov 2025 04:34:41 -0000 https://status.kitzy.net/incident/759677#3cc93a21c6d1012ce6d25a5976e6481748e01629c7c5d3ee7117150f5375cd4b Overseer went down. Authentik, Plex, and 3 other services are down https://status.kitzy.net/incident/759677 Fri, 07 Nov 2025 04:34:20 -0000 https://status.kitzy.net/incident/759677#bb252b8a6e2e0d4057ed99e23f6f72fede05b941ddc1982e4d19fef2b61a7445 Fleet went down. Authentik, Plex, and 3 other services are down https://status.kitzy.net/incident/759677 Fri, 07 Nov 2025 04:33:15 -0000 https://status.kitzy.net/incident/759677#748a74ffbcaabec830976dd1ebb9f476a727462dcecdd25cfceaf03f094bdc08 Authentik went down. Authentik, Plex, and 3 other services are down https://status.kitzy.net/incident/759677 Fri, 07 Nov 2025 04:33:04 -0000 https://status.kitzy.net/incident/759677#0e5a546a343bea325e83c1a7e338b7465a0cd87ea1d84caeae7299b1f6e14cfb Plex went down. Network and Unifi Controller are down https://status.kitzy.net/incident/756705 Mon, 03 Nov 2025 06:44:36 -0000 https://status.kitzy.net/incident/756705#d0610aae05eaf6b50f3dbc3e1e3acc784bbf5e0df441a34936333102d81ca51d Unifi Controller recovered. Network and Unifi Controller are down https://status.kitzy.net/incident/756705 Mon, 03 Nov 2025 06:41:59 -0000 https://status.kitzy.net/incident/756705#8f06ee04c065a20ca2b6b5b3104a90f706337f5d0cbe9280dd0b80eafd9a042c Network recovered. Network and Unifi Controller are down https://status.kitzy.net/incident/756705 Mon, 03 Nov 2025 06:41:46 -0000 https://status.kitzy.net/incident/756705#08d6bc799729cd1888da1291796c44f2e66cb36ba5ebef26fd4f9fd7a0d48995 Unifi Controller went down. Network and Unifi Controller are down https://status.kitzy.net/incident/756705 Mon, 03 Nov 2025 06:41:16 -0000 https://status.kitzy.net/incident/756705#c4805c2b4b59b44675d94eff7f891021d68b314554b266da44b0cb80e8582e50 Network went down. Authentik, Plex, and 1 other service are down https://status.kitzy.net/incident/745241 Fri, 17 Oct 2025 07:16:34 -0000 https://status.kitzy.net/incident/745241#f8b26f1c3f3d2ebd72e035bd8246b7b93a83543d1b17faa126a35898c15de6e6 Authentik and Plex recovered. Authentik, Plex, and 1 other service are down https://status.kitzy.net/incident/745241 Fri, 17 Oct 2025 07:16:34 -0000 https://status.kitzy.net/incident/745241#f8b26f1c3f3d2ebd72e035bd8246b7b93a83543d1b17faa126a35898c15de6e6 Authentik and Plex recovered. Authentik, Plex, and 1 other service are down https://status.kitzy.net/incident/745241 Fri, 17 Oct 2025 07:13:14 -0000 https://status.kitzy.net/incident/745241#9de64e5423bf08f87317ce24ee170fb99527a096c0932fad3bd477af022aafa6 Authentik went down. Authentik, Plex, and 1 other service are down https://status.kitzy.net/incident/745241 Fri, 17 Oct 2025 07:12:57 -0000 https://status.kitzy.net/incident/745241#78170b43158eea291342a5bfe95cde5f0944dbd910bf5a3adb4023d2eb71290e Plex went down. Authentik, Plex, and 1 other service are down https://status.kitzy.net/incident/745241 Fri, 17 Oct 2025 07:09:31 -0000 https://status.kitzy.net/incident/745241#a62410344a0ba3f74d63cab76e7b2939054860ae87d580cfcedbd74b587ee5b3 Unifi Controller recovered. Authentik, Plex, and 1 other service are down https://status.kitzy.net/incident/745241 Fri, 17 Oct 2025 07:07:27 -0000 https://status.kitzy.net/incident/745241#2c3370c8f32a60e1c10652a4d2a51444a5206d17ec75cddf9f07fa029587f849 Unifi Controller went down. Plex is down https://status.kitzy.net/incident/741491 Sat, 11 Oct 2025 16:39:00 -0000 https://status.kitzy.net/incident/741491#c4e6f4b45c96ca251b7c9b03fd528af92f6bc52a796fc25988013ccd043cfbd2 # Root Cause Analysis: Plex Transcoding Failures on Ubuntu 25.04 **Incident date**: October 11, 2025 **Detected**: October 11, 2025 00:37 UTC **Resolved**: October 11, 2025 00:48 UTC **Severity**: High (media streaming unavailable) **Author**: Kitzy ## Executive summary Plex transcoding failed completely on clients following Ubuntu host upgrade from 24.10 to 25.04. All transcode attempts crashed with FFmpeg errors. Root cause was an incompatibility between the LinuxServer.io Plex Docker image's bundled FFmpeg and Ubuntu 25.04. Switching to the official Plex Docker image resolved the issue immediately. **Impact**: 100% of transcoding attempts failed for 11 minutes. Direct play was unaffected. ## Timeline **October 11, 2025 ~00:00**: Upgraded Beelink host from Ubuntu 24.10 to 25.04 **October 11, 2025 00:17**: Upgraded Plex container from ls280 to ls282 **October 11, 2025 00:37**: First transcoding failure detected on AppleTV **October 11, 2025 00:37-00:48**: Troubleshooting and diagnosis **October 11, 2025 00:48**: Switched to official Plex image (plexinc/pms-docker) **October 11, 2025 00:48**: Transcoding confirmed working ## Root cause The LinuxServer.io Plex Docker image (lscr.io/linuxserver/plex) contains a custom-built FFmpeg transcoder that generates commands including the `segment_copyts` option for subtitle segmentation. This option is not recognized by the version of FFmpeg bundled in the LinuxServer image, causing the transcoder to crash with exit code 8. While this bug exists in all versions of the LinuxServer image, it was not triggered on Ubuntu 24.10. The Ubuntu 25.04 environment (likely kernel 6.14.0 or updated DRM libraries) causes Plex's transcoder command generator to include the problematic `segment_copyts` option in its FFmpeg invocations. **Error logs**: ``` ERROR - [Req#12c/Transcode] Unrecognized option 'segment_copyts'. ERROR - [Req#12f/Transcode] Error splitting the argument list: Option not found Jobs: '/usr/lib/plexmediaserver/Plex Transcoder' exit code for process 513 is 8 (failure) ``` ## Contributing factors 1. **Ubuntu version management**: Running non-LTS Ubuntu (24.10) which went EOL, forcing upgrade to 25.04 development release 2. **Hardware requirements**: Original rationale for using 24.10 was newer kernel requirement for Intel N150 hardware 3. **Image selection**: LinuxServer.io image chosen over official image without awareness of potential compatibility issues 4. **Testing gap**: No transcoding tests performed immediately after Ubuntu upgrade 5. **Documentation**: Ubuntu version and Plex image details not documented in wiki ## Resolution Switched from LinuxServer.io Plex image to official Plex image: ```yaml # Before image: lscr.io/linuxserver/plex:latest environment: - PUID=996 - PGID=988 - VERSION=latest # After image: plexinc/pms-docker:latest environment: - TZ=America/New_York ``` Additional changes: - Removed PUID/PGID environment variables (not used by official image) - Fixed config directory ownership: `chown -R 797:797 /mnt/data/docker/volumes/plex` - Retained hardware transcoding configuration (device passthrough and render group) The official Plex image uses a different FFmpeg build that does not have this bug. ## Prevention and action items ### Immediate actions (completed) - [x] Switch to official Plex image - [x] Document Ubuntu version in Current Hardware wiki page - [x] Create troubleshooting runbook for this issue - [x] Verify hardware transcoding works correctly ### Short-term actions - [ ] Evaluate Ubuntu 24.04 LTS with HWE kernel as alternative to 25.04 - [ ] Document specific hardware compatibility issue that required 24.10 originally - [ ] Add transcoding test to post-upgrade checklist - [ ] Set up automated testing for critical services after infrastructure changes ### Long-term actions - [ ] Establish OS selection criteria (prefer LTS over non-LTS) - [ ] Document image selection rationale (official vs community-maintained) - [ ] Implement pre-production testing environment for infrastructure changes - [ ] Add monitoring/alerting for transcoding failures ## Lessons learned ### What went well - Hardware acceleration configuration (device passthrough, render group) was correct - Systematic troubleshooting approach identified root cause quickly - Official Plex image provided immediate resolution without data loss ### What could be improved - **OS stability vs hardware support tradeoff**: Running development release (Ubuntu 25.04) in production introduces unnecessary risk. Should have investigated Ubuntu 24.04 LTS + HWE kernel instead of jumping to non-LTS versions. - **Testing after upgrades**: No transcoding test performed after Ubuntu upgrade. Should have caught this immediately. - **Image selection**: No documented rationale for choosing LinuxServer.io image over official image. Community images may lag behind in compatibility. - **Documentation**: Current Hardware page showed "OS: Ubuntu" without version number, making troubleshooting harder. ### Questions for follow-up 1. What specific hardware feature required kernel >6.8 that 24.04 LTS didn't provide? 2. Does Ubuntu 24.04 LTS + HWE kernel provide adequate support for Intel N150? 3. Should we establish a policy preferring official Docker images over community alternatives? ## Technical details ### Environment - Host OS: Ubuntu 25.04 (kernel 6.14.0-33-generic) - Hardware: Beelink EQ14, Intel N150 (Alder Lake-N), 16GB RAM - Container runtime: Docker via Portainer - Previous image: lscr.io/linuxserver/plex:1.42.2.10156-f737b826c-ls282 - Current image: plexinc/pms-docker:1.42.2.10156-f737b826c ### Verification Intel VA-API drivers confirmed working on host: ``` $ sudo vainfo --display drm --device /dev/dri/renderD128 vainfo: VA-API version: 1.22 (libva 2.22.0) vainfo: Driver version: Intel iHD driver for Intel(R) Gen Graphics - 25.1.2 ``` Hardware transcoding confirmed active in Plex dashboard with "HW" indicator. ## Related documentation - Troubleshooting runbook: [Plex transcoding failures on Ubuntu 25.04](https://www.notion.so/289f8d994cff818a9c2df3c6eb9e8325) - Docker Services page: https://www.notion.so/ac72a6815fe745b4954b3b44bc80c9b4 - Current Hardware page: https://www.notion.so/beff2f7c6b4d449b88107f7554032a98 Plex is down https://status.kitzy.net/incident/741491 Sat, 11 Oct 2025 01:13:18 -0000 https://status.kitzy.net/incident/741491#ee754f45d49a3dee3850820606b889792f0e284dee6e52590182557f63ff65d5 Plex recovered. Plex is down https://status.kitzy.net/incident/741491 Sat, 11 Oct 2025 01:07:00 -0000 https://status.kitzy.net/incident/741491#cb664b96ecc627e392612a2203932898c9f9fe200a701e97a962ae6cbab99913 Plex went down. Plex is down https://status.kitzy.net/incident/741491 Sat, 11 Oct 2025 00:52:32 -0000 https://status.kitzy.net/incident/741491#a0e84729808f123ca7838b7e5f18672dc5777e01ab1819eecb9475ccb0b9b7f6 Plex recovered. Plex is down https://status.kitzy.net/incident/741491 Sat, 11 Oct 2025 00:48:40 -0000 https://status.kitzy.net/incident/741491#fca0bf508964b515c5b026b53a6076314d2e91e6dfc943601f7147fbf4cd9314 Plex went down. kitzysound.com, Authentik, and 4 other services are down https://status.kitzy.net/incident/739575 Wed, 08 Oct 2025 01:09:03 -0000 https://status.kitzy.net/incident/739575#8661fc5579af7d62b999f8c0b8b76766ea0ba9d0a9f169c06f00a2415d167626 Fleet recovered. kitzysound.com, Authentik, and 4 other services are down https://status.kitzy.net/incident/739575 Wed, 08 Oct 2025 01:07:38 -0000 https://status.kitzy.net/incident/739575#92566b07b754e468bc014d1f8362818e366db57b6ca9858ffd2c7bb1b08d0d0c Authentik and Plex recovered. kitzysound.com, Authentik, and 4 other services are down https://status.kitzy.net/incident/739575 Wed, 08 Oct 2025 01:07:38 -0000 https://status.kitzy.net/incident/739575#92566b07b754e468bc014d1f8362818e366db57b6ca9858ffd2c7bb1b08d0d0c Authentik and Plex recovered. kitzysound.com, Authentik, and 4 other services are down https://status.kitzy.net/incident/739575 Wed, 08 Oct 2025 01:06:31 -0000 https://status.kitzy.net/incident/739575#ef3c16b7b1490a344339bbb1bc295a01468d575fa621abf43ed433bfeafcc933 Unifi Controller recovered. kitzysound.com, Authentik, and 4 other services are down https://status.kitzy.net/incident/739575 Wed, 08 Oct 2025 01:05:48 -0000 https://status.kitzy.net/incident/739575#3c2b0a52d3bd2669304ce440930fa0d0515a8b28c427464de8dd6e7f18e1e697 Fleet went down. kitzysound.com, Authentik, and 4 other services are down https://status.kitzy.net/incident/739575 Wed, 08 Oct 2025 01:05:17 -0000 https://status.kitzy.net/incident/739575#fd8343824399aa072fdc650e46a1aa5c1193a24ea4858f679b5c9ba3e8e71891 Authentik and Plex went down. kitzysound.com, Authentik, and 4 other services are down https://status.kitzy.net/incident/739575 Wed, 08 Oct 2025 01:05:17 -0000 https://status.kitzy.net/incident/739575#fd8343824399aa072fdc650e46a1aa5c1193a24ea4858f679b5c9ba3e8e71891 Authentik and Plex went down. kitzysound.com, Authentik, and 4 other services are down https://status.kitzy.net/incident/739575 Wed, 08 Oct 2025 01:03:37 -0000 https://status.kitzy.net/incident/739575#dd9980b9f180c53ed609639f255d5ecff40a2f0ed735808f1df6dc9e6170fc22 Unifi Controller went down. kitzysound.com, Authentik, and 4 other services are down https://status.kitzy.net/incident/739575 Wed, 08 Oct 2025 00:59:05 -0000 https://status.kitzy.net/incident/739575#b7cd45e8ff3b20a7079221350c6a5de3c892f424fe3ec9da1b5954b9c007a214 Authentik recovered. kitzysound.com, Authentik, and 4 other services are down https://status.kitzy.net/incident/739575 Wed, 08 Oct 2025 00:58:35 -0000 https://status.kitzy.net/incident/739575#e49e8351c2b42d82e80ffacf16e5d8df47935499c6f89d33ad2982d887113312 Plex recovered. kitzysound.com, Authentik, and 4 other services are down https://status.kitzy.net/incident/739575 Wed, 08 Oct 2025 00:55:03 -0000 https://status.kitzy.net/incident/739575#56eb8c6688920a4b8359f9278be4c9a370289ca4c015a041b10bf6a1d10e582f Authentik and Plex went down. kitzysound.com, Authentik, and 4 other services are down https://status.kitzy.net/incident/739575 Wed, 08 Oct 2025 00:55:03 -0000 https://status.kitzy.net/incident/739575#56eb8c6688920a4b8359f9278be4c9a370289ca4c015a041b10bf6a1d10e582f Authentik and Plex went down. Authentik stack failed to deploy (PostgreSQL unhealthy) https://status.kitzy.net/incident/733500 Sat, 27 Sep 2025 22:27:00 -0000 https://status.kitzy.net/incident/733500#8cea5ab8eb216d7ca58839aec770141e2676d50ff18ef9f78ca9095ce8ffadb8 # RCA — Authentik stack failed to deploy (PostgreSQL unhealthy) ## Summary New deployments of the Authentik stack failed in Portainer because the PostgreSQL service was repeatedly marked **unhealthy** and (earlier) failed to start. The **root cause** was an incorrect Docker volume mount target for the official `postgres` image, which declares `VOLUME /var/lib/postgresql/data`. We mounted our named volume at the **parent** (`/var/lib/postgresql`) instead of the declared data dir. Docker then created an **anonymous child volume** at `/var/lib/postgresql/data`, **masking the real cluster** on the named volume. A non-standard nested layout (`…/data/data`) and some role/healthcheck misconfigurations amplified the symptoms and obscured the root cause. ## Impact - **User-facing:** New logins to Authentik failed; already-authenticated sessions continued to work. - **Downstream services affected:** AWS (SSO), Portainer, Fleet. - **Start:** **2025-09-27 14:49 EDT** - **End:** **2025-09-27 18:07 EDT** - **Duration:** **3h 18m** - **Data loss:** None observed. After restore, schema contained **178** non-system tables; application state validated as intact. ## Timeline (EDT) - **14:49** – Portainer deploy fails: mount/propagation errors, then `unhealthy` on PostgreSQL. - **15:00–16:30** – Investigation finds valid PG 17 cluster on host volume under `…/_data/data` with ownership uid/gid 999; inside containers, `/var/lib/postgresql/data` appears empty or newly initialized. Health checks produce `role "root"/"postgres" does not exist` messages (clients connecting without a username). - **~16:45** – Root cause identified: named volume mounted at `/var/lib/postgresql` + image’s `VOLUME /var/lib/postgresql/data` ⇒ anonymous child volume masks real data. - **~17:00–18:00** – Recovery plan executed: logical dump (globals + `authentik` DB) from the old cluster, destroy volume, redeploy Postgres 17 with **volume mounted at `/var/lib/postgresql/data`**, restore globals and DB, add `pg_trgm` and `uuid-ossp` extensions, verify counts. - **18:07** – Stack healthy; Authentik available; AWS/Portainer/Fleet auth restored. ## Technical Details ### Root cause 1. **Mount target mismatch for a data-dir image.** The official `postgres` image declares `VOLUME /var/lib/postgresql/data`. Mounting the named volume at the parent path (`/var/lib/postgresql`) led Docker to mount an anonymous child volume at `/var/lib/postgresql/data`, hiding the real database files. Containers subsequently saw an empty/new data directory and either failed `initdb` (“exists but not empty”) or started against the wrong path. 2. **Non-standard nested layout.** The actual cluster lived under `…/data/data`. With `PGDATA` pointed at `/var/lib/postgresql/data`, the server didn’t see `PG_VERSION` and behaved as uninitialized; pointing `PGDATA` at the nested path worked but remained non-standard and error-prone. ### Contributing factors - **Health/monitoring confusion.** Docker healthchecks (and some client probes) connected without `-U`, defaulting to OS user `root` inside the container, producing noisy `role "root" does not exist`. Production monitoring used an **HTTP check from Better Stack**, which caught the outage but didn’t illuminate the DB mount issue. - **Role drift.** The legacy cluster at one point lacked a `postgres` role; probes using `-U postgres` failed. - **Tag drift risk.** Unpinned `library/postgres` tag increases chance of major version bumps during pull. - **No routine backups.** There was no preexisting logical backup job for Authentik; we relied on ad-hoc dumps during incident response. ### What we tried (and why it didn’t stick) - **Chown/chmod/ACL normalization** – necessary, but the masked child volume still hid the real cluster. - **Changing `PGDATA` while mounting the parent** – still subject to masking by the image’s declared VOLUME. - **Creating roles via psql** – failed until we targeted the actual `PGDATA` or used single-user mode, because the container wasn’t looking at the real data directory. ### Resolution - **Deterministic reset:** 1) Dumped roles (`pg_dumpall --globals-only`) and DB (`pg_dump -Fc authentik`) from the old cluster. 2) Removed the old volume (with an extra tarball snapshot as belt-and-suspenders). 3) Redeployed **`postgres:17`** with **named volume mounted at `/var/lib/postgresql/data`** and `PGDATA=/var/lib/postgresql/data`. 4) Restored globals and DB; ensured `pg_trgm` / `uuid-ossp` present; validated 178 non-system tables owned by `authentik`. 5) Updated healthcheck to a **user-agnostic** probe (`pg_isready -q -h 127.0.0.1 -p 5432`). ## Corrective & Preventive Actions ### Config standards (immediately) - **Pin the image** to `postgres:17` in Compose. - **Mount at the child path:** always mount the named volume at **`/var/lib/postgresql/data`** for the official image. Do not mount the parent. - **Standardize DB env consumption:** both `auth-server` and `auth-worker` read `${POSTGRES_USER}`, `${POSTGRES_DB}`, `${POSTGRES_PASSWORD}` (no hardcoding). - **Healthcheck:** use user-agnostic probe (TCP or `pg_isready` with host/port only). Avoid checks that require a specific DB user. ### Operational guardrails (this week) - **Pre-flight verification script** (run before deploy): `docker run --rm -v <vol>:/var/lib/postgresql/data --entrypoint sh postgres:17 -lc 'test -f /var/lib/postgresql/data/PG_VERSION || (echo "PG_VERSION missing at target"; ls -la /var/lib/postgresql/data; exit 1)'` - **Detect anonymous child volumes:** flag when a container has both a named mount at `/var/lib/postgresql` and any mount at `/var/lib/postgresql/data`. - **Monitoring:** keep Better Stack HTTP checks; add a DB socket/TCP liveness check on port 5432 from within the host or a sidecar to reduce false attribution. ### Backups & restore (this week) - **Nightly logical backups:** - `pg_dumpall --globals-only > pg-globals.sql` - `pg_dump -Fc authentik > authentik.dump` - Retention: 14–30 days; store off-box (MinIO/S3). - **Quarterly restore drill** into a throwaway container to ensure backups are viable. ### Documentation / runbooks (this week) - “Postgres in Docker” one-pager: **must mount `/var/lib/postgresql/data`**, effects of the image’s `VOLUME`, and nested layout pitfalls. - “Role repair via single-user mode” snippet (for missing `postgres`). - “Restore procedure” (drop/recreate DB vs temp DB swap). ### Owners & dates - **Compose standardization & pinning:** **@kitzy** — **Done** (verify in repo). - **Pre-flight check in CI/Portainer template:** **@kitzy** — **by EOW**. - **Backups to S3/MinIO + retention policy:** **@kitzy** — **by EOW**. - **Runbooks (deploy, backup, restore, role repair):** **@kitzy** — **by EOW**. - **Monitoring additions (DB TCP liveness):** **@kitzy** — **by EOW**. ## Evidence - Host volume: `…/authentik_postgresql/_data/data/PG_VERSION` (PG 17) with full cluster files. - Inside-container (parent mount): `/var/lib/postgresql/data` initially empty or re-initialized; after correct mount (child path), cluster is visible. - Successful restore: `current_user = authentik`, `count = 178` non-system tables. - Final Compose: `Destination=/var/lib/postgresql/data` mounted from named volume; user-agnostic healthcheck. Authentik stack failed to deploy (PostgreSQL unhealthy) https://status.kitzy.net/incident/733500 Sat, 27 Sep 2025 22:27:00 -0000 https://status.kitzy.net/incident/733500#8cea5ab8eb216d7ca58839aec770141e2676d50ff18ef9f78ca9095ce8ffadb8 # RCA — Authentik stack failed to deploy (PostgreSQL unhealthy) ## Summary New deployments of the Authentik stack failed in Portainer because the PostgreSQL service was repeatedly marked **unhealthy** and (earlier) failed to start. The **root cause** was an incorrect Docker volume mount target for the official `postgres` image, which declares `VOLUME /var/lib/postgresql/data`. We mounted our named volume at the **parent** (`/var/lib/postgresql`) instead of the declared data dir. Docker then created an **anonymous child volume** at `/var/lib/postgresql/data`, **masking the real cluster** on the named volume. A non-standard nested layout (`…/data/data`) and some role/healthcheck misconfigurations amplified the symptoms and obscured the root cause. ## Impact - **User-facing:** New logins to Authentik failed; already-authenticated sessions continued to work. - **Downstream services affected:** AWS (SSO), Portainer, Fleet. - **Start:** **2025-09-27 14:49 EDT** - **End:** **2025-09-27 18:07 EDT** - **Duration:** **3h 18m** - **Data loss:** None observed. After restore, schema contained **178** non-system tables; application state validated as intact. ## Timeline (EDT) - **14:49** – Portainer deploy fails: mount/propagation errors, then `unhealthy` on PostgreSQL. - **15:00–16:30** – Investigation finds valid PG 17 cluster on host volume under `…/_data/data` with ownership uid/gid 999; inside containers, `/var/lib/postgresql/data` appears empty or newly initialized. Health checks produce `role "root"/"postgres" does not exist` messages (clients connecting without a username). - **~16:45** – Root cause identified: named volume mounted at `/var/lib/postgresql` + image’s `VOLUME /var/lib/postgresql/data` ⇒ anonymous child volume masks real data. - **~17:00–18:00** – Recovery plan executed: logical dump (globals + `authentik` DB) from the old cluster, destroy volume, redeploy Postgres 17 with **volume mounted at `/var/lib/postgresql/data`**, restore globals and DB, add `pg_trgm` and `uuid-ossp` extensions, verify counts. - **18:07** – Stack healthy; Authentik available; AWS/Portainer/Fleet auth restored. ## Technical Details ### Root cause 1. **Mount target mismatch for a data-dir image.** The official `postgres` image declares `VOLUME /var/lib/postgresql/data`. Mounting the named volume at the parent path (`/var/lib/postgresql`) led Docker to mount an anonymous child volume at `/var/lib/postgresql/data`, hiding the real database files. Containers subsequently saw an empty/new data directory and either failed `initdb` (“exists but not empty”) or started against the wrong path. 2. **Non-standard nested layout.** The actual cluster lived under `…/data/data`. With `PGDATA` pointed at `/var/lib/postgresql/data`, the server didn’t see `PG_VERSION` and behaved as uninitialized; pointing `PGDATA` at the nested path worked but remained non-standard and error-prone. ### Contributing factors - **Health/monitoring confusion.** Docker healthchecks (and some client probes) connected without `-U`, defaulting to OS user `root` inside the container, producing noisy `role "root" does not exist`. Production monitoring used an **HTTP check from Better Stack**, which caught the outage but didn’t illuminate the DB mount issue. - **Role drift.** The legacy cluster at one point lacked a `postgres` role; probes using `-U postgres` failed. - **Tag drift risk.** Unpinned `library/postgres` tag increases chance of major version bumps during pull. - **No routine backups.** There was no preexisting logical backup job for Authentik; we relied on ad-hoc dumps during incident response. ### What we tried (and why it didn’t stick) - **Chown/chmod/ACL normalization** – necessary, but the masked child volume still hid the real cluster. - **Changing `PGDATA` while mounting the parent** – still subject to masking by the image’s declared VOLUME. - **Creating roles via psql** – failed until we targeted the actual `PGDATA` or used single-user mode, because the container wasn’t looking at the real data directory. ### Resolution - **Deterministic reset:** 1) Dumped roles (`pg_dumpall --globals-only`) and DB (`pg_dump -Fc authentik`) from the old cluster. 2) Removed the old volume (with an extra tarball snapshot as belt-and-suspenders). 3) Redeployed **`postgres:17`** with **named volume mounted at `/var/lib/postgresql/data`** and `PGDATA=/var/lib/postgresql/data`. 4) Restored globals and DB; ensured `pg_trgm` / `uuid-ossp` present; validated 178 non-system tables owned by `authentik`. 5) Updated healthcheck to a **user-agnostic** probe (`pg_isready -q -h 127.0.0.1 -p 5432`). ## Corrective & Preventive Actions ### Config standards (immediately) - **Pin the image** to `postgres:17` in Compose. - **Mount at the child path:** always mount the named volume at **`/var/lib/postgresql/data`** for the official image. Do not mount the parent. - **Standardize DB env consumption:** both `auth-server` and `auth-worker` read `${POSTGRES_USER}`, `${POSTGRES_DB}`, `${POSTGRES_PASSWORD}` (no hardcoding). - **Healthcheck:** use user-agnostic probe (TCP or `pg_isready` with host/port only). Avoid checks that require a specific DB user. ### Operational guardrails (this week) - **Pre-flight verification script** (run before deploy): `docker run --rm -v <vol>:/var/lib/postgresql/data --entrypoint sh postgres:17 -lc 'test -f /var/lib/postgresql/data/PG_VERSION || (echo "PG_VERSION missing at target"; ls -la /var/lib/postgresql/data; exit 1)'` - **Detect anonymous child volumes:** flag when a container has both a named mount at `/var/lib/postgresql` and any mount at `/var/lib/postgresql/data`. - **Monitoring:** keep Better Stack HTTP checks; add a DB socket/TCP liveness check on port 5432 from within the host or a sidecar to reduce false attribution. ### Backups & restore (this week) - **Nightly logical backups:** - `pg_dumpall --globals-only > pg-globals.sql` - `pg_dump -Fc authentik > authentik.dump` - Retention: 14–30 days; store off-box (MinIO/S3). - **Quarterly restore drill** into a throwaway container to ensure backups are viable. ### Documentation / runbooks (this week) - “Postgres in Docker” one-pager: **must mount `/var/lib/postgresql/data`**, effects of the image’s `VOLUME`, and nested layout pitfalls. - “Role repair via single-user mode” snippet (for missing `postgres`). - “Restore procedure” (drop/recreate DB vs temp DB swap). ### Owners & dates - **Compose standardization & pinning:** **@kitzy** — **Done** (verify in repo). - **Pre-flight check in CI/Portainer template:** **@kitzy** — **by EOW**. - **Backups to S3/MinIO + retention policy:** **@kitzy** — **by EOW**. - **Runbooks (deploy, backup, restore, role repair):** **@kitzy** — **by EOW**. - **Monitoring additions (DB TCP liveness):** **@kitzy** — **by EOW**. ## Evidence - Host volume: `…/authentik_postgresql/_data/data/PG_VERSION` (PG 17) with full cluster files. - Inside-container (parent mount): `/var/lib/postgresql/data` initially empty or re-initialized; after correct mount (child path), cluster is visible. - Successful restore: `current_user = authentik`, `count = 178` non-system tables. - Final Compose: `Destination=/var/lib/postgresql/data` mounted from named volume; user-agnostic healthcheck. Authentik stack failed to deploy (PostgreSQL unhealthy) https://status.kitzy.net/incident/733500 Sat, 27 Sep 2025 22:07:00 -0000 https://status.kitzy.net/incident/733500#aa8244055c0e8fb4acd5065bea3c5931591aa07f8baf7bf323484897487af137 Stack healthy; Authentik available; AWS/Portainer/Fleet auth restored. Authentik stack failed to deploy (PostgreSQL unhealthy) https://status.kitzy.net/incident/733500 Sat, 27 Sep 2025 22:07:00 -0000 https://status.kitzy.net/incident/733500#aa8244055c0e8fb4acd5065bea3c5931591aa07f8baf7bf323484897487af137 Stack healthy; Authentik available; AWS/Portainer/Fleet auth restored. Authentik stack failed to deploy (PostgreSQL unhealthy) https://status.kitzy.net/incident/733500 Sat, 27 Sep 2025 21:37:00 -0000 https://status.kitzy.net/incident/733500#b9d6f80f5c5f92701b60c5ea6fb1843d4d1d1690a1ea84323052ceb39d80e673 Recovery plan executed: logical dump (globals + `authentik` DB) from the old cluster, destroy volume, redeploy Postgres 17 with **volume mounted at `/var/lib/postgresql/data`**, restore globals and DB, add `pg_trgm` and `uuid-ossp` extensions, verify counts. Authentik stack failed to deploy (PostgreSQL unhealthy) https://status.kitzy.net/incident/733500 Sat, 27 Sep 2025 21:37:00 -0000 https://status.kitzy.net/incident/733500#b9d6f80f5c5f92701b60c5ea6fb1843d4d1d1690a1ea84323052ceb39d80e673 Recovery plan executed: logical dump (globals + `authentik` DB) from the old cluster, destroy volume, redeploy Postgres 17 with **volume mounted at `/var/lib/postgresql/data`**, restore globals and DB, add `pg_trgm` and `uuid-ossp` extensions, verify counts. Authentik stack failed to deploy (PostgreSQL unhealthy) https://status.kitzy.net/incident/733500 Sat, 27 Sep 2025 20:45:00 -0000 https://status.kitzy.net/incident/733500#9ce6afd2146254fb62d1194355b7a1a6f63efd11e30c9fcb9413e18bf421ee2e Root cause identified: named volume mounted at `/var/lib/postgresql` + image’s `VOLUME /var/lib/postgresql/data` ⇒ anonymous child volume masks real data. Authentik stack failed to deploy (PostgreSQL unhealthy) https://status.kitzy.net/incident/733500 Sat, 27 Sep 2025 20:45:00 -0000 https://status.kitzy.net/incident/733500#9ce6afd2146254fb62d1194355b7a1a6f63efd11e30c9fcb9413e18bf421ee2e Root cause identified: named volume mounted at `/var/lib/postgresql` + image’s `VOLUME /var/lib/postgresql/data` ⇒ anonymous child volume masks real data. Authentik stack failed to deploy (PostgreSQL unhealthy) https://status.kitzy.net/incident/733500 Sat, 27 Sep 2025 20:02:00 -0000 https://status.kitzy.net/incident/733500#c2008d3f461d76cc877aecf6e304622ecec0318ef5a1580c9763a1a18d491706 Investigation finds valid PG 17 cluster on host volume under `…/_data/data` with ownership uid/gid 999; inside containers, `/var/lib/postgresql/data` appears empty or newly initialized. Health checks produce `role "root"/"postgres" does not exist` messages (clients connecting without a username). Authentik stack failed to deploy (PostgreSQL unhealthy) https://status.kitzy.net/incident/733500 Sat, 27 Sep 2025 20:02:00 -0000 https://status.kitzy.net/incident/733500#c2008d3f461d76cc877aecf6e304622ecec0318ef5a1580c9763a1a18d491706 Investigation finds valid PG 17 cluster on host volume under `…/_data/data` with ownership uid/gid 999; inside containers, `/var/lib/postgresql/data` appears empty or newly initialized. Health checks produce `role "root"/"postgres" does not exist` messages (clients connecting without a username). Authentik stack failed to deploy (PostgreSQL unhealthy) https://status.kitzy.net/incident/733500 Sat, 27 Sep 2025 19:28:00 -0000 https://status.kitzy.net/incident/733500#a6a31cfae1237c85e3679f53350d2435759f1b63be1dbbdd00933e408996fd0e There was an issue updating a postgres container, I am investigating a fix. Authentication to AWS, Portainer, and Fleet are all impacted. Existing user sessions should continue to work. Authentik stack failed to deploy (PostgreSQL unhealthy) https://status.kitzy.net/incident/733500 Sat, 27 Sep 2025 19:28:00 -0000 https://status.kitzy.net/incident/733500#a6a31cfae1237c85e3679f53350d2435759f1b63be1dbbdd00933e408996fd0e There was an issue updating a postgres container, I am investigating a fix. Authentication to AWS, Portainer, and Fleet are all impacted. Existing user sessions should continue to work. Authentik stack failed to deploy (PostgreSQL unhealthy) https://status.kitzy.net/incident/733500 Sat, 27 Sep 2025 18:50:13 -0000 https://status.kitzy.net/incident/733500#1921ca4b607458d5ccb4c7e4f53067e85b87442a859613064e52c31cfb4ea663 Authentik went down.