Authentik stack failed to deploy (PostgreSQL unhealthy)
Resolved
Sep 27 at 06:27pm EDT
RCA — Authentik stack failed to deploy (PostgreSQL unhealthy)
Summary
New deployments of the Authentik stack failed in Portainer because the PostgreSQL service was repeatedly marked unhealthy and (earlier) failed to start. The root cause was an incorrect Docker volume mount target for the official postgres image, which declares VOLUME /var/lib/postgresql/data. We mounted our named volume at the parent (/var/lib/postgresql) instead of the declared data dir. Docker then created an anonymous child volume at /var/lib/postgresql/data, masking the real cluster on the named volume. A non-standard nested layout (…/data/data) and some role/healthcheck misconfigurations amplified the symptoms and obscured the root cause.
Impact
- User-facing: New logins to Authentik failed; already-authenticated sessions continued to work.
- Downstream services affected: AWS (SSO), Portainer, Fleet.
- Start: 2025-09-27 14:49 EDT
- End: 2025-09-27 18:07 EDT
- Duration: 3h 18m
- Data loss: None observed. After restore, schema contained 178 non-system tables; application state validated as intact.
Timeline (EDT)
- 14:49 – Portainer deploy fails: mount/propagation errors, then
unhealthyon PostgreSQL. - 15:00–16:30 – Investigation finds valid PG 17 cluster on host volume under
…/_data/datawith ownership uid/gid 999; inside containers,/var/lib/postgresql/dataappears empty or newly initialized. Health checks producerole "root"/"postgres" does not existmessages (clients connecting without a username). - ~16:45 – Root cause identified: named volume mounted at
/var/lib/postgresql+ image’sVOLUME /var/lib/postgresql/data⇒ anonymous child volume masks real data. - ~17:00–18:00 – Recovery plan executed: logical dump (globals +
authentikDB) from the old cluster, destroy volume, redeploy Postgres 17 with volume mounted at/var/lib/postgresql/data, restore globals and DB, addpg_trgmanduuid-osspextensions, verify counts. - 18:07 – Stack healthy; Authentik available; AWS/Portainer/Fleet auth restored.
Technical Details
Root cause
Mount target mismatch for a data-dir image.
The officialpostgresimage declaresVOLUME /var/lib/postgresql/data. Mounting the named volume at the parent path (/var/lib/postgresql) led Docker to mount an anonymous child volume at/var/lib/postgresql/data, hiding the real database files. Containers subsequently saw an empty/new data directory and either failedinitdb(“exists but not empty”) or started against the wrong path.Non-standard nested layout.
The actual cluster lived under…/data/data. WithPGDATApointed at/var/lib/postgresql/data, the server didn’t seePG_VERSIONand behaved as uninitialized; pointingPGDATAat the nested path worked but remained non-standard and error-prone.
Contributing factors
- Health/monitoring confusion. Docker healthchecks (and some client probes) connected without
-U, defaulting to OS userrootinside the container, producing noisyrole "root" does not exist. Production monitoring used an HTTP check from Better Stack, which caught the outage but didn’t illuminate the DB mount issue. - Role drift. The legacy cluster at one point lacked a
postgresrole; probes using-U postgresfailed. - Tag drift risk. Unpinned
library/postgrestag increases chance of major version bumps during pull. - No routine backups. There was no preexisting logical backup job for Authentik; we relied on ad-hoc dumps during incident response.
What we tried (and why it didn’t stick)
- Chown/chmod/ACL normalization – necessary, but the masked child volume still hid the real cluster.
- Changing
PGDATAwhile mounting the parent – still subject to masking by the image’s declared VOLUME. - Creating roles via psql – failed until we targeted the actual
PGDATAor used single-user mode, because the container wasn’t looking at the real data directory.
Resolution
- Deterministic reset:
1) Dumped roles (pg_dumpall --globals-only) and DB (pg_dump -Fc authentik) from the old cluster.
2) Removed the old volume (with an extra tarball snapshot as belt-and-suspenders).
3) Redeployedpostgres:17with named volume mounted at/var/lib/postgresql/dataandPGDATA=/var/lib/postgresql/data.
4) Restored globals and DB; ensuredpg_trgm/uuid-ossppresent; validated 178 non-system tables owned byauthentik.
5) Updated healthcheck to a user-agnostic probe (pg_isready -q -h 127.0.0.1 -p 5432).
Corrective & Preventive Actions
Config standards (immediately)
- Pin the image to
postgres:17in Compose.
- Mount at the child path: always mount the named volume at
/var/lib/postgresql/datafor the official image. Do not mount the parent.
- Standardize DB env consumption: both
auth-serverandauth-workerread${POSTGRES_USER},${POSTGRES_DB},${POSTGRES_PASSWORD}(no hardcoding).
- Healthcheck: use user-agnostic probe (TCP or
pg_isreadywith host/port only). Avoid checks that require a specific DB user.
Operational guardrails (this week)
- Pre-flight verification script (run before deploy):
docker run --rm -v <vol>:/var/lib/postgresql/data --entrypoint sh postgres:17 -lc 'test -f /var/lib/postgresql/data/PG_VERSION || (echo "PG_VERSION missing at target"; ls -la /var/lib/postgresql/data; exit 1)' - Detect anonymous child volumes: flag when a container has both a named mount at
/var/lib/postgresqland any mount at/var/lib/postgresql/data.
- Monitoring: keep Better Stack HTTP checks; add a DB socket/TCP liveness check on port 5432 from within the host or a sidecar to reduce false attribution.
Backups & restore (this week)
- Nightly logical backups:
pg_dumpall --globals-only > pg-globals.sql
pg_dump -Fc authentik > authentik.dump
- Retention: 14–30 days; store off-box (MinIO/S3).
- Quarterly restore drill into a throwaway container to ensure backups are viable.
Documentation / runbooks (this week)
- “Postgres in Docker” one-pager: must mount
/var/lib/postgresql/data, effects of the image’sVOLUME, and nested layout pitfalls.
- “Role repair via single-user mode” snippet (for missing
postgres).
- “Restore procedure” (drop/recreate DB vs temp DB swap).
Owners & dates
- Compose standardization & pinning: @kitzy — Done (verify in repo).
- Pre-flight check in CI/Portainer template: @kitzy — by EOW.
- Backups to S3/MinIO + retention policy: @kitzy — by EOW.
- Runbooks (deploy, backup, restore, role repair): @kitzy — by EOW.
- Monitoring additions (DB TCP liveness): @kitzy — by EOW.
Evidence
- Host volume:
…/authentik_postgresql/_data/data/PG_VERSION(PG 17) with full cluster files.
- Inside-container (parent mount):
/var/lib/postgresql/datainitially empty or re-initialized; after correct mount (child path), cluster is visible.
- Successful restore:
current_user = authentik,count = 178non-system tables.
- Final Compose:
Destination=/var/lib/postgresql/datamounted from named volume; user-agnostic healthcheck.
Affected services
Updated
Sep 27 at 06:07pm EDT
Stack healthy; Authentik available; AWS/Portainer/Fleet auth restored.
Affected services
Updated
Sep 27 at 05:37pm EDT
Recovery plan executed: logical dump (globals + authentik DB) from the old cluster, destroy volume, redeploy Postgres 17 with volume mounted at /var/lib/postgresql/data, restore globals and DB, add pg_trgm and uuid-ossp extensions, verify counts.
Affected services
Updated
Sep 27 at 04:45pm EDT
Root cause identified: named volume mounted at /var/lib/postgresql + image’s VOLUME /var/lib/postgresql/data ⇒ anonymous child volume masks real data.
Affected services
Updated
Sep 27 at 04:02pm EDT
Investigation finds valid PG 17 cluster on host volume under …/_data/data with ownership uid/gid 999; inside containers, /var/lib/postgresql/data appears empty or newly initialized. Health checks produce role "root"/"postgres" does not exist messages (clients connecting without a username).
Affected services
Updated
Sep 27 at 03:28pm EDT
There was an issue updating a postgres container, I am investigating a fix.
Authentication to AWS, Portainer, and Fleet are all impacted. Existing user sessions should continue to work.
Affected services
Created
Sep 27 at 02:50pm EDT
Authentik went down.
Affected services