Downtime

Authentik stack failed to deploy (PostgreSQL unhealthy)

Sep 27 at 02:50pm EDT

Affected services

Authentik

Fleet

Resolved
Sep 27 at 06:27pm EDT

RCA — Authentik stack failed to deploy (PostgreSQL unhealthy)

Summary

New deployments of the Authentik stack failed in Portainer because the PostgreSQL service was repeatedly marked unhealthy and (earlier) failed to start. The root cause was an incorrect Docker volume mount target for the official postgres image, which declares VOLUME /var/lib/postgresql/data. We mounted our named volume at the parent (/var/lib/postgresql) instead of the declared data dir. Docker then created an anonymous child volume at /var/lib/postgresql/data, masking the real cluster on the named volume. A non-standard nested layout (…/data/data) and some role/healthcheck misconfigurations amplified the symptoms and obscured the root cause.

Impact

User-facing: New logins to Authentik failed; already-authenticated sessions continued to work.
Downstream services affected: AWS (SSO), Portainer, Fleet.
Start: 2025-09-27 14:49 EDT
End: 2025-09-27 18:07 EDT
Duration: 3h 18m
Data loss: None observed. After restore, schema contained 178 non-system tables; application state validated as intact.

Timeline (EDT)

14:49 – Portainer deploy fails: mount/propagation errors, then unhealthy on PostgreSQL.
15:00–16:30 – Investigation finds valid PG 17 cluster on host volume under …/_data/data with ownership uid/gid 999; inside containers, /var/lib/postgresql/data appears empty or newly initialized. Health checks produce role "root"/"postgres" does not exist messages (clients connecting without a username).
~16:45 – Root cause identified: named volume mounted at /var/lib/postgresql + image’s VOLUME /var/lib/postgresql/data ⇒ anonymous child volume masks real data.
~17:00–18:00 – Recovery plan executed: logical dump (globals + authentik DB) from the old cluster, destroy volume, redeploy Postgres 17 with volume mounted at /var/lib/postgresql/data, restore globals and DB, add pg_trgm and uuid-ossp extensions, verify counts.
18:07 – Stack healthy; Authentik available; AWS/Portainer/Fleet auth restored.

Technical Details

Root cause

Mount target mismatch for a data-dir image.

The official postgres image declares VOLUME /var/lib/postgresql/data. Mounting the named volume at the parent path (/var/lib/postgresql) led Docker to mount an anonymous child volume at /var/lib/postgresql/data, hiding the real database files. Containers subsequently saw an empty/new data directory and either failed initdb (“exists but not empty”) or started against the wrong path.
Non-standard nested layout.

The actual cluster lived under …/data/data. With PGDATA pointed at /var/lib/postgresql/data, the server didn’t see PG_VERSION and behaved as uninitialized; pointing PGDATA at the nested path worked but remained non-standard and error-prone.

Contributing factors

Health/monitoring confusion. Docker healthchecks (and some client probes) connected without -U, defaulting to OS user root inside the container, producing noisy role "root" does not exist. Production monitoring used an HTTP check from Better Stack, which caught the outage but didn’t illuminate the DB mount issue.
Role drift. The legacy cluster at one point lacked a postgres role; probes using -U postgres failed.
Tag drift risk. Unpinned library/postgres tag increases chance of major version bumps during pull.
No routine backups. There was no preexisting logical backup job for Authentik; we relied on ad-hoc dumps during incident response.

What we tried (and why it didn’t stick)

Chown/chmod/ACL normalization – necessary, but the masked child volume still hid the real cluster.
Changing PGDATA while mounting the parent – still subject to masking by the image’s declared VOLUME.
Creating roles via psql – failed until we targeted the actual PGDATA or used single-user mode, because the container wasn’t looking at the real data directory.

Resolution

Deterministic reset:
1) Dumped roles (pg_dumpall --globals-only) and DB (pg_dump -Fc authentik) from the old cluster.
2) Removed the old volume (with an extra tarball snapshot as belt-and-suspenders).
3) Redeployed postgres:17 with named volume mounted at /var/lib/postgresql/data and PGDATA=/var/lib/postgresql/data.
4) Restored globals and DB; ensured pg_trgm / uuid-ossp present; validated 178 non-system tables owned by authentik.
5) Updated healthcheck to a user-agnostic probe (pg_isready -q -h 127.0.0.1 -p 5432).

Corrective & Preventive Actions

Config standards (immediately)

Pin the image to postgres:17 in Compose.
Mount at the child path: always mount the named volume at /var/lib/postgresql/data for the official image. Do not mount the parent.
Standardize DB env consumption: both auth-server and auth-worker read ${POSTGRES_USER}, ${POSTGRES_DB}, ${POSTGRES_PASSWORD} (no hardcoding).
Healthcheck: use user-agnostic probe (TCP or pg_isready with host/port only). Avoid checks that require a specific DB user.

Operational guardrails (this week)

Pre-flight verification script (run before deploy):
docker run --rm -v <vol>:/var/lib/postgresql/data --entrypoint sh postgres:17 -lc 'test -f /var/lib/postgresql/data/PG_VERSION || (echo "PG_VERSION missing at target"; ls -la /var/lib/postgresql/data; exit 1)'
Detect anonymous child volumes: flag when a container has both a named mount at /var/lib/postgresql and any mount at /var/lib/postgresql/data.
Monitoring: keep Better Stack HTTP checks; add a DB socket/TCP liveness check on port 5432 from within the host or a sidecar to reduce false attribution.

Backups & restore (this week)

Nightly logical backups:
- pg_dumpall --globals-only > pg-globals.sql
- pg_dump -Fc authentik > authentik.dump
- Retention: 14–30 days; store off-box (MinIO/S3).
Quarterly restore drill into a throwaway container to ensure backups are viable.

Documentation / runbooks (this week)

“Postgres in Docker” one-pager: must mount /var/lib/postgresql/data, effects of the image’s VOLUME, and nested layout pitfalls.
“Role repair via single-user mode” snippet (for missing postgres).
“Restore procedure” (drop/recreate DB vs temp DB swap).

Owners & dates

Compose standardization & pinning: @kitzy — Done (verify in repo).
Pre-flight check in CI/Portainer template: @kitzy — by EOW.
Backups to S3/MinIO + retention policy: @kitzy — by EOW.
Runbooks (deploy, backup, restore, role repair): @kitzy — by EOW.
Monitoring additions (DB TCP liveness): @kitzy — by EOW.

Evidence

Host volume: …/authentik_postgresql/_data/data/PG_VERSION (PG 17) with full cluster files.
Inside-container (parent mount): /var/lib/postgresql/data initially empty or re-initialized; after correct mount (child path), cluster is visible.
Successful restore: current_user = authentik, count = 178 non-system tables.
Final Compose: Destination=/var/lib/postgresql/data mounted from named volume; user-agnostic healthcheck.

Updated
Sep 27 at 06:07pm EDT

Stack healthy; Authentik available; AWS/Portainer/Fleet auth restored.

Updated
Sep 27 at 05:37pm EDT

Recovery plan executed: logical dump (globals + authentik DB) from the old cluster, destroy volume, redeploy Postgres 17 with volume mounted at /var/lib/postgresql/data, restore globals and DB, add pg_trgm and uuid-ossp extensions, verify counts.

Updated
Sep 27 at 04:45pm EDT

Root cause identified: named volume mounted at /var/lib/postgresql + image’s VOLUME /var/lib/postgresql/data ⇒ anonymous child volume masks real data.

Updated
Sep 27 at 04:02pm EDT

Investigation finds valid PG 17 cluster on host volume under …/_data/data with ownership uid/gid 999; inside containers, /var/lib/postgresql/data appears empty or newly initialized. Health checks produce role "root"/"postgres" does not exist messages (clients connecting without a username).

Updated
Sep 27 at 03:28pm EDT

There was an issue updating a postgres container, I am investigating a fix.

Authentication to AWS, Portainer, and Fleet are all impacted. Existing user sessions should continue to work.

Created
Sep 27 at 02:50pm EDT

Authentik went down.