ADR-010: S3-Notifier IAM Roles Managed by ACK in the Service Chart¶

Status¶

In-Review — proposes a migration; no implementation in this PR, only the design.

Context¶

The s3-notifier service is a Lambda function that consumes S3 ObjectCreated events and publishes SearchUploadSavedEvent to RabbitMQ, which the project-management saga consumes to start the systematic-search import pipeline.

Today, ownership of the s3-notifier's resources is split across three sources of truth, and the split is not even consistent across environments. The table below covers every relevant resource, its current owner (verified against live ManagedBy tags on AWS as of 2026-04-28), the desired end state, and how we get there.

Legend: 🅣 = Terraform, 🅐 = ACK in the chart (src/services/s3-notifier/.chart/), 🅙 = imperative Helm Job, ⚠️ = drift between declared and live state.

#	Resource	Current owner	Desired owner	How achieved
1	Production Lambda function (`syrfAppUploadS3Notifier`)	🅣 (`aws_lambda_function.s3_notifier_production`)	🅐 (`function.yaml`)	Production cluster-gitops values already declare `adoptExisting: true` + `deletionPolicy: retain` and `service.enabled: false`. Phase 5 flips `enabled` to true; ACK adopts the existing Lambda non-destructively; then `terraform state rm` the resource.
2	Production IAM role (`syrfS3NotifierProductionLambdaRole`) + 2 policies	🅣	🅐 (new `iam-role.yaml` + `iam-policy.yaml` in chart)	Phase 1 adds chart templates gated by `iam.managedByAck` flag (default off → no-op). Phase 5 flips flag in production values; ACK adopts existing role; `terraform state rm` after.
3	Production S3 → Lambda invoke permission	🅣 (`aws_lambda_permission.s3_invoke_production`)	🅐 (`Permission` CR via new `lambda-permission.yaml`)	New chart template; adopt existing permission; `terraform state rm` after.
4	Production S3 bucket notification	🅣 (`aws_s3_bucket_notification.uploads`)	🅐 (Bucket CR's `notificationConfiguration`)	Extend `bucket.yaml` with notification spec; adopt; `terraform state rm` after.
5	Staging Lambda function (`syrfAppUploadS3Notifier-staging`)	🅐 ✅	🅐	Already done.
6	Staging IAM role (`syrfS3NotifierStagingLambdaRole`) + 2 policies	🅣 ⚠️ — declared but policies missing in AWS (the 2026-04-28 incident)	🅐	Two-step: (a) immediate drift fix via Terraform CI/CD (camarades-infrastructure PR #1) so staging starts working again; (b) migration Phase 4 moves the declaration into the chart with `adoptExisting`.
7	Staging S3 → Lambda invoke permission	🅙 (`permission-job.yaml` runs `aws lambda add-permission`)	🅙 — stays imperative for now	The currently deployed ACK lambda-controller has no `Permission` CR (verified via `kubectl explain function.lambda.services.k8s.aws.spec.permissions` → field does not exist; upstream lambda-controller v1alpha1 only ships Function/Alias/EventSourceMapping/etc.). Plan #2 is downgraded — keep `permission-job.yaml` until upstream adds Permission support, then revisit. Same applies to all other envs.
8	Staging bucket + notification (`syrfapp-uploads-staging`)	🅐 ✅	🅐	Already done.
9	Preview Lambda functions (`syrfAppUploadS3Notifier-pr-*`)	🅐 mostly ✅; legacy `pr-2291` still 🅣	🅐 (or close stale PR)	Phase 3 cleanup: either adopt `pr-2291` into ACK or close PR #2291 if it's gone stale (it has been around 70+ days). Confirm `var.preview_prs` in TF no longer references it; `terraform state rm` if needed.
10	Preview IAM role(s)	🅣 — single shared `syrfS3NotifierPreviewLambdaRole`	🅐 — likely per-PR role rather than shared	Per-PR role gives per-env blast-radius isolation, matching how each PR already has its own bucket. Phase 2-3: chart renders `Role` CR named after env. Migrate previews one-at-a-time before retiring the shared role.
11	ACK controllers' IAM role (`ack-iam.tf`)	🅣	🅣 — stays	This is the role ACK itself uses to authenticate to AWS. Putting it in a chart that ACK reconciles would be circular. Out of ADR-010 scope.
12	GitHub Actions deployer IAM (`github-actions-iam.tf`)	🅣	🅣 — stays	This is what GHA uses to run terraform. Can't bootstrap itself from a chart. ADR-010 prerequisite P-1 still widens its resource list to cover the staging role until Phase 6.
13	GitHub Actions OIDC provider for AWS	does not exist yet	🅣	Replaces static AWS keys for the camarades-infrastructure CI workflow. Out of ADR-010 scope; sensible follow-up to camarades-infrastructure PR #1.

After migration, rows 1–6, 8–10 are owned by the chart in src/services/s3-notifier/.chart/templates/, with cluster-gitops values toggling per-environment behaviour. Row 7 (Lambda invoke permission) stays imperative in the chart's permission-job.yaml until upstream ACK ships a Permission CR — tracked as a follow-up. Rows 11–13 stay in Terraform on purpose (bootstrap / circular-dependency reasons).

Viability research (2026-04-28): confirmed against the live cluster that iam.services.k8s.aws CRDs are not yet installed (only lambda-controller and s3-controller are running in ack-system); ACK iam-controller v1.6.3 supports the Role and Policy CRDs we need; lambda-controller v1alpha1 has no Permission CR; s3-controller's Bucket CR already declares notification.lambdaFunctionConfigurations (the chart uses this for staging today). The migration plan's P-0 step now covers deploying the iam-controller as a true prerequisite.

This split has produced concrete problems.

The 2026-04-28 incident¶

A SyRF systematic-search upload to staging hung indefinitely. Investigation found:

Lambda CloudWatch metrics: 3 invocations / 3 errors today (matching one upload + 2 default S3 retries).
CloudWatch log group /aws/lambda/syrfAppUploadS3Notifier-staging did not exist — Lambda had never written a log line.
iam list-attached-role-policies and list-role-policies for syrfS3NotifierStagingLambdaRole both returned empty.
The Terraform code at terraform/lambda/main.tf:170-191 declared both aws_iam_role_policy_attachment.staging_lambda_basic (CloudWatch logs) and aws_iam_role_policy.staging_lambda_s3 (s3:GetObject). They had simply never been applied.
The github-actions-deployer IAM policy did not include syrfS3NotifierStagingLambdaRole in its resource list, so a CI-driven terraform apply would have failed even if one existed.

The Lambda fired, hit zero permissions, died silently, and the saga never received SearchUploadSavedEvent. The PM saga's pmSearchImportJobState stayed in Uploading indefinitely.

This is the second IAM-drift incident in this module — see 200f072 fix(iam): align Terraform with actual AWS IAM state on the infra repo. The split ownership model produces drift faster than it can be detected.

Specific smells¶

permission-job.yaml is imperative: a Helm-templated Job runs aws lambda add-permission and aws lambda remove-permission to manage the resource policy. There's an entire Python verification block to compensate for the lack of declarative reconciliation. This is shell-in-a-chart — readable, but inverts the GitOps model.
Function CR consumes an externally-managed ARN: function.yaml references arn:aws:iam::…:role/syrfS3NotifierStagingLambdaRole via a values file. If Terraform hasn't created that role, the Function CR is invalid. There is no way to deploy the s3-notifier service end-to-end from the chart alone.
Adding a new environment requires Terraform changes + chart changes: today, a new staging-like env requires (a) a Terraform PR for the role + policies + deployer-policy widening, (b) a cluster-gitops PR for values. These cannot atomically deploy.
Cross-repo failure recovery is brittle: if either repo lags behind the other, you get partial state — exactly the staging incident.

Decision¶

Move the s3-notifier's IAM role and policies from Terraform into the chart as ACK CRs (Role, Policy from iam.services.k8s.aws/v1alpha1).

After migration:

The Lambda role lives next to the Lambda function in src/services/s3-notifier/.chart/templates/.
ArgoCD reconciles role + policies + function + bucket atomically, in a single sync.
permission-job.yaml is kept for now — see "Out of scope" below.
New environments are added by writing a values file in cluster-gitops/syrf/environments/{env}/s3-notifier/. No Terraform change required.
The Terraform-managed roles for s3-notifier are deprecated and removed once all environments have migrated.

The ACK iam-controller will be deployed alongside the existing lambda-controller and s3-controller as part of P-0 of the migration plan. Today only the latter two are running in ack-system. The policies the iam-controller needs are constrained to s3-notifier role-name patterns and trivially auditable.

Out of scope for this ADR¶

Other Lambdas (none in current scope).
Migrating other Terraform-owned IAM (cluster service accounts, deployer policy itself). Those have different blast radius and need separate ADRs.
Replacing permission-job.yaml with a declarative Permission CR. Verified in 2026-04-28 viability research: the deployed ACK lambda-controller v1alpha1 has no Permission CRD (kubectl explain function.lambda.services.k8s.aws.spec.permissions returns "field does not exist"; upstream lambda-controller's only resource kinds are Function, Alias, EventSourceMapping, FunctionURLConfig, LayerVersion, Version, CodeSigningConfig). The imperative Helm Job stays until upstream lands a Permission CR — tracked as an upstream-blocked follow-up. The rest of the migration is unaffected.

Consequences¶

Positive¶

Single source of truth per service: everything s3-notifier needs is in the chart + cluster-gitops values.
Drift detection becomes trivial: ArgoCD shows OutOfSync for any IAM divergence, the same way it does for Function divergence today. No more "did terraform apply run with the right credentials?".
Atomic per-environment deploys: a new preview/staging/whatever lands in a single PR, in one repo.
Reduces — but does not eliminate — imperative shell: env-vars-job.yaml is removed if/when ACK Function CR exposes envvar-from-secret natively (separate question); permission-job.yaml stays until upstream ACK ships a Permission CRD, but in isolation it's the only remaining imperative shell-out and its scope is narrow.
Better failure modes: a missing role surfaces as a Helm template render error or ArgoCD sync error, not as a silently-failing Lambda invocation.
Cuts a class of incidents: the 2026-04-28 incident, by construction, cannot happen with this model — the role and the function would deploy or fail together.

Negative / costs¶

ACK IAM controller scope must be widened: today it likely doesn't have permission to manage these roles. We need to grant it iam:* on a scoped name pattern (e.g. syrfS3Notifier*LambdaRole). This is itself a sensitive change and needs its own review.
One-time migration cost: each environment (production, staging, all preview PRs) needs a coordinated cutover. See migration plan.
Two ways to manage IAM in the org: until other Terraform-owned roles also migrate, contributors have to know which roles are TF-owned vs ACK-owned. Mitigated by clear naming and documentation.
ACK is a moving target: minor version bumps to the ACK IAM controller could change CR fields. We pin the controller version in cluster-gitops; this is the same risk we already accept for the Lambda controller.
Loss of terraform plan as a preview tool: ArgoCD app diff is the equivalent for ACK CRs; it's a different UX but equally precise.

Risk register¶

Risk	Severity	Mitigation
ACK iam controller fails to reconcile mid-migration, leaving an env without a working role	High	Migrate one env at a time; staging first; verify upload pipeline works end-to-end before next.
ACK controller version skew between staging and production	Medium	Pin the controller version in cluster-gitops, upgrade staging-first.
Production Lambda loses CloudWatch Logs perms during cutover	High	Use ACK `services.k8s.aws/adopted: "true"` to take ownership of existing role without recreate; verify role unchanged via `aws iam list-role-policies` immediately after.
Deployer's terraform-driven IAM policy still references the old role names after migration	Low	Migration plan removes those resource ARNs from `github-actions-iam.tf` only after the env is fully migrated.
New ACK CR misconfiguration accidentally deletes the role	High	Use the ACK `deletion-policy: retain` annotation on the Role CR (same pattern already used for the Function).

Migration plan¶

See companion document: docs/planning/s3-notifier-iam-migration.md.

The plan is phased:

Pre-flight: widen ACK iam-controller permissions; add ADR + plan (this PR).
Add Role + Policy + Permission CRs to the chart, gated by a iam.managedByAck: false flag (default off → no-op).
Per-environment cutover: flip the flag in one preview env's values; verify; promote to staging; verify; promote to production.
Cleanup: remove the Terraform-owned roles once all envs migrated; delete permission-job.yaml; remove the resource list entries in github-actions-iam.tf.

Each phase has explicit rollback steps. No phase modifies production until staging has been validated for at least one upload-to-import cycle.

Alternatives considered¶

A. Keep Terraform, add CI/CD around it (status quo+)¶

Add a workflow that runs terraform plan on every PR and terraform apply on merge with manual gating — see the camarades-infrastructure companion PR (#1).

This is happening regardless of this ADR, because it solves the immediate problem of silent drift even before any migration. But it doesn't address:

The cross-repo dependency (Function CR still needs an ARN that lives elsewhere).
The imperative permission-job.yaml.
The "two PRs to add an environment" friction.

So this is necessary but not sufficient.

B. Move IAM to a separate cluster-gitops chart, but still managed by Terraform¶

Some orgs keep Terraform but invoke it via Atlantis/Terraform Cloud from cluster-gitops. Less invasive than moving to ACK, but:

Still two reconciliation engines (Atlantis + ArgoCD) instead of one.
Doesn't co-locate with the Function CR.
Requires standing up Atlantis. We don't have it; ACK is already there.

C. AWS CDK or Pulumi instead of either¶

Wholesale replacement. Out of proportion for this problem.

D. Do nothing¶

The 2026-04-28 incident says this isn't viable. Drift will happen again.

Option B vs the proposed decision: Option B is a reasonable fallback if the ACK IAM controller turns out to be unstable in practice. We pick the proposed decision because the controller is already running, the pattern is consistent with the existing Function/Bucket CRs, and it eliminates the cross-repo coupling rather than just moving it.

References¶

docs/planning/s3-notifier-iam-migration.md — phased implementation plan.
camarades-infrastructure PR #1 — Terraform CI/CD workflow (the companion change). Reduces drift risk during the migration window, regardless of this ADR's outcome.
ACK IAM Controller — upstream documentation.
2026-04-28 incident: see commits and saga state evidence in the related PR description.