Skip to content

S3-Notifier ACK Migration Plan (Terraform → Chart)

Temporary one-time planning document: delete or archive this doc once the migration is fully completed and the steady-state is captured in the service docs / ADR-010.

Phased plan to move every s3-notifier resource that's currently Terraform-managed (or imperative Helm-Job-managed) into the service Helm chart as ACK CRs, per ADR-010. This covers the Lambda function, IAM role + policies, S3 invoke permission, and bucket notification — across staging, production, and all preview environments.

What's moving (refer to ADR-010 ownership table)

Rows 1–10 of the ADR-010 ownership table all migrate. Specifically:

  • Production: Lambda function, IAM role + 2 policies, S3 invoke permission, bucket notification.
  • Staging: IAM role + 2 policies (function + bucket are already ACK). Imperative permission-job.yaml stays in place (see "Permission CR limitation" below).
  • Preview pool: IAM role(s) move to chart (likely per-PR rather than the current shared role); legacy pr-2291 Lambda still tagged Terraform is cleaned up.

What stays in Terraform: the ACK controllers' own IAM role (rows 11–13 of the ADR-010 table) — those are bootstrap/circular-dependency exceptions.

Goals

  1. Eliminate cross-repo split between Function (chart) and Role (Terraform).
  2. Replace imperative permission-job.yaml with a declarative ACK Permission CR. Downgraded — upstream blocked (see "Permission CR limitation" below). The Helm Job stays for now; revisit when ACK lambda-controller adds Permission support.
  3. Move production Lambda function + bucket notification from Terraform to ACK.
  4. Allow new environments to be added in a single cluster-gitops PR.
  5. Reach the new state without breaking production or in-flight uploads.

Permission CR limitation (verified 2026-04-28)

kubectl explain function.lambda.services.k8s.aws.spec.permissions returns "field does not exist" against the deployed CRD. Upstream lambda-controller v1alpha1 ships only Function, Alias, EventSourceMapping, FunctionURLConfig, LayerVersion, Version, CodeSigningConfig — no Permission. The chart's permission-job.yaml therefore stays as the only imperative shell-out in the chart until upstream lands a Permission CR. This is a quality-of-implementation issue, not a blocker.

Non-goals

  • Touch Terraform-managed IAM that doesn't belong to s3-notifier (cluster service accounts, ACK controllers' own role, GHA deployer beyond the staging-role widening required by P-1).
  • Change the s3-notifier service code itself.
  • Switch GitHub Actions away from static AWS keys (separate work — sensible follow-up to camarades-infrastructure PR #1).

Pre-flight

Must be done before any phase that creates or activates ACK-managed s3-notifier resources beyond what's already there.

P-0. Deploy ACK iam-controller (NEW — viability research found this missing)

Verified 2026-04-28: the cluster has ack-lambda-controller and ack-s3-controller in ack-system but no ack-iam-controller. The iam.services.k8s.aws CRD group is empty. Phase 1 onwards depends on this.

Tasks (cluster-gitops):

  • Add ack-iam-controller Helm chart to cluster-gitops/argocd/applicationsets/ so ArgoCD installs and reconciles it alongside existing controllers. Pin to ACK iam-controller v1.6.3 or later (released 2026-04-22; v1alpha1 CRDs cover Role, Policy, InstanceProfile, OpenIDConnectProvider, ServiceLinkedRole, User).
  • Configure the controller's K8s service account in ack-system namespace.
  • Apply via cluster-gitops PR. Verify with kubectl get crd | grep iam.services.k8s.aws after sync.

Tasks (camarades-infrastructure):

  • Create the AWS IAM role the iam-controller will assume via IRSA (workload-identity bound to its K8s service account). Follow the same pattern as the existing ack_controllers role in ack-iam.tf, but scope it to IAM operations on arn:aws:iam::*:role/syrfS3Notifier*LambdaRole and the matching policy ARNs.

P-1. ACK iam-controller permission widening

The ACK iam-controller currently can't manage the s3-notifier execution roles. Without this every Phase ≥ 2 sync will fail.

Tasks (camarades-infrastructure/terraform/lambda/ack-iam.tf):

  • Identify the IAM role assumed by the ACK iam-controller (workload-identity-bound K8s service account → AWS IAM role).
  • Grant iam:CreateRole, iam:DeleteRole, iam:GetRole, iam:UpdateRole, iam:UpdateAssumeRolePolicy, iam:TagRole, iam:UntagRole, iam:ListRoleTags, iam:AttachRolePolicy, iam:DetachRolePolicy, iam:ListAttachedRolePolicies, iam:PutRolePolicy, iam:GetRolePolicy, iam:DeleteRolePolicy, iam:ListRolePolicies, iam:PutRolePermissionsBoundary scoped to the four role-name patterns:
  • arn:aws:iam::*:role/syrfS3NotifierProductionLambdaRole
  • arn:aws:iam::*:role/syrfS3NotifierStagingLambdaRole
  • arn:aws:iam::*:role/syrfS3NotifierPreviewLambdaRole (legacy shared)
  • arn:aws:iam::*:role/syrfS3NotifierPr*LambdaRole (per-PR, post-migration)
  • Grant iam:PassRole for the same four ARNs so the ACK lambda-controller can pass the role to the Function CR.
  • Define the SyRFS3NotifierLambdaBoundary permissions boundary policy (CloudWatch Logs writes scoped to /aws/lambda/syrfAppUploadS3Notifier* log groups + s3:GetObject*/HeadObject scoped to arn:aws:s3:::syrfapp-uploads*).
  • Gate every role-write action above with Condition.StringEquals."iam:PermissionsBoundary" equal to that boundary's ARN. This is the load-bearing safety: without the condition, iam:PutRolePolicy + iam:UpdateAssumeRolePolicy on the role-name pattern would let a compromised iam-controller grant itself admin in camarades-net (CreateRole with attacker-trusting trust policy + inline Action:"*"). Do NOT grant iam:DeleteRolePermissionsBoundary — removing the boundary defeats the cap.
  • Attach the boundary to the existing TF-managed Lambda execution roles (production_lambda_role, staging_lambda_role, preview_lambda_role) via permissions_boundary so they remain manageable by the iam-controller after ACK adoption in Phase 4-5.
  • Expose the boundary ARN as a Terraform output (s3_notifier_boundary_arn) so the chart values can reference it.
  • Apply via Terraform — the change itself is in source control.

Status: implemented in camarades-infrastructure PR #2. Boundary ARN: arn:aws:iam::318789018510:policy/SyRFS3NotifierLambdaBoundary.

The ACK lambda-controller and s3-controller already have permissions for the Lambda functions and Bucket they currently manage in staging/preview. Verify (don't just assume) those permissions cover the production Lambda + bucket too before Phase 5.

  • aws iam list-attached-role-policies --role-name <ack-controller-role> → confirm Lambda + S3 + IAM permissions adequate for production-scoped ARNs.

P-2. Companion: Terraform CI/CD lands

Migration is much safer with camarades-infrastructure PR #1 in place — drift introduced during cutover gets surfaced via terraform plan on PRs.

  • camarades-infrastructure PR #1 merged.
  • infra-apply GitHub Environment configured with required reviewers.
  • Initial drift reconciliation has run (the staging IAM policy attachments are applied — fixing the 2026-04-28 incident as a side effect).

P-3. Verification rig

Same protocol used in every phase to prove the upload-to-saga path works end-to-end.

Upload smoke test:

  1. Authenticate to the target environment's web UI as a test user.
  2. Create or reuse a test project + systematic search.
  3. Upload a small CSV with at least one row and a screening column.
  4. Within 60 seconds, verify:
  5. File present in s3://syrfapp-uploads-{env}/Projects/.../*.csv.
  6. CloudWatch metric Invocations for syrfAppUploadS3Notifier-{env} increments by 1.
  7. CloudWatch metric Errors for the same Lambda does not increment.
  8. PM saga document in pmSearchImportJobState transitions out of Uploading (final state should be Complete or Faulted-with-parse-errors, not still Uploading).
  9. Record the search ID and saga state in the cutover log.

Pass criteria: ≥ 1 successful upload smoke test for the env before promoting to the next env.

Phase 1 — Add ACK IAM CRs to the chart, gated off

Goal: introduce the new resources without changing any environment's behaviour.

Changes

  • src/services/s3-notifier/.chart/templates/iam-role.yaml (new) — ACK Role CR (iam.services.k8s.aws/v1alpha1), rendered only when iam.managedByAck: true. Policies handled inline on the Role CR: attaches the AWS-managed AWSLambdaBasicExecutionRole (CloudWatch Logs) and an inline S3ReadAccess policy mirroring the existing Terraform setup. Annotations: services.k8s.aws/deletion-policy: retain and services.k8s.aws/adopted: "true" when iam.adoptExisting. spec.permissionsBoundary MUST reference the SyRFS3NotifierLambdaBoundary ARN (sourced from iam.permissionsBoundaryArn in values, default arn:aws:iam::318789018510:policy/SyRFS3NotifierLambdaBoundary). Without it the iam-controller is denied on every role write — its grants in P-1 require the boundary as a precondition.
  • src/services/s3-notifier/.chart/templates/adopted-resource.yaml (extended) — third AdoptedResource for the Role, gated by iam.managedByAck && iam.adoptExisting. Sync wave 0, mirroring Function/Bucket adoption.
  • src/services/s3-notifier/.chart/templates/function.yaml (modified) — when iam.managedByAck: true, spec.role is derived from the chart-computed ARN; otherwise unchanged (Terraform-supplied).
  • src/services/s3-notifier/.chart/values.yaml — new iam: section: managedByAck: false, adoptExisting: true, deletionPolicy: retain, permissionsBoundaryArn (required when managedByAck: true), optional roleName override, plus additionalManagedPolicies / additionalInlinePolicies extension hooks.
  • src/services/s3-notifier/.chart/templates/_helpers.tpl — helpers for iamRoleName (env-derived AWS role name), iamRoleArn (full ARN), and iamRoleK8sName (DNS-1123 metadata.name).
  • Not in Phase 1: no lambda-permission.yaml (ACK lambda-controller has no Permission CRD — see limitation note above). The imperative permission-job.yaml stays in place.

Apply

  • Land via PR. ArgoCD reconciles all envs with no diff (flag is off).

Verification

  • argocd app diff syrf-staging-s3-notifier shows no changes.
  • argocd app diff syrf-preview-pr-{N}-s3-notifier (any active PR) shows no changes.
  • (Production app is currently service.enabled: false — no diff possible until Phase 5.)

Rollback

Revert the PR. No live state was changed.

Phase 2 — Cut over a single preview environment

Goal: prove the path on a low-blast-radius env.

Pick the target

  • A short-lived preview PR (not someone's actively-tested PR).
  • Note the existing role name; ACK will adopt it.

Changes

  • In cluster-gitops/syrf/environments/preview/pr-{N}/services/s3-notifier.values.yaml:
  • Set iam.managedByAck: true.
  • Set iam.adoptExisting: true so the ACK Role CR adopts the existing AWS role rather than recreating it.
  • Reference the existing role name (syrfS3NotifierPreviewLambdaRole for now — the shared role).

Apply

  • Open cluster-gitops PR. ArgoCD app diff should show: new ACK Role CR + AdoptedResource (for the Role) and the Function CR's spec.role switching from external ARN to chart-derived ARN; 0 AWS-side changes (because of adoption). The imperative permission-job.yaml keeps running for invoke-permission management.
  • Merge. Watch ArgoCD sync.

Verification (run P-3 protocol)

  • Smoke test passes.
  • aws iam list-role-policies --role-name syrfS3NotifierPreviewLambdaRole matches expected policies (this is still the shared role across all preview environments at this point).
  • CloudWatch shows successful Lambda invocations from the smoke test.

Rollback

  • In the same values file, set iam.managedByAck: false. Re-sync.
  • If the role's policies were modified by ACK during sync (shouldn't happen under adoptExisting but possible if drift), restore via the existing Terraform module.

Phase 3 — All previews + per-PR role split

Goal: migrate all preview-pool Lambdas, validate the chart at scale, and switch from the shared syrfS3NotifierPreviewLambdaRole to per-PR roles for blast-radius isolation.

Pre-cutover

  • Decide naming convention for per-PR roles: syrfS3NotifierPr{N}LambdaRole.
  • Confirm P-1 widening covers syrfS3NotifierPr*LambdaRole ARN pattern.

Changes

  • Default iam.managedByAck: true in cluster-gitops/syrf/environments/preview/preview.values.yaml (preview defaults).
  • For new previews: chart creates a per-PR role; no Terraform involvement.
  • For previously-migrated previews from Phase 2: still adopting the shared role. They get switched to per-PR in this phase by changing the role name in their values and applying.

Apply

  • Sync envs in batches of 3, with a 5-minute soak between batches to watch for unexpected ArgoCD errors.
  • Spot-check 2 of the migrated previews with the smoke test.

Phase 3 cleanup tasks

  • Legacy pr-2291 Lambda: still tagged ManagedBy=Terraform while everything else is ACK. Decide: adopt into ACK, or close PR #2291 if it's gone stale (it's been around 70+ days). Track and execute the chosen action here.
  • After all active previews are on per-PR roles, the shared syrfS3NotifierPreviewLambdaRole becomes vestigial. Don't delete yet — Phase 6 owns that step.

Rollback

  • Revert the preview-defaults values change. One commit.

Phase 4 — Staging

Goal: cut over the longest-lived non-prod env. Staging Lambda function + bucket are already ACK; this phase moves IAM into the chart. The imperative permission-job.yaml stays in place — replacing it is upstream-blocked on ACK adding a Permission CR.

Pre-cutover

  • Confirm the staging IAM role drift fix from camarades-infrastructure PR #1 has been applied (CloudWatch Logs + S3 read policies present). This phase adopts that state.
  • Run smoke test in current state (Terraform-managed role, ACK-managed function) → pass.

Changes

  • In cluster-gitops/syrf/environments/staging/s3-notifier/values.yaml:
  • iam.managedByAck: true.
  • iam.adoptExisting: true.
  • iam.roleName: syrfS3NotifierStagingLambdaRole.

Apply

  • ArgoCD app diff should show new ACK Role CR + AdoptedResource and Function spec.role updated to chart-derived ARN; 0 AWS-side mutations.
  • Merge. ArgoCD sync.

Verification (P-3 protocol)

  • Smoke test in staging.
  • Critical: tail CloudWatch logs during smoke test; failures here are silent without logging permissions.
  • Confirm the Lambda's resource-based policy is unchanged (permission-job.yaml is still owning it): aws lambda get-policy --function-name syrfAppUploadS3Notifier-staging should match pre-cutover output byte-for-byte.

Rollback

  • Flip iam.managedByAck: false, re-sync. Terraform-managed role takes over again.
  • If permission-job.yaml didn't re-run (it's idempotent but timed off Sync events), trigger an ArgoCD hard refresh.

Phase 5 — Production

Goal: complete the migration. This is the biggest phase: production Lambda function + bucket + IAM + invoke-permission + bucket-notification all migrate together by enabling the chart in production for the first time and adopting all existing AWS resources.

Pre-cutover gates

  • Phase 4 has been stable in staging for at least 7 days.
  • At least 5 successful smoke tests across staging + previews since Phase 4 cutover.
  • Production cluster-gitops values s3-notifier/config.yaml currently has service.enabled: false and values.yaml already declares adoptExisting: true + deletionPolicy: retain. Re-confirm both.
  • On-call coverage during the cutover window.
  • Communicated maintenance window to platform owners.

Changes (cluster-gitops)

  • In cluster-gitops/syrf/environments/production/s3-notifier/config.yaml:
  • service.enabled: true.
  • In cluster-gitops/syrf/environments/production/s3-notifier/values.yaml:
  • iam.managedByAck: true.
  • iam.adoptExisting: true (already present for bucket and function — extend to IAM).
  • iam.roleName: syrfS3NotifierProductionLambdaRole.

Apply

  • Open cluster-gitops PR. ArgoCD app diff should show:
  • New K8s resources: Function, Bucket, Role CRs + matching AdoptedResources (all with adopted: "true" annotation), plus the imperative permission-job Job. The Lambda's S3 invoke permission continues to be managed by permission-job.yaml — no Permission CR (upstream-blocked).
  • Zero AWS-side mutations expected.
  • Merge. Watch ArgoCD sync.
  • Verify aws lambda get-function --function-name syrfAppUploadS3Notifier --query Tags.ManagedBy flips from Terraform to argocd-ack.

Verification

  • Run P-3 smoke test on a non-customer-impacting test project.
  • Watch CloudWatch metrics for the production Lambda for 1 hour post-cutover (no error rate increase).
  • Spot-check user uploads in real customer projects (with their consent / from a known test account).
  • Verify production bucket notification still fires (aws s3api get-bucket-notification-configuration --bucket syrfapp-uploads).

Rollback (production)

  • Fast path: in cluster-gitops values, service.enabled: false → ArgoCD removes the chart's CRs. Because CRs have deletion-policy: retain, the AWS resources stay. Terraform's view of them comes back.
  • Slow path (if AWS state was mutated unexpectedly): run terraform apply to restore from Terraform's record.
  • Worst case (Lambda function got deleted during ACK reconciliation despite retain): re-deploy from s3://camarades-terraform-state-aws/lambda-packages/production.zip via terraform.

Phase 6 — Decommission Terraform IAM (and revisit permission-job once upstream lands a Permission CR)

Goal: eliminate the dead code paths.

Only after all environments have been migrated for at least 14 days with no rollbacks.

Changes (camarades-infrastructure)

  • Remove from terraform/lambda/main.tf:
  • aws_iam_role.production_lambda_role + dependent policy resources.
  • aws_iam_role.preview_lambda_role + dependent policy resources (the shared one).
  • aws_iam_role.staging_lambda_role + dependent policy resources.
  • aws_lambda_function.s3_notifier_production.
  • aws_lambda_function.s3_notifier_preview (for_each).
  • aws_lambda_permission.s3_invoke_production.
  • aws_lambda_permission.s3_invoke_preview (for_each).
  • aws_s3_bucket_notification.uploads.
  • Remove the corresponding entries from terraform/lambda/github-actions-iam.tf (the resource-list ARNs).
  • Use terraform state rm, not destroy. The live resources are now ACK-adopted; we just want to drop them from Terraform state without deleting them.
  • Update docs/terraform-cicd.md to reflect the smaller scope.

Changes (syrf monorepo)

  • Keep src/services/s3-notifier/.chart/templates/permission-job.yaml — only delete once ACK ships a Permission CRD and we've migrated to it in a follow-up phase.
  • Delete src/services/s3-notifier/.chart/templates/env-vars-job.yaml if its functionality has also moved to ACK CRs (likely needs its own decision, may stay).
  • Delete the iam.managedByAck flag (it's now always true) and remove the false branch from each template.
  • Update CLAUDE.md and docs/architecture/ references to the s3-notifier model.

Verification

  • terraform plan is clean (no removals show up in AWS state).
  • ArgoCD apps still sync.
  • Smoke test passes on at least staging + production.

Risk-weighted rollback summary

At every phase except the final cleanup, rollback is a one-line flag flip in a values file (Phases 2–4) or a service.enabled toggle (Phase 5). The Terraform-managed resources stay declared until Phase 6, so reverting to the old model is always a few minutes of ArgoCD sync away. This is the load-bearing reason the plan is safe.

The one phase that can't be cleanly one-line-rolled-back is Phase 5 if AWS state actually changed (i.e. ACK did more than just adopt). Mitigation: deletion-policy: retain on every CR + double-checked adoptExisting: true + restoring from Terraform state as documented above.

Tracking

A ZenHub epic should be created for this migration. Each phase is a story:

  • Epic: s3-notifier ACK migration (ADR-010)
  • Phases P-1, P-2, P-3 → setup stories
  • Phases 1 through 6 → implementation stories
  • Per-environment smoke tests → sub-tasks under each phase
  • Cleanup task: legacy pr-2291 Lambda (Phase 3)

Update ADR-010's zenhub-ticket and this plan's zenhub-ticket once the epic exists.