S3-Notifier ACK Migration Plan (Terraform → Chart)¶
Temporary one-time planning document: delete or archive this doc once the migration is fully completed and the steady-state is captured in the service docs / ADR-010.
Phased plan to move every s3-notifier resource that's currently Terraform-managed (or imperative Helm-Job-managed) into the service Helm chart as ACK CRs, per ADR-010. This covers the Lambda function, IAM role + policies, S3 invoke permission, and bucket notification — across staging, production, and all preview environments.
What's moving (refer to ADR-010 ownership table)¶
Rows 1–10 of the ADR-010 ownership table all migrate. Specifically:
- Production: Lambda function, IAM role + 2 policies, S3 invoke permission, bucket notification.
- Staging: IAM role + 2 policies (function + bucket are already ACK). Imperative
permission-job.yamlstays in place (see "Permission CR limitation" below). - Preview pool: IAM role(s) move to chart (likely per-PR rather than the current shared role); legacy
pr-2291Lambda still taggedTerraformis cleaned up.
What stays in Terraform: the ACK controllers' own IAM role (rows 11–13 of the ADR-010 table) — those are bootstrap/circular-dependency exceptions.
Goals¶
- Eliminate cross-repo split between Function (chart) and Role (Terraform).
Replace imperativeDowngraded — upstream blocked (see "Permission CR limitation" below). The Helm Job stays for now; revisit when ACK lambda-controller adds Permission support.permission-job.yamlwith a declarative ACKPermissionCR.- Move production Lambda function + bucket notification from Terraform to ACK.
- Allow new environments to be added in a single cluster-gitops PR.
- Reach the new state without breaking production or in-flight uploads.
Permission CR limitation (verified 2026-04-28)¶
kubectl explain function.lambda.services.k8s.aws.spec.permissions returns "field does not exist" against the deployed CRD. Upstream lambda-controller v1alpha1 ships only Function, Alias, EventSourceMapping, FunctionURLConfig, LayerVersion, Version, CodeSigningConfig — no Permission. The chart's permission-job.yaml therefore stays as the only imperative shell-out in the chart until upstream lands a Permission CR. This is a quality-of-implementation issue, not a blocker.
Non-goals¶
- Touch Terraform-managed IAM that doesn't belong to s3-notifier (cluster service accounts, ACK controllers' own role, GHA deployer beyond the staging-role widening required by P-1).
- Change the s3-notifier service code itself.
- Switch GitHub Actions away from static AWS keys (separate work — sensible follow-up to camarades-infrastructure PR #1).
Pre-flight¶
Must be done before any phase that creates or activates ACK-managed s3-notifier resources beyond what's already there.
P-0. Deploy ACK iam-controller (NEW — viability research found this missing)¶
Verified 2026-04-28: the cluster has ack-lambda-controller and ack-s3-controller in ack-system but no ack-iam-controller. The iam.services.k8s.aws CRD group is empty. Phase 1 onwards depends on this.
Tasks (cluster-gitops):
- Add
ack-iam-controllerHelm chart tocluster-gitops/argocd/applicationsets/so ArgoCD installs and reconciles it alongside existing controllers. Pin to ACK iam-controller v1.6.3 or later (released 2026-04-22; v1alpha1 CRDs cover Role, Policy, InstanceProfile, OpenIDConnectProvider, ServiceLinkedRole, User). - Configure the controller's K8s service account in
ack-systemnamespace. - Apply via cluster-gitops PR. Verify with
kubectl get crd | grep iam.services.k8s.awsafter sync.
Tasks (camarades-infrastructure):
- Create the AWS IAM role the iam-controller will assume via IRSA (workload-identity bound to its K8s service account). Follow the same pattern as the existing
ack_controllersrole inack-iam.tf, but scope it to IAM operations onarn:aws:iam::*:role/syrfS3Notifier*LambdaRoleand the matching policy ARNs.
P-1. ACK iam-controller permission widening¶
The ACK iam-controller currently can't manage the s3-notifier execution roles. Without this every Phase ≥ 2 sync will fail.
Tasks (camarades-infrastructure/terraform/lambda/ack-iam.tf):
- Identify the IAM role assumed by the ACK iam-controller (workload-identity-bound K8s service account → AWS IAM role).
- Grant
iam:CreateRole,iam:DeleteRole,iam:GetRole,iam:UpdateRole,iam:UpdateAssumeRolePolicy,iam:TagRole,iam:UntagRole,iam:ListRoleTags,iam:AttachRolePolicy,iam:DetachRolePolicy,iam:ListAttachedRolePolicies,iam:PutRolePolicy,iam:GetRolePolicy,iam:DeleteRolePolicy,iam:ListRolePolicies,iam:PutRolePermissionsBoundaryscoped to the four role-name patterns: arn:aws:iam::*:role/syrfS3NotifierProductionLambdaRolearn:aws:iam::*:role/syrfS3NotifierStagingLambdaRolearn:aws:iam::*:role/syrfS3NotifierPreviewLambdaRole(legacy shared)arn:aws:iam::*:role/syrfS3NotifierPr*LambdaRole(per-PR, post-migration)- Grant
iam:PassRolefor the same four ARNs so the ACK lambda-controller can pass the role to the Function CR. - Define the
SyRFS3NotifierLambdaBoundarypermissions boundary policy (CloudWatch Logs writes scoped to/aws/lambda/syrfAppUploadS3Notifier*log groups +s3:GetObject*/HeadObjectscoped toarn:aws:s3:::syrfapp-uploads*). - Gate every role-write action above with
Condition.StringEquals."iam:PermissionsBoundary"equal to that boundary's ARN. This is the load-bearing safety: without the condition,iam:PutRolePolicy+iam:UpdateAssumeRolePolicyon the role-name pattern would let a compromised iam-controller grant itself admin incamarades-net(CreateRole with attacker-trusting trust policy + inlineAction:"*"). Do NOT grantiam:DeleteRolePermissionsBoundary— removing the boundary defeats the cap. - Attach the boundary to the existing TF-managed Lambda execution roles (
production_lambda_role,staging_lambda_role,preview_lambda_role) viapermissions_boundaryso they remain manageable by the iam-controller after ACK adoption in Phase 4-5. - Expose the boundary ARN as a Terraform output (
s3_notifier_boundary_arn) so the chart values can reference it. - Apply via Terraform — the change itself is in source control.
Status: implemented in camarades-infrastructure PR #2. Boundary ARN:
arn:aws:iam::318789018510:policy/SyRFS3NotifierLambdaBoundary.
The ACK lambda-controller and s3-controller already have permissions for the Lambda functions and Bucket they currently manage in staging/preview. Verify (don't just assume) those permissions cover the production Lambda + bucket too before Phase 5.
-
aws iam list-attached-role-policies --role-name <ack-controller-role>→ confirm Lambda + S3 + IAM permissions adequate for production-scoped ARNs.
P-2. Companion: Terraform CI/CD lands¶
Migration is much safer with camarades-infrastructure PR #1 in place — drift introduced during cutover gets surfaced via terraform plan on PRs.
- camarades-infrastructure PR #1 merged.
-
infra-applyGitHub Environment configured with required reviewers. - Initial drift reconciliation has run (the staging IAM policy attachments are applied — fixing the 2026-04-28 incident as a side effect).
P-3. Verification rig¶
Same protocol used in every phase to prove the upload-to-saga path works end-to-end.
Upload smoke test:
- Authenticate to the target environment's web UI as a test user.
- Create or reuse a test project + systematic search.
- Upload a small CSV with at least one row and a screening column.
- Within 60 seconds, verify:
- File present in
s3://syrfapp-uploads-{env}/Projects/.../*.csv. - CloudWatch metric
InvocationsforsyrfAppUploadS3Notifier-{env}increments by 1. - CloudWatch metric
Errorsfor the same Lambda does not increment. - PM saga document in
pmSearchImportJobStatetransitions out ofUploading(final state should beCompleteorFaulted-with-parse-errors, not stillUploading). - Record the search ID and saga state in the cutover log.
Pass criteria: ≥ 1 successful upload smoke test for the env before promoting to the next env.
Phase 1 — Add ACK IAM CRs to the chart, gated off¶
Goal: introduce the new resources without changing any environment's behaviour.
Changes¶
-
src/services/s3-notifier/.chart/templates/iam-role.yaml(new) — ACKRoleCR (iam.services.k8s.aws/v1alpha1), rendered only wheniam.managedByAck: true. Policies handled inline on the Role CR: attaches the AWS-managedAWSLambdaBasicExecutionRole(CloudWatch Logs) and an inlineS3ReadAccesspolicy mirroring the existing Terraform setup. Annotations:services.k8s.aws/deletion-policy: retainandservices.k8s.aws/adopted: "true"wheniam.adoptExisting.spec.permissionsBoundaryMUST reference theSyRFS3NotifierLambdaBoundaryARN (sourced fromiam.permissionsBoundaryArnin values, defaultarn:aws:iam::318789018510:policy/SyRFS3NotifierLambdaBoundary). Without it the iam-controller is denied on every role write — its grants in P-1 require the boundary as a precondition. -
src/services/s3-notifier/.chart/templates/adopted-resource.yaml(extended) — thirdAdoptedResourcefor the Role, gated byiam.managedByAck && iam.adoptExisting. Sync wave 0, mirroring Function/Bucket adoption. -
src/services/s3-notifier/.chart/templates/function.yaml(modified) — wheniam.managedByAck: true,spec.roleis derived from the chart-computed ARN; otherwise unchanged (Terraform-supplied). -
src/services/s3-notifier/.chart/values.yaml— newiam:section:managedByAck: false,adoptExisting: true,deletionPolicy: retain,permissionsBoundaryArn(required whenmanagedByAck: true), optionalroleNameoverride, plusadditionalManagedPolicies/additionalInlinePoliciesextension hooks. -
src/services/s3-notifier/.chart/templates/_helpers.tpl— helpers foriamRoleName(env-derived AWS role name),iamRoleArn(full ARN), andiamRoleK8sName(DNS-1123 metadata.name). - Not in Phase 1: no
lambda-permission.yaml(ACK lambda-controller has noPermissionCRD — see limitation note above). The imperativepermission-job.yamlstays in place.
Apply¶
- Land via PR. ArgoCD reconciles all envs with no diff (flag is off).
Verification¶
-
argocd app diff syrf-staging-s3-notifiershows no changes. -
argocd app diff syrf-preview-pr-{N}-s3-notifier(any active PR) shows no changes. - (Production app is currently
service.enabled: false— no diff possible until Phase 5.)
Rollback¶
Revert the PR. No live state was changed.
Phase 2 — Cut over a single preview environment¶
Goal: prove the path on a low-blast-radius env.
Pick the target¶
- A short-lived preview PR (not someone's actively-tested PR).
- Note the existing role name; ACK will adopt it.
Changes¶
- In
cluster-gitops/syrf/environments/preview/pr-{N}/services/s3-notifier.values.yaml: - Set
iam.managedByAck: true. - Set
iam.adoptExisting: trueso the ACKRoleCR adopts the existing AWS role rather than recreating it. - Reference the existing role name (
syrfS3NotifierPreviewLambdaRolefor now — the shared role).
Apply¶
- Open cluster-gitops PR. ArgoCD
app diffshould show: new ACKRoleCR +AdoptedResource(for the Role) and the Function CR'sspec.roleswitching from external ARN to chart-derived ARN; 0 AWS-side changes (because of adoption). The imperativepermission-job.yamlkeeps running for invoke-permission management. - Merge. Watch ArgoCD sync.
Verification (run P-3 protocol)¶
- Smoke test passes.
-
aws iam list-role-policies --role-name syrfS3NotifierPreviewLambdaRolematches expected policies (this is still the shared role across all preview environments at this point). - CloudWatch shows successful Lambda invocations from the smoke test.
Rollback¶
- In the same values file, set
iam.managedByAck: false. Re-sync. - If the role's policies were modified by ACK during sync (shouldn't happen under
adoptExistingbut possible if drift), restore via the existing Terraform module.
Phase 3 — All previews + per-PR role split¶
Goal: migrate all preview-pool Lambdas, validate the chart at scale, and switch from the shared syrfS3NotifierPreviewLambdaRole to per-PR roles for blast-radius isolation.
Pre-cutover¶
- Decide naming convention for per-PR roles:
syrfS3NotifierPr{N}LambdaRole. - Confirm P-1 widening covers
syrfS3NotifierPr*LambdaRoleARN pattern.
Changes¶
- Default
iam.managedByAck: trueincluster-gitops/syrf/environments/preview/preview.values.yaml(preview defaults). - For new previews: chart creates a per-PR role; no Terraform involvement.
- For previously-migrated previews from Phase 2: still adopting the shared role. They get switched to per-PR in this phase by changing the role name in their values and applying.
Apply¶
- Sync envs in batches of 3, with a 5-minute soak between batches to watch for unexpected ArgoCD errors.
- Spot-check 2 of the migrated previews with the smoke test.
Phase 3 cleanup tasks¶
- Legacy
pr-2291Lambda: still taggedManagedBy=Terraformwhile everything else is ACK. Decide: adopt into ACK, or close PR #2291 if it's gone stale (it's been around 70+ days). Track and execute the chosen action here. - After all active previews are on per-PR roles, the shared
syrfS3NotifierPreviewLambdaRolebecomes vestigial. Don't delete yet — Phase 6 owns that step.
Rollback¶
- Revert the preview-defaults values change. One commit.
Phase 4 — Staging¶
Goal: cut over the longest-lived non-prod env. Staging Lambda function + bucket are already ACK; this phase moves IAM into the chart. The imperative permission-job.yaml stays in place — replacing it is upstream-blocked on ACK adding a Permission CR.
Pre-cutover¶
- Confirm the staging IAM role drift fix from camarades-infrastructure PR #1 has been applied (CloudWatch Logs + S3 read policies present). This phase adopts that state.
- Run smoke test in current state (Terraform-managed role, ACK-managed function) → pass.
Changes¶
- In
cluster-gitops/syrf/environments/staging/s3-notifier/values.yaml: iam.managedByAck: true.iam.adoptExisting: true.iam.roleName: syrfS3NotifierStagingLambdaRole.
Apply¶
- ArgoCD
app diffshould show new ACKRoleCR +AdoptedResourceand Functionspec.roleupdated to chart-derived ARN; 0 AWS-side mutations. - Merge. ArgoCD sync.
Verification (P-3 protocol)¶
- Smoke test in staging.
- Critical: tail CloudWatch logs during smoke test; failures here are silent without logging permissions.
- Confirm the Lambda's resource-based policy is unchanged (
permission-job.yamlis still owning it):aws lambda get-policy --function-name syrfAppUploadS3Notifier-stagingshould match pre-cutover output byte-for-byte.
Rollback¶
- Flip
iam.managedByAck: false, re-sync. Terraform-managed role takes over again. - If permission-job.yaml didn't re-run (it's idempotent but timed off Sync events), trigger an ArgoCD hard refresh.
Phase 5 — Production¶
Goal: complete the migration. This is the biggest phase: production Lambda function + bucket + IAM + invoke-permission + bucket-notification all migrate together by enabling the chart in production for the first time and adopting all existing AWS resources.
Pre-cutover gates¶
- Phase 4 has been stable in staging for at least 7 days.
- At least 5 successful smoke tests across staging + previews since Phase 4 cutover.
- Production cluster-gitops values
s3-notifier/config.yamlcurrently hasservice.enabled: falseandvalues.yamlalready declaresadoptExisting: true+deletionPolicy: retain. Re-confirm both. - On-call coverage during the cutover window.
- Communicated maintenance window to platform owners.
Changes (cluster-gitops)¶
- In
cluster-gitops/syrf/environments/production/s3-notifier/config.yaml: service.enabled: true.- In
cluster-gitops/syrf/environments/production/s3-notifier/values.yaml: iam.managedByAck: true.iam.adoptExisting: true(already present for bucket and function — extend to IAM).iam.roleName: syrfS3NotifierProductionLambdaRole.
Apply¶
- Open cluster-gitops PR. ArgoCD
app diffshould show: - New K8s resources:
Function,Bucket,RoleCRs + matchingAdoptedResources (all withadopted: "true"annotation), plus the imperativepermission-jobJob. The Lambda's S3 invoke permission continues to be managed bypermission-job.yaml— no Permission CR (upstream-blocked). - Zero AWS-side mutations expected.
- Merge. Watch ArgoCD sync.
- Verify
aws lambda get-function --function-name syrfAppUploadS3Notifier --query Tags.ManagedByflips fromTerraformtoargocd-ack.
Verification¶
- Run P-3 smoke test on a non-customer-impacting test project.
- Watch CloudWatch metrics for the production Lambda for 1 hour post-cutover (no error rate increase).
- Spot-check user uploads in real customer projects (with their consent / from a known test account).
- Verify production bucket notification still fires (
aws s3api get-bucket-notification-configuration --bucket syrfapp-uploads).
Rollback (production)¶
- Fast path: in cluster-gitops values,
service.enabled: false→ ArgoCD removes the chart's CRs. Because CRs havedeletion-policy: retain, the AWS resources stay. Terraform's view of them comes back. - Slow path (if AWS state was mutated unexpectedly): run
terraform applyto restore from Terraform's record. - Worst case (Lambda function got deleted during ACK reconciliation despite
retain): re-deploy froms3://camarades-terraform-state-aws/lambda-packages/production.zipvia terraform.
Phase 6 — Decommission Terraform IAM (and revisit permission-job once upstream lands a Permission CR)¶
Goal: eliminate the dead code paths.
Only after all environments have been migrated for at least 14 days with no rollbacks.
Changes (camarades-infrastructure)¶
- Remove from
terraform/lambda/main.tf: aws_iam_role.production_lambda_role+ dependent policy resources.aws_iam_role.preview_lambda_role+ dependent policy resources (the shared one).aws_iam_role.staging_lambda_role+ dependent policy resources.aws_lambda_function.s3_notifier_production.aws_lambda_function.s3_notifier_preview(for_each).aws_lambda_permission.s3_invoke_production.aws_lambda_permission.s3_invoke_preview(for_each).aws_s3_bucket_notification.uploads.- Remove the corresponding entries from
terraform/lambda/github-actions-iam.tf(the resource-list ARNs). - Use
terraform state rm, notdestroy. The live resources are now ACK-adopted; we just want to drop them from Terraform state without deleting them. - Update
docs/terraform-cicd.mdto reflect the smaller scope.
Changes (syrf monorepo)¶
- Keep
src/services/s3-notifier/.chart/templates/permission-job.yaml— only delete once ACK ships a Permission CRD and we've migrated to it in a follow-up phase. - Delete
src/services/s3-notifier/.chart/templates/env-vars-job.yamlif its functionality has also moved to ACK CRs (likely needs its own decision, may stay). - Delete the
iam.managedByAckflag (it's now always true) and remove thefalsebranch from each template. - Update CLAUDE.md and
docs/architecture/references to the s3-notifier model.
Verification¶
-
terraform planis clean (no removals show up in AWS state). - ArgoCD apps still sync.
- Smoke test passes on at least staging + production.
Risk-weighted rollback summary¶
At every phase except the final cleanup, rollback is a one-line flag flip in a values file (Phases 2–4) or a service.enabled toggle (Phase 5). The Terraform-managed resources stay declared until Phase 6, so reverting to the old model is always a few minutes of ArgoCD sync away. This is the load-bearing reason the plan is safe.
The one phase that can't be cleanly one-line-rolled-back is Phase 5 if AWS state actually changed (i.e. ACK did more than just adopt). Mitigation: deletion-policy: retain on every CR + double-checked adoptExisting: true + restoring from Terraform state as documented above.
Tracking¶
A ZenHub epic should be created for this migration. Each phase is a story:
- Epic:
s3-notifier ACK migration (ADR-010) - Phases P-1, P-2, P-3 → setup stories
- Phases 1 through 6 → implementation stories
- Per-environment smoke tests → sub-tasks under each phase
- Cleanup task: legacy
pr-2291Lambda (Phase 3)
Update ADR-010's zenhub-ticket and this plan's zenhub-ticket once the epic exists.