#terraform

3 posts.

Why ECS Task Definitions Kept Changing on Every Apply

Every Terraform apply was creating new ECS task definition revisions even when nothing had actually changed. Our test environment had accumulated 17,000 task definitions. ECS task definitions are tracked by AWS Config, so this was quietly contributing to increased costs.

In AWS ECS, a task definition is a blueprint that describes how a container should run — what image to use, environment variables, resource limits, health checks, and so on. Every time Terraform sees a difference between what it expects and what AWS has, it creates a new revision. When that happens on every apply with no real change, something is off.

Finding the root cause

The way to debug this is to compare what Terraform has in state against what the AWS API actually returns. Pull the task definition from Terraform state and diff it against the output of aws ecs describe-task-definition. The differences tell you exactly what Terraform thinks it needs to change.

Two causes came up.

Environment variable ordering. AWS reorders environment variables alphabetically when storing a task definition. Terraform was building the environment variable list using merge(), a built-in function that combines maps but does not guarantee key order. On each plan, the order could differ from what AWS had stored, so Terraform saw a change that wasn’t really there.

Missing properties in the container templates. Some optional fields were absent from the JSON templates we used to define containers. AWS fills these in with default values and includes them in the API response. When Terraform compared its template against the API response, it saw those extra fields as additions it needed to remove — which triggered another replacement.

Fields like healthCheck.interval, healthCheck.retries, healthCheck.timeout, systemControls, mountPoints, and portMappings all fell into this category.

The fix

For environment variable ordering, the fix is to sort the keys explicitly before building the list so Terraform and AWS always agree on the order.

locals {
  merged_env_vars = merge(local.php_environment_json, var.extra_ecs_env_vars)
  php_environment_ecs = [
    for k in sort(keys(local.merged_env_vars)) : {
      name  = k
      value = local.merged_env_vars[k]
    }
  ]
}

For the missing properties, the fix is to add them explicitly to the container JSON templates to match what AWS returns. Once both sides agree, Terraform stops seeing phantom changes.

After both fixes, running the plan twice in a row showed no changes on the second run.

Working Around a Long Standing Terraform AWS Provider Bug

/

Our CloudFront distribution kept showing up in every Terraform plan even when nothing changed. The culprit was origin_shield.

When origin_shield is disabled, AWS does not include it in the refreshed state. Terraform sees it missing and tries to re-add it with enabled = false on every plan. The obvious fix was to make the block dynamic so it only appears when actually enabled.

dynamic "origin_shield" {
  for_each = var.enable_cloudfront_origin_shield ? [1] : []
  content {
    enabled              = var.enable_cloudfront_origin_shield
    origin_shield_region = data.aws_region.current.name
  }
}

That fixed the drift. But it introduced a different problem.

When enable_cloudfront_origin_shield changes from true to false, the dynamic block stops emitting the block. Terraform plans to remove origin_shield. The plan looks correct. But after apply, origin shield is still enabled in the AWS Console. The provider silently does nothing. This is a known bug filed in April 2022 with no movement in over three years.

Static block causes repeated drift. Dynamic block fixes drift but breaks disabling. Neither works on its own.

The workaround is a null_resource that only exists when origin shield is disabled. It runs a script that calls the AWS API directly to set OriginShield.Enabled = false.

resource "null_resource" "cloudfront_origin_shield_disable" {
  count = var.create_cloudfront == "yes" && !var.enable_cloudfront_origin_shield ? 1 : 0

  triggers = {
    enable_origin_shield = var.enable_cloudfront_origin_shield
    distribution_id      = aws_cloudfront_distribution.default[0].id
    script_hash          = filemd5("${path.module}/bin/disable-cloudfront-origin-shield.sh")
  }

  provisioner "local-exec" {
    command = "${path.module}/bin/disable-cloudfront-origin-shield.sh ${aws_cloudfront_distribution.default[0].id} wordpress"
  }

  depends_on = [aws_cloudfront_distribution.default]
}

The script fetches the current distribution config, checks if origin shield is already disabled, and exits early if so. The core of it patches the config with jq and pushes it back via the AWS CLI.

CLOUDFRONT_CONFIG=$(aws cloudfront get-distribution-config --id "$DISTRIBUTION_ID" --output json)
ETAG=$(echo "$CLOUDFRONT_CONFIG" | jq -r '.ETag')
DISTRIBUTION_CONFIG=$(echo "$CLOUDFRONT_CONFIG" | jq '.DistributionConfig')

jq --arg origin_id "$ORIGIN_ID" '
  .Origins.Items = [
    .Origins.Items[] |
    if .Id == $origin_id then
      .OriginShield.Enabled = false
    else
      .
    end
  ]
' <<< "$DISTRIBUTION_CONFIG" | aws cloudfront update-distribution \
    --id "$DISTRIBUTION_ID" \
    --distribution-config /dev/stdin \
    --if-match "$ETAG"

The null_resource only triggers once on transition. If origin shield gets re-enabled and then disabled again, the triggers change forces a re-run.

Calling the AWS API directly in a provisioner is not ideal. But the bug has been open for years with no fix in sight. Noisy plans erode trust in the plan output. This keeps things clean until the upstream fix lands.

Reducing Terraform Drift in Production

/
Terraform plan showing no changes

Our Terraform runs were showing changes on every apply, even when nothing had actually changed. In production infrastructure, that’s a real risk. When the plan is always noisy, it’s easy to miss a change that actually matters. It also blocks any move toward automation. You can’t safely automate applies when you can’t trust what the plan is telling you.

I picked this up on my own during spare time. Originally thinking one drift fix per sprint. Seeing the reduced change set gave me a dopamine hit and I just kept going until there were none left. Five sources of drift across CloudFront, ECS, Secrets Manager, and Terraform provider bugs. All cleaned up.

Why it matters

Noisy plans increase the risk of accidental changes and make it hard to move towards automation. Cleaning up drift ensures the plan only shows real changes. This gives us a reliable baseline for future improvements like automated releases and introducing guardrails.

The fixes

Each fix targets a specific source of drift, with an emphasis on fixing root causes rather than applying quick workarounds. In cases where AWS provider limitations made some drift unavoidable, ignore_changes was used as a last resort.

Make origin_shield dynamic

  • When origin_shield is disabled on a CloudFront distribution, AWS removes it from the refreshed state entirely, causing drift on every plan
  • Made the block dynamic so it is only included when enabled
  • There was a secondary provider bug when disabling via a dynamic block that needed its own workaround

Fix null_resources

  • timestamp() in triggers caused Terraform to show changes every run since the value always differs
  • Moved script execution to external data sources to remove drift while still running scripts every apply
  • Split dependency installation into a separate null_resource so it only runs once

Fix task definitions

  • Every apply created new ECS task definition revisions even with no real changes. Task definitions are tracked by AWS Config so this had a cost implication
  • Identified mismatches in environment variable ordering and missing optional properties by comparing Terraform state with the AWS API response
  • Standardized ordering and completed missing fields in the template

Webhook secrets

  • An AWS provider issue caused webhook secrets to always show as changed
  • Added ignore_changes for the secret value

Resources not managed by Terraform

  • Some resources existed in AWS but were intentionally not managed by Terraform, which resulted in drift
  • Added the relevant attributes to ignore_changes

What this affords us

Safer infrastructure releases. Planned changes match expected changes. Only changes with discrepancies will require manual review before apply.

Faster developer feedback loops. Terraform changes are easier to understand, approve, and debug during speculative planning.

A good foundation for future automation. A clean baseline means we can now move toward automated infrastructure releases. Something that wasn’t safe to do when the plan couldn’t be trusted.

What’s next

With a clean plan and apply baseline, we can safely automate infrastructure releases, introduce policy guardrails that highlight genuine issues, and reduce the time spent on manual maintenance.