#terraform

3 posts.

Finding a Way Around origin_shield Drift

Every Terraform plan flag a change in our CloudFront distribution (even when nothing is actually changing).

The issue was origin_shield. When it is disabled, AWS drops it from the response. Terraform sees it missing and adds it back with enabled = false every run.

So every run you would see CloudFront being changed. Big deal because it’s the CDN, misconfiguration could bring down a site. What ends up happening is we always had to inspect the change.

The obvious fix was to make the block dynamic so it only appears when origin shield is actually enabled.

dynamic "origin_shield" {
  for_each = var.enable_cloudfront_origin_shield ? [1] : []
  content {
    enabled              = var.enable_cloudfront_origin_shield
    origin_shield_region = data.aws_region.current.name
  }
}

Drift was gone, plans were clean. But then I found the other side of the problem.

If origin shield was enabled, then later disabled, Terraform would plan to remove the block. The plan looked right. Apply ran. Nothing happened. Origin shield stayed on in the AWS Console because the provider simply removed enabled = true but AWS needed an explicit enabled = false and we are back to the original problem.

We filed a bug report in April 2022. Three years passed, it’s still open.

I can’t fix the provider, but I really really wanted a clean plan/apply in our infrastructure.

I figured, I could work around it by using a null_resource that only exists when origin shield is disabled. When it runs, it calls the AWS API directly and forces OriginShield.Enabled = false.

resource "null_resource" "cloudfront_origin_shield_disable" {
  count = var.create_cloudfront == "yes" && !var.enable_cloudfront_origin_shield ? 1 : 0

  triggers = {
    enable_origin_shield = var.enable_cloudfront_origin_shield
    distribution_id      = aws_cloudfront_distribution.default[0].id
    script_hash          = filemd5("${path.module}/bin/disable-cloudfront-origin-shield.sh")
  }

  provisioner "local-exec" {
    command = "${path.module}/bin/disable-cloudfront-origin-shield.sh ${aws_cloudfront_distribution.default[0].id} wordpress"
  }

  depends_on = [aws_cloudfront_distribution.default]
}

The null_resource only runs on the transition from enabled to disabled. It fetches the current config, exits early if origin shield is already off, and only patches it when needed.

CLOUDFRONT_CONFIG=$(aws cloudfront get-distribution-config --id "$DISTRIBUTION_ID" --output json)
ETAG=$(echo "$CLOUDFRONT_CONFIG" | jq -r '.ETag')
DISTRIBUTION_CONFIG=$(echo "$CLOUDFRONT_CONFIG" | jq '.DistributionConfig')

jq --arg origin_id "$ORIGIN_ID" '
  .Origins.Items = [
    .Origins.Items[] |
    if .Id == $origin_id then
      .OriginShield.Enabled = false
    else
      .
    end
  ]
' <<< "$DISTRIBUTION_CONFIG" | aws cloudfront update-distribution \
    --id "$DISTRIBUTION_ID" \
    --distribution-config /dev/stdin \
    --if-match "$ETAG"

I distinctly remember being very satisfied that it’s working after spending couple of hours on this bug. Every re-run took minutes! When it finally worked, I felt so relieved.

The bug is still open, but at least our plans are finally clean.

Why ECS Task Definitions Kept Changing on Every Apply

/

Every Terraform apply was creating new ECS task definition revisions even when nothing had actually changed. Our test environment had accumulated 17,000 task definitions. ECS task definitions are tracked by AWS Config, so this was quietly contributing to increased costs.

In AWS ECS, a task definition is a blueprint that describes how a container should run — what image to use, environment variables, resource limits, health checks, and so on. Every time Terraform sees a difference between what it expects and what AWS has, it creates a new revision. When that happens on every apply with no real change, something is off.

Finding the root cause

The way to debug this is to compare what Terraform has in state against what the AWS API actually returns. Pull the task definition from Terraform state and diff it against the output of aws ecs describe-task-definition. The differences tell you exactly what Terraform thinks it needs to change.

Two causes came up.

Environment variable ordering. AWS reorders environment variables alphabetically when storing a task definition. Terraform was building the environment variable list using merge(), a built-in function that combines maps but does not guarantee key order. On each plan, the order could differ from what AWS had stored, so Terraform saw a change that wasn’t really there.

Missing properties in the container templates. Some optional fields were absent from the JSON templates we used to define containers. AWS fills these in with default values and includes them in the API response. When Terraform compared its template against the API response, it saw those extra fields as additions it needed to remove — which triggered another replacement.

Fields like healthCheck.interval, healthCheck.retries, healthCheck.timeout, systemControls, mountPoints, and portMappings all fell into this category.

The fix

For environment variable ordering, the fix is to sort the keys explicitly before building the list so Terraform and AWS always agree on the order.

locals {
  merged_env_vars = merge(local.php_environment_json, var.extra_ecs_env_vars)
  php_environment_ecs = [
    for k in sort(keys(local.merged_env_vars)) : {
      name  = k
      value = local.merged_env_vars[k]
    }
  ]
}

For the missing properties, the fix is to add them explicitly to the container JSON templates to match what AWS returns. Once both sides agree, Terraform stops seeing phantom changes.

After both fixes, running the plan twice in a row showed no changes on the second run.

Reducing Terraform Drift in Production

/
Terraform plan showing no changes

Our Terraform runs were showing changes on every apply, even when nothing had actually changed. When the plan is always noisy, it’s easy to miss a change that actually matters.

I picked this up on my own during spare time. Originally thinking one drift fix per sprint. Seeing the reduced change set gave me a dopamine hit and I just kept going until there were none left.

Five sources of drift across CloudFront, ECS, Secrets Manager, and Terraform provider bugs. All cleaned up.

The fixes

Make origin_shield dynamic

  • When origin_shield is disabled on a CloudFront distribution, AWS removes it from the refreshed state entirely, causing drift on every plan
  • Made the block dynamic so it is only included when enabled
  • There was a secondary provider bug when disabling via a dynamic block that needed its own workaround

Fix null_resources

  • timestamp() in triggers caused Terraform to show changes every run since the value always differs
  • Moved script execution to external data sources to remove drift while still running scripts every apply
  • Split dependency installation into a separate null_resource so it only runs once

Fix task definitions

  • Every apply created new ECS task definition revisions even with no real changes. Task definitions are tracked by AWS Config so this had a cost implication
  • Identified mismatches in environment variable ordering and missing optional properties by comparing Terraform state with the AWS API response
  • Standardized ordering and completed missing fields in the template
  • Full write-up

Webhook secrets

  • An AWS provider issue caused webhook secrets to always show as changed
  • Added ignore_changes for the secret value

Resources not managed by Terraform

  • Some resources existed in AWS but were intentionally not managed by Terraform, which resulted in drift
  • Added the relevant attributes to ignore_changes

This has been shipped and running across all our environments for a couple of months, nothing broke.

This was also a good stress test of how well I understand our infrastructure.

It was a fun exercise.