There’s a recurring genre of AWS war story — “I left a script running, woke up to a $4,000 bill, AWS support waived it.” It’s a real risk on a hobby account: you don’t have an org-level SCP, you don’t have a FinOps team, and one misconfigured loop on Bedrock or a forgotten EC2 instance can quietly run for a week.

So I sat down and built the smallest set of guardrails that would actually stop me, not just notify me. Three layers. All deployable from one CloudFormation stack except the last one, which has a constraint AWS doesn’t document well.

What I’m protecting against

  • Steady drift — I forget the EC2 instance is up, it racks up $30 over a month. Annoying, not catastrophic.
  • Surprise spike — I push a buggy script that hammers Bedrock or starts a too-big instance. Could be hundreds of dollars in hours.
  • Slow leak — Some service I provisioned a year ago is still billing me $3/day. Death by a thousand papercuts.

Different time horizons need different tools.

Layer 1 — Budget alerts (the slow drift)

AWS Budgets fires email alerts at thresholds against a monthly cost target. Not real-time — Budgets evaluates roughly every 8–12 hours — but cheap and zero-maintenance.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
MonthlyBudget:
  Type: AWS::Budgets::Budget
  Properties:
    Budget:
      BudgetName: monthly-cost-budget
      BudgetType: COST
      TimeUnit: MONTHLY
      BudgetLimit:
        Amount: 40
        Unit: USD
    NotificationsWithSubscribers:
      - Notification:
          NotificationType: ACTUAL
          ComparisonOperator: GREATER_THAN
          Threshold: 50
          ThresholdType: PERCENTAGE
        Subscribers:
          - SubscriptionType: SNS
            Address: !Ref CostAlertTopic
      - Notification:
          NotificationType: ACTUAL
          ComparisonOperator: GREATER_THAN
          Threshold: 80
          ThresholdType: PERCENTAGE
        Subscribers:
          - SubscriptionType: SNS
            Address: !Ref CostAlertTopic
      - Notification:
          NotificationType: ACTUAL
          ComparisonOperator: GREATER_THAN
          Threshold: 100
          ThresholdType: PERCENTAGE
        Subscribers:
          - SubscriptionType: SNS
            Address: !Ref CostAlertTopic

50% / 80% / 100% is the common shape. The 50% notification is the one that’s actually useful — if I’m halfway through the month and at 50%, I’m tracking. If I’m one week in and at 50%, something is wrong.

Layer 2 — Auto-stop Lambda (the spike safety net)

Email alerts at 100% are nice but I’m not going to read email at 2am. So the 100% threshold also fires an SNS message that triggers a Lambda, which stops every running EC2 instance tagged with AutoStop=true.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
AutoStopFunction:
  Type: AWS::Lambda::Function
  Properties:
    FunctionName: cost-monitor-auto-stop
    Runtime: python3.13
    Handler: index.handler
    Role: !GetAtt AutoStopRole.Arn
    Timeout: 60
    Code:
      ZipFile: |
        import boto3
        ec2 = boto3.client('ec2')
        def handler(event, context):
            r = ec2.describe_instances(Filters=[
                {'Name': 'tag:AutoStop', 'Values': ['true']},
                {'Name': 'instance-state-name', 'Values': ['running']},
            ])
            ids = [i['InstanceId'] for res in r['Reservations'] for i in res['Instances']]
            if ids:
                ec2.stop_instances(InstanceIds=ids)
            return {'stopped': ids}

AutoStopSubscription:
  Type: AWS::SNS::Subscription
  Properties:
    TopicArn: !Ref CostAlertTopic
    Protocol: lambda
    Endpoint: !GetAtt AutoStopFunction.Arn

This is not a real-time circuit breaker. Budgets evaluates every 8–12 hours, so the auto-stop only kicks in within that window. If you launch a p4d.24xlarge and it costs $40/hour, you can blow past your $40 monthly budget in one hour and Budgets won’t notice until the next evaluation cycle.

The auto-stop is for the case where you’ve been gradually burning toward your monthly limit and want a “no, really, stop spending” backstop. For genuine spike protection, you need Layer 3.

The IAM role is small — just ec2:DescribeInstances and ec2:StopInstances, scoped to the account. No reason to grant Start on the auto-stop role; restarting is a manual decision.

Layer 3 — Cost Anomaly Detection (the spike detector)

This is the one that catches “you launched something weird two hours ago.” Cost Anomaly Detection (CAD) evaluates roughly daily on actual usage, looks for statistical anomalies vs. your historical baseline per service, and emails you when one service’s daily spend deviates significantly.

The trick: you can’t fully manage CAD from CloudFormation if you already have a manually-created monitor. AWS allows exactly one DIMENSIONAL/SERVICE anomaly monitor per account. If you click around in the console and create one, then later try to create one in CloudFormation:

AlreadyExistsException: AnomalyMonitor already exists.

The fix is one of:

  1. Delete the manual one, then create via CloudFormation.
  2. Skip CloudFormation, manage CAD by hand via the console.
  3. Import the existing resource into your stack (aws cloudformation create-change-set --change-set-type IMPORT ...).

I went with option 2 because CAD is set-and-forget. The monitor lives outside the stack; the stack manages the budget + auto-stop + SNS topic.

If you do go via CloudFormation, two more gotchas:

  • ThresholdExpression is a JSON string, not a YAML object. CFN’s parser will accept the YAML and then fail at deploy:
1
2
3
4
5
6
7
8
9
# correct
ThresholdExpression: '{"Dimensions":{"Key":"ANOMALY_TOTAL_IMPACT_ABSOLUTE","MatchOptions":["GREATER_THAN_OR_EQUAL"],"Values":["5"]}}'

# wrong — fails EarlyValidation
ThresholdExpression:
  Dimensions:
    Key: ANOMALY_TOTAL_IMPACT_ABSOLUTE
    Values: ['5']
    MatchOptions: ['GREATER_THAN_OR_EQUAL']
  • Frequency: IMMEDIATE only works with SNS subscribers. If you want email straight to your inbox, the supported frequencies are DAILY and WEEKLY. I went with DAILY and a $5 absolute-impact threshold — anything that bumps a single service by $5 in a day is something I want to know about within 24 hours.

What this catches and what it doesn’t

ScenarioCaught byLatency
Forgotten EC2 instance, ~$30 over a monthBudget at 50% / 80%~12 hours
Bedrock loop bug, $200 in one afternoonAnomaly Detection~24 hours
p4d launched accidentallyNone of the above, fast enoughHours
Slow leak: $3/day for a yearAnomaly Detection threshold (if $5+)DAILY email

Honest assessment: a determined accident in a single hour can still hurt. AWS doesn’t really offer real-time auto-stop at any reasonable price point — Budgets is the closest thing and it’s not real-time. If you genuinely need spike protection on the order of minutes, your options are (a) provision through Service Catalog with hard-coded instance type allowlists, (b) put an SCP on a proper org account, or (c) keep most of your spend on a fixed-cost subscription (Reserved/Savings Plans) so the variable portion can’t run away.

For a hobby account: this three-layer setup is what I’d call “good enough that I don’t lose sleep, cheap enough that it costs me nothing on top of what I’m already spending.”

Pitfalls I actually hit

  • AlreadyExists on the anomaly monitor. If you’ve ever clicked “set up anomaly detection” in the console, you have a DIMENSIONAL/SERVICE monitor and CFN can’t make a second one. Either import it or skip CFN for CAD.
  • ThresholdExpression shape. Must be a JSON-encoded string. CFN templates accept YAML for most properties so this looks like a typo, not a constraint.
  • Frequency: IMMEDIATE with email. Doesn’t work. Use SNS-then-email if you want IMMEDIATE; use DAILY/WEEKLY for direct email.
  • Auto-stop ≠ circuit breaker. Budgets is not real-time. If you write a tight Bedrock loop, the auto-stop fires hours after you’ve already spent the money.
  • Tag your stoppable instances. The Lambda filters on AutoStop=true. If you forget the tag, the Lambda runs successfully and stops nothing. Tag every personal EC2 instance with AutoStop=true by default; remove the tag explicitly when you want something to keep running.
  • Restart is manual on purpose. The Lambda has no ec2:StartInstances. Restarting an instance after auto-stop should be a conscious “okay, I know what’s running and why” decision, not an automatic recovery.

What to do

If you have a personal AWS account and you don’t have at least Layer 1 (a budget with email alerts), set that up today — it’s ten minutes in the console, free, and prevents the steady-drift class of problem.

If you’re running anything that bills by the hour (EC2, RDS, OpenSearch, GPU anything), add Layer 2. The Lambda is twenty lines and a tag.

If you’re using usage-billed services where one bug can spike fast (Bedrock, Lambda invocation count, S3 request count), add Layer 3 — DAILY anomaly emails at a $5 absolute threshold catch the “what is this charge” stuff before it becomes a “what was this charge” conversation.

Three layers, ~20 minutes total, zero ongoing cost. Worth it.