There's a recurring genre of AWS war story — "I left a script running, woke up to a $4,000 bill, AWS support waived it." It's a real risk on a hobby account: you don't have an org-level SCP, you don't have a FinOps team, and one misconfigured loop on Bedrock or a forgotten EC2 instance can quietly run for a week.
So I sat down and built the smallest set of guardrails that would actually stop me, not just notify me. Three layers. All deployable from one CloudFormation stack except the last one, which has a constraint AWS doesn't document well.
## What I'm protecting against
- **Steady drift** — I forget the EC2 instance is up, it racks up $30 over a month. Annoying, not catastrophic.
- **Surprise spike** — I push a buggy script that hammers Bedrock or starts a too-big instance. Could be hundreds of dollars in hours.
- **Slow leak** — Some service I provisioned a year ago is still billing me $3/day. Death by a thousand papercuts.
Different time horizons need different tools.
## Layer 1 — Budget alerts (the slow drift)
AWS Budgets fires email alerts at thresholds against a monthly cost target. Not real-time — Budgets evaluates roughly every 8–12 hours — but cheap and zero-maintenance.
```yaml
MonthlyBudget:
Type: AWS::Budgets::Budget
Properties:
Budget:
BudgetName: monthly-cost-budget
BudgetType: COST
TimeUnit: MONTHLY
BudgetLimit:
Amount: 40
Unit: USD
NotificationsWithSubscribers:
- Notification:
NotificationType: ACTUAL
ComparisonOperator: GREATER_THAN
Threshold: 50
ThresholdType: PERCENTAGE
Subscribers:
- SubscriptionType: SNS
Address: !Ref CostAlertTopic
- Notification:
NotificationType: ACTUAL
ComparisonOperator: GREATER_THAN
Threshold: 80
ThresholdType: PERCENTAGE
Subscribers:
- SubscriptionType: SNS
Address: !Ref CostAlertTopic
- Notification:
NotificationType: ACTUAL
ComparisonOperator: GREATER_THAN
Threshold: 100
ThresholdType: PERCENTAGE
Subscribers:
- SubscriptionType: SNS
Address: !Ref CostAlertTopic
```
50% / 80% / 100% is the common shape. The 50% notification is the one that's actually useful — if I'm halfway through the month and at 50%, I'm tracking. If I'm one week in and at 50%, something is wrong.
## Layer 2 — Auto-stop Lambda (the spike safety net)
Email alerts at 100% are nice but I'm not going to read email at 2am. So the 100% threshold *also* fires an SNS message that triggers a Lambda, which stops every running EC2 instance tagged with `AutoStop=true`.
```yaml
AutoStopFunction:
Type: AWS::Lambda::Function
Properties:
FunctionName: cost-monitor-auto-stop
Runtime: python3.13
Handler: index.handler
Role: !GetAtt AutoStopRole.Arn
Timeout: 60
Code:
ZipFile: |
import boto3
ec2 = boto3.client('ec2')
def handler(event, context):
r = ec2.describe_instances(Filters=[
{'Name': 'tag:AutoStop', 'Values': ['true']},
{'Name': 'instance-state-name', 'Values': ['running']},
])
ids = [i['InstanceId'] for res in r['Reservations'] for i in res['Instances']]
if ids:
ec2.stop_instances(InstanceIds=ids)
return {'stopped': ids}
AutoStopSubscription:
Type: AWS::SNS::Subscription
Properties:
TopicArn: !Ref CostAlertTopic
Protocol: lambda
Endpoint: !GetAtt AutoStopFunction.Arn
```
This is **not** a real-time circuit breaker. Budgets evaluates every 8–12 hours, so the auto-stop only kicks in within that window. If you launch a `p4d.24xlarge` and it costs $40/hour, you can blow past your $40 monthly budget in one hour and Budgets won't notice until the next evaluation cycle.
The auto-stop is for the case where you've been *gradually* burning toward your monthly limit and want a "no, really, stop spending" backstop. For genuine spike protection, you need Layer 3.
The IAM role is small — just `ec2:DescribeInstances` and `ec2:StopInstances`, scoped to the account. No reason to grant `Start` on the auto-stop role; restarting is a manual decision.
## Layer 3 — Cost Anomaly Detection (the spike detector)
This is the one that catches "you launched something weird two hours ago." Cost Anomaly Detection (CAD) evaluates roughly daily on actual usage, looks for statistical anomalies vs. your historical baseline per service, and emails you when one service's daily spend deviates significantly.
The trick: you can't fully manage CAD from CloudFormation if you already have a manually-created monitor. AWS allows exactly **one DIMENSIONAL/SERVICE anomaly monitor per account**. If you click around in the console and create one, then later try to create one in CloudFormation:
```
AlreadyExistsException: AnomalyMonitor already exists.
```
The fix is one of:
1. Delete the manual one, then create via CloudFormation.
2. Skip CloudFormation, manage CAD by hand via the console.
3. Import the existing resource into your stack (`aws cloudformation create-change-set --change-set-type IMPORT ...`).
I went with option 2 because CAD is set-and-forget. The monitor lives outside the stack; the stack manages the budget + auto-stop + SNS topic.
If you do go via CloudFormation, two more gotchas:
- `ThresholdExpression` is a **JSON string**, not a YAML object. CFN's parser will accept the YAML and then fail at deploy:
```yaml
# correct
ThresholdExpression: '{"Dimensions":{"Key":"ANOMALY_TOTAL_IMPACT_ABSOLUTE","MatchOptions":["GREATER_THAN_OR_EQUAL"],"Values":["5"]}}'
# wrong — fails EarlyValidation
ThresholdExpression:
Dimensions:
Key: ANOMALY_TOTAL_IMPACT_ABSOLUTE
Values: ['5']
MatchOptions: ['GREATER_THAN_OR_EQUAL']
```
- `Frequency: IMMEDIATE` only works with SNS subscribers. If you want email straight to your inbox, the supported frequencies are `DAILY` and `WEEKLY`. I went with `DAILY` and a $5 absolute-impact threshold — anything that bumps a single service by $5 in a day is something I want to know about within 24 hours.
## What this catches and what it doesn't
| Scenario | Caught by | Latency |
|---|---|---|
| Forgotten EC2 instance, ~$30 over a month | Budget at 50% / 80% | ~12 hours |
| Bedrock loop bug, $200 in one afternoon | Anomaly Detection | ~24 hours |
| `p4d` launched accidentally | None of the above, fast enough | Hours |
| Slow leak: $3/day for a year | Anomaly Detection threshold (if $5+) | DAILY email |
Honest assessment: a determined accident in a single hour can still hurt. AWS doesn't really offer real-time auto-stop at any reasonable price point — Budgets is the closest thing and it's not real-time. If you genuinely need spike protection on the order of minutes, your options are (a) provision through Service Catalog with hard-coded instance type allowlists, (b) put an SCP on a proper org account, or (c) keep most of your spend on a fixed-cost subscription (Reserved/Savings Plans) so the variable portion can't run away.
For a hobby account: this three-layer setup is what I'd call "good enough that I don't lose sleep, cheap enough that it costs me nothing on top of what I'm already spending."
## Pitfalls I actually hit
- **`AlreadyExists` on the anomaly monitor.** If you've ever clicked "set up anomaly detection" in the console, you have a DIMENSIONAL/SERVICE monitor and CFN can't make a second one. Either import it or skip CFN for CAD.
- **`ThresholdExpression` shape.** Must be a JSON-encoded string. CFN templates accept YAML for most properties so this looks like a typo, not a constraint.
- **`Frequency: IMMEDIATE` with email.** Doesn't work. Use SNS-then-email if you want IMMEDIATE; use DAILY/WEEKLY for direct email.
- **Auto-stop ≠ circuit breaker.** Budgets is not real-time. If you write a tight Bedrock loop, the auto-stop fires hours after you've already spent the money.
- **Tag your stoppable instances.** The Lambda filters on `AutoStop=true`. If you forget the tag, the Lambda runs successfully and stops nothing. Tag every personal EC2 instance with `AutoStop=true` by default; remove the tag explicitly when you want something to keep running.
- **Restart is manual on purpose.** The Lambda has no `ec2:StartInstances`. Restarting an instance after auto-stop should be a conscious "okay, I know what's running and why" decision, not an automatic recovery.
## What to do
If you have a personal AWS account and you don't have at least Layer 1 (a budget with email alerts), set that up today — it's ten minutes in the console, free, and prevents the steady-drift class of problem.
If you're running anything that bills by the hour (EC2, RDS, OpenSearch, GPU anything), add Layer 2. The Lambda is twenty lines and a tag.
If you're using usage-billed services where one bug can spike fast (Bedrock, Lambda invocation count, S3 request count), add Layer 3 — DAILY anomaly emails at a $5 absolute threshold catch the "what is this charge" stuff before it becomes a "what *was* this charge" conversation.
Three layers, ~20 minutes total, zero ongoing cost. Worth it.
There’s a recurring genre of AWS war story — “I left a script running, woke up to a $4,000 bill, AWS support waived it.” It’s a real risk on a hobby account: you don’t have an org-level SCP, you don’t have a FinOps team, and one misconfigured loop on Bedrock or a forgotten EC2 instance can quietly run for a week.
So I sat down and built the smallest set of guardrails that would actually stop me, not just notify me. Three layers. All deployable from one CloudFormation stack except the last one, which has a constraint AWS doesn’t document well.
What I’m protecting against
Steady drift — I forget the EC2 instance is up, it racks up $30 over a month. Annoying, not catastrophic.
Surprise spike — I push a buggy script that hammers Bedrock or starts a too-big instance. Could be hundreds of dollars in hours.
Slow leak — Some service I provisioned a year ago is still billing me $3/day. Death by a thousand papercuts.
Different time horizons need different tools.
Layer 1 — Budget alerts (the slow drift)
AWS Budgets fires email alerts at thresholds against a monthly cost target. Not real-time — Budgets evaluates roughly every 8–12 hours — but cheap and zero-maintenance.
50% / 80% / 100% is the common shape. The 50% notification is the one that’s actually useful — if I’m halfway through the month and at 50%, I’m tracking. If I’m one week in and at 50%, something is wrong.
Layer 2 — Auto-stop Lambda (the spike safety net)
Email alerts at 100% are nice but I’m not going to read email at 2am. So the 100% threshold also fires an SNS message that triggers a Lambda, which stops every running EC2 instance tagged with AutoStop=true.
AutoStopFunction:Type:AWS::Lambda::FunctionProperties:FunctionName:cost-monitor-auto-stopRuntime:python3.13Handler:index.handlerRole:!GetAtt AutoStopRole.ArnTimeout:60Code:ZipFile:| import boto3
ec2 = boto3.client('ec2')
def handler(event, context):
r = ec2.describe_instances(Filters=[
{'Name': 'tag:AutoStop', 'Values': ['true']},
{'Name': 'instance-state-name', 'Values': ['running']},
])
ids = [i['InstanceId'] for res in r['Reservations'] for i in res['Instances']]
if ids:
ec2.stop_instances(InstanceIds=ids)
return {'stopped': ids}AutoStopSubscription:Type:AWS::SNS::SubscriptionProperties:TopicArn:!Ref CostAlertTopicProtocol:lambdaEndpoint:!GetAtt AutoStopFunction.Arn
This is not a real-time circuit breaker. Budgets evaluates every 8–12 hours, so the auto-stop only kicks in within that window. If you launch a p4d.24xlarge and it costs $40/hour, you can blow past your $40 monthly budget in one hour and Budgets won’t notice until the next evaluation cycle.
The auto-stop is for the case where you’ve been gradually burning toward your monthly limit and want a “no, really, stop spending” backstop. For genuine spike protection, you need Layer 3.
The IAM role is small — just ec2:DescribeInstances and ec2:StopInstances, scoped to the account. No reason to grant Start on the auto-stop role; restarting is a manual decision.
Layer 3 — Cost Anomaly Detection (the spike detector)
This is the one that catches “you launched something weird two hours ago.” Cost Anomaly Detection (CAD) evaluates roughly daily on actual usage, looks for statistical anomalies vs. your historical baseline per service, and emails you when one service’s daily spend deviates significantly.
The trick: you can’t fully manage CAD from CloudFormation if you already have a manually-created monitor. AWS allows exactly one DIMENSIONAL/SERVICE anomaly monitor per account. If you click around in the console and create one, then later try to create one in CloudFormation:
Frequency: IMMEDIATE only works with SNS subscribers. If you want email straight to your inbox, the supported frequencies are DAILY and WEEKLY. I went with DAILY and a $5 absolute-impact threshold — anything that bumps a single service by $5 in a day is something I want to know about within 24 hours.
What this catches and what it doesn’t
Scenario
Caught by
Latency
Forgotten EC2 instance, ~$30 over a month
Budget at 50% / 80%
~12 hours
Bedrock loop bug, $200 in one afternoon
Anomaly Detection
~24 hours
p4d launched accidentally
None of the above, fast enough
Hours
Slow leak: $3/day for a year
Anomaly Detection threshold (if $5+)
DAILY email
Honest assessment: a determined accident in a single hour can still hurt. AWS doesn’t really offer real-time auto-stop at any reasonable price point — Budgets is the closest thing and it’s not real-time. If you genuinely need spike protection on the order of minutes, your options are (a) provision through Service Catalog with hard-coded instance type allowlists, (b) put an SCP on a proper org account, or (c) keep most of your spend on a fixed-cost subscription (Reserved/Savings Plans) so the variable portion can’t run away.
For a hobby account: this three-layer setup is what I’d call “good enough that I don’t lose sleep, cheap enough that it costs me nothing on top of what I’m already spending.”
Pitfalls I actually hit
AlreadyExists on the anomaly monitor. If you’ve ever clicked “set up anomaly detection” in the console, you have a DIMENSIONAL/SERVICE monitor and CFN can’t make a second one. Either import it or skip CFN for CAD.
ThresholdExpression shape. Must be a JSON-encoded string. CFN templates accept YAML for most properties so this looks like a typo, not a constraint.
Frequency: IMMEDIATE with email. Doesn’t work. Use SNS-then-email if you want IMMEDIATE; use DAILY/WEEKLY for direct email.
Auto-stop ≠ circuit breaker. Budgets is not real-time. If you write a tight Bedrock loop, the auto-stop fires hours after you’ve already spent the money.
Tag your stoppable instances. The Lambda filters on AutoStop=true. If you forget the tag, the Lambda runs successfully and stops nothing. Tag every personal EC2 instance with AutoStop=true by default; remove the tag explicitly when you want something to keep running.
Restart is manual on purpose. The Lambda has no ec2:StartInstances. Restarting an instance after auto-stop should be a conscious “okay, I know what’s running and why” decision, not an automatic recovery.
What to do
If you have a personal AWS account and you don’t have at least Layer 1 (a budget with email alerts), set that up today — it’s ten minutes in the console, free, and prevents the steady-drift class of problem.
If you’re running anything that bills by the hour (EC2, RDS, OpenSearch, GPU anything), add Layer 2. The Lambda is twenty lines and a tag.
If you’re using usage-billed services where one bug can spike fast (Bedrock, Lambda invocation count, S3 request count), add Layer 3 — DAILY anomaly emails at a $5 absolute threshold catch the “what is this charge” stuff before it becomes a “what was this charge” conversation.
Three layers, ~20 minutes total, zero ongoing cost. Worth it.