Policy Engineering

The Difference Between a Policy That Exists and One That Works

Amit Megiddo November 19, 2024

Abstract policy document vs active enforcement concept

There is a document called a security policy. There is a set of controls that actually prevent things. These are related — the document describes the controls — but they are not the same thing, and conflating them is the most persistent error in enterprise cloud security.

This isn't a novel observation. Security practitioners have been making the documentation-vs-enforcement distinction for decades. What changed with cloud infrastructure is the gap got much wider and much harder to see. When your controls are AWS SCPs, Azure Policy assignments, and GCP Organization constraints, you have thousands of policy evaluation decisions happening per second across dozens of accounts and subscriptions. The documentation looks comprehensive because the policy JSON is right there. The enforcement may be hollow because of one wrong condition key, one missing account in the OU scope, or one inherited allow that cancels the deny.

Two Distinct Artifacts That People Call "Policy"

When a security team says "we have a policy against public S3 buckets," they might mean any of the following:

A written statement in the security runbook that says public S3 buckets are prohibited
An SCP that denies s3:PutBucketAcl with a condition that blocks public ACLs
An AWS Config rule that flags buckets with public access enabled
S3 Block Public Access settings enabled at the account level
A Terraform module that always sets block_public_acls = true and is used by all teams

These have entirely different enforcement characteristics. The first one relies on human compliance. The second and fourth are preventive controls — they block the action before it happens. The third is a detective control — it finds the violation after it exists. The fifth is a deployment-time guardrail that can be bypassed by anyone who doesn't use the module.

The problem is that organizations track all five in the same column of their controls spreadsheet. "Public S3 buckets: controlled." The spreadsheet is wrong for three of those five entries, at least in terms of what the word "controlled" implies to an auditor or an incident responder.

The Gap Is Structural, Not Operational

Most security teams understand this distinction conceptually. The gap persists because of structural factors that make it hard to bridge in practice.

Policy authorship and policy enforcement are owned by different people

A security architect writes the SCP. An infrastructure engineer attaches it to the org hierarchy. A cloud operations team manages account provisioning. None of these people typically close the loop: does the SCP as written, as attached, in the actual org structure as it exists today, produce the deny that the security architect intended when the infrastructure engineer's Terraform runs the action?

When we look at the handoff between these roles, we almost always find assumptions that don't hold. The architect assumes the attachment is correct. The infrastructure engineer assumes the JSON is correct. The cloud ops team assumes both. Nobody tests the actual API call from the actual principal in the actual account.

Cloud infrastructure evolves faster than policy review cycles

Accounts get created, OUs get restructured, service control policies get modified to unblock a team that hit a false-positive denial. Each of these changes can alter the effective policy for accounts you're not thinking about. The SCP you wrote six months ago may still exist, still be attached, still look correct in the console — and have an entirely different effective behavior after three infrastructure changes downstream.

A quarterly policy review that reads the SCP JSON doesn't catch this. You need to evaluate the effective policy against the current org structure, not the policy document in isolation.

Detective controls get classified as preventive

AWS Config, Azure Defender for Cloud compliance scores, GCP Security Command Center findings — these are all valuable. They also all detect violations after they occur. There's a version of "controlled" that means "we will know within 15 minutes when this happens and remediate." There's another version that means "this cannot happen." Both are legitimate security postures, but they have different risk profiles. A misconfiguration that exists for 15 minutes in a production environment is a different risk than one that is blocked before the resource is ever created.

We're not saying detective controls are bad. We're saying that classifying a detective control as a preventive control in your risk register is a documentation error that has real consequences when you're explaining the timeline of an incident.

A Framework for Categorizing What You Actually Have

When auditing a cloud security posture, we find it useful to evaluate every control statement against four questions:

1. Is this preventive or detective? Preventive: the action is blocked. Detective: a violation creates an alert or finding. This determines your exposure window.

2. Is the enforcement scope correct? For AWS SCPs: which accounts are in scope? Is the management account excluded? Are there direct children of root that bypass OU-level SCPs? For Azure Policy: which management groups and subscriptions are in scope? For GCP: which projects are in the org node hierarchy?

3. Has it been tested against real API calls? Not simulated. Real calls, from real principals with real permissions, in real member accounts. If the answer is "we ran it through the policy simulator," that is better than nothing but is not a test.

4. Does the current org structure match the assumptions the policy was written against? This is the hardest question because it requires understanding the historical context of the policy. The person who wrote the SCP six months ago may have assumed a flat OU structure. The structure is now three levels deep with delegated admin accounts. Does the policy still behave correctly?

The Enforcement Gap Has a Specific Shape in Cloud Environments

In traditional data center security, the gap between policy-as-documented and policy-as-enforced was mostly about process compliance: did humans follow the procedure? Cloud infrastructure changes the shape of the gap. The enforcement logic is code — SCP JSON, Azure Policy definitions, OPA rego, Terraform guard rules. Code has bugs. Code has edge cases. Code has unintended interactions with other code.

The gap is now a software correctness problem as much as a process compliance problem. And software correctness problems require testing, not review.

Consider a policy statement that appears in many AWS SCP libraries: deny all actions unless the request comes from an approved region. Here's a common implementation:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "DenyNonApprovedRegions",
      "Effect": "Deny",
      "NotAction": [
        "iam:*",
        "organizations:*",
        "support:*",
        "sts:AssumeRole"
      ],
      "Resource": "*",
      "Condition": {
        "StringNotEquals": {
          "aws:RequestedRegion": [
            "us-east-1",
            "us-west-2",
            "eu-west-1"
          ]
        }
      }
    }
  ]
}

This is a legitimate pattern. But NotAction is a broad carve-out. Every IAM action, every Organizations action, every STS action is excluded from the region restriction. If your threat model includes a compromised IAM credential trying to enumerate IAM roles or call sts:AssumeRole from a restricted region, this policy does not cover it. The policy exists. It does something. But whether it covers what the security team intended depends on whether the person who wrote it understood NotAction semantics and whether the list of carved-out actions was intentional.

Bridging the Gap: What Works in Practice

The most reliable way to bridge the documentation-enforcement gap is to treat policy enforcement as a code quality problem with the standard software engineering responses: testing, version control, and continuous verification.

Version control is usually present — SCPs get stored in a git repository. Testing is almost always absent. Continuous verification — checking on an ongoing basis that the live org state matches the intended policy — is rare.

For continuous verification, the starting point is running periodic checks against your org structure and the effective policy at each scope level. This means:

Comparing the list of accounts and their OU placement against expected topology
Evaluating which SCPs are effective for each account (accounting for inheritance)
Checking that explicit-deny statements in your critical SCPs are reachable given the full SCP evaluation chain

None of this requires a commercial tool. The AWS Organizations API exposes everything you need: list-policies-for-target, list-children, describe-policy. A Python script that walks the org tree and reports the effective SCP for every account runs in a few minutes and surfaces structural gaps that are invisible when reading a single policy in isolation.

What it cannot tell you without testing: whether the SCP's condition logic produces the intended evaluation result for a real API call. That still requires actually making the call.

The policy that works is the one that has been tested under the conditions it's meant to enforce. The policy that exists is a starting point. Treating those two things as the same is how you end up explaining to an incident timeline why a control you had on paper didn't stop anything.