Sprache des Inhalts

Im Moment sind nicht alle Inhalte übersetzt.

Lernen
Prove It, Part 2: Formal logic, Cedar policies, and the economics of verification

Prove It, Part 2: Formal logic, Cedar policies, and the economics of verification

Wie war dieser Inhalt?

collaborative team meeting ai project discussion

Automated reasoning adds verification on top of LLM outputs by translating business policies into formal logic and checking them against those rules. It detects hallucinations, enforces compliance constraints, and controls agent actions with full coverage at a fraction of the cost of manual QA.

In Part 1 ('Probably Correct' Is Not Good Enough"), we covered why startups building on AI need deterministic verification, not just probabilistic safeguards. In this post, we go under the hood. If you are a technical co-founder, engineering lead, or senior developer at an AI startup, this is where you learn how automated reasoning actually works, what happens when your LLM generates a response and formal logic checks it, and why the economics makes this a no-brainer compared to hiring QA teams.

No API code in this post. That comes in Part 3. Here we focus on concepts, formal logic, and the verification pipeline so you understand what you are building.

Stochastic vs. deterministic: The fundamental distinction

Before diving into the mechanics, it is worth grounding the core distinction clearly, as it affects every architectural decision you make.

LLMs are stochastic.

They predict the next token based on probability distributions learned from training data. Temperature, top-k, and top-p sampling parameters introduce deliberate randomness. Run the same prompt twice and you may get different answers. This is a feature, not a bug: it is what makes LLMs flexible, creative, and capable of handling diverse natural language inputs. But it also means that any single output is a sample from a probability distribution, not a proven fact.

Automated reasoning (AR) is correct.

Given a set of rules and an input, the result is not a best guess. It is a proof. A statement is either provably valid with respect to your rules, provably invalid, or the system tells you precisely what information is missing or ambiguous. If the answer is wrong, AR does not just flag it; it gives you a counterexample showing exactly why it fails. The output is mathematically verifiable, not probabilistic. There is no temperature knob. There is no sampling. There is a proof you can inspect.

This is not about one approach being "better." LLMs are better at understanding natural language and generating human-like responses. AR is better at checking whether those responses are logically correct. Together, they form the complete stack: the LLM handles the conversation, AR handles correctness. This combination of neural networks and formal logic is what researchers call neurosymbolic AI.

For a five-person startup shipping to your first 100 customers, this distinction has a very practical consequence. You do not have the headcount for a QA team, and you do not have the runway to absorb a compliance incident. Formal verification becomes a force multiplier that lets a small team ship with the confidence of a much larger one.

What are the two layers of protection in Amazon Bedrock?

AWS provides AR at two layers, and they map cleanly to the maturity curve of most AI startups. When you are early-stage, your product is likely a chatbot or assistant: an LLM answering customer questions, generating recommendations, or providing guidance. At this stage, automated reasoning in Amazon Bedrock Guardrails verifies the content of model outputs against defined business rules

As you grow, you start building agentic workflows: your AI books appointments, processes refunds, queries databases, calls external APIs. At this layer, Policy in Amazon Bedrock AgentCore governs what actions an agent is allowed to take, using Cedar policies enforced at the gateway before any tool execution.

Cedar policies govern what an agent can do. But you can also expose the AR checks API as a tool in the agent's toolkit, letting the agent validate its own intermediate conclusions before acting on them. The agent calls ApplyGuardrail against your policy, inspects the findings, and self-corrects without involving the user. An agent planning a multi-step workflow can check whether its proposed sequence violates any constraint, catch a contradiction in its assumptions, or confirm that a derived value actually follows from the inputs. This shifts AR from a boundary check to a reasoning partner: the agent does not just get blocked at the gate, it avoids walking toward the gate in the first place.

Here is how they compare:

AR checks and Policy in Amazon Bedrock AgentCore protect different layers of your application.

AR checks validate what your AI says, i.e. is this mortgage eligibility answer correct per your lending criteria?
Policy in Amazon Bedrock AgentCore controls what your AI does, i.e. is this agent authorized to initiate a refund or query a customer's account?

Depending on what you are building today, you might start with one or both.

If your MVP is a chatbot answering policy questions, AR checks is your first integration.
If you are shipping an agent that books appointments and moves money, Policy is non-negotiable from day one.

Most startups will eventually run both together.

How AR checks work: Encoding your business rules as formal logic

The input to AR is a policy, and a policy starts with documents you already have.

If you are a fintech startup, it could be a lending criteria document. In healthcare, it may be clinical protocols or HIPAA handling procedures. In insurance, underwriting guidelines. These documents define what your AI should and should not say. They have been reviewed by your legal counsel. Your investors asked about them during due diligence. AR turns those existing artifacts into enforceable rules. You are not building something new from scratch. You are activating something you already have.

When you upload a policy document (PDF, Markdown, or plain text) to Amazon Bedrock, the system extracts two things: variables and rules.

Variables represent the concepts in your domain. Each variable has a name, a type, and a description. For example, in a mortgage eligibility policy:

The variable descriptions are the single most important factor in accuracy. Vague descriptions leave the LLM guessing at what a variable represents. Detailed descriptions that explain what the concept means, how your users refer to it, and how it appears in your policy documents give the LLM the context it needs to map natural language to the right formal variables. This is where your domain expertise as a startup builder matters most.

Rules are formal logic expressions that capture relationships between variables. They use a subset of SMT-LIB syntax, and most follow an if-then (implicative) format:

This is what Bedrock generates under the hood when you upload your mortgage policy document. You can review these rules in the console to verify the system captured your intent correctly, but you are not writing SMT-LIB by hand. You write the policy in natural language, and the system translates it into formal logic.

The two-step verification process

Once you have set up an AR policy and attached it to a guardrail (covered in Part 3), every LLM response your application sends to ApplyGuardrail goes through a two-step verification process. Understanding this split is critical because it tells you where to invest your effort.

Step 1: Translation

The system converts the natural language in both the user question and the LLM's response into formal logic predicates, statements like earning “>= 180 & earning <= 220” rather than simple assignments, using the variables declared in your policy. This is where the variable descriptions come in.

For example, if the response says "a customer with $30,000 USD down on a $350,000 USD house qualifies for a conventional mortgage," the translation step produces: downPaymentAmount = 30000, purchasePrice = 350000, mortgageType = CONVENTIONAL.

The fact that we use LLMs to translate your input natural language into logic is why the system claims up to 99 percent verification accuracy. To ensure our translation is correct, we don't just rely on one LLM. Instead, we use multiple LLMs at different temperatures in parallel to perform the same translation. Only when the redundant translations are semantically equivalent can we make a validation result. When the translations disagree, we instead return an AMBIGUOUS result that includes two possible interpretations of the input. Your LLM can use this feedback to disambiguate its answer or ask the user clarifying questions. When we benchmarked our system against Carnegie Mellon's conditional QA data set, in over 99 percent of cases where an answer was flagged as VALID from our system the flag was correct. You can find more details on the methodology in this paper.

Step 2: Verification

An Satisfiability Modulo Theories (SMT) solver evaluates whether the translated assignments satisfy all your policy rules. This step is mathematically sound by definition. Once the translation is correct, the verification is provably correct.

For the mortgage example: the solver computes downPaymentPercentage = (30000 / 350000) * 100 = 8.57 percent, checks this against the rule (=> (< downPaymentPercentage 20.0) (not (= mortgageType CONVENTIONAL))), finds that 8.57 < 20, and concludes that mortgageType = CONVENTIONAL is logically invalid. The system returns an invalid finding with a complete trace showing exactly which rule was violated and why.

The practical takeaway for startup builders? The translation quality is something you control through better variable descriptions. The verification math is something you do not have to worry about. You invest in describing your domain well, and the solver handles the rest.

What do the findings mean?

AR checks do not simply return "pass" or "fail." They return structured findings that tell you exactly what happened and what to do next. Each finding is one of these types:

This is important: AR checks operate in detect mode. They return findings and feedback, but do not block the response. Your application inspects the findings and decides what to do: serve it, rewrite it, ask for clarification, or fall back to a safe default. This gives you full control over the user experience while getting mathematical verification on every response. For a working example of this pattern, see the open source rewriting chatbot implementation, which demonstrates how to take AR findings and feed them back to the LLM for automated correction.

AgentCore Policy: Deterministic boundaries for AI agents

If your startup is building agentic applications, where your AI orchestrates multi-step workflows, calls tools, and takes actions on behalf of users, you have a different trust problem. It is not just about what the AI says. It is about what the AI does.

For startups in healthcare, fintech, or legal, "the agent did something unexpected" is not acceptable from a compliance or regulatory perspective. You need provable boundaries on agent behavior, and those boundaries need to hold even when the agent is manipulated by prompt injection or makes a reasoning error.

Policy in Amazon Bedrock AgentCore solves this with an authorization layer enforced on every agent-tool interaction at the gateway boundary. Policies are written in Cedar, the authorization language built by AWS to be verified with AR. Cedar policies define who (principal) can do what (action) on which resource, under what conditions. No tool invocation passes through the gateway without a matching permit. You can write policies directly in Cedar or describe your rules in natural language and let the system generate Cedar for you.

Before your policies are deployed, automated reasoning runs semantic validation to catch problems, including:

Overly permissive policies that would allow all requests for a given principal/action/resource combination
Overly restrictive policies that would deny everything
Ineffective policies where a Permit permits nothing or a Forbid forbids nothing

The enforcement follows two principles that are particularly important for startups in regulated industries:

Default-deny: if no policy explicitly permits an action, it is blocked. Your agent cannot do anything you did not authorize.
Forbid always wins over permit: you can set hard-stop rules that cannot be overridden, regardless of what other policies exist.

Consider a healthcare startup building an appointment scheduling agent. With AgentCore Policy, you can enforce that patients only access their own records, that appointment bookings are restricted to business hours, and that refund amounts are capped by policy. These rules hold regardless of how the agent is prompted, any bugs that exist in your code, or how creative the user gets with their requests. The enforcement happens at the gateway boundary, outside the agent entirely.

How much does automated reasoning cost compared to manual QA?

To understand the economics of automated reasoning, it is useful to compare it against the baseline most teams currently use: manual QA review of model outputs.

The old way: Manual QA

A reviewer checking AI outputs can handle roughly 50 interactions per hour at a loaded cost of about $40 USD/hour. For a startup handling 100,000 customer interactions monthly, reviewing every interaction costs $80,000 USD/month. Even sampling 10 percent costs $8,000 USD/month, and you are still unverified on the other 90 percent. At $80K to $150K USD per reviewer annually, a team of 5 to 10 reviewers runs $400K to $1.5M USD per year. For most startups, that is a significant chunk of funding spent on checking whether your AI is correct.

The new way: automated reasoning

AR checks are priced per validation request. (Exact pricing should be confirmed on the Bedrock pricing page as rates may vary.) Unlike manual review, every interaction can be verified rather than sampled. Amazon Bedrock Guardrails uses text units as the validation unit, where each text unit is 1,000 characters. Here is what that looks like for a typical startup:

Assumptions

Calculation

Using the example above of a startup handling 100K interactions per month, and assuming that the output of each interaction is ~200 tokens, we can easily estimate the cost of using AR checks for validation. Amazon Bedrock Guardrails measures usage in Text Units, where one text unit is 1000 characters. Using an average of 4 characters per token, we estimate each interaction to be ~800 characters and therefore 0.8 text units.

Validation of 100,000 interactions of 0.8 text units each gives us a total of 80K text units per month. With AR checks price of $0.17 per 1000 text units, our total monthly cost for a single policy would be $13.60 USD. With multiple policies applied, costs scale linearly. Even with multiple policies, the total cost remains dramatically lower than manual review.

The compliance math is even more compelling. A single HIPAA violation category can cost up to $2,067,813 USD per year under current inflation-adjusted figures. For a healthcare startup whose AI processes thousands of patient interactions daily, the regulatory exposure from unverified outputs is existential. In this context, AR functions less like a software expense and more like a risk management control.

For developers, the closest analogy is Rust's borrow checker. Rust catches memory corruption at compile time, before code reaches production. AR catches policy violations at verification time, before they reach your customer. Catching errors before they cause damage is always cheaper than cleaning up afterward.

There is a trade-off: AR validation adds latency to each response, and rewriting means your LLM runs twice on flagged outputs. For most applications, however, those costs are trivial next to the operational, legal, or reputational impact of an incorrect output.

The same argument resonates with investors. Being able to say that AI outputs are systematically verified against formal business policies is not simply a product feature. It is proof that risk is being managed proactively. That is the kind of concrete, verifiable claim that strengthens a startup’s risk profile.

What comes next?

In Part 3: A Step-by-Step Implementation Playbook, we go from concepts to code. You will learn how to create an AR policy from your existing documents, deploy it to a guardrail, integrate with the ApplyGuardrail API, handle findings programmatically, set up AgentCore Policy through AgentCore Gateway, and build an audit trail. Everything you need to go from reading about mathematical verification to shipping it in production.

Harshvardhan Chunawala

Harshvardhan Chunawala ist Solutions Architect bei AWS sowie Authorized Educator der AWS Academy und lebt in den USA. Er arbeitet weltweit mit Führungskräften großer Unternehmen, Startup-Gründern und Vorstandsmitgliedern zusammen, um für verschiedene Branchen skalierbare und sichere Cloud-Infrastrukturen in AWS zu entwickeln. Er wurde mit dem AWS Golden Jacket ausgezeichnet und kooperiert mit mehreren Amazon-Teams, um wegweisende Cloud-Funktionen für Sicherheits- und Satellitenservices sowie für vertrauenswürdige agentenbasierte KI-Services zu gestalten und bereitzustellen. Neben seiner Tätigkeit bei AWS ist er ein weltweit anerkannter Technologe und Experte für Cloud-Sicherheit mit über 10 Jahren Erfahrung. Darüber hinaus ist er an der Carnegie Mellon University tätig, wo er an der Forschung und am Mentoring in den Bereichen Cloud Computing und neue Technologien mitwirkt. Zu seinen Hobbys zählen das Fallschirmspringen und das Fliegen.

Mike Miller

Mike Miller ist Director of AI Product Management bei AWS, wo er bei wichtigen Initiativen im Bereich der generativen KI beratend tätig ist. Dazu gehören Automated-Reasoning-Funktionen zur Vermeidung von Halluzinationen, Amazon Q und Amazon Bedrock. Er machte PartyRock, eine No-Code-Plattform zur Entwicklung generativer KI-Anwendungen, öffentlich zugänglich, nachdem sich eine interne Version davon unter den Amazon-Mitarbeitenden extrem großer Beliebtheit erfreut hatte. Zuvor leitete Mike das AWS-Team Machine Learning Thought Leadership. Dort brachte er AWS DeepLens, AWS DeepRacer und AWS DeepComposer auf den Markt, womit er Entwicklern weltweit auf unterhaltsame und ansprechende Weise praktische Erfahrungen mit Machine Learning ermöglichte. Mike ist seit über 13 Jahren bei Amazon tätig. Vor seinem Wechsel zu AWS leitete er das Produktmanagement für Fire TV bei Lab126.

Rahul Kumar

Dr. Rahul Kumar ist Senior Applied Science Manager bei AWS, wo er die Entwicklung von Verifizierungstechnologien für Rust- und C-Programme und die Weiterentwicklung der neuro-symbolischen KI leitet, die große Sprachmodelle mit Automated Reasoning kombiniert. Bei AWS treibt Rahul Open-Source-Initiativen voran, darunter den Modellprüfer Kani und die Challenge „Verify the Safety of the Rust Standard Library“. Er promovierte an der Brigham Young University und arbeitete zuvor im Bereich der formalen Verifizierung und statischen Analyse bei Microsoft Research und im NASA Jet Propulsion Laboratory sowie als Dozent am Caltech. Er setzt sich leidenschaftlich dafür ein, Automated Reasoning einem breiteren Publikum zugänglich zu machen, und hält Vorträge darüber, wie mathematische Beweisverfahren KI-Halluzinationen beseitigen und die Korrektheit von Software gewährleisten können. Er lebt in Seattle, Washington.

Stefano Buliani

Stefano Buliani ist Principal Product Manager in der Automated Reasoning Group bei AWS. Dort leitet er die Bemühungen, mithilfe des Amazon-Bedrock-Integritätsschutzes formale Verifizierungsfunktionen für generative KI bereitzustellen. Stefano ist gelernter Softwareentwickler und seit über 12 Jahren bei AWS tätig, und zwar sowohl als Specialist Solutions Architect wie auch als Produktmanager in den Teams für Serverless-Lösungen und Automated Reasoning. In seiner früheren Position unterstützte er Kunden bei der Entwicklung und Skalierung von Serverless-Anwendungen in AWS Lambda und Amazon API Gateway. In seiner Freizeit unternimmt Stefano gerne Ausflüge in die Natur im pazifischen Nordwesten Nordamerikas. Er lebt in Vancouver, Kanada.

Wie war dieser Inhalt?