콘텐츠 언어

현재 모든 콘텐츠가 번역되지는 않습니다.

학습
Prove It, Part 2: Formal logic, Cedar policies, and the economics of verification

Prove It, Part 2: Formal logic, Cedar policies, and the economics of verification

이 콘텐츠는 어떠셨나요?

collaborative team meeting ai project discussion

Automated reasoning adds verification on top of LLM outputs by translating business policies into formal logic and checking them against those rules. It detects hallucinations, enforces compliance constraints, and controls agent actions with full coverage at a fraction of the cost of manual QA.

In Part 1 ('Probably Correct' Is Not Good Enough"), we covered why startups building on AI need deterministic verification, not just probabilistic safeguards. In this post, we go under the hood. If you are a technical co-founder, engineering lead, or senior developer at an AI startup, this is where you learn how automated reasoning actually works, what happens when your LLM generates a response and formal logic checks it, and why the economics makes this a no-brainer compared to hiring QA teams.

No API code in this post. That comes in Part 3. Here we focus on concepts, formal logic, and the verification pipeline so you understand what you are building.

Stochastic vs. deterministic: The fundamental distinction

Before diving into the mechanics, it is worth grounding the core distinction clearly, as it affects every architectural decision you make.

LLMs are stochastic.

They predict the next token based on probability distributions learned from training data. Temperature, top-k, and top-p sampling parameters introduce deliberate randomness. Run the same prompt twice and you may get different answers. This is a feature, not a bug: it is what makes LLMs flexible, creative, and capable of handling diverse natural language inputs. But it also means that any single output is a sample from a probability distribution, not a proven fact.

Automated reasoning (AR) is correct.

Given a set of rules and an input, the result is not a best guess. It is a proof. A statement is either provably valid with respect to your rules, provably invalid, or the system tells you precisely what information is missing or ambiguous. If the answer is wrong, AR does not just flag it; it gives you a counterexample showing exactly why it fails. The output is mathematically verifiable, not probabilistic. There is no temperature knob. There is no sampling. There is a proof you can inspect.

This is not about one approach being "better." LLMs are better at understanding natural language and generating human-like responses. AR is better at checking whether those responses are logically correct. Together, they form the complete stack: the LLM handles the conversation, AR handles correctness. This combination of neural networks and formal logic is what researchers call neurosymbolic AI.

For a five-person startup shipping to your first 100 customers, this distinction has a very practical consequence. You do not have the headcount for a QA team, and you do not have the runway to absorb a compliance incident. Formal verification becomes a force multiplier that lets a small team ship with the confidence of a much larger one.

What are the two layers of protection in Amazon Bedrock?

AWS provides AR at two layers, and they map cleanly to the maturity curve of most AI startups. When you are early-stage, your product is likely a chatbot or assistant: an LLM answering customer questions, generating recommendations, or providing guidance. At this stage, automated reasoning in Amazon Bedrock Guardrails verifies the content of model outputs against defined business rules

As you grow, you start building agentic workflows: your AI books appointments, processes refunds, queries databases, calls external APIs. At this layer, Policy in Amazon Bedrock AgentCore governs what actions an agent is allowed to take, using Cedar policies enforced at the gateway before any tool execution.

Cedar policies govern what an agent can do. But you can also expose the AR checks API as a tool in the agent's toolkit, letting the agent validate its own intermediate conclusions before acting on them. The agent calls ApplyGuardrail against your policy, inspects the findings, and self-corrects without involving the user. An agent planning a multi-step workflow can check whether its proposed sequence violates any constraint, catch a contradiction in its assumptions, or confirm that a derived value actually follows from the inputs. This shifts AR from a boundary check to a reasoning partner: the agent does not just get blocked at the gate, it avoids walking toward the gate in the first place.

Here is how they compare:

AR checks and Policy in Amazon Bedrock AgentCore protect different layers of your application.

AR checks validate what your AI says, i.e. is this mortgage eligibility answer correct per your lending criteria?
Policy in Amazon Bedrock AgentCore controls what your AI does, i.e. is this agent authorized to initiate a refund or query a customer's account?

Depending on what you are building today, you might start with one or both.

If your MVP is a chatbot answering policy questions, AR checks is your first integration.
If you are shipping an agent that books appointments and moves money, Policy is non-negotiable from day one.

Most startups will eventually run both together.

How AR checks work: Encoding your business rules as formal logic

The input to AR is a policy, and a policy starts with documents you already have.

If you are a fintech startup, it could be a lending criteria document. In healthcare, it may be clinical protocols or HIPAA handling procedures. In insurance, underwriting guidelines. These documents define what your AI should and should not say. They have been reviewed by your legal counsel. Your investors asked about them during due diligence. AR turns those existing artifacts into enforceable rules. You are not building something new from scratch. You are activating something you already have.

When you upload a policy document (PDF, Markdown, or plain text) to Amazon Bedrock, the system extracts two things: variables and rules.

Variables represent the concepts in your domain. Each variable has a name, a type, and a description. For example, in a mortgage eligibility policy:

The variable descriptions are the single most important factor in accuracy. Vague descriptions leave the LLM guessing at what a variable represents. Detailed descriptions that explain what the concept means, how your users refer to it, and how it appears in your policy documents give the LLM the context it needs to map natural language to the right formal variables. This is where your domain expertise as a startup builder matters most.

Rules are formal logic expressions that capture relationships between variables. They use a subset of SMT-LIB syntax, and most follow an if-then (implicative) format:

This is what Bedrock generates under the hood when you upload your mortgage policy document. You can review these rules in the console to verify the system captured your intent correctly, but you are not writing SMT-LIB by hand. You write the policy in natural language, and the system translates it into formal logic.

The two-step verification process

Once you have set up an AR policy and attached it to a guardrail (covered in Part 3), every LLM response your application sends to ApplyGuardrail goes through a two-step verification process. Understanding this split is critical because it tells you where to invest your effort.

Step 1: Translation

The system converts the natural language in both the user question and the LLM's response into formal logic predicates, statements like earning “>= 180 & earning <= 220” rather than simple assignments, using the variables declared in your policy. This is where the variable descriptions come in.

For example, if the response says "a customer with $30,000 USD down on a $350,000 USD house qualifies for a conventional mortgage," the translation step produces: downPaymentAmount = 30000, purchasePrice = 350000, mortgageType = CONVENTIONAL.

The fact that we use LLMs to translate your input natural language into logic is why the system claims up to 99 percent verification accuracy. To ensure our translation is correct, we don't just rely on one LLM. Instead, we use multiple LLMs at different temperatures in parallel to perform the same translation. Only when the redundant translations are semantically equivalent can we make a validation result. When the translations disagree, we instead return an AMBIGUOUS result that includes two possible interpretations of the input. Your LLM can use this feedback to disambiguate its answer or ask the user clarifying questions. When we benchmarked our system against Carnegie Mellon's conditional QA data set, in over 99 percent of cases where an answer was flagged as VALID from our system the flag was correct. You can find more details on the methodology in this paper.

Step 2: Verification

An Satisfiability Modulo Theories (SMT) solver evaluates whether the translated assignments satisfy all your policy rules. This step is mathematically sound by definition. Once the translation is correct, the verification is provably correct.

For the mortgage example: the solver computes downPaymentPercentage = (30000 / 350000) * 100 = 8.57 percent, checks this against the rule (=> (< downPaymentPercentage 20.0) (not (= mortgageType CONVENTIONAL))), finds that 8.57 < 20, and concludes that mortgageType = CONVENTIONAL is logically invalid. The system returns an invalid finding with a complete trace showing exactly which rule was violated and why.

The practical takeaway for startup builders? The translation quality is something you control through better variable descriptions. The verification math is something you do not have to worry about. You invest in describing your domain well, and the solver handles the rest.

What do the findings mean?

AR checks do not simply return "pass" or "fail." They return structured findings that tell you exactly what happened and what to do next. Each finding is one of these types:

This is important: AR checks operate in detect mode. They return findings and feedback, but do not block the response. Your application inspects the findings and decides what to do: serve it, rewrite it, ask for clarification, or fall back to a safe default. This gives you full control over the user experience while getting mathematical verification on every response. For a working example of this pattern, see the open source rewriting chatbot implementation, which demonstrates how to take AR findings and feed them back to the LLM for automated correction.

AgentCore Policy: Deterministic boundaries for AI agents

If your startup is building agentic applications, where your AI orchestrates multi-step workflows, calls tools, and takes actions on behalf of users, you have a different trust problem. It is not just about what the AI says. It is about what the AI does.

For startups in healthcare, fintech, or legal, "the agent did something unexpected" is not acceptable from a compliance or regulatory perspective. You need provable boundaries on agent behavior, and those boundaries need to hold even when the agent is manipulated by prompt injection or makes a reasoning error.

Policy in Amazon Bedrock AgentCore solves this with an authorization layer enforced on every agent-tool interaction at the gateway boundary. Policies are written in Cedar, the authorization language built by AWS to be verified with AR. Cedar policies define who (principal) can do what (action) on which resource, under what conditions. No tool invocation passes through the gateway without a matching permit. You can write policies directly in Cedar or describe your rules in natural language and let the system generate Cedar for you.

Before your policies are deployed, automated reasoning runs semantic validation to catch problems, including:

Overly permissive policies that would allow all requests for a given principal/action/resource combination
Overly restrictive policies that would deny everything
Ineffective policies where a Permit permits nothing or a Forbid forbids nothing

The enforcement follows two principles that are particularly important for startups in regulated industries:

Default-deny: if no policy explicitly permits an action, it is blocked. Your agent cannot do anything you did not authorize.
Forbid always wins over permit: you can set hard-stop rules that cannot be overridden, regardless of what other policies exist.

Consider a healthcare startup building an appointment scheduling agent. With AgentCore Policy, you can enforce that patients only access their own records, that appointment bookings are restricted to business hours, and that refund amounts are capped by policy. These rules hold regardless of how the agent is prompted, any bugs that exist in your code, or how creative the user gets with their requests. The enforcement happens at the gateway boundary, outside the agent entirely.

How much does automated reasoning cost compared to manual QA?

To understand the economics of automated reasoning, it is useful to compare it against the baseline most teams currently use: manual QA review of model outputs.

The old way: Manual QA

A reviewer checking AI outputs can handle roughly 50 interactions per hour at a loaded cost of about $40 USD/hour. For a startup handling 100,000 customer interactions monthly, reviewing every interaction costs $80,000 USD/month. Even sampling 10 percent costs $8,000 USD/month, and you are still unverified on the other 90 percent. At $80K to $150K USD per reviewer annually, a team of 5 to 10 reviewers runs $400K to $1.5M USD per year. For most startups, that is a significant chunk of funding spent on checking whether your AI is correct.

The new way: automated reasoning

AR checks are priced per validation request. (Exact pricing should be confirmed on the Bedrock pricing page as rates may vary.) Unlike manual review, every interaction can be verified rather than sampled. Amazon Bedrock Guardrails uses text units as the validation unit, where each text unit is 1,000 characters. Here is what that looks like for a typical startup:

Assumptions

Calculation

Using the example above of a startup handling 100K interactions per month, and assuming that the output of each interaction is ~200 tokens, we can easily estimate the cost of using AR checks for validation. Amazon Bedrock Guardrails measures usage in Text Units, where one text unit is 1000 characters. Using an average of 4 characters per token, we estimate each interaction to be ~800 characters and therefore 0.8 text units.

Validation of 100,000 interactions of 0.8 text units each gives us a total of 80K text units per month. With AR checks price of $0.17 per 1000 text units, our total monthly cost for a single policy would be $13.60 USD. With multiple policies applied, costs scale linearly. Even with multiple policies, the total cost remains dramatically lower than manual review.

The compliance math is even more compelling. A single HIPAA violation category can cost up to $2,067,813 USD per year under current inflation-adjusted figures. For a healthcare startup whose AI processes thousands of patient interactions daily, the regulatory exposure from unverified outputs is existential. In this context, AR functions less like a software expense and more like a risk management control.

For developers, the closest analogy is Rust's borrow checker. Rust catches memory corruption at compile time, before code reaches production. AR catches policy violations at verification time, before they reach your customer. Catching errors before they cause damage is always cheaper than cleaning up afterward.

There is a trade-off: AR validation adds latency to each response, and rewriting means your LLM runs twice on flagged outputs. For most applications, however, those costs are trivial next to the operational, legal, or reputational impact of an incorrect output.

The same argument resonates with investors. Being able to say that AI outputs are systematically verified against formal business policies is not simply a product feature. It is proof that risk is being managed proactively. That is the kind of concrete, verifiable claim that strengthens a startup’s risk profile.

What comes next?

In Part 3: A Step-by-Step Implementation Playbook, we go from concepts to code. You will learn how to create an AR policy from your existing documents, deploy it to a guardrail, integrate with the ApplyGuardrail API, handle findings programmatically, set up AgentCore Policy through AgentCore Gateway, and build an audit trail. Everything you need to go from reading about mathematical verification to shipping it in production.

Harshvardhan Chunawala

Harshvardhan Chunawala는 AWS의 솔루션스 아키텍트이자 미국에서 활동하는 AWS Academy Authorized Educator입니다. 그는 전 세계 대기업 리더, Startup 창업자, 최고 경영진과 협력하여 산업 전반에서 확장 가능하고 안전한 AWS 클라우드 인프라를 설계합니다. 또한 AWS Golden Jacket 수상자로서, 여러 Amazon 팀과 협력하여 보안, 위성, 신뢰할 수 있는 에이전틱 AI 서비스 등 여러 분야에서 프론티어 클라우드 기능을 구축하고 제공합니다. AWS에서 일하는 것 외에, 그는 10년 이상의 경력을 갖춘 세계적으로 인정받는 기술자이자 클라우드 보안 전문가이기도 합니다. 또한 카네기 멜론 대학교와 제휴하여 클라우드 컴퓨팅 및 신기술에 대한 연구와 멘토링에 참여하고 있습니다. 여가 시간에는 스카이다이빙과 비행기 조종을 즐깁니다.

Mike Miller

Mike Miller는 AWS의 AI Product Management 부문 Director로, 할루시네이션을 방지하는 자동 추론 기능, Amazon Q, Amazon Bedrock 등, 주요 생성형 AI 이니셔티브에 대해 자문을 제공합니다. 내부 버전이 Amazon 직원들 사이에서 입소문을 타자 그는 생성형 AI 앱 구축을 위한 노코드 플레이그라운드인 PartyRock을 일반에 공개했습니다. 이전에 Mike는 AWS 기계 학습 사고 리더십 팀을 이끌면서 AWS DeepLens, AWS DeepRacer, AWS DeepComposer를 런칭하여 재미있고 매력적인 방식으로 전 세계 개발자들에게 실습 기계 학습을 제공한 바 있습니다. Mike는 Amazon에서 13년 이상 근무했으며, AWS에 합류하기 전에는 Lab126에서 Fire TV 제품 관리를 담당하기도 했습니다.

Rahul Kumar

Rahul Kumar 박사는 AWS의 Senior Applied Science Manager로, Rust 및 C 프로그램을 위한 검증 기술을 구축하고 대규모 언어 모델과 자동 추론을 결합하는 신경 기호 AI를 발전시키기 위한 작업을 이끌고 있습니다. AWS에서 Rahul은 Kani 모델 검사기와 ‘Rust 표준 라이브러리의 안전성 확인’ 챌린지를 비롯한 오픈 소스 이니셔티브를 이끕니다. 브리검 영 대학교에서 박사 학위를 취득했으며, 이전에는 Microsoft Research와 NASA JPL에서 형식 검증 및 정적 분석을 담당했으며 Caltech에서 강사로 재직했습니다. 그는 수학적 증명 기법이 어떻게 AI 할루시네이션을 없애고 소프트웨어 정확성을 보장할 수 있는지에 대해 강연하면서 더 많은 대중에게 자동 추론을 제공하는 것을 열렬히 지지하고 있습니다. 그는 워싱턴주 시애틀에 거주하고 있습니다.

Stefano Buliani

Stefano Buliani는 AWS Automated Reasoning Group의 Principal Product Manager로, Amazon Bedrock Guardrails를 통해 생성형 AI에 형식 검증 기능을 도입하기 위한 프로젝트를 이끌고 있습니다. 소프트웨어 엔지니어 출신인 Stefano는 AWS에서 12년 이상 근무하면서 서버리스 팀과 자동 추론 팀에서 전문 솔루션 아키텍트이자 제품 매니저로 활동했습니다. 초창기에는 고객이 AWS Lambda와 Amazon API Gateway를 기반으로 서버리스 애플리케이션을 구축하고 확장하도록 도왔습니다. Stefano는 업무 외 시간에 태평양 북서부의 자연을 탐험합니다. 그는 캐나다 밴쿠버에 거주하고 있습니다.

이 콘텐츠는 어떠셨나요?