Taming Uncertainty: Contracts

[ software , neuro-symbolic ai ]

“Testing shows the presence, not the absence of bugs.”
—Djikstra

Since the early days of symbolicai, we baked in a return type-casting mechanism (->), a feature I was excited about from the start. I had an intuition that it could be a powerful tool for ensuring correctness, but I never explored its full implications. That changed recently when I started thinking deeper about its potential. That led me to three weeks of research and experimentation, culminating in a working sketch that I’ll share here.

This post is about using contracts to enforce the semantic correctness of a generative process. I’ll walk through how contracts work in symbolicai, the benefits they offer, and the challenges ahead. Most intriguingly, I believe contracts could be a promising approach to mitigating hallucinations. There’s still much to refine, but the potential here is too compelling to ignore.

In software development, correctness is often treated as an afterthought, something verified through testing rather than built into the design itself. But what if we could enforce correctness by design?

This is where Design by Contract (DbC) comes in. Originally introduced by Bertrand Meyer for the Eiffel programming language, DbC proposes that software components should operate under formal agreements, i.e. defining what must be true before execution (preconditions), what they promise after execution (postconditions), and what remains invariant throughout.

Static type-checking guarantees that data is structurally valid, but it doesn’t verify whether outputs make sense in a given context. A function might return a string where a string is expected, but that string could still be gibberish or factually incorrect.

Correctness in traditional programming can be expressed using Hoare triples: {Pre} C {Post}. This means that if a computation C starts in a state satisfying Pre, it will produce a state satisfying Post. However, LLMs introduce uncertainty. Instead of absolute guarantees, we must consider probabilistic correctness. That’s why we need to rely on the probabilistic version of Hoare triples, i.e. {Pre} C {Post} [p], where p represents the probability that C produces a valid output.

LLMs can be understood as semantic parsers, the key insight lying at the core of our SymbolicAI paper. Unlike traditional parsers that process structured languages, LLMs have the ability to break down unstructured human language into meaningful components and transform them into structured forms. This makes them useful tools for bridging symbolic reasoning and generative AI, allowing for richer interactions between probabilistic and logic-based systems.

This perspective naturally leads to an interesting connection: types as semantic structures. Besides syntactic constraints, types inherently carry meaning. A class named Person, for example, immediately implies a specific concept—a human individual—without requiring additional explanation. It helps other programmers understand our intent. When we write type-safe programs, we get guarantees about what operations are valid, which prevents many bugs.

This connection between types and semantics suggests that types themselves can be used to enforce meaning in a structured way. Type-driven programming encodes semantic intent directly into the program’s structure, ensuring more than structural correctness this way. This is particularly relevant when designing DSLs, since DSLs too impose semantic constraints by design.

This is where Pydantic becomes useful. It provides a way to define data models that enforce both structural correctness (types) and domain-specific meaning (validations, constraints, descriptions).

The emphasis on design is important because it shifts our mindset from reactive validation to proactive structuring. When defining a contract, we are architecting the flow of information, ensuring that every input and output aligns with our intended semantics. This process forces us to think carefully about what goes in, what comes out, and what remains invariant.

I believe that this constraint-driven approach makes contracts an interesting paradigm for prompt engineering. Instead of relying purely on heuristics or trial and error, we can systematically design rules, types, and conditions that guide the generative process.

Contracts, in this sense, act as an interface between intent and execution. They define not only what a system should do but also what it should not do. This aligns with the broader idea of meta-design, which is more or less the notion that structuring constraints effectively leads to emergent behaviors that reflect the designer’s intent. A poorly designed contract might overly restrict a system, making it brittle. A well-designed one, however, introduces guardrails without stifling flexibility. I like to think that I’m putting you back in charge, so don’t mess up, we’re both team human.

This approach also has implications beyond correctness. Contracts enable composability by allowing different components to interoperate with clear expectations. This is especially relevant in multi-agent systems, where LLMs act as orchestrators that generate, verify, and refine outputs dynamically. By enforcing contracts at the interface level, we can ensure that agents exchange valid, meaningful, and contextually appropriate information.

At this point, part of my motivation is in place. Next, we’ll walk through a concrete implementation. I will be introducing the @contract decorator and demonstrating how it integrates into symbolicai.

Unlike classical DbC, which enforces invariants, this approach introduces a fallback or continuation mechanism instead. The @contract decorator is a class decorator, meaning it augments an entire Expression rather than a single function.

In symbolicai, much like in PyTorch, every Expression implements a forward method. The @contract decorator modifies forward, but crucially, it never prevents execution. Instead, if a contract fails, the system has the opportunity to act, e.g. by returning a default expected type (as indicated by the -> return type annotation). If the contract succeeds, it can simply return the validated result or perform additional actions, such as logging, computation, or transformations before passing the output forward. With this in mind, let’s break down how @contract is implemented in symbolicai.

The full code of what I’m about to show is available as a gist at the following link. I won’t reproduce it entirely here, but will conceptually outline important aspects of the implementation.

Contracts in symbolicai enforce a rigorous type system using Pydantic data models. These contracts only work with LLMDataModel objects for both inputs and outputs. The LLMDataModel is our custom extension of Pydantic’s BaseModel, enhanced with specialized formatting methods that allow us to effectively instruct LLMs. It’s nothing more than that for now, but it might get expanded in the future.

class TripletInput(LLMDataModel):
    """Input for triplet extraction"""
    text: str = Field(
        description="Text to extract triplets from"
    )
    ontology: OntologySchema = Field(
        description="Ontology schema to use for extraction"
    )

class TripletOutput(LLMDataModel):
    """Collection of extracted triplets forming a knowledge graph"""
    triplets: list[Triplet] | None = Field(
        default=None,
        description="List of extracted triplets"
    )

When defining your data models, we strongly recommend using Pydantic’s Field objects to represent model attributes. Fields are particularly relevant in this context because the description parameter serves a dual purpose—beyond documenting the field these descriptions become part of the prompt sent to the LLM. This creates an elegant way to provide contextual instructions directly within your data structures, guiding the model toward correct outputs without redundant code.

When applied, the @contract decorator modifies the class and particularly its forward method, adding several crucial attributes: contract_successful indicates whether the contract succeeded, contract_result holds the result (or None if contract failed), and a new method contract_perf_stats that provides detailed timing metrics for each contract validation phase. The forward method requires a specific signature with an input parameter (exactly like that) and must declare a return type annotation in the form -> SomeLLMDataModel. This return type serves as a contract guarantee that your method must always return an object of this type. Additionally, it doesnt support args, only keyword arguments. All other passed keyword arguments besides input get sent in the backend where they reach the current neuro-symbolic engine (say gpt-4o). It’s an entire discussion about how we use kwargs in symbolicai, but that’s all you need to know for now, and you won’t need to understand this particular tidbit to use contracts effectively.

def forward(self, input: TripletInput, **kwargs) -> TripletOutput:
    # The contract modified the class so self has now this new attribute
    if self.contract_result is None:
        return TripletOutput(triplets=None)
    # … do more work here …
    return self.contract_result

One critical aspect of the contract implementation is the execution order: the contract validation runs first, but the forward method always executes afterward, regardless of validation results. This design puts responsibility on the developer to check if contract_successful is True before proceeding with complex operations, or to construct an appropriate default return value that matches the declared type. While you could raise an exception when validation fails, the recommended pattern is to return a valid but empty or default object.

Keep that in mind, since it’s very important. There’s nothing more wasteful than waiting for a computation to complete only for it to fail in a preventable way. If you’re wondering, initially, I had set the default to return the contract result, but defaults undermine the very purpose of what I’m building: a system that prioritizes design awareness. Attention is one of the most vital skills. Think of this as a mindfulness exercise.

def pre(self, input: TripletInput) -> bool:
    # No semantic validation for now
    return True

def post(self, output: TripletOutput) -> bool:
    if output.triplets is None:
        # We can consider skipping since the LLM didn't
        # find any triplet for the given input.
        return True
    for triplet in output.triplets:
        if triplet.confidence < self.threshold:
            raise ValueError(
                "Confidence score "
                f"{triplet.confidence} "
                "has to be above threshold "
                f"{self.threshold}! "
                "Extract relationships between entities "
                "that are meaningful and relevant!"
            )
    return True

@property
def prompt(self) -> str:
    return (
        "You are an expert "
        "at extracting semantic relationships from text "
        "according to ontology schemas. "
        "For the given text and ontology:\n"
        "1. Identify entities matching the allowed entity types\n"
        "2. Extract relationships between entities matching the "
        "defined relationship types\n"
        "3. Assign confidence scores based on certainty of extraction\n"
        "4. Ensure all entity and relationship types conform to the ontology\n"
        "5. Do not duplicate triplets\n"
        "6. If triplets can't be found, default to None"
    )

To implement a complete contract, you must define two key methods and one property. The pre and post methods serve as semantic validation containers for input and output respectively and they can be as complex as you wish. These methods must return True when validation succeeds and raise an error with a descriptive message when validation fails. Raising an error is a deliberate design choice, as these error messages, similar to the Field’s description argument, serve as instructions that guide the LLM toward self-correction. While you might not always need to semantically validate an input you’ve designed yourself, this capability becomes relevant when chaining contracts together. In these scenarios, the output of one contract becomes the input to another, and although it passed validation in its original context, it might fail to meet the standards of the downstream contract. The required prompt property must be a string that describes the task you want to accomplish, and you have complete freedom in how you construct it. Two additional optional properties can enhance your contracts: template provides a mechanism for text infilling (e.g., “text… {fill this thing}… text”), while payload can contain any supplementary data relevant to the current computation.

@contract(
    pre_remedy=False,
    post_remedy=True,
    verbose=True,
    remedy_retry_params=dict(
        tries=1,
        delay=0.5,
        max_delay=15,
        jitter=0.1,
        backoff=2,
        graceful=False
    )
)

The @contract decorator accepts several parameters that control its behavior. The remedy_retry_params dictionary configures the underlying retry mechanisms that attempt to fix validation failures, one for type validation, and the other for semantic validation. I won’t go into detail about the internals of these two elements, but I will say this. Under the hood, they use a component called Function which was initially inspired by Wolfram’s LLMFunction. We liked that high-level approach so we added it to our framework. They include parameters like tries (maximum retry attempts), and others to fine-tune the remedy process. Two boolean parameters, pre_remedy and post_remedy, control whether auto-correction should be applied to inputs and outputs respectively. When enabled, the system will retry failed validations up to the specified number of times. The verbose parameter toggles detailed logging of the contract’s internal operations.

The contract execution follows a well-defined sequence of operations. First, it validates that the input is indeed an instance of LLMDataModel, immediately throwing an error if this fundamental type check fails. If pre_remedy is enabled, it then semantically validates the input according to your pre method, retrying with guidance from error messages when validation fails. Next, the system attempts to create a valid output object that conforms to your specified return type. If post_remedy is enabled, it semantically validates this output using your post method, again with retry capabilities.

To demonstrate contracts in action, let’s explore a practical example: extracting structured knowledge from legal documents. In this task, we design an ontology that guides the LLM to extract semantic triplets from a Terms of Service document (in this case, X’s ToS). These triplets then form a knowledge graph representing the document’s key concepts and relationships. We process the document in chunks to increase the likelihood of capturing all relevant entities and relationships. This chunking strategy is important for comprehensive extraction, as attempting to process the entire document at once might cause the model to miss important details or exceed context limitations. We’ll compare results from two different models: gpt-4o-mini and deepseek-r1-distill-qwen-32b-q4_k_m (unsloth), demonstrating how different models interpret and extract information under the same contractual constraints.

And, the results (clickable) for gpt-4o-mini and deepseek-r1-distill-qwen-32b-q4_k_m .

The scope of contracts extends beyond basic validation. One key observation is that a contract is considered fulfilled if both the LLM’s input and output are successfully validated against their specifications. This leads to a deep implication: if two different agents satisfy the same contract, they are functionally equivalent, at least with respect to that specific contract.

This concept of functional equivalence through contracts opens up promising opportunities. In principle, you could replace one LLM with another, or even substitute an LLM with a rule-based system, and as long as both satisfy the same contract, your application should continue functioning correctly. This creates a level of abstraction that shields higher-level components from the implementation details of underlying models.

Looking forward, there are several directions I’m excited to explore:

How can we formalize the probabilistic aspects of contracts, potentially calculating the probability of success for a given contract (maybe across different models too)?
Can contracts themselves hallucinate?
What should be a contract’s lifecycle?
In security, can contracts prevent prompt injection/contextual breaches?
How can we develop a standardized format for serializing and deserializing contracts, enabling easier sharing and version control (think .GGUF)?
How can we create a hub where developers can share and discover contracts for common tasks, a collaborative ecosystem around semantic correctness?

While still in its early stages, I believe it represents a step toward more trustworthy systems. By shifting our focus from post-hoc evaluation to design-time specification, we create AI components that are more predictable, composable, and aligned with human intent. If you’re interested in exploring these ideas further or collaborating on contract-based approaches to AI safety and reliability, please don’t hesitate to reach out. And of course, if you found this approach valuable, I’d appreciate you sharing it with others working at the intersection of symbolic reasoning and machine learning.