Introduction

At Ardoq, a core part of our evolution involves leveraging Artificial Intelligence. However, standard generative AI is not naturally equipped to handle the accuracy and reliability with large data sizes required by Enterprise Architecture (EA). To bridge this gap, Ardoq is focusing on a number of techniques, including neuro-symbolic AI which combines neural networks (generative AI, pattern recognition) with symbolic logic (reasoning, rules, generated code) to build more efficient, and trustworthy solutions.

This article explains the fundamental challenges of using standard Large Language Models (LLMs) for enterprise architecture and how Ardoq utilizes techniques such as aggregation tools, reasoning rubrics, and extensive evaluations to provide trustworthy architectural insights.

The Problem: Capability vs. Reliability

The AI industry has seen rapid advancements in raw capability. For instance, the Time Horizon study from METR (Model Evaluation & Threat Research) organization shows an exponential increase in the capability of AI models to solve problems compared to the time it takes for humans.

These capability increases in frontier models are often based on considerable context windows of up to 1 million tokens. This allows users to load entire codebases or massive datasets directly to a model.

Despite this rapid increase, overall reliability and consistency are improving at a much slower rate than raw capability. This is a significant issue when augmenting EA techniques because it's difficult for non-experts to identify there are errors in seemingly plausible results.

The model reliability benchmark studies at Princeton (leaderboard, paper) are a good example. They show agents still fail unpredictably in practice.

Similarly, when evaluating models on complex reasoning over long context, such as the Multi-Round Coreference Resolution (MRCR) benchmark, all frontier models experience a significant fall-off in performance as context size grows.

The best current models score well on this is Claude Opus 4.6 reaches 91.9% before dropping off as context size grows.

The most recent Claude Opus 4.7, while improving in multiple capabilities, is regressing on this one which is so important for analyzing large EA data sets (see Claude Opus Model Card).

Importantly, single benchmark scores obscure a problem that matters a great deal for real-world EA analysis: solving meaningful problems typically requires retrieving not one fact accurately, but many.

Consider what happens when errors compound. If a model has a 91.9% chance of correctly retrieving any given piece of information from context, the probability of getting all of the following right looks like this:

Facts required	Probability that all were retrieved correctly
3	77.6%
5	65.6%
10	43.0%

At 10 facts it is more likely to be wrong than correct. A real EA question "which applications supporting this critical business process are approaching end-of-life, have low technical fit, and lack a documented replacement plan?" might require correctly reading and correlating a dozen or more data points. At that scale, even a model with a high individual retrieval score is more likely than not to make at least one mistake, and a single incorrect fact can invalidate the entire recommendation.

While AI is getting better at retrieving specific facts from massive datasets, the ability to retrieve isolated facts does not mean the model can successfully synthesize or reason across the full context.

This unreliability of AI reasoning performance is worse when considering specific professional reasoning tasks. Professional reasoning benchmarks exist in domains such as Finance, Legal, and Healthcare. Performing analysis in these areas requires specialized knowledge and reasoning techniques that require domain knowledge, judgment, and contextual nuance. These aspects are difficult to generalize from the open data used in LLM training.

This above image shows the leaderboard for professional reasoning in Finance (accessed 14. April 2026) where the most advanced models are just above 50% when answering expert-authored tasks. They show inaccurate judgments, a lack of process transparency, and incomplete reasoning. Even when they reach correct conclusions, incomplete or opaque reasoning processes still severely reduce their practical reliability and trustworthiness for professional adoption.

Furthermore, generative AI is consistently poor at addressing complex quantitative questions. When an LLM is asked to analyze rich, interconnected EA models using only a simple instruction, the results can be unpredictable. The model may misunderstand parts of the architecture or hallucinate connections, resulting in inaccurate and inconsistent recommendations. For high-stakes EA decisions, these unpredictable failures are unacceptable.

AI solutions for enterprise architecture that only work correctly at 80% and then fall off significantly from there are not good enough.

How Ardoq is Working to Solve This

To overcome the inherent reasoning and reliability limitations of standard LLMs, Ardoq employs a number of techniques, including neuro-symbolic techniques. This methodology combines the flexible, generative capabilities of neural networks (LLMs) with the strict, deterministic logic of symbolic AI (rules, math, and structured tools).

Here are examples of how we implement this approach:

1. Aggregation Tools for Quantitative Analysis

Because generative AI struggles with math and complex aggregations, we do not rely on the LLM to solve quantitative problems natively. Instead, when quantitative analysis is required, the AI agent is instructed to invoke dedicated aggregation tools for mathematical operations. By offloading these calculations to built-in code tools, we ensure that you receive correct, reliable answers to complex portfolio questions every time.

Here’s a very simple example using a sample data repository with ~600 applications, which is average for a mid size organization portfolio. Using report filters in the tool I can see how many applications have a business value >= 4 and technical fit <= 3 . 30 is the correct answer.

To test an AI model directly using a test harness, the same report is extracted from Ardoq’s api and passed to Claude Sonnet 4.6 with a 1M token limit (the model is accessed through its API on an AWS Bedrock deployment). The report itself is approx 130K tokens in length, which is well below the upper limit of the AI model.

Passing EA repository data to a leading AI model and asking simple quantitative questions results completely incorrect responses. These simple quantitative analysis errors are repeated in the vast majority of test cases.

In comparison, performing the same request in Ardoq’s chat assistant we repeatedly get the correct answer.

Importantly we provide the reasoning to help users understand how the result is determined and how to assess the quality of the result.

2. EA Reasoning Rubrics and Ontology extensions

To prevent hallucinations and ensure the AI understands the nuances of your repository, Ardoq utilizes ontological representations of the metamodel combined with specific Enterprise Architecture reasoning rubrics.

The ontological representation provides far more detail about the information in the metamodel. This includes formal descriptions of component, reference, field types, and field values. These are combined with formal rules detailing how that information should relate to each other. For instance, this can be as simple as defining what 1..5 mean for maturity values or how application service levels and business criticality correlate. Other examples are predicates that help the AI reason about dates and time.

These rubrics give the AI detailed, highly relevant, and accurate context to consider alongside your data. By providing the agent with these structured rules and policies, the AI produces much higher-quality insights and recommendations than what could be achieved with regular AI model training alone.

Some of these rubrics we’re applying at runtime when utilizing AI models.

Other rubrics are being applied in the background to analyse the metamodel and generate rule-based or gremlin-based data quality reports. Using the AI model to generate these reports ensures the rules are executed faster, cheaper, and reliably.

These rubrics help AI to not just find syntactic problems but also non-obvious insights that are often referred to as EA smells. You can read more about this in our Foundation Insights Agents article.

3. Rigorous AI Evaluations

Building dependable AI agents requires targeted attention beyond simply scaling capabilities. To ensure our agents perform reliably and meet the strict thresholds required for enterprise deployment, our evaluation framework begins with manual evaluations. In this phase, domain experts interact directly with the system to evaluate the AI features. We record these test cases to establish a robust baseline for future regression testing, allowing us to systematically test for consistency, predictability, and robustness.

To evaluate these AI features at scale, we utilize automated evaluations driven by an LLM-as-a-judge system paired with human-in-the-loop verification. Our domain experts curate specific metrics based on the use case, providing the judge LLM with strict evaluation criteria, execution steps, and scoring rubrics. Because single-run accuracy can provide a misleadingly narrow view of an AI agent's capability, each automated evaluation run is rigorously examined by our human experts. They leverage the test cases originally recorded during manual evaluation to verify both the final evaluation score and the judge LLM's underlying reasoning.

Finally, because generative AI inherently struggles with math and complex aggregations, we conduct specialized quantitative evaluations that do not rely on the LLM to solve quantitative problems natively. Instead, we make use of deterministic code execution. By curating targeted datasets and evaluating deterministic queries against known expected outputs, we ensure that the AI consistently produces mathematically accurate and highly reliable results for complex quantitative tasks

4. Programmatic Tool Execution (Code generation)

Standard LLM tool calling works by allowing the model to call a tool, wait for the result, incorporate it into context, and decide on the next step. For Enterprise Architecture analysis, this creates a compounding problem. EA repositories contain large, interconnected datasets, and tool outputs consist of component lists, dependency graphs, field-level reports that can be arbitrarily large and unpredictable in size. Each result added to the context increases the cognitive load on the model and degrades the quality of its reasoning. At scale, the context fills up before the analysis is complete. Even if the LLM has a very big context window the quality drops significantly already at 10-20% of the context window. As pointed out earlier in the article small errors quickly compound and so it is crucial to keep the context as information dense as possible.

Ardoq addresses this through programmatic code execution - a pattern where the AI agent does not invoke tools one at a time and accumulate their outputs in context. Instead, the agent writes a short executable script that calls the necessary tools as functions, combines their outputs, and returns a processed result. Only the final, synthesized output enters the model's context and not the raw data from each individual tool call.

The LLM contributes the generative reasoning by deciding what to compute and how to structure the logic while the deterministic code execution guarantees correctness in the computation itself. The actions of filtering, aggregation, combinations, and transformations are performed by the generated code, and not by model.

Summary

AI has the ability to significantly augment how EAs work. Making the most of that potential requires strong focus on the strengths of AI and developing techniques for mitigating its weaknesses. Through the combination of ontological representations, aggregation tools, EA rubrics, and robust evaluation frameworks, Ardoq’s neuro-symbolic AI safely identifies data gaps, anomalies, and architectural smells that foundation AI models alone, and even some human experts, might miss.

Note: While aggregation tools, rubrics, and rigorous evaluations form part of our AI strategy, Ardoq utilizes a variety of advanced, proprietary methodologies behind the scenes. To maintain the security and competitive advantage of our platform, we do not make all of our AI techniques publicly available.

AI-based reference creation in Ardoq with simple Retrieval Augmented Generation

Ardoq AI: Capabilities, Controls, and FAQs

Context Engineering: Getting the most from Ardoq MCP

Omnipresent AI Assistant

Ardoq AI: Security & Architecture FAQ

Ardoq's Approach to reliable AI for Enterprise Architecture