Introduction
There has been a tremendous response to the Ardoq MCP (Model Context Protocol). The most common feedback has been:
"AI Agents using Ardoq MCP have allowed our colleagues to get the value of architecture information without having to know about architecture details or how to use architecture tools".
Unfortunately, you cannot simply connect an AI chatbot to a large architecture repository, ask it "Where can we save money in our IT Portfolio?", and expect a useful response. To get the most out of Ardoq MCP you need to understand Context Engineering - how to carefully design the information that is fed to the Large Language Model (LLM) to maximize the chance of a useful response. This includes considering how to structure the architecture repository assets and understanding how it will be accessed by Agents to inspect, navigate, and return information to users.
Context Engineering is a set of techniques for designing information, access tools, prompts, and memory management so that Agents can successfully use your architecture information to help users. Knowledge of these techniques will be a necessary skill for architects going forward.
This document therefore discusses Context Engineering, the known challenges with it, and the techniques we can use to overcome them to let you get the most out of Ardoq MCP.
Quick Summary
Understand the strengths and weaknesses of LLMs. Treat your AI like a precocious, teenage apprentice and not as an architecture expert. Use it to fetch and summarise information for you to make the decisions.
Evaluate, evaluate, evaluate. Do not blindly trust the answers you get from AI. Test it until you know which useful prompts provide useful answers.
Limit Ardoq access to specific, high-quality, sets of information. And design your reports, dashboards, and viewpoints to answer specific questions that people will ask via their AI client.
Understand the size of your AI client’s context window, which information is filling it up, and when it should be reset.
What is the Context, and why is it important?
When interacting with an LLM, all required information to generate a response must be provided as part of each interaction. For technical readers, it is stateless communication. This information is the context, and it is much more than just the user prompt. See [1][13] for good introductions to the topic.
The following image illustrates the information that constitutes the context that is passed to an LLM from a client:
Context engineering includes instructions, user profiles, history, tools, retrieved documents, and more (from [1]). You can also add “Tool Descriptions” to this list of items.
Users do not often see this context beyond their own user prompt, nor how it's constructed when working with a consumer AI chatbot like ChatGPT or Claude Desktop. But it is happening. If you are building your own Agents or working with an Agent configuration platform such as Google Agentspace or Microsoft Copilot Studio, then you will need to understand the implications.
There is a lot of information in the context passed to LLMs, and research shows that the quality, reliability, and performance of LLMs degrade as the context becomes larger and more diverse. See [2] for a great deep dive into the topic.
This performance variation is highlighted in two recent (August 2025) benchmarks of leading LLMs on MCP tool usage. They showed performance limitations across a range of tasks. Therefore, regardless of whether you are connecting a consumer Chatbot to Ardoq MCP or building your own Agent, you will need to understand Context Engineering to get the best possible results.
Challenges in Context Engineering
This section presents a summary of the challenges identified in references [1], [3-5]. Review these resources for a deeper understanding of the issues.
Context Poisoning is when a hallucination or other error makes it into the context, which is then repeatedly referenced in the chat history and negatively impacts LLM outputs.
Context Distraction (Focus) occurs when a context grows so long that the model over-focuses on the context, neglecting what it learned during training.
Context Distraction (Irrelevant or Noisy Context) occurs when the model receives too much irrelevant information and becomes confused.
Latency and Resource Costs can arise when long, complex contexts increase compute time and memory usage.
Context Clash (Tool and Knowledge Integration) is when tool outputs or external data, fetched as part of Retrieval Augmented Generation (RAG), conflict with existing information in the context.
Maintaining Coherence Over Multiple Turns can be an issue, as models may hallucinate or lose track of facts if memory management removes parts of the history to maintain the context below a size threshold.
All these issues can impact your ability to get useful results when using Ardoq MCP. Understanding these issues and leveraging techniques for overcoming them can ensure you produce practical tools for your end users.
Context Engineering tips to improve your success with Ardoq MCP
Let's now discuss Context Engineering tips and techniques for overcoming these challenges.
But firstly, it is important to state that as an architect providing useful agentic access to an architecture repository, you need to switch your mindset from the older Static Prompting:
To the newer Dynamic Context Assembly [2], [5]:
Doing so will greatly help your visualization and application of Context Engineering.
This section starts with two general tips and then provides specific items to mitigate the aforementioned context challenges. It assumes you are using an existing Chatbot or Agent framework that provides various mechanisms for configuring context engineering. You may also be building a custom agent, which will require you to construct these mechanisms yourself.
This article discusses these mechanisms, and you can follow the links in the resources if you need more information to start building your own. We will use the term ‘Agent’ to collectively refer to these GenAI-enabled tools that may connect to Ardoq MCP. The definition of ‘Agent’ varies throughout the industry, and for this article, we do not mean to imply any specific functionality.
Understand the Strengths and Weaknesses of LLMs
Understanding the strengths, weaknesses, and limitations of GenAI is necessary to both provide useful tools to your organization and to constrain expectations for people using those tools.
Managing expectations is an important consideration. Gartner has observed that GenAI is slipping into the trough of disillusionment of the AI hypecycle [7] as people realize it is not so easy to simply plug in AI to automate complex tasks. A recent MIT study shows that a significant number of companies are struggling to see a return on their AI pilot investments [9].
We do not want to get into a philosophical treatise about the nature of intelligence and whether LLMs can reason like an architect, but there is clearly a mismatch between preconceived notions of what GenAI can provide and what is sensible to expect in practice. Our research shows that it is possible to get good value when combining AI with an architecture repository, but it does require you to stay within the confines of GenAI’s strengths and mitigate its limitations.
Foundational LLMs are quite good at:
Probabilistic text extrusion based on their training data
Semantic inference and matching
Summation
Categorization
Using external tools. For example, to extract information from external sources, to perform deterministic functions such as math, or to generate user interfaces to display data effectively
Following prescriptive reasoning chains that consist of small, highly specific, tasks using the above skillset
With these skills, you should not assume that you can simply provide a list of details about your Business and IT Architecture and receive realistic assessments of where your operations should be optimized. Nevertheless, these strengths can be used productively by architects.
We have previously highlighted that LLMs can be used to improve how architects work and improve how architecture deliverables are utilized in the organization. These include:
Summation of deliverables for external stakeholders
Translation of informal, everyday language used in the organization to the more logically consistent language used in metamodels
Identification of architecture criteria and techniques that an LLM can find in its foundational training material that architects may not have.
Codifying knowledge, experience, patterns, and principles in the organization to make them available to emerging architecture practitioners.
Asking an LLM to perform a specific architecture technique with information from your architecture repository is highly dependent on the quality of its training data. But we know the quality and consistency of EA knowledge in that training data is highly variable. You can ask yourself, would I ask a group on LinkedIn or Reddit, 'where to rationalize my application portfolio,' and expect a consistent, useful response? Probably not. But you might get a useful answer to "which criteria should I consider when performing rationalization and provide references to where I can follow up on those recommendations?".
We have been doing evaluations using a collection of different foundational LLMs, and their responses exhibit a wide spread of quality. Recent research also shows inconsistencies in performance when using the same LLMs hosted by different providers [8]. This field is still immature and rapidly changing, so it is not surprising that AI solutions are failing to meet inflated expectations.
Despite these issues, we are finding ways of constraining LLM use so that they can consistently provide value. And you will need to do the same for your users.
Finally, while you cannot expect an LLM to mimic architecture techniques, there is emerging research on how to ground and fine-tune LLMs with domain-specific knowledge and reasoning approaches. We are working on ways to use that for adding architecture reasoning in our AI functionality. But for now, you need to consider carefully what you are asking AI Agents and whether you should trust the results based on their current strengths and weaknesses.
Evals - test your Agent to ensure it is returning acceptable results
One of the challenges of working with GenAI is that when you ask it a question, you will receive a definitive answer in return. But this may be completely wrong! This “illusion of certainty” is a challenge you will need to overcome, as users of your agent may not be familiar with architecture techniques, nor have the experience and knowledge to evaluate the responses.
Using an Agent to allow natural language access to summarize a report from your architecture repository may seem like an innocuous, low-risk feature. However, suppose the users asks different questions, such as identifying numbers to be saved or the total cost attributable to a department, then the consequences can be much more serious if the answers are wrong.
You need an evaluation framework to ensure your Agent is using tools and making use of architecture information in an acceptable way. Evals (AI model evaluations) are the emerging set of techniques for performing these tests on AI systems. In Ardoq, a significant part of our AI R&D is spent on performing these Evals.
For a deep dive into the details, you can use Hamel Husain’s excellent field guide [10] and blog series, and Ben Lorica’s summary [11] about performing evals on AI solutions. The quote below captures Husain’s common, initial interaction with teams trying to build AI-related products.
When you use an AI tool to interact with your architecture repository, the key question you need to ask yourself is "How do I know that it's providing a useful result?
You can start with simple evals and get more sophisticated over time. For instance, begin by performing basic tests on likely requests and responses to your architecture repository and set up a feedback channel from users based on their experience. You can then build up to a full evaluation solution using tools such as Logfire, LangSmith, or DeepEval.
Regardless of the approach, you should always ask yourself, “How do we measure that our agent is producing acceptable answers using the architecture repository information?”
Provide a system prompt that will help the Agent use Ardoq
As an LLM context begins to expand, an Agent framework will perform memory management, which may remove or summarize information to keep the context length manageable. You can, therefore, think of the system prompt as a user prompt that will not be removed by memory management. This is a useful but simplistic explanation for this article. You need to check your model and agent framework for how and when it will deal with the system prompt.
The Ardoq MCP Server provides instructions that explain to the AI client what Ardoq is, how the architecture information is constructed as a knowledge graph, and which tools are available to summarise, traverse, and extract information from that knowledge graph. This has been constructed based on extensive Ardoq MCP evaluations.
You can build on this with your own system prompt that explains the types of information in Ardoq and what that information can be used for. You should experiment in your own evals to see the effect of adjusting the system prompt. Let us know which changes work and which do not work via the Ardoq Product Portal or your Customer Service Manager (CSM).
Prompt Engineering for user prompts is a subset of Context Engineering and is still an important topic for extracting value when connecting AI to Ardoq. We provide examples of useful prompts in our knowledge base, but do not discuss Prompt Engineering in more detail in this article.
Limit Ardoq access to specific, high-quality, sets of information
Design the Agent’s access to Ardoq to focus on high-quality information that helps end users. Architecture repositories contain a mix of production information, testing information, and details that are only useful to architects. Use access controls to reduce the chance of Context Distraction when accessing irrelevant information.
Ardoq’s new access control mechanisms let you create user groups and define access to particular hierarchies of information for those groups. Create a new permission group for your Agent and provide read-only access to the subset of high-quality information that should be available to them.
For example, the following screenshot shows an “APM restricted MCP” user group that can read APM information but does not have access to Application Rationalization information:
MCP User Group with Restricted Access
If your evals show that context distraction is still a problem, then consider creating multiple agents where each focuses only on a subset of architecture information or a subset of operating areas within the company.
Configure tools so an Agent can use them to answer common questions
Ardoq MCP provides tools to extract information using Viewpoints, Reports, and Dashboards. Design these to focus on questions that you expect (or have measured) your users to ask. For example, a common query we hear from early customer feedback is:
“People want a simple way to ask, who should I talk to about Application X?”.
For this example, design your viewpoints to start with the information that people often know (an Application) and lead them to the information they want to find (the People to contact).
Configuring tools for AI access also includes adding descriptions that will help the Agent determine which of the tools to use to answer a particular question. Be sure to add descriptions to your Assets: Viewpoints, Reports, and Dashboards. Tools are designed so the Agent can request a list of these Assets: names, descriptions, and IDs, to identify the most appropriate tool to use, and then request that specific Asset to answer the user's question.
If your assets only have names, and lack useful descriptions, then you are limiting the Agent's ability to find the appropriate assets. Reports, Components, and Viewpoints all allow AI-assisted description generation based on their structural definition. The screenshot above shows the auto-generated description for the ‘Application owners and experts’ viewpoint that was generated based on its component and reference types.
Limit the data returned from Ardoq to the Agent
Consider the size of the information returned from the Ardoq MCP tools to your Agent. Large reports, such as providing information about all applications in the application portfolio, are easy for users to consume with pagination in the UI. However, they have too much information for an Agent to process effectively.
Consider creating reports that focus on subsets of the data or ensure there are column types and field values that the Agent can use to request a filtered report.
Similarly, make dashboards and viewpoints more focused to avoid creating noisy context.
The granularity of reports, dashboards, and viewpoints you need in the user interface might be different from those you need for AI access. The correct size for that AI access will be dependent on the context size and model you are using. So experiment with different subsets to increase your positive evals.
Consider which tools your Agent uses beyond Ardoq MCP
It might be useful to connect your Agent to both Ardoq and to additional MCP servers to combine information from multiple sources. However, combining multiple MCP servers might create context confusion because of the volume of data being processed.
Similarly, many agents have tools to search the open internet, but information fetched from the internet could conflict with information in the LLM’s training material or information available in Ardoq.
Consider someone asking, “Which business processes are supported by Salesforce?”. The user will be expecting the agent to extract information from the architecture repository about specific business processes supported by Salesforce in your organization. Instead, the Agent may perform an internet search or use its training knowledge to return a general list of processes that could be supported by Salesforce. This answer may be completely disassociated from how Salesforce is used within your organization. More concerning is that the end user will not know that this is the case.
Agents and agentic frameworks provide various mechanisms for configuring tools. You could set the system prompt to instruct the Agent to always search Ardoq before the internet. Or disable search on the internet entirely.
Consider the number of tools available to the Agent
Context confusion can also arise when adding too many tools to the Agent. Sometimes you will want to combine information from Ardoq with information from other MCP sources. Documentation about each tool available to an agent needs to be provided as part of the context. This information includes the tool name, a description of the tool, and a specification of the input and output parameters. This can add up to a considerable amount of information being passed to the LLM each time.
Research has shown that agent performance drops noticeably once Agents get close to 30 available tools [4], primarily because the names, descriptions, and other documentation start to overlap and create a noisy context.
Be frugal with the number of tools you make available to your agent and help users to selectively disable tools if that feature is available to you.
Recommend that your users start a separate chat for each topic
Agents perform worse as a context grows, even for those LLMs that advertise large context windows. The conversation history of user/agent interaction is maintained and passed each time in the context to the LLM. This can quickly grow quite large and create a noisy or irrelevant context. Agents often have memory management tools to remove parts of the conversation history or to replace them with shorter summaries. But which information is removed or summarized is not transparent to the end user.
All of these issues create a large context that reduces the AI's ability to provide an acceptable answer to the user's question.
Recommend to your users to start a new chat for each new topic when interacting with an Agent and your architecture repository. This will restart the conversation history and ensure that the context is specific to the topic.
Similarly, users should start a new chat if they notice a hallucination in the conversation. Maintaining a hallucination in the conversation history will poison the context and result in invalid answers when the user is not expecting it.
Another technique available is to ask the agent to summarize the conversation so far and copy that summation as input to a new, context-clearing chat.
Finally, keep an eye on the tool usage by the agent in the conversation. What may seem like a direct user request to call a particular tool, e.g., to fetch a report, can result in the agent calling multiple tools before arriving at the desired one. All of these try/fail efforts by the agent are included in the conversation history and can muddy the context and reduce performance.
Summary
We are seeing significant benefit when architecture teams make their repository information available to end users via AI clients. However, as we have shown, it is not just a case of connecting the tools and expecting the LLM to handle the huge and disparate collection of architecture information available.
You need to understand the techniques of Context Engineering to be able to get the most out of its potential. Some of these are the necessary preconditions, such as understanding the strengths and weaknesses of LLMs to set correct expectations. Others are “no-brainers”, such as setting the system prompt to help the LLM understand how to navigate the tools, whilst with others, you will need to experiment and evaluate to tune the approach to your local context.
Resources
[2] Mei, Lingrui, Jiayu Yao, Yuyao Ge, et al. “A Survey of Context Engineering for Large Language Models.” arXiv:2507.13334. Preprint, arXiv, July 21, 2025. https://doi.org/10.48550/arXiv.2507.13334.
[3] How Long Contexts Fail and [4] How to Fix your Context
[6] MCP-Universe: Benchmarking Large Language Models with Real-World Model Context Protocol Servers
[8] gpt-oss-120B (high): API Provider Benchmarking & Analysis
[9] Aditya Challapally, Chris Pease, Ramesh Raskar, and Pradyumna Chari. The GenAI Divide: State Of AI In Business 2025. MIT Nanda, 2025. https://mlq.ai/media/quarterly_decks/v0.1_State_of_AI_in_Business_2025_Report.pdf.
[12] Wang, Zhenting, Qi Chang, Hemani Patel, et al. “MCP-Bench: Benchmarking Tool-Using LLM Agents with Complex Real-World Tasks via MCP Servers.” arXiv:2508.20453. Preprint, arXiv, August 28, 2025. https://doi.org/10.48550/arXiv.2508.20453.
[13] Effective context engineering for AI agents, Anthropic. September, 2025