AIOps and root cause analysis with MCP server using Icinga as an example – Monitoring intelligently automated with AIR-MOC from RISE

April 9, 2025 | relevant products: COMMOC , LOMOC , SIEMOC , AIR

The operation of modern IT infrastructures today faces enormous challenges. Increasingly diverse system landscapes, rising availability requirements and an exponentially growing volume of monitoring data call for intelligent and scalable solutions. Traditional monitoring approaches are increasingly reaching their limits here, especially when it comes to the rapid detection and analysis of causal chains in the event of a fault.

This is precisely where AIOps – Artificial Intelligence for IT Operations – comes into play. With the help of machine learning algorithms, pattern recognition and intelligent correlation of events, not only symptoms are identified, but their causes are automatically uncovered.

A specific example of this approach is the integration of a so-called MCP server (Meta Correlation Point) into existing monitoring landscapes such as Icinga. The focus is on the automated root cause analysis, which is made possible by correlation techniques at different levels of abstraction. In the future, the MCP server will be an integral part of the planned AIR-OPS product from RISE, which operationalises AIOps in an industry-compatible way.

COMMOC based on Icinga as a powerful monitoring framework

COMMOC based on Icinga has established itself in practice as a flexible, scalable and open-source monitoring system. It allows for comprehensive monitoring of hosts, services and applications across distributed instances. The strength of COMMOC lies primarily in its modularity and tight integration into existing system environments. But despite all its capabilities, a central problem remains: identifying the actual cause of the malfunction, especially in highly dynamic and networked IT environments.

While COMMOC is very good at detecting and visualising symptoms in the form of status changes or threshold violations, it lacks semantic intelligence for root cause analysis. This results in a flood of alerts in which operators know that a problem exists, but not immediately where and why.

The MCP server as a core AIOps component

This is where the MCP server comes into play. As the central element of an AIOps strategy, it acts as a meta-correlation point – an instance that receives structured monitoring events from various sources, enriches, weights, historicises and, above all, correlates them. The aim is to condense the multitude of individual symptoms into a meaningful overall picture and to generate automated root cause hypotheses on this basis.

The way the MCP server works can be outlined in several steps:

The MCP server is the centrepiece of an intelligent AIOps architecture and follows a multi-stage analysis process based on structured data flow and machine learning. It all starts with data aggregation: the server receives a multitude of events from various sources – including traditional monitoring systems such as Icinga, log data, SNMP traps and native system status messages. All of this input data is first normalised and transferred into a consistent event model, which forms the basis for further analysis.

Building on this, context enrichment is carried out, in which additional information is added to the raw event data. This includes, among other things, data from configuration management databases (CMDBs), existing topology models, configuration statuses and historical progressions. This contextual knowledge is essential for a better understanding of correlations and to enable semantic interpretations.

The next step is to correlate and weight the information collected. The MCP server analyses temporal coincidences, recognises logical and topological dependencies and compares current patterns with known historical events. This methodology makes it possible to identify relationships between individual events. In addition, relevance is assessed by weighting and probabilities are calculated, which makes it much easier to prioritise potential causes.

On this basis, the server generates specific hypotheses about the most likely cause of a disruption as part of the root cause determination. These hypotheses can be visualised in a suitable form or transmitted directly to operational teams – for example, as a recommended course of action or an automated incident suggestion.

One crucial aspect is the adaptive operation of the system. Confirmed root cause analyses and operational experience are fed back into the underlying machine learning model via integrated feedback mechanisms. This continuously improves the accuracy of the analysis and dynamically adapts to new system conditions and error patterns.

Example of use: COMMOC with AIR-MOC extension

A specific use case demonstrates the power of the MCP server: In a productive environment, COMMOC is used to monitor several hundred hosts and thousands of services. One day, over 100 service checks fail at the same time – spread across different hosts and applications. The COMMOC interface shows a large number of red and yellow states, but without being able to narrow down the cause.

The MCP server processes the incoming events in parallel and determines that all affected services are located on virtual machines that are located on a specific cluster node of a hypervisor. A correlation with historical events shows that similar patterns in the past were associated with network segment failures on the management interface of the node in question.

Within seconds, the MCP server generates the hypothesis: ‘Network problem on hypervisor node X most likely cause of service outages.’ This hypothesis is assigned a confidence level and transmitted to operations. The team can then take targeted action – a significant time saving compared to manual root cause analysis.

Questions that AIR-MOC can answer without having to use COMMOC

The previous example is just one of many. AIR-OPS can use the powerful REST API from COMMOC or ICINGA to answer a variety of natural language questions about your own infrastructure. These are all the questions that experts ask in the course of a root cause analysis – and whose answers they then have to work out themselves by operating tools. In the future, AIR-OPS will do this work for them. Here is just a sample of the questions that AIR-OPS can answer about your infrastructure:

"Which hosts or services have generated a particularly large number of events in the last 24 hours?"

"How often has the ‘mysql-check’ service switched to a ‘WARNING’ status in the last 30 days?"

"Are there recurring downtimes at certain times of the day or week?"

"Which services or hosts show unstable conditions with more than 5 status changes per day?"

"Are there any services that have remained in a ‘CRITICAL’ state for more than an hour?"

"Were there host or service groups with synchronous event peaks?"

"Are there services whose ‘CRITICAL’ events always correlate with a ‘DOWN’ status of a particular host?"

"How has the number of events changed compared to the previous week?"

"Has there been a significant increase in critical events on a particular host in the last month?"

At first glance, answering these questions with the existing monitoring tools seems like magic – in reference to the most famous quote from Arthur Clark: ‘Any sufficiently advanced technology is indistinguishable from magic’.

RISE's AIR-MOC uses the COMMOC query interface to answer these questions. This is just one of many concrete examples from our MCP server on the question ‘Flapping services of a particular host in the last 14 days’. The following is a simple form that you can try out yourself if you have access to a Linux console (replace USERNAME, PASSWORD, IcingA_API_URL and HOSTNAME):

curl -k -u USERNAME:PASSWORD -H Accept: application/json
  -H  Content-Type: application/json
  -X POST https://ICINGA_API_URL:5665/v1/objects/services
  -d {
    filter: flapping == true && last_state_change >= now() - 1209600,
    attrs: [host.name, HOSTNAME, flapping, last_state_change]
  }

No magic, just the intelligent combination of existing possibilities.

From proof of concept to product: AIR-MOC from RISE

The MCP server is currently part of several pilot projects and will be part of RISE's comprehensive AIOps product AIR-MOC in the future. AIR-MOC aims to provide AIOps in an industry-standard, scalable and integrable form. The following aspects take centre stage:

AIR-MOC is a modern, modular AIOps platform that has been specifically developed for use in complex IT landscapes. Its architecture is divided into various functional units, including the event hub for recording event data, the MCP server for analysis and modelling, a powerful visualisation unit and an integrated feedback module. This modular structure allows companies to introduce the solution step by step and combine it with existing tools such as Icinga, Grafana, Elastic or Splunk. The result is a flexible, scalable solution that integrates seamlessly into existing environments.

A key distinguishing feature of AIR-MOC is the use of Explainable AI. Unlike traditional black box processes, the platform places great value on the transparency and traceability of its decisions. Cause-and-effect relationships are explicitly shown, giving users a better understanding of the results. This explainability not only promotes acceptance in operations, but is also a decisive advantage from an auditability perspective - especially in regulated industries. AIR-MOC is also extremely easy to integrate. The platform supports a large number of open interfaces, including REST, WebHooks and syslog. This means it can be easily integrated into heterogeneous system landscapes and modern DevOps pipelines. This openness ensures that AIR-MOC does not remain an isolated system, but integrates organically into existing IT processes.

In terms of performance and scalability, AIR-MOC is designed for demanding environments. By using asynchronous processing and a containerised architecture, the platform can also process large volumes of events efficiently. The horizontal scaling of the processing units allows capacities to be expanded as required - a clear advantage in dynamic infrastructures with a high event density.

Another key aspect is security. As a product from a European manufacturer, AIR-MOC fulfils the standard requirements for data protection, traceability and integration security. This represents significant added value, especially for KRITIS-related sectors in which regulatory requirements are particularly strict.

However, despite all these strengths, the introduction of AIOps is not a sure-fire success. The transition from classic rule-based approaches to data-driven, probabilistic models requires a rethink within the organisation. One of the most common challenges is data quality: AIOps systems require clean, well-structured data enriched with context. Missing metadata or inconsistent topologies significantly impair the quality of analyses. Our COMMOC, LOMOC and SIEMOC monitoring solutions have therefore been specifically developed to create an ‘AIOps-ready’ database - the stable foundation for effective AI deployment.

Another obstacle is the complexity of the modelling. The initial creation of correlation topologies and weighting models can be resource-intensive. This is where the MCP server comes in, supporting step-by-step modelling with intelligent wizards and thus making it much easier to get started.

Acceptance in IT operations should also not be underestimated. AI-supported decisions must be comprehensible and trustworthy. AIR-MOC meets this challenge with transparent, explainable results - a decisive factor in reducing reservations and gaining the trust of users. Ultimately, AIOps must be considered an integral part of existing processes. It must not create a parallel world, but must be seamlessly integrated into existing ITSM, CMDB and monitoring structures. Open interfaces and tried-and-tested process adapters are essential for this and an integral part of the AIR-MOC ecosystem.

Challenges in introducing AIOps

Despite all the potential, introducing AIOps is not a sure-fire success. In particular, the transition from classic rule sets to probabilistic, data-driven models requires a rethink. The most common challenges include:

Data quality: AIOps systems depend on clean, well-structured and context-rich data. If metadata is missing or topologies are inconsistent, the quality of analysis suffers. All our monitoring products, COMMOC, LOMOC and SIEMOC, are designed to create an AIOPS-ready database – the solid foundation for a helpful use of AIOPS. Complexity of modelling: The initial creation of correlation topologies and weighting models can be time-consuming. However, tools such as the MCP server offer supporting assistants for step-by-step modelling. Acceptance in the company: The introduction of AI-supported decisions must be done cautiously. Transparent, explainable results, such as those offered by AIR-OPS, are a key factor here. Integration into existing processes: AIOps must not open up a parallel world, but must be integrable into existing ITSM, CMDB and monitoring processes. Open interfaces and process adapters are therefore essential. Outlook: the path to autonomous operating models

With solutions such as the MCP server and the AIR-MOC product currently in development, RISE is consistently moving towards intelligent, adaptive IT operations. In the future, such systems could not only identify causes, but also independently initiate measures – from load shifting and rollback to escalation to people.

The goal is to establish an operating model that no longer reacts, but acts proactively and, in some cases, even preventively, by combining domain-specific knowledge, algorithmic intelligence and operational embedding. This is a crucial competitive advantage, especially in highly complex environments with short response times and limited human resources.

Conclusion

The combination of established monitoring frameworks like Icinga and modern AIOps components like the MCP server is an impressive demonstration of how efficiency and quality in IT operations can be increased. Automated root cause analysis can reduce downtime, localise errors faster and deploy resources in a more targeted way.

With AIR-MOC, RISE is preparing a future-oriented product that takes the principles of AIOps to a new level – explainable, integrable and production-ready. This offers new, realistic prospects for companies that want to take the next step towards intelligent IT operations.

Interested in this topic? - Talk to us!

Get in touch