Checkmk

Checkmk is a software system developed in Python and C++ for IT Infrastructure monitoring. It is used for the monitoring of servers, applications, networks, cloud infrastructures (public, private, hybrid), containers, storage, databases and environment sensors.

Checkmk is available in four editions: an open source edition (Checkmk Raw Edition), a commercial enterprise-grade edition (Checkmk Enterprise Edition), a commercial edition with advanced cloud monitoring features (Checkmk Cloud Edition), and an edition for managed services providers (Checkmk Managed Services Edition). These Checkmk Editions are available for a range of platforms, in particular for various versions of Debian, Ubuntu, SLES and Red Hat, and also as a Docker Image. In addition, physical appliances of various sizes as well as a virtual appliance are offered to simplify the administration of the underlying operating system through a graphical user interface and to enable high-availability solutions.

The agents used by Checkmk to collect data are available for 11 platforms, including Windows.

History
Checkmk originated in 2008 as an Agent-substituting shell script for Inetd, and was published in April 2009 under GPL. It was initially based on Nagios, and extended this with a number of new components. The open source edition (Checkmk Raw Edition) also continues to be based on the Nagios-core, and bundles this with additional open source components into a complete system.

Over many years, Checkmk's commercial editions have evolved into a self-contained monitoring system – one that has replaced all of the essential Nagios components with its own – including its very own monitoring core. The majority of the developments for the commercial editions, in particular all plug-ins, are also available into the Checkmk Raw Edition.

While in the past Checkmk was designed for monitoring large and heterogeneous on-premise environments, from version 1.5+ (1.5p12) it also supports the monitoring of AWS, Azure, Docker and Kubernetes services.

Checkmk is being developed by Checkmk GmbH in Munich, Germany. Until 16.04.2019 it operated under the name of Mathias Kettner GmbH, at which point the company was rebranded to tribe29 GmbH, while the product name "Check_MK" was also changed to "Checkmk". A subsequent rebranding took place on 14.04.2023, when the company has been renamed to Checkmk GmbH.

Checkmk GmbH follows an open core business model. The open source edition is available under different open source licenses – mostly GPLv2, while large parts of the commercial editions run under the proprietary "Checkmk Enterprise License".

The Product
Checkmk combines three types of IT monitoring:
 * Status-based monitoring, which records the "health" of a device or application, via thresholds.
 * Metric-based monitoring that enables the recording and analysis of time series graphs using a HTML5-based graphing system. An integration with Grafana is available as well.
 * Log-based and event-based monitoring, in which key events can be filtered out and actions can be triggered based on these events.

In order to ensure a very broad monitoring, Checkmk currently has 2000+ plug-ins in each edition – all of which are licensed under GPLv2. These plug-ins are maintained as part of the product and are regularly supplemented with additional plug-ins or extensions. Connecting existing legacy Nagios plug-ins is possible as well.

To simplify setup and operation, all components of Checkmk are delivered fully integrated. A rule-based 1:n configuration, as well as a high degree of automation, significantly accelerate workflows. This includes:
 * Auto-discovery of hosts (where applicable)
 * Auto-discovery of services
 * Automated configuration of plug-ins via preconfigured thresholds and rules
 * Automated agent updates (a CEE feature)
 * Automatic and dynamic configuration that enables the monitoring of volatile services with a lifespan of just a few seconds, such as in the Kubernetes environment (starting from CEE v1.6)
 * Automated discovery of tags and labels from sources such as Kubernetes, AWS and Azure (starting from CEE v1.6)

In addition, there are also playbooks for the use of configuration and deployment tools such as Ansible or Salt.

Checkmk is often used in very large distributed environments where a high number of sites (e.g., 300 locations of Faurecia ) and/or well over 100,000 devices (e.g. Edeka ) are monitored. This is possible, among other things, because Checkmk's microcore consumes much less CPU resources than, for example, Nagios’, and therefore offers a significantly higher performance on the same hardware. Furthermore, the non-persistent data is stored in-memory in RAM which significantly improves the access time.

Monitoring core
Checkmk Raw Edition uses the Nagios monitoring core.

Checkmk commercial edition uses the proprietary "Checkmk Microcore" (CMC) monitoring core, written in C++. It has better performance than Checkmk Raw Edition core. It supports recording of objects with a short lifespan, such as containers. It does not require a reboot to apply configuration changes.

Configuration & Check Engine
Checkmk offers self-contained service discovery and settings generation. Checkmk uses its own method when carrying out the checks. During the test period each host is contacted only once. The test results are transmitted to the monitoring core as passive checks. This significantly improves the performance on the monitoring server, as well as on the hosts being monitored.

Checkmk uses different methods to access the data in the target systems. These include agents installed on the target system, "special agents" running on the monitoring server and communicating with the API of the target system, the SNMP API for monitoring, for example, network devices and printers, and HTTP/TCP protocols to communicate with web and internet services. By default, Checkmk follows the "pull principle", i.e. the data is explicitly queried by the monitoring system to quickly identify when a system suddenly fails and does not respond to a "pull". As an alternative, however, a "push" can be configured with which the system transfers its data directly to Checkmk or to an intermediate host.

Data Interface ("Livestatus")
Livestatus is the main interface in Checkmk. It provides live access to all data from the monitored hosts and services. The data is fetched directly from the RAM, which avoids slow hard disk access and gives fast access to the information without overloading the system too much. Access is done via a simple protocol and it is possible from all programming languages without requiring a special library.

Web-GUI ("Multisite")
Multisite is Checkmk’s web GUI. In addition to having a quick page layout, it offers user-definable views and dashboards, distributed monitoring by integrating multiple monitoring instances via Livestatus, integration of NagVis, an integrated LDAP connection, access to status data via web services, and much more. Dashboards and views can be differentiated for various users or groups of users, for example vSphere-specific views for VMware admins. The web GUI is available in several languages.

Setup
Checkmk is completely administrable via the browser via its Setup module. This includes managing users, roles, groups, time periods, and more. Permissions can be granted in a granular way using a role concept. Existing role-based access controls (LDAP, AD) can be used for this. Checkmk’s configuration is rule-based, so that it remains intuitive and scalable even in complex environments. Automated service discovery and configuration, as well as the automatic agent update, further accelerate the configuration process. An HTTP API can also be used to integrate CMDBs for accelerated configuration.

Alert System
Several notification channels can be set up and configured with different rules for each user. For example, emails can be triggered at any time of the day, but notifications via SMS are sent only for important issues during on-call hours. The notifications can be set for all or for specific teams, e.g. notify only the storage admins about a failed hard drive. Duplicate notifications are grouped together so that no user is notified twice through a particular channel. Furthermore, users can configure their own notifications themselves. In distributed environments alerts can be managed centrally. For detected issues, actions can be triggered automatically (alarm control) via scripts. Checkmk includes integrations to email and SMS gateways as well as to communication and IT service-management solutions such as Slack, Jira, PagerDuty, OpsGenie, VictorOps, and ServiceNow.

Business Intelligence
The BI module is integrated into the graphical user interface. It aggregates the overall status of business processes, their dependency on complex applications and IT infrastructure elements from many individual hosts and services in a rule-based manner. It can also be used to represent applications made up of microservices, which in turn consist of Kubernetes pods and deployments. In addition, worst-case scenarios can be simulated in real time and historical data can be analyzed to understand the causes of performance degradation.

Event Console
The Event Console integrates the processing of log messages and SNMP traps into the monitoring. It is configured via a flexible set of rules, and decides whether incoming messages are to be discarded or how they are to be classified. It can count, correlate, expect messages, rewrite messages, and more. Similar entries can be grouped into a single event (e.g. multiple failed logins) to keep track of events. It also has a built-in syslog daemon that receives messages directly on port 514, and an SNMP trap receiver that receives traps on port 162.

Metrics Graphing
The commercial Checkmk editions use their own metric and graphing system. Time series metrics can be analysed over long intervals using interactive HTML5 graphs. The maximum resolution is one second. Data can be imported from a variety of data sources and metrics formats (JSON, XML, SNMP etc.) and stored on the disk of a long-term data storage device.

Alternatively, Graphite or InfluxDB can be connected via an export interface. From CEE version 1.5p16 there is also a plug-in available for integrating data directly from Checkmk into Grafana for visualization purposes. The Checkmk Raw Edition currently uses PNP4Nagios as its graphing system.

Reporting
Reporting enables the direct delivery of PDF reports, ad-hoc or automatically, at regular intervals. It includes the availability analysis in which the history of the states over any desired time period can be provided with a click. Availability calculations can exclude unmonitored times, adjust the resolution, or ignore short intervals. In addition to the availability calculations, reporting also includes SLA reporting in which complex SLAs can be monitored. The reporting is only available in the commercial versions of Checkmk.

Hardware/Software Inventory
The hardware/software inventory can be used, for example, to monitor hardware and software changes, to verify the presence of installed security updates, and to update static data with dynamic parameters (for example, updating the current disk usage statistics based on monitoring data). The Configuration Management Database (CMDB) i-doit has a deep integration that enables the exchange of CMDB data with monitoring data.