User:Meelislubi

http://www.itlibrary.org/index.php?page=Incident_Management

IT service monitoring
IT Monitoring is checking of system events for informing people based on predefined logic. IT Monitoring can include keeping a record of system event statuses (problem <-> OK ), but this is not monitoring primary goal.

event -> logic -> trigger -> notification

For monitoring IT services you can only monitor CI-s and by doing so you will also monitor services depending on them as CI can have one-to-one relations with IT services.

SLA OLA

Objects in question
Costumer usually wants to get notified by them, but they do not follow the definition of incident, as there is not impact to service or quality (jet)
 * Event - Event in system
 * Incident - (ITILv3) An unplanned interruption to an IT Service or a reduction in the Quality of an IT Service.
 * Great risk increasing -
 * Problem - (out of monitoring scope- but references as next step)

DEFINE: event DEFINE: message


 * Passive monitoring - monitoring or logs
 * Active monitoring - emulating user (logging in checking balance, making payment ...)

Checking methods

 * Checks for error
 * Alerts on Errors (excludes unknown) (stateless monitoring)
 * (Optional) Alerts on Success after error. (Usually hard to accomplice as tool is already designed for only Error monitoring) (state based monitoring)
 * Checks for Success (state based monitoring)
 * Alerts on non-Success (includes unknown)
 * Alerts on Success after non-Success

Stateless
Messages are just coming, not possible to understood when errors are fixed. Sometimes it is presumed that if message does not repeat then event / incident has ended (Error no found). But this may not always be so as Checking is being performed on error, not success.

State based
Messages are coming when errors occure. (Non-Success) Messages are being automatically closed when error situation is over. (Success reached) State based event can also have unknown state.

Mixed
Unknown which messages will close automatically and which not (If not separated!!!)

Actions

 * Automatic - Action is run automatically. Example: error detected SMS
 * Manual - Action is being run manually (needs human involvement) Example: call Admin
 * Semi-Automatic - Verified error situation (human) -> run automatic action (machine)

Impact Analyzes
Message is basically impact to CI-s.(No Impact, Working Slow, Partly Working, Not Working) Message is needed for admin to start fixing - identifying object (CI) and error (Message text)
 * No Impact - no impact to service(yet)
 * Working Slow - self explanatory
 * Partly Working - Service non-critical functionality affected
 * Not Working - Service critical functionality affected