Incident & Problem Management

Some definitions:
Events
happen all the time, in their thousands per minute.
Incidents
are unwanted events or missing desired events.
Problems
are the underlying causes of incidents.
Step one in Incident Management is to find out what is going on, and make sure the people who need to know are informed. Most software servers have built in reporting, generating vast numbers of records for expected and innocuous events. There are then applications running on those servers: their reporting may be good or indifferent. The trick is to ensure the events that are important and/or unexpected, the incidents, do get reported to the right human(s).
Incidents may be fixed as an 'exception', but when the exception starts becoming a daily routine, then the cost of the hours start to mount up. The problem that is the root cause of these incidents should now be more strongly considered for investigation.
An example: a night shift operator was loading a computer-controlled furnace with work. They became aware of heat unexpectedly coming down on their neck: the furnace heater was moving. Despite the operator's best endeavours to stop the motor, the heater collided with the containment vessel cradle, shearing the nylon gearing. The furnace was out of operation for two days during a busy period: there would not have been a night-shift running otherwise. An incident review eventually exonerated the operator, finding a flaw deep in the furnace programming: if a switch was changed through two positions quickly at a certain stage in the sequence, the program did not register the intermediate change, which would have initiated a "halt" state for the heater in a safe neutral position. The heater should not have started moving into position until a confirming "go" button was pressed.
An incident or problem review should focus on potential issues with the process, and look to find ways to strengthen the processes, to make them more robust. It's all too easy to blame the operator. A longer article "Incident Reviews" is published on Linkedin.
Why hire an IT Service Operations Consultant to carry out incident and problem reviews? Being neutral in an incident review is a tacit skill, and fixes for the underlying problem are not always obvious to those closely involved. An outsider can ask penetrating questions that may not otherwise get asked.
For more information on how we can help you with improving IT incident & problem management, including carrying out an ad hoc review, please write to robert@esm.solutions.