Closed Loop Incident Process
Closed Loop Incident Process
Closed Loop Incident Process
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
<strong>Closed</strong> <strong>Loop</strong> <strong>Incident</strong> <strong>Process</strong><br />
From fault detection to closure<br />
Andreas Gutzwiller<br />
Presales Consultant, Hewlett-Packard (Schweiz)<br />
HP Software<br />
and Solutions<br />
©2010 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice
<strong>Closed</strong> <strong>Loop</strong> <strong>Incident</strong> <strong>Process</strong> Solution<br />
The CLIP solution is a:<br />
– Highly automated fault detectionto-recovery<br />
solution<br />
– Focused on end-to-end service<br />
availability and performance<br />
– Reducing mean time to recovery<br />
and improves mean time between<br />
system failures
Agenda<br />
1. Event and <strong>Incident</strong> <strong>Process</strong>es<br />
2. Closing the <strong>Loop</strong><br />
3. Architecture<br />
4. Why CLIP
Agenda<br />
1. Event and <strong>Incident</strong> <strong>Process</strong>es<br />
2. Closing the <strong>Loop</strong><br />
3. Architecture<br />
4. Why CLIP
ITILv3 Linkage of Event & <strong>Incident</strong> Management<br />
Neither process can stand alone in today’s IT environments<br />
Event – A change of state or alert<br />
that has significance for the<br />
management of a Configuration<br />
Item (CI) or IT Service.<br />
<strong>Incident</strong> – Unplanned interruption,<br />
or reduction of quality, of an IT<br />
service<br />
IT Service – People, processes &<br />
technology deliverable that<br />
supports a customer’s business<br />
processes<br />
Event Management<br />
• Responsible for managing events<br />
throughout their lifecycle. Main activity of IT<br />
Operations.<br />
• Event Filtered/Correlated Resolve or<br />
forward to <strong>Incident</strong> Close<br />
<strong>Incident</strong> Management<br />
• Includes any event which, or could,<br />
disrupts a service. From users or IT staff<br />
• <strong>Incident</strong> -> Categorize /Prioritize -><br />
Diagnose -> Resolve -> Close<br />
5
ITIL Areas Involved in CLIP<br />
– Operations Bridge (aka NOC)<br />
• Central coordination point<br />
• Manages various classes of events<br />
• Detects incidents<br />
• Manages routine operational activities<br />
• Reports on the status and performance<br />
• May provide first-level support for those<br />
events which generate an incident<br />
– Service Desk<br />
• Single central point of contact for all<br />
users of IT<br />
• Logs and manages all incidents, service<br />
requests and access requests<br />
• Provides interface to all other Service<br />
Operation processes and activities<br />
“The Service Desk is not typically involved in Event Management …<br />
unless the Service Desk and Operations Bridge have been combined”<br />
6
Traditional <strong>Incident</strong> Management<br />
From diagnosis to resolution<br />
1<br />
Identify service<br />
performance<br />
degradation<br />
Troubleshoot<br />
problem to<br />
isolate root cause<br />
1. Service performance<br />
notification<br />
2<br />
3<br />
Identify<br />
actionable<br />
condition /<br />
changes to be<br />
implemented<br />
2. Gather data to<br />
assign SME<br />
4<br />
Create TT/RFC<br />
to implement<br />
change<br />
3. Bouncing the<br />
incident<br />
5<br />
Implement and<br />
automate change<br />
to close RFC<br />
6<br />
Update CMS<br />
(Federated CMDB)<br />
6. Update CMDB -<br />
timely & correctly?<br />
End User<br />
Help Desk<br />
“Fire Storms”<br />
CMDB<br />
7<br />
4. Ticket is finally<br />
assigned to the<br />
correct SME<br />
5. Impact analysis and<br />
change<br />
management<br />
Multiple un-integrated systems and data stores, manually coordinated<br />
hand-offs → inconsistent troubleshooting, high MTTR<br />
SME: Subject Matter Experts
Agenda<br />
1. Event and <strong>Incident</strong> <strong>Process</strong>es<br />
2. Closing the <strong>Loop</strong><br />
3. Architecture<br />
4. Why CLIP
From Fault Detection To Recovery & Closure<br />
<strong>Closed</strong> <strong>Loop</strong> <strong>Incident</strong> <strong>Process</strong> solution for ITIL Event and <strong>Incident</strong><br />
Management<br />
Event Generation<br />
& Detection<br />
Recovery &<br />
Closure<br />
Event Correlation<br />
& Business<br />
Impact<br />
Resolution<br />
<strong>Incident</strong><br />
Submission<br />
9<br />
Investigation &<br />
Diagnosis<br />
ITIL <strong>Process</strong><br />
Event Management<br />
<strong>Incident</strong> Management
Event Generation & Detection<br />
<strong>Closed</strong> <strong>Loop</strong> <strong>Incident</strong> <strong>Process</strong> solution for ITIL Event and <strong>Incident</strong><br />
Management<br />
Recovery &<br />
Closure<br />
Resolution<br />
10<br />
Event<br />
Generation &<br />
Detection<br />
Investigation &<br />
Diagnosis<br />
Event<br />
Correlation &<br />
Business Impact<br />
<strong>Incident</strong><br />
Submission<br />
Operations bridge console collects events & alerts<br />
from servers, networks, apps & 3rd party<br />
Challenge<br />
Bottom-up alert and event overload<br />
Lack of qualitative cross domain “actionable”<br />
and causal event data<br />
Solution<br />
All events come to one place, correlated and<br />
enriched against an auto-updated service<br />
model<br />
User Example – Events to single console<br />
End user experience slow<br />
SQL slow query performance alert<br />
J2EE DB collection pool issue
Event Correlation & Business Impact<br />
<strong>Closed</strong> <strong>Loop</strong> <strong>Incident</strong> <strong>Process</strong> solution for ITIL Event and <strong>Incident</strong><br />
Management<br />
Recovery &<br />
Closure<br />
Resolution<br />
11<br />
Event<br />
Generation &<br />
Detection<br />
Investigation &<br />
Diagnosis<br />
Event<br />
Correlation &<br />
Business Impact<br />
<strong>Incident</strong><br />
Submission<br />
Business services, business impact relationship,<br />
and SLAs determined<br />
Challenge<br />
Struggle to link causal events to top down enduser<br />
experience and business impact<br />
Solution<br />
Proactive end-user experience linked to<br />
business process and business transaction<br />
flow to identify high revenue generating<br />
service impact<br />
User Example - Cause from symptoms and impact<br />
Oracle database is the cause, topology based<br />
correlation<br />
Critical funds transfer business service<br />
impacted
<strong>Incident</strong> Submission<br />
<strong>Closed</strong> <strong>Loop</strong> <strong>Incident</strong> <strong>Process</strong> solution for ITIL Event and <strong>Incident</strong><br />
Management<br />
Recovery &<br />
Closure<br />
Resolution<br />
12<br />
Event<br />
Generation &<br />
Detection<br />
Investigation &<br />
Diagnosis<br />
Event<br />
Correlation &<br />
Business Impact<br />
<strong>Incident</strong><br />
Submission<br />
Automatic submission to service desk with<br />
annotations and cause area<br />
Challenge<br />
Quality and enrichment of data<br />
Siloed, broken service lifecycle<br />
Duplication of effort wasting time<br />
Solution<br />
Better collaboration<br />
Automation and integrated of event to incident<br />
process lifecycle<br />
User Example - Automatic incident ticket creation<br />
Ticket visible to ops bridge<br />
Assignment to subject expert
Investigation & Diagnosis<br />
<strong>Closed</strong> <strong>Loop</strong> <strong>Incident</strong> <strong>Process</strong> solution for ITIL Event and <strong>Incident</strong><br />
Management<br />
Recovery &<br />
Closure<br />
Resolution<br />
13<br />
Event<br />
Generation &<br />
Detection<br />
Investigation &<br />
Diagnosis<br />
Event<br />
Correlation &<br />
Business Impact<br />
<strong>Incident</strong><br />
Submission<br />
Problem isolation, SME tools, and KM used to<br />
determine root cause<br />
Challenge<br />
Significant problem resolution time spent on<br />
pinpointing problem in a dynamic<br />
heterogeneous IT universe<br />
<strong>Incident</strong> assigned and reassigned to multiple<br />
silos<br />
Solution<br />
Cross domain data visualization and analysis<br />
User Example - Diving deeper to find root cause<br />
Expert sees corrupt DB tables<br />
Finds runbook automation fix in<br />
knowledgebase
Resolution<br />
<strong>Closed</strong> <strong>Loop</strong> <strong>Incident</strong> <strong>Process</strong> solution for ITIL Event and <strong>Incident</strong><br />
Management<br />
Recovery &<br />
Closure<br />
Resolution<br />
14<br />
Event<br />
Generation &<br />
Detection<br />
Investigation &<br />
Diagnosis<br />
Event<br />
Correlation &<br />
Business Impact<br />
<strong>Incident</strong><br />
Submission<br />
Change request with attached run book automation<br />
to repair CI’s<br />
Challenge<br />
Little or lack of automation leads to increased<br />
manual efforts impacting quality and efficiency<br />
Solution<br />
Expert created/authorized run book<br />
automation to empower lower level teams<br />
Manage change, configuration, and release<br />
process<br />
User Example - <strong>Process</strong>ing the change<br />
Get change request approval<br />
Use runbook to reindex database tables
Recovery & Closure<br />
<strong>Closed</strong> <strong>Loop</strong> <strong>Incident</strong> <strong>Process</strong> solution for ITIL Event and <strong>Incident</strong><br />
Management<br />
Recovery &<br />
Closure<br />
Resolution<br />
Event<br />
Generation &<br />
Detection<br />
Investigation &<br />
Diagnosis<br />
Event<br />
Correlation &<br />
Business Impact<br />
<strong>Incident</strong><br />
Submission<br />
Automatically close incident & related incidents<br />
acknowledging related events<br />
Challenge<br />
Struggle to improve speed of restoration,<br />
recovery and closure of incident and verify<br />
post compliance of SLA/OLA<br />
Solution<br />
Automate all notifications & updates,<br />
continuously monitor SLA/OLA compliance<br />
User Example – Verify the change worked<br />
User, DB and connection pool OK<br />
Ticket and events closed<br />
15
Agenda<br />
1. Event and <strong>Incident</strong> <strong>Process</strong>es<br />
2. Closing the <strong>Loop</strong><br />
3. Architecture<br />
4. Why CLIP
<strong>Closed</strong> <strong>Loop</strong> <strong>Incident</strong> <strong>Process</strong> Integration Points<br />
Integrated ITIL event and incident management process optimizing MTTR and MTBF<br />
Monitoring<br />
1<br />
2<br />
3<br />
5<br />
Integrated<br />
CMDB<br />
Automation<br />
1<br />
5<br />
Service<br />
Desk<br />
4<br />
17<br />
1. Sharing CIs, topology and state information<br />
2. For creating and updating incidents<br />
3. For updating events<br />
4. <strong>Incident</strong>-, Problem- and Change-Mgmt<br />
5. Runbook automation to remediate
HP’s <strong>Closed</strong> <strong>Loop</strong> <strong>Incident</strong> <strong>Process</strong> Solution<br />
Integrated ITIL event and incident management process optimizing MTTR and MTBF<br />
BSM<br />
CIs, Topo,<br />
Events,<br />
Status<br />
Net<br />
1<br />
Ops<br />
App<br />
Other<br />
2<br />
3<br />
4<br />
5<br />
UCMDB<br />
Operations<br />
Orchestration<br />
NA<br />
SA<br />
CA<br />
SE<br />
Other<br />
6<br />
7<br />
Service<br />
Manager<br />
18<br />
1. CIs, topology, events, status measurements<br />
flowing into BSM<br />
2. Sharing events and topology<br />
3. For creating and updating incidents<br />
4. To access Business Impact View for a CI<br />
5. Runbook automation to enrich, diagnosis and<br />
remediate<br />
6. Sharing CIs and state information<br />
7. Runbook automation to remediate
Agenda<br />
1. Event and <strong>Incident</strong> <strong>Process</strong>es<br />
2. Closing the <strong>Loop</strong><br />
3. Architecture<br />
4. Why CLIP
<strong>Closed</strong>-<strong>Loop</strong> <strong>Incident</strong> Mgmt <strong>Process</strong><br />
<strong>Incident</strong> management from diagnosis to automated resolution<br />
1<br />
Identify service<br />
performance<br />
degradation<br />
2<br />
Troubleshoot<br />
problem to<br />
isolate root<br />
cause<br />
3<br />
Identify<br />
changes to be<br />
implemented<br />
4<br />
Create TT/RFC<br />
to implement<br />
change<br />
5<br />
Implement and<br />
automate<br />
change to close<br />
RFC<br />
6<br />
Update CMS<br />
(Federated<br />
CMCB)<br />
1. Identify service<br />
performance issue<br />
Business service<br />
management<br />
2. Gather data to identify<br />
root cause<br />
3. Create RFC to<br />
make change<br />
4b. Review, assess, plan and<br />
govern change<br />
IT service<br />
management<br />
4a. Initiate change<br />
5b. Close change<br />
request?<br />
6. Update Configuration Management System<br />
Configuration Management System (Federated CMDB)<br />
5a. Implement change<br />
Business service<br />
automation<br />
• Key processes—incident, change and configuration—need to be tightly linked<br />
• Seamless process linkage requires tools to be consistently service-oriented<br />
20
<strong>Closed</strong> <strong>Loop</strong> <strong>Incident</strong> <strong>Process</strong> Key Benefits<br />
Drive innovation value of IT<br />
Cost<br />
Quality<br />
Transparency<br />
Agility<br />
Business<br />
risk<br />
• Drive efficiency through automation<br />
• Optimize service lifecycle process efficiency<br />
• Eliminate error-prone manual tasks<br />
• Predict and prevent negative business impact<br />
• The cost/value ratio of delivered services is understood by<br />
the business<br />
• Any service from everywhere<br />
• Saved labor can be spend on innovation<br />
• Measure and optimize time to develop and successfully<br />
deploy new services<br />
• Reduce risk of failure when deploying changes<br />
• Enable compliance<br />
72% lower<br />
maintenance cost<br />
2.5x increased<br />
availability and<br />
performance<br />
99.5% availability<br />
via integrated<br />
delivery<br />
30% faster time to<br />
market for new apps<br />
70% fewer bad<br />
changes<br />
21