11.11.2014 Views

Closed Loop Incident Process

Closed Loop Incident Process

Closed Loop Incident Process

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>Closed</strong> <strong>Loop</strong> <strong>Incident</strong> <strong>Process</strong><br />

From fault detection to closure<br />

Andreas Gutzwiller<br />

Presales Consultant, Hewlett-Packard (Schweiz)<br />

HP Software<br />

and Solutions<br />

©2010 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice


<strong>Closed</strong> <strong>Loop</strong> <strong>Incident</strong> <strong>Process</strong> Solution<br />

The CLIP solution is a:<br />

– Highly automated fault detectionto-recovery<br />

solution<br />

– Focused on end-to-end service<br />

availability and performance<br />

– Reducing mean time to recovery<br />

and improves mean time between<br />

system failures


Agenda<br />

1. Event and <strong>Incident</strong> <strong>Process</strong>es<br />

2. Closing the <strong>Loop</strong><br />

3. Architecture<br />

4. Why CLIP


Agenda<br />

1. Event and <strong>Incident</strong> <strong>Process</strong>es<br />

2. Closing the <strong>Loop</strong><br />

3. Architecture<br />

4. Why CLIP


ITILv3 Linkage of Event & <strong>Incident</strong> Management<br />

Neither process can stand alone in today’s IT environments<br />

Event – A change of state or alert<br />

that has significance for the<br />

management of a Configuration<br />

Item (CI) or IT Service.<br />

<strong>Incident</strong> – Unplanned interruption,<br />

or reduction of quality, of an IT<br />

service<br />

IT Service – People, processes &<br />

technology deliverable that<br />

supports a customer’s business<br />

processes<br />

Event Management<br />

• Responsible for managing events<br />

throughout their lifecycle. Main activity of IT<br />

Operations.<br />

• Event Filtered/Correlated Resolve or<br />

forward to <strong>Incident</strong> Close<br />

<strong>Incident</strong> Management<br />

• Includes any event which, or could,<br />

disrupts a service. From users or IT staff<br />

• <strong>Incident</strong> -> Categorize /Prioritize -><br />

Diagnose -> Resolve -> Close<br />

5


ITIL Areas Involved in CLIP<br />

– Operations Bridge (aka NOC)<br />

• Central coordination point<br />

• Manages various classes of events<br />

• Detects incidents<br />

• Manages routine operational activities<br />

• Reports on the status and performance<br />

• May provide first-level support for those<br />

events which generate an incident<br />

– Service Desk<br />

• Single central point of contact for all<br />

users of IT<br />

• Logs and manages all incidents, service<br />

requests and access requests<br />

• Provides interface to all other Service<br />

Operation processes and activities<br />

“The Service Desk is not typically involved in Event Management …<br />

unless the Service Desk and Operations Bridge have been combined”<br />

6


Traditional <strong>Incident</strong> Management<br />

From diagnosis to resolution<br />

1<br />

Identify service<br />

performance<br />

degradation<br />

Troubleshoot<br />

problem to<br />

isolate root cause<br />

1. Service performance<br />

notification<br />

2<br />

3<br />

Identify<br />

actionable<br />

condition /<br />

changes to be<br />

implemented<br />

2. Gather data to<br />

assign SME<br />

4<br />

Create TT/RFC<br />

to implement<br />

change<br />

3. Bouncing the<br />

incident<br />

5<br />

Implement and<br />

automate change<br />

to close RFC<br />

6<br />

Update CMS<br />

(Federated CMDB)<br />

6. Update CMDB -<br />

timely & correctly?<br />

End User<br />

Help Desk<br />

“Fire Storms”<br />

CMDB<br />

7<br />

4. Ticket is finally<br />

assigned to the<br />

correct SME<br />

5. Impact analysis and<br />

change<br />

management<br />

Multiple un-integrated systems and data stores, manually coordinated<br />

hand-offs → inconsistent troubleshooting, high MTTR<br />

SME: Subject Matter Experts


Agenda<br />

1. Event and <strong>Incident</strong> <strong>Process</strong>es<br />

2. Closing the <strong>Loop</strong><br />

3. Architecture<br />

4. Why CLIP


From Fault Detection To Recovery & Closure<br />

<strong>Closed</strong> <strong>Loop</strong> <strong>Incident</strong> <strong>Process</strong> solution for ITIL Event and <strong>Incident</strong><br />

Management<br />

Event Generation<br />

& Detection<br />

Recovery &<br />

Closure<br />

Event Correlation<br />

& Business<br />

Impact<br />

Resolution<br />

<strong>Incident</strong><br />

Submission<br />

9<br />

Investigation &<br />

Diagnosis<br />

ITIL <strong>Process</strong><br />

Event Management<br />

<strong>Incident</strong> Management


Event Generation & Detection<br />

<strong>Closed</strong> <strong>Loop</strong> <strong>Incident</strong> <strong>Process</strong> solution for ITIL Event and <strong>Incident</strong><br />

Management<br />

Recovery &<br />

Closure<br />

Resolution<br />

10<br />

Event<br />

Generation &<br />

Detection<br />

Investigation &<br />

Diagnosis<br />

Event<br />

Correlation &<br />

Business Impact<br />

<strong>Incident</strong><br />

Submission<br />

Operations bridge console collects events & alerts<br />

from servers, networks, apps & 3rd party<br />

Challenge<br />

Bottom-up alert and event overload<br />

Lack of qualitative cross domain “actionable”<br />

and causal event data<br />

Solution<br />

All events come to one place, correlated and<br />

enriched against an auto-updated service<br />

model<br />

User Example – Events to single console<br />

End user experience slow<br />

SQL slow query performance alert<br />

J2EE DB collection pool issue


Event Correlation & Business Impact<br />

<strong>Closed</strong> <strong>Loop</strong> <strong>Incident</strong> <strong>Process</strong> solution for ITIL Event and <strong>Incident</strong><br />

Management<br />

Recovery &<br />

Closure<br />

Resolution<br />

11<br />

Event<br />

Generation &<br />

Detection<br />

Investigation &<br />

Diagnosis<br />

Event<br />

Correlation &<br />

Business Impact<br />

<strong>Incident</strong><br />

Submission<br />

Business services, business impact relationship,<br />

and SLAs determined<br />

Challenge<br />

Struggle to link causal events to top down enduser<br />

experience and business impact<br />

Solution<br />

Proactive end-user experience linked to<br />

business process and business transaction<br />

flow to identify high revenue generating<br />

service impact<br />

User Example - Cause from symptoms and impact<br />

Oracle database is the cause, topology based<br />

correlation<br />

Critical funds transfer business service<br />

impacted


<strong>Incident</strong> Submission<br />

<strong>Closed</strong> <strong>Loop</strong> <strong>Incident</strong> <strong>Process</strong> solution for ITIL Event and <strong>Incident</strong><br />

Management<br />

Recovery &<br />

Closure<br />

Resolution<br />

12<br />

Event<br />

Generation &<br />

Detection<br />

Investigation &<br />

Diagnosis<br />

Event<br />

Correlation &<br />

Business Impact<br />

<strong>Incident</strong><br />

Submission<br />

Automatic submission to service desk with<br />

annotations and cause area<br />

Challenge<br />

Quality and enrichment of data<br />

Siloed, broken service lifecycle<br />

Duplication of effort wasting time<br />

Solution<br />

Better collaboration<br />

Automation and integrated of event to incident<br />

process lifecycle<br />

User Example - Automatic incident ticket creation<br />

Ticket visible to ops bridge<br />

Assignment to subject expert


Investigation & Diagnosis<br />

<strong>Closed</strong> <strong>Loop</strong> <strong>Incident</strong> <strong>Process</strong> solution for ITIL Event and <strong>Incident</strong><br />

Management<br />

Recovery &<br />

Closure<br />

Resolution<br />

13<br />

Event<br />

Generation &<br />

Detection<br />

Investigation &<br />

Diagnosis<br />

Event<br />

Correlation &<br />

Business Impact<br />

<strong>Incident</strong><br />

Submission<br />

Problem isolation, SME tools, and KM used to<br />

determine root cause<br />

Challenge<br />

Significant problem resolution time spent on<br />

pinpointing problem in a dynamic<br />

heterogeneous IT universe<br />

<strong>Incident</strong> assigned and reassigned to multiple<br />

silos<br />

Solution<br />

Cross domain data visualization and analysis<br />

User Example - Diving deeper to find root cause<br />

Expert sees corrupt DB tables<br />

Finds runbook automation fix in<br />

knowledgebase


Resolution<br />

<strong>Closed</strong> <strong>Loop</strong> <strong>Incident</strong> <strong>Process</strong> solution for ITIL Event and <strong>Incident</strong><br />

Management<br />

Recovery &<br />

Closure<br />

Resolution<br />

14<br />

Event<br />

Generation &<br />

Detection<br />

Investigation &<br />

Diagnosis<br />

Event<br />

Correlation &<br />

Business Impact<br />

<strong>Incident</strong><br />

Submission<br />

Change request with attached run book automation<br />

to repair CI’s<br />

Challenge<br />

Little or lack of automation leads to increased<br />

manual efforts impacting quality and efficiency<br />

Solution<br />

Expert created/authorized run book<br />

automation to empower lower level teams<br />

Manage change, configuration, and release<br />

process<br />

User Example - <strong>Process</strong>ing the change<br />

Get change request approval<br />

Use runbook to reindex database tables


Recovery & Closure<br />

<strong>Closed</strong> <strong>Loop</strong> <strong>Incident</strong> <strong>Process</strong> solution for ITIL Event and <strong>Incident</strong><br />

Management<br />

Recovery &<br />

Closure<br />

Resolution<br />

Event<br />

Generation &<br />

Detection<br />

Investigation &<br />

Diagnosis<br />

Event<br />

Correlation &<br />

Business Impact<br />

<strong>Incident</strong><br />

Submission<br />

Automatically close incident & related incidents<br />

acknowledging related events<br />

Challenge<br />

Struggle to improve speed of restoration,<br />

recovery and closure of incident and verify<br />

post compliance of SLA/OLA<br />

Solution<br />

Automate all notifications & updates,<br />

continuously monitor SLA/OLA compliance<br />

User Example – Verify the change worked<br />

User, DB and connection pool OK<br />

Ticket and events closed<br />

15


Agenda<br />

1. Event and <strong>Incident</strong> <strong>Process</strong>es<br />

2. Closing the <strong>Loop</strong><br />

3. Architecture<br />

4. Why CLIP


<strong>Closed</strong> <strong>Loop</strong> <strong>Incident</strong> <strong>Process</strong> Integration Points<br />

Integrated ITIL event and incident management process optimizing MTTR and MTBF<br />

Monitoring<br />

1<br />

2<br />

3<br />

5<br />

Integrated<br />

CMDB<br />

Automation<br />

1<br />

5<br />

Service<br />

Desk<br />

4<br />

17<br />

1. Sharing CIs, topology and state information<br />

2. For creating and updating incidents<br />

3. For updating events<br />

4. <strong>Incident</strong>-, Problem- and Change-Mgmt<br />

5. Runbook automation to remediate


HP’s <strong>Closed</strong> <strong>Loop</strong> <strong>Incident</strong> <strong>Process</strong> Solution<br />

Integrated ITIL event and incident management process optimizing MTTR and MTBF<br />

BSM<br />

CIs, Topo,<br />

Events,<br />

Status<br />

Net<br />

1<br />

Ops<br />

App<br />

Other<br />

2<br />

3<br />

4<br />

5<br />

UCMDB<br />

Operations<br />

Orchestration<br />

NA<br />

SA<br />

CA<br />

SE<br />

Other<br />

6<br />

7<br />

Service<br />

Manager<br />

18<br />

1. CIs, topology, events, status measurements<br />

flowing into BSM<br />

2. Sharing events and topology<br />

3. For creating and updating incidents<br />

4. To access Business Impact View for a CI<br />

5. Runbook automation to enrich, diagnosis and<br />

remediate<br />

6. Sharing CIs and state information<br />

7. Runbook automation to remediate


Agenda<br />

1. Event and <strong>Incident</strong> <strong>Process</strong>es<br />

2. Closing the <strong>Loop</strong><br />

3. Architecture<br />

4. Why CLIP


<strong>Closed</strong>-<strong>Loop</strong> <strong>Incident</strong> Mgmt <strong>Process</strong><br />

<strong>Incident</strong> management from diagnosis to automated resolution<br />

1<br />

Identify service<br />

performance<br />

degradation<br />

2<br />

Troubleshoot<br />

problem to<br />

isolate root<br />

cause<br />

3<br />

Identify<br />

changes to be<br />

implemented<br />

4<br />

Create TT/RFC<br />

to implement<br />

change<br />

5<br />

Implement and<br />

automate<br />

change to close<br />

RFC<br />

6<br />

Update CMS<br />

(Federated<br />

CMCB)<br />

1. Identify service<br />

performance issue<br />

Business service<br />

management<br />

2. Gather data to identify<br />

root cause<br />

3. Create RFC to<br />

make change<br />

4b. Review, assess, plan and<br />

govern change<br />

IT service<br />

management<br />

4a. Initiate change<br />

5b. Close change<br />

request?<br />

6. Update Configuration Management System<br />

Configuration Management System (Federated CMDB)<br />

5a. Implement change<br />

Business service<br />

automation<br />

• Key processes—incident, change and configuration—need to be tightly linked<br />

• Seamless process linkage requires tools to be consistently service-oriented<br />

20


<strong>Closed</strong> <strong>Loop</strong> <strong>Incident</strong> <strong>Process</strong> Key Benefits<br />

Drive innovation value of IT<br />

Cost<br />

Quality<br />

Transparency<br />

Agility<br />

Business<br />

risk<br />

• Drive efficiency through automation<br />

• Optimize service lifecycle process efficiency<br />

• Eliminate error-prone manual tasks<br />

• Predict and prevent negative business impact<br />

• The cost/value ratio of delivered services is understood by<br />

the business<br />

• Any service from everywhere<br />

• Saved labor can be spend on innovation<br />

• Measure and optimize time to develop and successfully<br />

deploy new services<br />

• Reduce risk of failure when deploying changes<br />

• Enable compliance<br />

72% lower<br />

maintenance cost<br />

2.5x increased<br />

availability and<br />

performance<br />

99.5% availability<br />

via integrated<br />

delivery<br />

30% faster time to<br />

market for new apps<br />

70% fewer bad<br />

changes<br />

21

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!