11.11.2014 Views

Closed Loop Incident Process

Closed Loop Incident Process

Closed Loop Incident Process

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>Closed</strong> <strong>Loop</strong> <strong>Incident</strong> <strong>Process</strong><br />

From fault detection to closure<br />

Andreas Gutzwiller<br />

Presales Consultant, Hewlett-Packard (Schweiz)<br />

HP Software<br />

and Solutions<br />

©2010 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice


<strong>Closed</strong> <strong>Loop</strong> <strong>Incident</strong> <strong>Process</strong> Solution<br />

The CLIP solution is a:<br />

– Highly automated fault detectionto-recovery<br />

solution<br />

– Focused on end-to-end service<br />

availability and performance<br />

– Reducing mean time to recovery<br />

and improves mean time between<br />

system failures


Agenda<br />

1. Event and <strong>Incident</strong> <strong>Process</strong>es<br />

2. Closing the <strong>Loop</strong><br />

3. Architecture<br />

4. Why CLIP


Agenda<br />

1. Event and <strong>Incident</strong> <strong>Process</strong>es<br />

2. Closing the <strong>Loop</strong><br />

3. Architecture<br />

4. Why CLIP


ITILv3 Linkage of Event & <strong>Incident</strong> Management<br />

Neither process can stand alone in today’s IT environments<br />

Event – A change of state or alert<br />

that has significance for the<br />

management of a Configuration<br />

Item (CI) or IT Service.<br />

<strong>Incident</strong> – Unplanned interruption,<br />

or reduction of quality, of an IT<br />

service<br />

IT Service – People, processes &<br />

technology deliverable that<br />

supports a customer’s business<br />

processes<br />

Event Management<br />

• Responsible for managing events<br />

throughout their lifecycle. Main activity of IT<br />

Operations.<br />

• Event Filtered/Correlated Resolve or<br />

forward to <strong>Incident</strong> Close<br />

<strong>Incident</strong> Management<br />

• Includes any event which, or could,<br />

disrupts a service. From users or IT staff<br />

• <strong>Incident</strong> -> Categorize /Prioritize -><br />

Diagnose -> Resolve -> Close<br />

5


ITIL Areas Involved in CLIP<br />

– Operations Bridge (aka NOC)<br />

• Central coordination point<br />

• Manages various classes of events<br />

• Detects incidents<br />

• Manages routine operational activities<br />

• Reports on the status and performance<br />

• May provide first-level support for those<br />

events which generate an incident<br />

– Service Desk<br />

• Single central point of contact for all<br />

users of IT<br />

• Logs and manages all incidents, service<br />

requests and access requests<br />

• Provides interface to all other Service<br />

Operation processes and activities<br />

“The Service Desk is not typically involved in Event Management …<br />

unless the Service Desk and Operations Bridge have been combined”<br />

6


Traditional <strong>Incident</strong> Management<br />

From diagnosis to resolution<br />

1<br />

Identify service<br />

performance<br />

degradation<br />

Troubleshoot<br />

problem to<br />

isolate root cause<br />

1. Service performance<br />

notification<br />

2<br />

3<br />

Identify<br />

actionable<br />

condition /<br />

changes to be<br />

implemented<br />

2. Gather data to<br />

assign SME<br />

4<br />

Create TT/RFC<br />

to implement<br />

change<br />

3. Bouncing the<br />

incident<br />

5<br />

Implement and<br />

automate change<br />

to close RFC<br />

6<br />

Update CMS<br />

(Federated CMDB)<br />

6. Update CMDB -<br />

timely & correctly?<br />

End User<br />

Help Desk<br />

“Fire Storms”<br />

CMDB<br />

7<br />

4. Ticket is finally<br />

assigned to the<br />

correct SME<br />

5. Impact analysis and<br />

change<br />

management<br />

Multiple un-integrated systems and data stores, manually coordinated<br />

hand-offs → inconsistent troubleshooting, high MTTR<br />

SME: Subject Matter Experts


Agenda<br />

1. Event and <strong>Incident</strong> <strong>Process</strong>es<br />

2. Closing the <strong>Loop</strong><br />

3. Architecture<br />

4. Why CLIP


From Fault Detection To Recovery & Closure<br />

<strong>Closed</strong> <strong>Loop</strong> <strong>Incident</strong> <strong>Process</strong> solution for ITIL Event and <strong>Incident</strong><br />

Management<br />

Event Generation<br />

& Detection<br />

Recovery &<br />

Closure<br />

Event Correlation<br />

& Business<br />

Impact<br />

Resolution<br />

<strong>Incident</strong><br />

Submission<br />

9<br />

Investigation &<br />

Diagnosis<br />

ITIL <strong>Process</strong><br />

Event Management<br />

<strong>Incident</strong> Management


Event Generation & Detection<br />

<strong>Closed</strong> <strong>Loop</strong> <strong>Incident</strong> <strong>Process</strong> solution for ITIL Event and <strong>Incident</strong><br />

Management<br />

Recovery &<br />

Closure<br />

Resolution<br />

10<br />

Event<br />

Generation &<br />

Detection<br />

Investigation &<br />

Diagnosis<br />

Event<br />

Correlation &<br />

Business Impact<br />

<strong>Incident</strong><br />

Submission<br />

Operations bridge console collects events & alerts<br />

from servers, networks, apps & 3rd party<br />

Challenge<br />

Bottom-up alert and event overload<br />

Lack of qualitative cross domain “actionable”<br />

and causal event data<br />

Solution<br />

All events come to one place, correlated and<br />

enriched against an auto-updated service<br />

model<br />

User Example – Events to single console<br />

End user experience slow<br />

SQL slow query performance alert<br />

J2EE DB collection pool issue


Event Correlation & Business Impact<br />

<strong>Closed</strong> <strong>Loop</strong> <strong>Incident</strong> <strong>Process</strong> solution for ITIL Event and <strong>Incident</strong><br />

Management<br />

Recovery &<br />

Closure<br />

Resolution<br />

11<br />

Event<br />

Generation &<br />

Detection<br />

Investigation &<br />

Diagnosis<br />

Event<br />

Correlation &<br />

Business Impact<br />

<strong>Incident</strong><br />

Submission<br />

Business services, business impact relationship,<br />

and SLAs determined<br />

Challenge<br />

Struggle to link causal events to top down enduser<br />

experience and business impact<br />

Solution<br />

Proactive end-user experience linked to<br />

business process and business transaction<br />

flow to identify high revenue generating<br />

service impact<br />

User Example - Cause from symptoms and impact<br />

Oracle database is the cause, topology based<br />

correlation<br />

Critical funds transfer business service<br />

impacted


<strong>Incident</strong> Submission<br />

<strong>Closed</strong> <strong>Loop</strong> <strong>Incident</strong> <strong>Process</strong> solution for ITIL Event and <strong>Incident</strong><br />

Management<br />

Recovery &<br />

Closure<br />

Resolution<br />

12<br />

Event<br />

Generation &<br />

Detection<br />

Investigation &<br />

Diagnosis<br />

Event<br />

Correlation &<br />

Business Impact<br />

<strong>Incident</strong><br />

Submission<br />

Automatic submission to service desk with<br />

annotations and cause area<br />

Challenge<br />

Quality and enrichment of data<br />

Siloed, broken service lifecycle<br />

Duplication of effort wasting time<br />

Solution<br />

Better collaboration<br />

Automation and integrated of event to incident<br />

process lifecycle<br />

User Example - Automatic incident ticket creation<br />

Ticket visible to ops bridge<br />

Assignment to subject expert


Investigation & Diagnosis<br />

<strong>Closed</strong> <strong>Loop</strong> <strong>Incident</strong> <strong>Process</strong> solution for ITIL Event and <strong>Incident</strong><br />

Management<br />

Recovery &<br />

Closure<br />

Resolution<br />

13<br />

Event<br />

Generation &<br />

Detection<br />

Investigation &<br />

Diagnosis<br />

Event<br />

Correlation &<br />

Business Impact<br />

<strong>Incident</strong><br />

Submission<br />

Problem isolation, SME tools, and KM used to<br />

determine root cause<br />

Challenge<br />

Significant problem resolution time spent on<br />

pinpointing problem in a dynamic<br />

heterogeneous IT universe<br />

<strong>Incident</strong> assigned and reassigned to multiple<br />

silos<br />

Solution<br />

Cross domain data visualization and analysis<br />

User Example - Diving deeper to find root cause<br />

Expert sees corrupt DB tables<br />

Finds runbook automation fix in<br />

knowledgebase


Resolution<br />

<strong>Closed</strong> <strong>Loop</strong> <strong>Incident</strong> <strong>Process</strong> solution for ITIL Event and <strong>Incident</strong><br />

Management<br />

Recovery &<br />

Closure<br />

Resolution<br />

14<br />

Event<br />

Generation &<br />

Detection<br />

Investigation &<br />

Diagnosis<br />

Event<br />

Correlation &<br />

Business Impact<br />

<strong>Incident</strong><br />

Submission<br />

Change request with attached run book automation<br />

to repair CI’s<br />

Challenge<br />

Little or lack of automation leads to increased<br />

manual efforts impacting quality and efficiency<br />

Solution<br />

Expert created/authorized run book<br />

automation to empower lower level teams<br />

Manage change, configuration, and release<br />

process<br />

User Example - <strong>Process</strong>ing the change<br />

Get change request approval<br />

Use runbook to reindex database tables


Recovery & Closure<br />

<strong>Closed</strong> <strong>Loop</strong> <strong>Incident</strong> <strong>Process</strong> solution for ITIL Event and <strong>Incident</strong><br />

Management<br />

Recovery &<br />

Closure<br />

Resolution<br />

Event<br />

Generation &<br />

Detection<br />

Investigation &<br />

Diagnosis<br />

Event<br />

Correlation &<br />

Business Impact<br />

<strong>Incident</strong><br />

Submission<br />

Automatically close incident & related incidents<br />

acknowledging related events<br />

Challenge<br />

Struggle to improve speed of restoration,<br />

recovery and closure of incident and verify<br />

post compliance of SLA/OLA<br />

Solution<br />

Automate all notifications & updates,<br />

continuously monitor SLA/OLA compliance<br />

User Example – Verify the change worked<br />

User, DB and connection pool OK<br />

Ticket and events closed<br />

15


Agenda<br />

1. Event and <strong>Incident</strong> <strong>Process</strong>es<br />

2. Closing the <strong>Loop</strong><br />

3. Architecture<br />

4. Why CLIP


<strong>Closed</strong> <strong>Loop</strong> <strong>Incident</strong> <strong>Process</strong> Integration Points<br />

Integrated ITIL event and incident management process optimizing MTTR and MTBF<br />

Monitoring<br />

1<br />

2<br />

3<br />

5<br />

Integrated<br />

CMDB<br />

Automation<br />

1<br />

5<br />

Service<br />

Desk<br />

4<br />

17<br />

1. Sharing CIs, topology and state information<br />

2. For creating and updating incidents<br />

3. For updating events<br />

4. <strong>Incident</strong>-, Problem- and Change-Mgmt<br />

5. Runbook automation to remediate


HP’s <strong>Closed</strong> <strong>Loop</strong> <strong>Incident</strong> <strong>Process</strong> Solution<br />

Integrated ITIL event and incident management process optimizing MTTR and MTBF<br />

BSM<br />

CIs, Topo,<br />

Events,<br />

Status<br />

Net<br />

1<br />

Ops<br />

App<br />

Other<br />

2<br />

3<br />

4<br />

5<br />

UCMDB<br />

Operations<br />

Orchestration<br />

NA<br />

SA<br />

CA<br />

SE<br />

Other<br />

6<br />

7<br />

Service<br />

Manager<br />

18<br />

1. CIs, topology, events, status measurements<br />

flowing into BSM<br />

2. Sharing events and topology<br />

3. For creating and updating incidents<br />

4. To access Business Impact View for a CI<br />

5. Runbook automation to enrich, diagnosis and<br />

remediate<br />

6. Sharing CIs and state information<br />

7. Runbook automation to remediate


Agenda<br />

1. Event and <strong>Incident</strong> <strong>Process</strong>es<br />

2. Closing the <strong>Loop</strong><br />

3. Architecture<br />

4. Why CLIP


<strong>Closed</strong>-<strong>Loop</strong> <strong>Incident</strong> Mgmt <strong>Process</strong><br />

<strong>Incident</strong> management from diagnosis to automated resolution<br />

1<br />

Identify service<br />

performance<br />

degradation<br />

2<br />

Troubleshoot<br />

problem to<br />

isolate root<br />

cause<br />

3<br />

Identify<br />

changes to be<br />

implemented<br />

4<br />

Create TT/RFC<br />

to implement<br />

change<br />

5<br />

Implement and<br />

automate<br />

change to close<br />

RFC<br />

6<br />

Update CMS<br />

(Federated<br />

CMCB)<br />

1. Identify service<br />

performance issue<br />

Business service<br />

management<br />

2. Gather data to identify<br />

root cause<br />

3. Create RFC to<br />

make change<br />

4b. Review, assess, plan and<br />

govern change<br />

IT service<br />

management<br />

4a. Initiate change<br />

5b. Close change<br />

request?<br />

6. Update Configuration Management System<br />

Configuration Management System (Federated CMDB)<br />

5a. Implement change<br />

Business service<br />

automation<br />

• Key processes—incident, change and configuration—need to be tightly linked<br />

• Seamless process linkage requires tools to be consistently service-oriented<br />

20


<strong>Closed</strong> <strong>Loop</strong> <strong>Incident</strong> <strong>Process</strong> Key Benefits<br />

Drive innovation value of IT<br />

Cost<br />

Quality<br />

Transparency<br />

Agility<br />

Business<br />

risk<br />

• Drive efficiency through automation<br />

• Optimize service lifecycle process efficiency<br />

• Eliminate error-prone manual tasks<br />

• Predict and prevent negative business impact<br />

• The cost/value ratio of delivered services is understood by<br />

the business<br />

• Any service from everywhere<br />

• Saved labor can be spend on innovation<br />

• Measure and optimize time to develop and successfully<br />

deploy new services<br />

• Reduce risk of failure when deploying changes<br />

• Enable compliance<br />

72% lower<br />

maintenance cost<br />

2.5x increased<br />

availability and<br />

performance<br />

99.5% availability<br />

via integrated<br />

delivery<br />

30% faster time to<br />

market for new apps<br />

70% fewer bad<br />

changes<br />

21

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!