DeepTriage: Automated Transfer Assistance for Incidents in Cloud Services

11/25/2020
by   Phuong Pham, et al.
0

As cloud services are growing and generating high revenues, the cost of downtime in these services is becoming significantly expensive. To reduce loss and service downtime, a critical primary step is to execute incident triage, the process of assigning a service incident to the correct responsible team, in a timely manner. An incorrect assignment risks additional incident reroutings and increases its time to mitigate by 10x. However, automated incident triage in large cloud services faces many challenges: (1) a highly imbalanced incident distribution from a large number of teams, (2) wide variety in formats of input data or data sources, (3) scaling to meet production-grade requirements, and (4) gaining engineers' trust in using machine learning recommendations. To address these challenges, we introduce DeepTriage, an intelligent incident transfer service combining multiple machine learning techniques - gradient boosted classifiers, clustering methods, and deep neural networks - in an ensemble to recommend the responsible team to triage an incident. Experimental results on real incidents in Microsoft Azure show that our service achieves 82.9 from 76.3 frameworks to scale DeepTriage to handle incident routing for all cloud services. DeepTriage has been deployed in Azure since October 2017 and is used by thousands of teams daily.

READ FULL TEXT
02/12/2020

Building Reliable Cloud Services Using P# (Experience Report)

Cloud services must typically be distributed across a large number of ma...
05/03/2018

On Collaborative Model-driven Development of Microservices

Microservice Architecture (MSA) denotes an emerging architectural style ...
09/26/2019

CapExec: Towards Transparently-Sandboxed Services (Extended Version)

Network services are among the riskiest programs executed by production ...
10/17/2018

Cloud Service Provider Evaluation System using Fuzzy Rough Set Technique

Cloud Service Provider (CSPs) offers a wide variety of scalable, flexibl...
05/15/2019

Predicting Breakdowns in Cloud Services (with SPIKE)

Maintaining web-services is a mission-critical task. Any downtime of web...
08/08/2018

Cognitive system to achieve human-level accuracy in automated assignment of helpdesk email tickets

Ticket assignment/dispatch is a crucial part of service delivery busines...
03/10/2021

Designing a Bot for Efficient Distribution of Service Requests

The tracking and timely resolution of service requests is one of the maj...