Managing the Dynamic Datacenter

Datacenter Automation

Subscribe to Datacenter Automation: eMailAlertsEmail Alerts newslettersWeekly Newsletters
Get Datacenter Automation: homepageHomepage mobileMobile rssRSS facebookFacebook twitterTwitter linkedinLinkedIn

Datacenter Automation Authors: Yeshim Deniz, Elizabeth White, Pat Romanski, Liz McMillan, Glenn Rossman

Related Topics: Datacenter Automation, SOA & WOA Magazine, Java Developer Magazine, Big Data on Ulitzer

BigData: Article

Can ‘Big Data’ Prevent Major Service Outages?

New analytics are coming

If Amazon, Bank of America and Microsoft can’t contain service outages before they become colossal PR problems, the rest of us mere mortals have much to fear. It's a safe bet that some time in the next 60 days, another major consumer Internet service will make the headlines for melting down.

Given the potential losses, one might think that the IT organizations at these high profile companies would be bullet proof. But they’re not. Despite countless millions in management investments, the state of the art is still unexpected outages with recovery times that seem to be a minimum of 4 hours and maximum of 2 or 3 days. The evidence suggests that the same holds true for most of the Global 2000 enterprises and similarly sized service providers and government organizations. Multiple escalated incidents a week with one or two showstoppers a year, keep our IT experts tied up for an average of 4 hours per incident.

The key issue that “prevents preventing” outages is at the design core of the network and application monitoring systems in use today. Many are based on technology that was developed when a company’s network could still be visualized on a couple of PowerPoint slides. Applications had a one to one relationship with servers, networks were largely point to point, users could be grouped by the router that served their office. The life of an IT manager was so much simpler.

These old monitoring systems are based on the idea that IT experts would define the performance thresholds, rules and exceptions necessary to identify unacceptable behavior. But today, the typical enterprise application infrastructure is so complex, that you would have to gather dozens of IT managers to even begin to map it all out. Many have reached a level of complexity that defies an IT organizations to fully understand. The result, unforeseen, outages that often take days to resolve.

These monitoring systems are still great for generating the data required to understand the systems behavior – just ask the operations center that receives tens of thousand of alerts a day. But the real challenge lies in making sense of the alerts, and taking the right action before the train wreck occurs.

‘Big Data Analytics’ to the rescue
There’s a lot of chatter about the promise of Big Data in retail, healthcare and manufacturing. But in the realm of IT operations and application performance, not so much. But perhaps we can apply the lessons learned in ‘Big Data’ to solve our operational crises.

Unlocking the promise of Big Data requires two elements – aggregating and managing the data for fast access, and super powerful analytics to uncover the information locked within. The IT operations environments have been collecting and managing the data for decades. A typical large enterprise generates millions of data points an hour of monitoring metrics, log files and events. What’s missing from these environments is the analytics.

Luckily a new generation of machine-learning analytics has arisen that is up to the task. These systems can process information already collected by these monitoring systems and by “self-learning’ their behavior – actually detect problems and identify their root cause as they develop.

If Big Data principles are applied to our complex IT applications and infrastructures, the service outage that made yesterday’s Wall Street Journal will tomorrow be a blip on a sys admin’s screen that is solved with a single mouse click.

More Stories By Kevin Conklin

Kevin Conklin is a 20-year network management veteran and Vice President of Prelert, which is leveraging recent developments in Artificial Intelligence to provide transformative IT management software solutions. Prelert is founded and managed by a team that has developed considerable expertise in the field through companies like Micromuse and Riversoft, Securify, Axentis, Concord, Smarts, Securify and Njini.