Managing the Dynamic Datacenter

Datacenter Automation

Subscribe to Datacenter Automation: eMailAlertsEmail Alerts newslettersWeekly Newsletters
Get Datacenter Automation: homepageHomepage mobileMobile rssRSS facebookFacebook twitterTwitter linkedinLinkedIn

Datacenter Automation Authors: Yeshim Deniz, Elizabeth White, Pat Romanski, Liz McMillan, Glenn Rossman

Related Topics: Cloud Computing, Datacenter Automation, Microservices Journal


Cloud Hardware and New Memory Controller Designs

An exclusive interview with Barbara P. Aichinger, co-founder of FuturePlus Systems and VP of New Business Development

"The Data Center operators do understand that quality does matter," noted Barbara P. Aichinger, co-founder of FuturePlus Systems and VP of New Business Development, in this exclusive interview with Cloud Expo Conference Chair Jeremy Geelan. "When they experience failures they call the supplier and the Tier 2 and 3 vendors just blame somebody else, like the DIMM vendor or the software."

Cloud Computing Journal: You seem to have some concerns about the actual cloud hardware can you explain?

Barbara P. Aichinger: Sure, my company FuturePlus Systems makes memory design validation equipment used by the engineers that design cloud hardware. These server and network equipment have technology standards that govern their design. The advantage of using standards is that you can buy one part from vendor A and another from vendor B and because they are all designed to the same standard they work together. The standards organizations that write these standards are international in nature and in most cases have a Compliance Standard associated with the technology standard. Vendors have to not only obey the standard itself but pass a test, specified by the compliance portion of the standard, that proves that their design meets the specification. This is a stamp of quality and interoperability. The problem we have today with cloud hardware is that at the very heart of all of this hardware is the JEDEC DDR Memory standard but this standard has no compliance specification per se. Thus there is no third party checking this very critical portion of the design for quality and compliance.

Cloud Computing Journal: Why is it that there is no compliance standard for DDR Memory?

Aichinger: Good Question. Last May (2013) at a JEDEC Conference (JEDEC is the international standards organization that governs the DDR Memory specification) I asked that very question. The answer was a shrug of the shoulders and a response of ‘well we all work so closely together so we did not need one'. This was probably ok 6 or 7 years ago when the server market was dominated by a few large vendors. In addition the memory controllers themselves only came from two major silicon vendors. In addition proving compliance was very difficult and only the large major players could afford the equipment to perform such an analysis. However now there are lots of vendors supplying cloud hardware and new memory controller designs by smaller vendors starting to proliferate the market. As such we see memory error rates in the data center accelerating.

Cloud Computing Journal: How big is the problem?

Aichinger: Google, having one of the largest data centers in the world, has definitely noticed the problem. They have worked with several in academia studying the problem. Two main works have resulted: DRAM Errors in the Wild: A Large-Scale Field Study and Cosmic Rays Don't Strike Twice: Understanding the Nature of DRAM Errors and the Implications for System Design

At the Open Compute Project conference in January 2013 Facebook said that DDR Memory failures were the #2 failure in the data center. The data rates are not trivial. Given the growth that we see in data centers we are seeing memory failures bring down servers hourly. This is not only a cost in down time but also in labor to replace the system or the failing DIMM.

We have also heard the phrase ‘ghost errors'. This is when the server will go down experiencing a hard memory failure. The operators run all sorts of diagnostics and they find no error, everything works fine. They boot back up and the system will continue to run for perhaps several weeks before it experiences another error. Because they can never find the failure as they seem to disappear they call them ‘ghost errors'.

Cloud Computing Journal: How are Data Centers responding?

Aichinger: They are doing a lot of head scratching. They have cost pressures and quality concerns. From what we have been told there is a push to commoditize the server market. That is to have no distinction between the Tier 1 and the lower Tier 2 or Tier 3 vendors. The Data Center operators do understand that quality does matter. When they experience failures they call the supplier and the Tier 2 and 3 vendors just blame somebody else, like the DIMM vendor or the software. We have seen all sorts of finger pointing. Even the DIMM connector vendors get blamed even though there is really no proof behind the claim. The Tier 1 vendors will often try to study the problem. They will bring the machine back to their facility and try to recreate the problem. One of our Tier one customers told us that only 30% of the time can they recreate the failure.

Cloud Computing Journal: What is the answer here? Can we have low cost and high quality?

Aichinger: I think we can. The first step would be for the customers to demand qualification of the memory subsystem. This is what we at FuturePlus are trying to do. We are trying to alert the end user to the problem. The suppliers of this hardware are more than likely going to take the easy way out and not validate their designs. Oftentimes you have system integrators who have no idea where the motherboard or the memory came from and can't even tell you what speed the memory is operating at. The companies that run these data centers are going to have to come up to speed on basic computer architecture so they don't get the wool pulled over their eyes when buying this hardware.

More Stories By Liz McMillan

News Desk compiles and publishes breaking news stories, press releases and latest news articles as they happen.

Comments (0)

Share your thoughts on this story.

Add your comment
You must be signed in to add a comment. Sign-in | Register

In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.