By Liz McMillan | Article Rating: |
|
September 30, 2013 03:25 PM EDT | Reads: |
6,241 |
"The Data Center operators do understand that quality does matter," noted Barbara P. Aichinger, co-founder of FuturePlus Systems and VP of New Business Development, in this exclusive interview with Cloud Expo Conference Chair Jeremy Geelan. "When they experience failures they call the supplier and the Tier 2 and 3 vendors just blame somebody else, like the DIMM vendor or the software."
Cloud Computing Journal: You seem to have some concerns about the actual cloud hardware can you explain?
Barbara P. Aichinger: Sure, my company FuturePlus Systems makes memory design validation equipment used by the engineers that design cloud hardware. These server and network equipment have technology standards that govern their design. The advantage of using standards is that you can buy one part from vendor A and another from vendor B and because they are all designed to the same standard they work together. The standards organizations that write these standards are international in nature and in most cases have a Compliance Standard associated with the technology standard. Vendors have to not only obey the standard itself but pass a test, specified by the compliance portion of the standard, that proves that their design meets the specification. This is a stamp of quality and interoperability. The problem we have today with cloud hardware is that at the very heart of all of this hardware is the JEDEC DDR Memory standard but this standard has no compliance specification per se. Thus there is no third party checking this very critical portion of the design for quality and compliance.
Cloud Computing Journal: Why is it that there is no compliance standard for DDR Memory?
Aichinger: Good Question. Last May (2013) at a JEDEC Conference (JEDEC is the international standards organization that governs the DDR Memory specification) I asked that very question. The answer was a shrug of the shoulders and a response of ‘well we all work so closely together so we did not need one'. This was probably ok 6 or 7 years ago when the server market was dominated by a few large vendors. In addition the memory controllers themselves only came from two major silicon vendors. In addition proving compliance was very difficult and only the large major players could afford the equipment to perform such an analysis. However now there are lots of vendors supplying cloud hardware and new memory controller designs by smaller vendors starting to proliferate the market. As such we see memory error rates in the data center accelerating.
Cloud Computing Journal: How big is the problem?
Aichinger: Google, having one of the largest data centers in the world, has definitely noticed the problem. They have worked with several in academia studying the problem. Two main works have resulted: DRAM Errors in the Wild: A Large-Scale Field Study and Cosmic Rays Don't Strike Twice: Understanding the Nature of DRAM Errors and the Implications for System Design
At the Open Compute Project conference in January 2013 Facebook said that DDR Memory failures were the #2 failure in the data center. The data rates are not trivial. Given the growth that we see in data centers we are seeing memory failures bring down servers hourly. This is not only a cost in down time but also in labor to replace the system or the failing DIMM.
We have also heard the phrase ‘ghost errors'. This is when the server will go down experiencing a hard memory failure. The operators run all sorts of diagnostics and they find no error, everything works fine. They boot back up and the system will continue to run for perhaps several weeks before it experiences another error. Because they can never find the failure as they seem to disappear they call them ‘ghost errors'.
Cloud Computing Journal: How are Data Centers responding?
Aichinger: They are doing a lot of head scratching. They have cost pressures and quality concerns. From what we have been told there is a push to commoditize the server market. That is to have no distinction between the Tier 1 and the lower Tier 2 or Tier 3 vendors. The Data Center operators do understand that quality does matter. When they experience failures they call the supplier and the Tier 2 and 3 vendors just blame somebody else, like the DIMM vendor or the software. We have seen all sorts of finger pointing. Even the DIMM connector vendors get blamed even though there is really no proof behind the claim. The Tier 1 vendors will often try to study the problem. They will bring the machine back to their facility and try to recreate the problem. One of our Tier one customers told us that only 30% of the time can they recreate the failure.
Cloud Computing Journal: What is the answer here? Can we have low cost and high quality?
Aichinger: I think we can. The first step would be for the customers to demand qualification of the memory subsystem. This is what we at FuturePlus are trying to do. We are trying to alert the end user to the problem. The suppliers of this hardware are more than likely going to take the easy way out and not validate their designs. Oftentimes you have system integrators who have no idea where the motherboard or the memory came from and can't even tell you what speed the memory is operating at. The companies that run these data centers are going to have to come up to speed on basic computer architecture so they don't get the wool pulled over their eyes when buying this hardware.
Published September 30, 2013 Reads 6,241
Copyright © 2013 Ulitzer, Inc. — All Rights Reserved.
Syndicated stories and blog feeds, all rights reserved by the author.
More Stories By Liz McMillan
News Desk compiles and publishes breaking news stories, press releases and latest news articles as they happen.
- Serverless Computing: Ready for Prime Time | @CloudExpo #AI #DX #Serverless #DigitalTransformation
- Continuous Testing vs Test Automation | @DevOpsSummit @Tricentis @AAkela #Serverless #DevOps
- Wild West of Gemini Test Automation | @CloudExpo #DevOps #ContinuousTesting
- AI and #DigitalTransformation | @ExpoDX @Ayehu_Eyeshare #AI #IoT #API
- [video] Scale and Simplify with WineSOFT | @CloudExpo #Cloud #APM #Monitoring
- ICC-USA to Exhibit at @CloudEXPO NY | @ICCUSA #AI #SDN #DataCenter #Storage #SmartCities #DigitalTransformation
- Ubersmith Announces Referral Program for its Subscription Business Management Software
- Ubersmith Announces Expanded Agreement with Namecheap
- Cloud-Native, #DevOps and #DigitalTransformation | @CloudExpo #Serverless #CloudNative
- Serverless Computing: Ready for Prime Time | @CloudExpo #AI #DX #Serverless #DigitalTransformation
- Continuous Testing vs Test Automation | @DevOpsSummit @Tricentis @AAkela #Serverless #DevOps
- Wild West of Gemini Test Automation | @CloudExpo #DevOps #ContinuousTesting
- Storage Quality of Service with @NetApp | @CloudExpo #DevOps #DataCenter
- Data Centers in a #Serverless World | @CloudExpo DevOps #SDN #DataCenter
- AI and #DigitalTransformation | @ExpoDX @Ayehu_Eyeshare #AI #IoT #API
- [video] Scale and Simplify with WineSOFT | @CloudExpo #Cloud #APM #Monitoring
- Synametrics to Exhibit at @CloudEXPO NY | @SynametricsTech #SQL #WinSQL #DataCenter
- Synametrics to Debut SyncriBox at @CloudEXPO NY | @SynametricsTech #SQL #WinSQL #DataCenter
- The Top 150 Players in Cloud Computing
- Pacific Office Automation San Mateo Outlines Rationale for Streamlining Workflow Processes
- Yahoo! to Keynote 4th Cloud Expo: Accelerating Innovation with Cloud Computing
- Oracle To Keynote Cloud Computing Expo
- Pacific Office Automation Denver Details How Going Green Improves Workflow Management
- Exclusive Q&A with Rich Marcello - Unisys President, Systems & Technology
- Yahoo! Named “Platinum Sponsor” of Cloud Computing Expo
- 1st Annual GovIT Expo: Letter from the Technical Chair
- Behind the Scenes, SANta Claus Global Cloud Story
- Unisys & Cloud Computing: Anywhere, Anytime IT Needs Also To Be Secure