Modern data centres are complex systems that need to operate within tolerances if their owners and/or users are to extract maximum efficiency and the best possible value from the IT infrastructure they contain. Dave Wolfenden from Mafi Mushkila explains
Several decades of IT history have shown that, if elements of a data centre – or the entire system – are operated outside normal tolerances, then this has an increasing effect on the efficiency of the hardware and allied systems.
Put simply, this situation decreases the time between equipment failures, something that IT professionals and engineers describe as the Mean Time Between Failure (MTBF).
Although it may sound complex, the MTBF of a component or system is simply a measure of how reliable a hardware product or element is.
For most IT components, the measure is typically in thousands or even tens of thousands of hours between failures. For example, a disk drive system may have a mean time between failures of 300,000 hours.
The MTBF figure is usually developed by the manufacturer or supplier as the result of intensive testing, based on actual product experience, or predicted by analysing known factors.
If multiple component systems start to fail in a data centre, the efficiency of the centre will take a nosedive. In many cases, as these failures compound, this can often result in the automatic shutdown of the data centre due to equipment failure.
The reason for this is that, whilst a couple of decades ago a typical data centre was manned on an extended hours basis – or had local engineers available on-call – today’s centres are rarely manned and often located on client sites, requiring an engineer visit if something goes wrong.
Whilst systems redundancy – which all adds to the expense of the data centre – can help to ensure 24×7 operations, even when only a single piece of hardware fails, ultimately an engineer will have to visit to swap out and/or remediate the equipment problem.
This can be an expensive option where the centre is located remotely. And it gets really expensive in sparsely populated countries, where an engineer’s visit may necessitate journey times of several hours – and often involve travel by light plane or helicopter.
It’s for this reason that operating a data centre – and allied IT systems – within normal operating tolerances is very important. If tolerances are not maintained, then a partial or complete shutdown can add significantly to the operating costs of the centre, as well as dramatically decreasing client satisfaction levels.
But the effects of running outside tolerances can be subtler than a partial or complete shutdown. The risk of running higher temperatures, for example, does not so much involve electronic breakdowns but generally results in material changes, such as problems with insulation, wiring and connectors.
Some older connectors, for instance, will corrode as temperatures start to creep up. It’s worth noting here that higher temperatures do not normally present a problem for the people working in the data centres because there are no staff – this is because a growing number of centres operate on a `lights out’ or dark basis, so as to reduce their energy footprint, as well staffing requirements.
The downside of this arrangement, is that the ability of an automated system to spot a temperature runaway situation as early as possible is far less than that of a human member of staff.
This is especially true where the affected systems are localised, meaning that air cooling – typically in a cold aisle environment – will compensate for a heat problem for some time, until the situation gets out of hand. Being able to minimise the potential for a temperature runaway scenario is, therefore, a significant advantage where unmanned data centres are concerned.
In an ideal world, modern data centres have a 20-year life span before they need to be replaced, usually for efficiency and obsolescence reasons. When temperature issues raise their ugly head, however, the lifespan is usually reduced – and, by implication, the Opex (operating expenditure) costs of that centre are increased.
It’s important here to understand that it does not matter if the design of a data centre is old or new, since most centres have on-board and integrated capabilities as standard features.
These systems normally allow remote monitoring systems to control most aspects of temperature and power consumption, but their operation – crucially – presumes that the testing and installation phase has been completed correctly.
This situation is similar to the dashboard diagnostic systems in a modern motor vehicle – lights and alarms will sound in most modern cars, alerting the driver if something goes wrong.
These engine and transmission-related diagnostic systems are, however, only as good as their installation process – put simply, if they are not installed or calibrated correctly, then the diagnostic alerts they generate will not be reliable. The same is true for data centre monitoring and diagnostic systems.
It’s also worth noting that some IT system vendors offer highly comprehensive warranties on their kit but – to avoid misunderstandings when claims under warranty are submitted – the guarantees can be declared invalid when hardware hits problems as a result of temperature or power runaway issues.
These void warranty situations – whilst perhaps understandable from the vendor’s perspective – can add to the Capex (capital expenditure) costs of the IT systems involved.
Nor is this a theoretical issue, as when we helped to design a data centre for a major bank recently, the potential for void warranties caused by IT systems operating outside normal tolerances was a key issue for the commissioning staff concerned.
One interesting issue is the challenge of operating a data centre on a partial occupancy basis. This may be due to the client wanting to build options for future expansion into the centre – or it may simply be due to an occupant company delaying or cancelling its involvement in the project.
Where a data centre is only partially occupied, the operating efficiencies of the centre are rarely anywhere near as good as a fully occupied centre, meaning that loading and temperature testing of the systems are all the more important.
And as the current round of economic issues continue to dog businesses – as they have done over the last six or seven years – there is a significant possibility that elements of a normally fully-occupied, but shared, data centre may be removed, resulting in a partially-occupied centre continuing to operate.
There is a strong need for data centres to operate within tolerances at all times in order to minimise their downtime and maximise efficiencies.
Whilst operating a data centre at lower-than-normal temperatures does not have any significant effect on the reliability and efficiency of the centre, operating at higher-than-normal temperatures will usually result in significant impairments to the efficiency of the systems – even where the excess over tolerances is minimal.
These efficiency impairments – which can also result in a cascade failure of components and systems as the temperature climbs – can directly affect power efficiencies, as well as ongoing Capex and Opex costs.
Using a reputable and tried-and-tested heat loading system means that companies can fully test a centre at the installation stages, rather than having to expensively retrofit technology solutions when problems start to occur.
A dependable, experienced company can also be useful at the design stage of a data centre project, as professionals can advise on the best options to avoid problems after the centre has been commissioned.
One example is the importance of reducing the chances of systems failures on cooling systems in a modern data centre, as it should be apparent that, if you have cooling system problems, then you will increase the chances of partial or complete IT systems failures – as well as a reduction in the effective lifetime of the centre itself.
The best method of testing a data centre is to use a device known as a heat load bank. These units – which can be deployed on a rack or freestanding basis – create an electrical load on the centre’s power facility, as well as producing a heat load. The heat load banks are also useful for facilitating testing of the centre’s electrical and cooling systems in a controlled environment.
In conjunction with several leading mechanical and electrical consultants, Mafi Mushkila has developed two rack mountable 3U 2kW heat loads and the 3U 3.75kW unit that closely replicate the characteristics of IT servers.
The company also has in excess of 6MW of rack mounted units available, with a further 8MW of floor standing 15 and 22kW three-phase floor standing heat load units.
Rack mounted units can be provided to clients with C14, C20, UK 3-pin 13Amp or 16Amp commando plugs.