Re: Recommended hard drive temperature



On 17 Apr 2008 13:22:52 GMT, Arno Wagner <me@xxxxxxxxxxx> put finger
to keyboard and composed:

Previously Franc Zabkar <fzabkar@xxxxxxxxxxxxxxxxx> wrote:
On 16 Apr 2008 22:10:18 GMT, Arno Wagner <me@xxxxxxxxxxx> put finger
to keyboard and composed:

Previously Franc Zabkar <fzabkar@xxxxxxxxxxxxxxxxx> wrote:
On 16 Apr 2008 12:20:06 GMT, Arno Wagner <me@xxxxxxxxxxx> put finger
to keyboard and composed:

Bottom line, the Google study shows that if you can get the drives
consitently down to below 40C, temperature does not matter a lot.
So the recomendation would be to have your drives (under load,
on a hot day) below 40C at all times. Note that this also applies
to external enclosures.

Arno

AFAICS, the Google study conclusively shows that failure rates also
increase when temperatures drop below 35C. In fact lower temps appear
to be more dangerous than slightly higher temps, except when the drive
is getting old, in which case higher temps start to become
significant.

Don't read too much into it. AFAIR they did not separate by
manufacturer, model and manufactuuring date. It is quite possible that
the drives running at lower temperatures were actually from a batch
that had less life expectancy from the start and stay at lower
temperatures because of different cooling characteristics, i.e. there
may well be a systematic error in the measurements.

Arno

The way I read it, the reliability-versus-temperature result was found
to be consistent across all models and manufacturers.

Indeed. But did they have all models and all manufacturers
at all temperatures?


==================================================================
Failure rates are known to be highly correlated with drive models,
manufacturers and vintages. Our results do not contradict this fact.
For example, Figure 2 [Annualized failure rates broken down by age
groups] changes significantly when we normalize failure rates per each
drive model. Most age-related results are impacted by drive vintages.
However, in this paper, we do not show a breakdown of drives per
manufacturer, model, or vintage due to the proprietary nature of these
data.

Interestingly, this does not change our conclusions. In contrast to
age-related results, we note that all results shown in the rest of the
paper are not affected significantly by the population mix.

==================================================================
The data in this study are collected from a large number of disk
drives, deployed in several types of systems across all of Google?s
services. More than one hundred thousand disk drives were used for all
the results presented here. The disks are a combination of serial and
parallel ATA consumer-grade hard disk drives, ranging in speed from
5400 to 7200 rpm, and in size from 80 to 400 GB. All units in this
study were put into production in or after 2001. The population
contains several models from many of the largest disk drive
manufacturers and from at least nine different models.

==================================================================

Hmm, I have to look at the paper again. This smells rather
strongly of a methodical error.

Ok, I have it now. I think you refer to figure 5: "AFR for average
drove Temperature". This one seems to indicate slightly higher failure
rates for the 15...30C window than for the others in drives younger
than 3 years. If you consult figure 4, you see that temperature
extremes are rare. Then there is one thing: Partially defective drives
work slower or not at all. This may result in lower drive temperatures
(spin down, refusal to execute access) and higher drive temperatures
(lots and lots of retries, heat from bearings). This can
significantly skew the results.

I would expect that Google would identify a partially defective drive
(assuming it was detected by SMART) and eventually take it out of
service. Certainly, if the drive does not work at all, then by
definition it must be totally, not partially, defective. Having said
that, the article doesn't really give a satisfactory definition of
failure other than to say that it is the reason that a drive is
replaced. <shrug>

As for spin problems, the article states ...

"Spin Retries. Counts the number of retries when the drive is
attempting to spin up. We did not register a single count within our
entire population."

The basic results could be that
failing drives run hotter or colder than others. I am also missing
more break-downs into different temperature profiles (e.g. mainly
constant, strong variation, etc..) as it is, e.g., possible thet the
problem in the low temp section is due to cycling temperatures.

The article states ...

"As is common in server-class deployments, the disks were powered on,
spinning, and generally in service for essentially all of their
recorded life. They were deployed in rack-mounted servers and housed
in professionally managed datacenter facilities."

I think that would discount your temperature cycling hypothesis.

I am not saying the results are wrong, but they are suspicuous and
with the data given are _very_ difficult to even understand
properly. It does not seem any statistics expert was consulted by the
writers and the temperature results are by far the weakest in the
paper. I also miss a proof or at least conclusive argument that the
remaining observations are temperature independent, both for absolute
value and different change profiles.

The paper is still very valuable. Figures 7-10 give solid results, and
need no further details. Scanning your disks every 2 weeks or so and
monitoring reallocation counts is a very good idea (and something I
have been doing for several years now). The folks at Google likely
also found that the SMART status alone is typically over-optimistic.

As to many failures not being predicted by SMART data, my results
are different. It is possible that the drive selection here again
skewed the picture compared to modern drives. Personally I have had
100% prediction by SMART attributes (not SMART status though) in
an addmittedly small population of about 50 drives over three
years and with mostly Maxtors that are known to fail gradually.

Arno

With respect, I prefer to accept Google's experience.

"It is difficult to add temperature to this analysis since despite it
being reported as part of SMART there are no crisp thresholds that
directly indicate errors. However, if we arbitrarily assume that
spending more than 50% of the observed time above 40C is an indication
of possible problem, and add those drives to the set of predictable
failures, we still are left with about 36% of all drives with no
failure signals at all."

I notice also that Google have an interesting observation regarding
seek errors.

"When examining our population, we find that seek errors are
widespread within drives of one manufacturer only, while others are
more conservative in showing this kind of errors. For this one
manufacturer, the trend in seek errors is not clear, changing from one
vintage to another. For other manufacturers, there is no correlation
between failure rates and seek errors."

I wonder if the abovementioned manufacturer is Seagate. IME, when
Seagate drives report a "seek error rate", they are actually reporting
a seek count.

- Franc Zabkar
--
Please remove one 'i' from my address when replying by email.
.



Relevant Pages

  • Re: Mainboard/CPU Temps
    ... >> It looks as though the PSU fan has stopped working or the case has made ... It's hard to guess the room temperature, ... the cpu remaining below the 65 deg point and the system and drives ... The trouble is that dedicated backup devices with removable media (AKA ...
    (uk.comp.homebuilt)
  • Re: Recommended hard drive temperature
    ... little correlation between failure rates and either elevated ... temperature increases. ... very high temperatures is there a slight reversal of this trend." ... Figure 5 suggests that Google's optimum temperature for hard drives is ...
    (comp.sys.ibm.pc.hardware.storage)
  • Re: hard drives
    ... I have a 'dead' one of those drives, ... I was running a high environmental temperature test (40 deg Ambient - ... volted fan to divert part of the incoming airflow over the two drives. ... the more important limiting factor isn't the CPU ...
    (uk.comp.homebuilt)
  • Re: WTD: 1Tb HDD
    ... There was no doubting the temperature ... contributions by the four drives and the CPU cooler are totally swamped ... the so called "Fluid Dynamic Bearing" technology (essentially plain ... it's important that you don't use the default starting sector ...
    (uk.comp.homebuilt)
  • Re: Do Athlon 64s intentially run cool......
    ... > small heatsink and the fan is set to cycle with triggerpoints of 48 dec C ... the temp depends a lot in the room temperature; ... >> It seems Dtemp was using the master SATA drive's temp for both drives. ... monitoring not only the MoBo sensors, ...
    (uk.comp.homebuilt)