The Risks Behind Software Recovery Attempts

A large majority of general IT service providers across the world offer data recovery services to their clients. In most cases, these recoveries are attempted only with software tools by an individual with no formal data recovery training. These unprofessional attempts are responsible for countless cases of hard drive failure and permanent data loss every year. This two part blog post will explain some of the main causes of these undesirable outcomes.

Software recoveries will inevitably cause some level of irreversible damage when attempted on an unstable drive. An unstable drive in this case is any drive that is having trouble reliably processing read commands due to degraded read/write heads, platter damage/dirt, electronic instability, firmware exceptions, and so on. Instabilities like these tend to initially show up as a few intermittent bad sectors. After the very first symptoms appear, the drive will gradually become worse the more it is used until there is a final crash. This crash either makes the drive completely unrecoverable or at the very least it dramatically complicates the recovery process afterward, making it far more difficult and expensive for the customer to eventually recover their data. Professional recovery processes are all about minimizing the damage done and getting a copy of the data before the drive crashes.

The crux of the problem is that all software tools can do is send standard read commands to the drive and hope that it responds to them in good time. If the drive does not respond, software has no mechanism in place to do anything about it. To understand why this is a problem, we have to first understand how hard drives process standard read commands. Several internal operations occur whenever a drive receives a read command. First, the requested sector block is read from the platters by the read/write heads and loaded into the drive's cache. While in cache, the checksum of the user data of each physical sector is calculated by the drive's processor. This checksum is then compared with the checksum in the error correction code that was originally written together with the sector.

If the checksums match, the drive concludes that the data it read from the platters is the same as the data that was originally written and sends it through the ATA channel. If there is a mismatch, the drive will first attempt to repair the corruption using the error correction code section of the problematic sector. If that fails, the drive will automatically attempt to reread the sector(s) with the problem over and over again, trying to obtain a good read. A bad sector means that the drive was unable to obtain an integral read and it gave up trying, sending an error without data through the ATA channel. At that point the drive will also use its read/write heads to write to its service area (which is located on the platters) to update the status of various logs, such as the number of reallocated sectors within SMART attributes. For data recovery purposes, this process wastes time and causes extra unnecessary degradation of the read/write heads. This is also a potential point of serious service area corruption because degraded read/write heads will not necessarily be capable of making the accurate adjustment of logs that the drive is aiming for. Record keeping procedures like these can be disabled outright with the right hardware equipment, but software tools have no such abilities.

A key point to understand is that hard drives do not give up on sectors easily! Even in perfectly healthy devices, an occasional internal read failure is a relatively normal occurrence when it happens on the first or second read attempt. Drive firmware is designed with that in mind; automatic internal retries ensure that the drive will continue operating as expected regardless of occasional internal read instabilities. The fact that these errors are common in the first place is a symptom of enormous market pressure to increase storage capacities that drive manufacturers are constantly experiencing. The greater the density of data, the more difficult it is for the read/write heads to reliably do their job. If a sector comes back as bad, it means that the drive has already tried to read it hundreds of times internally over multiple seconds without any luck. These multitudes of failed attempts are what cause the drive to degrade further.

It is important to note that all of these internal retries take place after a single ATA read command. In other words, they happen regardless of any kind of "no retries" option being used, which only means that the software will not send more than one read command for each sector block. Because most bad sectors are there due to degraded read/write heads and/or issues with the surface of the platter(s), these multiple internal read retries further damage the drive, quickly exacerbating the initial problem. For example, a small speck of dirt on the platter can grow into a scratch and then destroy the read/write heads if it is accessed enough times.

Drive firmware is what ultimately decides when to give up on unreadable sectors. In some cases of instability, drive firmware can fail to correctly do this, causing the drive to become permanently stuck in the state of re-reading a bad sector to try to obtain an integral read. This is the worst possible state when performing software recoveries as the drive will be destroying itself very rapidly and software will have no way of rectifying the problem or even knowing it is happening.

Properly designed data recovery hardware gives complete control over how long the drive is allowed to perform internal retries for. This alone cuts down undesirable processing (and thus drive degradation) to a small fraction, as we can force the drive to give up trying after a few hundred milliseconds instead of after a few seconds. Hardware also has direct control over the power going to the drive, which can be automatically cycled as a last resort in case all other methods of safeguarding the drive fail.

On to part 2...

The Risks Behind Software Recovery Attempts - Part 1