Recovering an OCZ Vertex SSD

I’m fortunate enough to where I convinced my boss I needed an SSD to learn how to manage them. She agreed. :) At first I thought I was going to hell for telling such a lie. I thought at the time… an SSD is just a more stable HDD. I thought wrong. I’m glad I bought one. If these things are the future, they’re a whole new category of HD’s. Managing them is not as simple as it would seem.

SSD’s work great, when they work. One thing I particularly have a problem with is, they can lock up your system to an absolutely frozen state in a second. No warning. No moving parts means just that… it’s either all or nothing. I’ve never seen a HDD instantaneously lock up a system to a frozen state. They just don’t work that way because of the mechanical parts involved. If you catch a failing HDD in it’s earlier stages, you’ll likely be able to recover data. If an SSD stops working, it just stops working. That’s it. So, the problem with the system being locked up instantaneously means you’re forced to do a hard-reset. With HDD’s, this made IT professionals cringe, because you weren’t giving the OS the opportunity to close open files and finish whatever writes it needed to do to the HD. The file system could become corrupt as a result. That doesn’t happen usually, but can happen. You’re also instantly removing power from the drive, not giving it a chance to spin down. This can be detrimental to it’s internals. With SSD’s, I’ve personally experienced a hard reset mean an unreadable disk. I’m still not sure exactly why, but it’s something to do with the controller.

If you follow the latest and greatest in the SSD world, you know the drive’s controller is everything. The actual storage space of an SSD itself is virtually bulletproof. Manufacturers are right in saying the storage medium is much more reliable than HDD’s. However, what good is the data if you can’t access it? The idea of storing data on flash memory is nothing new, as we’ve been doing it with USB thumb drives for a while, which are pretty reliable. Using an SSD as your primary OS hard drive means you need a way to effectively manage where the data goes on that flash memory, and all of the other things an OS expects a drive to be able to do. That’s why the controller is important. In the early stages of SSD’s, several drives had controller issues usually due to firmware that was not optimal. Firmware is everything to an SSD. The drive would start off blazing fast, but would degrade to a state slower than an HDD after a while.

The standard tools for managing a HDD don’t apply to an SSD. You don’t ever need to defragment an SSD, and actually, doing so will hurt your drive’s performance. Defragmentation only applies to platter based HDD’s, the idea being that over time files get scattered across the platters, and the needle would have to work harder and take longer to find what you needed. Defragmenting reorganized the files to pack them closer together to the center of the platter’s spindle. This meant less work for the needle, and faster seek times (improved performance) for the user. With an SSD, there is no moving, rotational part, so no seek times. Everything is just “there”, which is the biggest reason why it’s so fast.

SSD’s do need maintenance though. If your drive and it’s OS support TRIM, that’s your best bet. TRIM takes care of a lot of the maintenance for you. However, the only OS I’ve seen at this time that natively supports TRIM is Windows 7, which is weird because M$ is usually behind the curve on these things. I’ve seen in various forums that Linux does support TRIM, but you have to invoke it manually or script it to be scheduled. As far as I can tell, OS X doesn’t support TRIM. There is another feature in some drives called Garbage Collection. GC has the same concept of TRIM, but just on a less frequent basis. TRIM happens at the time of a file deletion. GC happens only occasionally. Not having one of these features is one of the reasons an SSD’s performance degrades over time.

Anyway, back to the main problem. I have an OCZ Vertex 256GB Mac SSD. I wanted the Intel SSD more, but it wasn’t big enough for me. The OCZ had one of the best controllers (Indilinx), therefore at the time it was second only to the Intel 128GB model SSD. Intel’s controller still cannot be beaten in terms of speed and reliabliity. More on that later.

Since I bought this SSD, I’ve had it crap out on me 4 times.

  1. The first time was when I first got it. I installed it, hooked up my old drive to a SATA to USB adapter, then proceeded to clone the old drive onto the new. The clone froze halfway through, and rendered the drive unusable. I notified my reseller, and they were awesome enough to immediately ship me another. This time it worked, and I used it for about 3 months.
  2. Second failure was more spectacular. I was using Mail.app to compose an email. It just stopped responding and I got a beach ball. At first I thought it was just Mail, but over the course of the next 5 mins, other apps followed suit. I could use them initially, but eventually they all beachballed. I’m guessing what I was able to use was the app being loaded into memory. Once they actually wanted to write to the filesystem is when they beachballed. Had to do a hard reset and to my surprise (then), the drive wasn’t recognizable upon restart. However, then I pulled it out and left it removed from any power source for ~30 mins.  I got this tip from OCZ’s forum, by the way.  I plugged it back in, and it was recognizable, and I continued working… no real loss there.
  3. The third failure was bad, because I had to do an RMA.  It happened ~4 months after the second.  I, again, was just doing light application work and got the same result as the second.  However, this time, the drive would not come back no matter what I did.  Checking OCZ’s forums gave me a few ideas to try, but none of them worked.  You’ll notice that there are a few OCZ employees that are on their forums every day.  I wonder why.  :)
  4. The 4th failure happened Easter weekend 2010.  This time I was running a sustained write to the disk to check IO.  I was writing a 10GB file, by using this command in OS X:  dd if=/dev/zero of=/tmp/testfile bs=10240 count=102400.  It wrote maybe 3GB, then became unusable… beachballs again.  Hard reset gave the same result… no disk.

I want to make this clear… I think my problems are more due to my Macbook Pro’s inadequate internal cooling than they are the SSD.  I have a 2009 Unibody Macbook Pro, model identifier MacBookPro5,1.  This model has the SATA bus and GPU on the Southbridge controller of the logic board, so it can get a little warm.  :)  By a little warm, I mean a lot warmer than I’d prefer.  If I can find enough time, I really want to get some Artic Silver on the heatsinks, though that may void my AppleCare.  It’s no secret Apple messed up when they applied a little too much thermal paste to these Macbook Pros.

Back to the problem, again.  My particular SSD has jumper pinouts on the back to where if they’re closed (jumper installed), it puts the SSD into “maintenance mode”.  This will present the SSD to your system as a YATAPDONG BAREFOOT drive.  You can’t actually do anything to the SSD in this mode except flash firmware.

So I’m scouring OCZ’s forums for some ideas on what to do.  Believe it or not, this is how OCZ recommends getting customer support.  If you want their latest supported firmware, you have to get it from their forums… not their website.  :-/  The Indilinx controller has had several firmware revisions since I bought mine originally.  The procedure to flash them is laughable, and can make you go insane.  Each firmware revision has had a different firmware flash procedure.  Believe it or not, if you want to do any low level SSD work, you’ll need a Windows PC, on a PC with a motherboard that has a certain chipset.  This is because you’ll need the SATA controller on the motherboard to be set to IDE mode with no emulation.  This is an awful thing to require.

So I found a thread on OCZ’s forums that describes a process of recovering a drive with symptoms similar to mine, a process approved by one of OCZ’s technical support employees.  This process involved flashing the drive’s firmware back to revision 1.10, then incrementally flashing to 1.3, then 1.41, then 1.5 (which is the current as of this posting).  I’m reading this and shaking my head in disbelief.  How can they have this kind of business model and expect to succeed?  At this point, I don’t care… I just want to get back up and running.  Here’s how it went.

I hooked the drive up to an HP dc7700 PC, which is what I had available.  BIOS told me the SATA controller was in IDE mode, which was the important thing.  So I put a jumper on the drive and it shows up on a Windows PC as YATAPDONG BAREFOOT in Device Manager.  I power off, remove the jumper, reconnect, reboot, and can’t see the device at all.  This is at least a little promising… I may be able to get it working again.  So, I re-installed the jumper and proceeded to flash the firmware in order:

  1. Downloaded firmware 1.10, which is considered by OCZ to be their last “destructive” flash.  It will wipe out the drive.  I don’t care… I can’t use it at this point.  It requires the drive to be in maintenance mode, and hooked up as a secondary to a Windows system.  You download the firmware from OCZ’s firmware sub-forum (not their website :) ), which consists of a zip file of an executable along with the firmware files themselves.  You run the executable, and a command prompt window pops up seeing if it can detect the drive.  If it can, it asks you if you want to flash.  You say yes, it spits out a bunch of gibberish about the drive being unclean and finishes.  I do recall on my SSD’s previous failures, I attempted to do this as part of the recovery process and it failed.  I’m still not sure why.  This time it succeeded.  So, I’m back on firmware 1.10.  On to the next step.
  2. Downloaded firmware 1.3.  I hate this firmware.  It’s clear at this time, OCZ was still trying to figure out how to best flash the drives.  It’s supposed to be run the same as 1.10, but without the drive in maintenance mode (no jumper).  I tried this, but it failed.  However, they also provide a way to download a bootable ISO which boots to FreeDOS and has the firmware already loaded.  Mind you, the ISO was created by one of the forum members, not OCZ.  The ISO I downloaded wasn’t bootable.  I downloaded several times, and burned the ISO on several different CD’s.  It just wasn’t bootable.  Luckily, they also provide a bootable USB option… same concept as the ISO.  It was easy for me to set up because I followed all the right links in the forums and landed on the how-to that actually worked.  So, bootable USB drive created, this time it actually boots… but doesn’t recognize the SSD.  BIOS recognizes the SSD, but not the firmware updater.  I searched the forums more, and it seems that 1.3 is very picky about what SATA controller it’s hooked up to.  Luckily, I also have a Dell Precision 670 in the lab and in the forums, people had the most luck with Dells.  I tried it there, and it worked.  :)  Up to 1.3 now.
  3. Downloaded 1.41, which comes in an ISO provided by OCZ.  Tried to flash on the HP dc7700, but same result as the 1.3 USB.  Then I remembered I upgraded to 1.41 from my Macbook Pro when it first came out.  So I popped the drive back into the Macbook Pro, booted to the CD, flashed the firmware.  Now on 1.41.
  4. Downloaded 1.5, which also comes on an ISO provided by OCZ.  It’s clear OCZ figured out that ISO’s are the way to go.  Left the drive in the Macbook Pro, booted to the 1.5 CD, flashed the firmware.  Now (finally) back to 1.5.

I then hooked the SSD back up to the HP dc7700 and booted to Windows with it as the secondary drive.  I wanted to run OCZ’s sanitary erase tool on it once.  I don’t know why, because I think the firmware flashing takes care of this for you, but I wanted to make sure the drive was in as pristine shape as possible.  Only took a second to do.

This entire procedure took all day.

At this point, you’re probably asking yourself… Why did he have to go through each one?  I’m still not sure.  The forum staff member said to, and he knows more than I do, so I listened.  It worked, didn’t it?  I think it has something to do with needing to get back to 1.10 which is a destructive flash, and you can’t go from 1.10 to 1.5… you have to do them incrementally.

However, I’m typing this now on my Macbook Pro, running that SSD that yesterday was inoperable.  I’m back up and running, largely due to Time Machine, which restored all my data.

After all that trouble, why would I still want to run this SSD as my primary drive?  Well, you see, it’s freakin’ fast.  It’s like crack.  I can’t go back to a platter based HDD as my primary OS drive.  I refuse to.  My brain is now wired for SSD.  Anything slower is painful.  I’m willing to accept the possibility of failure and downtime because I have my Time Machine backups going to my OS X Server at home as well as my OS X Server at work.  I also keep my important documents in the “cloud”… Dropbox for personal stuff, and my work provided EMC SAN for work documents.  I still have my platter based HDD that I can restore my Time Machine backup onto if I absolutely have to (i.e., need to RMA SSD).

Would I recommend OCZ to anyone?  Yes, I would.  As I stated above, I’m almost certain the majority of my SSD problems is due to my Macbook Pro’s southbridge controller.  In a different machine, I think this SSD would shine, and not have any problems (or much fewer problems).  If they did have problems, they’d just RMA it.  OCZ still needs to get their firmware upgrade process settled.  The forums are still rife with users who have seemingly bricked their drives due to a botched firmware upgrade.  The 1.5 firmware is a vast improvement over the previous versions, but it’s still more of a “hackers tool” than a commercial one.  I think OCZ’s newest model uses a different controller altogether.  Haven’t read up on that enough to comment on it.

If you can afford it and you can do without the space, I would strongly suggest getting an Intel SSD as my first choice.  Both of my business managers have one, and they love it.  I have had zero problems with either (knock on wood), Intel’s firmware upgrade process works every time, and the Intel SSD toolbox is nice to have.  The OCZ would be my second choice, only because other manufacturers seem to have even less support.  At least you can communicate to an OCZ staff employee through the forums.  One (RyderOCZ) even lets you ship your drive to him and he’ll try to recover it using the tools he has.  If he can’t, he’ll RMA it for you.  If he can, he’ll ship it back to you.  I did this the third time my drive died.  He couldn’t recover it, and it had to be RMA’d.  Got the new one within a few weeks.

Do NOT run an SSD without some sort of automated, live system backup tool.  You actually shouldn’t run any OS without this in place, but even more so for an SSD.  For OS X, use Time Machine.  For Windows, I’m using Seagate Replica and I like it a lot.  Provides similar Time Machine functionality for Windows clients.  For Linux… um… rsync?  If you run Linux, you don’t need me telling you how to back up.  :)

Solid State Drives are still an evolving technology.  The controllers and their firmware need to mature further before they can overtake platter based HDD’s in consumer systems.  They are the future, but the future’s not here yet.  :)

Comments