Wednesday, May 25, 2016

With BIOS update, Intel RAID gives "Non-Member Disk" error on RAID1+0 set

In prep to play the new DOOM 4 by installing a Windows SSD in my video editing machine, I ran a BIOS update on my Asus P9X79.  After the update, the Intel onboard C600 chipset RAID controller dropped two of the four RAID1+0 members that I had created from RAID members to "Non-Member" disks.  After researching the problem, it seemed that many people have had the issue with the Intel RAID:
 
Apparently, Intel RAID gets upset when the BIOS gets updated.  At a high level, what you do to resolve is 
A. Turn all members into Non-Raid Members
B. Re-create the RAID in the exact same order that it was created
- Don't Delete the RAID set, as that has caused data loss for some people
C. Usually, the partition table gets blown away when you recreate the RAID.  You then have to go find it again by running a program like TestDisk to find and rewrite any missing partitions
------------------
The Path I Took
Of course, when I ran TestDisk, I compounded the problem by choosing the wrong partition.  I basically got confused, as TestDisk allows you to select from a couple of choices:
1) Intel/PC partition
2) EFI GPT partition map
 
Now, my BIOS is UEFI and the drives all had GPT partition maps.  I thought that was the logical choice and I selected it.  It showed me two main drives:
- MS Data partition, size 200MB
- MS Data partition, size 2.7TB, the size of my RAID set
 
First, I wrote the 200MB partition table having the 200MB filesystem.  This was the wrong thing to do.  I got confused because my drives were GPT and because most of the people on the threads I had been reading were Windows.  It was only after spending Saturday and Sunday reading that I found a couple (out of the hundreds) of posts on people with Linux OSs that actually had the correct instructions to fix my issue.  
 
For my Linux Fedora 22 system, I actually needed to select "1) Intel/PC partition".  Before I did this, I decided to:
1) Pull off any data that I could using PhotoRec, a program that reads files block-by-block, written by the same guy who wrote TestDisk, Christophe Grenier.  I wrote that data to my 3GB SATA.  This took about ten hours.
- I ran through this exercise, but of course, the files that PhotoRec outputs are labeled F000001.txt, F000002.sh, F000003.gz and so on.  The names and directory structures are totally lost.  The content of the data files, however, is kept and I was able to find the video editing scripts that I had spent 100s of hours developing.  That was good, but it would be an enormous pain-in-the-A to dig through them and rename them all.
 
2) After pulling off the data, I was going to blow away the 200MB MS Partition, as creating this was a mistake.  I did this with some trepidation, obviously, as I didn't want to do any more harm.
 
Once those two things were done, I went back into TestDisk and selected "Intel/PC partition."  This time, I saw a Linux partition of the correct size, 2.7TB.  TestDisk gives you the option of viewing the directory structure.  When I did this I got a foreboding message:
“Can’t open filesystem. Filesystem seems damaged.”
 
Ugh.  I pressed on and wrote the partition table to disk.  Mounting the filesystem, I got a new error:
mount: wrong fs type, bad option, bad superblock on /dev/md126p1
 
Ugh.  Doing some research, I found this article on fixing bad superblocks, which entails finding a good one from the many that are sprinkled throughout an ext4 filesystem:
 
I found the backup superblocks by using mke2fs:
sudo mke2fs -n /dev/md126p1
 
mke2fs showed me a number of superblocks available.  I prayed they were corrupt.  Using e2fsck, I chose one of the superblocks, 32768, from the output of mke2fs: 
sudo e2fsck -f -b 32768 /dev/md126p1
 
When the program ran, it found a few errors.  In case there were many errors, I CTRL-C'd out of e2fsck and restarted the program by adding the "-y" switch to let it run unattended and accept all fixes with a "yes."  I let it run.  After I came back to the terminal, I found the screen full of the text output of the program, with thousands of sector id codes scrolling by.  I wasn't sure if this was bad or good..but I let it finish.  Afterward, I mounted the drive.  
 
The mount came back without error.  Good.  The moment of truth was when I ran an "ls" and saw my files and directories!  However, I wasn't out of the woods yet.  The proof would be if I was able to access them.  I played some music files, opened a few videos, started up a VM..they all worked!  YAHOO!!  I didn't lose my stuff!!  Even though I have pretty much everything backed up, I did not have my scripts backed up and I've spent 100s of hours on those.  Thank God I got them back!!
 
So..hooray for me!
:)
Now onto playing DOOM, the reason why I got into that mess in the first place!!

Reference
Feel free to drop me a line or ask me a question.