Reparing Suspect SQL Database
August 24, 2010 1 Comment
I’m not a DBA but I have a real life experience repairing a suspect databases twice, I guess I can say that I am unlucky at that time as I was presented in a tough situation where I am still in the process of learning how things work and we don’t have a DBA resource (remember I am a developer) but still consider it lucky as even some experienced DBA’s might not have even experienced this situation. I had experienced a similar situation before , and restoring it was easy as we have proper backups in place.
Now in this situation, the backups had failed so the last clean backup we had was 2 weeks old, so I can only rely on that one plus the current transaction log which is not backed up for now. So first thing that came to my mind is to backup the transaction log so I can easily restore the full lot. So I tried to backup the transaction log but i have to remove the suspect status first by running this TSQL
Now it’s not a suspect anymore, I am hoping to run the backup of transaction log by using this TSQL
BACKUP LOG <dbname> TO DISK='X:\YourFolder\YourFileNameLOG.bak',
but to my surprise it did not work as the Database went back to a suspect status, I repeated again the process hoping that it would work but got no luck out of it, with further research it suspect status will always go back when the corruption is severe, so the next thing that came into my mind the disk might fail more so I have copy everything to a remote disc (MDF, LDF and the Backup), so I have to kill all connections to it and stop the database so I can copy the files properly, after waiting for a long time around 15 minutes (remember the file is huge) it presented me an error which I cannot copy further, I cannot remember the exact error anymore but its similar to copying files from a badly scratched CD or DVD, the error is in the lines of an I/O error (Oh shit!). So I go ahead and checked the event logs and to my surprise here is what I got!
Everything is in red nearly all 10K events, don’t ask me why this happened without someone noticing (remember I am the developer).
The message is almost the same from event to event, the HEX part is almost repetitive, and looks like it’s using the same 20 values all over again.
Error: 823, Severity: 24, State: 3 I/O error 2(error not found) detected during write at offset 0x000001d527a000 in file 'E:\Program Files\Data\MyDatabase.mdf'.
So moving ahead first thing came to my mind is to check disk and defragment as it might help as the file system will be repaired if there are errors that are repairable and I am hoping that the data will be moved from the offset mentioned logs to a better location on the drive. It took nearly 6 hrs doing it as the data is large and drive is large but after that I tried now copying the files I mentioned above and it worked, now I have a clean copy in a good disk of the MDF, LDF and the Backup which gave me peace of mind, as I know I can restore from those (but using those is not easy as the LDF is not a backup copy).
From there knowing that the disk was repaired by the checkdisk and defragmentation, I can now backup the transaction logs and the database which did not give me much problems anymore. Also having that in mind I can now perform a full backup of the database which I can restore on a new server. And this is the TSQL I used
BACKUP DATABASE <dbname> TO DISK='X:\YourFolder\YourFileNameDB.bak',
That worked as well, so the checkdisk and defragmentation really helped.
Now I have a real backup, next is to restore it on a new machine which again did not give me a lot of issues until…
Once the applications start connecting to the database I checked the logs and to my surprise the same errors (Error: 823) appeared, now I tried contacting some of the DBA’s I know and they can’t give me a convincing answer so I just trusted my guts and do it on my own with the help of Google. Now I found out that the error might be a remnant of the old database which was restored in that state so the best thing is to repair it before applications start using it so I repaired it using this TSQL Command
USE [master] GO ALTER DATABASE <dbname> SET SINGLE_USER WITH ROLLBACK IMMEDIATE GO ALTER DATABASE <dbname> SET SINGLE_USER GO DBCC CHECKDB <dbname>, REPAIR_REBUILD) WITH ALL_ERRORMSGS, NO_INFOMSGS GO ALTER DATABASE <dbname> SET MULTI_USER GO
So what is does is, you set the database to be used only by you then you repair and rebuild the indexes, suppressing all info messages except for errors then set the database back for multi user use.
There are a lot of repair option and this is the most time-consuming one so you need a lot of patience for at least 6 hrs for a huge database, there are some other options and here it is defined
Here is the list of the complete options
REPAIR_REBUILD – Performs all repairs done by REPAIR_FAST and includes time-consuming repairs while rebuilding indexes. This will not have data loss and this was my first option which worked for me.
REPAIR_FAST – Performs minor, non time-consuming repair actions without rebuilding the indexes and will not have a data loss as well.
REPAIR_ALLOW_DATA_LOSS – Performs all repairs done by REPAIR_REBUILD but you allow data loss as the unrepairable corrupted text will be deleted, this was my last resort.
So after running that script here is my result:
Msg 8909, Level 16, State 1, Line 1 Table error: Object ID 0, index ID -1, partition ID 0, alloc unit ID 0 (type Unknown), page ID (1:2425877) contains an incorrect page ID in its page header. The PageId in the page header = (0:0). The error has been repaired. CHECKDB found 0 allocation errors and 1 consistency errors not associated with any single object. CHECKDB fixed 0 allocation errors and 1 consistency errors not associated with any single object. Repair: The Clustered index successfully rebuilt for the object "dbo.MySampleTable" in database "MyDatabase". Repair: The page (1:2425877) has been deallocated from object ID 462321894, index ID 1, partition ID 311773704355840, alloc unit ID 30298727645184 (type In-row data). Msg 8945, Level 16, State 1, Line 1 Table error: Object ID 462321894, index ID 1 will be rebuilt. The error has been repaired. Msg 8928, Level 16, State 1, Line 1 Object ID 462321894, index ID 1, partition ID 311773704355840, alloc unit ID 30298727645184 (type In-row data): Page (1:2425877) could not be processed. See other errors for details. The error has been repaired. Msg 8976, Level 16, State 1, Line 1 Table error: Object ID 462321894, index ID 1, partition ID 311773704355840, alloc unit ID 311773704355840 (type In-row data). Page (1:2425877) was not seen in the scan although its parent (1:2425841) and previous (1:2425876) refer to it. Check any previous errors. The error has been repaired. CHECKDB found 0 allocation errors and 2 consistency errors in table 'MySampleTable' (object ID 462321894). CHECKDB fixed 0 allocation errors and 2 consistency errors in table 'MySampleTable' (object ID 462321894). CHECKDB found 0 allocation errors and 3 consistency errors in database 'MyDatabase'. CHECKDB fixed 0 allocation errors and 3 consistency errors in database 'MyDatabase'.
Looks good from that point as it repaired any errors it found, so last thing to do is backup your spanking shiny new repaired database in case something happens and create a good maintenance plans and backup.
So to summarize here are the steps to do in repairing the suspect database if the culprit was an I/O error
- Kill all connections and stop SQL Server.
- Perform a Checkdisk with repair option
- Perform a Defragmentation
- Now copy the files necessary (MDF, LDF and Backups)
- Perform a backup
- Restore the backup
- Repair the Database (DBCC CHECKDB)
- Backup the repaired DB
- Have Fun!