So I found myself in a very precarious position on Monday morning. I was sent an email showing screen shots of a DTS package that had failed with the error message
“A possible Database Consistency problem has been detected on database X DBCC checkdb and Dbcc checkcatalog should be run on Database X”.
I thought the worst of course and went out to investigate this server to see what was going on. I’m new to this position and I had not looked at this server before so I didn’t know what to expect. First bad news was that the most recent backup was October. Not a good thing!
Once I got connected to the machine I found it was a Sql 2000 Ent edition box with 4 processors and 8 GB of ram and what I consider a VLDB. The Main db was 596GB in size and had a table that boasted 929mil rows. I begun to look in the logs and found Assertion failures in the log and long wait times to write records. Messages like this.
SQL Server has encountered 4 occurrence(s) of IO requests taking longer than 15 seconds to complete on file [x:\data.mdf] (21). The OS file handle is 0x000003AC. The offset of the latest long IO is: 0x0000000ce67800
So having seen this message before I realized we must have had some sort of hardware issue either Controller or the Disk subsystem was unavailable. After looking in the windows log I found SCSI controller errors. At this point we already had a maintenance page up for whoever was trying to access this server so I called upon my Hardware/Operations guys to take a look and see what they could find. After a few hours and a few patches to the OS they did what they could do for now without incurring a longer downtime. We thought it would take much longer to fix the DB side so we decided to see what was wrong there. I also made a backup at this time even if it was corrupt some backup is better than no backup at all.
I figured the Very large table would take a long time to run checkdb on so I decided I would run a check table on all other tables(only 38 of them) to try and find any other issues that I could possibly repair while the big one was running. It only took about an hour to run a checktable through everything else and it all came back clean. I then told it to checktable just the large table. To my surprise it only took 56 minutes and there was nothing wrong with it! What luck! So I figured there was something wrong with the DB so I started the checkdb. Again I was amazed at the speed. 1h56min later it was done and still no errors. So at this point I counted myself very lucky that SQL Server did what it was suppose to and didn’t corrupt any data when the controller failed.
Make sure that if you have a Disk/Controller failure you always run CheckDb on the Database after the system is back up. Even if it’s a very large database I think you will be surprised at how fast it can go.
Here are some steps I have used and talked about in the past to approach a corruption issue.
- Identify the issue. 9 out of 10 times corruption is because of a hardware failure of some sort.
- Make a Backup of the DB. Even if it’s corrupt you could save much of the Data in it.
If you can take extra downtime then attempt to correct your hardware issue before running your checks. If your checks cause the issue to happen again you’re not going to be any better off. Usually in the real world you won’t have the time to deal with the hardware side unless your disks are completely unavailable
- Start to identify where the corruption is. If you have a large DB and you’re short on time perhaps running check table on the tables you believe to be the issue might be a good first step. You will have to run the Full checkdb at some point though to make sure you got everything.
- Start fixing the problems. Remember to try and drop and re-create indexes on tables that have corruption. Sometimes just the indexes corrupt and that can be usually fixed by dropping and re-creating. If not go through the levels that DBCC suggests only going to repair_allow_data_loss as a last resort. Sometimes you can export all the data out to another table and then just dropped the corrupted one.
- Make another backup when everything is clean. If you lost data then you can possibly compare to your corrupted DB and get any other data that might be gone.
- Make sure to double check your backups and I suggest running a checkdb on a consistent basis to let you know if there’s any issues
Hope you never have to deal with corruption. But be prepared if you do. 🙂