Tuesday, September 09, 2008

500 errors from IIS and Oracle

For the past couple of years, the Microsoft Win2K/Win2K3 website that I support had been plagued with 500 errors and subsequent IIS resets. Our website is a combination of custom ASP and ASP.net code running on 32-bit Windows, connecting via ODBC driver to an Oracle 9.2.0.7 backend on Solaris. The side effect to our customers of IIS blowing up is that for a short period of time (less than an hour, usually), customers would see a number of these 500 server errors from the server that panicked. Usually, this was not much of a problem, as it happened on one or two servers once or twice per day and the number of errors was low. The workaround was to reset IIS.

Recently, we added more servers into our web farm and the problem happened more often, with a greater number of 500 errors being spit out when one of these newer, more powerful machines' IIS process loses its mind. So, we contacted Microsoft and used DebugDiag to setup a crash/hang dump to find out what memory threads were in play when IIS died.

Sending this information to Microsoft, Microsoft engineers pointed to an interaction between IIS and Oracle. Specifically, there were pointers to the 9.2.0.7 Oracle ODBC driver that we use to connect our web application to our database. In the past, we have had very little luck with Oracle being able to solve any driver related issues for us. But we know that the first thing Oracle Support will ask for is "do you have the latest driver set installed." Knowing that Oracle was going to require this of us, we started the process of upgrading the driver from 9i to 10G (10.2.0.4) in our development environment. More specifically, we needed the following:
- the Oracle 10G client (1GB)
- the 6810189 patch set for 32-bit Windows (1GB)
- the 7218676 patch set for 32-bit Windows (67MB)

The install is quite large. Our very basic install took about 500MB of disk space. Once we got the new 10G drivers and patch set installed in development, we put it through a gamut of tests. It seemed to work fine, but development is no substitute for production traffic on a site that gets millions of hits every day. So, we started rolling out the new drivers to the most problematic web servers.

The first week was very tense, as we let the drivers cook on one server. Previously, this server blew up at least once or twice a day. With the new drivers in place, two days went by without IIS experiencing a problem. Three days went by. Then an entire week went by without a problem! We were psyched! We started rolling out the driver to the rest of the eighteen web servers, two servers every two days.

After a week went by, we could tell that the number of 500 errors and IIS resets were diminishing. After a second week went by, we were about half way through the lot, with every decreasing numbers of 500 errors. Best of all, the boxes that were patched weeks back had not reset themselves. This was unheard of! After using Oracle drivers for eight years, this was the only time in the history of our website that we were not seeing 500 errors! Fabulous!

We finished upgrading from 9.2.0.7 to 10.2.0.4 a couple weeks later and are EXTREMELY happy to report that we no longer are experience IIS resets and blasts of 500 errors.

Thank you Oracle, for finally fixing this issue!

No comments:

Feel free to drop me a line or ask me a question.