Monday, August 13, 2007

nCipher hardserver process memory leak

We are using nCipher's 500 F2 SSL acceleration cards in a small farm of four Windows 2000 production web servers. About a month and a half ago, we installed these cards into the servers. Last week, one of the servers went down, gave us the lovely blue screen of death and became unbootable. The only way our server admin could bring the server back up was to remove the nCipher card.

Two days after the card was removed, I started seeing these errors in the System logs:
Event ID: 2019
The server was unable to allocate from the system nonpaged pool because the pool was empty.

That error lead me to a Microsoft Q Article that stated that this error could be associated with an application that was using up too much memory; ie, an application that had a memory leak:
/technet/prodtechnol/windows2000serv/reskit/w2000msgs/4746.mspx
Therefore, I started hunting for a rogue application. I used Performance Monitor to map all the running processes' "Pool Nonpaged Bytes", like so:


I simply looked at each running process and found the ones using the most Pool Nonpaged Byte memory. I set my refresh to every three seconds, so that I could see the increase in Pool Nonpaged Byte memory as I searched for the rogue. As there were seventy or so processes running, it took a bit of time to identify "hardserver" as the rogue process. I simply looked for the process consuming the most memory! Also, I saw that the memory usage was increasing as I watched the chart.

But what is "hardserver?" Apparently, hardserver is part of nCipher's SSL card driver install:
http://www.ncipher.com/resources/97/sa14_presence_of_flaws_in_firmware_security

Once I charted hardserver's Pool Nonpaged Byte memory use against the Total Pool Nonpaged Byte memory used, it was easy to see how the hardserver process was driving up Nonpaged memory utilization, as in the screen cap below:


The side effect of this memory leak brought down the ASPNet worker process on the webserver:
"aspnet_wp.exe could not be started. The error code for the failure is 800705AA. This error can be caused when the worker process account has insufficient rights to read the .NET Framework files. Please ensure that the .NET Framework is correctly installed and that the ACLs on the installation directory allow access to the configured account."

This, in turn, showed the ugly error "Server Application Unavailable" to the end user:


So, either the card itself is bad or nCipher's software has a memory leak only when the cards' driver software is installed, but the card is not available.

We put the card in a test server in the lab and it showed the same bad behaviour as in the production server: blue screen of death. Speaking to an nCipher engineer, I found out that the blue indicator light on the back of the card will tell us whether the card is functioning properly. As expected, instead of a constant blink at the 3 or 4 second mark, the indicator light on the card flashed randomly. This tells us the card is not working correctly for some reason.

The nCipher engineers and customer service folks were very helpful and we soon had an RMA number to return the card. Thanks nCipher!

10/23/07 Update
Here's a concise set of instructions about finding processes that are triggering memory leaks. From Microsoft, no less:
http://support.microsoft.com/kb/130926

'sodo

No comments:

Feel free to drop me a line or ask me a question.