Mistress of mystery Agatha Christie would have loved this plot.
The place: Baycrest Centre for Geriatric Care, one of the largest geriatric health care facilities in Canada. The scene of the crime: a locked computer room which houses a 50-server installation that was achieving 99.99 percent uptime. The mystery: a number of servers began to fail intermittently. The culprit: well, you’ll have to read on.
It took the IT department one year of detective work, more than $100,000 in unbudgeted expenses, and hundreds of hours in overtime to finally determine and then correct the cause of the problem. It seems the culprit was underfoot the entire time.
Let’s set the scene: Baycrest’s IT Department is comprised of two groups – applications, and systems and operations. Systems and Operations is responsible for telephony, hardware, software, and networking. Approximately 1200 people at Baycrest use the 1000 PCs and printers on the 50-server Windows(r) 2000 network.
As the space required for servers in the computer room decreased because of newer, smaller units, the technical team used more and more room as working area. To keep people out of the computer room and thereby reduce the risk of physical activity and contamination, a new wall was constructed with servers on one side and a refurbished work area on the other.
Mystery beginsAs soon as work began on the wall, intermittent problems began to crop up. Equipment, including backup tapes, started to fail on a regular basis. A team of IT troubleshooters couldn’t find a pattern or specific cause. The problems would disappear only to reappear again later. Consultants were brought in to look at electrical power, air conditioning, hardware, as well as representatives from the equipment manufacturer. Invariably each consultant identified two or three minor problems, but nothing that could be causing across-the-board intermittent problems.
After a year of frustration, Baycrest’s Hewlett Packard (HP) sales representative came in. He’d just been at another hospital that was going through a similar problem with its servers. The hospital had “zinc whisker contamination.”
Zinc whiskers are micron-sized filaments that grow over long periods on galvanized steel plates. Zinc conducts electricity, and a filament falling on a circuit board is perfectly capable of causing a short circuit. The devilish part is that the filament disintegrates, leaving little or no trace. Ah ha! A modus operandi that would have impressed Agatha Christie. The problem was first discovered in the telephone industry in the 1940s.
The HP sales rep put Baycrest in touch with Data Clean Corporation in Chicago (www.dataclean.com), a company that specializes in clean controlled environments. They suggested a quick preliminary test for this type of contamination: remove one of the panels in the raised floor in the computer room, turn off the lights, and look at the bottom of the floor panel with a flashlight. If small hair-like filaments can be seen, there is a strong possibility it’s those nefarious zinc whiskers. Sure enough, the culprits were there.
The flooring in the computer room was installed in the early 1980s and was made of wood core with a linoleum top finish and a galvanized steel bottom. Zinc whiskers grow over time as the galvanized metal ages. The longer the filaments, the more unstable they become. The computer room remodeling project was enough to disturb the filaments, break them into pieces and send them floating thorough the air and into the servers. These microscopic filaments can pass thorough computer filters. Newer servers are at higher risk because components are densely packed on the boards.
The clean upBaycrest’s Occupational Health Department was notified and treated the cleanup as if it were asbestos contamination. The server room was sealed and personnel could only enter wearing masks. Zinc may cause short-term illness if a person is exposed to massive amounts, but the substance does not accumulate in the body and does not seem to pose long-term health risks.
Over three weekends, the old floor was replaced with contaminent-free panels. The ceiling tiles were also replaced; better safe than sorry as the tiles were likely hiding spots for the filaments.
During these weekends all but essential servers were taken off line and wrapped in plastic. Operating servers were tightly wrapped in air filtration media. On the third weekend, 50 server cases were washed, the units taken apart, and compressed air used to clean the boards. The air itself was cleaned by vacuums with high-density HEPA filters.
InsightsNetwork administrators at Baycrest learned much from this experience. There are a host of possible contaminants in a server room, from sulfides to lead, and air quality and temperature must be monitored even more closely. As the component parts of circuit boards integrate (or decrease), so will the risk of environmental agent contamination and heat buildup.
This time it was zinc whiskers Ñ mystery solved. The lesson in all of this: be more knowledgeable about the physical environment of your server room. Who knows what might be lurking beneath the floor boards.