Hundreds of connectors are being re-soldered each week, and the Titan supercomputer at Oak Ridge National Laboratory—the world’s fastest machine—could be in regular production by May, a lab official said Wednesday.
Jeff Nichols, ORNL associate lab director for computing and computational sciences, said connectors on the $100 million computer’s motherboards had too much gold, and solder was interacting with the gold on connector pins, making the solder unstable and leading to cracks.
There are about 20,000 of the pencil-sized connectors, which link central and graphic processing units, or CPUs and GPUs. Each connector has about 100 pins.
Motherboards from Titan’s 200 closet-sized cabinets are being shipped back to Cray Inc., and the company is removing the connectors, laying down new ones with the right amount of gold, and re-soldering them, Nichols said.
ORNL had hoped to complete acceptance testing on Titan, allowing it to be put into production with full-scale user operations, by the end of 2012, Nichols said. But that was an aggressive target and assumed that everything went well, he said.
Lab officials now plan to have all the components back in service by April 6, and they plan to run the acceptance test one more time. It includes a 14-day stability test that will ensure Titan is finishing problems, producing the right answers, and performing appropriately. The acceptance testing could be complete by the end of April.
The testing was almost completed once before, but workers noticed a degradation in communications between the CPUs and GPUs.
While repairs are being made, research is continuing on Titan. The machine’s GPUs give it a lot of power, but the CPUs still allow it to be used.
“Right now, the users are on it, but they’re not able to take advantage of the full system in the way that they could in the future,” Nichols said.
Titan has 24 pizza box-sized metal “blades” in each of its 200 cabinets. There are four connectors per blade or about 100 connectors per cabinet. Nichols said Cray is repairing connectors in about 12-16 cabinets per week.
He said the lab is not assigning blame for the solder problems on the big, cutting-edge machine. The solder started to crack as Titan heated up and cooled down, and blades were moved in and out of cabinets.
“We have the biggest machine on the planet,” Nichols said. The setbacks are part of “life on the leading edge,” he said.
He said Cray is bearing the cost of the repairs, and the company won’t get all of its money until the machine is accepted.
Titan received a first-place ranking in a semiannual Top500 list that was released in November at the SC12 supercomputing conference in Salt Lake City, Utah. A test showed Titan is capable of reaching a speed of 17.59 petaflops, or more than 17,000 trillion calculations per second. It had an even higher theoretical capability of 27 petaflops.