Sunday, August 21, 2016

How to Build a Supercomputer and Still Lose the War

The Autobiography of Ben & Bob
Chapter 13: How to Build a Supercomputer and Still Lose the War

Everyone sees the limits of his own vision as the limits of the world.”
Arthur Schopenhauer. 1788-1860.

“I worked with guys who made some tremendous contributions and you have never heard of them.”
Chuck Elmendorf quoted in Jon Gertner’s The Idea factory: Bell Labs and the Great Age of American Innovation

“Men wanted for hazardous journey. Low wages, bitter cold, long hours of complete darkness. Safe return doubtful. Honour and recognition in event of success.”
Advertisement placed in local papers by Ernest Shackleton for the Endurance expedition.

I ran into an old friend recently and we reminisced about the great fun we had working together at MIPS and Silicon Graphics back in the late '80s and early '90s. As we talked, I was reminded of several anecdotes that have colored my memory of those days. Almost none of these are actually related to the projects we were working on but are rather about the experiences we had together as a team.

  • SGI was famous for lavish parties and team offsides - to Hawaii, to Tahoe, to Napa. It was apparently a cost effective way for one of the hottest high tech companies in the valley to retain talent. I took my team on one of these trips - to Tahoe. Let's just say it was an expensive trip. At one point during the team dinner at the top of Caesar’s hotel, I came to find that practically everyone in the 150 person team had decided to order $200 cigars and $500 shots of Cognac for "dessert". The next day, we went for a group outing on ski mobiles. Several members of the team managed to flip and totally destroy their assigned ski mobiles, some doing so even before we had left the parking lot of the rental agency. I was worried about the massive expense report but no one at the company even batted an eye when I filed it.

  • It's well past midnight on a cold December evening in Mountain View, CA. Several of us have been pulling an all-nighter in the lab, hunched over oscilloscopes and logic analyzers and prom emulators. I honestly can't remember now which one of a handful of projects this was - probably the SGI Super Challenge or the bad ass SGI Origin. Those days are mostly a blur now. But I do remember the guys starting to complain that they were hungry and couldn't find a local restaurant that would deliver food after midnight. An hour or so later, someone came up with the brilliant idea that we could go to the local 24-hour Safeway, fire up the grill, and make ourselves some steaks. So it was that I found myself manning a barbeque at 3 am in shorts and a t-shirt in freezing temperatures.

  • We sit around a table at the Prince of Wales Pub in San Mateo egging each other on as we eat our habanero hamburgers, tearing streaming down our faces. These burgers earned a place in the Guinness Book of World Records as the hottest hamburgers in the world, a well-deserved distinction given the gastrointestinal havoc they caused. Someone on the team had turned us on to these vile sandwiches and we had turned it into a rite of passage. If you wanted to join my team, you had to eat one. If you wanted to leave the team, well… you had to eat two! We ended up going back to the hole in the wall pub roughly once a month, laughing our asses off and wiping tears from our faces as we introduced new members of the team to “the experience” or said goodbye to departing colleagues. For some reason, no one ever wanted to leave my team.

  • It's 4 pm on a Saturday afternoon. I walk back from the hardware lab to my office down the hall to find my daughter curled up and sleeping on the floor under my desk. I had left her there to study as I went to work in the lab. You cannot imagine the amount of guilt that image brings back. Not only did I just spend the entire day at work but I also managed to ignore my daughter whom I'd taken to work on a weekend day when I was supposed to be spending "quality time" with her. No one was forcing me to be there that day. I was just obsessed with the project, with the company, with the goal, with the code. Work/Life balance? What's that?

  • My wife confronts me at 3 o’clock in the morning as I pull into the driveway. Why are all these memories in the middle of the night!?! "The neighbors think you're having an affair. They won't believe me when I tell them you're just at work." She said it in jest, in exasperation, an "I give up on you" tone of voice. It took me months to convince the neighbors that nothing nefarious was afoot.

  • We are all hunched over a table in the lab looking at the guts of a system with probes hanging every which way. "I think it said [hex] 82". "No way. You missed a bit. Run it again. One One Zero Zero… See? It’s C2." We were counting LED lights - the only way we had to debug the hardware as we brought up a new processor. Somehow, in all our wisdom, we had decided that we should build a brand new massively complex supercomputer from the ground up. Everything was new. The processor, the memory subsystem, the IO subsystem, the BIOS, the operating system, the compilers used to build those piece of software, you name it. What balls we had. It was either that or pure naïveté.

System bring up literally meant blinking LEDs to diagnose error codes, one bit at a time, because that's the only path in the system we could get to work reliably. Every time we changed a line of code, we ran into yet another compiler bug, yet another CPU bug, yet another cache bug, yet another memory corruption, yet another OS bug. Oh what fun we had! We would spend days tracking down a mysterious bug only to find that it disappeared because we had managed to change the timing in the code or we had managed to force a second level cache eviction, totally invalidating the experiment. You'll have to forgive me when I scoff at developers who tell me they can't debug a problem despite the 30 GB of log files they generate - per hour!

Of course we had already brought the OS up on a simulator, brought up a 64 bit version of IRIX (SGI's version of UNIX) on an earlier generation of hardware, brought up the CPU on another board, etc. We had done as much as we could with previous generations of hardware and software and simulators. But now we were bringing everything together in the lab for the first time. And, of course, anything that could go wrong did go wrong.

The good news is that the Origin was truly a groundbreaking system. Its shared memory NUMA architecture allowed you to build ever larger systems out of what we called "Lego" nodes. In its first incarnation, we built systems with as many as 128 processors. Its architecture supported up to 1024 processors and the company shipped systems that large several years later. By then, however, the industry had zigged while SGI zagged. Loosely coupled clusters of personal computers were now fast enough to undercut the likes of SGI with their proprietary hardware architecture.

When that project started, the Top 500 List, the list of the fastest supercomputers in the world, was dominated by Cray Research. SGI was making the bold bet that it could dethrone that giant - and it did. By the time I left a few years later, roughly 140 of the top 500 supercomputers in the world were from SGI. Cray was in trouble, so what did SGI do? Of course, they “bought” Cray! Somehow, the executives at SGI convinced themselves that it made sense to buy the dying company - for their "brand name" and “customer loyalty”. It’s odd to use the term "buy" in this case because SGI was actually smaller than Cray in almost every sense - market cap, employee count, revenue, etc. No worries; we'll just borrow some money!

In the end, the combined company collapsed under its own weight. The once lucrative high end 3D graphics industry had shifted to using clusters of Intel x86 based servers instead of tightly coupled shared memory multiprocessor systems. The cold war had also ended and the government no longer needed as many supercomputers to simulate nuclear weapons. SGI had missed an inflection point in the industry because they were too busy dealing with their existing customer base and their needs - typical innovator’s dilemma.

Nevertheless, SGI was an excellent company with a superb cadre of individuals at all levels. I thoroughly enjoyed my years there. The company was hip, a Silicon Valley high flier as well as a Hollywood darling. I honestly don't remember how much my equity in the company was worth at its peak. Not that it matters. I never sold the stock, hanging on to it, continuing to believe in the dream - long after it had crashed and the company was on its way to not one - but two - bankruptcies!