System Administration of the IBM Watson Supercomputer

Aleksey Tsalolikhin

Issue #216, April 2012

Find out how the brains at IBM handle system administration of the Watson supercomputer.

System administrators at the USENIX LISA 2011 conference (LISA is a great system administration conference, by the way) in Boston in December got to hear Michael Perrone's presentation “What Is Watson?”

Michael Perrone is the Manager of Multicore Computing from the IBM T.J. Watson Research Center. The entire presentation (slides, video and MP3) is available on the USENIX Web site at www.usenix.org/events/lisa11/tech, and if you really want to understand how Watson works under the hood, take an hour to listen to Michael's talk (and the sysadmin Q&A at the end).

I approached Michael after his talk and asked if there was a sysadmin on his team who would be willing to answer some questions about handling Watson's system administration, and after a brief introduction to Watson, I include our conversation below.

What Is Watson?

In a nutshell, Watson is an impressive demonstration of the current state of the art in artificial intelligence: a computer's ability to answer questions posed in natural language (text or speech) correctly.

Watson came out of the IBM DeepQA Project and is an application of DeepQA tuned specifically to Jeopardy (a US TV trivia game show). The “QA” in DeepQA stands for Question Answering, which means the computer can answer your questions, spoken in a human language (starting with English). The “Deep” in DeepQA means the computer is able to analyze deeply enough to handle natural language text and speech successfully. Because natural language is unstructured, deep analysis is required to interpret it correctly.

It demonstrates (in a popular format) a computer's capability to interface with us using natural language, to “understand” and answer questions correctly by quickly searching a vast sea of data and correctly picking out the vital facts that answer the question.

Watson is thousands of algorithms running on thousands of cores using terabytes of memory, driving teraflops of CPU operations to deliver an answer to a natural language question in less than five seconds. It is an exciting feat of technology, and it's just a taste of what's to come.

IBM's goal for the DeepQA Project is to drive automatic Question Answering technology to a point where it clearly and consistently rivals the best human performance.

How Does Watson Work?

First, Watson develops a semantic net. Watson takes a large volume of text (the corpus) and parses that with natural language processing to create “syntatic frames” (subject→verb→object). It then uses syntactic frames to create “semantic frames”, which have a degree of probability. Here's an example of semantic frames:

  • Inventors patent inventions (.8).

  • Fluid is a liquid (.6).

  • Liquid is a fluid (.5).

Why isn't the probability 1 in any of these examples? Because of phrases like “I speak English fluently”. They tend to skew the numbers.

To answer questions, Watson uses Massively Parallel Probabilistic Evidence-Based Architecture. It uses the evidence from its semantic net to analyze the hypotheses it builds up to answer the question. You should watch the video of Michael's presentation and look at the slides, as there is really too much under the hood to present in a short article, but in a nutshell, Watson develops huge amounts of hypotheses (potential answers) and uses evidence from its semantic Web to assign probabilities to the answers to pick the most likely answer.

There are many algorithms at play in Watson. Watson even can learn from its mistakes and change its Jeopardy strategy.

Interview with Eddie Epstein on System Administration of the Watson Supercomputer

Eddie Epstein is the IBM researcher responsible for scaling out Watson's computation over thousands of compute cores in order to achieve the speed needed to be competitive in a live Jeopardy game. For the past seven years, Eddie managed the IBM team doing ongoing development of Apache UIMA. Eddie was kind enough to answer my questions about system administration of the Watson cluster.

AT: Why did you decide to use Linux?

EE: The project started with x86-based blades, and the researchers responsible for admin were very familiar with Linux.

AT: What configuration management tools did you use? How did you handle updating the Watson software on thousands of Linux servers?

EE: We had only hundreds of servers. The servers ranged from 4- to 32-core machines. We started with CSM to manage OS installs, then switched to xCat.

AT: xCat sounds like an installation system rather than a change management system. Did you use an SSH-based “push” model to push out changes to your systems?

EE: xCat has very powerful push features, including a multithreaded push that interacts with different machines in parallel. It handles OS patches, upgrades and more.

AT: What monitoring tool did you use and why? Did you have any cool visual models of Watson's physical or logical activity?

EE: The project used a home-grown cluster management system for development activities, which had its own monitor. It also incorporated ganglia. This tool was the basis for managing about 1,500 cores.

The Watson game-playing system used UIMA-AS with a simple SSH-based process launcher. The emphasis there was on measuring every aspect of runtime performance in order to reduce the overall latency. Visualization of performance data was then done after the fact. UIMA-AS managed the work on thousands of cores.

AT: What were the most useful system administration tools for you in handling Watson and why?

EE: clusterSSH (sourceforge.net/apps/mediawiki/clusterssh) was quite useful. That and simple shell scripts with SSH did most of the work.

AT: How did you handle upgrading Watson software? SSH in, shut down the service, update the package, start the service? Or?

EE: Right, the Watson application is just restarted to pick up changes.

AT: How did you handle packaging of Watson software?

EE: The Watson game player was never packaged up to be delivered elsewhere.

AT: How many sysadmins do you have handling how many servers? You mentioned there were hundreds of operating system instances—could you be more specific? (How many humans and how many servers?) Is there actually a dedicated system administration staff, or do some of the researchers wear the system administrator hat along with their researcher duties?

EE: We have in the order of 800 OS instances. After four years we finally hired a sysadmin; before that, it was a part-time job for each of three researchers with root access.

AT: Regarding your monitoring system, how did you output the system status?

EE: We are not a production shop. If the cluster has a problem, only our colleagues complain.

What's Next?

IBM wants to make DeepQA useful, not just entertaining. Possible fields of application include healthcare, life sciences, tech support, enterprise knowledge management and business intelligence, government, improved information sharing and security.

Aleksey Tsalolikhin has been a UNIX/Linux system administrator for 14 years. Wrangling EarthLink's server farms by hand during its growth from 1,000 to 5,000,000 users, he developed an abiding interest in improving the lot of system administrators through training in configuration management, documentation and personal efficiency (including time management for system administrators and improving communication). Aleksey also provides private and public training services; send e-mail to aleksey@verticalsysadmin.com for more information.