From Cluster Documentation Project
The following topics were collected from discussions with community members and from our previous Why Do Clusters Suck? We invite those with ideas or interest in the following topics to click on the topic and add the discussion.
The government labs have played a key role in the development of cluster technology. What are the issues that the government labs are addressing or would like to see addressed. Is there a list of all government supported cluster projects? How can we work closer with these cluster powerhouses?
The need for tools and methods to quickly create applications for clusters is often considered the biggest challenge facing HPC today. No one will argue that writing software for clusters is not hard. It needs to be easier. How will we write software for the "Blue Collar™ Computing" applications that require more than a handful of processors, but much less than the "heroic" projects we hear so much about? Will more software sell more clusters?
When you look inside the cabinet, clusters are still islands of machines separated by a sea of cables. Each machine boots on it's own, has it's own BIOS, and keeps logs of system activity. As clusters grow, how do we manage these factors. IPMI may help with some of these issues, but in general, clusters do not have a single power/reset button. In addition, how do we keep an eye on the system logs beyond using grep?
Recently, there have some new books on cluster computing. But, in reality, the best way to learn about clusters is to either build one or work directly with a production system. Many people believe this process is too slow. Where can one go to learn about clusters? Is there a college or university that teaches engineers and scientists how to use cluster technology? How do we begin to address this problem?
Comparing cluster performance is similar to asking the question "How tall is a building?" The answer needs more information and always depends on what you are trying to do. Ideally you would like to build a machine to run your application. But, what if you must share a cluster? How can applications run on systems that vary in so many ways? As we focus less on the Top500, can we find a way to measure clusters that provides good engineering data for everyone? Are we ready to create standards like the Linux Standard Base Project? Or should we?
Clusters are most often built from what we can find in the commodity market. In recent years the cluster market has been developing a strong voice and vendors are noticing. There are some vendors, like those who sell interconnects, that are clearly focused on the HPC cluster market. What about the others? If we had the major vendors (hardware and software) sitting in front of us, what would we ask of them? (and would we buy it if they made it!)
Cluster administration is about managing complexity. There seem to be two methods for handling this issue from and operating system perspective. The more traditional method of deploying Local-Disk-Full (LDF) nodes is easy to implement and requires very little new technology, but these clusters are often the hardest to administrate. Clusters that use Remote-Disk-Less (RDL) methods require more engineering, but often offer a much easier to administration path. Is there a best way? What about upgrading and version skew? Why do we need to use a 100GB disk on each node to hold 10MB of OS and libraries? Can we build truly stateless clusters?
Anyone who has stood near next to a cluster knows that in addition to computing the size of the universe, they are also excellent space heaters. Many believe we have reached the maximal thermal density for air cooled clusters. The introduction of dual cores will push this limit to the wall. Furthermore for many large clusters the cost of electricity and heat removal has become a major cost. How long will this continue? Why don't we rate clusters in terms of GFOLPS/Watt? Should we be using a small number of fast/hot processors or a large number of slow/cool processors? When do we roll in the chilled water systems?
We will summarize what we have learned and note any issues that need more attention or need to be addressed in the future. In addition, we will attempt to rank in order of importance the challenges facing HPC clusters. Anything else?