SPRUCE: An Infrastructure for Emergency, On-Demand, Urgent Computing
SPRUCE: An Infrastructure for Emergency, On-Demand, Urgent Computing
This was a talk I attended at NCAR earlier this semester.
This was a rather interesting lecture, and illustrates some of the novel ways in which computing resources can be utilized for the public good. The basic idea behind the lecture was now that we have supercomputers connected to each other (TeraGrid) in such a way that allows data and work to flow between them, we can leverage this to provide resources to urgent computing needs in the event of natural disasters, outbreaks of disease, etc. For example, if there are wildfires occurring in a certain area of the country, priority consideration can be given to simulation software that predicts the spread of the wildfire, in order to get this information as quickly as possible and help provide the most informed and effective strategy for focusing the efforts to extinguish them. It's in essence an agreement that those supercomputing sites that participate agree to make their systems available to assist in case of an emergency, the implementation of which is determined by the host facility.
In initiatives such as this, the political challenges can be as difficult as the technical ones. By participating in this initiative, the owners of these systems are opening themselves up to the possibility that their computing resources are being used for purposes other than those which directly benefit the host institution. For example, computers at Argonne national lab can be used in an emergency to predict hurricane landfall in florida, and while this provides no direct benefit to the host institution it is invaluable to the population that could be affected by the hurricane itself. It was interesting to see the strategies for addressing these issues so that the users of the systems would be more accepting of this possibility. One strategy was to provide a 'discount' on the number of hours users who volunteer their systems for SPRUCE are given for their regular jobs. An effort is also made in many cases to make the preemption of the jobs as nondestructive as possible. The local site administrator in these cases has discretion about how they will decide to implement participation in the SPRUCE initiative which I believe is crucial to the success of the endeavor.
One concern I had is that while an effort is made to provide a somewhat uniform interface to these systems, they are in fact very different and in many cases will not support certain kinds of software (i.e. applications written in the shared memory paradigm will not run on the BlueGene/L architecture). The validation and certification of software can become quite an issue with this, and there is probably a need to associate some sort of system affinity metric with an application to ensure that it is assigned to a system capable of running it. The initiative is far from complete, so it will be interesting to see how these sorts of administrative issues are addressed.