2007-11-15

Distributed Computing: An Introduction

By Leon Erlanger
Original Link:http://www.extremetech.com/article2/0%2C1697%2C11769%2C00.asp

You can define distributed computing many different ways. Various vendors have created and marketed distributed computing systems for years, and have developed numerous initiatives and architectures to permit distributed processing of data and objects across a network of connected systems.

One flavor of distributed computing has received a lot of attention lately, and it will be a primary focus of this story--an environment where you can harness idle CPU cycles and storage space of tens, hundreds, or thousands of networked systems to work together on a particularly processing-intensive problem. The growth of such processing models has been limited, however, due to a lack of compelling applications and by bandwidth bottlenecks, combined with significant security, management, and standardization challenges. But the last year has seen a new interest in the idea as the technology has ridden the coattails of the peer-to-peer craze started by Napster. A number of new vendors have appeared to take advantage of the nascent market; including heavy hitters like Intel, Microsoft, Sun, and Compaq that have validated the importance of the concept. Also, an innovative worldwide distributed computing project whose goal is to find intelligent life in the universe--SETI@Home--has captured the imaginations, and desktop processing cycles of millions of users and desktops.

Increasing desktop CPU power and communications bandwidth have also helped to make distributed computing a more practical idea. The numbers of real applications are still somewhat limited, and the challenges--particularly standardization--are still significant. But there's a new energy in the market, as well as some actual paying customers, so it's about time to take a look at where distributed processing fits and how it works.




Distributed vs Grid Computing

There are actually two similar trends moving in tandem--distributed computing and grid computing. Depending on how you look at the market, the two either overlap, or distributed computing is a subset of grid computing. Grid Computing got its name because it strives for an ideal scenario in which the CPU cycles and storage of millions of systems across a worldwide network function as a flexible, readily accessible pool that could be harnessed by anyone who needs it, similar to the way power companies and their users share the electrical grid.

Sun defines a computational grid as "a hardware and software infrastructure that provides dependable, consistent, pervasive, and inexpensive access to computational capabilities." Grid computing can encompass desktop PCs, but more often than not its focus is on more powerful workstations, servers, and even mainframes and supercomputers working on problems involving huge datasets that can run for days. And grid computing leans more to dedicated systems, than systems primarily used for other tasks.

Large-scale distributed computing of the variety we are covering usually refers to a similar concept, but is more geared to pooling the resources of hundreds or thousands of networked end-user PCs, which individually are more limited in their memory and processing power, and whose primary purpose is not distributed computing, but rather serving their user. As we mentioned above, there are various levels and types of distributed computing architectures, and both Grid and distributed computing don't have to be implemented on a massive scale. They can be limited to CPUs among a group of users, a department, several departments inside a corporate firewall, or a few trusted partners across the firewall.

How It Works

In most cases today, a distributed computing architecture consists of very lightweight software agents installed on a number of client systems, and one or more dedicated distributed computing management servers. There may also be requesting clients with software that allows them to submit jobs along with lists of their required resources.

An agent running on a processing client detects when the system is idle, notifies the management server that the system is available for processing, and usually requests an application package. The client then receives an application package from the server and runs the software when it has spare CPU cycles, and sends the results back to the server. The application may run as a screen saver, or simply in the background, without impacting normal use of the computer. If the user of the client system needs to run his own applications at any time, control is immediately returned, and processing of the distributed application package ends. This must be essentially instantaneous, as any delay in returning control will probably be unacceptable to the user.


Distributed Computing Management Server
The servers have several roles. They take distributed computing requests and divide their large processing tasks into smaller tasks that can run on individual desktop systems (though sometimes this is done by a requesting system). They send application packages and some client management software to the idle client machines that request them. They monitor the status of the jobs being run by the clients. After the client machines run those packages, they assemble the results sent back by the client and structure them for presentation, usually with the help of a database.

If the server doesn't hear from a processing client for a certain period of time, possibly because the user has disconnected his system and gone on a business trip, or simply because he's using his system heavily for long periods, it may send the same application package to another idle system. Alternatively, it may have already sent out the package to several systems at once, assuming that one or more sets of results will be returned quickly. The server is also likely to manage any security, policy, or other management functions as necessary, including handling dialup users whose connections and IP addresses are inconsistent.

Obviously the complexity of a distributed computing architecture increases with the size and type of environment. A larger environment that includes multiple departments, partners, or participants across the Web requires complex resource identification, policy management, authentication, encryption, and secure sandboxing functionality. Resource identification is necessary to define the level of processing power, memory, and storage each system can contribute.

Policy management is used to varying degrees in different types of distributed computing environments. Administrators or others with rights can define which jobs and users get access to which systems, and who gets priority in various situations based on rank, deadlines, and the perceived importance of each project. Obviously, robust authentication, encryption, and sandboxing are necessary to prevent unauthorized access to systems and data within distributed systems that are meant to be inaccessible.

If you take the ideal of a distributed worldwide grid to the extreme, it requires standards and protocols for dynamic discovery and interaction of resources in diverse network environments and among different distributed computing architectures. Most distributed computing solutions also include toolkits, libraries, and API's for porting third party applications to work with their platform, or for creating distributed computing applications from scratch.







What About Peer-to-Peer Features?







Though distributed computing has recently been subsumed by the peer-to-peer craze, the structure described above is not really one of peer-to-peer communication, as the clients don't necessarily talk to each other. Current vendors of distributed computing solutions include Entropia, Data Synapse, Sun, Parabon, Avaki, and United Devices. Sun's open source GridEngine platform is more geared to larger systems, while the others are focusing on PCs, with Data Synapse somewhere in the middle. In the case of the http://setiathome.ssl.berkeley.edu/ project, Entropia, and most other vendors, the structure is a typical hub and spoke with the server at the hub. Data is delivered back to the server by each client as a batch job. In the case of DataSynapse's LiveCluster, however, client PCs can work in parallel with other client PCs and share results with each other in 20ms long bursts. The advantage of LiveCluster's architecture is that applications can be divided into tasks that have mutual dependencies and require interprocess communications, while those running on Entropia cannot. But while Entropia and other platforms can work very well across an Internet of modem connected PCs, DataSynapse's LiveCluster makes more sense on a corporate network or among broadband users across the Net.

The Poor Man's Supercomputer
The advantages of this type of architecture for the right kinds of applications are impressive. The most obvious is the ability to provide access to supercomputer level processing power or better for a fraction of the cost of a typical supercomputer.
SETI@Home's Web site FAQ points out that the most powerful computer, IBM's ASCI White, is rated at 12 TeraFLOPS and costs $110 million, while SETI@home currently gets about 15 TeraFLOPs and has cost about $500K so far. Further savings comes from the fact that distributed computing doesn't require all the pricey electrical power, environmental controls, and extra infrastructure that a supercomputer requires. And while supercomputing applications are written in specialized languages like mpC, distributed applications can be written in C, C++, etc.
The performance improvement over typical enterprise servers for appropriate applications can be phenomenal. In a case study that Intel did of a commercial and retail banking organization running Data Synapse's LiveCluster platform, computation time for a series of complex interest rate swap modeling tasks was reduced from 15 hours on a dedicated cluster of four workstations to 30 minutes on a grid of around 100 desktop computers. Processing 200 trades on a dedicated system took 44 minutes, but only 33 seconds on a grid of 100 PCs. According to the company using the technology, the performance improvement running various simulations allowed them to react much more swiftly to market fluctuations (but we have to wonder if that's a good thing…).




Scalability is also a great advantage of distributed computing. Though they provide massive processing power, super computers are typically not very scalable once they're installed. A distributed computing installation is infinitely scalable--simply add more systems to the environment. In a corporate distributed computing setting, systems might be added within or beyond the corporate firewall.
A byproduct of distributed computing is more efficient use of existing system resources. Estimates by various analysts have indicated that up to 90 percent of the CPU cycles on a company's client systems are not used. Even servers and other systems spread across multiple departments are typically used inefficiently, with some applications starved for server power while elsewhere in the organization server power is grossly underutilized. And server and workstation obsolescence can be staved off considerably longer by allocating certain applications to a grid of client machines or servers. This leads to the inevitable Total Cost of Ownership, Total Benefit of Ownership, and ROI discussions. Another byproduct, instead of throwing away obsolete desktop PCs and servers, an organization can dedicate them to distributed computing tasks.

Distributed Computing Application Characteristics

Obviously not all applications are suitable for distributed computing. The closer an application gets to running in real time, the less appropriate it is. Even processing tasks that normally take an hour are two may not derive much benefit if the communications among distributed systems and the constantly changing availability of processing clients becomes a bottleneck. Instead you should think in terms of tasks that take hours, days, weeks, and months. Generally the most appropriate applications, according to Entropia, consist of "loosely coupled, non-sequential tasks in batch processes with a high compute-to-data ratio." The high compute to data ratio goes hand-in-hand with a high compute-to-communications ratio, as you don't want to bog down the network by sending large amounts of data to each client, though in some cases you can do so during off hours. Programs with large databases that can be easily parsed for distribution are very appropriate.

Clearly, any application with individual tasks that need access to huge data sets will be more appropriate for larger systems than individual PCs. If terabytes of data are involved, a supercomputer makes sense as communications can take place across the system's very high speed backplane without bogging down the network. Server and other dedicated system clusters will be more appropriate for other slightly less data intensive applications. For a distributed application using numerous PCs, the required data should fit very comfortably in the PC's memory, with lots of room to spare.


Taking this further, United Devices recommends that the application should have the capability to fully exploit "coarse-grained parallelism," meaning it should be possible to partition the application into independent tasks or processes that can be computed concurrently. For most solutions there should not be any need for communication between the tasks except at task boundaries, though Data Synapse allows some interprocess communications. The tasks and small blocks of data should be such that they can be processed effectively on a modern PC and report results that, when combined with other PC's results, produce coherent output. And the individual tasks should be small enough to produce a result on these systems within a few hours to a few days.

Types of Distributed Computing Applications
Beyond the very popular poster child SETI@Home application, the following scenarios are examples of other types of application tasks that can be set up to take advantage of distributed computing.
A query search against a huge database that can be split across lots of desktops, with the submitted query running concurrently against each fragment on each desktop.
Complex modeling and simulation techniques that increase the accuracy of results by increasing the number of random trials would also be appropriate, as trials could be run concurrently on many desktops, and combined to achieve greater statistical significance (this is a common method used in various types of financial risk analysis).
Exhaustive search techniques that require searching through a huge number of results to find solutions to a problem also make sense. Drug screening is a prime example.
Many of today's vendors, particularly Entropia and United Devices, are aiming squarely at the life sciences market, which has a sudden need for massive computing power. As a result of sequencing the human genome, the number of identifiable biological targets for today's drugs is expected to increase from about 500 to about 10,000. Pharmaceutical firms have repositories of millions of different molecules and compounds, some of which may have characteristics that make them appropriate for inhibiting newly found proteins. The process of matching all these "ligands" to their appropriate targets is an ideal task for distributed computing, and the quicker it's done, the quicker and greater the benefits will be. Another related application is the recent trend of generating new types of drugs solely on computers.
Complex financial modeling, weather forecasting, and geophysical exploration are on the radar screens of these vendors, as well as car crash and other complex simulations.
To enhance their public relations efforts and demonstrate the effectiveness of their platforms, most of the distributed computing vendors have set up philanthropic computing projects that recruit CPU cycles across the Internet. Parabon's Compute-Against-Cancer harnesses an army of systems to track patient responses to chemotherapy, while Entropia's FightAidsAtHome project evaluates prospective targets for drug discovery. And of course, the SETI@home project has attracted millions of PCs to work on analyzing data from the Arecibo radio telescope for signatures that indicate extraterrestrial intelligence. There are also higher end grid projects, including those run by the US National Science Foundation, NASA, and as well as the European Data Grid, Particle Physics Data Grid, the Network for Earthquake Simulation Grid, and Grid Physics Network that plan to aid their research communities. And IBM has announced that it will help to create a life sciences grid in North Carolina to be used for genomic research.



Porting Applications
The major distributed computing platforms generally have two methods of porting applications, depending on the level of integration needed by the user, and whether the user has access to the source code of the application that needs to be distributed. Most of the vendors have software development kits (SDK's) that can be used to wrap existing applications with their platform without cracking the existing .exe file. The only other task is determining the complexity of pre- and post-processing functions. Entropia in particular boasts that it offers "binary integration," which can integrate applications into the platform without the user having to access the source code.
Other vendors, including Data Synapse, and United Devices, however offer API's of varying complexity that require access to the source code, but provide tight integration and access by the application to the all the security, management, and other features of the platforms. Most of these vendors offer several libraries of proven distributed computing paradigms. Data Synapse comes with C++ and Java software developer kit support. United Devices uses a POSIX compliant C/C++ API. Integrating the application can take anywhere from half a day to months depending on how much optimization is needed. Some vendors also allow access to their own in-house grids for testing by application developers.







Companies and Organizations to Watch

Avaki Corporation
Cambridge, MA Corporate Headquarters
One Memorial Drive
Cambridge, MA 02142
617-374-2500
http://www.avaki.com/
Makes Avaki 2.0, grid computing software for mixed platform environments and global grid configurations. Includes a PKI based security infrastructure for grids spanning multiple companies, locations, and domains.

The DataGrid
Dissemination Office:
CNR-CED
Piazzale Aldo Moro 7
00145 Roma (Italy)
+39 06 49933205
http://www.eu-datagrid.org/
A project funded by the European Union and led by CERN and five other partners whose goal is to set up computing grids that can analyse data from scientific exploration across the continent. The project hopes to develop scalable software solutions and testbeds that can handle thousands of users and tens of thousands of grid connected systems from multiple research institutions.

DataSynapse Inc.
632 Broadway
5th Floor
New York, NY 10012-2614
212-842-8842
http://www.datasynapse.com/
Makes LiveCluster, distributed computing software middleware aimed at the financial services and energy markets. Currently mostly for use inside the firewall. Includes the ability for interprocess communications among distributed application packages.

Distributed.Net
http://www.distributed.net/
Founded in 1997, Distributed.Net was one of the first non-profit distributed computing organizations and the first to create a distributed computing network on the Internet. Distributed.net was highly successful in using distributed computing to take on cryptographic challenges sponsored by RSA Labs and CS Communication & Systems.

Entropia, Inc.
10145 Pacific Heights Blvd., Suite 800
San Diego, CA 92121 USA
858-623-2840
http://www.entropia.com/
Makes the Entropia distributed computing platform aimed at the life sciences market. Currently mostly for use inside the firewall. Boasts binary integration, which lets you integrate your applications using any language without having to access the application's source code. Recently integrated its software with The Globus Toolkit.

Global Grid Forum
http://www.gridforum.org/
A standards organization composed of over 200 companies working to devise and promote standards, best practices, and integrated platforms for grid computing.

The Globus Project
http://www.globus.org/
A research and development project consisting of members of the Argonne National Laboratory, the University of Southern California's Information Science Institute, NASA, and others focused on enabling the application of Grid concepts to scientific and engineering computing. The team has produced the Globus Toolkit, an open source set of middleware services and software libraries for constructing grids and grid applications. The ToolKit includes software for security, information infrastructure, resource management, data management, communication, fault detection, and portability.

Grid Physics Network
http://www.griphyn.org/
The Grid Physics Network (GriPhyN) is a team of experimental physicists and IT researchers from the University of Florida, University of Chicago, Argonne National Laboratory and about a dozen other research centers working to implement the first worldwide Petabyte-scale computational and data grid for physics and other scientific research. The project is funded by the National Science Foundation.

IBM
International Business Machines Corporation
New Orchard Road
Armonk, NY 10504.
914-499-1900
IBM is heavily involved in setting up over 50 computational grids across the planet using IBM infrastructure for cancer research and other initiatives. IBM was selected in August by a consortium of four U.S. research centers to help create the "world's most powerful grid," which when completed in 2003 will supposedly be capable of processing 13.6 trillion calculations per second. IBM also markets the IBM Globus ToolKit, a version of the ToolKit for its servers running AIX and Linux.

Intel
Corporation2200 Mission College Blvd.
Santa Clara, California 95052-8119408-765-8080
http://www.intel.com/
Intel is the principal founder of the Peer-To-Peer Working Group and recently announced the Peer-to-Peer Accelerator Kit for Microsoft.NET, middleware based on the Microsoft.NET platform that provides building blocks for the development of peer-to-peer applications and includes support for location independence, encryption and availability. The technology, source code, demo applications, and documentation will be made available on Microsoft's gotdotnet (www.gotdotnet.com) website. The download will be free. The target release date is early December. Also partners with United Devices on the Intel-United Devices Cancer Research Project, which enlists Internet users in a distributed computing grid for cancer research.

NASA Advanced SuperComputing Division (NAS)
NAS Systems Division Office
NASA Ames Research Center
Moffet Field, CA 94035
650-604-4502
http://www.nas.nasa.gov/
NASA's NAS Division is leading a joint effort among leaders within government, academia, and industry to build and test NASA's Information Power Grid (lPG), a grid of high performance computers, data storage devices, scientific instruments, and advanced user interfaces that will help NASA scientists collaborate with these other institutions to "solve important problems facing the world in the 21st century."

Network for Earthquake Engineering Simulation Grid (NEESgrid)
www.neesgrid.org/
In August 2001, the National Science Foundation awarded $10 million to a consortium of institutions led by the National Center for Supercomputing Applications (NCSA) at the University of Illinois at Urbana-Champaign to build the NEESgrid, which will link earthquake engineering research sites across the country in a national grid, provide data storage facilities and repositories, and offer remote access to research tools.

Parabon Computation
3930 Walnut Street, Suite 100
Fairfax, VA 22030-4738
703-460-4100
http://www.parabon.com/
Makes Frontier server software and Pioneer client software, a distributed computing platform that supposedly can span enterprises or the Internet. Also runs the Compute Against Cancer, a distributed computing grid for non-profit cancer organizations.

Particle Physics Data Grid (PPDG)
http://www.ppdg.net/
A collaboration of the Argonne National Laboratory, Brookhaven National Laboratory, Caltech, and others to develop, acquire and deliver the tools for a national computing grid for current and future high-energy and nuclear physics experiments.

Peer-to-Peer Working Group
5440 SW Westgate Drive, Suite 217
Portland, OR 97221
503-291-2572
http://www.peer-to-peerwg.org/
A standards group founded by Intel and composed of over 30 companies with the goal of developing best practices that enable interoperability among peer-to-peer applications.

Platform Computing
3760 14th Ave
Markham, Ontario L3R 3T7
Canada
905-948-8448
http://www.platform.com/
Makes a number of enterprise distributed and grid computing products, including Platform LSF Active Cluster for distributed computing across Windows desktops, Platform LSF for distributed computing across mixed environments of UNIX, Linux, Macintosh and Windows servers, desktops, supercomputers, and clusters. Also offers a number of products for distributed computing management and analysis, and its own commercial distribution of the Globus Toolkit. Targets computer and industrial manufacturing, life sciences, government, and financial services markets.

SETI@Home
http://setiathome.ssl.berkeley.edu/
A worldwide distributed computing grid based at the University of California at Berkeley that allows users connected to the Internet to donate their PC's spare CPU cycles to the exploration of extraterrestrial life in the universe. Its task is to sort through the 1.4 billion potential signals picked up by the Arecibo telescope to find signals that repeat. Users receive approximately 350K or data at a time and the client software runs as a screensaver.

Sun Microsystems Inc.
901 San Antonio Road
Palo Alto, CA 94303
USA
650-960-1300
http://www.sun.com/Sun is involved in several grid and peer-to-peer products and initiatives including its open source Grid Engine platform for setting up departmental and campus computing grids (with the eventual goal of a global grid platform) and its JXTA (short for juxtapose) set of protocols and building blocks for developing peer-to-peer applications.

United Devices, Inc.
12675 Research, Bldg A
Austin, Texas 78759
512-331-6016
http://www.ud.com/
Makes the MetaProcessor distributed computing platform aimed at life sciences, geosciences, and industrial design and engineering markets and currently focused inside the firewall. Also partners with Intel on the Intel-United Devices Cancer Research Project, which enlists Internet users in a distributed computing grid for cancer research.

没有评论: