03.05.2014 Views

GridFTP: A Data Transfer Protocol for the Grid - Open Grid Forum

GridFTP: A Data Transfer Protocol for the Grid - Open Grid Forum

GridFTP: A Data Transfer Protocol for the Grid - Open Grid Forum

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

GRIDFTP: A DATA TRANSFER PROTOCOL FOR THE GRID<br />

<strong><strong>Grid</strong>FTP</strong>: A <strong>Data</strong> <strong>Transfer</strong> <strong>Protocol</strong> <strong>for</strong> <strong>the</strong> <strong>Grid</strong><br />

<strong>Grid</strong> <strong>Forum</strong> <strong>Data</strong> Working Group on <strong><strong>Grid</strong>FTP</strong><br />

Bill Allcock, Lee Liming, Steven Tuecke – ANL<br />

Ann Chervenak – USC/ISI<br />

<br />

Introduction<br />

In <strong>Grid</strong> environments, access to distributed data is typically as important as access to distributed<br />

computational resources. Distributed scientific and engineering applications require:<br />

• transfers of large amounts of data (terabytes or petabytes) between storage<br />

systems, and<br />

• access to large amounts of data (gigabytes or terabytes) by many geographically<br />

distributed applications and users <strong>for</strong> analysis, visualization, etc.<br />

Un<strong>for</strong>tunately, <strong>the</strong> lack of standard protocols <strong>for</strong> transfer and access of data in <strong>the</strong> <strong>Grid</strong> has led<br />

to a fragmented <strong>Grid</strong> storage community. Users who wish to access different storage systems<br />

are <strong>for</strong>ced to use multiple protocols and/or APIs, and it is difficult to efficiently transfer data<br />

between <strong>the</strong>se different storage systems.<br />

We propose a common data transfer and access protocol called <strong><strong>Grid</strong>FTP</strong> that provides secure,<br />

efficient data movement in <strong>Grid</strong> environments. This protocol, which extends <strong>the</strong> standard FTP<br />

protocol, provides a superset of <strong>the</strong> features offered by <strong>the</strong> various <strong>Grid</strong> storage systems<br />

currently in use. We chose <strong>the</strong> FTP protocol because it is <strong>the</strong> most commonly used protocol <strong>for</strong><br />

data transfer on <strong>the</strong> Internet, and of <strong>the</strong> existing candidates from which to start it comes closest<br />

to meeting <strong>the</strong> <strong>Grid</strong>’s needs.<br />

The <strong><strong>Grid</strong>FTP</strong> protocol includes <strong>the</strong> following features:<br />

• <strong>Grid</strong> Security Infrastructure (GSI) and Kerberos support<br />

• Third-party control of data transfer<br />

• Parallel data transfer<br />

• Striped data transfer<br />

• Partial file transfer<br />

• Automatic negotiation of TCP buffer/window sizes<br />

• Support <strong>for</strong> reliable and restartable data transfer<br />

• Integrated instrumentation<br />

Some of <strong>the</strong>se features are supported by FTP extensions that have already been standardized in<br />

<strong>the</strong> IETF, but which are currently seldom implemented. O<strong>the</strong>r features are new extensions FTP.<br />

1


GRIDFTP: A DATA TRANSFER PROTOCOL FOR THE GRID<br />

The <strong>Grid</strong> <strong>Forum</strong> <strong><strong>Grid</strong>FTP</strong> working group was <strong>for</strong>med within <strong>the</strong> <strong>Data</strong> working group to<br />

foster <strong>the</strong> development of a <strong><strong>Grid</strong>FTP</strong> protocol standard, as well as interoperable storage<br />

clients and servers that implement <strong>the</strong> <strong><strong>Grid</strong>FTP</strong> protocol.<br />

Motivation <strong>for</strong> a common protocol<br />

There are already a number of storage systems in use by <strong>the</strong> <strong>Grid</strong> community. These storage<br />

systems have been created in response to specific needs <strong>for</strong> storing and accessing large datasets.<br />

They each focus on a distinct set of requirements and provide distinct services to <strong>the</strong>ir clients.<br />

For example, some storage systems (DPSS, HPSS) focus on high-per<strong>for</strong>mance access to data<br />

and utilize parallel data transfer streams and/or striping across multiple servers to improve<br />

per<strong>for</strong>mance. i ii O<strong>the</strong>r systems (DFS) focus on supporting high-volume usage and utilize dataset<br />

replication and local caching to divide and balance server load. iii The SRB system connects<br />

heterogeneous data collections and provides a uni<strong>for</strong>m client interface to <strong>the</strong>se repositories, and<br />

also provides metadata <strong>for</strong> use in identifying and locating data within <strong>the</strong> storage system. iv Still<br />

o<strong>the</strong>r systems (HDF5) focus on <strong>the</strong> structure of <strong>the</strong> data, and provide client support <strong>for</strong><br />

accessing structured data from a variety of underlying storage systems. v<br />

Un<strong>for</strong>tunately, most of <strong>the</strong>se storage systems utilize incompatible, an often unpublished<br />

protocols <strong>for</strong> accessing data, and <strong>the</strong>re<strong>for</strong>e require <strong>the</strong> use of <strong>the</strong>ir own client libraries to access<br />

data. The use of multiple incompatible protocols and client libraries <strong>for</strong> accessing storage<br />

effectively partitions <strong>the</strong> datasets available on <strong>the</strong> grid. Applications that require access to data<br />

stored in different storage systems must ei<strong>the</strong>r choose to only use a subset of storage systems, or<br />

must use multiple methods to retrieve data from <strong>the</strong> various storage systems.<br />

One approach to breaking down partitions created by <strong>the</strong>se mutually incompatible storage<br />

system protocols is to build a layered client or gateway which can present <strong>the</strong> user with one<br />

interface, but which translates requests into <strong>the</strong> various storage system protocols and/or client<br />

library calls. This approach is attractive to existing storage system providers because it does not<br />

require <strong>the</strong>m adopt support <strong>for</strong> a new protocol. But it also has significant disadvantages,<br />

including:<br />

• Per<strong>for</strong>mance: Costly translations are often required between <strong>the</strong> layered client<br />

and storage system specific client libraries and protocols. In addition, it can be<br />

challenging to efficiently transfer a dataset from one storage system to ano<strong>the</strong>r.<br />

• Complexity: Building and maintaining a client or gateway that supports<br />

numerous storage systems is considerable work. In addition, staying up to date<br />

as each storage system independently evolves is very difficult. This is fur<strong>the</strong>r<br />

exacerbated by <strong>the</strong> need to provide support <strong>for</strong> multiple client languages, such as<br />

C/C++, Java, Perl, Python, shells, etc.<br />

It would be mutually advantageous to both storage providers and users to have a common level<br />

of interoperability between all of <strong>the</strong>se disparate systems: a common—but extensible—<br />

underlying data transfer protocol. Storage providers would gain a broader user base, because<br />

<strong>the</strong>ir data would be available to any client. Storage users would gain access to a broader range<br />

of storage systems and data. In addition, <strong>the</strong>se benefits can be gained without <strong>the</strong> per<strong>for</strong>mance<br />

and complexity problems of <strong>the</strong> layered client or gateway approach.<br />

2


GRIDFTP: A DATA TRANSFER PROTOCOL FOR THE GRID<br />

Fur<strong>the</strong>rmore, establishing a common data transfer protocol would eliminate <strong>the</strong> current<br />

duplication of ef<strong>for</strong>t in developing unique data transfer capabilities <strong>for</strong> different storage systems.<br />

A pooling of ef<strong>for</strong>t in <strong>the</strong> data transfer protocol area would lead to greater reliability,<br />

per<strong>for</strong>mance, and overall features that would <strong>the</strong>n be available to all distributed storage systems.<br />

Characteristics of <strong>the</strong> data transfer protocol<br />

In order to make a common data transfer protocol attractive to users and developers of existing<br />

storage systems, we must provide a transfer protocol that offers a superset of <strong>the</strong> features<br />

offered by systems currently in regular use. In addition, <strong>the</strong> protocol must be extensible, in<br />

order to support future innovations by storage system users and developers.<br />

We have observed that <strong>the</strong> FTP protocol is <strong>the</strong> protocol most commonly used <strong>for</strong> data transfer<br />

on <strong>the</strong> Internet, and <strong>the</strong> most likely candidate <strong>for</strong> meeting <strong>the</strong> <strong>Grid</strong>’s needs. It is attractive in<br />

particular <strong>for</strong> <strong>the</strong> following reasons.<br />

• It is a widely implemented and well-understood IETF standard protocol. There is<br />

a large code base and expertise from which to build.<br />

• It provides a well-defined architecture <strong>for</strong> protocol extensions, and supports<br />

dynamic discovery of <strong>the</strong> extensions supported by a particular implementation.<br />

• Numerous groups have added various extensions through <strong>the</strong> IETF. Some of<br />

<strong>the</strong>se extensions are particularly useful in <strong>the</strong> <strong>Grid</strong>.<br />

• In addition to client/server transfers (i.e. “put/get”, or “remote read/write”), it<br />

also supports transfers directly between two servers, mediated by a third party<br />

client (i.e. “third party transfer”).<br />

• The separation of data and control channels onto different sockets allows <strong>for</strong><br />

easier extensibility <strong>for</strong> parallel and striped transfers, efficiently transiting<br />

firewalls, etc.<br />

Most current FTP implementations support only a subset of <strong>the</strong> features defined in <strong>the</strong> FTP<br />

protocol (RFC 969) and its accepted extensions. Some of <strong>the</strong> seldom-implemented features are<br />

useful to <strong>Grid</strong> applications. But <strong>the</strong> standards also lack several features which <strong>Grid</strong> applications<br />

require.<br />

We intend to select a subset of <strong>the</strong> existing FTP standards and fur<strong>the</strong>r extend it, adding <strong>the</strong><br />

following features. We believe that <strong>the</strong> resulting protocol will be a suitable candidate <strong>for</strong> <strong>the</strong><br />

common data transfer protocol <strong>for</strong> <strong>the</strong> grid, which we call “<strong><strong>Grid</strong>FTP</strong>”.<br />

<strong>Grid</strong> Security Infrastructure (GSI) and Kerberos support<br />

Robust and flexible au<strong>the</strong>ntication, integrity, and confidentiality features are critical when<br />

transferring or accessing files. <strong><strong>Grid</strong>FTP</strong> must support GSI and Kerberos au<strong>the</strong>ntication, with<br />

user controlled setting of various levels of data integrity and/or confidentiality. <strong><strong>Grid</strong>FTP</strong><br />

provides this capability by implementing <strong>the</strong> GSSAPI au<strong>the</strong>ntication mechanisms defined by<br />

RFC 2228, “FTP Security Extensions”.<br />

Third-party control of data transfer<br />

In order to manage large data sets <strong>for</strong> large distributed communities, it is necessary to provide<br />

third-party control of transfers between storage servers. <strong><strong>Grid</strong>FTP</strong> provides this capability by<br />

3


GRIDFTP: A DATA TRANSFER PROTOCOL FOR THE GRID<br />

adding GSSAPI security to <strong>the</strong> existing third-party transfer capability defined in <strong>the</strong> FTP<br />

standard.<br />

Parallel data transfer<br />

On wide-area links, using multiple TCP streams (even between <strong>the</strong> same source and<br />

destination) can improve aggregate bandwidth over using a single TCP stream. This is required<br />

both between a single client and a single server, and between two servers. <strong><strong>Grid</strong>FTP</strong> supports<br />

parallel data transfer through FTP command extensions and data channel extensions defined in<br />

<strong>the</strong> <strong>Grid</strong> <strong>Forum</strong> draft.<br />

Striped data transfer<br />

Using multiple TCP streams to transfer data that is partitioned across multiple servers (ala.<br />

DPSS) can fur<strong>the</strong>r improve aggregate bandwidth. <strong><strong>Grid</strong>FTP</strong> supports striped data transfers<br />

through extensions defined in <strong>the</strong> <strong>Grid</strong> <strong>Forum</strong> draft.<br />

Partial file transfer<br />

Many applications require <strong>the</strong> transfer of partial files. However, standard FTP requires <strong>the</strong><br />

application to transfer <strong>the</strong> entire file, or <strong>the</strong> remainder of a file starting at a particular offset.<br />

<strong><strong>Grid</strong>FTP</strong> introduces new FTP commands, as defined in <strong>the</strong> <strong>Grid</strong> <strong>Forum</strong> draft, to support<br />

transfers of regions of a file.<br />

Automatic negotiation of TCP buffer/window sizes<br />

Manually setting TCP buffer/window sizes is an error-prone process (particularly <strong>for</strong> nonexperts)<br />

and is often simply not done. <strong><strong>Grid</strong>FTP</strong> extends <strong>the</strong> standard FTP command set and<br />

data channel protocol to support both manual setting and automatic negotiation of TCP buffer<br />

sizes both <strong>for</strong> large files and large sets of small files.<br />

Support <strong>for</strong> reliable data transfer<br />

Reliable transfer is important <strong>for</strong> many applications that manage data. Fault recovery methods<br />

<strong>for</strong> handling transient network failures, server outages, etc. are needed. The FTP standard<br />

includes basic features <strong>for</strong> restarting failed transfer that are not widely implemented. The<br />

<strong><strong>Grid</strong>FTP</strong> protocol exploits <strong>the</strong>se features, and extends <strong>the</strong>m cover <strong>the</strong> new data channel<br />

protocol.<br />

Implementation Status<br />

The <strong><strong>Grid</strong>FTP</strong> protocol has been adopted, in full or in part, in <strong>the</strong> following software systems:<br />

• Globus Toolkit: The next version of <strong>the</strong> Globus Toolkit, which is currently<br />

available as an alpha release, includes a reference implementation of <strong>the</strong> full<br />

<strong><strong>Grid</strong>FTP</strong> protocol. Client and server SDKs, command line programs, and sample<br />

servers are included.<br />

• wuftpd: The Washington University ftpd server has been modified to support<br />

GSI, partial files, and parallel transfers. This work was done by <strong>the</strong> ACES group<br />

at NCSA and <strong>the</strong> Globus Project group at Argonne.<br />

• ncftp: The ncftp client has been modified to support GSI. This work was done<br />

by <strong>the</strong> Globus Project group at Argonne.<br />

4


GRIDFTP: A DATA TRANSFER PROTOCOL FOR THE GRID<br />

• HPSS pftpd: The HPSS pftpd server has been modified to support GSI. This<br />

work was done by SDSC.<br />

• Unitree ftpd: The Unitree ftpd server has been modified to support GSI. This<br />

work was done by <strong>the</strong> ACES group at NCSA.<br />

References<br />

i B. Tierney, W. Johnston, J. Lee, G. Hoo, and M. Thompson. End-to-end per<strong>for</strong>mance analysis of high speed distributed<br />

storage systems in wide area ATM networ<br />

ks. In NASA/Goddard Conference on Mass Storage Systems and Technologies, 1996, LBNL-39064.<br />

ii R.W. Watson and R.A. Coyne. The parallel I/O architecture of <strong>the</strong> high-per<strong>for</strong>mance storage system (HPSS). In IEEE<br />

MSS Symposium, 1995.<br />

iii It is interesting to note that DFS also provides <strong>for</strong> cache coherence in <strong>the</strong> face of multiple writers. This is one of its<br />

drawbacks <strong>for</strong> data grid applications, since this feature <strong>for</strong>ces unnecessary overhead. <strong>Data</strong> grid applications seldom<br />

need <strong>the</strong> ability to change existing datasets.<br />

iv In<strong>for</strong>mation on SRB is available on <strong>the</strong> World Wide Web at http://www.npaci.edu/DICE/SRB/.<br />

v In<strong>for</strong>mation about HDF/HDF5 is available on <strong>the</strong> World Wide Web at http://hdf.ncsa.uiuc.edu/.<br />

vi In<strong>for</strong>mation about <strong>the</strong> Globus gsiftp is available on <strong>the</strong> World Wide Web at http://www.globus.org/security/v1.1/.<br />

5

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!