GridFTP: A Data Transfer Protocol for the Grid - Open Grid Forum
GridFTP: A Data Transfer Protocol for the Grid - Open Grid Forum
GridFTP: A Data Transfer Protocol for the Grid - Open Grid Forum
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
GRIDFTP: A DATA TRANSFER PROTOCOL FOR THE GRID<br />
<strong><strong>Grid</strong>FTP</strong>: A <strong>Data</strong> <strong>Transfer</strong> <strong>Protocol</strong> <strong>for</strong> <strong>the</strong> <strong>Grid</strong><br />
<strong>Grid</strong> <strong>Forum</strong> <strong>Data</strong> Working Group on <strong><strong>Grid</strong>FTP</strong><br />
Bill Allcock, Lee Liming, Steven Tuecke – ANL<br />
Ann Chervenak – USC/ISI<br />
<br />
Introduction<br />
In <strong>Grid</strong> environments, access to distributed data is typically as important as access to distributed<br />
computational resources. Distributed scientific and engineering applications require:<br />
• transfers of large amounts of data (terabytes or petabytes) between storage<br />
systems, and<br />
• access to large amounts of data (gigabytes or terabytes) by many geographically<br />
distributed applications and users <strong>for</strong> analysis, visualization, etc.<br />
Un<strong>for</strong>tunately, <strong>the</strong> lack of standard protocols <strong>for</strong> transfer and access of data in <strong>the</strong> <strong>Grid</strong> has led<br />
to a fragmented <strong>Grid</strong> storage community. Users who wish to access different storage systems<br />
are <strong>for</strong>ced to use multiple protocols and/or APIs, and it is difficult to efficiently transfer data<br />
between <strong>the</strong>se different storage systems.<br />
We propose a common data transfer and access protocol called <strong><strong>Grid</strong>FTP</strong> that provides secure,<br />
efficient data movement in <strong>Grid</strong> environments. This protocol, which extends <strong>the</strong> standard FTP<br />
protocol, provides a superset of <strong>the</strong> features offered by <strong>the</strong> various <strong>Grid</strong> storage systems<br />
currently in use. We chose <strong>the</strong> FTP protocol because it is <strong>the</strong> most commonly used protocol <strong>for</strong><br />
data transfer on <strong>the</strong> Internet, and of <strong>the</strong> existing candidates from which to start it comes closest<br />
to meeting <strong>the</strong> <strong>Grid</strong>’s needs.<br />
The <strong><strong>Grid</strong>FTP</strong> protocol includes <strong>the</strong> following features:<br />
• <strong>Grid</strong> Security Infrastructure (GSI) and Kerberos support<br />
• Third-party control of data transfer<br />
• Parallel data transfer<br />
• Striped data transfer<br />
• Partial file transfer<br />
• Automatic negotiation of TCP buffer/window sizes<br />
• Support <strong>for</strong> reliable and restartable data transfer<br />
• Integrated instrumentation<br />
Some of <strong>the</strong>se features are supported by FTP extensions that have already been standardized in<br />
<strong>the</strong> IETF, but which are currently seldom implemented. O<strong>the</strong>r features are new extensions FTP.<br />
1
GRIDFTP: A DATA TRANSFER PROTOCOL FOR THE GRID<br />
The <strong>Grid</strong> <strong>Forum</strong> <strong><strong>Grid</strong>FTP</strong> working group was <strong>for</strong>med within <strong>the</strong> <strong>Data</strong> working group to<br />
foster <strong>the</strong> development of a <strong><strong>Grid</strong>FTP</strong> protocol standard, as well as interoperable storage<br />
clients and servers that implement <strong>the</strong> <strong><strong>Grid</strong>FTP</strong> protocol.<br />
Motivation <strong>for</strong> a common protocol<br />
There are already a number of storage systems in use by <strong>the</strong> <strong>Grid</strong> community. These storage<br />
systems have been created in response to specific needs <strong>for</strong> storing and accessing large datasets.<br />
They each focus on a distinct set of requirements and provide distinct services to <strong>the</strong>ir clients.<br />
For example, some storage systems (DPSS, HPSS) focus on high-per<strong>for</strong>mance access to data<br />
and utilize parallel data transfer streams and/or striping across multiple servers to improve<br />
per<strong>for</strong>mance. i ii O<strong>the</strong>r systems (DFS) focus on supporting high-volume usage and utilize dataset<br />
replication and local caching to divide and balance server load. iii The SRB system connects<br />
heterogeneous data collections and provides a uni<strong>for</strong>m client interface to <strong>the</strong>se repositories, and<br />
also provides metadata <strong>for</strong> use in identifying and locating data within <strong>the</strong> storage system. iv Still<br />
o<strong>the</strong>r systems (HDF5) focus on <strong>the</strong> structure of <strong>the</strong> data, and provide client support <strong>for</strong><br />
accessing structured data from a variety of underlying storage systems. v<br />
Un<strong>for</strong>tunately, most of <strong>the</strong>se storage systems utilize incompatible, an often unpublished<br />
protocols <strong>for</strong> accessing data, and <strong>the</strong>re<strong>for</strong>e require <strong>the</strong> use of <strong>the</strong>ir own client libraries to access<br />
data. The use of multiple incompatible protocols and client libraries <strong>for</strong> accessing storage<br />
effectively partitions <strong>the</strong> datasets available on <strong>the</strong> grid. Applications that require access to data<br />
stored in different storage systems must ei<strong>the</strong>r choose to only use a subset of storage systems, or<br />
must use multiple methods to retrieve data from <strong>the</strong> various storage systems.<br />
One approach to breaking down partitions created by <strong>the</strong>se mutually incompatible storage<br />
system protocols is to build a layered client or gateway which can present <strong>the</strong> user with one<br />
interface, but which translates requests into <strong>the</strong> various storage system protocols and/or client<br />
library calls. This approach is attractive to existing storage system providers because it does not<br />
require <strong>the</strong>m adopt support <strong>for</strong> a new protocol. But it also has significant disadvantages,<br />
including:<br />
• Per<strong>for</strong>mance: Costly translations are often required between <strong>the</strong> layered client<br />
and storage system specific client libraries and protocols. In addition, it can be<br />
challenging to efficiently transfer a dataset from one storage system to ano<strong>the</strong>r.<br />
• Complexity: Building and maintaining a client or gateway that supports<br />
numerous storage systems is considerable work. In addition, staying up to date<br />
as each storage system independently evolves is very difficult. This is fur<strong>the</strong>r<br />
exacerbated by <strong>the</strong> need to provide support <strong>for</strong> multiple client languages, such as<br />
C/C++, Java, Perl, Python, shells, etc.<br />
It would be mutually advantageous to both storage providers and users to have a common level<br />
of interoperability between all of <strong>the</strong>se disparate systems: a common—but extensible—<br />
underlying data transfer protocol. Storage providers would gain a broader user base, because<br />
<strong>the</strong>ir data would be available to any client. Storage users would gain access to a broader range<br />
of storage systems and data. In addition, <strong>the</strong>se benefits can be gained without <strong>the</strong> per<strong>for</strong>mance<br />
and complexity problems of <strong>the</strong> layered client or gateway approach.<br />
2
GRIDFTP: A DATA TRANSFER PROTOCOL FOR THE GRID<br />
Fur<strong>the</strong>rmore, establishing a common data transfer protocol would eliminate <strong>the</strong> current<br />
duplication of ef<strong>for</strong>t in developing unique data transfer capabilities <strong>for</strong> different storage systems.<br />
A pooling of ef<strong>for</strong>t in <strong>the</strong> data transfer protocol area would lead to greater reliability,<br />
per<strong>for</strong>mance, and overall features that would <strong>the</strong>n be available to all distributed storage systems.<br />
Characteristics of <strong>the</strong> data transfer protocol<br />
In order to make a common data transfer protocol attractive to users and developers of existing<br />
storage systems, we must provide a transfer protocol that offers a superset of <strong>the</strong> features<br />
offered by systems currently in regular use. In addition, <strong>the</strong> protocol must be extensible, in<br />
order to support future innovations by storage system users and developers.<br />
We have observed that <strong>the</strong> FTP protocol is <strong>the</strong> protocol most commonly used <strong>for</strong> data transfer<br />
on <strong>the</strong> Internet, and <strong>the</strong> most likely candidate <strong>for</strong> meeting <strong>the</strong> <strong>Grid</strong>’s needs. It is attractive in<br />
particular <strong>for</strong> <strong>the</strong> following reasons.<br />
• It is a widely implemented and well-understood IETF standard protocol. There is<br />
a large code base and expertise from which to build.<br />
• It provides a well-defined architecture <strong>for</strong> protocol extensions, and supports<br />
dynamic discovery of <strong>the</strong> extensions supported by a particular implementation.<br />
• Numerous groups have added various extensions through <strong>the</strong> IETF. Some of<br />
<strong>the</strong>se extensions are particularly useful in <strong>the</strong> <strong>Grid</strong>.<br />
• In addition to client/server transfers (i.e. “put/get”, or “remote read/write”), it<br />
also supports transfers directly between two servers, mediated by a third party<br />
client (i.e. “third party transfer”).<br />
• The separation of data and control channels onto different sockets allows <strong>for</strong><br />
easier extensibility <strong>for</strong> parallel and striped transfers, efficiently transiting<br />
firewalls, etc.<br />
Most current FTP implementations support only a subset of <strong>the</strong> features defined in <strong>the</strong> FTP<br />
protocol (RFC 969) and its accepted extensions. Some of <strong>the</strong> seldom-implemented features are<br />
useful to <strong>Grid</strong> applications. But <strong>the</strong> standards also lack several features which <strong>Grid</strong> applications<br />
require.<br />
We intend to select a subset of <strong>the</strong> existing FTP standards and fur<strong>the</strong>r extend it, adding <strong>the</strong><br />
following features. We believe that <strong>the</strong> resulting protocol will be a suitable candidate <strong>for</strong> <strong>the</strong><br />
common data transfer protocol <strong>for</strong> <strong>the</strong> grid, which we call “<strong><strong>Grid</strong>FTP</strong>”.<br />
<strong>Grid</strong> Security Infrastructure (GSI) and Kerberos support<br />
Robust and flexible au<strong>the</strong>ntication, integrity, and confidentiality features are critical when<br />
transferring or accessing files. <strong><strong>Grid</strong>FTP</strong> must support GSI and Kerberos au<strong>the</strong>ntication, with<br />
user controlled setting of various levels of data integrity and/or confidentiality. <strong><strong>Grid</strong>FTP</strong><br />
provides this capability by implementing <strong>the</strong> GSSAPI au<strong>the</strong>ntication mechanisms defined by<br />
RFC 2228, “FTP Security Extensions”.<br />
Third-party control of data transfer<br />
In order to manage large data sets <strong>for</strong> large distributed communities, it is necessary to provide<br />
third-party control of transfers between storage servers. <strong><strong>Grid</strong>FTP</strong> provides this capability by<br />
3
GRIDFTP: A DATA TRANSFER PROTOCOL FOR THE GRID<br />
adding GSSAPI security to <strong>the</strong> existing third-party transfer capability defined in <strong>the</strong> FTP<br />
standard.<br />
Parallel data transfer<br />
On wide-area links, using multiple TCP streams (even between <strong>the</strong> same source and<br />
destination) can improve aggregate bandwidth over using a single TCP stream. This is required<br />
both between a single client and a single server, and between two servers. <strong><strong>Grid</strong>FTP</strong> supports<br />
parallel data transfer through FTP command extensions and data channel extensions defined in<br />
<strong>the</strong> <strong>Grid</strong> <strong>Forum</strong> draft.<br />
Striped data transfer<br />
Using multiple TCP streams to transfer data that is partitioned across multiple servers (ala.<br />
DPSS) can fur<strong>the</strong>r improve aggregate bandwidth. <strong><strong>Grid</strong>FTP</strong> supports striped data transfers<br />
through extensions defined in <strong>the</strong> <strong>Grid</strong> <strong>Forum</strong> draft.<br />
Partial file transfer<br />
Many applications require <strong>the</strong> transfer of partial files. However, standard FTP requires <strong>the</strong><br />
application to transfer <strong>the</strong> entire file, or <strong>the</strong> remainder of a file starting at a particular offset.<br />
<strong><strong>Grid</strong>FTP</strong> introduces new FTP commands, as defined in <strong>the</strong> <strong>Grid</strong> <strong>Forum</strong> draft, to support<br />
transfers of regions of a file.<br />
Automatic negotiation of TCP buffer/window sizes<br />
Manually setting TCP buffer/window sizes is an error-prone process (particularly <strong>for</strong> nonexperts)<br />
and is often simply not done. <strong><strong>Grid</strong>FTP</strong> extends <strong>the</strong> standard FTP command set and<br />
data channel protocol to support both manual setting and automatic negotiation of TCP buffer<br />
sizes both <strong>for</strong> large files and large sets of small files.<br />
Support <strong>for</strong> reliable data transfer<br />
Reliable transfer is important <strong>for</strong> many applications that manage data. Fault recovery methods<br />
<strong>for</strong> handling transient network failures, server outages, etc. are needed. The FTP standard<br />
includes basic features <strong>for</strong> restarting failed transfer that are not widely implemented. The<br />
<strong><strong>Grid</strong>FTP</strong> protocol exploits <strong>the</strong>se features, and extends <strong>the</strong>m cover <strong>the</strong> new data channel<br />
protocol.<br />
Implementation Status<br />
The <strong><strong>Grid</strong>FTP</strong> protocol has been adopted, in full or in part, in <strong>the</strong> following software systems:<br />
• Globus Toolkit: The next version of <strong>the</strong> Globus Toolkit, which is currently<br />
available as an alpha release, includes a reference implementation of <strong>the</strong> full<br />
<strong><strong>Grid</strong>FTP</strong> protocol. Client and server SDKs, command line programs, and sample<br />
servers are included.<br />
• wuftpd: The Washington University ftpd server has been modified to support<br />
GSI, partial files, and parallel transfers. This work was done by <strong>the</strong> ACES group<br />
at NCSA and <strong>the</strong> Globus Project group at Argonne.<br />
• ncftp: The ncftp client has been modified to support GSI. This work was done<br />
by <strong>the</strong> Globus Project group at Argonne.<br />
4
GRIDFTP: A DATA TRANSFER PROTOCOL FOR THE GRID<br />
• HPSS pftpd: The HPSS pftpd server has been modified to support GSI. This<br />
work was done by SDSC.<br />
• Unitree ftpd: The Unitree ftpd server has been modified to support GSI. This<br />
work was done by <strong>the</strong> ACES group at NCSA.<br />
References<br />
i B. Tierney, W. Johnston, J. Lee, G. Hoo, and M. Thompson. End-to-end per<strong>for</strong>mance analysis of high speed distributed<br />
storage systems in wide area ATM networ<br />
ks. In NASA/Goddard Conference on Mass Storage Systems and Technologies, 1996, LBNL-39064.<br />
ii R.W. Watson and R.A. Coyne. The parallel I/O architecture of <strong>the</strong> high-per<strong>for</strong>mance storage system (HPSS). In IEEE<br />
MSS Symposium, 1995.<br />
iii It is interesting to note that DFS also provides <strong>for</strong> cache coherence in <strong>the</strong> face of multiple writers. This is one of its<br />
drawbacks <strong>for</strong> data grid applications, since this feature <strong>for</strong>ces unnecessary overhead. <strong>Data</strong> grid applications seldom<br />
need <strong>the</strong> ability to change existing datasets.<br />
iv In<strong>for</strong>mation on SRB is available on <strong>the</strong> World Wide Web at http://www.npaci.edu/DICE/SRB/.<br />
v In<strong>for</strong>mation about HDF/HDF5 is available on <strong>the</strong> World Wide Web at http://hdf.ncsa.uiuc.edu/.<br />
vi In<strong>for</strong>mation about <strong>the</strong> Globus gsiftp is available on <strong>the</strong> World Wide Web at http://www.globus.org/security/v1.1/.<br />
5