Skip to main content

Ask Steve! - How to Specify and Implement Data Movement?

Posted by s.crouch on 12 April 2011 - 3:50pm

You may recall back in February I talked about the importance of data formats when choosing a programming language for sustainable development.  Since then, I’ve received the following question…

“We’re currently putting together a machine for data intensive research. The machine will have a data-staging node, and 120 other nodes (configured using ROCKS) which each have several large local disks (>6TB/node).  We want to try out different ways of staging the data to the different nodes, and keep a record of what we’ve done.  The main things we’d like to record are: the number and size of files, the pattern (one to many, many to many, etc), and the locations to which the data are sent.

We’d like to use some kind of standard way of specifying how we want the data to move, and allow for different data transfer implementations to be plugged in behind this.  I’ve heard that OGSA-DMI might be able to help us here. Do you think that could be helpful?  Do you have any other suggestions or advice as to how we could provide a standard interface to the up-loader of the data which could also potentially record what movement has taken place?”

Instead of the format of data, we’re now talking about the format of requirements for moving data.  Essentially, you have data stored in one or more ‘source’ locations and you want to transfer it to one or more ‘sink’ locations – how do you specify this?

The OGSA Data Movement Interface (DMI) from the Open Grid Forum (OGF) is an XML specification aimed at doing just that.  The good news is that the OGSA-DMI Plain Web Service Rendering Specification v1.0, soon to be ratified as an OGF full Proposed Recommendation, seems to meet your requirements.  You can readily specify multiple transfer sources and sinks at a high-level (one-to-many, many-to-many), and the specification does not mandate which transfer protocols are supported by the service (e.g. ftp, scp, etc.), so you are free to add those you wish to support.  For example, when specifying requirements for moving data, you could specify as a Source:

      <dmi-plain:SourceDEPR>
        <wsa:Address>
          http://www.ogf.org/ogsa/2007/08/addressing/none
        </wsa:Address>
        ...
        <wsa:Metadata>
          <dmi:DataLocations>
            <dmi:Data
              DataUrl="ftp://ftp.siteA.com/source/example.zip"
              ProtocolUri="http://www.ogf.org/ogsa-dmi/2006/03/im/protocol/ftp">
              <dmi:Credentials>
                <ws-sec:UsernameToken>
                  <Username>foo</Username>
                  <Password>bar</Password>
                </ws-sec:UsernameToken>
              </dmi:Credentials>
            </dmi:Data>
          </dmi:DataLocations>
          ...
        </wsa:Metadata>
        ...
      </dmi-plain:SourceDEPR>

And specify a corresponding Sink in a very similar high-level way.  Importantly, it contains definitions for specifying data movement requirements as well as the web service interface itself.  Perhaps worth investigating!

Of course, you have to implement the service’s back-end to perform the actual transfers.  You could consider the Commons Virtual File System (VFS) data transfer library, which provides a single Java API for transferring files using a number of protocols (e.g. FTP, SFTP, HTTP).  There are a couple of variants of this – the original Apache Commons-VFS and commons-vfs-grid on SourceForge which includes some fixes to the original and more advanced features.  Of course, the nodes in your cluster would have to support the protocol as a service to act as a Sink.  In addition, if the demands on the service are expected to be high, you may have to consider a scalable solution that farms out the transfer ‘jobs’ to nodes on your network.

As for recording the transfers, much of the information you need is embedded in the requests, so you could add a simple ‘recording’ feature into your implementation.  You would have to think about how to get the file sizes (perhaps only known at transfer time) into the recording log though!

I’m aware of two implementations of this specification.  The first is the UNICORE Grid middleware, and there is also an implementation from the DataMINX project which is an open-source project on Google Code.  DataMINX in particular offers a scalable architecture with worker nodes pluggable into a Java Messaging Service (JMS)-compliant queueing system, and is modular to the extent that you could make use of just the transfer worker client, for example.  Perhaps you’d like to investigate to see whether it is appropriate, and maybe contribute to its development?

Lastly, the OGSA-DMI Working Group within OGF is working towards an OGSA-DMI ‘Common’ specification, which is designed to specify only the requirements for transfers and not the service interface.  This would mean you can use this ‘Common’ specification within your own service in any way you choose.  Perhaps you’d like to join the group, contribute your use case, and help us work towards the next generation of an OGSA-DMI specification! :)

I hope this helps!

Share this page

We use cookies on our website to support technical features that enhance your user experience.

We also use analytics & advertising services. To opt-out click for more information.