Most of our readers are familiar with grid computing, a cost-effective way to distribute the computational cost of a high-volume of computations to computers separated by large geographical distances. In this column, Reagan Moore takes us back to basics to explain what data grids are, and how they are different from the grid computing we’ve come to know and love.
Sometimes, storing data on a single server makes sense. It is a simple way to ensure that people always know where to store data, and where to access it. But in more complex situations, this model can break down for a variety of reasons.
Perhaps the rate at which the remotely recorded data is uploading, combined with the rate at which users wish to access and download your data, is more than the server and its network can handle. Or perhaps you want a way to reliably and seamlessly access data owned and stored by multiple institutions without having to worry about how recently that single server was synchronized with the other institutions’ servers.
Groups that need to manage distributed data, stored across multiple locations and multiple types of storage devices, can use data grids to impose a common access method, common naming, common authentication, and common management policies. They can use the data grid to integrate archives with disk caches used for data access, automate enforcement of management policies and administrative functions, build a collaboration environment, and publish data in a digital library.
The original funding for the development of data grids came from the US Defense Advanced Research Projects Agency. An initial application was the creation of a distributed patent digital library for the US Patent and Trade Office. Within the US National Science Foundation’s Partnerships for Advanced Computational Infrastructure (PACI) program, which was launched in March 1997, data grids were built for major national scale research projects in seismology, neuroinformatics, digital library initiatives, astronomy, environmental science, and oceanography.
Since those early days, data grid capabilities have evolved from simple organization of distributed collections to now include enforcement of management policies, automation of administrative functions, and validation of assessment criteria. The new capabilities are enabled through the integration of a distributed rule engine into the data management infrastructure.
Within that infrastructure, each policy is mapped to a computer actionable rule that controls the execution of a data management procedure. At each storage location, a rule engine applies the procedures. The procedures, in turn, are mapped to workflows composed from standard functions, called micro-services. The results of the execution of each procedure are stored as persistent state information in a metadata catalog, tracking the data’s provenance. The state information can be queried to verify assessment criteria.
This policy-based data management approach makes it possible to build generic infrastructure that can support each stage of the data life cycle. The policies and procedures required for simple distributed data management (traditional data grid applications) can be augmented with the policies required for data publication in a digital library or data preservation in an archive. Furthermore, the same data management infrastructure can be used to re-purpose a collection for a new use, by applying the policies required by the new user community.
One example of a policy-based data management system is the open source integrated Rule Oriented Data System (iRODS).
The demonstration of the generic capabilities of policy-based data management systems is being done in international collaborations that include groups in Japan, Taiwan, Australia, Europe, and the United States. In Asia, the iRODS technology has been adopted by the T2K neutrino data grid in Japan and the Taiwan Digital Archives Remote Backup system, and Academia Sinica in Taiwan is developing the gLite-iRODS interoperability.
For more information about iRODS, please visit the website http://irods.diceresearch.org.