Indiana University

Follow us on Facebook!

D2I Current Projects

Cloud for Climate: Advancing Data & Resource Management This project develops a pipeline framework for running ensemble simulations on the cloud; the framework has two key components: ensemble deployment and metadata harvest. Regarding the former, on commercial cloud platforms typically a much smaller number of jobs than desired can be started at any one time. An ensemble run will need to be pipelined to a cloud resource, that is, executed in well-controlled batches over a period of time.
HathiTrust Research Center

The HathiTrust Research Center (HTRC) enables computational access for nonprofit and educational users to published works in the public domain stored within the HathiTrust Digital Library , an extensive collaborative digital library of nearly 10 million volumes and 2 billion pages of archived material maintained by major research institutions and libraries worldwide.

Socio-Ecological Informatics Social-ecological researchers study the interactions of the environment, users, and governance of environmental resources. The research undertaken by the Social Ecological Informatics group applies database and data management, information retrieval, knowledge management, human computer interaction design, and ontological tools and approaches to enhancing the value of social-ecological data for research and policy use.
Gigabyte Synthetic Database Provenance of scientific data is a key piece of the metadata record for the data's ongoing discovery and reuse. Provenance collection systems capture provenance on the fly, however, the protocol between application and provenance tool may not be reliable. Consequently, the provenance record can be partial, partitioned, and simply inaccurate. The Gigabyte Synthetic Database is a noisy data collection generated using the Workflow Emulator Tool (WORKEM) with a number of scientific workflow...
XML Metadata Concept Catalog (XMC Cat) XMC Cat is a metadata catalog that stores rich metadata describing data objects that are themselves stored in files, storage repositories, or on the web. Its features include adaptability to domain schemata through configuration instead of code changes, support for automatic capture of metadata through the use of curation plugins, and search and browse capabilities through a web-based GUI that is dynamically generated from a domain schema. IT can be deployed in different scientific and...
Karma Provenance Collection Tool The Karma tool is a standalone tool that can be added to existing cyberinfrastructure for purposes of collection and representation of provenance data. Karma utilizes a modular architecture that permits support for multiple instrumentation plugins that make it usable in different architectural settings.
Sustainable Environment-Actionable Data

Awarded through NSF's DataNet program, the Sustainable Environment-Actionable Data (SEAD) project will develop tools and services for active curation and longterm preservation of scientific data, while also engaging researchers through social networking tools. SEAD will enable new modalities of sustainability science -- the study of dynamic interactions between nature and society by advancing...

Secure Computational and Data Environments for Non-Consumptive Research In this research, researchers at the University of Michigan and the Data to Insight Center are developing a “data capsule framework” that is founded on a principle of “trust but verify”. That is, the informatics scholar is given freedom to experiment with new algorithms on a huge body of copyrighted or otherwise protected information, but technological mechanisms are in place to verify compliance with the policy of non-consumptive research.
Hierarchical MapReduce We present a hierarchical MapReduce framework that gathers computation resources from different clusters and run MapReduce jobs across them. The global controller in our framework splits the data set and dispatches them to multiple "local" MapReduce clusters, and balances the workload by assigning tasks in accordance to the capabilities of each cluster and of each node. The local results are then returned back to the global controller for global reduction.
PRAGMA at IU IU provides a virtual cluster consisting of a frontend node and 3 compute nodes. Additional virtual clusters may be made available in the future.
Data Catalog Data Catalog harvests data product metadata from distributed THREDDS catalogs into an XMC Cat instance. The metadata (including data product location) is then available to applications via the XMC Cat API. Data catalog metadata harvesting employs a shared nothing ingest pipeline to allow for the indexing of large catalogs such as NEXRADIII and has indexed of over 17 thousand collections and 2 million files.
Streamflow Streamflow integrates data streams into a standard workflow system through a programming model approach that introduces new workflow semantics that enable scientific workflow designers to incorporate data streams into the experiment without major changes to the infrastructure. It utilizes XBaya as a graphical client program for workflow composition, execution and monitoring.
Sigiri We propose a simple abstraction for interaction with heterogeneous resource managers spanning grid and cloud computing, and on features that make the tool useful for the mid-scale physical or natural scientist. Key strengths of the abstraction are its support for multiple standard job specification languages, preservation of direct user interaction with the service, removing the delay that can come through layers of services, and the predictable behaviour under heavy loads.
Linked Environments for Atmospheric Discovery II (LEAD II) LEAD II is a follow-on to the successful Linked Environments for Atmospheric Discovery NSF funded large-scale ITR. LEAD II carries the vision of LEAD forward into new areas as it explores research challenges in hybrid computing and in the manipulation and use of weather data in non-weather applications.


Archived Projects >>