The Data to Insight Center has a strong presence in provenance and metadata for scientific data through numerous
funded projects and efforts with collaborators both at Indiana University and at other institutions. Following is a brief
description of our provenance and metadata research initiatives, followed by links to current and on-going research projects, prior projects, and efforts with our collaborators.
Digital Data Provenance
Digital data created through computational
science experiment and discovery is growing at a rapid rate and
extending to new frontiers as discovery and experiment frameworks gain
acceptance and computational power and storage become cheaper. As
research digital data collections become more accessible, it becomes
increasingly important to address the issues of data validity and
quality: To record and manage information about where each data object
originated, the processes applied to the data products, and by whom.
The ability to routinely collect provenance information about the data
products that are produced during the scientific discovery process can
have a transformational impact on scientific discovery.
Provenance collection is, in essence, a form of automatic metadata
generation. When metadata information collection is automated and done
at the point of data product generation, what results is more accurate
and complete information being collected, largely because it removes
the need of involving users in annotating after-the-fact. As digital
library solutions for scientific data collections become more common,
as trends indicate is happening already, it will be important that
specialized metadata catalogs built up around e-Science discovery, such
as the provenance database, be utilized in archival collection for the
rich contextual metadata they contain.
We are developing tools for provenance generation and collection and
case-based reasoning. The tools and collected data are also available
for download for wider community use.
Metadata for Scientific Data
With the increasing deluge of scientific data, detailed metadata is necessary to enable
scientists to share data and find the data and scientific results relevant to their research.
In the Data to Insight Center, our research emphasizes capturing detailed metadata
early in the scientific process. In addition to detailed metadata, research has found that as
the distance (spatially or temporally) increases between data creators and data users,
additional structured metadata is required. The XMC Cat suite of tools resulting from our
research in the Linked Environments for Atmospheric Discovery (LEAD) and subsequent
research projects enables detailed and automated incremental metadata capture early
in the scientific process using a generalized architecture that can be adapted to metadata
schemas of different scientific communities.
- Karma Provenance Collection Tool is a stand alone provenance collection toolkit. The most recent version, v3.0, features instrumentation using Axis2 handlers.
is a tool for capturing the workflow of Global Environments for Network
Innovations (GENI) experiments which includes slice creation, topology
of the slice, operational status and other measurement statistics and
correlate it with experimental data. NetKarma will allow researchers
to see the exact state of the network and store configuration of the
experiment and its slice.
is a tool for collecting and disseminating provenance of Advanced
Microwave Scanning Radiometer - Earth Observing System (AMSR-E)
standard data products to improve the collection, preservation, utility
and dissemination of the provenance information within the NASA Earth
- Gigabyte Provenance Database is a multi-Gigabyte collection of provenance information.
- Workflow Emulator (WORKEM) is a tool for executing synthetic workflows.
is a case-based reasoning recommender system. It uses computer models
of case-based reasoning to develop a support system that leverages the
collective experience of the users of the provenance system to provide
- XMC Cat Suite is a set of metadata tools including the XMC Cat web service metadata catalog, the web-based XMC Cat GUI, post-processing plug-ins for automating metadata capture, and configurations for multiple scientific metadata schemata. XMC Cat is no longer an ongoing D2I project, but all of the source code for the metadata catalog and associated utilities is available through the project's SourceForge site as open source.
- Beth Plale [plale at indiana dot edu]
- Mehmet Aktas
- Bina Bhaskar
- Bin Cao
- Kavitha Chandrasekar
- You-Wei Cheah
- Peng Chen
- Sribabu Doddapaneni
- Dennis Gannon
- Devarshi Ghoshal
- Scott Jensen
- Stacy Kowalczyk
- Shobana Krishnan
- David Leake
- Yuan Luo
- Joseph Morwick
- Beth Plale
- Prajakta Purohit
- Lavanya Ramakrishnan
- Aparna Rao
- Ed Robertson
- Kalani Ruwanpathirana
- Bimalee Salpitikorala
- Yogesh Simmhan
- Christopher Small
- Girish Subramanian
- Yiming Sun
Related News, Events and Publications: