Indiana University

Follow us on Facebook!

Provenance and Metadata

The Data to Insight Center has a strong presence in provenance and metadata for scientific data through numerous funded projects and efforts with collaborators both at Indiana University and at other institutions. Following is a brief description of our provenance and metadata research initiatives, followed by links to current and on-going research projects, prior projects, and efforts with our collaborators.

Digital Data Provenance
Digital data created through computational science experiment and discovery is growing at a rapid rate and extending to new frontiers as discovery and experiment frameworks gain acceptance and computational power and storage become cheaper. As research digital data collections become more accessible, it becomes increasingly important to address the issues of data validity and quality: To record and manage information about where each data object originated, the processes applied to the data products, and by whom. The ability to routinely collect provenance information about the data products that are produced during the scientific discovery process can have a transformational impact on scientific discovery.

Provenance collection is, in essence, a form of automatic metadata generation. When metadata information collection is automated and done at the point of data product generation, what results is more accurate and complete information being collected, largely because it removes the need of involving users in annotating after-the-fact. As digital library solutions for scientific data collections become more common, as trends indicate is happening already, it will be important that specialized metadata catalogs built up around e-Science discovery, such as the provenance database, be utilized in archival collection for the rich contextual metadata they contain.

We are developing tools for provenance generation and collection and case-based reasoning. The tools and collected data are also available for download for wider community use.

Metadata for Scientific Data
With the increasing deluge of scientific data, detailed metadata is necessary to enable scientists to share data and find the data and scientific results relevant to their research. In the Data to Insight Center, our research emphasizes capturing detailed metadata early in the scientific process. In addition to detailed metadata, research has found that as the distance (spatially or temporally) increases between data creators and data users, additional structured metadata is required. The XMC Cat suite of tools resulting from our research in the Linked Environments for Atmospheric Discovery (LEAD) and subsequent research projects enables detailed and automated incremental metadata capture early in the scientific process using a generalized architecture that can be adapted to metadata schemas of different scientific communities.

Current Projects

  • Karma Provenance Collection Tool is a stand alone provenance collection toolkit.  The most recent version, v3.0, features instrumentation using Axis2 handlers.

  • NetKarma is a tool for capturing the workflow of Global Environments for Network Innovations (GENI) experiments which includes slice creation, topology of the slice, operational status and other measurement statistics and correlate it with experimental data.  NetKarma will allow researchers to see the exact state of the network and store configuration of the experiment and its slice.

  • InstantKarma is a tool for collecting and disseminating provenance of Advanced Microwave Scanning Radiometer - Earth Observing System (AMSR-E) standard data products to improve the collection, preservation, utility and dissemination of the provenance information within the NASA Earth Science community.

  • Gigabyte Provenance Database is a multi-Gigabyte collection of provenance information.

  • Workflow Emulator (WORKEM) is a tool for executing synthetic workflows.

  • Phala is a case-based reasoning recommender system.  It uses computer models of case-based reasoning to develop a support system that leverages the collective experience of the users of the provenance system to provide suggestions.

Past Projects

  • XMC Cat Suite is a set of metadata tools including the XMC Cat web service metadata catalog, the web-based XMC Cat GUI, post-processing plug-ins for automating metadata capture, and configurations for multiple scientific metadata schemata. XMC Cat is no longer an ongoing D2I project, but all of the source code for the metadata catalog and associated utilities is available through the project's SourceForge site as open source.



  • Beth Plale [plale at indiana dot edu]
Project Contributors
  • Mehmet Aktas
  • Bina Bhaskar
  • Bin Cao
  • Kavitha Chandrasekar
  • You-Wei Cheah
  • Peng Chen
  • Sribabu Doddapaneni
  • Dennis Gannon
  • Devarshi Ghoshal
  • Scott Jensen
  • Stacy Kowalczyk
  • Shobana Krishnan
  • David Leake
  • Yuan Luo
  • Joseph Morwick
  • Beth Plale
  • Prajakta Purohit
  • Lavanya Ramakrishnan
  • Aparna Rao
  • Ed Robertson
  • Kalani Ruwanpathirana
  • Bimalee Salpitikorala
  • Yogesh Simmhan
  • Christopher Small
  • Girish Subramanian
  • Yiming Sun


Related News, Events and Publications:

Provenance as Essential Infrastructure for Data Lakes
ProvErr: System Level Statistical Fault Diagnosis using Dependency Model
Big Data Provenance Analysis and Visualization
Komadu: A Capture and Visualization System for Scientific Data Provenance
Analysis of Memory Constrained Live Provenance
Trust Threads: Minimal Provenance for Data Publication and Reuse
Threads of Trust: Provenance of Data Reuse in Long Tail Science
2014 IEEE International Conference on Big Data

Beth Plale, Ying Ding, and XiaoFeng Wang of Indiana University are on the Program Committee for this conference on big data.

IU develops Komadu, a new suite of data provenance software tools The Indiana University Data to Insight Center (D2I) has released a new suite of software tools, Komadu, designed to help researchers track and verify digital data, a crucial step in computational research.
Komadu Provenance Collection Tool Komadu is a W3C PROV compliant standalone provenance collection tool that can be added to an existing cyberinfrastructure for the purpose of collecting and visualizing provenance data. Komadu is the successor of Karma and it comes with a set of new features and a new API to support easier provenance collection.