XML Metadata Concept Catalog (XMC Cat)
XMC Cat is a metadata catalog that stores rich metadata describing data objects that are themselves stored in files, storage repositories, or on the web. It is an open source web service written in Java that utilizes the Axis2 web service engine and Apache Tomcat. Its features include adaptability to domain schemata through configuration instead of code changes, support for automatic capture of metadata through the use of curation plugins, and search and browse capabilities through a web-based GUI that is dynamically generated from a domain schema. This allows XMC Cat to be deployed in different scientific and educational domains without requiring new code to be written. XMC Cat is currently in use in the LEAD Science Gateway.
Role of Metadata Concepts in XMC Cat

Metadata schemas used in science and education are composed of complex concepts that describe the data products generated by a community. XMC Cat exploits this unique feature of scientific metadata to both efficiently store metadata and perform detailed data discovery queries. This concept-based approach also enables the automatic generation of the data search GUI and easier deployment of metadata catalogs in diverse scientific domains. An XML metadata schema (or schemas) is partitioned into the concepts it contains and metadata can be efficiently ingested and validated incrementally using concepts as the unit of storage. These concepts are also shredded to allow detailed data discovery through a point-and-click search GUI. This combination of concepts as the unit of metadata storage along with shredded metadata provide efficiency for both insert and query operations by enabling for the rapid rebuilding of the XML metadata in response to detailed data discovery queries. This approach also enables the query interface to dynamically adapt to the native data type for each metadata element, be it numeric, string, temporal, or spatial.
Query interfaces customized
for the schema are constructed automatically based on the metadata schema for which each community deploys XMC Cat. Additionally, the necessary XML Beans and XSLT code needed to configure XMC Cat for a domain schema can be generated through a point-and-click web interface.
License
XMC Cat is an open source tool licensed under the Apache 2.0 license. A copy of the license is available at http://www.apache.org/licenses/LICENSE-2.0
News
-
D2I's Scott Jensen gave a talk on "Adaptable and Incremental Metadata Capture in e-Science" at University of Chicago's Computation Institute on March 2, 2012.
-
Beth Plale and Scott Jensen, along with DataONE colleagues from University of New Mexico and Oak Ridge National Labs, will be presenting the tutorial M13: Big Data Means Your Metadata Must Work at SC11 in Seattle on Monday November 14th. XMC Cat will be one of the metadata tools discussed.
-
Stop by the Indiana University booth on the exhibition floor at SC11 on November 15th - 17th. See the exciting work D2I is doing, including demos of XMC Cat!
- Scientific Data Discovery with XMC Cat. Pushing Back on the Data Deluge: Advancements in Metadata, Archival and Workflows. Presented at Supercomputing 2010, Nov 15-19; PPTX
- XMC Cat: An Adaptive Catalog for Scientific Metadata, Improving Observing Network Coordination: A Cyberinformatics Forum, Boulder, CO, US, May 17-18, 2010; PDF; PPTX
- National News
Story, May 2010
- Watch a short
video of Scott Jensen describing XMC Cat:
http://pti.iu.edu/video/xmccat
Contacting Us
If you have questions or comments on XMC Cat, you can contact us at: xmccat [at] cs [dot] indiana [dot] edu
Get Up and Running With XMC Cat
Installation prerequisites, build instructions, installation instructions for the server and client, as well as additional help on configuration settings,
sample code for building your own client tools to work with XMC Cat, and the XMC Cat FAQ can all be found on the Data to Insight's wiki. Below are direct links to the relevant wiki pages:
Contributors
| Scott Jensen |
 |
Scott Jensen's research focus is on
metadata management (with a particular focus on scientific data), data management, data provenance, services and SOA, XML,
XML-Relational storage, search interfaces, and the Semantic Web. His dissertation work focused on identifying the characteristics of XML-based
metadata and differences from general XML storage that can be exploited to provide faster query response for
scientific data while using a flexible, scalable, and adaptable generic relational database structure that can be
applied to varied scientific domains using different metadata schemas and data hierarchies. |
| Beth Plale |
|
Professor Beth Plale
serves as Director of the Data to Insight Center and the Center for Data and Search Informatics for Pervasive
Technology Institute. She is an associate professor of Computer Science and Informatics. Plale is a national leader
in data and information management and serves on leadership teams of several major grant funded projects including
the large NSF funded LEAD project in cyberinfrastructure for mesoscale meteorology forecasting. |
 |
Additional Contributors
| Yiming Sun |
 |
Yiming Sun is a PhD Candidate whose research areas focus on the long-term preservation of e-Science experiments and artifacts, reuseable preservation objects, data provenance, metadata, cyberinfrastructure, and services. He is also a research staff in the Data-to-Insight center, currently working on the HathiTrust Research Center project (HTRC).
|
| Shobana Krishnan |
![]() |
|
| Bina Bhaskar |
 |
Master's student in the CS department of IUB.
|
| Kavitha Chandrasekar |
 |
Kavitha Chandrasekar is a Research Software Engineer at Data to Insight Center. She has worked on the Lead II project, running workflows with Trident Scientific Workflow Workbench. She is currently working as a programmer on the Sustainable Environment Actionable Data (SEAD) project and is also involved in projects on running workflows on the cloud.
|
| Kalani Ruwanpathirana |
 |
Kalani is a software analyst in University Information Technology Services (UITS) at Indiana University-Bloomington (IUB).
|
| Bimalee Salpitikorala |
![]() |
|
Publications & Tutorials
-
Scott Jensen, Beth Plale, Rebecca Koskela, and John Cobb, Big Data Means Your Metadata Must Work
half-day tutorial presented at SC11, Seattle, WA, November 2011.
-
Scott Jensen, Devarshi Ghoshal, and Beth Plale, Evaluation of Two XML Storage Approaches for Scientific Metadata
Indiana University CS Technical Report TR698, October 2011.
-
Scott Jensen and Beth Plale,
Trading Consistency for Scalability in Scientific Metadata,
In Proceedings of the 2010 IEEE International Conference on e-Science,
Brisbane, Australia, December 2010.
-
Yiming Sun, Bin Cao, Jeffery Cox, Chathura Herath, Scott Jensen, and Beth Plale,
Event Processing in a Weather Data Science Gateway,
3rd ACM International Conference on Distributed Event-Based Systems, July 2009.
-
Scott Jensen and Beth Plale,
Extended abstract: Schema-Independent and Schema-Friendly Scientific Metadata Management,
IEEE International Conference on eScience,
pp. 428-429, Fourth IEEE International Conference on eScience, 2008.
-
Scott Jensen and Beth Plale,
Using Characteristics of Computational Science Schemas for Workflow Metadata Management,
In Proceedings of the 2008 IEEE Congress on Services, IEEE 2008 Second International Workshop on Scientific Workflows (SWF 2008),
Hawaii, July 2008.
-
Yiming Sun, Scott Jensen, Sangmi Lee Pallickara, and Beth Plale,
Personal Workspace for Large-scale Data-driven Computational Experimentation,
7th IEEE/ACM International Conference on Grid Computing (Grid'06),
Barcelona, September 2006.
-
Scott Jensen, Beth Plale, Sangmi Lee Pallickara and Yiming Sun,
A Hybrid XML-Relational Grid Metadata Catalog,
Workshop on Web Services-based Grid Applications (WGSA'06) in association with
International Conference on Parallel Processing (ICPP-06), August 2006.
-
Sangmi Lee Pallickara, Beth Plale, Scott Jensen, Yiming Sun,
Structure, sharing, and preservation of scientific experiment data,
IEEE 3rd International Workshop on Challenges of Large Applications in Distributed Environments (CLADE),
July 2005.
Sponsors, Oct 2005 - present
Related News, Events and Publications:
|
|
|
|
|
XML Metadata Concept Catalog (XMC Cat) |
XMC Cat is a metadata catalog that stores rich metadata describing data objects that are themselves stored in files, storage repositories, or on the web. Its features include adaptability to domain schemata through configuration instead of code changes, support for automatic capture of metadata through the use of curation plugins, and search and browse capabilities through a web-based GUI that is dynamically generated from a domain schema. IT can be deployed in different scientific and... |
|
Data Catalog |
Data Catalog harvests data product metadata from distributed THREDDS catalogs into an XMC Cat instance. The metadata (including data product location) is then available to applications via the XMC Cat API. Data catalog metadata harvesting employs a shared nothing ingest pipeline to allow for the indexing of large catalogs such as NEXRADIII and has indexed of over 17 thousand collections and 2 million files. |
|
Adaptable and Incremental Metadata Capture in e-Science |
Presented by Scott Jensen, March 2, 2012. Scientific communities are recognizing an increasing need to enable reuse of the deluge (or bonanza) of scientific data currently being generated. Detailed metadata, or 'data about data', is key to preserving the value, as well as enabling the sharing and reuse of data. Communities have developed detailed XML schemata to capture and communicate metadata describing scientific data. Historically however, to the extent metadata has been captured at all... |
|
D2I's Scott Jensen gives invited talk on metadata capture in e-Science at Computation Institute |
"Adaptable and Incremental Metadata Capture in e-Science", March 2, 2012, Searle 240A, University of Chicago Computation Institute. |
|
D2I: Adaptable and Incremental Metadata Capture in e-Science |
Scott Jensen, Post Doc Research Associate, Data to Insight Center, Indiana University. Scientific communities are recognizing an increasing need to enable reuse of the deluge (or bonanza) of scientific data currently being generated. Detailed metadata, or “data about data”, is key to preserving the value, as well as enabling the sharing and reuse of data. Communities have developed detailed XML schemata to capture and communicate metadata describing scientific data. Historically however, to... |
|
XMC Cat Downloads |
Both the client and server can either be downloaded as binaries or compiled from source code by downloading the source tarball. The source code is configured to be built using Maven2, and the build script will generate the client, server, and some additional utilities used in XMC Cat. Since XMC Cat is a web service described by a WSDL, clients can also use the tool of their choice to to build a client. In the server installation of XMC Cat, we use the XML Beans data binding. |
|
Indiana University Pervasive Technology Institute Report to the Lilly Endowment, Inc. Grant Number 2008 1639-00 36 Month Program Report June 1, 2011 - November 30, 2011 |
Bi-Annual report to the Lilly Endowment, Inc. Search for "Lilly Report" to find all reports. |
|
XMC Cat Among the Stars |
Presentation given by Yiming Sun and Scott Jensen at Supercomputing 2011 regarding the use of XMC Cat with ODI (One Degree Imager); data challenges, data requirements, data subsystem role in ODI, ODI metadata requirements. |
|
Tutorial: Big Data Means Your Metadata Must Work |
1/2 day tutorial at Supercomputing 2011 (SC11), Seattle, Washington, November 2011 with 65 attendees. Data intensive computing means a lot of valuable data coming from parallel and graph computations and remote instruments. However, a recent Science article estimated that only 1% of ecological data is accessible after the research has been published. This tutorial addresses how metadata is critical to the use and reuse of scientific data. It discusses techniques for metadata capture,... |
|
Evaluation of Two XML Storage Approaches for Scientific Metadata |
Scientific data are increasingly described by metadata based on detailed XML schemata that capture both general and domain-specific concepts about the underlying data. Metadata captured using detailed XML schemata tailored to specific scientific domains increases the potential for data reuse by providing the ability to discover data products described by detailed concepts. Since such metadata is captured as XML, one alternative for managing scientific metadata is to store and query the metadata... |
|