You are at the archived site of knoesis which is no longer updated. The new site is here.

Semantic Provenance: Modeling, Querying, and Application in Scientific Discovery

Abstract

Semantic Provenance: Modeling, Querying, and Application in Scientific Discovery Provenance metadata, describing the history or lineage of an entity, is essential for ensuring data quality, correctness of process execution, and computing trust values. Traditionally, provenance management issues have been dealt with in the context of workflow or relational database systems. However, existing provenance systems are inadequate to address the requirements of an emerging set of applications in the new eScience or Cyberinfrastructure paradigm and the Semantic Web. Provenance in these applications incorporates complex domain semantics on a large scale with a variety of uses, including accurate interpretation by software agents, trustworthy data integration, reproducibility, attribution for commercial or legal applications, and trust computation. In this dissertation, we introduce the notion of “semantic provenance” to address these requirements for eScience and Semantic Web applications. In addition, we describe a framework for management of semantic provenance by addressing the three issues of, (a) provenance representation, (b) query & analysis, and (c) scalable implementation. First, we introduce a foundational model of provenance called Provenir to serve as an upper-level reference ontology to facilitate provenance interoperability. Second, we define a classification scheme for provenance queries based on the query characteristics and use this scheme to define a set of specialized provenance query operators. Third, we describe the implementation of a highly scalable query engine to support the provenance query operators, which uses a new class of materialized views based on the Provenir ontology, called Materialized Provenance Views (MPV), for query optimization. We also define a novel provenance tracking approach called Provenance Context Entity (PaCE) for the Resource Description Framework (RDF) model used in Semantic Web applications. PaCE, defined in terms of the Provenir ontology, is an effective and scalable approach for RDF provenance tracking in comparison to the currently used RDF reification vocabulary. Finally, we describe the application of the semantic provenance framework in biomedical and oceanography research projects.

Slideshow


Committee Members


Amit P. Sheth, Ph.D.
(Advisor)

Krishnaprasad
Thirunarayan, Ph.D.

Michael Raymer
Ph.D.

Nicholas V. Reo, Ph.D.

Olivier Bodenreider, Ph.D.

William S. York, Ph.D.

Publications

Understanding User-generated Content on Social Media

Abstract

Over the last few years, there has been a growing public and enterprise fascination with ‘social media’ and its role in modern society. At the heart of this fascination is the ability for users to participate, collaborate, consume, create and share content via a variety of platforms such as blogs, micro-blogs, email, instant messaging services, social network services, collaborative wikis, social bookmarking sites, and multimedia sharing sites. Today, in addition to any factual information, we are also able to access conversations, opinions and emotions that these facts evoke among other users. We are able to ask questions such as, what are people saying about any news-worthy event or entity? Can we use this information to assess a population’s preference? Can we study how these preferences propagate in a network of friends? Are such crowd-sourced preferences a good substitute for traditional polling methods? This dissertation is devoted to understanding informal user-generated textual content on social media platforms and using the results of the analysis to build Social Intelligence Applications. The body of research presented in this thesis focuses on understanding what a piece of user- generated content is ‘About’ via two sub-goals of Named Entity Recognition and Key Phrase Ex- traction on informal text. In light of the poor context and informal nature of content on social media platforms, we investigate the role of contextual information from documents, domain mod- els and the social medium to supplement and improve the reliability and performance of existing text mining algorithms for Named Entity Recognition and Key Phrase Extraction. In all cases we find that using multiple contextual cues together lends to reliable inter-dependent decisions, better than using the cues in isolation and that such improvements are robust across domains and content of varying characteristics, from micro-blogs like Twitter, social networking forums such as those on MySpace and Facebook, and blogs on the Web. Finally, we showcase two deployed Social Intelligence applications that build over the results of Named Entity Recognition and Key Phrase Extraction algorithms to provide near real-time information about the pulse of an online populace. Specifically, we describe what it takes to build applications that wish to exploit the ‘wisdom of the crowds’ – highlighting challenges in data collection, processing informal English text, metadata extraction and presentation of the resulting information.

Slideshow


Video


Committee Members


Amit P. Sheth, Ph.D.
(Advisor)

John M. Flach
Ph.D.

Daniel Gruhl
Ph.D.

Michael L. Raymer, Ph.D.

Shaojun Wang, Ph.D.

Kevin Haas, M.S.

Publications


Dr. Amith Sheth and Dr. Meena Nagrajan at Dr. Nagrajan's Graduation.