Data Intensive Analysis and Computing (DIAC) Lab
Recent advances in computing, communication, and digital storage
technologies have enabled incredible volumes of data to be accessible remotely
across geographical and administrative boundaries. There is an increasing demand on
summarizing, understanding, monitoring, learning, and collaboratively mining
from large, evolving, and possibly private data stores. In DIAC lab, we study the research problems and applications
related to such large datasets.
Faculty: Keke Chen
, Amit Sheth
Data Analytics with the Cloud
Data clouds, consisting of hundreds or thousands of cheap multi-core PCs and disks, are available
for rent in low cost (e.g., Amazon EC2 and S3 services). Many cloud-based applications generate large
amount of data in the cloud, which needs to be processed with cloud-based data analytics tools.
Powered with the distributed file system, e.g., hadoop distributed file system, and MapReduce programming model,
the cloud becomes an economical and scalable platform for performing large-scale data analytics.
We study the visual cluster exploration framework (CloudVista) for analyzing the large data hosted in the cloud and
the cost model for resource-aware cloud computing.
Clustering Large/Streaming Numerical/Categorical Data
Large datasets are also
characterized by high complexity and uncertainty. Clustering is an effective
tool for understanding this complexity and uncertainty. In DIAC lab, we
investigate novel techniques that combine visual analytics and statistical
analysis to help better understanding the clustering patterns in large datasets.
In particular, we are interested in visually exploring and validating clustering
patterns in large multi-dimensional datasets (VISTA, iVIBRATE), finding the optimal number of clusters in categorical ACE and BestK) and transactional datasets
Density and DMDI), and monitoring the change of clustering
patterns in categorical data streams (CatStream).
Privacy Preserving Computing, Trustworthy Computing When large
datasets are shared crossing boundaries, privacy and trust have become the major
concerns. In DIAC we study the privacy issues in distributed data intensive
computing, in particular, privacy preserving OLAP and mining on outsourced data, and privacy preserving multiparty collaborative data mining.
We have proposed the geometric data perturbation (GDP), which can be used
to fully preserve data utility in terms of classification and clustering modeling, while
providing satisfactory privacy guarantee. The GDP method can also be applied to privacy preserving multiparty
collaborative mining (Multiparty GDP).
Recent developments have been focused on the theoretical study on the family of geometric perturbation methods and its application on privacy preserving OLAP on outsourced data, and
privacy and trust in social networks.
Web Science: Ranking and Adaptation
For large-scale complicated learning problems, it is very expensive to collect sufficient amount of labeled
training data. Learning to rank in web search is one of such problems. There are
multiple ways to extend training dataset, such as leveraging large
amount of unlabeled data (i.e., semi-supervised learning), or searching over the
large amount of unlabeled data to find the most effective candidate examples for
labeling (i.e., active learning). In learning to rank, we study
some novel strategies to enhance the training data. Concretely, we develop new
algorithms to utilize pairwise preference training data mined from implicit user
feedback (GBRank), to adapt the model trained with small amount
of labeled data to the pairwise preference data ( ClickAdapt), and to adapt a
ranking function trained on one search domain to another (Tree Adaptation or Trada). Recent developments
include the understanding of effectiveness of Tree Adaptation for Ranking and tree adaptation methods for pairwise data.