Cloud Computing – DATA SCIENCE LAB

GOFFISH A GRAPH-ORIENTED FRAMEWORK FOR FORESIGHT AND INSIGHT USING SCALABLE HEURISTICS

Sensors and online instruments performing high fidelity observations are contributing in a large measure to the growing big data analytics challenge. These datasets are unique in that they represent events, observation, and activities that are related to each other while being recorded by independent data streams. Existing data processing frameworks such as MapReduce that operate on file or row based data do not lend themselves to scalable analytics over such an interconnected web of stream-based data. We propose GoFFish, a scalable graph-oriented analytics framework that is well suited for trawling over reservoirs of inter-connected data that are fed by event data streams. Our framework will help design optimized graph algorithms that leverage the specialized graph-oriented data store, GoFs, and are based on the proposed graph programming abstraction, Gopher, that can be used by analysts to intuitively and rapidly compose graph and event analytical models. The composed application will enhance data parallel analytics at scales far superior to traditional MapReduce models using a novel distributed data partitioning approach based on edge distance heuristics. This will allow unprecedented insight from the reservoirs of stream data for commanders to perform causal graph analysis and strategic planning. Further, we propose to close the loop between insight and foresight by coupling event patterns mined from historical stream reservoirs by graph analytics with real-time event streams from sensors. Such an online stream analytics engine will provide operational leaders with augmented situation awareness and advanced warning about impending conditions.

FLOE AN ADAPTIVE FRAMEWORK FOR DYNAMIC APPLICATIONS

Traditional scientific workflows deal with static structures and processing data in batch mode. However, the emerging applications require continuous operation over dynamic data and changing application needs. This motivates the need for data flow programming frameworks that can adapt to changes to the application structure, data feeds and speeds, latency requirements with minimal interruptions to the flow of results. In addition, the advent of elastic platforms such as Clouds also required the execution model of these frameworks to adapt to dynamism in the infrastructure. Floe is an adaptive, data flow framework designed for such dynamic applications on Cloud platforms. Floe provides programming abstractions that support traditional data flow and stream processing paradigms, while allowing dynamic application recomposition, changes to streaming data sources at runtime and leveraging elastic Cloud platforms for optimizing resource usage. The many advantages of Clouds are inhibited by their limitations for resilient computing, caused by the use of commodity hardware and multi-tenancy. We are investigating ways to prospectively plan the execution of Floe graphs that can then adapt to resilience exigencies at runtime while maximizing expected net utility, on unreliable Clouds. These goals will be achieved through a combination of tunable application specification, distributed resource optimization and continuous adaptive recovery.

Floe2 is readily available on GitHub: link

PREGEL.NET A PARALLEL GRAPH PROCESSING USING CLOUD PLATFORMS

The need for analyzing large scale graphs in parallel is increasing with the growth of social networks and other scale free networks. The Pillcrow project is exploring graph programming abstractions that are well suited for scaling on Cloud platforms. In our initial work, we are investigating the Betweenness centrality algorithm, popular for finding key vertices in many applications such as social networks, bioinformatics, and distribution networks. Several parallel formulations suitable to supercomputers and clusters exist for this. We have studied betweenness centrality in the context of Microsoft Windows Azure and demonstrate scalable parallel performance. Key issues related to a cloud-based implementation include mitigating penalties associated with VM failures as well as the impact of communication overheads in the cloud. We use a combination of empirical and analytical evaluation using both synthetic small-world and real-world social interaction graphs. Further, we are comparing such decoupled programming abstractions with loosely coupled ones like MapReduce and Pregel to evaluate their suitability.

CRYPTONITE A DATA SECURITY AND PRIVACY ON CLOUDS (DORMANT)

As Cloud platforms gain increasing traction among scientific and business communities for outsourcing storage, comput- ing and content delivery, there is also growing concern about the associated loss of control over private data hosted in the Cloud. In this paper, we present an architecture for a secure data repository service designed on top of a public Cloud infrastructure to support multi-disciplinary scientific communities dealing with personal and human subject data, motivated by the smart power grid domain. Our repository model allows users to securely store and share their data in the Cloud without revealing the plain text to unauthorized users, the Cloud storage provider or the repository itself. The system masks file names, user permissions and access patterns while providing auditing capabilities with provable data updates.

OPENPLANET A SCALABLE MACHINE LEARNING USING MAPREDUCE (DORMANT)

The projected increase in the use of smart meters and data collection in a Smart Grid environment means that all applications, including machine learning for demand forecasting, will be data intensive and require the use of scalable and reliable platforms for operations. For example, the Los Angeles Power Grid with over 1.4 million customers will collect and analyze terabytes of smart meter data. This data will further grow as the frequency of data collection is increased and newer information sources are added. Power consumption forecasting is one of the analysis that is performed by using machine learning models, such as the regression tree, and is compute and data intensive. This problem becomes intractable on a single machine for even 25,000 customers, taking several days to train the model. Our work on OpenPlanet is building scalable machine learning algorithms using the Hadoop MapReduce framework. Specifically, we study the tuning and performance issues of mapping this problem to a Hadoop cluster and investigate incremental learning models that scale over time.