Computer Science

Reusing Results in Big Data Frameworks

Reusing Results in Big Data Frameworks (Computer Science Project Topics)


Big Data analysis has been very hot and active research during the past few years. It is getting hard to execute data analysis tasks with traditional data warehouse solutions efficiently.

Parallel processing platforms and parallel data-flow systems running on top of them are increasingly popular. They have greatly improved the throughput of data analysis tasks. The trade-off is the consumption of more computation resources. Tens or hundreds of nodes run together to execute one task.

However, completing a task might still take hours or even days. It is very important to improve resource utilization and computation efficiency. According to Microsoft’s research, around 30% of common sub-computations exist in usual workloads. Computation redundancy is a waste of time and resources.

Apache Pig is a parallel data-flow system that runs on top of Apache Hadoop, a parallel processing platform. Pig/Hadoop is one of the most popular combinations for large-scale data processing.

This project proposed a framework which materializes and reuses previous computation results to avoid computation redundancy on top of Pig/Hadoop. The idea came from the materialized view technique in Relational Databases. Computation outputs were selected and stored in the Hadoop File System due to their large size.

The execution statistics of the outputs were stored in MySQL Cluster. The framework used a plan matcher and rewriter component to find the maximally shared common computation with the query from MySQL Cluster, and rewrite the query with the materialized outputs. The framework was evaluated with the TPC-H Benchmark.

The results showed that execution time had been significantly reduced by avoiding redundant computation. By reusing sub-computations, the query execution time was reduced by 65% on average; it only took around 30 ˜ 45 seconds when reusing whole computations. Besides, the results showed that the overhead is only around 25% on average.

Source: KTH

Author: Shang, Hui


Kevin S Beyer, Vuk Ercegovac, Rainer Gemulla, Andrey Balmin, Mohamed Eltabakh, Carl-Christian Kanne, Fatma Ozcan, and Eugene J Shekita. Jaql: A scripting language for large-scale semistructured data analysis. In Proceedings of VLDB Conference, 2011.

Pramod Bhatotia, Alexander Wieder, Rodrigo Rodrigues, Umut A Acar, and Rafael Pasquin. Incoop: MapReduce for incremental computations. In Proceedings of the 2nd ACM Symposium on Cloud Computing, page 7. ACM, 2011.

Ronnie Chaiken, Bob Jenkins, Per-Åke Larson, Bill Ramsey, Darren Shakib, Simon Weaver, and Jingren Zhou. SCOPE: easy and efficient parallel processing of massive data sets. Proceedings of the VLDB Endowment, 1(2):1265–1276, 2008.

Jeffrey Dean and Sanjay Ghemawat. MapReduce: simplified data processing on large clusters. Communications of the ACM, 51(1):107–113, 2008.

Iman Elghandour and Ashraf Aboulnaga. Restore: Reusing results of MapReduce jobs. Proceedings of the VLDB Endowment, 5(6):586–597, 2012.

Alan Gates. Programming Pig. O’Reilly Media, 2011.

Alan F Gates, Olga Natkovich, Shubham Chopra, Pradeep Kamath, Shravan M Narayanamurthy, Christopher Olston, Benjamin Reed, Santhosh Srinivasan, and Utkarsh Srivastava. Building a high-level dataflow system on top of Map-Reduce: the Pig experience. Proceedings of the VLDB Endowment, 2(2):1414–1425, 2009.

Jonathan Goldstein and Per-Åke Larson. Optimizing queries using materialized views: a practical, scalable solution. In ACM SIGMOD Record, volume 30, pages 331–342. ACM, 2001.

Pradeep Kumar Gunda, Lenin Ravindranath, Chandramohan A Thekkath, Yuan Yu, and Li Zhuang. Nectar: automatic management of data and computation in data centers. In Proceedings of the 9th USENIX conference on Operating systems design and implementation, pages 1–8. USENIX Association, 2010.