Optimizing Multi-TB Market Data Workloads: Advanced Partitioning and Skew Mitigation Strategies for Hive and Spark on EMR

Janardhan Reddy Kasireddy

doi:10.15680/IJCTECE.2023.0603005

Authors

Janardhan Reddy Kasireddy Lead Data Engineer, Info Drive Systems (Finra Contractor), USA Author

DOI:

https://doi.org/10.15680/IJCTECE.2023.0603005

Keywords:

Hive, Spark, EMR, query optimization, parallel processing, partition pruning, skew, performance engineering.

Abstract

One of the main issues of the contemporary big data analytics is the ability to process multi-terabyte data in a cost-effective way. This paper explores extreme parallel processing with Hive and Spark in a large scale market data setup with the view to optimizing query performance and resource consumption. The NYSE Stock Prices dataset was simulated on Kaggle, and experimented on with simulated workloads in the real world such as high-frequency trade data, historical price series and market metadata. The techniques applied were on performance engineering by developed partitioning schemes, query optimization and the mitigation of skew. The improvement of hive queries was based on dynamic partition pruning, vectorized execution, and prudent join ordering whilst Spark workloads were enabled by dataframe caching, broadcast joins, and adaptive query execution. The two environments were implemented on Amazon EMR clusters to test the scalability depending on the number of nodes and the cluster configurations.

We show that with good use of partition pruning and vectorised execution Hive was able to run up to 4x faster on aggregation intensive workloads compared to Spark, and to run up to 6x faster on iterative and join heavy queries with broadcast joins and adaptive execution. Biased data partitions have been found to be a key performance bottleneck and alleviation via salting methods enhanced throughput. The paper also identified trade-offs between allocate compute resources, manage the memory, and job parallelism in the large-scale cluster setting.

These results have a real-world implication to the enterprises working with multi-TB datasets in both Hive and Spark wherein they highlighted the significance of partition-aware design and query optimizations and cluster-level optimizations. The work can be used as a framework of performance engineering as a reference in extreme-scale data processing.

References

[1] A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff, and R. Murthy, “Hive: a warehousing solution over a map-reduce framework,” Proc. VLDB Endow., vol. 2, no. 2, pp. 1626–1629, 2009. [Online]. Available: https://dl.acm.org/doi/10.14778/1687553.1687609

[2] M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. Franklin, S. Shenker, and I. Stoica, “Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing,” Proc. 9th USENIX Conf. Networked Systems Design and Implementation (NSDI ’12), pp. 15–28, 2012. [Online]. Available: https://dl.acm.org/doi/10.5555/2228298.2228301

[3] W. Fan, A. Ai, A. Ghodsi, and M. Zaharia, “Adaptive query execution: Speeding up Spark SQL at runtime,” Databricks Blog, May 29, 2020. [Online]. Available: https://www.databricks.com/blog/2020/05/29/adaptive-query-execution-speeding-up-spark-sql-at-runtime.html

[4] E. Costa, C. Costa, and M. Y. Santos, “Partitioning and bucketing in Hive-based big data warehouses,” in WorldCIST’18 - World Conf. Info. Syst. and Technologies, Springer, 2018, pp. 764–774.

[5] E. Costa, C. Costa, and M. Y. Santos, “Efficient big data modelling and organization for Hadoop Hive-based data warehouses,” in 14th European, Mediterranean, and Middle Eastern Conf. (EMCIS), Coimbra: Springer, 2017, pp. 3–16.

[6] A. S. Kumar, “Performance analysis of MySQL partition, Hive partition-bucketing and Apache Pig,” in 1st India Int. Conf. Information Processing (IICIP), 2016, pp. 1–6.

[7] C. L. Philip Chen and C. Y. Zhang, “Data-intensive applications, challenges, techniques and technologies: a survey on big data,” Information Sciences, vol. 275, pp. 314–347, 2014.

[8] M. Y. Santos, C. Costa, J. Galvão, et al., “Evaluating SQL-on-Hadoop for big data warehousing on not-so-good hardware,” in 21st Int. Database Engineering & Applications Symp. (IDEAS), ACM, New York, NY, USA, 2017, pp. 242–252.

[9] S. Shaw, A. F. Vermeulen, A. Gupta, and D. Kjerrumgaard, Practical Hive: A Guide to Hadoop’s Data Warehouse System. New York: Apress, 2016.

[10] D. Du, Apache Hive Essentials. Packt Publishing Ltd., 2015.

[11] P. Zikopoulos and C. Eaton, Understanding Big Data: Analytics for Enterprise-Class Hadoop and Streaming Data, 1st ed. Delhi: McGraw-Hill, 2011.

[12] C. Costa and M. Y. Santos, “The SusCity big data warehousing approach for smart cities,” in 21st Int. Database Engineering & Applications Symp., 2017, pp. 264–273.

[13] A. De Mauro, M. Greco, and M. Grimaldi, “What is big data? A consensual definition and a review of key research topics,” in AIP Conf. Proc., AIP Publishing, 2015, pp. 97–104.

[14] A. S. Kumar, “Performance analysis of Hive partitioning and bucketing,” in 1st India Int. Conf. Information Processing (IICIP), 2016, pp. 1–6.

[15] Zvara, Z., Szabó, P. G. N., Lóránt, B. B., & Benczúr, A. A. (2021). System-aware dynamic partitioning for batch and streaming workloads. arXiv. https://arxiv.org/abs/2105.15023

[16] Li, Y., Chen, J., & Zhou, Y. (2020). ImRP: A predictive partition method for data skew alleviation in Spark streaming environment. Parallel Computing, 100, 102699. https://doi.org/10.1016/j.parco.2020.102699

[17] Kumar, A., Ni, S., Wang, Z., Huang, Y., & Chen, C. (2022). Reshape: Adaptive result-aware skew handling for exploratory analysis on big data. arXiv. https://arxiv.org/abs/2208.13143

Optimizing Multi-TB Market Data Workloads: Advanced Partitioning and Skew Mitigation Strategies for Hive and Spark on EMR

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

Issue

Section

How to Cite

Make a Submission

open-access

Menu

License

Information