Abstract
MLlib is an Apache Spark library that provides many machine learning algorithms and data processing utilities. Although the default configuration of these algorithms yields satisfactory results for practitioners, further tuning is often needed to improve resource usage efficiency. Furthermore, tuned MLlib algorithms may run faster than those using default configurations. However, this improvement depends on several factors, including machine settings, dataset design, and operating system preferences. Previous studies have generally focused on developing sophisticated tuners for MLlib, evaluating algorithm-focused optimizers for their competitiveness. Although derivative-based and model-free optimizers have been modified for use with MLlib, sampling-based optimizers are generally overlooked. To fill this research gap, this study empirically compares sampling-based and model-free techniques for tuning MLlib. Firstly, Monte Carlo and Cross-Entropy sampling algorithms are adapted to optimize MLlib algorithms. Subsequently, model-free techniques, including grid and random search algorithms, are compared with these sampling-based algorithms. Through extensive experimentation, their advantages and limitations are highlighted. Finally, threats to validity and future directions for unlocking the tuning potential of Apache Spark are discussed by interpreting performance bottlenecks and promising areas for optimization.
References
Andonie, R. (2019). Hyperparameter optimization in learning systems. Journal of Membrane Computing, 1(4):279–291.
Assefi, M., Behravesh, E., Liu, G., and Tappert, A. R. (2017). Big data machine learning using apache spark mllib. In Proceedings of the IEEE International Conference on Big Data, pages 3492–3498.
Baziar, M., Kashi, A. R., and Karimi, S. M. H. (2025). Machine learning-based monte carlo hyperparameter optimization for thms prediction in urban water distribution networks. Journal of Water Process Engineering, 73:107683.
Benham, T., Cole, J. S., and Kloeden, P. E. (2017). Ceoptim: Cross-entropy r package for optimization. Journal of Statistical Software, 76:1–29.
Bergstra, J. and Bengio, Y. (2012). Random search for hyperparameter optimization. Journal of Machine Learning Research, 13(1):281–305.
Beruvides, G., Quiza, R., and Castaño, F. (2016). Multi-objective optimization based on an improved cross-entropy method: A case study of a micro-scale manufacturing process. Information Sciences, 334:161–173.
Campbell, A., Chen, J., and Li, W. (2021). A gradient based strategy for hamiltonian monte carlo hyperparameter optimization. In Proceedings of the International Conference on Machine Learning, pages 1238–1248.
Cao, R., Chen, Y., and Zhang, W. (2024). Etune: Efficient configuration tuning for big-data software systems via configuration space reduction. Journal of Systems and Software, 209:111936.
Dunbar, O., Garcia-Trillos, N., and Perego, M. (2025). Hyperparameter optimization for randomized algorithms: a case study on random features. Statistics and Computing, 35(3):1–28.
Eleftheriadis, P., Lioudakis, G., and Amditis, A. (2024). Joint state of charge and state of health estimation using bidirectional lstm and bayesian hyperparameter optimization. IEEE Access, 12:80244–80254.
Feurer, M., Eggensperger, K., Falkner, S., Lindauer, M., and Hutter, F. (2022). Auto-sklearn 2.0: Hands-free automl via meta-learning. Journal of Machine Learning Research, 23(261):1–61.
Feurer, M. and Hutter, F. (2019). Hyperparameter optimization. In Automated Machine Learning: Methods, Systems, Challenges, pages 3–33. Springer, Cham, Switzerland.
Herbst, P., Schuld, M., and Petruccione, F. (2024). On optimizing hyperparameters for quantum neural networks. In Proceedings of the IEEE International Conference on Quantum Computing and Engineering (QCE), volume 1, pages 1478–1489.
Ilmeboya, J., Adebayo, P. O., and Adewumi, A. O. (2024). Hyperparameter tuning in machine learning: A comprehensive review. Journal of Engineering Research and Reports, 26(6):388–395.
Jalobeanu, A., Zerubia, J., and Blanc-Talon, J. (2002). Hyperparameter estimation for satellite image restoration using a mcmc maximum-likelihood method. Pattern Recognition, 35(2):341–352.
Khaldi, M., Kafi, A. E., and Bernoussi, S. E. (2025). Hyperparameter optimization for malicious url detection: Leveraging optuna and random search in machine learning and deep learning models. Informatica, 49(27).
Kurian, N. C., Dubey, S. R., and Chakraborty, S. (2021). Sample specific generalized cross entropy for robust histology image classification. In Proceedings of the IEEE 18th International Symposium on Biomedical Imaging, pages 1934–1938.
Li, G., Zhao, W., and Wang, Y. (2019). An improved butterfly optimization algorithm for engineering design problems using the cross-entropy method. Symmetry, 11(8):1049.
Liu, X., Zhang, Y., and Chen, W. (2021). A cross-entropy algorithm based on quasi-monte carlo estimation and its application in hull form optimization. International Journal of Naval Architecture and Ocean Engineering, 13:115–125.
Mao, A., Mohri, M., and Abu-Mostafa, Y. S. (2023). Cross-entropy loss functions: Theoretical analysis and applications. In Proceedings of the International Conference on Machine Learning, pages 23803–23828.
Meister, M., Parnell, T., and Zhang, C. (2020). Maggy: Scalable asynchronous parallel hyperparameter search. In Proceedings of the 1st Workshop on Distributed Machine Learning, pages 28–33.
Meng, X., Bradley, J. K., Yavuz, B., Sparks, E. R., Venkataraman, S., and Franklin, M. J. (2016). Mllib: Machine learning in apache spark. Journal of Machine Learning Research, 17(34):1–7.
Naeem, M., Reza, S. M. S., and Kemp, A. H. (2009). Cross-entropy optimization for sensor selection problems. In Proceedings of the 9th International Symposium on Communications and Information Technology, pages 396–401.
Nguyen, N., Hassan, M. A., Bai, K., and Wang, Y. (2018). Towards automatic tuning of apache spark configuration. In Proceedings of the IEEE 11th International Conference on Cloud Computing (CLOUD), pages 417–425.
Rakotoarison, H., Sebag, M., and Schoenauer, M. (2019). Automated machine learning with monte-carlo tree search. In Proceedings of the 28th International Joint Conference on Artificial Intelligence, pages 3296–3303.
Rubinstein, R. Y. (1997). Optimization of computer simulation models with rare events. European Journal of Operational Research, 99(1):89–112.
Svensson, A., Schön, T. B., and Kok, M. (2015). Marginalizing gaussian process hyperparameters using sequential monte carlo. In Proceedings of the IEEE 6th International Workshop on Computational Advances in Multi-Sensor Adaptive Processing (CAMSAP), pages 477–480.
Tang, Q., Wang, Y., and Liu, Z. (2024). A dual-robot cooperative arc welding path planning algorithm based on multi-objective cross-entropy optimization. Robotics and Computer-Integrated Manufacturing, 89:102760.
Turner, R., Eriksson, D., McCourt, M., Kiili, J., Laaksonen, E., and Xu, Z. (2021). Bayesian optimization is superior to random search for machine learning hyperparameter tuning: Analysis of the black-box optimization challenge 2020. In NeurIPS 2020 Competition Demonstration Track, pages 3–26.
van Stein, N., Bäck, T., and Emmerich, M. (2024). In-the-loop hyper-parameter optimization for llm-based automated design of heuristics. ACM Transactions on Evolutionary Learning.
Zhou, M., Li, Y., and Chen, W. (2025). Towards hybrid architectures for big data analytics: Insights from spark-mpi integration. IEEE Transactions on Services Computing, pages 1852–1868.
Zhu, Y., Wang, X., and Zhang, L. (2025). Rockhopper: A robust optimizer for spark configuration tuning in production environment. In Companion Proceedings of the International Conference on Management of Data, pages 743–756.

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
