Enhancing Anomaly Detection in Large-Scale Log Data Using Machine Learning: A Comparative Study of SVM and KNN Algorithms with HDFS Dataset
PDF File

Keywords

Anomaly Detection
KNN
SVM
Machine Learning
HDFS

How to Cite

Enhancing Anomaly Detection in Large-Scale Log Data Using Machine Learning: A Comparative Study of SVM and KNN Algorithms with HDFS Dataset. (2024). ADBA Computer Science, 1(1), 14-18. https://doi.org/10.69882/adba.cs.2024073

Abstract

As information technology rapidly advances, servers, mobile, and desktop applications are easily attacked due to their high value. Therefore, cyber attacks have raised great concerns in many areas. Anomaly detection plays a significant role in the field of cyber attacks, and log records, which record detailed system runtime information, have consequently become an important data analysis object. Traditional log anomaly detection relies on programmers manually inspecting logs through keyword searches and regular expression matching. While programmers can use intrusion detection systems to reduce their workload, log data is massive, attack types are diverse, and the advancement of hacking skills makes traditional detection inefficient. To improve traditional detection technology, many anomaly detection mechanisms, especially machine learning methods, have been proposed in recent years. In this study, an anomaly detection system using two different machine learning algorithms is proposed for large log data. Using Support Vector Machines (SVM) and K-Nearest Neighbors (KNN) algorithms, experiments were conducted with the Hadoop Distributed File System (HDFS) log dataset, and experimental results show that this system provides higher detection accuracy and can detect unknown anomaly data.

PDF File

References

A. Oliner, A. G. andW. Xu, 2012 Advances and challenges in log analysis. Communications of the ACM 55: 55–61.

Church, K. W., 2017 Word2Vec. Natural Language Engineering 23:155–162.

Elbasani, E. and J. D. Kim, 2021 LLAD: Life-Log Anomaly Detection Based on Recurrent Neural Network LSTM. Journal of Healthcare Engineering 2021.

G. Guo, D. B. Y. B., H. Wang and K. Greer, 2003 KNN model-based approach in classification. In OTM Confederated International Conferences "On the Move to Meaningful Internet Systems", pp. 986–996, Springer.

H. Hamooni, J. X. H. Z.-G. J., B. Debnath and A. Mueen, 2016 Logmine: Fast pattern recognition for log analytics. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, pp. 1573–1582, ACM.

H. Mi, Y. Z. M. R.-T. L., H. Wang and H. Cai, 2013 Toward finegrained unsupervised scalable performance diagnosis for production cloud computing systems. IEEE Transactions on Parallel and Distributed Systems 24: 1245–1255.

H. Saadatfar, H. F. and H. Deldari, 2012 Predicting job failures in AuverGrid based on workload log analysis. New Generation Computing 30: 73–94.

Lin, W.-C. and C.-F. Tsai, 2020 Missing value imputation: a review and analysis of the literature (2006–2017). Artificial Intelligence Review 53: 1487–1509.

M. A. Hearst, E. O.-J. P., S. T. Dumais and B. Scholkopf, 1998 Support vector machines. IEEE Intelligent Systems and their Applications 13: 18–28.

M. Du, G. Z., F. Li and V. Srikumar, 2017 DeepLog: Anomaly detection and diagnosis from system logs through deep learning. In Proceedings of the ACM Conference on Computer and Communications Security, pp. 1285–1298.

Moghaddam, V. H. and J. Hamidzadeh, 2016 New Hermite orthogonal polynomial kernel and combined kernels in support vector machine classifier. Pattern Recognition 60: 921–935.

Sillito, J. and E. Kutomi, 2020 Failures and Fixes: A Study of Software System Incident Response. In 2020 IEEE International Conference on Software Maintenance and Evolution (ICSME), pp. 185–195, IEEE.

Steinwart, I. and A. Christmann, 2008 Support Vector Machines. Springer Science & Business Media.

T. Jia, P. C. Y. L.-F. M., L. Yang and J. Xu, 2017 Logsed: Anomaly diagnosis through mining time-weighted control flow graph in logs. In 2017 IEEE 10th International Conference on Cloud Computing (CLOUD), pp. 447–455, IEEE.

T. Pitakrat, O. K. F. K., J. Grunert and A. V. Hoorn, 2014 A framework for system event classification and prediction by means of machine learning. In Proceedings of the 8th International Conference on Performance Evaluation Methodologies and Tools, pp. 173–180, ACM.

Tan, Y. and X. Gu, 2010 On predictability of system anomalies in real world. In 2010 IEEE International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems, pp. 133–140, IEEE.

Vaarandi, R., 2003 A data clustering algorithm for mining patterns from event logs. In Proceedings of the 3rd IEEE Workshop on IP Operations & Management (IPOM 2003), pp. 119–126, IEEE.

W. Xu, A. F. D. P., L. Huang and M. I. Jordan, 2009 Detecting largescale system problems by mining console logs. In Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles,pp. 117–132, ACM.

X. Duan, H. C. W. Y., S. Ying and X. Yin, 2021 OILog: An online incremental log keyword extraction approach based on MDPLSTM neural network. Information Systems 95: 101618.

Z. Zheng, R. G. S. C., Z. Lan and P. Beckman, 2010 A practical failure prediction with location and lead time for blue gene/p. In 2010 International Conference on Dependable Systems and Networks Workshops (DSN-W), pp. 15–22, IEEE.

Ö. Tonkal, E. B. Z. C., H. Polat and R. Kocao˘ glu, 2021 Machine Learning Approach Equipped with Neighbourhood Component Analysis for DDoS Attack Detection in Software-Defined Networking. Electronics 10.

Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.