Efficient Malware Classification Using Multiprocessing and Bag-of-Words Vectorization

Avijit Chowdhury; Touhidul Alam Seyam; Moin Uddin Ahmed Babar; Chonchal Khan; Saima Akter

doi:doi:10.11648/j.net.20251201.12

Research Article |

| Peer-Reviewed

Efficient Malware Classification Using Multiprocessing and Bag-of-Words Vectorization

Avijit Chowdhury^*

, Touhidul Alam Seyam

, Moin Uddin Ahmed Babar, Chonchal Khan, Saima Akter

Published in Advances in Networks (Volume 12, Issue 1)

Received: 3 January 2025 Accepted: 20 January 2025 Published: 21 March 2025

Views: Downloads:

Download PDF

Share This Article

Twitter
Linked In
Facebook

Abstract

The increasing incidence of malware presents significant obstacles to cybersecurity, necessitating sophisticated and effective techniques for detection and categorization. This paper presents a novel method that improves the precision and efficacy of malware classification by utilizing multi-processing and Bag-of-Words (BoW) vectorization. The suggested technique utilizes feature vectors to depict malware samples, leveraging the parallel processing capabilities of contemporary computer systems to expedite classification operations. This solution centers on a bespoke HexVectorizer that transforms hexadecimal strings obtained from malware binaries into feature vectors. The approach employs a balanced subset of the Microsoft Malware Classification dataset, meticulously preprocessed for dependable evaluation to assure thorough analysis. Python's multiprocessing module is utilized to meet the computing requirements of extensive datasets, facilitating the parallelization of vectorization processes and markedly enhancing processing performance. The classification system is centered on XGBoost's XGBClassifier, recognized for its superior performance and accuracy in addressing malware detection and classification issues. The experimental findings validate the efficacy of the suggested technique in real-time malware detection, confirming its relevance to diverse cybersecurity contexts. This paper offers an exhaustive elucidation of the implementation procedure, accompanied by thorough performance assessments. The results highlight the method's promise as a scalable and efficient approach for tackling the increasing issues of malware classification in cybersecurity. This research greatly advances the creation of resilient and efficient malware detection systems by integrating sophisticated vectorization techniques with cutting-edge machine learning algorithms, therefore fulfilling essential requirements in a dynamic threat environment.

Published in	Advances in Networks (Volume 12, Issue 1)
DOI	10.11648/j.net.20251201.12
Page(s)	19-28
Creative Commons	This is an Open Access article, distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution and reproduction in any medium or format, provided the original work is properly cited.
Copyright	Copyright © The Author(s), 2025. Published by Science Publishing Group

Keywords

Bag-of-Words (BoW), Cybersecurity, Multiprocessing, Malware Classification, Machine Learning

1. Introduction

The rapid spread of malware is a persistent and major problem in cybersecurity. There is a pressing need for novel methods to effectively detect and categorize malware, as malevolent actors are always developing more complex ways to avoid detection and corrupt systems. This research proposes a unique malware classification methodology integrating Bag-of-Words (BoW) vectorization techniques and multiprocessing. This research is motivated by the need to defend digital infrastructure against an ever-evolving array of cyber-attacks.

Given the vast number and diversity of malware variants, conventional signature-based detection algorithms have found it challenging to remain current. These strategies are less successful against new, unknown, or polymorphic threats because they depend on well-known malware patterns and traits that have already been recognized. Thus, it is necessary to investigate substitute tactics that may offer stronger and more flexible defenses. The proposed methodology seeks to improve the malware classification accuracy and efficiency by utilizing the parallel processing power of contemporary computer architectures and BoW vectorization to represent malware samples as feature vectors.

This research aims to address the main issue of the inability of standard malware detection algorithms to handle a large and continuously increasing number of malware variants. Because signature-based techniques rely on preexisting knowledge, they are becoming less successful in recognizing novel and complex malware. More dynamic and scalable methods are required to adjust to the ever-changing threat landscape without incurring exorbitant computing expenses.

How might integrating multiprocessing and Bag-of-Words (BoW) vectorization techniques enhance the effectiveness and precision of malware classification compared to conventional signature-based methodologies.

By combining multi-processing and Bag-of-Words (BoW) vectorization approaches, this research seeks to enhance the field of malware classification significantly. The main goal is to provide a solid framework that presents an extensive malware categorization approach that may change to accommodate evolving cyber threats. This work aims to reduce computational costs by improving malware classification accuracy and efficiency by utilizing the parallel processing power of contemporary computer systems.

This study presents a methodical procedure for gathering data, preprocessing, extracting features, and training a malware-specific model. This approach ensures that every stage is thoughtfully planned to maximize the classification system’s overall performance. Moreover, performance metric analysis and experimentation were used for empirical validation to demonstrate the effectiveness of the proposed methodology in practical settings.

2. Literature Review

Malware classification is a critical component of cybersecurity, and the aim is to identify and categorize malicious software to protect computer systems and networks. This section provides a comprehensive review of the existing literature, focusing on the role of multi-processing and Bag-of-Words (BoW) vectorization in malware analysis. Latent Support Measure Machines (latent SMM) enhance Bag-of-Words (BoW) data classification by estimating latent vectors for words, capturing the semantic relationships and co-occurrence patterns that traditional Support Vector Machines (SVMs) miss. This approach leads to state-of-the-art accuracy in BoW text classification and demonstrates robustness in terms of hyperparameters. Additionally, latent SMM is valuable for visualizing the relationships between words. In contrast, SVMs struggle to accurately reflect the co-occurrence of similar words because the kernel values between data points may not be properly defined

[1]

. AttentionXML proposes a label tree-based deep learning model for extreme multi-label text classification, featuring a multilabel attention mechanism that uses raw text input to capture the most relevant parts of the text for each label and a shallow, wide probabilistic label tree (PLT) for efficiently handling millions of labels, especially rare “tail labels.” Whereas traditional Bag-of-Words (BoW) methods overlook word context and deep semantic information, and deep learning methods often struggle with scalability and capturing subtext, AttentionXML outperforms all existing methods on six benchmark datasets, particularly for handling tail labels among label tree-based approaches

[2]

. A novel approach for malware classification leverages multi-processing and Bag-of-Words (BoW) vectorization to enhance efficiency. The method uses multithreading to process data in parallel across all CPU cores, thereby significantly optimizing the computation time. Employing bigram BoW and pixel intensity features improves accuracy and ensures efficient data processing through parallel execution

[3]

. Wang et al. proposed a new malware classification system that achieved 99.87% accuracy using the XGBoost algorithm. This system utilizes a fusion feature set that combines binary and assembly malware features through a forward feature stepwise selection technique based on the BIG2015 dataset. While small n-grams may fail to capture complex code patterns owing to code reuse in malware, the fusion feature set effectively addresses this issue. The LightGBM and CatBoost algorithms were also tested, achieving accuracies of 99.84% and 99.76%, respectively

[4]

. The authors present a hybrid XceptionCNN-LightGBM model for Windows Malware classification utilizing the Malimg malware dataset. This model integrates the XceptionCNN architecture with the LightGBM algorithm to enhance classification accuracy and performance. While the generic LightGBM algorithm achieved a classification accuracy of 99% True Positive Rate (TPR), the proposed XceptionCNN-LightGBM technique achieved a remarkable 100% TPR, indicating superior performance. However, the authors acknowledge the need for further improvements in accuracy and performance, particularly when scaling up to larger sample sizes

[5]

. The article fails to address effective malware classification by multiprocessing and Bag-of-Words vectorization. It emphasizes URL-based malware detection via NLP approaches, notably implementing the ROBERTa model for enhanced accuracy and minimal false-positive rates

[6]

. Employed multiprocessing and Bag-of-Words vectorization for effective malware categorization. Utilized multithreading to process data for optimization concurrently

[7]

. This study presents a hybrid model that combines XceptionCNN and LightGBM for malware classification, attaining a 100% True Positive Rate and decreased prediction durations. It omits the discussion of multiprocessing and Bag-of-Words vectorization for effective malware classification

[8]

. The authors concentrate on a machine learning methodology for malware classification, including data analysis, feature engineering, and modeling techniques. Nevertheless, it does not explicitly encompass multiprocessing or Bag-of-Words vectorization methodologies

[9]

. The study omits the topic of effective malware classification by multiprocessing and Bag-of-Words vectorization. The study emphasizes using byte sequences from executable files to identify and classify IoT malware with machine learning techniques such as SVM, KNN, and MLP

[10]

. The report fails to discuss effective malware classification via multiprocessing and Bag-of-Words vectorization. The framework employs machine learning methods for malware identification and classification, utilizing Cuckoo Sandbox for analysis and the Weka Framework for model construction

[11]

. The study does not address effective malware categorization by multiprocessing or bag-of-words vectorization. The study emphasizes the categorization of malware using extracted printable strings with several classification methods, attaining a classification accuracy of 97% without the aforementioned methodologies

[12]

. The study emphasizes memory-efficient malware detection with a bag-of-words methodology, achieving a 95% reduction in memory consumption during training. Nonetheless, it does not explicitly discuss multiprocessing methodologies for malware categorization

[13]

. The study omits the topic of effective malware classification by multiprocessing and Bag-of-Words vectorization. It emphasizes a hybrid methodology integrating static and dynamic analytic capabilities with machine learning to forecast malware execution duration and family categorization

[14]

3. Methodology

Data preparation, feature extraction, parallel processing, and model training are important phases in the multiprocessing and Bag-of-Words (BoW) vectorization approach for malware classification. This section outlines each procedure to explain the methodology employed in this study thoroughly.

3.1. Data Preprocessing

3.1.1. Data Loading and Exploration

The dataset was loaded from. Byte files, each containing a hexadecimal representation of the malware binaries. Initial exploratory data analysis (EDA) was conducted to understand the distribution and characteristics of the data, including inspecting the malware families and the size of the malware. Bytes files.

3.1.2. Data Balancing

The dataset was imbalanced, with some malware families being more represented than others. To address this, the authors sampled the dataset to create a balanced subset, ensuring each malware family has an equal number of samples. This step is crucial to prevent the model from being biased towards more prevalent families.

3.2. Feature Extraction

3.2.1. Hexadecimal String Extraction

The hexadecimal strings were extracted from. Byte files by removing non-hexadecimal characters and memory addresses. This results in clean sequences of hex values that represent malware binary content.

3.2.2. Custom HexVectorizer

A custom HexVectorizer class was developed to transform hex strings into feature vectors. This class leverages Scikit learnaˆs CountVectorizer to tokenize the hex values and convert them into a sparse matrix of token counts. The HexVectorizer processes the hex values in blocks (e.g., two bytes at a time) to build a Bag-of-Words (BoW) model.

3.2.3. Including Additional Features

In addition to the BoW vectors, here is the file size for each. Byte file is included as an additional feature. This helps to capture the differences in file sizes across various malware families, potentially improving the classification accuracy.

3.3. Parallel Processing (Multiprocessing) for Feature Extraction

Owing to the large size of the dataset, the feature extraction is computationally intensive. To address this, the authrs implemented multi-processing using Python’s multi-processing library. The data are split into chunks, and the parallelize function distributes these chunks across multiple CPU cores for concurrent processing, thereby significantly reducing the time required for vectorization.

[1]	Yoshikawa, Yuya, Tomoharu Iwata, and Hiroshi Sawada. “Latent Support Measure Machines for Bag-of-Words Data Classification.” Proceedings Article, 2014.
[2]	Ronghui, Zihan Zhang, Ziye Wang, Suyang Dai, Hiroshi Mamitsuka, and Shanfeng Zhu. “AttentionXML: Label Tree-based Attention-Aware Deep Model for High- Performance Extreme Multi-Label Text Classification.” Proceedings Article, 2019.
[3]	Banerjee, Shobhan, Bibhuti Bhusan Dash, M. Rath, Tanmaya Swain, and Tapaswini Samant. “Malware Classification using Bigram BOW, Pixel Intensity Features, and Multiprocessing.” Proceedings Article, 2022. https://doi.org/10.1109/CONECCT55679.2022.9865764
[4]	Chen, Zhiguo, and Xuanyu Ren. “An Efficient Boosting- Based Windows Malware Family Classification System Using Multi-Features Fusion.” Journal Article, Applied Sciences, 2023. https://doi.org/10.3390/app13064060
[5]	Onoja, M. Nelson, Abayomi Jegede, N. V Blamah, Abinbola Victor Olawale, and Temidayo Oluwatosin Omotehinwa. “EEMDS: Efficient and Effective Malware Detection System with Hybrid Model based on XceptionCNN and LightGBM Algorithm.” Journal Article, 2022.
[6]	Y. P. Kumar S, S. Mishra and R. K. Singh, "Unleashing the Potential of Machine Learning and NLP Contextual Word Embedding for URL-Based Malicious Traffic Classification," 2024 IEEE 99th Vehicular Technology Conference (VTC2024-Spring), Singapore, Singapore, 2024, pp. 1-5, https://doi.org/10.1109/VTC2024-Spring62846.2024.10683482
[7]	S. Banerjee, B. B. Dash, M. K. Rath, T. Swain and T. Samant, "Malware Classification using Bigram BOW, Pixel Intensity Features, and Multiprocessing," 2022 IEEE International Conference on Electronics, Computing and Communication Technologies (CONECCT), Bangalore, India, 2022, pp. 1-5, https://doi.org/10.1109/CONECCT55679.2022.9865764
[8]	M. Onoja, A. Jegede, N. Blamah, O. V. Abimbola, and T. O. Omotehinwa, “EEMDS: Efficient and Effective Malware Detection System with Hybrid Model based on XceptionCNN and LightGBM Algorithm”, JCSI, vol. 1, no. 2, pp. 42–57, Oct. 2022, https://doi.org/10.33736/jcsi.4739.2022
[9]	Houssem Hosni. Machine learning approach for malware multiclass classification. BRAINS 2019 - 1st Blockchain, Robotics, AI for Networking Security Conference, Mar 2019, Rio de janeiro, Brazil. ffhal-02075139f.
[10]	T. -L. Wan et al., "Efficient Detection and Classification of Internet-of-Things Malware Based on Byte Sequences from Executable Files," in IEEE Open Journal of the Computer Society, vol. 1, pp. 262-275, 2020, https://doi.org/10.1109/OJCS.2020.3033974
[11]	Kamalakanta Sethi. A novel malware analysis for malware detection and classification using machine learning algorithms. In Proceedings of the 10th International Conference on Security of Information and Networks (SIN '17). Association for Computing Machinery, New York, NY, USA, 107–113. https://doi.org/10.1145/3136825.3136883
[12]	R. Tian, L. Batten, R. Islam and S. Versteeg, "An automated classification system based on the strings of trojan and virus families," 2009 4th International Conference on Malicious and Unwanted Software (MALWARE), Montreal, QC, Canada, 2009, pp. 23-30, https://doi.org/10.1109/MALWARE.2009.5403021
[13]	Yueming, Wang, Meng, Zhang. 13. Memory-efficient detection of large-scale obfuscated malware. International Journal of Wireless and Mobile Computing, https://doi.org/10.1504/ijwmc.2024.136586
[14]	Sean, Kilgallon, Leonardo, De, La, Rosa., John, Cavazos. (2017). 14. Improving the effectiveness and efficiency of dynamic malware analysis with machine learning. https://doi.org/10.1109/RWEEK.2017.8088644
[15]	A. Chowdhury, “Advancing Multi-Class Arc Welding Defect Classification: DEEPTLWELD Intelligent System Utilizing Computer Vision, Deep Learning, and Transfer Learning on Radiographic X-ray Images for Bangladesh’s Manufacturing Sector,” 2024 IEEE International Conference on Computing, Applications and Systems (COMPAS), Cox’s Bazar, Bangladesh, 2024, pp. 1-6, https://doi.org/10.1109/COMPAS60761.2024.10796006
[16]	Hossain, Sazzad, et al. "Fine-tuning LLaMA 2 interference: a comparative study of language implementations for optimal efficiency." (2025). https://doi.org/10.48550/arXiv.2502.01651