BODMAS Malware Dataset
Update (10/09/2023) - Since Limin is graduadated, please email his labmate Zhi Chen (zhic4@illinois.edu) and CC Dr. Gang Wang (gangw@illinois.edu) for all the future requests.
Update (12/15/2021) - Malware category information is available at Google Drive
Update (08/29/2021) - Source code is available at: GitHub
BODMAS is short for Blue Hexagon Open Dataset for Malware AnalysiS. We collaborate with Blue Hexagon to release a dataset containing timestamped malware samples and well-curated family information for research purposes. The BODMAS dataset contains 57,293 malware samples and 77,142 benign samples collected from August 2019 to September 2020, with carefully curated family information (581 families).
We extract the feature vectors using the LIEF project (version 0.9.0), the same as the Ember dataset (details can be found here). Each sample is represented as a 2381 feature vector, along with its label (benign
or malicious
) and malware family if it’s malicious. We also release the original binary for malware samples only.
Further details can be found in our paper “BODMAS: An Open Dataset for Learning based Temporal Analysis of PE Malware” [PDF], Deep Learing and Security Workshop 2021 (co-located with IEEE Security and Privacy 2021).
If you end up building on this dataset as part of a project or publication, please include a reference to our paper:
@inproceedings{bodmas,
title = {BODMAS: An Open Dataset for Learning based Temporal Analysis of PE Malware},
author = {Yang, Limin and Ciptadi, Arridhana and Laziuk, Ihar and Ahmadzadeh, Ali and Wang, Gang},
booktitle = {4th Deep Learning and Security Workshop},
year = {2021}
}
Download
- The feature vectors and metadata are open to everyone. Download the data here: Google Drive
- feature vectors (~250 MB):
bodmas.npz
- metadata (~12 MB):
bodmas_metadata.csv
- They are sorted by the timestamp in the ascending order (i.e., each feature vector corresponds to one row in the metadata file).
- feature vectors (~250 MB):
- We cannot release the original file for the benign software due to copyright considerations. But we will host the original binaries of malware samples.
To avoid misuse, please read and agree to the following conditions before sending us emails.
- Please email
Limin (liminy2@illinois.edu)Zhi Chen (zhic4@illinois.edu) and CC Gang (gangw@illinois.edu). Also, please include your Gmail address in the body so that I can add you to the google drive folder where the dataset is stored. - Do not share the data with any others (except your co-authors for the project). We are happy to share with other researchers based upon their requests.
- Explain in a few sentences of your plan to do with these binaries. It should not be a precise plan.
- If you are in academia, contact us using your institution email and provide us a webpage registered at the university domain that contains your name and affiliation.
- If you are in research (industrial) labs, send us an email from your company’s email account and introduce yourself and company. In the email, please attach a justification letter (in PDF format) in official letterhead. The letter needs to state clearly the reasons why this dataset is being requested.
Please note that an email not following the conditions might be ignored. And we will keep the public list of organizations accessing these samples at the bottom.
- Please email
Get Started
-
To load the feature vectors, you need to load
bodmas.npz
(a numpy compressed format) with the following code. Note that the feature values are unnormalized, which is okay for classifiers like gradient-boosted decision tree, but you may need to normalize them first when applying an MLP classifier.import numpy as np filename = 'bodmas.npz' data = np.load(filename) X = data['X'] # all the feature vectors y = data['y'] # labels, 0 as benign, 1 as malicious print(X.shape, y.shape) # >>> (134435, 2381), (134435,)
-
For
bodmas_metadata.csv
, it has three columns, indicating SHA-256, when the sample first appeared, and malware family.If the malware family is empty, then it’s a benign sample.
-
Top malware families and their number of samples (>= 1,000) are as follows:
1. sfone: 4729 2. wacatac: 4694 3. upatre: 3901 4. wabot: 3673 5. small: 3339 6. ganelp: 2232 7. dinwod: 2057 8. mira: 1960 9. berbew: 1749 10. sillyp2p: 1616 11. ceeinject: 1169 12. gepys: 1124 13. benjamin: 1071 14. musecador: 1054
Organizations Reguested Our Dataset
- Simon Fraser University, Canada
- Oracle Labs
- Columbia University
- Telkom University, Indonesia
- University of Alberta, Canada
- Orange Inc., France
- Beijing Institute of Technology
- College Of Engineering Pune, India
- University of Salerno, Italy
- Shanghai Jiao Tong University
- Southeast University
- Beijing University of Posts and Telecommunications
- Guizhou Normal University
- Korea University
- GuiLin University of Electronic and Technology
- New York University
- University of Chinese Academy of Sciences
- University of the West of England (UWE) Bristol
- University College Dublin, Ireland
- Women Engineering College, Ajmer, India
- Beijing University of Technology
- Air University Islamabad, Pakistan
- Eastern Connecticut State University
- Yonsei University, South Korea
- Arizona State University
- Bandung Institute of Technology, Indonesia
- University of Southampton, United Kingdom
- Xidian University
- University of Balamand, Lebanon
- The University of Chicago
- Xinjiang University
- University of Turin, Italy
- Punjab University College of Information Technology, Pakistan
- Guangzhou University
- Middle East Technical University, Turkey
- Microsoft
- Sana'a University, Yemen
- HarfangLab, France
- Purdue University Northwest
- PSG College of Technology, India
- University of Windsor, Canada
- Georgia Tech
- De Montfort University, United Kingdom
- Ghent University, Belgium
- Iowa State University
- Macquarie University, Australia
- Hongik University, South Korea
- UiTM Shah Alam, Malaysia
- Hanoi University of Science and Technology, Vietnam
- Ain Shams university, Egypt
- Open University of Catalonia, Spain
- Amrita Vishwa Vidyapeetham, India
- National University of Science and Technology, Zimbabwe
- Nagoya University, Japan
- Institute of Information Security, Japan
- Heriot-Watt University, United Kingdom
- Edinburgh Napier University, United Kingdom
- Istanbul University-Cerrahpaşa, Turkey
- Zhejiang University
- Hanyang University, South Korea
- Army Engineering University of PLA
- Purdue University
- University of Molise, Italy
- SharpAI LLC
- Silesian University of Technology, Poland
- Florida State University
- University Of Bath, United Kingdom
- National University of Computer and Emerging Sciences, Pakistan
- Chungnam National University, South Korea
- PeeploTech
- Damietta University, Egypt
- Queen's University Belfast, United Kingdom
- Vilnius Tech, Italy
- Indian Institute of Technology Roorkee, India
- Beijing University of Civil Engineering and Architecture
- University of Quebec in Outaouais, Canada
- National Institute of Technology Raipur, India
- University of Colorado Colorado Springs
- University of Technology and Applied Sciences, Oman
- University of Portsmouth, United Kingdom
- Brno University of Technology, Czechia
- Royal Holloway, University of London, United Kingdom
- The University of Alabama in Huntsville
- University of Portsmouth, United Kingdom
- Wuhan University
- Guizhou University
- Amrita Vishwa Vidyapeetham, India
- Birkbeck, University of London, United Kingdom
- GoldenEye Inc
- Huazhong University of Science and Technology
- Sam Houston State University
- Hoseo University, South Korea
- East China University of Science and Technology
- Xiamen University Malaysia
- Pamantasan ng Lungsod ng Maynila, Pilipinas
- Sichuan University
- Nanjing University of Information Science and Technology
- University of Information Technology, Ho Chi Minh City, Vietnam
- Seoul National University of Science and Technology, South Korea
- University of Science and Technology of China
- Tsukuba University, Japan
- University of Toronto, Canada
- Charles Darwin University, Australia
- Zoho Corporation, India
- University of Cape Town, South Africa
- Sivas University of Science and Technology, Turkey
- University of Bari Aldo Moro, Italy
- UET Lahore University of Engineering and Technology
- Bandung Institute of Technology, Indonesia
- Sungshin Women's University,South Korea
- Budapest University of Technology and Economics, Hungary
- University of Bari (islab-uniba), Italy
- Dongguk University, South Korea
- People's Public Security University, China
- Fujian Normal University, China
- Qassim University, Saudi Arabia
- Sichuan University, China
- Zhejiang Normal University, China
- University of Minnesota
- Amrita Vishwa Vidyapeetham, India
- Indian Institute of Technology Jammu, India
- Babes-Bolyai University of Cluj-Napoca, Romania
- Texas A&M University
- Ho Chi Minh City University of Technology, Vietnam
- AnxinSec, China
- Czech Technical University in Prague, Czechia
- Koç University, Turkey
- Telkom University, Indonesia
- ShanghaiTech University, China
- University of Electronic Science and Technology of China, China
- VNU-HCM University of Information Technology, Vietnam
- Johns Hopkins University
- Umm Al-Qura University, Kingdom of Saudia Arabia
- Federal University of Parana, Brazil
- University of Sannio in Benevento, Italy
- German University in Cairo, Egypt
- BRAC University, Bangladesh
- University of Piraeus, Greece
- ECIT-Queens University Belfast, Northern Ireland
- Nanjing University of Posts and Telecommunications, China
- National University of Defense Technology, China
- Numidia Institute of Technology, Algeria
- George Washington University
Contributors
Limin Yang, Ph.D. from UIUC.
Arridhana Ciptadi, Blue Hexagon Inc.
Ihar Laziuk, Blue Hexagon Inc.
Ali Ahmadzadeh, Blue Hexagon Inc.
Gang Wang, Associate Professor at UIUC