BODMAS Malware Dataset

View on GitHub

Update (10/09/2023) - Since Limin is graduadated, please email his labmate Zhi Chen (zhic4@illinois.edu) and CC Dr. Gang Wang (gangw@illinois.edu) for all the future requests.

Update (12/15/2021) - Malware category information is available at Google Drive

Update (08/29/2021) - Source code is available at: GitHub

BODMAS is short for Blue Hexagon Open Dataset for Malware AnalysiS. We collaborate with Blue Hexagon to release a dataset containing timestamped malware samples and well-curated family information for research purposes. The BODMAS dataset contains 57,293 malware samples and 77,142 benign samples collected from August 2019 to September 2020, with carefully curated family information (581 families).

We extract the feature vectors using the LIEF project (version 0.9.0), the same as the Ember dataset (details can be found here). Each sample is represented as a 2381 feature vector, along with its label (benign or malicious) and malware family if it’s malicious. We also release the original binary for malware samples only.

Further details can be found in our paper “BODMAS: An Open Dataset for Learning based Temporal Analysis of PE Malware” [PDF], Deep Learing and Security Workshop 2021 (co-located with IEEE Security and Privacy 2021).

If you end up building on this dataset as part of a project or publication, please include a reference to our paper:

@inproceedings{bodmas,
  title = {BODMAS: An Open Dataset for Learning based Temporal Analysis of PE Malware},
  author = {Yang, Limin and Ciptadi, Arridhana and Laziuk, Ihar and Ahmadzadeh, Ali and Wang, Gang},
  booktitle = {4th Deep Learning and Security Workshop},
  year = {2021}
}

Download

The feature vectors and metadata are open to everyone. Download the data here: Google Drive
- feature vectors (~250 MB): bodmas.npz
- metadata (~12 MB): bodmas_metadata.csv
- They are sorted by the timestamp in the ascending order (i.e., each feature vector corresponds to one row in the metadata file).
We cannot release the original file for the benign software due to copyright considerations. But we will host the original binaries of malware samples.
To avoid misuse, please read and agree to the following conditions before sending us emails.
- Please email ~~Limin (liminy2@illinois.edu)~~ Zhi Chen (zhic4@illinois.edu) and CC Gang (gangw@illinois.edu). Also, please include your Gmail address in the body so that I can add you to the google drive folder where the dataset is stored.
- Do not share the data with any others (except your co-authors for the project). We are happy to share with other researchers based upon their requests.
- Explain in a few sentences of your plan to do with these binaries. It should not be a precise plan.
- If you are in academia, contact us using your institution email and provide us a webpage registered at the university domain that contains your name and affiliation.
- If you are in research (industrial) labs, send us an email from your company’s email account and introduce yourself and company. In the email, please attach a justification letter (in PDF format) in official letterhead. The letter needs to state clearly the reasons why this dataset is being requested.
Please note that an email not following the conditions might be ignored. And we will keep the public list of organizations accessing these samples at the bottom.

Get Started

To load the feature vectors, you need to load bodmas.npz (a numpy compressed format) with the following code. Note that the feature values are unnormalized, which is okay for classifiers like gradient-boosted decision tree, but you may need to normalize them first when applying an MLP classifier.
```
import numpy as np

filename = 'bodmas.npz'
data = np.load(filename)
X = data['X']  # all the feature vectors
y = data['y']  # labels, 0 as benign, 1 as malicious

print(X.shape, y.shape)
# >>> (134435, 2381), (134435,)
```
For bodmas_metadata.csv, it has three columns, indicating SHA-256, when the sample first appeared, and malware family. If the malware family is empty, then it’s a benign sample.

Top malware families and their number of samples (>= 1,000) are as follows:

sfone: 4729
wacatac: 4694
upatre: 3901
wabot: 3673
small: 3339
ganelp: 2232
dinwod: 2057
mira: 1960
berbew: 1749
sillyp2p: 1616
ceeinject: 1169
gepys: 1124
benjamin: 1071
musecador: 1054

Organizations Reguested Our Dataset

Simon Fraser University, Canada
Oracle Labs
Columbia University
Telkom University, Indonesia
University of Alberta, Canada
Orange Inc., France
Beijing Institute of Technology
College Of Engineering Pune, India
University of Salerno, Italy
Shanghai Jiao Tong University
Southeast University
Beijing University of Posts and Telecommunications
Guizhou Normal University
Korea University
GuiLin University of Electronic and Technology
New York University
University of Chinese Academy of Sciences
University of the West of England (UWE) Bristol
University College Dublin, Ireland
Women Engineering College, Ajmer, India
Beijing University of Technology
Air University Islamabad, Pakistan
Eastern Connecticut State University
Yonsei University, South Korea
Arizona State University
Bandung Institute of Technology, Indonesia
University of Southampton, United Kingdom
Xidian University
University of Balamand, Lebanon
The University of Chicago
Xinjiang University
University of Turin, Italy
Punjab University College of Information Technology, Pakistan
Guangzhou University
Middle East Technical University, Turkey
Microsoft
Sana'a University, Yemen
HarfangLab, France
Purdue University Northwest
PSG College of Technology, India
University of Windsor, Canada
Georgia Tech
De Montfort University, United Kingdom
Ghent University, Belgium
Iowa State University
Macquarie University, Australia
Hongik University, South Korea
UiTM Shah Alam, Malaysia
Hanoi University of Science and Technology, Vietnam
Ain Shams university, Egypt
Open University of Catalonia, Spain
Amrita Vishwa Vidyapeetham, India
National University of Science and Technology, Zimbabwe
Nagoya University, Japan
Institute of Information Security, Japan
Heriot-Watt University, United Kingdom
Edinburgh Napier University, United Kingdom
Istanbul University-Cerrahpaşa, Turkey
Zhejiang University
Hanyang University, South Korea
Army Engineering University of PLA
Purdue University
University of Molise, Italy
SharpAI LLC
Silesian University of Technology, Poland
Florida State University
University Of Bath, United Kingdom
National University of Computer and Emerging Sciences, Pakistan
Chungnam National University, South Korea
PeeploTech
Damietta University, Egypt
Queen's University Belfast, United Kingdom
Vilnius Tech, Italy
Indian Institute of Technology Roorkee, India
Beijing University of Civil Engineering and Architecture
University of Quebec in Outaouais, Canada
National Institute of Technology Raipur, India
University of Colorado Colorado Springs
University of Technology and Applied Sciences, Oman
University of Portsmouth, United Kingdom
Brno University of Technology, Czechia
Royal Holloway, University of London, United Kingdom
The University of Alabama in Huntsville
University of Portsmouth, United Kingdom
Wuhan University
Guizhou University
Amrita Vishwa Vidyapeetham, India
Birkbeck, University of London, United Kingdom
GoldenEye Inc
Huazhong University of Science and Technology
Sam Houston State University
Hoseo University, South Korea
East China University of Science and Technology
Xiamen University Malaysia
Pamantasan ng Lungsod ng Maynila, Pilipinas
Sichuan University
Nanjing University of Information Science and Technology
University of Information Technology, Ho Chi Minh City, Vietnam
Seoul National University of Science and Technology, South Korea
University of Science and Technology of China
Tsukuba University, Japan
University of Toronto, Canada
Charles Darwin University, Australia
Zoho Corporation, India
University of Cape Town, South Africa
Sivas University of Science and Technology, Turkey
University of Bari Aldo Moro, Italy
UET Lahore University of Engineering and Technology
Bandung Institute of Technology, Indonesia
Sungshin Women's University,South Korea
Budapest University of Technology and Economics, Hungary
University of Bari (islab-uniba), Italy
Dongguk University, South Korea
People's Public Security University, China
Fujian Normal University, China
Qassim University, Saudi Arabia
Sichuan University, China
Zhejiang Normal University, China
University of Minnesota
Amrita Vishwa Vidyapeetham, India
Indian Institute of Technology Jammu, India
Babes-Bolyai University of Cluj-Napoca, Romania
Texas A&M University
Ho Chi Minh City University of Technology, Vietnam
AnxinSec, China
Czech Technical University in Prague, Czechia
Koç University, Turkey
Telkom University, Indonesia
ShanghaiTech University, China
University of Electronic Science and Technology of China, China
VNU-HCM University of Information Technology, Vietnam
Johns Hopkins University
Umm Al-Qura University, Kingdom of Saudia Arabia
Federal University of Parana, Brazil
University of Sannio in Benevento, Italy
German University in Cairo, Egypt
BRAC University, Bangladesh
University of Piraeus, Greece
ECIT-Queens University Belfast, Northern Ireland
Nanjing University of Posts and Telecommunications, China
National University of Defense Technology, China
Numidia Institute of Technology, Algeria
George Washington University

Contributors

Limin Yang, Ph.D. from UIUC.

Arridhana Ciptadi, Blue Hexagon Inc.

Ihar Laziuk, Blue Hexagon Inc.

Ali Ahmadzadeh, Blue Hexagon Inc.

Gang Wang, Associate Professor at UIUC