BODMAS Malware Dataset

View on GitHub

BODMAS is short for Blue Hexagon Open Dataset for Malware AnalysiS. We collaborate with Blue Hexagon to release a dataset containing timestamped malware samples and well-curated family information for research purposes. The BODMAS dataset contains 57,293 malware samples and 77,142 benign samples collected from August 2019 to September 2020, with carefully curated family information (581 families).

We extract the feature vectors using the LIEF project (version 0.9.0), the same as the Ember dataset (details can be found here). Each sample is represented as a 2381 feature vector, along with its label (benign or malicious) and malware family if it’s malicious. We also release the original binary for malware samples only.

Further details can be found in our paper “BODMAS: An Open Dataset for Learning based Temporal Analysis of PE Malware” [PDF], Deep Learing and Security Workshop 2021 (co-located with IEEE Security and Privacy 2021).

If you end up building on this dataset as part of a project or publication, please include a reference to our paper:

@inproceedings{bodmas,
    title = {BODMAS: An Open Dataset for Learning based Temporal Analysis of PE Malware},
    author = {Yang, Limin and Ciptadi, Arridhana and Laziuk, Ihar and Ahmadzadeh, Ali and Wang, Gang},
    booktitle = {4th Deep Learning and Security Workshop},
    year = {2021}
}

Download

  1. The feature vectors and metadata are open to everyone. Download the data here: Google Drive
    • feature vectors (~250 MB): bodmas.npz
    • metadata (~12 MB): bodmas_metadata.csv
    • They are sorted by the timestamp in the ascending order (i.e., each feature vector corresponds to one row in the metadata file).
  2. We cannot release the original file for the benign software due to copyright considerations. But we will host the original binaries of malware samples. To avoid misuse, please read and agree to the following conditions before sending us emails.
    • Do not share the data with any others (except your co-authors for the project). We are happy to share with other researchers based upon their requests.
    • Explain in a few sentences of your plan to do with these binaries. It should not be a precise plan.
    • If you are in academia, contact us using your institution email and provide us a webpage registered at the university domain that contains your name and affiliation.
    • If you are in research (industrial) labs, send us an email from your company’s email account and introduce yourself and company. In the email, please attach a justification letter (in PDF format) in official letterhead. The letter needs to state clearly the reasons why this dataset is being requested.

    Please note that an email not following the conditions might be ignored. And we will keep the public list of organizations accessing these samples here.

Get Started

  1. To load the feature vectors, you need to load bodmas.npz (a numpy compressed format) with the following code. Note that the feature values are unnormalized, which is okay for classifiers like gradient-boosted decision tree, but you may need to normalize them first when applying an MLP classifier.

     import numpy as np
    
     filename = 'bodmas.npz'
     data = np.load(filename)
     X = data['X']  # all the feature vectors
     y = data['y']  # labels, 0 as benign, 1 as malicious
    
     print(X.shape, y.shape)
     # >>> (134435, 2381), (134435,)
    
  2. For bodmas_metadata.csv, it has three columns, indicating SHA-256, when the sample first appeared, and malware family. If the malware family is empty, then it’s a benign sample.

Contributors

Limin Yang, Ph.D. student at UIUC (Contact me via liminy2@illinois.edu).

Arridhana Ciptadi, Blue Hexagon Inc.

Ihar Laziuk, Blue Hexagon Inc.

Ali Ahmadzadeh, Blue Hexagon Inc.

Gang Wang, Assistant Professor at UIUC