BODMAS Malware Dataset

View on GitHub

Update (08/29/2021) - Source code is available at: GitHub

BODMAS is short for Blue Hexagon Open Dataset for Malware AnalysiS. We collaborate with Blue Hexagon to release a dataset containing timestamped malware samples and well-curated family information for research purposes. The BODMAS dataset contains 57,293 malware samples and 77,142 benign samples collected from August 2019 to September 2020, with carefully curated family information (581 families).

We extract the feature vectors using the LIEF project (version 0.9.0), the same as the Ember dataset (details can be found here). Each sample is represented as a 2381 feature vector, along with its label (benign or malicious) and malware family if it’s malicious. We also release the original binary for malware samples only.

Further details can be found in our paper “BODMAS: An Open Dataset for Learning based Temporal Analysis of PE Malware” [PDF], Deep Learing and Security Workshop 2021 (co-located with IEEE Security and Privacy 2021).

If you end up building on this dataset as part of a project or publication, please include a reference to our paper:

@inproceedings{bodmas,
  title = {BODMAS: An Open Dataset for Learning based Temporal Analysis of PE Malware},
  author = {Yang, Limin and Ciptadi, Arridhana and Laziuk, Ihar and Ahmadzadeh, Ali and Wang, Gang},
  booktitle = {4th Deep Learning and Security Workshop},
  year = {2021}
}

Download

  1. The feature vectors and metadata are open to everyone. Download the data here: Google Drive
    • feature vectors (~250 MB): bodmas.npz
    • metadata (~12 MB): bodmas_metadata.csv
    • They are sorted by the timestamp in the ascending order (i.e., each feature vector corresponds to one row in the metadata file).
  2. We cannot release the original file for the benign software due to copyright considerations. But we will host the original binaries of malware samples.

    To avoid misuse, please read and agree to the following conditions before sending us emails.

    • Please email Limin (liminy2@illinois.edu) and CC Gang (gangw@illinois.edu). Also, please include your Gmail address in the body so that I can add you to the google drive folder where the dataset is stored.
    • Do not share the data with any others (except your co-authors for the project). We are happy to share with other researchers based upon their requests.
    • Explain in a few sentences of your plan to do with these binaries. It should not be a precise plan.
    • If you are in academia, contact us using your institution email and provide us a webpage registered at the university domain that contains your name and affiliation.
    • If you are in research (industrial) labs, send us an email from your company’s email account and introduce yourself and company. In the email, please attach a justification letter (in PDF format) in official letterhead. The letter needs to state clearly the reasons why this dataset is being requested.

    Please note that an email not following the conditions might be ignored. And we will keep the public list of organizations accessing these samples at the bottom.

Get Started

Organizations Reguested Our Dataset

  1. Simon Fraser University, Canada
  2. Oracle Labs
  3. Columbia University
  4. Telkom University, Indonesia
  5. University of Alberta, Canada
  6. Orange Inc., France
  7. Beijing Institute of Technology
  8. College Of Engineering Pune, India
  9. University of Salerno, Italy
  10. Shanghai Jiao Tong University
  11. Southeast University
  12. Beijing University of Posts and Telecommunications
  13. Guizhou Normal University
  14. Korea University
  15. GuiLin University of Electronic and Technology
  16. New York University
  17. University of Chinese Academy of Sciences
  18. University of the West of England (UWE) Bristol
  19. University College Dublin, Ireland
  20. Women Engineering College, Ajmer, India
  21. Beijing University of Technology
  22. Air University Islamabad, Pakistan
  23. Eastern Connecticut State University

Contributors

Limin Yang, Ph.D. student at UIUC (Contact me via liminy2@illinois.edu).

Arridhana Ciptadi, Blue Hexagon Inc.

Ihar Laziuk, Blue Hexagon Inc.

Ali Ahmadzadeh, Blue Hexagon Inc.

Gang Wang, Assistant Professor at UIUC