BODMAS Malware Dataset

View on GitHub

Update (12/15/2021) - Malware category information is available at Google Drive

Update (08/29/2021) - Source code is available at: GitHub

BODMAS is short for Blue Hexagon Open Dataset for Malware AnalysiS. We collaborate with Blue Hexagon to release a dataset containing timestamped malware samples and well-curated family information for research purposes. The BODMAS dataset contains 57,293 malware samples and 77,142 benign samples collected from August 2019 to September 2020, with carefully curated family information (581 families).

We extract the feature vectors using the LIEF project (version 0.9.0), the same as the Ember dataset (details can be found here). Each sample is represented as a 2381 feature vector, along with its label (benign or malicious) and malware family if it’s malicious. We also release the original binary for malware samples only.

Further details can be found in our paper “BODMAS: An Open Dataset for Learning based Temporal Analysis of PE Malware” [PDF], Deep Learing and Security Workshop 2021 (co-located with IEEE Security and Privacy 2021).

If you end up building on this dataset as part of a project or publication, please include a reference to our paper:

@inproceedings{bodmas,
  title = {BODMAS: An Open Dataset for Learning based Temporal Analysis of PE Malware},
  author = {Yang, Limin and Ciptadi, Arridhana and Laziuk, Ihar and Ahmadzadeh, Ali and Wang, Gang},
  booktitle = {4th Deep Learning and Security Workshop},
  year = {2021}
}

Download

  1. The feature vectors and metadata are open to everyone. Download the data here: Google Drive
    • feature vectors (~250 MB): bodmas.npz
    • metadata (~12 MB): bodmas_metadata.csv
    • They are sorted by the timestamp in the ascending order (i.e., each feature vector corresponds to one row in the metadata file).
  2. We cannot release the original file for the benign software due to copyright considerations. But we will host the original binaries of malware samples.

    To avoid misuse, please read and agree to the following conditions before sending us emails.

    • Please email Limin (liminy2@illinois.edu) and CC Gang (gangw@illinois.edu). Also, please include your Gmail address in the body so that I can add you to the google drive folder where the dataset is stored.
    • Do not share the data with any others (except your co-authors for the project). We are happy to share with other researchers based upon their requests.
    • Explain in a few sentences of your plan to do with these binaries. It should not be a precise plan.
    • If you are in academia, contact us using your institution email and provide us a webpage registered at the university domain that contains your name and affiliation.
    • If you are in research (industrial) labs, send us an email from your company’s email account and introduce yourself and company. In the email, please attach a justification letter (in PDF format) in official letterhead. The letter needs to state clearly the reasons why this dataset is being requested.

    Please note that an email not following the conditions might be ignored. And we will keep the public list of organizations accessing these samples at the bottom.

Get Started

Organizations Reguested Our Dataset

  1. Simon Fraser University, Canada
  2. Oracle Labs
  3. Columbia University
  4. Telkom University, Indonesia
  5. University of Alberta, Canada
  6. Orange Inc., France
  7. Beijing Institute of Technology
  8. College Of Engineering Pune, India
  9. University of Salerno, Italy
  10. Shanghai Jiao Tong University
  11. Southeast University
  12. Beijing University of Posts and Telecommunications
  13. Guizhou Normal University
  14. Korea University
  15. GuiLin University of Electronic and Technology
  16. New York University
  17. University of Chinese Academy of Sciences
  18. University of the West of England (UWE) Bristol
  19. University College Dublin, Ireland
  20. Women Engineering College, Ajmer, India
  21. Beijing University of Technology
  22. Air University Islamabad, Pakistan
  23. Eastern Connecticut State University
  24. Yonsei University, South Korea
  25. Arizona State University
  26. Bandung Institute of Technology, Indonesia
  27. University of Southampton, United Kingdom
  28. Xidian University
  29. University of Balamand, Lebanon
  30. The University of Chicago
  31. Xinjiang University
  32. University of Turin, Italy
  33. Punjab University College of Information Technology, Pakistan
  34. Guangzhou University
  35. Middle East Technical University, Turkey
  36. Microsoft
  37. Sana'a University, Yemen
  38. HarfangLab, France
  39. Purdue University Northwest
  40. PSG College of Technology, India
  41. University of Windsor, Canada
  42. Georgia Tech
  43. De Montfort University, United Kingdom
  44. Ghent University, Belgium
  45. Iowa State University
  46. Macquarie University, Australia
  47. Hongik University, South Korea
  48. UiTM Shah Alam, Malaysia
  49. Hanoi University of Science and Technology, Vietnam
  50. Ain Shams university, Egypt
  51. Open University of Catalonia, Spain
  52. Amrita Vishwa Vidyapeetham, India
  53. National University of Science and Technology, Zimbabwe
  54. Nagoya University, Japan
  55. Institute of Information Security, Japan
  56. Heriot-Watt University, United Kingdom
  57. Edinburgh Napier University, United Kingdom
  58. Istanbul University-Cerrahpaşa, Turkey
  59. Zhejiang University
  60. Hanyang University, South Korea

Contributors

Limin Yang, Ph.D. student at UIUC (Contact me via liminy2@illinois.edu).

Arridhana Ciptadi, Blue Hexagon Inc.

Ihar Laziuk, Blue Hexagon Inc.

Ali Ahmadzadeh, Blue Hexagon Inc.

Gang Wang, Assistant Professor at UIUC