A set of statistical metrics to better understand and qualify malware datasets.
Médéric Hurier e62a24e59f Update 'LICENSE.txt' 3 months ago
.gitignore Initial commit 2 years ago
LICENSE.txt Update 'LICENSE.txt' 3 months ago
README.md Update README.md 2 years ago
ouroboros.py Initial Commit 2 years ago
output.json Initial Commit 2 years ago
requirements.txt Initial Commit 2 years ago
sample.csv.gz Initial Commit 2 years ago
stase.py Initial Commit 2 years ago

README.md

What is STASE ?

STASE provides a set of metrics to describe a dataset of malware labels.

Goal:

  • evaluate the properties of malware datasets
  • identify potential bias in experimental studies
  • analyze the decision and classification of antivirus products

Usage

Input: a dataset of labels formatted as a CSV or CSV.GZ file

  • columns: antivirus products
  • rows: malware files

Output: metrics introduce in this research paper (soon to be released)

Example:

python3 stase.py sample.csv.gz output.json

{
    "equiponderance": 0.2422919148,
    "equiponderance_idx":8.0,
    "exclusivity":0.2626262626,
    "recognition":0.1051423324,
    "synchronicity":0.1677210336,
    "genericity":0.5233236152,
    "uniformity":0.2926562999,
    "uniformity_idx":48.0,
    "divergence":0.7568027211,
    "consensuality":0.2227891156,
    "resemblance":0.6406466991,
    "labels":328.0,
    "apps":99.0,
    "avs":66.0,
}

Technical details:

  • implemented in Python 3 (dependencies in requirements.txt)
  • use multiprocessing for performance
  • shipped with Ouroboros

TODO

  • Handle more input formats and options

Pull request accepted !