vulnerability-assessment

CAT: Context-aware, Graph-based, Commit-level Vulnerability Assessment

This repository contains the code for CAT: Context-aware, Graph-based, Commit-level Vulnerability Assessment.

Source Code: https://github.com/Vulnerability-Assessment/vulnerability-assessment#Instruction_to_Run_CAT

Demo: https://github.com/Vulnerability-Assessment/vulnerability-assessment#Demo

Introduction
Dataset
Requirement
Instruction_to_Run_CAT
Demo

Introduction

Software Vulnerabilities (SVs) are security weaknesses and flaws that are exploitable in cyber-attacks. Delay in the assessment of SVs might cause serious consequences due to the unknown impacts of the SVs on the attacked systems. The state-of-the-art approach has been proposed to work directly on the committed code changes that were deemed to have vulnerable code, and produces the assessment grades for the vulnerability without waiting until the SV reports are filed. However, the existing approach still has low accuracy due to its limited code change representations and surrounding contexts. In this work, we propose CAT, a Context-aware, Graphbased, Commit-level Vulnerability Assessment Model that evaluates a vulnerability-introducing code and provides the CVSS assessment grades for the vulnerability. One of the key solutions is a novel context-aware, graph-based, representation learning (RL) model to learn the contextualized embeddings for the code changes that integrate program dependencies and the surrounding contexts of code changes. We utilize the contextualized embeddings to learn to provide the CVSS assessment grades for the vulnerability. During the assessment of one aspect, we also consider the impacts of the other aspects by leveraging a multi-task learning model (each task learning to assess one aspect) to propagate the learning from one task to another. Our empirical evaluation shows that on a large dataset of vulnerabilities in C code, CAT achieves F-score of 25.5% and MCC of 26.9% relatively higher than the state-of-the-art model in vulnerability assessment. In the dataset of vulnerabilities in Java code, CAT achieves F-score of 14.3% and MCC of 8.0% relatively higher than the state-of-the-art model in vulnerability assessment.

Dataset

Preprocessed Dataset

We published our processed C dataset at https://drive.google.com/file/d/1kms3Xr5xkSA6gu9AErbFigsgaTgm25id/view?usp=sharing

Please create a folder named data under the root folder of CAT, download the dataset, unzip it and put all files in ./data folder.

The Java dataset we used is from DeepCVA: Automated Commit-level Vulnerability Assessment with Deep Multi-task Learning research. Please reach this research for more information.

Use your own dataset

If you want to use your own dataset, please prepare the data as follow:

The data are stored in data_n.pt
Each data_n.pt include a set of Data object from torch_geometric. Each Data represent a method in a vulnerability introducing commit:

1> Data.x = Node_feature_vector

2> Data.y = [label_1, ..., label_7]

3> Data.edge_index = edge_list

Where Node_feature_vector is N*R sized torch tensors that represent the node features in the graph, N is the number of nodes on the graph and R is the representation vector length.

edge_list is the matrixs to represent the graph edges for each method. Please refer to torch_geometric package for more details.

label_1, ... ,label_7 are the true labels for seven different vulnerability assessment types.

Requirement

Install Torch by following the Instruction from PyTorch.

Install torch_sparse by following the Instruction from pytorch_sparse.

Install torch_geometric by following the Instruction from torch_geometric.

See requirement.txt for other required packages.

Instruction_to_Run_CAT

Download the CAT source code and run main.py to see the result for our experiment.

Demo

Because the dataset that used in our approaches contains big graphs which are huge and the model may take a long time to well trained and tested. To quickly try our model, please download our demo that contains just limited amount of data.

Demo download: https://drive.google.com/file/d/1ruxgWuAT4DPcSyRue0beooXNi0Y82YAv/view?usp=sharing

Put model.pt and data in the root folder of CAT and then run run_demo.py to see the results.