This repository contains the code for CAT: Context-aware, Graph-based, Commit-level Vulnerability Assessment.
Software Vulnerabilities (SVs) are security weaknesses and flaws that are exploitable in cyber-attacks. Delay in the assessment of SVs might cause serious consequences due to the unknown impacts of the SVs on the attacked systems. The state-of-the-art approach has been proposed to work directly on the committed code changes that were deemed to have vulnerable code, and produces the assessment grades for the vulnerability without waiting until the SV reports are filed. However, the existing approach still has low accuracy due to its limited code change representations and surrounding contexts. In this work, we propose CAT, a Context-aware, Graphbased, Commit-level Vulnerability Assessment Model that evaluates a vulnerability-introducing code and provides the CVSS assessment grades for the vulnerability. One of the key solutions is a novel context-aware, graph-based, representation learning (RL) model to learn the contextualized embeddings for the code changes that integrate program dependencies and the surrounding contexts of code changes. We utilize the contextualized embeddings to learn to provide the CVSS assessment grades for the vulnerability. During the assessment of one aspect, we also consider the impacts of the other aspects by leveraging a multi-task learning model (each task learning to assess one aspect) to propagate the learning from one task to another. Our empirical evaluation shows that on a large dataset of vulnerabilities in C code, CAT achieves F-score of 25.5% and MCC of 26.9% relatively higher than the state-of-the-art model in vulnerability assessment. In the dataset of vulnerabilities in Java code, CAT achieves F-score of 14.3% and MCC of 8.0% relatively higher than the state-of-the-art model in vulnerability assessment.
We published our processed C dataset at https://drive.google.com/file/d/1kms3Xr5xkSA6gu9AErbFigsgaTgm25id/view?usp=sharing
Please create a folder named data
under the root folder of CAT, download the dataset, unzip it and put all files in ./data
folder.
The Java dataset we used is from DeepCVA: Automated Commit-level Vulnerability Assessment with Deep Multi-task Learning
research. Please reach this research for more information.
If you want to use your own dataset, please prepare the data as follow:
The data are stored in data_n.pt
Each data_n.pt
include a set of Data
object from torch_geometric
. Each Data
represent a method in a vulnerability introducing commit:
1> Data.x = Node_feature_vector
2> Data.y = [label_1, ..., label_7]
3> Data.edge_index = edge_list
Where Node_feature_vector
is N*R
sized torch tensors that represent the node features in the graph, N
is the number of nodes on the graph and R
is the representation vector length.
edge_list
is the matrixs to represent the graph edges for each method. Please refer to torch_geometric
package for more details.
label_1, ... ,label_7
are the true labels for seven different vulnerability assessment types.
Install Torch
by following the Instruction from PyTorch.
Install torch_sparse
by following the Instruction from pytorch_sparse.
Install torch_geometric
by following the Instruction from torch_geometric.
See requirement.txt for other required packages.
Download the CAT source code and run main.py
to see the result for our experiment.
Because the dataset that used in our approaches contains big graphs which are huge and the model may take a long time to well trained and tested. To quickly try our model, please download our demo that contains just limited amount of data.
Demo download: https://drive.google.com/file/d/1ruxgWuAT4DPcSyRue0beooXNi0Y82YAv/view?usp=sharing
Put model.pt
and data
in the root folder of CAT and then run run_demo.py
to see the results.