| --- |
| library_name: transformers |
| tags: |
| - Code |
| - Vulnerability |
| - Detection |
| - C/C++ |
| datasets: |
| - DetectVul/devign |
| language: |
| - en |
| base_model: |
| - microsoft/graphcodebert-base |
| license: mit |
| metrics: |
| - accuracy |
| - precision |
| - f1 |
| - recall |
| --- |
| |
| ## GraphCodeBERT for Code Vulnerability Detection |
|
|
| ## Model Summary |
| This model is a fine-tuned version of **microsoft/graphcodebert-base**, optimized for detecting vulnerabilities in code. It is trained on the **DetectVul/devign** dataset. |
| The model takes in a code snippet and classifies it as either **safe (0)** or **vulnerable (1)**. |
|
|
| ## Model Details |
|
|
| - **Developed by:** Mukit Mahdin |
| - **Finetuned from:** `microsoft/graphcodebert-base` |
| - **Language(s):** English (for code comments & metadata), C/C++ |
| - **License:** MIT |
| - **Task:** Code vulnerability detection |
| - **Dataset Used:** `DetectVul/devign` |
| - **Architecture:** Transformer-based sequence classification |
|
|
| ## Uses |
|
|
| ### Direct Use |
| This model can be used for **static code analysis**, security audits, and automatic vulnerability detection in software repositories. It is useful for: |
| - **Developers**: To analyze their code for potential security flaws. |
| - **Security Teams**: To scan repositories for known vulnerabilities. |
| - **Researchers**: To study vulnerability detection in AI-powered systems. |
|
|
| ### Downstream Use |
| This model can be integrated into **IDE plugins**, **CI/CD pipelines**, or **security scanners** to provide real-time vulnerability detection. |
|
|
| ### Out-of-Scope Use |
| - The model is **not meant to replace human security experts**. |
| - It may not generalize well to **languages other than C/C++**. |
| - False positives/negatives may occur due to dataset limitations. |
|
|
| ## Bias, Risks, and Limitations |
| - **False Positives & False Negatives:** The model may flag safe code as vulnerable or miss actual vulnerabilities. |
| - **Limited to C/C++:** The model was trained on a dataset primarily composed of **C and C++ code**. It may not perform well on other languages. |
| - **Dataset Bias:** The training data may not cover all possible vulnerabilities. |
|
|
| ### Recommendations |
| Users should **not rely solely on the model** for security assessments. Instead, it should be used alongside **manual code review and static analysis tools**. |
|
|
| ## How to Get Started with the Model |
| Use the code below to load the model and run inference on a sample code snippet: |
|
|
| ```python |
| from transformers import AutoTokenizer, AutoModelForSequenceClassification |
| import torch |
| |
| # Load the fine-tuned model |
| tokenizer = AutoTokenizer.from_pretrained("microsoft/graphcodebert-base") |
| model = AutoModelForSequenceClassification.from_pretrained("mahdin70/graphcodebert-devign-code-vulnerability-detector") |
| |
| # Sample code snippet |
| code_snippet = ''' |
| void process(char *input) { |
| char buffer[50]; |
| strcpy(buffer, input); // Potential buffer overflow |
| } |
| ''' |
| |
| # Tokenize the input |
| inputs = tokenizer(code_snippet, return_tensors="pt", truncation=True, padding="max_length", max_length=512) |
| |
| # Run inference |
| with torch.no_grad(): |
| outputs = model(**inputs) |
| predictions = torch.nn.functional.softmax(outputs.logits, dim=-1) |
| predicted_label = torch.argmax(predictions, dim=1).item() |
| |
| # Output the result |
| print("Vulnerable Code" if predicted_label == 1 else "Safe Code") |
| ``` |
|
|
| ## Training Details |
|
|
| ### Training Data |
| - **Dataset:** `DetectVul/devign` |
| - **Classes:** `0 (Safe)`, `1 (Vulnerable)` |
| - **Size:** `21800` Code Snippets |
|
|
| ### Training Procedure |
| - **Optimizer:** AdamW |
| - **Loss Function:** CrossEntropyLoss |
| - **Batch Size:** 16 |
| - **Learning Rate:** 2e-05 |
| - **Epochs:** 3 |
| - **Hardware Used:** 2x T4 GPU |
|
|
| ### Metrics |
| | Metric | Score | |
| |------------|-------------| |
| | **Train Loss** | 0.6112 | |
| | **Evaluation Loss** | 0.605983 | |
| | **Accuracy** | 64.27% | |
| | **F1 Score** | 51.8% | |
| | **Precision** | 68.04% | |
| | **Recall** | 41.9% | |
|
|
| ## Environmental Impact |
|
|
| | Factor | Value | |
| |-----------|----------| |
| | **GPU Used** | 2x T4 GPU | |
| | **Training Time** | ~1 hour | |