Journal publications
2023
Why smart contracts reported as vulnerable were not exploited?
Authors: Tianyuan Hu, Jingyue Li, Bixin Li, and André Storhaug
Location: IEEE Transactions on Dependable and Secure Computing
10.1109/TDSC.2024.3520554
[
ABS]
[
PDF]
Smart contract security is crucial for blockchain applications. While studies suggest that only a small fraction of reported vulnerabilities are exploited, no follow-up research has investigated the reasons behind this. Our goal is to understand the factors contributing to the low exploitation rate to improve vulnerability detection and defense mechanisms. We collected 136,969 real-world smart contracts and analyzed them using seven vulnerability detectors. We applied Strauss’ grounded theory to gain insights into exploitability and analyzed transaction logs to trace the historic exploitations. Among the 4,364 smart contracts flagged as vulnerable, a significant 75.25% were found to be unexploitable, meaning they were either false positives or posed no security risk. We identified ten reasons for reporting unexploitable vulnerabilities. Furthermore, we found that only 66 out of 1,080 (6%) exploitable contracts had been exploited. We compared the characteristics of exploited versus non-exploited vulnerabilities and identified five factors that may reduce the likelihood of exploitation. Our findings highlight the importance of not treating smart contracts as conventional object-oriented (OO) applications. Researchers must account for the unique features of Solidity, smart contract design principles, and execution environments. Based on these insights, we propose six recommendations to improve smart contract vulnerability detection, prioritization, and mitigation.
Conference publications
2023
Evaluating the impact of ChatGPT on exercises of a software security course
Authors: Jingyue Li, Per Håkon Meland, Jakob Svennevik Notland, André Storhaug, and Jostein Hjortland Tysse
Location: 2023 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM)
10.1109/ESEM56168.2023.10304857
[
ABS]
[
PDF]
Along with the development of large language models (LLMs), e.g., ChatGPT, many existing approaches and tools for software security are changing. It is, therefore, essential to understand how security-aware these models are and how these models impact software security practices and education. In exercises of a software security course at our university, we ask students to identify and fix vulnerabilities we insert in a web application using state-of-the-art tools. After ChatGPT, especially the GPT-4 version of the model, we want to know how the students can possibly use ChatGPT to complete the exercise tasks. We input the vulnerable code to ChatGPT and measure its accuracy in vulnerability identification and fixing. In addition, we investigated whether ChatGPT can provide a proper source of information to support its outputs. Results show that ChatGPT can identify 20 of the 28 vulnerabilities we inserted in the web application in a white-box setting, reported three false positives, and found four extra vulnerabilities beyond the ones we inserted. ChatGPT makes nine satisfactory penetration testing and fixing recommendations for the ten vulnerabilities we want students to fix and can often point to related sources of information.
Efficient avoidance of vulnerabilities in auto-completed smart contract code using vulnerability-constrained decoding
Authors: André Storhaug, Jingyue Li, and Tianyuan Hu
Location: 2023 IEEE 34th International Symposium on Software Reliability Engineering (ISSRE)
10.1109/ISSRE59848.2023.00035
[
ABS]
[
PDF]
[
Code]
Auto-completing code enables developers to speed up coding significantly. Recent advances in transformer-based large language model (LLM) technologies have been applied to code synthesis. However, studies show that many of such synthesized codes contain vulnerabilities. We propose a novel vulnerability-constrained decoding approach to reduce the amount of vulnerable code generated by such models. Using a small dataset of labeled vulnerable lines of code, we fine-tune an LLM to include vulnerability labels when generating code, acting as an embedded classifier. Then, during decoding, we deny the model to generate these labels to avoid generating vulnerable code. To evaluate the method, we chose to automatically complete Ethereum Blockchain smart contracts (SCs) as the case study due to the strict requirements of SC security. We first fine-tuned the 6-billion-parameter GPT-J model using 186,397 Ethereum SCs after removing the duplication from 2,217,692 SCs. The fine-tuning took more than one week using ten GPUs. The results showed that our fine-tuned model could synthesize SCs with an average BLEU (BiLingual Evaluation Understudy) score of 0.557. However, many codes in the auto-completed SCs were vulnerable. Using the code before the vulnerable line of 176 SCs containing different types of vulnerabilities to auto-complete the code, we found that more than 70% of the auto-completed codes were insecure. Thus, we further fine-tuned the model on other 941 vulnerable SCs containing the same types of vulnerabilities and applied vulnerability-constrained decoding. The fine-tuning took only one hour with four GPUs. We then auto-completed the 176 SCs again and found that our approach could identify 62% of the code to be generated as vulnerable and avoid generating 67% of them, indicating the approach could efficiently and effectively avoid vulnerabilities in the auto-completed code.
Preprints
2024
Parameter-Efficient Fine-Tuning of Large Language Models for Unit Test Generation: An Empirical Study
Authors: André Storhaug, and Jingyue Li
Location: arXiv preprint arXiv:2411.02462
10.48550/arXiv.2411.02462
[
ABS]
[
PDF]
[
Code]
The advent of large language models (LLMs) like GitHub Copilot has significantly enhanced programmers’ productivity, particularly in code generation. However, these models often struggle with real-world tasks without fine-tuning. As LLMs grow larger and more performant, fine-tuning for specialized tasks becomes increasingly expensive. Parameter-efficient fine-tuning (PEFT) methods, which fine-tune only a subset of model parameters, offer a promising solution by reducing the computational costs of tuning LLMs while maintaining their performance. Existing studies have explored using PEFT and LLMs for various code-related tasks and found that the effectiveness of PEFT techniques is task-dependent. The application of PEFT techniques in unit test generation remains underexplored. The state-of-the-art is limited to using LLMs with full fine-tuning to generate unit tests. This paper investigates both full fine-tuning and various PEFT methods, including LoRA, (IA)^3, and prompt tuning, across different model architectures and sizes. We use well-established benchmark datasets to evaluate their effectiveness in unit test generation. Our findings show that PEFT methods can deliver performance comparable to full fine-tuning for unit test generation, making specialized fine-tuning more accessible and cost-effective. Notably, prompt tuning is the most effective in terms of cost and resource utilization, while LoRA approaches the effectiveness of full fine-tuning in several cases.