Mining Software Repositories for Vulnerability Prediction: Lessons Learned, Challenges, and Recommendations

Talk Abstract

Software vulnerabilities are weaknesses in source code that might be exploited to cause harm or loss. Over the last decades, the software engineering research community has been proposing a number of static and dynamic approaches to assist developers with the (semi-)automatic detection and removal of vulnerabilities. Despite the promising achievements obtained so far, a relatively new trend is represented by the use of historical data to predict the emergence of new software vulnerabilities. By exploiting the large amount of data available in software repositories, machine learning algorithms are trained in order to learn patterns that may indicate the presence of vulnerable constructs in source code, hence helping developers in the early identification and remediation of potential weaknesses.

This lecture will discuss our experience with the use of mining software repository techniques to build vulnerability prediction models. Besides providing insights into the recent advances in the field, the lecture will discuss the design choices to take when training, testing, and validating machine learning algorithms for vulnerability prediction, other than the implications that these choices have for the trustworthiness and credibility of the resulting models. We will conclude by discussing the current limitations and challenges that we deem crucial to further improve the vulnerability prediction capabilities.

The lecture will not only attempt to provide a theoretical ground for students, but also practical instruments to deal with the perils of mining software repositories for vulnerability prediction.

Lecture shared with Fabio Palomba