RepoPal, Detecting Similar Repositories on GitHub

Posted in Academic, Projects

Time Aug, 2015 – Jun. 2016


Summary

  • Invented a GitHub repository recommendation algorithm based on two heuristics leveraging data not considered before
  • Developed the corresponding system, RepoPal based on a large amount of data mined from GitHub
  • Demonstrated that RepoPal outperforms CLAN (state-of-the-art) in precision by 20% and confidence by 41%
  • Analyzed other advantages of RepoPal, including non-language-specific and less computation cost

Abstract
GitHub contains millions of repositories with a number of repositories implementing similar functionalities. Find- ing similar repositories on GitHub can be helpful for software engineers as it can help them reuse source code, identify alter- native implementations, explore related projects, find projects to contribute to, and discover code theft and plagiarism. Previous studies have proposed techniques to detect similar applications by analyzing API usage patterns and software tags. Unfortunately, these prior studies either only make use of a limited source of information or use information not available for projects on GitHub.

In this paper, we propose an approach that can effectively detect similar repositories on GitHub. Our approach is designed based on three heuristics which leverage additional data sources (i.e., GitHub stars and readme files) which are not considered in previous works. The three heuristics are: projects that are starred by the same users within a short period of time are likely to be similar with one another, projects that are starred by similar users are likely to be similar with one another, and projects whose readme files contain similar contents are likely to be similar with one another. Based on these three heuristics, we compute two relevance scores (i.e., star-based relevance and readme-based relevance) to assess the similarity between two repositories. By integrating the two relevance scores, we build a recommendation system called RepoPal to detect similar repositories. We compare RepoPal to a prior state-of-the-art approach CLAN using one thousand Java repositories on GitHub. Our empirical evaluation demonstrates that RepoPal achieves a higher success rate, precision, and confidence over CLAN.


Architecture
RepoPal Architecture


Performance
The performances compared among RepoPal, CLAN and their combination:

RepoPal Confidence

Confidence Comparison


RepoPal Precision

Precision Comparison


RepoPal SuccessRate

Success Rate Comparison


Conclusion and Future Work
Detecting similar repositories on GitHub can help software engineers to reuse source code, identify alternative implementations, explore related projects, find projects to contribute to, discover code theft and plagiarism, among others. A number of prior approaches have been proposed to identify similar applications, unfortunately, they are not optimal for GitHub. One approach relies only on similarity in API method invocations, while another relies on tags which are not present in GitHub. They do not leverage two sources of information that can intuitively help to identify similar repositories, that is, GitHub stars and readme files. In this work, we propose a new technique named RepoPal that leverages the two sources of information. It works based on three heuristics: First, repositories that are starred by the same people within a short period of time are likely to be similar. Second, repositories starred by similar users are likely to be similar. Third, repositories whose readme files share similar contents are likely to be similar. In this study, we have evaluated RepoPal on 50 queries run against a pool of 1,000 repositories, and compared its effectiveness against CLAN. Our experiment results show that RepoPal can outperform CLAN in terms of success rate, confidence, and precision.

In a future work, we plan to reduce the threats to validity by including additional queries, repositories, and participants in the evaluation of RepoPal. Moreover, we plan to include additional sources of information to boost the effectiveness of RepoPal further.