Summarising big data: public GitHub dataset for software engineering challenges

Summarising big data: public GitHub dataset for software engineering challenges

In open-source software development environments; textual, numerical, and relationshipbased data generated are of interest to researchers. Various data sets are available for this data,which is frequently used in areas such as software engineering and natural languageprocessing. However, since these data sets contain all the data in the environment, the problemarises in the terabytes of data processing. For this reason, almost all of the studies using GitHubdata use filtered data according to certain criteria. In this context, using a different data set ineach study makes a comparison of the accuracy of the studies quite difficult. In order to solvethis problem, a common dataset was created and shared with the researchers, which wouldallow to work on many software engineering problems.

___

  • [1] V. Cosentino, J. Luis, and J. Cabot. Findings from GitHub: methods, datasets and limitations. Proceedings of the 13th Int. Workshop on Mining Softw. Repositories, (2016), 137–141.
  • [2] V. Cosentino, J. L. Canovas Izquierdo, and J. Cabot. A Systematic Mapping Study of Software Development With GitHub, IEEE Access, 5 (2017) 7173–7192.
  • [3] Z. Kotti and D. Spinellis. Standing on shoulders or feet?: the usage of the MSR data papers, Proceedings of the 16th Int. Conference on Mining Software Repositories, (2019) 565–576.
  • [4] G. Gousios. The GHTorrent dataset and tool suite, Proceedings of the 10th Working Conf. on Mining Soft. Repositories, (2013) 233–236.
  • [5] Y. Zhang, G. Yin, Y. Yu, and H. Wang. Investigating social media in GitHub’s pullrequests: a case study on Ruby on Rails, Proceedings of the 1st International Workshop on Crowd-based Soft. Development Methods and Technologies - CrowdSoft 2014 (2014) 37–41.
  • [6] E. van der Veen, G. Gousios, and A. Zaidman. Automatically Prioritizing Pull Requests, 2015 IEEE/ACM 12th Working Conference on Mining Soft. Repositories. (2015) 357–361.
  • [7] Y. Yu, H. Wang, V. Filkov, P. Devanbu, and B. Vasilescu. Wait for It: Determinants of Pull Request Evaluation Latency on GitHub. 2015 IEEE/ACM 12th Working Conference on Mining Software Repositories, (2015) 367–371.
  • [8] M. L. de L. Júnior, D. M. Soares, A. Plastino, and L. Murta. Automatic assignment of integrators to pull requests: The importance of selecting appropriate attributes, J. Syst. Softw., 144 (2018) 181–196.
  • [9] G. Zhao, D. A. da Costa, and Y. Zou. Improving the Pull Requests Review Process Using Learning-to-rank Algorithms, Empir. Softw. Eng., (2019) 1–31.
Cumhuriyet Science Journal-Cover
  • ISSN: 2587-2680
  • Yayın Aralığı: Yılda 4 Sayı
  • Başlangıç: 2002
  • Yayıncı: SİVAS CUMHURİYET ÜNİVERSİTESİ > FEN FAKÜLTESİ