Go to file
whystar 6a8ef93863 fix some typos 2020-09-23 13:14:34 +08:00
pullreq_info fix some typos 2020-09-23 13:14:34 +08:00
.gitignore init commit 2020-09-23 12:13:07 +08:00
LICENSE init commit 2020-09-23 12:13:07 +08:00
dup_prs.md init commit 2020-09-23 12:13:07 +08:00
readme.md update pdf link 2020-09-23 13:05:53 +08:00
readme_zh.md update pdf link 2020-09-23 13:05:53 +08:00

readme.md

中文

The DupPR dataset

About this dataset

This dataset includes a list of accidentally duplicate pull requests collected from GitHub, which can be seen in dup_prs.md. The readily-avaialbe information of these pull requests can be found in pullreq_info.

How can I help?

You would be appreciated if you can open an issue/pull-request to

  • add new duplicates you have found
  • point out the errors in the dataset

Attention: please do not submit duplicate issue/pull-request :)

How can I cite this work?

@inproceedings{yu2018dataset,
  title={A dataset of duplicate pull-requests in github},
  author={Yu, Yue and Li, Zhixing and Yin, Gang and Wang, Tao and Wang, Huaimin},
  booktitle={Proceedings of the 15th International Conference on Mining Software Repositories},
  pages={22--25},
  year={2018}
}

Papers using this dataset

  • Li, Z., Yu, Y., Zhou, M., Wang, T., Yin, G., Lan, L, & Wang, H.Redundancy, Context, and Preference: An Empirical Study of Duplicate Pull Requests in OSS Projects. (2020). IEEE Transactions on Software Engineering (TSE). PDF

  • Wang, Q., Xu, B., Xia, X., Wang, T., & Li, S. (2019, October). Duplicate Pull Request Detection: When Time Matters. In Proceedings of the 11th Asia-Pacific Symposium on Internetware (pp. 1-10).

  • Zhou, S., Vasilescu, B., & Kästner, C. (2019, August). What the fork: a study of inefficient and efficient forking practices in social coding. In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE) (pp. 350-361).

  • Ren, L., Zhou, S., Kästner, C., & Wąsowski, A. (2019, February). Identifying redundancies in fork-based development. In Proceedings 2019 IEEE 26th International Conference on Software Analysis, Evolution and Reengineering (SANER) (pp. 230-241). IEEE.

  • Li, Z., Yu, Y., Wang, T., Yin, G., Mao, X., & Wang, H. (2019). Detecting Duplicate Contributions in Pull-based Model Combining Textual and Change Similarities. Journal of Computer Science and Technology. PDF

  • Li, Z., Yin, G., Yu, Y., Wang, T., & Wang, H. (2017, September). Detecting duplicate pull-requests in github. In Proceedings of the 9th Asia-Pacific Symposium on Internetware (pp. 1-6). PDF