Go to file
Hao He a5c86a912a chore: many production changes 2022-06-18 14:02:19 +00:00
.github/workflows Fix CI Error 2022-04-07 07:02:53 +00:00
frontend chore: many production changes 2022-06-18 14:02:19 +00:00
gfibot chore: many production changes 2022-06-18 14:02:19 +00:00
github-app Add Python project skeleton 2021-12-06 12:12:05 +00:00
production chore: many production changes 2022-06-18 14:02:19 +00:00
tests fix(model): create model file if non-existent 2022-06-14 16:12:18 +08:00
.gitignore feat(model): training model with global user data 2022-06-08 21:01:58 +08:00
.pre-commit-config.yaml Add Python project skeleton 2021-12-06 12:12:05 +00:00
LICENSE doc(license): change to GPLv3 2022-04-09 13:03:35 +00:00
README.md chore: many production changes 2022-06-18 14:02:19 +00:00
USE_CASES.md doc: update use cases 2022-04-13 12:42:25 +00:00
poetry.lock feat(data): better multiprocessing for gfibot.data.update 2022-06-06 08:40:30 +00:00
pyproject.toml chore: many production changes 2022-06-18 14:02:19 +00:00
pytest.ini feat: collect repo stats, commits, and issues 2021-12-13 08:51:45 +00:00

README.md

GFI-Bot

Python Lint GFI-Bot Tests GFI-Bot Coverage License ![GFI-Bot]https://gfibot.io/api/repo/badge?owner=osslab-pku&name=gfi-bot)

ML-powered 🤖 for finding and labeling good first issues in your GitHub project!

The tool is based on the following paper: W. Xiao, H. He, W. Xu, X. Tan, J. Dong, M. Zhou. Recommending Good First Issues in GitHub OSS Projects. Accepted at ICSE'2022.

Get Started

TODO: Add a quick usage guide after a prototype is finished.

Roadmap

We describe our envisioned use cases for GFI-Bot in this documentation.

Development

Project Organization

GFI-Bot is organized into five main modules:

  1. gfibot.data: Modules to periodically and incrementally collect latest issue statistics on registered GitHub projects.
  2. gfibot.model: Modules to periodically train GFI recommendation models based on issue statistics collected by gfibot.data.
  3. gfibot.backend: Modules to provide RESTful APIs for interaction with frontend.
  4. frontend: A standalone JavaScript (or TypeScript?) project as our website. This website will be used both as the main portal of GFI-Bot and as a control panel for users to find recommended good first issues or track bot status for their projects.
  5. github-app: A standalone JavaScript (or TypeScript?) project for interacting with GitHub.

All modules interact with a MongoDB instance for both reading and writing data (except frontend, which interact with backend using RESTful APIs). The MongoDB instance serves as a "single source of truth" and the main way to decouple different modules. It will be used to store and continiously update issue statistics, training progress and performance, recommendation results, etc.

Environment Setup

GFI-Bot uses poetry for dependency management. Run the following commands with poetry to setup a working environment.

poetry shell       # activate a working virtual environment
poetry install     # install all dependencies
pre-commit install # install pre-commit hooks
black .            # lint all Python code
pytest             # run all tests to confirm this environment is working

Then, configure a MongoDB instance (4.2 or later) and specify its connection URL in pyproject.toml.

Database Schemas

As mentioned before, the MongoDB instance serves as a "single source of truth" and decouples different modules. Therefore, before you start working with any part of GFI-Bot, it is important to know how the data look like in the MongoDB. For this purpose, we adopt mongoengine as an ORM-alike layer to formally describe and enforce schemas for each MongoDB collection and all collections are defined as Python classes here.

Development Guidelines

Contributions should follow existing conventions and styles in the codebase with best effort. Please add type annotations for all class members, function parameters, and return values. When writing commit messages, please follow the Conventional Commits specification.

Deployment

First, determine some GitHub projects of interest and specify them in pyproject.toml. Configure a list of GitHub access tokens (line separated) in tokens.txt. Make sure to use more tokens in order to quickly bootstrap GFI-Bot. Run the following script to check if the tokens are configured correctly.

python -m gfibot.check_tokens

Dataset Preparation

Next, run the following script to collect historical data for the interested projects. This can take some time (up to days) to finish for the first run, but can perform quick incremental update on an existing database. This script should be done periodically (e.g., as a scheduled background task) to ensure that the MongoDB database reflect the latest state in the specified repositories.

python -m gfibot.data.update --nprocess=4 # you can increase parallelism with more GitHub tokens

Then, build a dataset for training and prediction as follows. This script may also take a long time but can be accelerated with more processes.

python -m gfibot.data.dataset --since=2008.01.01 --nprocess=4

Model Training

Model training can be simply done by running the following script.


Backend Deployment