new commit

This commit is contained in:
qiangge 2016-11-17 15:28:27 +08:00
parent b98b8509e4
commit 08373c0dea
28 changed files with 176 additions and 355 deletions

6
.gitignore vendored
View File

@ -1,3 +1,7 @@
*.*
!*.tex
!*.pdf
!*.pdf
acm-update.pdf
latex/acmcopyright.sty
latex/sigproc.bib
latex/sig-alternate-05-2015.cls

BIN
latex/main.pdf Normal file

Binary file not shown.

View File

@ -200,8 +200,8 @@ Yue Yu\\
\section{background and related work}
\input{background}
\section{METHOD}
% \input{method}
\section{DATA PROCESS}
\input{method}
% \begin{figure*}[!htbp]
% \centering

169
latex/method.tex Normal file
View File

@ -0,0 +1,169 @@
\subsection{Data Collection}
In our previous work \cite{yu2015wait}, we have composed a comprehensive dataset to study the pull-based model, involving 951,918 issues across 1185 main-line projects in GitHub (dump dated 10/11/2014 based on GHTorrent \cite{gousios2013ghtorent,gousios2014lean}.
And through GitHub API, we get data of title, content, labels, reporter for each issue.
In this paper, in order to test which supervised machine learning techniques performs best for issue classification, we need the projects containing enough number of labeled issues for training and testing.
Otherwise, to avoid the influence of unbalanced dataset, we need project have appropriative bug rate in labeled issue.
Thus, we finally identify candidate projects (101 projects) from GitHub which have at least 500 labeled issues and bug rate between 20\% and 80\% (the labeling dataset process can be seen in section \ref{labeling process}).
All these comprehensive dataset can be used as our training and testing set.
\subsection{Labeling Dataset}
\label{labeling process}
In this process, we need to know which issues are bugs, and which issues are non-bug.
This will help us to build a labeled training dataset, which will used for supervised ML next.
Compared with other ITS like Bugzilla, it is more difficult to get labeled training dataset in GitHub.
Because in the ITS of GitHub, it only provides label system for developers to add extra structure information like categories, priority, etc.
And label system in GitHub is user self-defined, which means different project may use different labels to express same meaning.
In our dataset, there are 7793 different labels in 101 projects.
This flatten and flexible design make it difficult to understand label usage of so many projects.
% On the one hand, different with Bugzilla, the ITS in Github do not have categories information for issue reports.
% Users in GitHub distinguish category of issue reports by labels.
% On the other hand, GitHub's label system are user self-defined and not all the projects use the same labels to label bug or feature for issue reports.
So we need comprehend which labels are used to classify issues first.
In GitHub, there are some projects
% There are some projects in GitHub migrated from other platform, and at the same time,
% succeed to the custom of using traditional ITS,
giving extra information to label issues compared with other projects. For example, in project ``WordPress-Android'', issues are labeled like ``[type] bug'', ``[type] enhancement'', etc.; in project ``angular.js'', issues are labeled like ``type: bug'', ``type: feature'', ``component: form'', etc.
Labels in these projects not only contain the type of the issue, but also contain the category of label itself.
Despite projects like those are just a small part, it still helpful for us to know what labels are most used to express the category of issues.
To understand what labels users use in GitHub, we design a process to aggregate labels.
Firstly, we pick out all labels acting as those forms, and separate the information of them.
We use a 2-d vector \textit{<C, name>} to represent such a label, which \textit{C} means category of the label and \textit{name} means the main information of the label (like ``bug'', ``feature'', ``enhancement'', etc.).
Secondly, we group labels with same C items, and the preliminary aggregate process is done.
Next, we define the similarity of different group $Group_i$ as Equation \ref{equation:similarity}. We iteratively calculate the similarity of different group, then get the union of groups whose similarity greater than threshold.
\begin{equation}
similarity = \frac{{\left| {Grou{p_i}\bigcap {Grou{p_j}} } \right|}}{{\min \left( {\left| {Grou{p_i}} \right|,\left| {Grou{p_j}} \right|} \right)}}
, (i \neq j)
\label{equation:similarity}
\end{equation}
Where $Group_i$ is a set of all the labels with the same category $C_i$ and different $name$.
Finally, we get a structure label system through aggregating selected labels.
To get a overall perspective to label system, we aggregate all other labels according to the group information we aggregated before.
In the result, we find the top 3 most used group of labels are type (\textit{i.e.}, bug, feature, enhancement), status (\textit{i.e.}, duplicate, wontfix, invalid) and adverb (\textit{i.e.}, high, critical, major. This kind of labels are mostly used as priority and severity).
To evaluate the coverage of these 3 kinds of labels, we select all labels that are used in more than 5 projects to filter labels with minority usage. We count the frequency of usage of these 3 kinds of labels and calculate the usage rate for all filtered labels. Finally, these 3 kinds of labels are used more than half, which achieve 58.7\%.
Through aggregating all the labels, we finally get 113 labels that users use most in the category ``type''. From these labels, we distinguish them and select bug-like labels such as ``bug'', ``defect'', ``type:bug'', ``definition:bug'', etc. and select feature-like labels such as ``feature'', ``enhancement'', ``new feature'', ``feature request'', etc. Then, we label issues as bug or feature using labels we have distinguished, and these labeled issues will be our training data in following process.
Because of the need for adequate classified issues for training model, we hand-pick 111 projects which have more than 500 labeled categories information issues. The detail information of these projects are shown in Table \ref{tag:datacollection}.
\begin{table}[htbp]
\centering
\caption{Summary Statistics for Data Collection}
\begin{tabular}{|c|c|c|} \hline
\textbf{} & \textbf{Count} & \textbf{Mean} \\ \hline
Projects & 111 & \\ \hline
Issues & 356256 & 3209.5(per project) \\ \hline
Labeled issues & 240754 & 2169.0(per project) \\ \hline
Labels & 470965 & 1.96(per issue) \\ \hline
\end{tabular}%
\label{tag:datacollection}
\end{table}%
We also select 3 different projects (phpadmin, piwik and numpy) as objects of case study, which have enough labeled issues and different proportion of bug to feature. The detail information about them are exhibited in Table \ref{tag:casestudy}.
\begin{table}[htbp]
\centering
\caption{Projects for Case Study}
\begin{tabular}{|c|c|c|} \hline
\textbf{Projects} & \textbf{Labeled Issues} & \textbf{Bug Proportion (\%)} \\ \hline
phpadmin & 6766 & 0.75 \\ \hline
piwik & 5389 & 0.66 \\ \hline
numpy & 2511 & 0.86 \\ \hline
\end{tabular}%
\label{tag:casestudy}
\end{table}%
\subsection{Preprocess of Data}
Through the former process, each issue is characterized by its title and description, and part of them can be labeled as ``bug'' or ``feature''. Then, linguistic features extracted for machine learning method undergo the standard processing, i.e., text filtering, stemming, and indexing \cite{frakes1992information}. Here, we do not remove all stop-words, and leave common English term, such as ``should'', ``might'', ``not''. In study \cite{antoniol2008bug}, they indicate that it may be important for classifying issues, and study \cite{bissyande2013got} also mentions that removing default list of stop-words in common corpora might decrease the classification accuracy. For instance, the semantic of a sentence ``This is not a bug'' is completely lost if the Standard English stop-words are removed because the result is ``This is a bug''.
We then use vector space model to represent each issues as a weighted vector. We segment the issue into different terms (in here a word means a term) and each element in the vector of the issue is the weight of a term, and the value stands for the importance of the term for the issue.
we utilize term frequency-inverse document frequency (\textit{tf-idf}) to calculate weight, which based on two assumptions: The more a given word appears in the issue, the more important it is for that issue. Contrariwise, the more issue a word appears in, the less useful it is to distinguish among these issues. is utilized to indicate the weight of a term, The process of calculating \textit{tf-idf} acts as Equation \ref{e:tfidf}.
\begin{equation}
\begin{array}{l}
\displaystyle f(t,i) = \frac{{{n_t}}}{{{N_i}}}\\
\displaystyle idf(t) = \log \left( {\frac{{{N_I}}}{{\left| {i \in I:t \in i} \right|}}} \right)\\
\displaystyle tfidf(t,i) = tf(t,i) \times idf(t)
\end{array}
\label{e:tfidf}
\end{equation}
Where \textit{t} is a term, \textit{i} is the corpus of an issue, \textit{I} is the corpus of all issues in the given project, $n_t$ is the count of appearance for term \textit{t} in the issue, $N_i$ is the total number of terms in issue \textit{i} and $N_I$ is the total number of issues in the given project.
% \subsection{Improving Classifier}
% We design an improving machine learning process to predict categories of issues process as shown in Figure \ref{figure:process}. The detail of the process will be shown below and then we will introduce some well-known machine learning algorithms used in this paper.
% \begin{figure}[!htb]
% \centering
% \includegraphics[width=8.5cm]{classprocess}
% \caption{The process of the improving classifier}
% \label{figure:process}
% \end{figure}%picture
% Before building a classification model, we need to label issues in training set. Machine utilizing these knowledge to build a prediction model and predict other issues. After deciding the prediction target, an improving defect classification process acts as Figure \ref{figure:process}:
% \textbf{Labeling}: We need to specify the prediction target for the machine, and ``tell'' machine what kind of issues are bug and what kind of issues are feature. Labeling process is important and a precise labeling process is helpful to build an excellent prediction model.
% \textbf{Preprocessing of data}: This step extracts features from issues and prepares a proper form of data (usually using vector space model) for machine to process. We can create a training corpus to be used by a machine learn method through combining labels and features of instances.
% \textbf{Training model}: We utilizing training data to construct a prediction model, such as Support Vector Machines (SVM) or Naive Bayes (NB).
% This step is the core and goal of the process, and we will build a prediction model for each project.
% To evaluate the prediction model, we separate all sample into training set and testing set by 10-fold cross-validation, using 9 folds training model and 1 fold testing model, and carry out this process 10 times which each time uses a different fold as testing set.
% We use accuracy to evaluate the prediction model. Mostly, the prediction model can predict tendentiousness of the issue, and there are four possible outcomes: classifying a bug as a bug (${n_{b \to b}}$), classifying a bug as a feature (${n_{b \to f}}$), classifying a feature as a bug (${n_{f \to b}}$), and classifying a feature as a feature (${n_{f \to f}}$). The accuracy of the model are defined as Equation 3:
% \begin{equation}
% Accuracy = \frac{{{n_{b \to b}} + {n_{f \to f}}}}{{{n_{b \to b}} + {n_{b \to f}} + {n_{f \to b}} + {n_{f \to f}}}}
% \end{equation}
% Here accuracy can reflect the performance of model by two side. The higher accuracy means more issue reports have a right prediction regardless its category. At the same time, it means a higher recall from prediction model. Therefore, only accuracy is sufficient to evaluate the performance of the model in our research.
% There are many ML techniques utilized in different study \cite{antoniol2008bug,herzig2013s,maalej2015bug,zhou2014combining}. These ML techniques are all used in traditional ITS like Bugzilla, but no one use them to test whether they work in ITS of GitHub. We evaluate some widely used ML techniques in our data collection and the package of these ML techniques are shown as Table \ref{tag:packages}:
% \begin{table}[htbp]
% \centering
% \caption{ML techniques and packages}
% \begin{tabular}{|c|c|c|} \hline
% \textbf{ML} & \textbf{Full Name} & \textbf{Package} \\ \hline
% SVM & Support Vector Machine & $sklearn.svm$ \\ \hline
% NB & Naive Bayes & $sklearn.naive\_bayes$ \\ \hline
% LR & Logistic Regression & $sklearn.linear_model$ \\ \hline
% ET & Extra Trees & $sklearn.ensemble$ \\ \hline
% RF & Random Forest & $sklearn.ensemble$ \\ \hline
% \end{tabular}%
% \label{tag:packages}
% \end{table}%
% \textbf{Improving model}: Intuitively, not all the issues are same to the prediction model. The issue reports are hard to classified by machine which are closed to the hyperplane of the prediction model. So figuring out which issue reports are hard to be classified for machine and what they are ``talking about'' may be great useful to improve the performance of the prediction model. For most prediction models, they can calculate the probability the issue report classified to each categories, which can reflect the difference between issue reports. To pick up issues that hard to be classified, we use the probability result of the prediction model and define a variable $\Delta$ to present how difficult an issue is for prediction model:
% \begin{equation}
% \Delta = \left| {prob{a_b} - prob{a_f}} \right|
% \end{equation}
% Where $proba_b$ is the probability that the issue is classified to be a bug, $proba_f$ is the probability that the issue is classified to be a feature, and the sum of them are equal to 1. The greater $\Delta$ is, the classification result is more convinced.
% $\Delta$ is an important attribute for issue reports in our research and the analysis next are based on this variable.
% To improving the performance of the classifier, we focus on the issue reports which are most likely to get a wrong prediction.
% The number of wrong predicted issue reports is ${N_I} \times (1 - Accuracy)$, where $N_I$ is the count of all issues from the given project. For different prediction model, these wrong predicted issue reports are more likely to appear in the interval of small $\Delta$. So we defined a threshold $threshold_\Delta$ to divide all issue reports into convinced part or hesitated part:
% \begin{equation}
% threshol{d_\Delta } = {\Delta _{{N_I} \times (1 - Accuracy)}}
% \label{e:thresholdd}
% \end{equation}
% Where $\Delta_n$ is $\Delta$ of the n-th issue of all ascending ordered training issues by $\Delta$ from the given project. For each issue, we define it convinced when $\Delta$ is greater than $threshold_\Delta$, and hesitated where $\Delta$ is smaller than $threshold_\Delta$.
% We use case study to find whether there are some reasons or patterns of hesitated and convinced issues, which can be utilized to improving the prediction model.
% In our case study, we select most convinced issues which get a wrong prediction from classification model, and classify them manually to see why they have a wrong predation.
% At the same time, we select the most convinced issue reports (according to $\Delta$ of issue reports) which have a wrong prediction and the most hesitated issue reports (according to $\Delta$ of issue reports, too), and read them manually to understand why they are hard to be classified by classification model.
% The rule that we judge an issue as a bug or a feature is the way to solve it. We classify an issue as a bug when it callused by existing co and want developers to fix it.
% And we classify an issue as a feature when it need new feature or function (enhancement) to solve the issue. In the process of manual classification, two raters worked together to develop a stable classification scheme.
% We use a manual classification process like study \cite{knauss2012detecting} and the detail process acts as follows: First, the two raters classify part of issues together to create an initial rule. Then each rater classify a specified set of issues individually.
% Next, compare the result of two raters and judge whether they have the same opinion based on Cohen's Kappa ($\kappa$) \cite{strijbos2006content}.
% If there are too much disagreement between them ($\kappa$< 0.7), the issues where the raters holds different opinion will be discussed until they get enough common understanding about classification ($\kappa$ >= 0.7).

View File

@ -1,200 +0,0 @@
% This is "sig-alternate.tex" V2.1 April 2013
% This file should be compiled with V2.5 of "sig-alternate.cls" May 2012
%
% This example file demonstrates the use of the 'sig-alternate.cls'
% V2.5 LaTeX2e document class file. It is for those submitting
% articles to ACM Conference Proceedings WHO DO NOT WISH TO
% STRICTLY ADHERE TO THE SIGS (PUBS-BOARD-ENDORSED) STYLE.
% The 'sig-alternate.cls' file will produce a similar-looking,
% albeit, 'tighter' paper resulting in, invariably, fewer pages.
%
% ----------------------------------------------------------------------------------------------------------------
% This .tex file (and associated .cls V2.5) produces:
% 1) The Permission Statement
% 2) The Conference (location) Info information
% 3) The Copyright Line with ACM data
% 4) NO page numbers
%
% as against the acm_proc_article-sp.cls file which
% DOES NOT produce 1) thru' 3) above.
%
% Using 'sig-alternate.cls' you have control, however, from within
% the source .tex file, over both the CopyrightYear
% (defaulted to 200X) and the ACM Copyright Data
% (defaulted to X-XXXXX-XX-X/XX/XX).
% e.g.
% \CopyrightYear{2007} will cause 2007 to appear in the copyright line.
% \crdata{0-12345-67-8/90/12} will cause 0-12345-67-8/90/12 to appear in the copyright line.
%
% ---------------------------------------------------------------------------------------------------------------
% This .tex source is an example which *does* use
% the .bib file (from which the .bbl file % is produced).
% REMEMBER HOWEVER: After having produced the .bbl file,
% and prior to final submission, you *NEED* to 'insert'
% your .bbl file into your source .tex file so as to provide
% ONE 'self-contained' source file.
%
% ================= IF YOU HAVE QUESTIONS =======================
% Questions regarding the SIGS styles, SIGS policies and
% procedures, Conferences etc. should be sent to
% Adrienne Griscti (griscti@acm.org)
%
% Technical questions _only_ to
% Gerald Murray (murray@hq.acm.org)
% ===============================================================
%
% For tracking purposes - this is V2.0 - May 2012
\documentclass{sig-alternate-05-2015}
\usepackage[numbers,sort&compress]{natbib}
\begin{document}
% Copyright
\setcopyright{acmcopyright}
%\CopyrightYear{2007} % Allows default copyright year (20XX) to be over-ridden - IF NEED BE.
%\crdata{0-12345-67-8/90/01} % Allows default copyright data (0-89791-88-6/97/05) to be over-ridden - IF NEED BE.
\doi{10.475/123_4}
% ISBN
\isbn{123-4567-24-567/08/06}
% --- End of Author Metadata ---
\title{Alternate {\ttlit ACM} SIG Proceedings Paper in LaTeX
Format}
\subtitle{[Extended Abstract]
%
% You need the command \numberofauthors to handle the 'placement
% and alignment' of the authors beneath the title.
%
% For aesthetic reasons, we recommend 'three authors at a time'
% i.e. three 'name/affiliation blocks' be placed beneath the title.
%
% NOTE: You are NOT restricted in how many 'rows' of
% "name/affiliations" may appear. We just ask that you restrict
% the number of 'columns' to three.
%
% Because of the available 'opening page real-estate'
% we ask you to refrain from putting more than six authors
% (two rows with three columns) beneath the article title.
% More than six makes the first-page appear very cluttered indeed.
%
% Use the \alignauthor commands to handle the names
% and affiliations for an 'aesthetic maximum' of six authors.
% Add names, affiliations, addresses for
% the seventh etc. author(s) as the argument for the
% \additionalauthors command.
% These 'additional authors' will be output/set for you
% without further effort on your part as the last section in
% the body of your article BEFORE References or any Appendices.
\numberofauthors{8} % in this sample file, there are a *total*
% of EIGHT authors. SIX appear on the 'first-page' (for formatting
% reasons) and the remaining two appear in the \additionalauthors section.
%
\author{
% You can go ahead and credit any number of authors here,
% e.g. one 'row of three' or two rows (consisting of one row of three
% and a second row of one, two or three).
%
% The command \alignauthor (no curly braces needed) should
% precede each author name, affiliation/snail-mail address and
% e-mail address. Additionally, tag each line of
% affiliation/address with \affaddr, and tag the
% e-mail address with \email.
%
% 1st. author
\alignauthor
Ben Trovato\\
\affaddr{Institute for Clarity in Documentation}\\
\affaddr{1932 Wallamaloo Lane}\\
\affaddr{Wallamaloo, New Zealand}\\
\email{trovato@corporation.com}
% 2nd. author
\alignauthor
G.K.M. Tobin\\
\affaddr{Institute for Clarity in Documentation}\\
\affaddr{P.O. Box 1212}\\
\affaddr{Dublin, Ohio 43017-6221}\\
\email{webmaster@marysville-ohio.com}
% 3rd. author
\alignauthor Lars Th{\o}rv{\"a}ld\\
\affaddr{The Th{\o}rv{\"a}ld Group}\\
\affaddr{1 Th{\o}rv{\"a}ld Circle}\\
\affaddr{Hekla, Iceland}\\
\email{larst@affiliation.org}
\and % use '\and' if you need 'another row' of author names
% 4th. author
\alignauthor Lawrence P. Leipuner\\
\affaddr{Brookhaven Laboratories}\\
\affaddr{Brookhaven National Lab}\\
\affaddr{P.O. Box 5000}\\
\email{lleipuner@researchlabs.org}
% 5th. author
\alignauthor Sean Fogarty\\
\affaddr{NASA Ames Research Center}\\
\affaddr{Moffett Field}\\
\affaddr{California 94035}\\
\email{fogartys@amesres.org}
% 6th. author
\alignauthor Charles Palmer\\
\affaddr{Palmer Research Laboratories}\\
\affaddr{8600 Datapoint Drive}\\
\affaddr{San Antonio, Texas 78229}\\
\email{cpalmer@prl.com}
}
% Just remember to make sure that the TOTAL number of authors
% is the number that will appear on the first page PLUS the
% number that will appear in the \additionalauthors section.
\maketitle
\begin{abstract}
\input{abstract}
\end{abstract}
%
% Use this command to print the description
%
% We no longer use \terms command
%\terms{Theory}
\keywords{ACM proceedings; \LaTeX; text tagging}
\section{Introduction}
\input{introduction}
\section{background and related work}
\input{background}
\section{METHOD}
\input{method}
\section{Result and Discussion}
\input{result}
\section{Conclusions}
%\end{document} % This is where a 'short' article might terminate
%ACKNOWLEDGMENTS are optional
\section{Acknowledgments}
This research is supported by National Science Foundation of China (grants)
%
% The following two commands are all you need in the
% initial runs of your .tex file to
% produce the bibliography for the citations in your paper.
\bibliographystyle{abbrv}
\bibliography{sigproc} % sigproc.bib is the name of the Bibliography in this case
% You must have a proper ".bib" file
% and remember to run:
% latex bibtex latex latex
% to resolve all references
%
% ACM needs 'a single self-contained file'!
%
\end{document}

View File

@ -1,152 +0,0 @@
\subsection{Data Collection}
In our previous work \cite{yu2015wait}, we have composed a comprehensive dataset to study the pull-based model, involving 951,918 issues across 1185 main-line projects in GitHub (dump dated 10/11/2014 based on GHTorrent \cite{gousios2013ghtorent,gousios2014lean}. In this paper, in order to test which supervised machine learning techniques perform best for issue classification, we need the projects containing enough number of labeled issues for training and testing. Thus, we firstly identify candidate projects (127 projects) from GitHub which have at least 1000 labeled issues. Then, for each projects, we gather the meta-data (i.e., title, description, etc.) of issues in ITS through API of GitHub. All these comprehensive dataset can be used as our training and testing set.
\subsection{Labeling Bug and Feature}
Compared with other ITS like Bugzilla, it is more difficult to get labeled train data in GitHub. On the one hand, different with Bugzilla, the ITS in Github do not have categories information for issue reports. Users in GitHub distinguish category of issue reports by labels. On the other hand, GitHub's label system are user self-defined and not all the projects use the same labels to label bug or feature for issue reports. In our data set, there are 7793 different labels in 127 projects. So we need comprehend which labels are related to bug and which labels are related to feature.
In GitHub, there are some projects
% There are some projects in GitHub migrated from other platform, and at the same time,
% succeed to the custom of using traditional ITS,
giving more information to label issues compared with other projects. For example, in project ``WordPress-Android'', issues are labeled like ``[type] bug'', ``[type] enhancement'', etc.; in project ``angular.js'', issues are labeled like ``type: bug'', ``type: feature'', ``component: form'', etc.
Labels in these projects not only contain the type of the issue, but also contain the category of label itself.
Despite projects like those are just a small part, it still helpful for us to know what labels are most used to express the category of issues.
To understand what labels users use in GitHub, we design a process to aggregate labels.
Firstly, we pick out all labels acting as those forms, and separate the information of them.
We use a 2-d vector \textit{<C, name>} to represent such a label, which \textit{C} means category of the label and \textit{name} means the main information of the label (like ``bug'', ``feature'', ``enhancement'', etc.).
Secondly, we group labels with same C items, and the preliminary aggregate process is done.
Next, we define the similarity of different group $Group_i$ as Equation \ref{equation:similarity}. We iteratively calculate the similarity of different group, then get the union of groups whose similarity greater than threshold.
\begin{equation}
similarity = \frac{{\left| {Grou{p_i}\bigcap {Grou{p_j}} } \right|}}{{\min \left( {\left| {Grou{p_i}} \right|,\left| {Grou{p_j}} \right|} \right)}}
, (i \neq j)
\label{equation:similarity}
\end{equation}
Where $Group_i$ is a set of all the labels with the same category $C_i$ and different $name$.
Finally, we get a structure label system through aggregating selected labels.
To get a overall perspective to label system, we aggregate all other labels according to the group information we aggregated before.
In the result, we find the top 3 most used categories of labels are type (\textit{i.e.}, bug, feature, enhancement), status (\textit{i.e.}, duplicate, wontfix, invalid) and adverb (\textit{i.e.}, high, critical, major. This kind of labels are mostly used as priority and severity).
To evaluate the coverage of these 3 kinds of labels, we select all labels that are used in more than 5 projects to filter labels with minority usage. We count the frequency of usage of these 3 kinds of labels and calculate the usage rate for all filtered labels. Finally, these 3 kinds of labels are used more than half, which achieve 58.7\%.
Through aggregating all the labels, we finally get 113 labels that users use most in the category ``type''. From these labels, we distinguish them and select bug-like labels such as ``bug'', ``defect'', ``type:bug'', ``definition:bug'', etc. and select feature-like labels such as ``feature'', ``enhancement'', ``new feature'', ``feature request'', etc. Then, we label issues as bug or feature using labels we have distinguished, and these labeled issues will be our training data in following process.
Because of the need for adequate classified issues for training model, we hand-pick 111 projects which have more than 500 labeled categories information issues. The detail information of these projects are shown in Table \ref{tag:datacollection}.
\begin{table}[htbp]
\centering
\caption{Summary Statistics for Data Collection}
\begin{tabular}{|c|c|c|} \hline
\textbf{} & \textbf{Count} & \textbf{Mean} \\ \hline
Projects & 111 & \\ \hline
Issues & 356256 & 3209.5(per project) \\ \hline
Labeled issues & 240754 & 2169.0(per project) \\ \hline
Labels & 470965 & 1.96(per issue) \\ \hline
\end{tabular}%
\label{tag:datacollection}
\end{table}%
We also select 3 different projects (phpadmin, piwik and numpy) as objects of case study, which have enough labeled issues and different proportion of bug to feature. The detail information about them are exhibited in Table \ref{tag:casestudy}.
\begin{table}[htbp]
\centering
\caption{Projects for Case Study}
\begin{tabular}{|c|c|c|} \hline
\textbf{Projects} & \textbf{Labeled Issues} & \textbf{Bug Proportion (\%)} \\ \hline
phpadmin & 6766 & 0.75 \\ \hline
piwik & 5389 & 0.66 \\ \hline
numpy & 2511 & 0.86 \\ \hline
\end{tabular}%
\label{tag:casestudy}
\end{table}%
\subsection{Preprocess of Data}
Through the former process, each issue is characterized by its title and description, and part of them can be labeled as ``bug'' or ``feature''. Then, linguistic features extracted for machine learning method undergo the standard processing, i.e., text filtering, stemming, and indexing \cite{frakes1992information}. Here, we do not remove all stop-words, and leave common English term, such as ``should'', ``might'', ``not''. In study \cite{antoniol2008bug}, they indicate that it may be important for classifying issues, and study \cite{bissyande2013got} also mentions that removing default list of stop-words in common corpora might decrease the classification accuracy. For instance, the semantic of a sentence ``This is not a bug'' is completely lost if the Standard English stop-words are removed because the result is ``This is a bug''.
We then use vector space model to represent each issues as a weighted vector. We segment the issue into different terms (in here a word means a term) and each element in the vector of the issue is the weight of a term, and the value stands for the importance of the term for the issue.
we utilize term frequency-inverse document frequency (\textit{tf-idf}) to calculate weight, which based on two assumptions: The more a given word appears in the issue, the more important it is for that issue. Contrariwise, the more issue a word appears in, the less useful it is to distinguish among these issues. is utilized to indicate the weight of a term, The process of calculating \textit{tf-idf} acts as Equation \ref{e:tfidf}.
\begin{equation}
\begin{array}{l}
\displaystyle f(t,i) = \frac{{{n_t}}}{{{N_i}}}\\
\displaystyle idf(t) = \log \left( {\frac{{{N_I}}}{{\left| {i \in I:t \in i} \right|}}} \right)\\
\displaystyle tfidf(t,i) = tf(t,i) \times idf(t)
\end{array}
\label{e:tfidf}
\end{equation}
Where \textit{t} is a term, \textit{i} is the corpus of an issue, \textit{I} is the corpus of all issues in the given project, $n_t$ is the count of appearance for term \textit{t} in the issue, $N_i$ is the total number of terms in issue \textit{i} and $N_I$ is the total number of issues in the given project.
\subsection{Improving Classifier}
We design an improving machine learning process to predict categories of issues process as shown in Figure \ref{figure:process}. The detail of the process will be shown below and then we will introduce some well-known machine learning algorithms used in this paper.
\begin{figure}[!htb]
\centering
\includegraphics[width=8.5cm]{classprocess}
\caption{The process of the improving classifier}
\label{figure:process}
\end{figure}%picture
Before building a classification model, we need to label issues in training set. Machine utilizing these knowledge to build a prediction model and predict other issues. After deciding the prediction target, an improving defect classification process acts as Figure \ref{figure:process}:
\textbf{Labeling}: We need to specify the prediction target for the machine, and ``tell'' machine what kind of issues are bug and what kind of issues are feature. Labeling process is important and a precise labeling process is helpful to build an excellent prediction model.
\textbf{Preprocessing of data}: This step extracts features from issues and prepares a proper form of data (usually using vector space model) for machine to process. We can create a training corpus to be used by a machine learn method through combining labels and features of instances.
\textbf{Training model}: We utilizing training data to construct a prediction model, such as Support Vector Machines (SVM) or Naive Bayes (NB).
This step is the core and goal of the process, and we will build a prediction model for each project.
To evaluate the prediction model, we separate all sample into training set and testing set by 10-fold cross-validation, using 9 folds training model and 1 fold testing model, and carry out this process 10 times which each time uses a different fold as testing set.
We use accuracy to evaluate the prediction model. Mostly, the prediction model can predict tendentiousness of the issue, and there are four possible outcomes: classifying a bug as a bug (${n_{b \to b}}$), classifying a bug as a feature (${n_{b \to f}}$), classifying a feature as a bug (${n_{f \to b}}$), and classifying a feature as a feature (${n_{f \to f}}$). The accuracy of the model are defined as Equation 3:
\begin{equation}
Accuracy = \frac{{{n_{b \to b}} + {n_{f \to f}}}}{{{n_{b \to b}} + {n_{b \to f}} + {n_{f \to b}} + {n_{f \to f}}}}
\end{equation}
Here accuracy can reflect the performance of model by two side. The higher accuracy means more issue reports have a right prediction regardless its category. At the same time, it means a higher recall from prediction model. Therefore, only accuracy is sufficient to evaluate the performance of the model in our research.
There are many ML techniques utilized in different study \cite{antoniol2008bug,herzig2013s,maalej2015bug,zhou2014combining}. These ML techniques are all used in traditional ITS like Bugzilla, but no one use them to test whether they work in ITS of GitHub. We evaluate some widely used ML techniques in our data collection and the package of these ML techniques are shown as Table \ref{tag:packages}:
\begin{table}[htbp]
\centering
\caption{ML techniques and packages}
\begin{tabular}{|c|c|c|} \hline
\textbf{ML} & \textbf{Full Name} & \textbf{Package} \\ \hline
SVM & Support Vector Machine & $sklearn.svm$ \\ \hline
NB & Naive Bayes & $sklearn.naive\_bayes$ \\ \hline
LR & Logistic Regression & $sklearn.linear_model$ \\ \hline
ET & Extra Trees & $sklearn.ensemble$ \\ \hline
RF & Random Forest & $sklearn.ensemble$ \\ \hline
\end{tabular}%
\label{tag:packages}
\end{table}%
\textbf{Improving model}: Intuitively, not all the issues are same to the prediction model. The issue reports are hard to classified by machine which are closed to the hyperplane of the prediction model. So figuring out which issue reports are hard to be classified for machine and what they are ``talking about'' may be great useful to improve the performance of the prediction model. For most prediction models, they can calculate the probability the issue report classified to each categories, which can reflect the difference between issue reports. To pick up issues that hard to be classified, we use the probability result of the prediction model and define a variable $\Delta$ to present how difficult an issue is for prediction model:
\begin{equation}
\Delta = \left| {prob{a_b} - prob{a_f}} \right|
\end{equation}
Where $proba_b$ is the probability that the issue is classified to be a bug, $proba_f$ is the probability that the issue is classified to be a feature, and the sum of them are equal to 1. The greater $\Delta$ is, the classification result is more convinced.
$\Delta$ is an important attribute for issue reports in our research and the analysis next are based on this variable.
To improving the performance of the classifier, we focus on the issue reports which are most likely to get a wrong prediction.
The number of wrong predicted issue reports is ${N_I} \times (1 - Accuracy)$, where $N_I$ is the count of all issues from the given project. For different prediction model, these wrong predicted issue reports are more likely to appear in the interval of small $\Delta$. So we defined a threshold $threshold_\Delta$ to divide all issue reports into convinced part or hesitated part:
\begin{equation}
threshol{d_\Delta } = {\Delta _{{N_I} \times (1 - Accuracy)}}
\label{e:thresholdd}
\end{equation}
Where $\Delta_n$ is $\Delta$ of the n-th issue of all ascending ordered training issues by $\Delta$ from the given project. For each issue, we define it convinced when $\Delta$ is greater than $threshold_\Delta$, and hesitated where $\Delta$ is smaller than $threshold_\Delta$.
We use case study to find whether there are some reasons or patterns of hesitated and convinced issues, which can be utilized to improving the prediction model.
In our case study, we select most convinced issues which get a wrong prediction from classification model, and classify them manually to see why they have a wrong predation.
At the same time, we select the most convinced issue reports (according to $\Delta$ of issue reports) which have a wrong prediction and the most hesitated issue reports (according to $\Delta$ of issue reports, too), and read them manually to understand why they are hard to be classified by classification model.
The rule that we judge an issue as a bug or a feature is the way to solve it. We classify an issue as a bug when it callused by existing co and want developers to fix it.
And we classify an issue as a feature when it need new feature or function (enhancement) to solve the issue. In the process of manual classification, two raters worked together to develop a stable classification scheme.
We use a manual classification process like study \cite{knauss2012detecting} and the detail process acts as follows: First, the two raters classify part of issues together to create an initial rule. Then each rater classify a specified set of issues individually.
Next, compare the result of two raters and judge whether they have the same opinion based on Cohen's Kappa ($\kappa$) \cite{strijbos2006content}.
If there are too much disagreement between them ($\kappa$< 0.7), the issues where the raters holds different opinion will be discussed until they get enough common understanding about classification ($\kappa$ >= 0.7).

BIN
text.pdf

Binary file not shown.