tijiaoban

This commit is contained in:
qiangge 2017-09-18 00:23:05 +08:00
parent 2beeb6a627
commit 10c92e128a
41 changed files with 3945 additions and 3 deletions

BIN
latex/bare_conf.pdf Normal file

Binary file not shown.

620
latex/bare_conf.tex Normal file
View File

@ -0,0 +1,620 @@
%% bare_conf.tex
%% V1.3
%% 2007/01/11
%% by Michael Shell
%% See:
%% http://www.michaelshell.org/
%% for current contact information.
%%
%% This is a skeleton file demonstrating the use of IEEEtran.cls
%% (requires IEEEtran.cls version 1.7 or later) with an IEEE conference paper.
%%
%% Support sites:
%% http://www.michaelshell.org/tex/ieeetran/
%% http://www.ctan.org/tex-archive/macros/latex/contrib/IEEEtran/
%% and
%% http://www.ieee.org/
%%*************************************************************************
%% Legal Notice:
%% This code is offered as-is without any warranty either expressed or
%% implied; without even the implied warranty of MERCHANTABILITY or
%% FITNESS FOR A PARTICULAR PURPOSE!
%% User assumes all risk.
%% In no event shall IEEE or any contributor to this code be liable for
%% any damages or losses, including, but not limited to, incidental,
%% consequential, or any other damages, resulting from the use or misuse
%% of any information contained here.
%%
%% All comments are the opinions of their respective authors and are not
%% necessarily endorsed by the IEEE.
%%
%% This work is distributed under the LaTeX Project Public License (LPPL)
%% ( http://www.latex-project.org/ ) version 1.3, and may be freely used,
%% distributed and modified. A copy of the LPPL, version 1.3, is included
%% in the base LaTeX documentation of all distributions of LaTeX released
%% 2003/12/01 or later.
%% Retain all contribution notices and credits.
%% ** Modified files should be clearly indicated as such, including **
%% ** renaming them and changing author support contact information. **
%%
%% File list of work: IEEEtran.cls, IEEEtran_HOWTO.pdf, bare_adv.tex,
%% bare_conf.tex, bare_jrnl.tex, bare_jrnl_compsoc.tex
%%*************************************************************************
% *** Authors should verify (and, if needed, correct) their LaTeX system ***
% *** with the testflow diagnostic prior to trusting their LaTeX platform ***
% *** with production work. IEEE's font choices can trigger bugs that do ***
% *** not appear when using other class files. ***
% The testflow support page is at:
% http://www.michaelshell.org/tex/testflow/
% Note that the a4paper option is mainly intended so that authors in
% countries using A4 can easily print to A4 and see how their papers will
% look in print - the typesetting of the document will not typically be
% affected with changes in paper size (but the bottom and side margins will).
% Use the testflow package mentioned above to verify correct handling of
% both paper sizes by the user's LaTeX system.
%
% Also note that the "draftcls" or "draftclsnofoot", not "draft", option
% should be used if it is desired that the figures are to be displayed in
% draft mode.
%
\documentclass[10pt, conference, compsocconf]{IEEEtran}
% Add the compsocconf option for Computer Society conferences.
%
% If IEEEtran.cls has not been installed into the LaTeX system files,
% manually specify the path to it like:
% \documentclass[conference]{../sty/IEEEtran}
% Some very useful LaTeX packages include:
% (uncomment the ones you want to load)
% *** MISC UTILITY PACKAGES ***
%
%\usepackage{ifpdf}
% Heiko Oberdiek's ifpdf.sty is very useful if you need conditional
% compilation based on whether the output is pdf or dvi.
% usage:
% \ifpdf
% % pdf code
% \else
% % dvi code
% \fi
% The latest version of ifpdf.sty can be obtained from:
% http://www.ctan.org/tex-archive/macros/latex/contrib/oberdiek/
% Also, note that IEEEtran.cls V1.7 and later provides a builtin
% \ifCLASSINFOpdf conditional that works the same way.
% When switching from latex to pdflatex and vice-versa, the compiler may
% have to be run twice to clear warning/error messages.
% *** CITATION PACKAGES ***
%
%\usepackage{cite}
% cite.sty was written by Donald Arseneau
% V1.6 and later of IEEEtran pre-defines the format of the cite.sty package
% \cite{} output to follow that of IEEE. Loading the cite package will
% result in citation numbers being automatically sorted and properly
% "compressed/ranged". e.g., [1], [9], [2], [7], [5], [6] without using
% cite.sty will become [1], [2], [5]--[7], [9] using cite.sty. cite.sty's
% \cite will automatically add leading space, if needed. Use cite.sty's
% noadjust option (cite.sty V3.8 and later) if you want to turn this off.
% cite.sty is already installed on most LaTeX systems. Be sure and use
% version 4.0 (2003-05-27) and later if using hyperref.sty. cite.sty does
% not currently provide for hyperlinked citations.
% The latest version can be obtained at:
% http://www.ctan.org/tex-archive/macros/latex/contrib/cite/
% The documentation is contained in the cite.sty file itself.
% *** GRAPHICS RELATED PACKAGES ***
%
\ifCLASSINFOpdf
% \usepackage[pdftex]{graphicx}
% declare the path(s) where your graphic files are
% \graphicspath{{../pdf/}{../jpeg/}}
% and their extensions so you won't have to specify these with
% every instance of \includegraphics
% \DeclareGraphicsExtensions{.pdf,.jpeg,.png}
\else
% or other class option (dvipsone, dvipdf, if not using dvips). graphicx
% will default to the driver specified in the system graphics.cfg if no
% driver is specified.
% \usepackage[dvips]{graphicx}
% declare the path(s) where your graphic files are
% \graphicspath{{../eps/}}
% and their extensions so you won't have to specify these with
% every instance of \includegraphics
% \DeclareGraphicsExtensions{.eps}
\fi
% graphicx was written by David Carlisle and Sebastian Rahtz. It is
% required if you want graphics, photos, etc. graphicx.sty is already
% installed on most LaTeX systems. The latest version and documentation can
% be obtained at:
% http://www.ctan.org/tex-archive/macros/latex/required/graphics/
% Another good source of documentation is "Using Imported Graphics in
% LaTeX2e" by Keith Reckdahl which can be found as epslatex.ps or
% epslatex.pdf at: http://www.ctan.org/tex-archive/info/
%
% latex, and pdflatex in dvi mode, support graphics in encapsulated
% postscript (.eps) format. pdflatex in pdf mode supports graphics
% in .pdf, .jpeg, .png and .mps (metapost) formats. Users should ensure
% that all non-photo figures use a vector format (.eps, .pdf, .mps) and
% not a bitmapped formats (.jpeg, .png). IEEE frowns on bitmapped formats
% which can result in "jaggedy"/blurry rendering of lines and letters as
% well as large increases in file sizes.
%
% You can find documentation about the pdfTeX application at:
% http://www.tug.org/applications/pdftex
% *** MATH PACKAGES ***
%
%\usepackage[cmex10]{amsmath}
% A popular package from the American Mathematical Society that provides
% many useful and powerful commands for dealing with mathematics. If using
% it, be sure to load this package with the cmex10 option to ensure that
% only type 1 fonts will utilized at all point sizes. Without this option,
% it is possible that some math symbols, particularly those within
% footnotes, will be rendered in bitmap form which will result in a
% document that can not be IEEE Xplore compliant!
%
% Also, note that the amsmath package sets \interdisplaylinepenalty to 10000
% thus preventing page breaks from occurring within multiline equations. Use:
%\interdisplaylinepenalty=2500
% after loading amsmath to restore such page breaks as IEEEtran.cls normally
% does. amsmath.sty is already installed on most LaTeX systems. The latest
% version and documentation can be obtained at:
% http://www.ctan.org/tex-archive/macros/latex/required/amslatex/math/
% *** SPECIALIZED LIST PACKAGES ***
%
%\usepackage{algorithmic}
% algorithmic.sty was written by Peter Williams and Rogerio Brito.
% This package provides an algorithmic environment fo describing algorithms.
% You can use the algorithmic environment in-text or within a figure
% environment to provide for a floating algorithm. Do NOT use the algorithm
% floating environment provided by algorithm.sty (by the same authors) or
% algorithm2e.sty (by Christophe Fiorio) as IEEE does not use dedicated
% algorithm float types and packages that provide these will not provide
% correct IEEE style captions. The latest version and documentation of
% algorithmic.sty can be obtained at:
% http://www.ctan.org/tex-archive/macros/latex/contrib/algorithms/
% There is also a support site at:
% http://algorithms.berlios.de/index.html
% Also of interest may be the (relatively newer and more customizable)
% algorithmicx.sty package by Szasz Janos:
% http://www.ctan.org/tex-archive/macros/latex/contrib/algorithmicx/
% *** ALIGNMENT PACKAGES ***
%
%\usepackage{array}
% Frank Mittelbach's and David Carlisle's array.sty patches and improves
% the standard LaTeX2e array and tabular environments to provide better
% appearance and additional user controls. As the default LaTeX2e table
% generation code is lacking to the point of almost being broken with
% respect to the quality of the end results, all users are strongly
% advised to use an enhanced (at the very least that provided by array.sty)
% set of table tools. array.sty is already installed on most systems. The
% latest version and documentation can be obtained at:
% http://www.ctan.org/tex-archive/macros/latex/required/tools/
%\usepackage{mdwmath}
%\usepackage{mdwtab}
% Also highly recommended is Mark Wooding's extremely powerful MDW tools,
% especially mdwmath.sty and mdwtab.sty which are used to format equations
% and tables, respectively. The MDWtools set is already installed on most
% LaTeX systems. The lastest version and documentation is available at:
% http://www.ctan.org/tex-archive/macros/latex/contrib/mdwtools/
% IEEEtran contains the IEEEeqnarray family of commands that can be used to
% generate multiline equations as well as matrices, tables, etc., of high
% quality.
%\usepackage{eqparbox}
% Also of notable interest is Scott Pakin's eqparbox package for creating
% (automatically sized) equal width boxes - aka "natural width parboxes".
% Available at:
% http://www.ctan.org/tex-archive/macros/latex/contrib/eqparbox/
% *** SUBFIGURE PACKAGES ***
%\usepackage[tight,footnotesize]{subfigure}
% subfigure.sty was written by Steven Douglas Cochran. This package makes it
% easy to put subfigures in your figures. e.g., "Figure 1a and 1b". For IEEE
% work, it is a good idea to load it with the tight package option to reduce
% the amount of white space around the subfigures. subfigure.sty is already
% installed on most LaTeX systems. The latest version and documentation can
% be obtained at:
% http://www.ctan.org/tex-archive/obsolete/macros/latex/contrib/subfigure/
% subfigure.sty has been superceeded by subfig.sty.
%\usepackage[caption=false]{caption}
%\usepackage[font=footnotesize]{subfig}
% subfig.sty, also written by Steven Douglas Cochran, is the modern
% replacement for subfigure.sty. However, subfig.sty requires and
% automatically loads Axel Sommerfeldt's caption.sty which will override
% IEEEtran.cls handling of captions and this will result in nonIEEE style
% figure/table captions. To prevent this problem, be sure and preload
% caption.sty with its "caption=false" package option. This is will preserve
% IEEEtran.cls handing of captions. Version 1.3 (2005/06/28) and later
% (recommended due to many improvements over 1.2) of subfig.sty supports
% the caption=false option directly:
%\usepackage[caption=false,font=footnotesize]{subfig}
%
% The latest version and documentation can be obtained at:
% http://www.ctan.org/tex-archive/macros/latex/contrib/subfig/
% The latest version and documentation of caption.sty can be obtained at:
% http://www.ctan.org/tex-archive/macros/latex/contrib/caption/
% *** FLOAT PACKAGES ***
%
%\usepackage{fixltx2e}
% fixltx2e, the successor to the earlier fix2col.sty, was written by
% Frank Mittelbach and David Carlisle. This package corrects a few problems
% in the LaTeX2e kernel, the most notable of which is that in current
% LaTeX2e releases, the ordering of single and double column floats is not
% guaranteed to be preserved. Thus, an unpatched LaTeX2e can allow a
% single column figure to be placed prior to an earlier double column
% figure. The latest version and documentation can be found at:
% http://www.ctan.org/tex-archive/macros/latex/base/
%\usepackage{stfloats}
% stfloats.sty was written by Sigitas Tolusis. This package gives LaTeX2e
% the ability to do double column floats at the bottom of the page as well
% as the top. (e.g., "\begin{figure*}[!b]" is not normally possible in
% LaTeX2e). It also provides a command:
%\fnbelowfloat
% to enable the placement of footnotes below bottom floats (the standard
% LaTeX2e kernel puts them above bottom floats). This is an invasive package
% which rewrites many portions of the LaTeX2e float routines. It may not work
% with other packages that modify the LaTeX2e float routines. The latest
% version and documentation can be obtained at:
% http://www.ctan.org/tex-archive/macros/latex/contrib/sttools/
% Documentation is contained in the stfloats.sty comments as well as in the
% presfull.pdf file. Do not use the stfloats baselinefloat ability as IEEE
% does not allow \baselineskip to stretch. Authors submitting work to the
% IEEE should note that IEEE rarely uses double column equations and
% that authors should try to avoid such use. Do not be tempted to use the
% cuted.sty or midfloat.sty packages (also by Sigitas Tolusis) as IEEE does
% not format its papers in such ways.
% *** PDF, URL AND HYPERLINK PACKAGES ***
%
%\usepackage{url}
% url.sty was written by Donald Arseneau. It provides better support for
% handling and breaking URLs. url.sty is already installed on most LaTeX
% systems. The latest version can be obtained at:
% http://www.ctan.org/tex-archive/macros/latex/contrib/misc/
% Read the url.sty source comments for usage information. Basically,
% \url{my_url_here}.
% *** Do not adjust lengths that control margins, column widths, etc. ***
% *** Do not use packages that alter fonts (such as pslatex). ***
% There should be no need to do such things with IEEEtran.cls V1.6 and later.
% (Unless specifically asked to do so by the journal or conference you plan
% to submit to, of course. )
% correct bad hyphenation here
\hyphenation{op-tical net-works semi-conduc-tor}
\usepackage{subfigure}
\usepackage{graphicx}
\usepackage{flushend}
\usepackage{booktabs}
\usepackage{multirow}
\usepackage{footnote}
\usepackage{framed}
\usepackage{color,soul}
\usepackage{url}
\makeatletter
\g@addto@macro{\UrlBreaks}{\UrlOrds}
\makeatother
\makeatletter
\def\url@leostyle{%
\@ifundefined{selectfont}{\def\UrlFont{\same}}{\def\UrlFont{\scriptsize\bf\ttfamily}}}
\makeatother
\urlstyle{leo}
% \usepackage{caption}
\usepackage{dcolumn}
\usepackage{xspace}
\usepackage{balance}
\usepackage{bm}
\usepackage{cite}
\newcommand{\ie}{{\emph{i.e.}},\xspace}
\newcommand{\viz}{{\emph{viz.}},\xspace}
\newcommand{\eg}{{\emph{e.g.}},\xspace}
\newcommand{\etc}{etc.\xspace}
\newcommand{\etal}{{\emph{et al.}}}
\newcommand{\GH}{{\sc GitHub}\xspace}
\newcommand{\BB}{{\sc BitBucket}\xspace}
\IEEEoverridecommandlockouts
\newcommand\correspondingauthor{\thanks{Corresponding author.}}
\begin{document}
%
% paper title
% can use linebreaks \\ within to get better formatting as desired
\title{Where is the Road for Issue Reports Classification Based on Text Mining?}
% author names and affiliations
% use a multiple column layout for up to two different
% affiliations
\author{\IEEEauthorblockN{Qiang Fan, Yue Yu\IEEEauthorrefmark{1}\thanks{*Corresponding author.}, Gang Yin, Tao Wang, Huaimin Wang}
\IEEEauthorblockA{National Laboratory for Parallel and Distributed Processing\\
College of Computer, National University of Defence Technology\\
Changsha, China\\
\{fanqiang09, yuyue, yingang, taowang2005, hmwang\}@nudt.edu.cn}
}
% conference papers do not typically use \thanks and this command
% is locked out in conference mode. If really needed, such as for
% the acknowledgment of grants, issue a \IEEEoverridecommandlockouts
% after \documentclass
% for over three affiliations, or if they all won't fit within the width
% of the page, use this alternative format:
%
%\author{\IEEEauthorblockN{Michael Shell\IEEEauthorrefmark{1},
%Homer Simpson\IEEEauthorrefmark{2},
%James Kirk\IEEEauthorrefmark{3},
%Montgomery Scott\IEEEauthorrefmark{3} and
%Eldon Tyrell\IEEEauthorrefmark{4}}
%\IEEEauthorblockA{\IEEEauthorrefmark{1}School of Electrical and Computer Engineering\\
%Georgia Institute of Technology,
%Atlanta, Georgia 30332--0250\\ Email: see http://www.michaelshell.org/contact.html}
%\IEEEauthorblockA{\IEEEauthorrefmark{2}Twentieth Century Fox, Springfield, USA\\
%Email: homer@thesimpsons.com}
%\IEEEauthorblockA{\IEEEauthorrefmark{3}Starfleet Academy, San Francisco, California 96678-2391\\
%Telephone: (800) 555--1212, Fax: (888) 555--1212}
%\IEEEauthorblockA{\IEEEauthorrefmark{4}Tyrell Inc., 123 Replicant Street, Los Angeles, California 90210--4321}}
% use for special paper notices
%\IEEEspecialpapernotice{(Invited Paper)}
% make the title area
\maketitle
\begin{abstract}
\input{abstract}
\end{abstract}
\begin{IEEEkeywords}
issue tracking system; machine learning technique; mining software repositories;
\end{IEEEkeywords}
% For peer review papers, you can put extra information on the cover
% page as needed:
% \ifCLASSOPTIONpeerreview
% \begin{center} \bfseries EDICS Category: 3-BBND \end{center}
% \fi
%
% For peerreview papers, this IEEEtran command inserts a page break and
% creates the second title. It will be ignored for other modes.
\IEEEpeerreviewmaketitle
\section{Introduction}
% no \IEEEPARstart
\input{introduction}
% You must have at least 2 lines in the paragraph with the drop letter
% (should never be an issue)
\section{background and related work}
\input{background}
\section{data set}
\input{dataset}
\section{methods}
\input{data_process}
\section{results and discussion}
\input{result}
\section{Threats to validity}
\input{threats}
\section{Conclusions}
\input{conclusion}
% An example of a floating figure using the graphicx package.
% Note that \label must occur AFTER (or within) \caption.
% For figures, \caption should occur after the \includegraphics.
% Note that IEEEtran v1.7 and later has special internal code that
% is designed to preserve the operation of \label within \caption
% even when the captionsoff option is in effect. However, because
% of issues like this, it may be the safest practice to put all your
% \label just after \caption rather than within \caption{}.
%
% Reminder: the "draftcls" or "draftclsnofoot", not "draft", class
% option should be used if it is desired that the figures are to be
% displayed while in draft mode.
%
%\begin{figure}[!t]
%\centering
%\includegraphics[width=2.5in]{myfigure}
% where an .eps filename suffix will be assumed under latex,
% and a .pdf suffix will be assumed for pdflatex; or what has been declared
% via \DeclareGraphicsExtensions.
%\caption{Simulation Results}
%\label{fig_sim}
%\end{figure}
% Note that IEEE typically puts floats only at the top, even when this
% results in a large percentage of a column being occupied by floats.
% An example of a double column floating figure using two subfigures.
% (The subfig.sty package must be loaded for this to work.)
% The subfigure \label commands are set within each subfloat command, the
% \label for the overall figure must come after \caption.
% \hfil must be used as a separator to get equal spacing.
% The subfigure.sty package works much the same way, except \subfigure is
% used instead of \subfloat.
%
%\begin{figure*}[!t]
%\centerline{\subfloat[Case I]\includegraphics[width=2.5in]{subfigcase1}%
%\label{fig_first_case}}
%\hfil
%\subfloat[Case II]{\includegraphics[width=2.5in]{subfigcase2}%
%\label{fig_second_case}}}
%\caption{Simulation results}
%\label{fig_sim}
%\end{figure*}
%
% Note that often IEEE papers with subfigures do not employ subfigure
% captions (using the optional argument to \subfloat), but instead will
% reference/describe all of them (a), (b), etc., within the main caption.
% An example of a floating table. Note that, for IEEE style tables, the
% \caption command should come BEFORE the table. Table text will default to
% \footnotesize as IEEE normally uses this smaller font for tables.
% The \label must come after \caption as always.
%
%\begin{table}[!t]
%% increase table row spacing, adjust to taste
%\renewcommand{\arraystretch}{1.3}
% if using array.sty, it might be a good idea to tweak the value of
% \extrarowheight as needed to properly center the text within the cells
%\caption{An Example of a Table}
%\label{table_example}
%\centering
%% Some packages, such as MDW tools, offer better commands for making tables
%% than the plain LaTeX2e tabular which is used here.
%\begin{tabular}{|c||c|}
%\hline
%One & Two\\
%\hline
%Three & Four\\
%\hline
%\end{tabular}
%\end{table}
% Note that IEEE does not put floats in the very first column - or typically
% anywhere on the first page for that matter. Also, in-text middle ("here")
% positioning is not used. Most IEEE journals/conferences use top floats
% exclusively. Note that, LaTeX2e, unlike IEEE journals/conferences, places
% footnotes above bottom floats. This can be corrected via the \fnbelowfloat
% command of the stfloats package.
% \section{Conclusion}
% The conclusion goes here. this is more of the conclusion
% conference papers do not normally have an appendix
% use section* for acknowledgement
\section*{Acknowledgment}
This research is supported by National Science Foundation of China (Grant No.61432020, 61472430, 61502512 and 61303064) and National Key R\&D Program of China (2016-YFB1000805).
% trigger a \newpage just before the given reference
% number - used to balance the columns on the last page
% adjust value as needed - may need to be readjusted if
% the document is modified later
%\IEEEtriggeratref{8}
% The "triggered" command can be changed if desired:
%\IEEEtriggercmd{\enlargethispage{-5in}}
% references section
% can use a bibliography generated by BibTeX as a .bbl file
% BibTeX documentation can be easily obtained at:
% http://www.ctan.org/tex-archive/biblio/bibtex/contrib/doc/
% The IEEEtran BibTeX style support page is at:
% http://www.michaelshell.org/tex/ieeetran/bibtex/
%\bibliographystyle{IEEEtran}
% argument is your BibTeX string definitions and bibliography database(s)
%\bibliography{IEEEabrv,../bib/paper}
%
% <OR> manually copy in the resultant .bbl file
% set second argument of \begin to the number of references
% (used to reserve space for the reference number labels box)
% \begin{thebibliography}{1}
\bibliographystyle{IEEEtran}
\bibliography{sigproc}
% \end{thebibliography}
% that's all folks
\end{document}

Binary file not shown.

View File

@ -102,10 +102,10 @@ then the data do not provide sufficient evidence to reject the null hypothesis.
Table~\ref{tab:compareML} shows results of procedure $\widetilde{\textbf{T}}$
(the last three rows are the results of Section~\ref{sec:2stage}).
All the $p-values$ are less than 0.05 in the first four rows.
All the \textit{p-values} are less than 0.05 in the first four rows.
Thus, a significant difference among the four text-based classifications and base line method is observed,
which implies that text-based classifications are useful for the issue classification in ITS of GitHub.
Similarly, all the $p-values$ are less than 0.05 in the next four rows and the lower and upper boundaries are greater than zero,
Similarly, all the \textit{p-values} are less than 0.05 in the next four rows and the lower and upper boundaries are greater than zero,
which means that the performance of \textit{SVM} is significantly better than the other three classifications.
% \fbox{
@ -265,7 +265,7 @@ Figure~\ref{figure:2levelresult} shows that the combined method outperforms all
% Table~\ref{table:detailresult} shows that the average values of precision, recall, and F-measure are all better than
% those of the baseline methods.
For procedure $\widetilde{\textbf{T}}$ (last rows in Table~\ref{tab:compareML}),
all the $p-values$ of combined method versus SVM, developer information method and perplexity information method are less than 0.05,
all the \textit{p-values} of combined method versus SVM, developer information method and perplexity information method are less than 0.05,
and the lower and upper boundaries are greater than zero,
which means that combined method is more significant than other approaches.

Binary file not shown.

110
submit/IEEEtranBST/README Normal file
View File

@ -0,0 +1,110 @@
September 30, 2008
IEEEtran.bst is the official BibTeX style for authors of the Institute of
Electrical and Electronics Engineers (IEEE) Transactions journals and
conferences.
It also may have applications for other academic work such as theses and
technical reports. The alphanumeric and natbib variants extend the
applicability of the IEEEtran bibstyle family to the natural sciences
and beyond.
The IEEEtran bibstyle is a very comprehensive BibTeX style which provides
many features beyond the standard BibTeX styles, including full support
for references of online documents, patents, periodicals and standards.
See the provided user manual for detailed usage information.
The latest version of the IEEEtran BibTeX style can be found at CTAN:
http://www.ctan.org/tex-archive/macros/latex/contrib/IEEEtran/bibtex/
as well as within IEEE's site:
http://www.ieee.org/
Note that the packages at IEEE's site do not contain the natbib and
alphanumeric variants (e.g., IEEEtranN.bst, etc.) as these are not used
for IEEE related work. These files can be obtained on CTAN.
For helpful tips, answers to frequently asked questions, and other support,
visit the IEEEtran home page at my website:
http://www.michaelshell.org/tex/ieeetran/
Enjoy!
Michael Shell
http://www.michaelshell.org/
*******
Version 1.13 (2008/09/30) changes:
1. Fixed bug with edition number to ordinal conversion. Thanks to
Michael Roland for reporting this issue and correcting the algorithm.
2. Added new IEEE journal string definitions.
********************************** Files **********************************
README - This file.
IEEEtran_bst_HOWTO.pdf - The user manual.
IEEEtran.bst - The standard IEEEtran BibTeX style file. For use
with IEEE work.
IEEEtranS.bst - A version of IEEEtran.bst that sorts the entries.
Some IEEE conferences/publications may use/allow
sorted bibliographies.
IEEEtranSA.bst - Like IEEEtranS.bst, but with alphanumeric citation
tags like alpha.bst. Not for normal IEEE use.
IEEEtranN.bst - Like IEEEtran.bst, but based on plainnat.bst and
is compatible with Patrick W. Daly's natbib
package. Not for normal IEEE use.
IEEEtranSN.bst - Sorting version of IEEEtranN.bst. Not for normal
IEEE use
IEEEexample.bib - An example BibTeX database that contains the
references shown in the user manual.
IEEEabrv.bib - String definitions for the abbreviated names of
IEEE journals. (For use with IEEE work.)
IEEEfull.bib - String definitions for the full names of IEEE
journals. (Do not use for IEEE work.)
***************************************************************************
Legal Notice:
This code is offered as-is without any warranty either expressed or
implied; without even the implied warranty of MERCHANTABILITY or
FITNESS FOR A PARTICULAR PURPOSE!
User assumes all risk.
In no event shall IEEE or any contributor to this code be liable for
any damages or losses, including, but not limited to, incidental,
consequential, or any other damages, resulting from the use or misuse
of any information contained here.
All comments are the opinions of their respective authors and are not
necessarily endorsed by the IEEE.
This work is distributed under the LaTeX Project Public License (LPPL)
( http://www.latex-project.org/ ) version 1.3, and may be freely used,
distributed and modified. A copy of the LPPL, version 1.3, is included
in the base LaTeX documentation of all distributions of LaTeX released
2003/12/01 or later.
Retain all contribution notices and credits.
** Modified files should be clearly indicated as such, including **
** renaming them and changing author support contact information. **
File list of work: IEEEtran_bst_HOWTO.pdf, IEEEtran.bst, IEEEtranS.bst,
IEEEtranSA.bst, IEEEtranN.bst, IEEEtranSN.bst,
IEEEexample.bib, IEEEabrv.bib, IEEEfull.bib
***************************************************************************

125
submit/README Normal file
View File

@ -0,0 +1,125 @@
March 5, 2007
IEEEtran is the official LaTeX class for authors of the Institute of
Electrical and Electronics Engineers (IEEE) transactions journals and
conferences. The latest version of the IEEEtran package can be found
at CTAN:
http://www.ctan.org/tex-archive/macros/latex/contrib/IEEEtran/
as well as within IEEE's site:
http://www.ieee.org/
For latest news, helpful tips, answers to frequently asked questions,
beta releases and other support, visit the IEEEtran home page at my
website:
http://www.michaelshell.org/tex/ieeetran/
Version 1.7a is a bug fix release that corrects the two column peer
review title page problem. This problem was not present in the 1.6 series.
V1.7 is a significant update over the 1.6 series with many important
changes. For a full list, please read the file changelog.txt. The most
notable changes include:
1. New class option compsoc to support the IEEE Computer Society format.
2. Several commands and environments have been deprecated in favor of
replacements with IEEE prefixes to better avoid potential future name
clashes with other packages. Legacy code retained to allow the use of
the obsolete forms (for now), but with a warning message to the console
during compilation:
\IEEEauthorblockA, \IEEEauthorblockN, \IEEEauthorrefmark,
\IEEEbiography, \IEEEbiographynophoto, \IEEEkeywords, \IEEEPARstart,
\IEEEproof, \IEEEpubid, \IEEEpubidadjcol, \IEEEQED, \IEEEQEDclosed,
\IEEEQEDopen, \IEEEspecialpapernotice. IEEEtran.cls now redefines
\proof in way to avoid problems with the amsthm.sty package.
For IED lists:
\IEEEiedlabeljustifyc, \IEEEiedlabeljustifyl, \IEEEiedlabeljustifyr,
\IEEEnocalcleftmargin, \IEEElabelindent, \IEEEsetlabelwidth,
\IEEEusemathlabelsep
These commands/lengths now require the IEEE prefix and do not have
legacy support: \IEEEnormaljot.
For IED lists: \ifIEEEnocalcleftmargin, \ifIEEEnolabelindentfactor,
\IEEEiedlistdecl, \IEEElabelindentfactor
3. New \CLASSINPUT, \CLASSOPTION and \CLASSINFO interface allows for more
user control and conditional compilation.
4. Several bug fixes and improved compatibility with other packages.
A note to those who create classes derived from IEEEtran.cls: Consider the
use of patch code, either in an example .tex file or as a .sty file,
rather than creating a new class. The IEEEtran.cls CLASSINPUT interface
allows IEEEtran.cls to be fully programmable with respect to document
margins, so there is no need for new class files just for altered margins.
In this way, authors can benefit from updates to IEEEtran.cls and the need
to maintain derivative classes and backport later IEEEtran.cls revisions
thereto is avoided. As always, developers who create classes derived from
IEEEtran.cls should use a different name for the derived class, so that it
cannot be confused with the official/base version here, as well as provide
authors with technical support for the derived class. It is generally a bad
idea to produce a new class that is not going to be maintained.
Best wishes for all your publication endeavors,
Michael Shell
http://www.michaelshell.org/
********************************** Files **********************************
README - This file.
IEEEtran.cls - The IEEEtran LaTeX class file.
changelog.txt - The revision history.
IEEEtran_HOWTO.pdf - The IEEEtran LaTeX class user manual.
bare_conf.tex - A bare bones starter file for conference papers.
bare_jrnl.tex - A bare bones starter file for journal papers.
bare_jrnl_compsoc.tex - A bare bones starter file for Computer Society
journal papers.
bare_adv.tex - A bare bones starter file showing advanced
techniques such as conditional compilation,
hyperlinks, PDF thumbnails, etc. The illustrated
format is for a Computer Society journal paper.
***************************************************************************
Legal Notice:
This code is offered as-is without any warranty either expressed or
implied; without even the implied warranty of MERCHANTABILITY or
FITNESS FOR A PARTICULAR PURPOSE!
User assumes all risk.
In no event shall IEEE or any contributor to this code be liable for
any damages or losses, including, but not limited to, incidental,
consequential, or any other damages, resulting from the use or misuse
of any information contained here.
All comments are the opinions of their respective authors and are not
necessarily endorsed by the IEEE.
This work is distributed under the LaTeX Project Public License (LPPL)
( http://www.latex-project.org/ ) version 1.3, and may be freely used,
distributed and modified. A copy of the LPPL, version 1.3, is included
in the base LaTeX documentation of all distributions of LaTeX released
2003/12/01 or later.
Retain all contribution notices and credits.
** Modified files should be clearly indicated as such, including **
** renaming them and changing author support contact information. **
File list of work: IEEEtran.cls, IEEEtran_HOWTO.pdf, bare_adv.tex,
bare_conf.tex, bare_jrnl.tex, bare_jrnl_compsoc.tex
***************************************************************************

20
submit/abstract.tex Normal file
View File

@ -0,0 +1,20 @@
Currently, open source projects receive various kinds of issues daily,
because of the extreme openness of Issue Tracking System (ITS) in GitHub.
ITS is a labor-intensive and time-consuming task of issue categorization for project managers.
%,and utilizing automatic approach to predict category of issue reports can effectively reduce management costs.
%Many researches have worked on this problem in ITS like Bugzilla, ITracker, etc,
%and adding structured information of ITS to classifier is a common method to improve performance of classification model.
However, a contributor is only required a short textual abstract to report an issue in GitHub.
Thus, most traditional classification approaches based on detailed and structured data
(\eg priority, severity, software version and so on) are difficult to adopt.
%However, nearly all research works on few projects, and we don't know how these methods perform for most projects.
%Otherwise, for ITS of GitHub, these structured information is hard to collect, which makes the method useless.
In this paper, issue classification approaches on a large-scale dataset,
including 80 popular projects and over 252,000 issue reports collected from GitHub, were investigated.
First, four traditional text-based classification methods and their performances were discussed.
Semantic perplexity (\ie an issues description confuses bug-related sentences with
nonbug-related sentences) is a crucial factor that affects the classification performances based on quantitative and qualitative study.
Finally, A two-stage classifier framework based on the novel metrics of semantic perplexity of issue reports was designed.
Results show that our two-stage classification
can significantly improve issue classification performances.

300
submit/background.tex Normal file
View File

@ -0,0 +1,300 @@
\subsection{Issue Tracking System}
\label{ITS_T}
Software development generally produces programs with two caveats~\cite{bissyande2013got}:
(1) they are often incomplete with respect to certain features,
and (2) they are usually buggy.
In the development process, most of time, developers code and test the programs,
while end-users (does not exclude developers) use the programs and provide feedbacks.
Both developers and end-users can submit issues to ITS
when the software performance does not meet their expectations.
%And submitting issues to ITS is one of the most important contribution way for OSS participants~\cite{jalbert2008automated}.
%Good management of developing tasks is of great benefit to improve develop efficiency.
%During the process of issue management,
%the core team needs to make textual information of the issue clear,
Then, the core team needs to clearly understand the intentions of the contributors,
distinguish categories, and find suitable developers to fix the corresponding issues
at the stage of project maintenance.
% For core team of a project,
% they need to tell what kind of tasks they have and decide who should fix them.
% They are supposed to communicate with other developers and coordinate the work among them.
Using the ITS is a common way to organize and maintain
development tasks in the open source practice~\cite{zimmermann2009improving}
to help project mangers keep track of issue reports by monitoring progress,
identifying new issues, discussing potential solutions for fixing bugs, and so on.
The consistent utilization of ITS is considered as a
``hallmarks of a good software team''~\cite{spolsky2010painless} in open source communities.
% Using ITS is of great benefit to the management of developing tasks.
\begin{figure}[!htb]
\centering
% \includegraphics[width=8.5cm]{classprocess}
\includegraphics[width=7.5cm]{figure/workflow_bugzilla}
\caption{Workflow of management for traditional ITS}
\label{figure:lifecycle_old}
\end{figure}%picture
Dozens of ITS tools has been popularized,
\eg \emph{Bugzilla} and \emph{ITracker},
with the development of OSS.
%In the development of ITS, many ITS appear, such as Bugzilla and ITracker~\cite{serrano2005bugzilla}.
% 添加传统管理工具的使用
These traditional tools design a rigid and complicated data structure,
which are used for organizing issues, \eg category, priority, assignee and status.
%These tools provide well organized task management with
%structured information (\eg category, priority, assignee, etc) and in this paper,
%ITS like these are called traditional ITS.
Figure~\ref{figure:lifecycle_old} shows the common workflow.
First, when contributors find bugs in the software,
they are usually asked for a basic description about the bug,
and the structured fields are completed to as many as they can.
Second, core team members and contributors discuss their problems to clearly understand the problems.
Thereafter, the structured information that requires distinguishing suitable categories,
determining priorities, making plans to ensure progress, and locating the issue
(indicating the product, component, and version of the software where the issue appeared) are corrected.
Finally, based on all structured information, the corresponding developers would be assigned to fix the bugs.
%These ITSs provide many comprehensive but complex structured information compared with their predecessor,
%mailing-list and spreadsheet~\cite{serrano2005bugzilla}, and these structured information is great benefit to manage and retrieve issues.
%Through the long development cycle, projects have collected many informationally complete issue reports and modification history,
%and these issues contain many meaningful informations for managers and researchers.
% Through the whole life cycle, the issue tracking system collect many informationally complete issue reports and modification history of them, which contain many meaningful informations for managers and researchers.
% \begin{figure}[!htb]
% \centering
% % \includegraphics[width=8.5cm]{classprocess}
% \includegraphics[width=8.5cm]{figure/workflow}
% \caption{Overview of issue tracking system workflow}
% \label{figure:lifecycle}
% \end{figure}%picture
Several academic studies focus on ITS
to free the managers from some cumbersome and repetitive work,
%improve the efficiency of project maintenance,
%because of the key role that ITS plays in the development of project,
such as automatically classifying issue reports to bug-prone and nonbug-prone~\cite{antoniol2008bug,herzig2013s,maalej2015bug,zhou2014combining},
bug assignment~\cite{anvik2006should,baysal2009bug},
duplicate issue detection~\cite{wang2008approach,sun2010discriminative},
fixing time prediction~\cite{weiss2007long}, and so on.
%All those researches aim to free the issue managers from some cumbersome and repetitive work.
Most of the existing approaches highly depend on the structured bug data (\eg priority and severity).
%Otherwise, some transitional ITSs force contributors to apply much information to cut the workload of core team.
However, prior work~\cite{antoniol2008bug,herzig2013s,zhou2014combining}
has shown that OSS contributors often omit or use default value for some important information,
which results in many wrong messages and missing messages existing in ITS.
%which defeats the purpose of ITS.
Thus, improving the efficiency of ITS services is
becoming an important research topic~\cite{zimmermann2009improving}.
%It also leads some research on the construction of improving ITS~\cite{just2008towards,zimmermann2009improving}.
% Developers use unstructured free text to describe issues and use structured information like category, priority, etc, to manage ITS.
% Through these structured information, manager can decide when and who should to fix them and record whether it appeared before.
% In the development of ITS, there are appearing many ITS such as Bugzilla and ITracker \cite{serrano2005bugzilla}.
% These ITSs have many comprehensive and complex features compared with their predecessor, mail-list and spreadsheet \cite{serrano2005bugzilla}.
% These features, such as category, priority, status, etc, are designed for recording developing activity as much as possible, and it works effectively compared with mail-list.
% However, some recently researches \cite{antoniol2008bug,herzig2013s,kim2011dealing,zhou2014combining} observed that a considerable amount of issue reports in ITS marked as defective actually never had a bug.
% Too many default values are selected for options of issue reports, which make most function of issues meaningless.
% Reports with wrong information bring trouble to managing projects, and also hamper the researches on ITS.
% The ITS data with correct information become the key to make a successful analysis.
% To make sure that developer have provided correct information, it cost much time and manpower for manager.
% And this situation seems to be more intense with the rapid growth of the number of issues.
% Manual classification can help reduce the misclassification but the rapid growth of the number of issues make it an impractical option.
% Therefore, automatically classifying issue report would be very useful.
\subsection{Lightweight ITS in GitHub}
\label{ITS_GH}
%GitHub is a social coding OSS community, which allows contributors of GitHub to communicate with each other by social media tools.
%Social media tools play an increasingly important role in software engineering research and practice~\cite{storey2010impact}.
%And the usage of social media drive a rich set of inferences around commitment, work quality,
%community significance and personal relevance~\cite{dabbish2012social},
%which helps GitHub become the largest platform of providing web-based repository hosting service.
\begin{figure}[!htb]
\centering
% \includegraphics[width=8.5cm]{classprocess}
\includegraphics[width=7.5cm]{figure/workflow_git}
\caption{Workflow of management in GitHub ITS}
\label{figure:lifecycle}
\end{figure}%picture
GitHub, the largest social coding community,
released its own ITS called \emph{Issues 2.0}\footnote{\url{https://github.com/blog/831-issues-2-0-the-next-generation}}
in 2014 to provide an excellent service in reporting issues.
Figure~\ref{figure:lifecycle} summarizes the typical workflow of issue management in GitHub.
First, the contributor submits an issue report and provides some textual summary to describe it.
Second, the core team of the project discusses the issue with the contributor.
During discussion, the core team needs to reach an agreement of the issue with the contributors,
and select relevant labels for the issue from pre-defined labels.
Finally, the core team assigns the issue to be fixed by the corresponding developer.
GitHub provides a more lightweight ITS that is flexibly integrated with its label system,
compared to traditional ITS.
The structured information of issues,
such as category and priority,
is substituted by the label system in GitHub.
Contributors are only required a short textual abstract when submit issues,
whereas the core team can use labels, besides milestone and assignee,
to mark and manage issue reports.
The label system in GitHub is custom and the core team can summarize information concerns them most as labels.
% These predefining labels may contain the information of category, status, component, which is concerned most for the core team.
% there are only labels can be used to manage issue reports.
% It is hard to judge whether this change is benefit or not for projects,
Zach Holman, a GitHub engineer, describes his design as follows: \textit{``Our goal is basically to make a flexible, simple product that everyone can enjoy.''} \textit{``In the meantime, we've found that using labels and milestones is a great way to achieve the same result (other functions) in a more flexible system.''}
% The flexible ITS do not force users to provide structured information (e.g., category, priority, etc.) when users submit an issue report.
% Users just need to focus on what they want others know (title and description of issue report), and this flexible feature can reduce the fault of default values.
The lightweight design of ITS results in some changes for the contributors and core team.
For contributors, Issue 2.0 reduced the cost of submitting issue reports and stimulated their enthusiasm.
However, its openness results in the emergence of undesired issues in ITS.
Consequently, loose constraints reduced the work of contributors,
which are transferred to the core team as management task.
Thus, an automatic approach that can effectively filter out useful issue reports is significant and urgent for issue management.
For the core team, the flexible label system makes the management task customizable and configurable.
The flexibility breaks the fixed form of management,
and the core team can shape the way of management according to their requirement.
However, this customization lacks mandate and enforceability,
thereby omitting structured information.
Moreover, this customization results in difference in usage among projects,
which makes the management of different projects difficult to understand.
% But the core team \hl{is more likely to forget to add some labels} because of loose constraints,
% which aggravates the omitting of structured information.
% contributors easy to commit issue reports.
% Contributor do not need to provide much information, and transfer
% which correspondingly transfers the task to managers.
% In addition, the usability of this design let contributors commit issue reports more easily, while this results in committing information more casually than before.
% And this situation indirectly increased the management task.
% Many projects give up maintaining work of some information such as priority, component, because of too onerous management task.
% Missed information is also very important, but the limitation of manpower and time makes it impossible to keep information correct and accuracy.
% So import automatic technique to help management is great urgent.
% 这使得将一些自动化的技术引进变得更有意义。比如说自动分类
% 考虑到GitHUb与传统ITS的不同我们提出了第一个RQ
% 机器学习的方法是否在ITS of GitHub适用如果适用的话哪种效果好
A study on a large-scale dataset was performed to build a common and effective text-based classification approach for GitHub projects.
This paper focuses on classifying bug-prone and non-bug prone issue reports because of the dominance of bugs in ITS.
We expect to refer the achievement from former research on traditional ITS, so we ask:
% Thinking of the different between ITS of GitHub and traditional ITS, we propose the first research question:
\textbf{RQ1: \textit{Do traditional text-based classification approaches still work on the ITS of GitHub?
Which classifier performs best?}}
In prior research \cite{zhou2014combining,merten2016software},
combining textual and structured information (\eg priority, assignee, and so on)
is used routinely to outperform the classifier.
However, as previously discussed, the omission of structured information is serious in the ITS of GitHub.
The data that can be used to build a classifier are limited because of the structured information scarcity.
Thus, only textual summary and the historical data of submitters can be used for the majority of issues.
Facing this challenge, we expect to study some factors that may influence the performance of classifiers, so we ask:
\textbf{RQ2: \textit{What factors influence the performances of text-based classification approaches?}}
The factors that influence the performances of text-based classifiers,
which would guide us to extract more additional features from textual summary,
can be identified using the regression analysis.
In this paper, a two-stage classifier framework,
which can flexibly combine textual information and other types of features, was built.
To evaluate our approach, we ask:
\textbf{RQ3: \textit{How to improve the classification performances
by integrating different types of features,
especially for the semantic complexity metrics extracted from textual descriptions?}}
% 研究combine。。提出了使用一些结构化的信息来提升分类的准确率但是对于GitHu来说这些信息太珍贵。我们能够使用的信息只有文本以及提交者的历史活动信息所以我们希望能够在文本中找到影响分类的一些因素。
% This flexibility, however, comes at some cost in managing the project.
% On the one hand, few users provide extension information which is important to managers of the project.
% Extensive loss of extension information cause more manpower needed to maintain ITS.
% So automatic issue reports classification is of great signification in GitHub.
% On the other hand, issue reporters may categorize the issues with typographical mistakes, or using various idiosyncrasies \cite{bissyande2013got}.
% Many labels are used to distinguish an issue report as a bug or feature, such as bug, defect, feature, feature request, etc.
% And it is hard to propose a unified method to distinguish these labels because of this complex situation.
%介绍issue system in github并介绍相关工作
%\subsection{Relative Research in Automatically Classifying Issue Reports}
\subsection{Related Work}
Many studies have investigated bug classification \cite{antoniol2008bug,herzig2013s,maalej2015bug,zhou2014combining}
to predict whether an issue is about a bug.
Antoniol et al.~\cite{antoniol2008bug} investigated the automatic classification of issue reports
by utilizing conventional text mining techniques based on the description part of issue reports.
By extracting the textual part of issue reports (title, description, and discussions) from ITS of three case projects
and building classifiers using three supervised MLTs
(alternating decision trees, Naive Bayes classifiers, and logistic regression),
linguistic information in ITS is sufficient (82\% best precision for three case projects) to automatically
distinguish bugs from other activities.
Zhou et al.~\cite{zhou2014combining} proposed a hybrid approach that combines text and data mining techniques
and considers the misclassification of issue reports in ITS.
They took advantage of structural information with textual information that proposed a hybrid approach
that combines text and data mining techniques,
and achieved an excellent result (average 84.7 for 5 case projects).
A common approach adds information extracted from ITS~\cite{antoniol2008bug,zhou2014combining,merten2016software}
to improve the performance of the model.
In~\cite{antoniol2008bug}, discussions are involved; in~\cite{zhou2014combining}, structural information, such as severity, priority, and component, are utilized;
and in~\cite{merten2016software}, metadata extracted from ITS are proved to outperform classifiers.
However, these kinds of data are not produced when the issue report is submitted.
Consequently, obtaining structured information in GitHub is difficult because of omissions.
Textual summary is a main type of information used to build the classification model.
% Concerning these, we propose an approach to build classification model only extracting title and description of issue report.
% Moreover, mining feature requests from textual summary of user feedbacks is also what researchers are concerned about.
% Walid et al.~\cite{maalej2015bug} utilize MLT on feedback from app store.
% They combine metadata (i.e., star rating, tense, etc.) with natural language processing, and classify app reviews into four types: bug reports, feature requests, user experiences, and ratings.
% Finally, the classification precision got between 70-95\% while the recall between 80-90\%, and they found that multiple binary classifiers outperformed single multiclass classifier.
% Thorsten et al.~\cite{merten2016software} investigates natural language processing and machine learning features to detect software feature requests in natural language data of issue tracking systems.
% They compare traditional linguistic machine learning features, such as “bag of words”, with more advanced features, such as subject-action-object.
% And they find that request can be detected best out of the researched SFR parts.
% Researches introduced before have utilized ML techniques
% to automatically classify user feedbacks in dataset like Bugzilla and user comments in app store.
% Thinking about the huge difference between ITS of GitHub and other dataset,
% there are no one has studied whether these ML techniques are worked on the ITS of GitHub.
% So we further propose the first question:
% \textbf{RQ1: \textit{Is traditional ML techniques also applicable in ITS of GitHub?
% Which ML technique performs best for classifying issue report for most projects in GitHub?}}
% %Different issues are not same for machine learning method to classify.
% The performance of ML techniques are inconsistent among
% the issue reports with different features.
% (e.g., the length of description).
% Some issues are easy to distinguish the category of them,
% but others are hard to tell.
% Instead of directly giving the answer about bug or not a bug,
% Zhou et al.~\cite{zhou2014combining} classify the issues into
% three levels of likelihood of being a corrective bug.
% Actually, for many ML techniques,
% they can output probability of classification results.
% The probability reflects
% how we can trust the prediction results generated by a classification model.
% Then we ask:
% \textbf{RQ2: \textit{What factors affect an issue report to be right classified?}}
% After manually analyzing the patterns why some issues are difficult to automatically classify,
% a nature idea come out that can we utilize above patterns
% to improving accuracy of our classification models.
% So we ask:
% \textbf{RQ3: \textit{If patterns for these issues are exist,
% can we take advantage of it to improve performance of classification model?}}
%\yy{to be determined}
% Through topic model, we can aggregate issues according to topic of issues or the component that issues are involved into, etc. From topics of issues, the distribution of different categories of issues may be imbalanced, which means that there are more bugs appear in a topic and more features appear in another topic. if so, we can use the imbalanced distribution of issues to correct the classification model and improving the performance of classification model. So, we have the last question:
% \textbf{RQ4: \textit{Is precision of issues associate with topics of them?}}

BIN
submit/bare_conf.pdf Normal file

Binary file not shown.

620
submit/bare_conf.tex Normal file
View File

@ -0,0 +1,620 @@
%% bare_conf.tex
%% V1.3
%% 2007/01/11
%% by Michael Shell
%% See:
%% http://www.michaelshell.org/
%% for current contact information.
%%
%% This is a skeleton file demonstrating the use of IEEEtran.cls
%% (requires IEEEtran.cls version 1.7 or later) with an IEEE conference paper.
%%
%% Support sites:
%% http://www.michaelshell.org/tex/ieeetran/
%% http://www.ctan.org/tex-archive/macros/latex/contrib/IEEEtran/
%% and
%% http://www.ieee.org/
%%*************************************************************************
%% Legal Notice:
%% This code is offered as-is without any warranty either expressed or
%% implied; without even the implied warranty of MERCHANTABILITY or
%% FITNESS FOR A PARTICULAR PURPOSE!
%% User assumes all risk.
%% In no event shall IEEE or any contributor to this code be liable for
%% any damages or losses, including, but not limited to, incidental,
%% consequential, or any other damages, resulting from the use or misuse
%% of any information contained here.
%%
%% All comments are the opinions of their respective authors and are not
%% necessarily endorsed by the IEEE.
%%
%% This work is distributed under the LaTeX Project Public License (LPPL)
%% ( http://www.latex-project.org/ ) version 1.3, and may be freely used,
%% distributed and modified. A copy of the LPPL, version 1.3, is included
%% in the base LaTeX documentation of all distributions of LaTeX released
%% 2003/12/01 or later.
%% Retain all contribution notices and credits.
%% ** Modified files should be clearly indicated as such, including **
%% ** renaming them and changing author support contact information. **
%%
%% File list of work: IEEEtran.cls, IEEEtran_HOWTO.pdf, bare_adv.tex,
%% bare_conf.tex, bare_jrnl.tex, bare_jrnl_compsoc.tex
%%*************************************************************************
% *** Authors should verify (and, if needed, correct) their LaTeX system ***
% *** with the testflow diagnostic prior to trusting their LaTeX platform ***
% *** with production work. IEEE's font choices can trigger bugs that do ***
% *** not appear when using other class files. ***
% The testflow support page is at:
% http://www.michaelshell.org/tex/testflow/
% Note that the a4paper option is mainly intended so that authors in
% countries using A4 can easily print to A4 and see how their papers will
% look in print - the typesetting of the document will not typically be
% affected with changes in paper size (but the bottom and side margins will).
% Use the testflow package mentioned above to verify correct handling of
% both paper sizes by the user's LaTeX system.
%
% Also note that the "draftcls" or "draftclsnofoot", not "draft", option
% should be used if it is desired that the figures are to be displayed in
% draft mode.
%
\documentclass[10pt, conference, compsocconf]{IEEEtran}
% Add the compsocconf option for Computer Society conferences.
%
% If IEEEtran.cls has not been installed into the LaTeX system files,
% manually specify the path to it like:
% \documentclass[conference]{../sty/IEEEtran}
% Some very useful LaTeX packages include:
% (uncomment the ones you want to load)
% *** MISC UTILITY PACKAGES ***
%
%\usepackage{ifpdf}
% Heiko Oberdiek's ifpdf.sty is very useful if you need conditional
% compilation based on whether the output is pdf or dvi.
% usage:
% \ifpdf
% % pdf code
% \else
% % dvi code
% \fi
% The latest version of ifpdf.sty can be obtained from:
% http://www.ctan.org/tex-archive/macros/latex/contrib/oberdiek/
% Also, note that IEEEtran.cls V1.7 and later provides a builtin
% \ifCLASSINFOpdf conditional that works the same way.
% When switching from latex to pdflatex and vice-versa, the compiler may
% have to be run twice to clear warning/error messages.
% *** CITATION PACKAGES ***
%
%\usepackage{cite}
% cite.sty was written by Donald Arseneau
% V1.6 and later of IEEEtran pre-defines the format of the cite.sty package
% \cite{} output to follow that of IEEE. Loading the cite package will
% result in citation numbers being automatically sorted and properly
% "compressed/ranged". e.g., [1], [9], [2], [7], [5], [6] without using
% cite.sty will become [1], [2], [5]--[7], [9] using cite.sty. cite.sty's
% \cite will automatically add leading space, if needed. Use cite.sty's
% noadjust option (cite.sty V3.8 and later) if you want to turn this off.
% cite.sty is already installed on most LaTeX systems. Be sure and use
% version 4.0 (2003-05-27) and later if using hyperref.sty. cite.sty does
% not currently provide for hyperlinked citations.
% The latest version can be obtained at:
% http://www.ctan.org/tex-archive/macros/latex/contrib/cite/
% The documentation is contained in the cite.sty file itself.
% *** GRAPHICS RELATED PACKAGES ***
%
\ifCLASSINFOpdf
% \usepackage[pdftex]{graphicx}
% declare the path(s) where your graphic files are
% \graphicspath{{../pdf/}{../jpeg/}}
% and their extensions so you won't have to specify these with
% every instance of \includegraphics
% \DeclareGraphicsExtensions{.pdf,.jpeg,.png}
\else
% or other class option (dvipsone, dvipdf, if not using dvips). graphicx
% will default to the driver specified in the system graphics.cfg if no
% driver is specified.
% \usepackage[dvips]{graphicx}
% declare the path(s) where your graphic files are
% \graphicspath{{../eps/}}
% and their extensions so you won't have to specify these with
% every instance of \includegraphics
% \DeclareGraphicsExtensions{.eps}
\fi
% graphicx was written by David Carlisle and Sebastian Rahtz. It is
% required if you want graphics, photos, etc. graphicx.sty is already
% installed on most LaTeX systems. The latest version and documentation can
% be obtained at:
% http://www.ctan.org/tex-archive/macros/latex/required/graphics/
% Another good source of documentation is "Using Imported Graphics in
% LaTeX2e" by Keith Reckdahl which can be found as epslatex.ps or
% epslatex.pdf at: http://www.ctan.org/tex-archive/info/
%
% latex, and pdflatex in dvi mode, support graphics in encapsulated
% postscript (.eps) format. pdflatex in pdf mode supports graphics
% in .pdf, .jpeg, .png and .mps (metapost) formats. Users should ensure
% that all non-photo figures use a vector format (.eps, .pdf, .mps) and
% not a bitmapped formats (.jpeg, .png). IEEE frowns on bitmapped formats
% which can result in "jaggedy"/blurry rendering of lines and letters as
% well as large increases in file sizes.
%
% You can find documentation about the pdfTeX application at:
% http://www.tug.org/applications/pdftex
% *** MATH PACKAGES ***
%
%\usepackage[cmex10]{amsmath}
% A popular package from the American Mathematical Society that provides
% many useful and powerful commands for dealing with mathematics. If using
% it, be sure to load this package with the cmex10 option to ensure that
% only type 1 fonts will utilized at all point sizes. Without this option,
% it is possible that some math symbols, particularly those within
% footnotes, will be rendered in bitmap form which will result in a
% document that can not be IEEE Xplore compliant!
%
% Also, note that the amsmath package sets \interdisplaylinepenalty to 10000
% thus preventing page breaks from occurring within multiline equations. Use:
%\interdisplaylinepenalty=2500
% after loading amsmath to restore such page breaks as IEEEtran.cls normally
% does. amsmath.sty is already installed on most LaTeX systems. The latest
% version and documentation can be obtained at:
% http://www.ctan.org/tex-archive/macros/latex/required/amslatex/math/
% *** SPECIALIZED LIST PACKAGES ***
%
%\usepackage{algorithmic}
% algorithmic.sty was written by Peter Williams and Rogerio Brito.
% This package provides an algorithmic environment fo describing algorithms.
% You can use the algorithmic environment in-text or within a figure
% environment to provide for a floating algorithm. Do NOT use the algorithm
% floating environment provided by algorithm.sty (by the same authors) or
% algorithm2e.sty (by Christophe Fiorio) as IEEE does not use dedicated
% algorithm float types and packages that provide these will not provide
% correct IEEE style captions. The latest version and documentation of
% algorithmic.sty can be obtained at:
% http://www.ctan.org/tex-archive/macros/latex/contrib/algorithms/
% There is also a support site at:
% http://algorithms.berlios.de/index.html
% Also of interest may be the (relatively newer and more customizable)
% algorithmicx.sty package by Szasz Janos:
% http://www.ctan.org/tex-archive/macros/latex/contrib/algorithmicx/
% *** ALIGNMENT PACKAGES ***
%
%\usepackage{array}
% Frank Mittelbach's and David Carlisle's array.sty patches and improves
% the standard LaTeX2e array and tabular environments to provide better
% appearance and additional user controls. As the default LaTeX2e table
% generation code is lacking to the point of almost being broken with
% respect to the quality of the end results, all users are strongly
% advised to use an enhanced (at the very least that provided by array.sty)
% set of table tools. array.sty is already installed on most systems. The
% latest version and documentation can be obtained at:
% http://www.ctan.org/tex-archive/macros/latex/required/tools/
%\usepackage{mdwmath}
%\usepackage{mdwtab}
% Also highly recommended is Mark Wooding's extremely powerful MDW tools,
% especially mdwmath.sty and mdwtab.sty which are used to format equations
% and tables, respectively. The MDWtools set is already installed on most
% LaTeX systems. The lastest version and documentation is available at:
% http://www.ctan.org/tex-archive/macros/latex/contrib/mdwtools/
% IEEEtran contains the IEEEeqnarray family of commands that can be used to
% generate multiline equations as well as matrices, tables, etc., of high
% quality.
%\usepackage{eqparbox}
% Also of notable interest is Scott Pakin's eqparbox package for creating
% (automatically sized) equal width boxes - aka "natural width parboxes".
% Available at:
% http://www.ctan.org/tex-archive/macros/latex/contrib/eqparbox/
% *** SUBFIGURE PACKAGES ***
%\usepackage[tight,footnotesize]{subfigure}
% subfigure.sty was written by Steven Douglas Cochran. This package makes it
% easy to put subfigures in your figures. e.g., "Figure 1a and 1b". For IEEE
% work, it is a good idea to load it with the tight package option to reduce
% the amount of white space around the subfigures. subfigure.sty is already
% installed on most LaTeX systems. The latest version and documentation can
% be obtained at:
% http://www.ctan.org/tex-archive/obsolete/macros/latex/contrib/subfigure/
% subfigure.sty has been superceeded by subfig.sty.
%\usepackage[caption=false]{caption}
%\usepackage[font=footnotesize]{subfig}
% subfig.sty, also written by Steven Douglas Cochran, is the modern
% replacement for subfigure.sty. However, subfig.sty requires and
% automatically loads Axel Sommerfeldt's caption.sty which will override
% IEEEtran.cls handling of captions and this will result in nonIEEE style
% figure/table captions. To prevent this problem, be sure and preload
% caption.sty with its "caption=false" package option. This is will preserve
% IEEEtran.cls handing of captions. Version 1.3 (2005/06/28) and later
% (recommended due to many improvements over 1.2) of subfig.sty supports
% the caption=false option directly:
%\usepackage[caption=false,font=footnotesize]{subfig}
%
% The latest version and documentation can be obtained at:
% http://www.ctan.org/tex-archive/macros/latex/contrib/subfig/
% The latest version and documentation of caption.sty can be obtained at:
% http://www.ctan.org/tex-archive/macros/latex/contrib/caption/
% *** FLOAT PACKAGES ***
%
%\usepackage{fixltx2e}
% fixltx2e, the successor to the earlier fix2col.sty, was written by
% Frank Mittelbach and David Carlisle. This package corrects a few problems
% in the LaTeX2e kernel, the most notable of which is that in current
% LaTeX2e releases, the ordering of single and double column floats is not
% guaranteed to be preserved. Thus, an unpatched LaTeX2e can allow a
% single column figure to be placed prior to an earlier double column
% figure. The latest version and documentation can be found at:
% http://www.ctan.org/tex-archive/macros/latex/base/
%\usepackage{stfloats}
% stfloats.sty was written by Sigitas Tolusis. This package gives LaTeX2e
% the ability to do double column floats at the bottom of the page as well
% as the top. (e.g., "\begin{figure*}[!b]" is not normally possible in
% LaTeX2e). It also provides a command:
%\fnbelowfloat
% to enable the placement of footnotes below bottom floats (the standard
% LaTeX2e kernel puts them above bottom floats). This is an invasive package
% which rewrites many portions of the LaTeX2e float routines. It may not work
% with other packages that modify the LaTeX2e float routines. The latest
% version and documentation can be obtained at:
% http://www.ctan.org/tex-archive/macros/latex/contrib/sttools/
% Documentation is contained in the stfloats.sty comments as well as in the
% presfull.pdf file. Do not use the stfloats baselinefloat ability as IEEE
% does not allow \baselineskip to stretch. Authors submitting work to the
% IEEE should note that IEEE rarely uses double column equations and
% that authors should try to avoid such use. Do not be tempted to use the
% cuted.sty or midfloat.sty packages (also by Sigitas Tolusis) as IEEE does
% not format its papers in such ways.
% *** PDF, URL AND HYPERLINK PACKAGES ***
%
%\usepackage{url}
% url.sty was written by Donald Arseneau. It provides better support for
% handling and breaking URLs. url.sty is already installed on most LaTeX
% systems. The latest version can be obtained at:
% http://www.ctan.org/tex-archive/macros/latex/contrib/misc/
% Read the url.sty source comments for usage information. Basically,
% \url{my_url_here}.
% *** Do not adjust lengths that control margins, column widths, etc. ***
% *** Do not use packages that alter fonts (such as pslatex). ***
% There should be no need to do such things with IEEEtran.cls V1.6 and later.
% (Unless specifically asked to do so by the journal or conference you plan
% to submit to, of course. )
% correct bad hyphenation here
\hyphenation{op-tical net-works semi-conduc-tor}
\usepackage{subfigure}
\usepackage{graphicx}
\usepackage{flushend}
\usepackage{booktabs}
\usepackage{multirow}
\usepackage{footnote}
\usepackage{framed}
\usepackage{color,soul}
\usepackage{url}
\makeatletter
\g@addto@macro{\UrlBreaks}{\UrlOrds}
\makeatother
\makeatletter
\def\url@leostyle{%
\@ifundefined{selectfont}{\def\UrlFont{\same}}{\def\UrlFont{\scriptsize\bf\ttfamily}}}
\makeatother
\urlstyle{leo}
% \usepackage{caption}
\usepackage{dcolumn}
\usepackage{xspace}
\usepackage{balance}
\usepackage{bm}
\usepackage{cite}
\newcommand{\ie}{{\emph{i.e.}},\xspace}
\newcommand{\viz}{{\emph{viz.}},\xspace}
\newcommand{\eg}{{\emph{e.g.}},\xspace}
\newcommand{\etc}{etc.\xspace}
\newcommand{\etal}{{\emph{et al.}}}
\newcommand{\GH}{{\sc GitHub}\xspace}
\newcommand{\BB}{{\sc BitBucket}\xspace}
\newcommand\correspondingauthor{\thanks{Corresponding author.}}
\begin{document}
%
% paper title
% can use linebreaks \\ within to get better formatting as desired
\title{Bare Demo of IEEEtran.cls for IEEECS Conferences}
% author names and affiliations
% use a multiple column layout for up to two different
% affiliations
\author{\IEEEauthorblockN{Qiang Fan, Yue Yu\IEEEauthorrefmark{1}\correspondingauthor, Gang Yin, Tao Wang, Huaimin Wang}
\IEEEauthorblockA{National Laboratory for Parallel and Distributed Processing\\
College of Computer, National University of Defence Technology\\
Changsha, China\\
\{fanqiang09, yuyue, yingang, taowang2005, hmwang\}@nudt.edu.cn}
}
% conference papers do not typically use \thanks and this command
% is locked out in conference mode. If really needed, such as for
% the acknowledgment of grants, issue a \IEEEoverridecommandlockouts
% after \documentclass
% for over three affiliations, or if they all won't fit within the width
% of the page, use this alternative format:
%
%\author{\IEEEauthorblockN{Michael Shell\IEEEauthorrefmark{1},
%Homer Simpson\IEEEauthorrefmark{2},
%James Kirk\IEEEauthorrefmark{3},
%Montgomery Scott\IEEEauthorrefmark{3} and
%Eldon Tyrell\IEEEauthorrefmark{4}}
%\IEEEauthorblockA{\IEEEauthorrefmark{1}School of Electrical and Computer Engineering\\
%Georgia Institute of Technology,
%Atlanta, Georgia 30332--0250\\ Email: see http://www.michaelshell.org/contact.html}
%\IEEEauthorblockA{\IEEEauthorrefmark{2}Twentieth Century Fox, Springfield, USA\\
%Email: homer@thesimpsons.com}
%\IEEEauthorblockA{\IEEEauthorrefmark{3}Starfleet Academy, San Francisco, California 96678-2391\\
%Telephone: (800) 555--1212, Fax: (888) 555--1212}
%\IEEEauthorblockA{\IEEEauthorrefmark{4}Tyrell Inc., 123 Replicant Street, Los Angeles, California 90210--4321}}
% use for special paper notices
%\IEEEspecialpapernotice{(Invited Paper)}
% make the title area
\maketitle
\begin{abstract}
\input{abstract}
\end{abstract}
\begin{IEEEkeywords}
issue tracking system; machine learning technique; mining software repositories;
\end{IEEEkeywords}
% For peer review papers, you can put extra information on the cover
% page as needed:
% \ifCLASSOPTIONpeerreview
% \begin{center} \bfseries EDICS Category: 3-BBND \end{center}
% \fi
%
% For peerreview papers, this IEEEtran command inserts a page break and
% creates the second title. It will be ignored for other modes.
\IEEEpeerreviewmaketitle
\section{Introduction}
% no \IEEEPARstart
\input{introduction}
% You must have at least 2 lines in the paragraph with the drop letter
% (should never be an issue)
\section{background and related work}
\input{background}
\section{data set}
\input{dataset}
\section{methods}
\input{data_process}
\section{results and discussion}
\input{result}
\section{Threats to validity}
\input{threats}
\section{Conclusions}
\input{conclusion}
% An example of a floating figure using the graphicx package.
% Note that \label must occur AFTER (or within) \caption.
% For figures, \caption should occur after the \includegraphics.
% Note that IEEEtran v1.7 and later has special internal code that
% is designed to preserve the operation of \label within \caption
% even when the captionsoff option is in effect. However, because
% of issues like this, it may be the safest practice to put all your
% \label just after \caption rather than within \caption{}.
%
% Reminder: the "draftcls" or "draftclsnofoot", not "draft", class
% option should be used if it is desired that the figures are to be
% displayed while in draft mode.
%
%\begin{figure}[!t]
%\centering
%\includegraphics[width=2.5in]{myfigure}
% where an .eps filename suffix will be assumed under latex,
% and a .pdf suffix will be assumed for pdflatex; or what has been declared
% via \DeclareGraphicsExtensions.
%\caption{Simulation Results}
%\label{fig_sim}
%\end{figure}
% Note that IEEE typically puts floats only at the top, even when this
% results in a large percentage of a column being occupied by floats.
% An example of a double column floating figure using two subfigures.
% (The subfig.sty package must be loaded for this to work.)
% The subfigure \label commands are set within each subfloat command, the
% \label for the overall figure must come after \caption.
% \hfil must be used as a separator to get equal spacing.
% The subfigure.sty package works much the same way, except \subfigure is
% used instead of \subfloat.
%
%\begin{figure*}[!t]
%\centerline{\subfloat[Case I]\includegraphics[width=2.5in]{subfigcase1}%
%\label{fig_first_case}}
%\hfil
%\subfloat[Case II]{\includegraphics[width=2.5in]{subfigcase2}%
%\label{fig_second_case}}}
%\caption{Simulation results}
%\label{fig_sim}
%\end{figure*}
%
% Note that often IEEE papers with subfigures do not employ subfigure
% captions (using the optional argument to \subfloat), but instead will
% reference/describe all of them (a), (b), etc., within the main caption.
% An example of a floating table. Note that, for IEEE style tables, the
% \caption command should come BEFORE the table. Table text will default to
% \footnotesize as IEEE normally uses this smaller font for tables.
% The \label must come after \caption as always.
%
%\begin{table}[!t]
%% increase table row spacing, adjust to taste
%\renewcommand{\arraystretch}{1.3}
% if using array.sty, it might be a good idea to tweak the value of
% \extrarowheight as needed to properly center the text within the cells
%\caption{An Example of a Table}
%\label{table_example}
%\centering
%% Some packages, such as MDW tools, offer better commands for making tables
%% than the plain LaTeX2e tabular which is used here.
%\begin{tabular}{|c||c|}
%\hline
%One & Two\\
%\hline
%Three & Four\\
%\hline
%\end{tabular}
%\end{table}
% Note that IEEE does not put floats in the very first column - or typically
% anywhere on the first page for that matter. Also, in-text middle ("here")
% positioning is not used. Most IEEE journals/conferences use top floats
% exclusively. Note that, LaTeX2e, unlike IEEE journals/conferences, places
% footnotes above bottom floats. This can be corrected via the \fnbelowfloat
% command of the stfloats package.
\section{Conclusion}
The conclusion goes here. this is more of the conclusion
% conference papers do not normally have an appendix
% use section* for acknowledgement
\section*{Acknowledgment}
This research is supported by National Science Foundation of China (Grant No.61432020, 61472430, 61502512 and 61303064) and National Key R\&D Program of China (2016-YFB1000805).
% trigger a \newpage just before the given reference
% number - used to balance the columns on the last page
% adjust value as needed - may need to be readjusted if
% the document is modified later
%\IEEEtriggeratref{8}
% The "triggered" command can be changed if desired:
%\IEEEtriggercmd{\enlargethispage{-5in}}
% references section
% can use a bibliography generated by BibTeX as a .bbl file
% BibTeX documentation can be easily obtained at:
% http://www.ctan.org/tex-archive/biblio/bibtex/contrib/doc/
% The IEEEtran BibTeX style support page is at:
% http://www.michaelshell.org/tex/ieeetran/bibtex/
%\bibliographystyle{IEEEtran}
% argument is your BibTeX string definitions and bibliography database(s)
%\bibliography{IEEEabrv,../bib/paper}
%
% <OR> manually copy in the resultant .bbl file
% set second argument of \begin to the number of references
% (used to reserve space for the reference number labels box)
\begin{thebibliography}{1}
\bibitem{IEEEhowto:kopka}
H.~Kopka and P.~W. Daly, \emph{A Guide to \LaTeX}, 3rd~ed.\hskip 1em plus
0.5em minus 0.4em\relax Harlow, England: Addison-Wesley, 1999.
\end{thebibliography}
% that's all folks
\end{document}

Binary file not shown.

View File

@ -0,0 +1,584 @@
%% bare_conf.tex
%% V1.3
%% 2007/01/11
%% by Michael Shell
%% See:
%% http://www.michaelshell.org/
%% for current contact information.
%%
%% This is a skeleton file demonstrating the use of IEEEtran.cls
%% (requires IEEEtran.cls version 1.7 or later) with an IEEE conference paper.
%%
%% Support sites:
%% http://www.michaelshell.org/tex/ieeetran/
%% http://www.ctan.org/tex-archive/macros/latex/contrib/IEEEtran/
%% and
%% http://www.ieee.org/
%%*************************************************************************
%% Legal Notice:
%% This code is offered as-is without any warranty either expressed or
%% implied; without even the implied warranty of MERCHANTABILITY or
%% FITNESS FOR A PARTICULAR PURPOSE!
%% User assumes all risk.
%% In no event shall IEEE or any contributor to this code be liable for
%% any damages or losses, including, but not limited to, incidental,
%% consequential, or any other damages, resulting from the use or misuse
%% of any information contained here.
%%
%% All comments are the opinions of their respective authors and are not
%% necessarily endorsed by the IEEE.
%%
%% This work is distributed under the LaTeX Project Public License (LPPL)
%% ( http://www.latex-project.org/ ) version 1.3, and may be freely used,
%% distributed and modified. A copy of the LPPL, version 1.3, is included
%% in the base LaTeX documentation of all distributions of LaTeX released
%% 2003/12/01 or later.
%% Retain all contribution notices and credits.
%% ** Modified files should be clearly indicated as such, including **
%% ** renaming them and changing author support contact information. **
%%
%% File list of work: IEEEtran.cls, IEEEtran_HOWTO.pdf, bare_adv.tex,
%% bare_conf.tex, bare_jrnl.tex, bare_jrnl_compsoc.tex
%%*************************************************************************
% *** Authors should verify (and, if needed, correct) their LaTeX system ***
% *** with the testflow diagnostic prior to trusting their LaTeX platform ***
% *** with production work. IEEE's font choices can trigger bugs that do ***
% *** not appear when using other class files. ***
% The testflow support page is at:
% http://www.michaelshell.org/tex/testflow/
% Note that the a4paper option is mainly intended so that authors in
% countries using A4 can easily print to A4 and see how their papers will
% look in print - the typesetting of the document will not typically be
% affected with changes in paper size (but the bottom and side margins will).
% Use the testflow package mentioned above to verify correct handling of
% both paper sizes by the user's LaTeX system.
%
% Also note that the "draftcls" or "draftclsnofoot", not "draft", option
% should be used if it is desired that the figures are to be displayed in
% draft mode.
%
\documentclass[10pt, conference, compsocconf]{IEEEtran}
% Add the compsocconf option for Computer Society conferences.
%
% If IEEEtran.cls has not been installed into the LaTeX system files,
% manually specify the path to it like:
% \documentclass[conference]{../sty/IEEEtran}
% Some very useful LaTeX packages include:
% (uncomment the ones you want to load)
% *** MISC UTILITY PACKAGES ***
%
%\usepackage{ifpdf}
% Heiko Oberdiek's ifpdf.sty is very useful if you need conditional
% compilation based on whether the output is pdf or dvi.
% usage:
% \ifpdf
% % pdf code
% \else
% % dvi code
% \fi
% The latest version of ifpdf.sty can be obtained from:
% http://www.ctan.org/tex-archive/macros/latex/contrib/oberdiek/
% Also, note that IEEEtran.cls V1.7 and later provides a builtin
% \ifCLASSINFOpdf conditional that works the same way.
% When switching from latex to pdflatex and vice-versa, the compiler may
% have to be run twice to clear warning/error messages.
% *** CITATION PACKAGES ***
%
%\usepackage{cite}
% cite.sty was written by Donald Arseneau
% V1.6 and later of IEEEtran pre-defines the format of the cite.sty package
% \cite{} output to follow that of IEEE. Loading the cite package will
% result in citation numbers being automatically sorted and properly
% "compressed/ranged". e.g., [1], [9], [2], [7], [5], [6] without using
% cite.sty will become [1], [2], [5]--[7], [9] using cite.sty. cite.sty's
% \cite will automatically add leading space, if needed. Use cite.sty's
% noadjust option (cite.sty V3.8 and later) if you want to turn this off.
% cite.sty is already installed on most LaTeX systems. Be sure and use
% version 4.0 (2003-05-27) and later if using hyperref.sty. cite.sty does
% not currently provide for hyperlinked citations.
% The latest version can be obtained at:
% http://www.ctan.org/tex-archive/macros/latex/contrib/cite/
% The documentation is contained in the cite.sty file itself.
% *** GRAPHICS RELATED PACKAGES ***
%
\ifCLASSINFOpdf
% \usepackage[pdftex]{graphicx}
% declare the path(s) where your graphic files are
% \graphicspath{{../pdf/}{../jpeg/}}
% and their extensions so you won't have to specify these with
% every instance of \includegraphics
% \DeclareGraphicsExtensions{.pdf,.jpeg,.png}
\else
% or other class option (dvipsone, dvipdf, if not using dvips). graphicx
% will default to the driver specified in the system graphics.cfg if no
% driver is specified.
% \usepackage[dvips]{graphicx}
% declare the path(s) where your graphic files are
% \graphicspath{{../eps/}}
% and their extensions so you won't have to specify these with
% every instance of \includegraphics
% \DeclareGraphicsExtensions{.eps}
\fi
% graphicx was written by David Carlisle and Sebastian Rahtz. It is
% required if you want graphics, photos, etc. graphicx.sty is already
% installed on most LaTeX systems. The latest version and documentation can
% be obtained at:
% http://www.ctan.org/tex-archive/macros/latex/required/graphics/
% Another good source of documentation is "Using Imported Graphics in
% LaTeX2e" by Keith Reckdahl which can be found as epslatex.ps or
% epslatex.pdf at: http://www.ctan.org/tex-archive/info/
%
% latex, and pdflatex in dvi mode, support graphics in encapsulated
% postscript (.eps) format. pdflatex in pdf mode supports graphics
% in .pdf, .jpeg, .png and .mps (metapost) formats. Users should ensure
% that all non-photo figures use a vector format (.eps, .pdf, .mps) and
% not a bitmapped formats (.jpeg, .png). IEEE frowns on bitmapped formats
% which can result in "jaggedy"/blurry rendering of lines and letters as
% well as large increases in file sizes.
%
% You can find documentation about the pdfTeX application at:
% http://www.tug.org/applications/pdftex
% *** MATH PACKAGES ***
%
%\usepackage[cmex10]{amsmath}
% A popular package from the American Mathematical Society that provides
% many useful and powerful commands for dealing with mathematics. If using
% it, be sure to load this package with the cmex10 option to ensure that
% only type 1 fonts will utilized at all point sizes. Without this option,
% it is possible that some math symbols, particularly those within
% footnotes, will be rendered in bitmap form which will result in a
% document that can not be IEEE Xplore compliant!
%
% Also, note that the amsmath package sets \interdisplaylinepenalty to 10000
% thus preventing page breaks from occurring within multiline equations. Use:
%\interdisplaylinepenalty=2500
% after loading amsmath to restore such page breaks as IEEEtran.cls normally
% does. amsmath.sty is already installed on most LaTeX systems. The latest
% version and documentation can be obtained at:
% http://www.ctan.org/tex-archive/macros/latex/required/amslatex/math/
% *** SPECIALIZED LIST PACKAGES ***
%
%\usepackage{algorithmic}
% algorithmic.sty was written by Peter Williams and Rogerio Brito.
% This package provides an algorithmic environment fo describing algorithms.
% You can use the algorithmic environment in-text or within a figure
% environment to provide for a floating algorithm. Do NOT use the algorithm
% floating environment provided by algorithm.sty (by the same authors) or
% algorithm2e.sty (by Christophe Fiorio) as IEEE does not use dedicated
% algorithm float types and packages that provide these will not provide
% correct IEEE style captions. The latest version and documentation of
% algorithmic.sty can be obtained at:
% http://www.ctan.org/tex-archive/macros/latex/contrib/algorithms/
% There is also a support site at:
% http://algorithms.berlios.de/index.html
% Also of interest may be the (relatively newer and more customizable)
% algorithmicx.sty package by Szasz Janos:
% http://www.ctan.org/tex-archive/macros/latex/contrib/algorithmicx/
% *** ALIGNMENT PACKAGES ***
%
%\usepackage{array}
% Frank Mittelbach's and David Carlisle's array.sty patches and improves
% the standard LaTeX2e array and tabular environments to provide better
% appearance and additional user controls. As the default LaTeX2e table
% generation code is lacking to the point of almost being broken with
% respect to the quality of the end results, all users are strongly
% advised to use an enhanced (at the very least that provided by array.sty)
% set of table tools. array.sty is already installed on most systems. The
% latest version and documentation can be obtained at:
% http://www.ctan.org/tex-archive/macros/latex/required/tools/
%\usepackage{mdwmath}
%\usepackage{mdwtab}
% Also highly recommended is Mark Wooding's extremely powerful MDW tools,
% especially mdwmath.sty and mdwtab.sty which are used to format equations
% and tables, respectively. The MDWtools set is already installed on most
% LaTeX systems. The lastest version and documentation is available at:
% http://www.ctan.org/tex-archive/macros/latex/contrib/mdwtools/
% IEEEtran contains the IEEEeqnarray family of commands that can be used to
% generate multiline equations as well as matrices, tables, etc., of high
% quality.
%\usepackage{eqparbox}
% Also of notable interest is Scott Pakin's eqparbox package for creating
% (automatically sized) equal width boxes - aka "natural width parboxes".
% Available at:
% http://www.ctan.org/tex-archive/macros/latex/contrib/eqparbox/
% *** SUBFIGURE PACKAGES ***
%\usepackage[tight,footnotesize]{subfigure}
% subfigure.sty was written by Steven Douglas Cochran. This package makes it
% easy to put subfigures in your figures. e.g., "Figure 1a and 1b". For IEEE
% work, it is a good idea to load it with the tight package option to reduce
% the amount of white space around the subfigures. subfigure.sty is already
% installed on most LaTeX systems. The latest version and documentation can
% be obtained at:
% http://www.ctan.org/tex-archive/obsolete/macros/latex/contrib/subfigure/
% subfigure.sty has been superceeded by subfig.sty.
%\usepackage[caption=false]{caption}
%\usepackage[font=footnotesize]{subfig}
% subfig.sty, also written by Steven Douglas Cochran, is the modern
% replacement for subfigure.sty. However, subfig.sty requires and
% automatically loads Axel Sommerfeldt's caption.sty which will override
% IEEEtran.cls handling of captions and this will result in nonIEEE style
% figure/table captions. To prevent this problem, be sure and preload
% caption.sty with its "caption=false" package option. This is will preserve
% IEEEtran.cls handing of captions. Version 1.3 (2005/06/28) and later
% (recommended due to many improvements over 1.2) of subfig.sty supports
% the caption=false option directly:
%\usepackage[caption=false,font=footnotesize]{subfig}
%
% The latest version and documentation can be obtained at:
% http://www.ctan.org/tex-archive/macros/latex/contrib/subfig/
% The latest version and documentation of caption.sty can be obtained at:
% http://www.ctan.org/tex-archive/macros/latex/contrib/caption/
% *** FLOAT PACKAGES ***
%
%\usepackage{fixltx2e}
% fixltx2e, the successor to the earlier fix2col.sty, was written by
% Frank Mittelbach and David Carlisle. This package corrects a few problems
% in the LaTeX2e kernel, the most notable of which is that in current
% LaTeX2e releases, the ordering of single and double column floats is not
% guaranteed to be preserved. Thus, an unpatched LaTeX2e can allow a
% single column figure to be placed prior to an earlier double column
% figure. The latest version and documentation can be found at:
% http://www.ctan.org/tex-archive/macros/latex/base/
%\usepackage{stfloats}
% stfloats.sty was written by Sigitas Tolusis. This package gives LaTeX2e
% the ability to do double column floats at the bottom of the page as well
% as the top. (e.g., "\begin{figure*}[!b]" is not normally possible in
% LaTeX2e). It also provides a command:
%\fnbelowfloat
% to enable the placement of footnotes below bottom floats (the standard
% LaTeX2e kernel puts them above bottom floats). This is an invasive package
% which rewrites many portions of the LaTeX2e float routines. It may not work
% with other packages that modify the LaTeX2e float routines. The latest
% version and documentation can be obtained at:
% http://www.ctan.org/tex-archive/macros/latex/contrib/sttools/
% Documentation is contained in the stfloats.sty comments as well as in the
% presfull.pdf file. Do not use the stfloats baselinefloat ability as IEEE
% does not allow \baselineskip to stretch. Authors submitting work to the
% IEEE should note that IEEE rarely uses double column equations and
% that authors should try to avoid such use. Do not be tempted to use the
% cuted.sty or midfloat.sty packages (also by Sigitas Tolusis) as IEEE does
% not format its papers in such ways.
% *** PDF, URL AND HYPERLINK PACKAGES ***
%
%\usepackage{url}
% url.sty was written by Donald Arseneau. It provides better support for
% handling and breaking URLs. url.sty is already installed on most LaTeX
% systems. The latest version can be obtained at:
% http://www.ctan.org/tex-archive/macros/latex/contrib/misc/
% Read the url.sty source comments for usage information. Basically,
% \url{my_url_here}.
% *** Do not adjust lengths that control margins, column widths, etc. ***
% *** Do not use packages that alter fonts (such as pslatex). ***
% There should be no need to do such things with IEEEtran.cls V1.6 and later.
% (Unless specifically asked to do so by the journal or conference you plan
% to submit to, of course. )
% correct bad hyphenation here
\hyphenation{op-tical net-works semi-conduc-tor}
\begin{document}
%
% paper title
% can use linebreaks \\ within to get better formatting as desired
\title{Bare Demo of IEEEtran.cls for IEEECS Conferences}
% author names and affiliations
% use a multiple column layout for up to two different
% affiliations
\author{\IEEEauthorblockN{Authors Name/s per 1st Affiliation (Author)}
\IEEEauthorblockA{line 1 (of Affiliation): dept. name of organization\\
line 2: name of organization, acronyms acceptable\\
line 3: City, Country\\
line 4: Email: name@xyz.com}
\and
\IEEEauthorblockN{Varun K R} \IEEEauthorblockA{Student\\ Dept. of Electronics\\ and Communication Engg. \\ M.S. Ramaiah Inst. of Tech. \\ Bangalore-560054, INDIA \\ varunkr95@gmail.com} \and \IEEEauthorblockN{Manikantan K \IEEEauthorrefmark{1}} \IEEEauthorblockA{Associate Professor \\ Dept. of Electronics\\ and Communication Engg. \\ M.S. Ramaiah Inst. of Tech. \\ Bangalore-560054, INDIA \\ kmanikantan@msrit.edu \\ \{1 corresponding author \} }
}
% conference papers do not typically use \thanks and this command
% is locked out in conference mode. If really needed, such as for
% the acknowledgment of grants, issue a \IEEEoverridecommandlockouts
% after \documentclass
% for over three affiliations, or if they all won't fit within the width
% of the page, use this alternative format:
%
%\author{\IEEEauthorblockN{Michael Shell\IEEEauthorrefmark{1},
%Homer Simpson\IEEEauthorrefmark{2},
%James Kirk\IEEEauthorrefmark{3},
%Montgomery Scott\IEEEauthorrefmark{3} and
%Eldon Tyrell\IEEEauthorrefmark{4}}
%\IEEEauthorblockA{\IEEEauthorrefmark{1}School of Electrical and Computer Engineering\\
%Georgia Institute of Technology,
%Atlanta, Georgia 30332--0250\\ Email: see http://www.michaelshell.org/contact.html}
%\IEEEauthorblockA{\IEEEauthorrefmark{2}Twentieth Century Fox, Springfield, USA\\
%Email: homer@thesimpsons.com}
%\IEEEauthorblockA{\IEEEauthorrefmark{3}Starfleet Academy, San Francisco, California 96678-2391\\
%Telephone: (800) 555--1212, Fax: (888) 555--1212}
%\IEEEauthorblockA{\IEEEauthorrefmark{4}Tyrell Inc., 123 Replicant Street, Los Angeles, California 90210--4321}}
% use for special paper notices
%\IEEEspecialpapernotice{(Invited Paper)}
% make the title area
\maketitle
\begin{abstract}
The abstract goes here. DO NOT USE SPECIAL CHARACTERS, SYMBOLS, OR MATH IN YOUR TITLE OR ABSTRACT.
\end{abstract}
\begin{IEEEkeywords}
component; formatting; style; styling;
\end{IEEEkeywords}
% For peer review papers, you can put extra information on the cover
% page as needed:
% \ifCLASSOPTIONpeerreview
% \begin{center} \bfseries EDICS Category: 3-BBND \end{center}
% \fi
%
% For peerreview papers, this IEEEtran command inserts a page break and
% creates the second title. It will be ignored for other modes.
\IEEEpeerreviewmaketitle
\section{Introduction}
% no \IEEEPARstart
This demo file is intended to serve as a ``starter file''
for IEEE conference papers produced under \LaTeX\ using
IEEEtran.cls version 1.7 and later.
All manuscripts must be in English. These guidelines include complete descriptions of the fonts, spacing, and related information for producing your proceedings manuscripts. Please follow them and if you have any questions, direct them to the production editor in charge of your proceedings at Conference Publishing Services (CPS): Phone +1 (714) 821-8380 or Fax +1 (714) 761-1784.
% You must have at least 2 lines in the paragraph with the drop letter
% (should never be an issue)
\subsection{Subsection Heading Here}
Subsection text here.
\subsubsection{Subsubsection Heading Here}
Subsubsection text here.
\section{Type style and Fonts}
Wherever Times is specified, Times Roman or Times New Roman may be used. If neither is available on your system, please use the font closest in appearance to Times. Avoid using bit-mapped fonts if possible. True-Type 1 or Open Type fonts are preferred. Please embed symbol fonts, as well, for math, etc.
% An example of a floating figure using the graphicx package.
% Note that \label must occur AFTER (or within) \caption.
% For figures, \caption should occur after the \includegraphics.
% Note that IEEEtran v1.7 and later has special internal code that
% is designed to preserve the operation of \label within \caption
% even when the captionsoff option is in effect. However, because
% of issues like this, it may be the safest practice to put all your
% \label just after \caption rather than within \caption{}.
%
% Reminder: the "draftcls" or "draftclsnofoot", not "draft", class
% option should be used if it is desired that the figures are to be
% displayed while in draft mode.
%
%\begin{figure}[!t]
%\centering
%\includegraphics[width=2.5in]{myfigure}
% where an .eps filename suffix will be assumed under latex,
% and a .pdf suffix will be assumed for pdflatex; or what has been declared
% via \DeclareGraphicsExtensions.
%\caption{Simulation Results}
%\label{fig_sim}
%\end{figure}
% Note that IEEE typically puts floats only at the top, even when this
% results in a large percentage of a column being occupied by floats.
% An example of a double column floating figure using two subfigures.
% (The subfig.sty package must be loaded for this to work.)
% The subfigure \label commands are set within each subfloat command, the
% \label for the overall figure must come after \caption.
% \hfil must be used as a separator to get equal spacing.
% The subfigure.sty package works much the same way, except \subfigure is
% used instead of \subfloat.
%
%\begin{figure*}[!t]
%\centerline{\subfloat[Case I]\includegraphics[width=2.5in]{subfigcase1}%
%\label{fig_first_case}}
%\hfil
%\subfloat[Case II]{\includegraphics[width=2.5in]{subfigcase2}%
%\label{fig_second_case}}}
%\caption{Simulation results}
%\label{fig_sim}
%\end{figure*}
%
% Note that often IEEE papers with subfigures do not employ subfigure
% captions (using the optional argument to \subfloat), but instead will
% reference/describe all of them (a), (b), etc., within the main caption.
% An example of a floating table. Note that, for IEEE style tables, the
% \caption command should come BEFORE the table. Table text will default to
% \footnotesize as IEEE normally uses this smaller font for tables.
% The \label must come after \caption as always.
%
%\begin{table}[!t]
%% increase table row spacing, adjust to taste
%\renewcommand{\arraystretch}{1.3}
% if using array.sty, it might be a good idea to tweak the value of
% \extrarowheight as needed to properly center the text within the cells
%\caption{An Example of a Table}
%\label{table_example}
%\centering
%% Some packages, such as MDW tools, offer better commands for making tables
%% than the plain LaTeX2e tabular which is used here.
%\begin{tabular}{|c||c|}
%\hline
%One & Two\\
%\hline
%Three & Four\\
%\hline
%\end{tabular}
%\end{table}
% Note that IEEE does not put floats in the very first column - or typically
% anywhere on the first page for that matter. Also, in-text middle ("here")
% positioning is not used. Most IEEE journals/conferences use top floats
% exclusively. Note that, LaTeX2e, unlike IEEE journals/conferences, places
% footnotes above bottom floats. This can be corrected via the \fnbelowfloat
% command of the stfloats package.
\section{Conclusion}
The conclusion goes here. this is more of the conclusion
% conference papers do not normally have an appendix
% use section* for acknowledgement
\section*{Acknowledgment}
The authors would like to thank...
more thanks here
% trigger a \newpage just before the given reference
% number - used to balance the columns on the last page
% adjust value as needed - may need to be readjusted if
% the document is modified later
%\IEEEtriggeratref{8}
% The "triggered" command can be changed if desired:
%\IEEEtriggercmd{\enlargethispage{-5in}}
% references section
% can use a bibliography generated by BibTeX as a .bbl file
% BibTeX documentation can be easily obtained at:
% http://www.ctan.org/tex-archive/biblio/bibtex/contrib/doc/
% The IEEEtran BibTeX style support page is at:
% http://www.michaelshell.org/tex/ieeetran/bibtex/
%\bibliographystyle{IEEEtran}
% argument is your BibTeX string definitions and bibliography database(s)
%\bibliography{IEEEabrv,../bib/paper}
%
% <OR> manually copy in the resultant .bbl file
% set second argument of \begin to the number of references
% (used to reserve space for the reference number labels box)
\begin{thebibliography}{1}
\bibitem{IEEEhowto:kopka}
H.~Kopka and P.~W. Daly, \emph{A Guide to \LaTeX}, 3rd~ed.\hskip 1em plus
0.5em minus 0.4em\relax Harlow, England: Addison-Wesley, 1999.
\end{thebibliography}
% that's all folks
\end{document}

36
submit/conclusion.tex Normal file
View File

@ -0,0 +1,36 @@
Management tasks are always cumbersome and repetitive.
Using automatic methods to assist in solving these repetitive but necessary tasks is crucial.
In this paper, automatic classification techniques for issue reports were examined.
% But the different usage of ITS in GitHub from traditional ITS brings a series of challenges for this task.
Four different text-based classification approaches in a large-scale dataset were used to determine
which approach performs best.
Regression analysis by multiple linear mixed effects models was used to investigate key factors
that limit the performance of the classifier,
and discovered that semantic perplexity is a crucial factor that affects the performance of classification.
A two-stage classifier framework was established based on the regression analysis.
Some experiments were conducted and the results show the following:
1) In our large scale study, text-based classification approaches work in the context of GitHubs ITS,
and Support Vector Machines (SVM) achieves the best performance among 4 different text-based classifiers.
% MLTs also work on classifying issue reports by only using free text from ITS of GitHub.
% Of 4 MLTs, SVM have the best performance compared with 3 other MLTs in classifying issue reports in GitHub.
2) Increasing the size of training set can effectively improve the performance of the classification model.
Besides, too many confused issues (defined in Section \ref{sec:regression}) are harm for building an effective text-based classifier.
3) Our 2-stage classifier framework can extract semantic perplexity information from free text, which is benefit for classifier.
The quantitative evaluations show that the classification performance can achieve a significant improvement.
% , which can make use of unstructured free text and structured contributors information.
% The first level can collect perplexity information of issue reports from free text, and the second level combines unstructured features extracted from the first level and structured features extracted from contributors' historical activities of issue reports' authors.
% The result of our approach indicate that use some extra information is an efficient way to build a more effective classification model.
In our future work, we plan to explore the relationship between the performance of classifier and topics of the issue report.
We believe that the distribution of categories is associate with topics of issue reports.
Reusing classification model is another study we are interested in.
Otherwise, a more outperforming preprocessing is our sustained progressing task.
% That is, issue reports about one component will more likely be a bug and issue reports about one function may be more probability to be a feature.
% If there are some regularities, we can use these regularities to feedback classification model.

422
submit/data_process.tex Normal file
View File

@ -0,0 +1,422 @@
% \subsection{Category Extraction}
% \label{labeling process}
% Different with traditional ITS, the ITS in GitHub use label system to manage issues.
% %category and other structured information.
% In order to get a pre-labeled training set from GitHub, we need to extract category information
% from the user-defined label system.
% %However, the custom label system makes it difficult.
% In our dataset, there are 7,793 different labels in 1,185 projects
% (the details of our dataset are presented in Section~\ref{sec:dataset}),
% and many projects use different tags (\ie labels) to express same meanings.
% For example, the tags like ``bug'', ``type:bug'', ``error'' and ``defect'' are used to identify bug-related issues.
% In this paper, we present a qualitative study to comprehend how the core team use tags as categories of issues.
% % The ITS of GitHub use label system to
% % labeling process aims to build a pre-labeled training set,
% % which can be used for supervised text-based classification approaches next.
% % The first task in this process is to know what labels are used to distinguish Bug-prone issue reports from Nonbug-prone ones.
% % Compared with traditional ITS, it is more difficult to get pre-labeled training set in GitHub.
% % Because in the ITS of GitHub, it only provides labels for contributors to add extra structured information like categories, priority, etc.
% % And label system in GitHub is user self-defined, which means different project may use different labels to express same meaning.
% % In our dataset, there are 7793 different labels in 101 projects.
% % This flatten and flexible design makes it difficult to understand label usage from so many projects.
% % % On the one hand, different with Bugzilla, the ITS in Github do not have categories information for issue reports.
% % % Users in GitHub distinguish category of issue reports by labels.
% % % On the other hand, GitHub's label system are user self-defined and not all the projects use the same labels to label bug or feature for issue reports.
% % So we need comprehend which labels are used to distinguish issues first.
% In GitHub, there are many projects
% % There are some projects in GitHub migrated from other platform, and at the same time,
% % succeed to the custom of using traditional ITS,
% giving some extra information in tags. For example, in project ``WordPress-Android'', issues are labeled like ``[type] bug'', ``[type] enhancement'', etc.; in project ``angular.js'', issues are labeled like ``type: bug'', ``type: feature'', ``component: form'', etc.
% Tags in these projects not only contain categories of issues, but also contain the categories of tag itself.
% % Despite projects like those are just a small part,
% These information is great helpful for us to know what labels are used most to express categories of issues.
% We design a process to aggregate these tags by making full use of this extra information.
% Firstly, we pick out all tags acting as those forms, and separate the information of them.
% We use a 2-d vector $<C, name>$ to represent these tags,
% where $C$ means category of the tag (like ``type'', ``component'', etc)
% and $name$ means the main information of the tag (like ``bug'', ``feature'', ``enhancement'', etc).
% Secondly, we group tags with same $C$ items as $Group_C$, and the preliminary aggregating process is done.
% Next, we define the similarity of two groups through Equation~\ref{equation:similarity}.
% We iteratively calculate the similarity of different group, and merge the groups whose similarity greater than threshold.
% \begin{equation}
% similarity = \frac{{\left| {Group_{C_i}\bigcap {Group_{C_j}} } \right|}}{{\min \left( {\left| {Group{C_i}} \right|,\left| {Group{C_j}} \right|} \right)}}
% , (i \neq j)
% \label{equation:similarity}
% \end{equation}
% Where $Group_{C_i}$ is a set of tags with the same category $C_i$ and different $name$.
% Finally, we get a structure tag information through above process.
% % \begin{table}[htbp]
% % \centering
% % \caption{Summary of Labels in GitHub}
% % \begin{tabular}{|l|p{0.65\columnwidth}|} \hline
% % \textbf{Type} & \textbf{Labels} \\ \hline
% % bug & \begin{tabular}[c]{@{}l@{}}bug, defect, bug - functional, Bug - Views, \\ Bug: Framework, Issue Type: Bug, kind/bug, etc.\end{tabular} \\ \hline
% % enhancement & \begin{tabular}[c]{@{}l@{}}Enhancement, Improvement, kind: enhancement, \\ t:enhancement, etc.\end{tabular} \\ \hline
% % feature & \begin{tabular}[c]{@{}l@{}}feature,feature request, Feature - High Priority, \\ Feature/Http, feature:bootstrap, etc.\end{tabular} \\ \hline
% % documentation & \begin{tabular}[c]{@{}l@{}}doc, documentation, type/docs, \\ Kind:Documentation, etc.\end{tabular} \\ \hline
% % question & \begin{tabular}[c]{@{}l@{}}kind/question, type/question, etc.\end{tabular} \\ \hline
% % other & \begin{tabular}[c]{@{}l@{}}task, test, support, design, refactor, etc.\end{tabular} \\ \hline
% % \end{tabular}%
% % \label{tag:datacollection}
% % \end{table}%
% \begin{table}[htbp]
% \centering
% \caption{My caption}
% \label{tag:datacollection}
% \begin{tabular}{lccc}
% \toprule
% label & projects & issues & percent (\%) \\
% \midrule
% bug & 644 & 118,155 & 46.9 \\
% enhancement & 412 & 44,947 & 17.8 \\
% feature & 199 & 14,795 & 5.9 \\
% question & 319 & 13,109 & 5.2 \\
% defect & 15 & 7,604 & 3.0 \\
% feature request & 93 & 6,976 & 2.8 \\
% documentation & 239 & 6,422 & 2.5 \\
% type:bug & 13 & 5,684 & 2.3 \\
% improvement & 45 & 5,592 & 2.2 \\
% docs & 122 & 5,510 & 2.2 \\
% \bottomrule
% \end{tabular}
% \end{table}
% Through distinguishing \textit{name} of group ``$tpye$'' manually,
% we divide tags into 6 different categories as shown in Table~\ref{tag:datacollection}.
% Bug, enhancement and feature are most used tags, which are observed in 46.9\%, 17.8\% and 5.9\% of the labeled issues respectively.
% Here we use these structure tag information to judge whether an issue is bug-prone or not.
% The number of issues labeled with both categories is 3,869 in 386 projects, which can be ignored compared with 252,084 labeled issues.
% % To get a overall perspective to label system, we aggregate all other labels according to the group information we aggregated before.
% % In the result, we find the top 3 most used group of labels are type (\textit{i.e.}, bug, feature, enhancement), status (\textit{i.e.}, duplicate, wontfix, invalid) and adverb (\textit{i.e.}, high, critical, major, and this kind of labels are mostly used as priority and severity).
% % To evaluate the coverage of these 3 kinds of labels, we select all labels that are used in more than 5 projects to filter labels with minority usage. We count the usage times of these 3 kinds of labels and calculate the usage rate for all filtered labels. Finally, these 3 groups of labels are used more than half, which achieve 58.7\%.
% % Through aggregating all the labels, we finally get 113 labels used most in the category ``type''. From these labels, we distinguish them and select bug-like labels such as ``bug'', ``defect'', ``type:bug'', ``definition:bug'', etc. and select feature-like labels such as ``feature'', ``enhancement'', ``new feature'', ``feature request'', etc. Then, we label issues as bug or feature using labels we have distinguished, and these labeled issues will be our training data in following process.
% % Because of the need for adequate classified issues for training model, we hand-pick 111 projects which have more than 500 labeled categories information issues. The detail information of these projects are shown in Table \ref{tag:datacollection}.
% % \begin{table}[htbp]
% % \centering
% % \caption{Summary Statistics for Data Collection}
% % \begin{tabular}{|c|c|c|} \hline
% % \textbf{} & \textbf{Count} & \textbf{Mean} \\ \hline
% % Projects & 111 & \\ \hline
% % Issues & 356256 & 3209.5(per project) \\ \hline
% % Labeled issues & 240754 & 2169.0(per project) \\ \hline
% % Labels & 470965 & 1.96(per issue) \\ \hline
% % \end{tabular}%
% % \label{tag:datacollection}
% % \end{table}%
% % We also select 3 different projects (phpadmin, piwik and numpy) as objects of case study, which have enough labeled issues and different proportion of bug to feature. The detail information about them are exhibited in Table \ref{tag:casestudy}.
% % \begin{table}[htbp]
% % \centering
% % \caption{Projects for Case Study}
% % \begin{tabular}{|c|c|c|} \hline
% % \textbf{Projects} & \textbf{Labeled Issues} & \textbf{Bug Proportion (\%)} \\ \hline
% % phpadmin & 6766 & 0.75 \\ \hline
% % piwik & 5389 & 0.66 \\ \hline
% % numpy & 2511 & 0.86 \\ \hline
% % \end{tabular}%
% % \label{tag:casestudy}
% % \end{table}%
% \subsection{Preprocessing of Dataset}
% Through the former process, each issue is characterized by its title and description,
% and part of them can be labeled as ``bug'' or ``non-bug''.
% Here, we select issues labeled by former process to do the following steps.
% First, linguistic features extracted for the text-based classifier undergo the standard processing,
% i.e., lowercase, text filtering, stemming, and indexing~\cite{frakes1992information}.
% We do not remove all stop-words, and leave common English term, such as ``should'', ``might'', ``not''.
% In study~\cite{antoniol2008bug}, they indicate that it may be important for classifying issues,
% and study~\cite{bissyande2013got} also mentions that removing default list of stop-words in common corpora might decrease the classification accuracy.
% For instance, the semantic of a sentence ``This is not a bug'' is completely lost if the Standard English stop-words are removed because the result is ``This is a bug''.
% Then we use vector space model to represent each issue as a weighted vector.
% We segment the issue into different terms (in here a word means a term) and each element in the vector of the issue is the weight of a term, and the value stands for the importance of the term for the issue.
% We utilize term frequency-inverse document frequency (\textit{tf-idf}) to calculate weight, which based on two assumptions: The more a given word appears in the issue, the more important it is for that issue.
% Contrariwise, the more issues a word appears in, the less useful it is to distinguish among these issues.
% % is utilized to indicate the weight of a term.
% The process of calculating \textit{tf-idf} acts as Equation~(\ref{equation:tf})(\ref{equation:idf})(\ref{equation:tfidf}).
% \begin{equation}
% tf(t,i) = \frac{{{n_t}}}{{{N_i}}}
% \label{equation:tf}
% \end{equation}
% \begin{equation}
% idf(t) = \log \left( {\frac{{{N_I}}}{{\left| {i \in I:t \in i} \right|}}} \right)
% \label{equation:idf}
% \end{equation}
% \begin{equation}
% tf-idf(t,i) = tf(t,i) \times idf(t)
% \label{equation:tfidf}
% \end{equation}
% Where \textit{t} is a term, \textit{i} is the corpus of an issue, \textit{I} is the corpus of all issues in the given project, $n_t$ is the count of appearance for term \textit{t} in the issue, $N_i$ is the total number of terms in issue \textit{i} and $N_I$ is the total number of issues in the given project.
\subsection{Text-based Classification}
\label{ML}
Facing the huge changes described in Section~\ref{ITS_GH},
determining whether the regular pattern of free text found in research~\cite{antoniol2008bug} still works,
and whether the performance of the text-based classification model in dealing with large-scale projects is efficient, are required.
Many text-based classifications are used to classify issues in different
studies~\cite{antoniol2008bug,herzig2013s,maalej2015bug,zhou2014combining}.
In this paper, various types of widely used text-based classifications,
such as \textit{Naive Bayes}, \textit{Logistic Regression},
were selected to know which classifier performs best.
Table~\ref{tag:packages} shows the selected text-based classifications and parameter settings,
which are determined by the best performance of numerous tests.
\begin{table}[htbp]
\centering
\caption{Text-based classifications and parameters setting}
\begin{tabular}{|c|l|l|} \hline
\textbf{Classifier} & \textbf{API} & \textbf{Parameters Setting} \\ \hline
SVM & SVC & kernel=`linear' \\ \hline
NB & MultinomialNB & class\_prior=`None' \\ \hline
LR & LogisticRegression & penalty=`l2' \\ \hline
RF & RandomForestClassifier & n\_estimators=100, n\_jobs=-1 \\ \hline
\end{tabular}%
\label{tag:packages}
\end{table}%
A well-labeled dataset, which can be used to train the classification model is constructed by dataset labeling process and preprocessing.
Table~\ref{tag:packages} shows the four classifiers built for each project.
We use APIs of package \textit{sklearn} to implement the methods.
A ten-fold cross-validation was applied to separate dataset samples into training and testing sets to evaluate the classification model.
% The ten-fold cross-validation randomly partitions the dataset into 10 equal-sized subsets.
% One subset is retained as the testing set for evaluating the classifier out of the 10 subsets,
% and the remaining 9 subsets were used as the training set to build the classifier.
The ten-fold cross-validation has a minimal effect on the sample characteristics
and can investigate the stability of item loading on multiple factors~\cite{van2006five}.
% The advantage of this validation method over repeated random sub-sampling is that all observations are used for both training and validation, and each observation is used for validation exactly once .
\subsection{Regression Analysis of Classification Performance}
\label{sec:regression}
The classification performances of different projects are always various.
This part aims to determine the factors that influenced the performance of text-based classification.
Manual analysis was used to detect factors, and we discovered several misclassified issues containing both bug- and nonbug-prone parts.
For example, contributors may discover some unreasonable designs or problems in the project
when they submit an issue about feature request (\eg EX in Section \ref{section:improvingmodel}).
The issue EX is nonbug prone, but the part of problem description is bug prone,
which may confuse the classification.
Such issues are termed as \textit{\textbf{confused issues}}.
Regression analysis techniques were used to verify our perception.
In this paper, multiple linear mixed effect models were used to investigate the factors that affect classifier performance.
In addition to coefficients, the effect size of each variable obtained from ANOVA analyses was reported.
The model's fit can be evaluated by pseudo R-squared, i.e., the marginal (${R_m}^2$) and conditional (${R_c}^2$) coefficient,
to determine generalized mixed-effect models~\cite{tsay2014let}.
As implemented in the MuMIn package of R~\cite{jiang2004exploration},
(${R_m}^2$) is the proportion of variance explained by the fixed effects alone,
and (${R_c}^2$) is the proportion of variance explained by the fixed and random effects.
All numeric variables were first log transformed (plus 0.5 if necessary) to stabilize variance
and reduce heteroscedasticity~\cite{gharehyazie2014developer}.
The variance inflation factors (VIFs) for each predictor were computed to test for multicollinearity.
If the VIFs of all the remaining factors are below 3, then multicollinearity is absent~\cite{gharehyazie2014developer}.
\textit{1) Outcome:} The outcome measure is \textit{average F-measure} (calculated as Equation~\ref{equation:performance}) of the classification.
Because of ten-fold cross-validation that we use,
we can obtain $10*n_{project}$ records for regression analysis,
where $n_{project}$ is the number of projects.
\textit{2) Predictors:} Project- and issue-level measures were computed in this process.
Project-level measures are features of the status of the project,
whereas issue-level measures are features extracted from textual summary of issues.
\textbf{\underline{\textit{Project-level measures}}}
\textbf{\textit{Star and watch:}} The number of stars and watches of the project.
This can reflect the popularity of the project in GitHub.
\textbf{\textit{Contributors:}} The number of developers active in the project.
The data are acquired in the homepage of the project in GitHub.
\textbf{\textit{Project age:}} The project duration from the creation, in timestamps.
\textbf{\textit{Commits:}} The total number of commits of the project.
\textbf{\textit{Issues:}} The number of issues used in training set.
\textbf{\underline{\textit{Issue-level measures}}}
\textbf{\textit{Confused issues:}} The total number of confused issues.
Each sentence of the issue is predicted using the best model built in Section~\ref{ML}.
if not all sentences of the issue are predicted to the same part, the issue will be considered as a confused issue.
\textbf{\textit{Median of words:}} The median number of words for each instance in the training set.
More words are likely to contain more information, which may help for classifier.
\subsection{Tow-stage Classification}
\label{section:improvingmodel}
{\color{red}{
Omitting structured information in GitHub, as described in section~\ref{ITS_GH},
results in that less information can be used to build a synthesized classification model.
The firsthand features can be extracted, except for free text of issues,
are relating to the historical information about issue contributors,
\eg contributor's identity (core or external developer) and historical developing activities.
Thus, we propose a two-stage classification approach to combine textual summary information
and developer information, which could be expected to improve the performance of classification.
Each stage of our approach is explained in the following paragraphs,
and the overview of our approach is shown in Figure~\ref{figure:framework}.}}
% which consists of the following steps:}}
% 1) The first stage uses textual summary information, including title and description of issues,
% to build a supervised text-based classification model.
% At this stage, the issue is not directly predicted as bug-prone or not, but the probability of being bug-prone occurs.
% % In this paper, linear SVM was used as our classification model in the first stage.
% Through this process, the semantic perplexity information of each issue was collected.
% All these outputs in the first stage are regarded as features extracted from unstructured free texts,
% which will be used in the second stage.
% 2) The second stage uses the structured information of contributors along with the features extracted from the first stage.
% At this stage, these features were fed into a new machine leaner, which would classify the issue as either bug-prone or non-bug-prone.
% % In the second stage, logistic regression was used as the classification model.
% Each stage of our approach is explained in the following paragraphs,
% and the overview of our approach is shown in Figure~\ref{figure:framework}.
\begin{figure*}[!htbp]
\centering
\setlength{\fboxrule}{0.5pt}
% \setlength{\fboxsep}{1cm}
\fbox{\includegraphics[scale = 0.62]{figure/framework}}
\caption{Overview of two-stage classification framework}
\label{figure:framework}
\end{figure*}
\subsubsection{Stage 1 - Textual Summary Classification}
In this stage, the main task is to extract the information in the free text.
Similar to the process of Section~\ref{ML}, two main textual information sources,
title and description, were used.
% But in this time, we don't directly build a model to predict whether the issue report is bug-prone.
The classification model was trained from the textual information of training set,
and the probability output of the model was applied to predict the testing set.
After that, the title and description of the issue were divided into sentences,
and the classification model we build before was used to predict each sentence.
The sentence prediction results explain the changes in semantic when contributors report an issue.
% and extract semantic perplexity information contained in the free text.
Free text can be examined further by analyzing the semantic perplexity information of the sentence
and the regular pattern in submitting issues.
The example will explain semantic perplexity.
\textbf{EX:} ``\textit{Currently, auto-archiving cannot be used if Piwik's authentication is configured to use the CAS plugin. I ran into this problem with authentication when running archive.php with CAS plugin enabled on my site... Add a feature to auto-archiving, so that it can succeeds when Piwik uses CAS for authentication instead of the default Login module.}''
Based on the EX, the first sentence describes the problem encountered by the contributor.
For the classification model, this sentence is more likely to be predicted as bug-prone.
However, a new feature is proposed in the last sentence, which is more likely to be predicted as non-bug-prone.
For the classification model, issues similar to EX are difficult to classify because of the perplexity of the text.
This situation can be addressed by extracting the features and dividing the issue into sentences.
In Stage 1, the following features from the textual summary were extracted:
\underline{\textbf{Probability:}} The probability that the issue report is predicted as bug-prone.
The probability output of the classification model is used to obtain this feature.
\underline{\textbf{SentenceCount:}} The total number of sentences in the issue report, including title and description.
\underline{\textbf{MostBugProb:}} The maximum probability of all sentences that are predicted as bug-prone.
\underline{\textbf{MostNonbugProb:}} The maximum probability of all sentences that are predicted as nonbug-prone.
\underline{\textbf{Location:}} The sequence number of the most non-bug sentence.
This feature was used to show where the nonbug sentence appears in the issue report.
\underline{\textbf{BugCount:}} The number of sentences that are predicted as bug-prone.
\underline{\textbf{NonbugCount:}} The number of sentences that are predicted as non-bug-prone.
\underline{\textbf{ChangeCount:}} The number of semantic changes.
For sentences sequence, every time the prone of sentences change from bug-prone to nonbug-prone or vice versa, the semantic changes will add one.
\underline{\textbf{Perplexity:}} The perplexity of the issue.
For sentence sequences of the issue, a series of probabilities that are predicted as bug-prone were collected
and their perplexity were calculated using Equation~\ref{equation:perplexity},
which is borrowed from the perplexity of natural language processing (NLP).
\begin{equation}
Perplexity = \frac {1}{SentenceCount} \sum \log (p_{i+1}-p_i)
\label{equation:perplexity}
\end{equation}
Where $p_i$ is the probability of the ith sentence.
These features need to be carefully extracted because of using the ten-fold cross-validation.
All these features are assumed to be produced by a training set,
which means that the training set was used to build a prediction model,
and use this model to extract features of each instance in the training set.
This approach allows the extraction of features without using the labeled information of a testing set,
which may introduce some additional information and produce less-scientific results.
\subsubsection{Stage 2 - Combining Free Text and Developer Information Classification}
\label{section:secondlevel}
In Stage 1, the probability of bug-prone and perplexity information of sentences for each issue were obtained from free text.
These features will be part of the input of Stage 2.
The experience of developers may influence the categories of issues.
For example, skilled developers are likely to report a bug-prone issue
and provide issue reports that meet the specification of the core team.
Thus, in Stage 2, some structured features about contributors who submit issue reports were provided.
These features contain identity of contributors in the project,
historical developing activities, and social influence.
The detailed information are as follows:
% 要写考虑这些feature的直觉可以一个总结下面写几个feature
\underline{\textbf{IsCoreTeam:}} This feature shows whether the contributor who reported the issue is a core team member in the project.
If the contributor is in the core team, this feature is set to 1, otherwise, 0.
\underline{\textbf{IssueCountInProject:}} The number of issues the contributor reports before in the project.
\underline{\textbf{IssueCountInGitHub:}} The number of issues the contributor reports to GitHub.
\underline{\textbf{CommentCountInProject:}} The number of comments that the contributor commits in the project.
\underline{\textbf{CommentCountInGitHub:}} The number of comments the contributor commits in GitHub.
\underline{\textbf{FollowerCount:}} The number of followers that the contributor acquires.
This feature can reveal the social influence of the contributor in GitHub.
\underline{\textbf{RegisterTime:}} This feature shows the duration that the contributor has registered.
The longer a contributor has registered, the more familiar he is with the principles of GitHub.
% To complete the classification process in Stage 2,
% data grafting process is required to combine features from free text and structured contributor information features.
% During Stage 1, the ID of report is not included, only input is the instances with lists of preprocessed terms,
% as well as the corresponding labels.
% A special subprocess called data grafting was used to smooth the linkage of the two stages.
% Data grafting aims to melt datasets from various sources, and combines features into a regular form according to its source.
% Owing to the use of a ten-fold strategy to partition the training set,
% following the change of dataset is required.
% Thus, the same partition process as training set for ID information, which enables tracing every instance, is performed.
Logistic regression was used as our prediction model in Stage 2.
Logistic regression needs to be careful in partitioning datasets into training and testing sets before a prediction model is built.
The output of Stage 1 and input of Stage 2 are associated.
The testing set of Stages 1 and 2 should be similar to ensure that the same training set will be used to build the model,
and to avoid the introduction of extra information from the testing set.

171
submit/dataset.tex Normal file
View File

@ -0,0 +1,171 @@
\subsection{Data Collection}
\label{sec:dataset}
% In our previous work \cite{yu2015wait}, we have composed a comprehensive dataset to study the pull-based model, involving 951,918 issues across
The dataset from Yu~\cite{yu2015wait}, which is a comprehensive dataset to study the pull-based model,
involving 951,918 issues across 1,185 main-line projects in GitHub
(dump dated 10/11/2014 based on GHTorrent~\cite{gousios2013ghtorent,gousios2014lean}),
was used in this paper.
Data on title, content, labels, and contributors for each issue were obtained through GitHub API.
Projects need to contain a sufficient number of labeled issues for training and testing to test which supervised text-based classification performs best.
Otherwise, an appropriative number of bug and non-bug issues existing in the projects is required to avoid the influence of unbalanced dataset. Thus, the candidate projects from GitHub, which have at least 500 labeled issues and bug rate between 20\% and 80\%
(the labeling dataset process is presented in section~\ref{labeling process}),
were finally identified and can be used as the training and testing sets.
\subsection{Category Extraction}
\label{labeling process}
Compared to traditional ITS, the ITS in GitHub uses a labeling system to manage issues.
Category information from the user-defined label system must be extracted to obtain a pre-labeled training set from GitHub.
Our dataset has 7,793 different labels in 1,185 projects, and many projects use different tags (\ie labels),
such as “bug,” “type:bug,” “error,” and “defect” to express the same meaning and identify bug-related issues.
A qualitative study is presented in this paper to comprehend the use of tags as categories of issues in this section.
Many projects in GitHub provide additional information in tags.
For example, in project ``WordPress-Android'', issues are labeled as ``[type] bug'', ``[type] enhancement'', and so on.
Tags in these projects contain not only categories of issues but also categories of the tag itself.
The information is useful in knowing what labels are mostly used to express categories of issues.
A process is designed to aggregate these tags by fully utilizing the additional information.
First, all tags that act as those forms were selected, and their information were separated.
A 2D vector $<C, name>$ was used to represent these tags,
where $C$ is the category of the tag (such as ``type'', ``component'')
and $name$ is the main information of the tag (such as ``bug'', ``feature'').
Second, tags with similar $C$ items were grouped as $Group_C$.
Then, the similarities of the two groups were defined using Equation~\ref{equation:similarity}.
The groups whose similarity were greater than the threshold were merged.
\begin{equation}
similarity = \frac{{\left| {Group_{C_i}\bigcap {Group_{C_j}} } \right|}}{{\min \left( {\left| {Group{C_i}} \right|,\left| {Group{C_j}} \right|} \right)}}
, (i \neq j)
\label{equation:similarity}
\end{equation}
$Group_{C_i}$ is a set of tags with the same category $C_i$ and different $name$.
Finally, a structure tag information is obtained through the aforementioned process.
% \begin{table}[htbp]
% \centering
% \caption{Summary of Labels in GitHub}
% \begin{tabular}{|l|p{0.65\columnwidth}|} \hline
% \textbf{Type} & \textbf{Labels} \\ \hline
% bug & \begin{tabular}[c]{@{}l@{}}bug, defect, bug - functional, Bug - Views, \\ Bug: Framework, Issue Type: Bug, kind/bug, etc.\end{tabular} \\ \hline
% enhancement & \begin{tabular}[c]{@{}l@{}}Enhancement, Improvement, kind: enhancement, \\ t:enhancement, etc.\end{tabular} \\ \hline
% feature & \begin{tabular}[c]{@{}l@{}}feature,feature request, Feature - High Priority, \\ Feature/Http, feature:bootstrap, etc.\end{tabular} \\ \hline
% documentation & \begin{tabular}[c]{@{}l@{}}doc, documentation, type/docs, \\ Kind:Documentation, etc.\end{tabular} \\ \hline
% question & \begin{tabular}[c]{@{}l@{}}kind/question, type/question, etc.\end{tabular} \\ \hline
% other & \begin{tabular}[c]{@{}l@{}}task, test, support, design, refactor, etc.\end{tabular} \\ \hline
% \end{tabular}%
% \label{tag:datacollection}
% \end{table}%
\begin{table}[htbp]
\centering
\caption{Tags in GitHub}
\label{tag:datacollection}
\begin{tabular}{clcccc}
\toprule
Category & Label & Projects & Issues & Percent & Total \\
\midrule
\multirow{3}{*}{bug} & bug & 644 & 118,155 & 46.9\% & \multirow{3}{*}{52.2\%} \\
& defect & 15 & 7,604 & 3.0\% & \\
& type:bug & 13 & 5,684 & 2.3\% & \\
\midrule
\multirow{7}{*}{nonbug} & enhancement & 412 & 44,947 & 17.8\% & \multirow{7}{*}{38.6\%} \\
& feature & 199 & 14,795 & 5.9\% & \\
& question & 319 & 13,109 & 5.2\% & \\
& feature request & 93 & 6,976 & 2.8\% & \\
& documentation & 239 & 6,422 & 2.5\% & \\
& improvement & 45 & 5,592 & 2.2\% & \\
& docs & 122 & 5,510 & 2.2\% & \\
\bottomrule
\end{tabular}
\end{table}
{\color{red}{Through the prior process, we extract 149 tags, which can indicate the category of issues, as group ``$type$''.
Finally, we filter total 252,084 issues with tags in group group ``$type$''.}}
Table~\ref{tag:datacollection} shows the most used tags in group ``$type$'', and how many projects and issues they appear.
These tags were divided into bug-prone or nonbug-prone by manually distinguishing.
The most used tags are bug, enhancement, and feature, which were observed in 46.9\%, 17.8\%, and 5.9\% of the labeled issues, respectively.
The issues with other tags
These structure tags were used in this study to judge whether an issue is bug-prone or not.
The number of issues labeled with both categories is 3,869 in 386 projects, which can be ignored compared with 252,084 labeled issues.
% To get a overall perspective to label system, we aggregate all other labels according to the group information we aggregated before.
% In the result, we find the top 3 most used group of labels are type (\textit{i.e.}, bug, feature, enhancement), status (\textit{i.e.}, duplicate, wontfix, invalid) and adverb (\textit{i.e.}, high, critical, major, and this kind of labels are mostly used as priority and severity).
% To evaluate the coverage of these 3 kinds of labels, we select all labels that are used in more than 5 projects to filter labels with minority usage. We count the usage times of these 3 kinds of labels and calculate the usage rate for all filtered labels. Finally, these 3 groups of labels are used more than half, which achieve 58.7\%.
% Through aggregating all the labels, we finally get 113 labels used most in the category ``type''. From these labels, we distinguish them and select bug-like labels such as ``bug'', ``defect'', ``type:bug'', ``definition:bug'', etc. and select feature-like labels such as ``feature'', ``enhancement'', ``new feature'', ``feature request'', etc. Then, we label issues as bug or feature using labels we have distinguished, and these labeled issues will be our training data in following process.
% Because of the need for adequate classified issues for training model, we hand-pick 111 projects which have more than 500 labeled categories information issues. The detail information of these projects are shown in Table \ref{tag:datacollection}.
% \begin{table}[htbp]
% \centering
% \caption{Summary Statistics for Data Collection}
% \begin{tabular}{|c|c|c|} \hline
% \textbf{} & \textbf{Count} & \textbf{Mean} \\ \hline
% Projects & 111 & \\ \hline
% Issues & 356256 & 3209.5(per project) \\ \hline
% Labeled issues & 240754 & 2169.0(per project) \\ \hline
% Labels & 470965 & 1.96(per issue) \\ \hline
% \end{tabular}%
% \label{tag:datacollection}
% \end{table}%
% We also select 3 different projects (phpadmin, piwik and numpy) as objects of case study, which have enough labeled issues and different proportion of bug to feature. The detail information about them are exhibited in Table \ref{tag:casestudy}.
% \begin{table}[htbp]
% \centering
% \caption{Projects for Case Study}
% \begin{tabular}{|c|c|c|} \hline
% \textbf{Projects} & \textbf{Labeled Issues} & \textbf{Bug Proportion (\%)} \\ \hline
% phpadmin & 6766 & 0.75 \\ \hline
% piwik & 5389 & 0.66 \\ \hline
% numpy & 2511 & 0.86 \\ \hline
% \end{tabular}%
% \label{tag:casestudy}
% \end{table}%
\subsection{Preprocessing of Dataset}
Each issue, which can be labeled as ``bug'' or ``non-bug'',
is characterized by its title and description.
In this paper, issues labeled by the former process were selected to perform the following steps.
First, linguistic features extracted for the text-based classifier undergo standard processing,
i.e., lowercase, text filtering, stemming, and indexing~\cite{frakes1992information}.
All stop-words and common English terms, such as ``should'', ``might'', ``not'', were retained.
The importance of linguistic features for classifying issues was indicated in study~\cite{antoniol2008bug};
moreover, study~\cite{bissyande2013got} mentions that removing the default list of stop-words in common corpora might decrease the classification accuracy.
Otherwise, ``\`{}\`{}\`{}'' is used to distinguish the code information in issue reports, because markdown editor is used in GitHub.
Then, a vector space model was used to represent each issue as a weighted vector.
The issue is segmented into different terms (in this paper, a word means term)
in which each element in the vector of the issue is the weight of a term,
and the value stands for the importance of the term for the issue.
Term frequency-inverse document frequency (\textit{tf-idf}) is used to calculate weight.
Tf-idf is based on two assumptions:
First, the frequency of the appearance of a given word implies its importance to an issue.
Second, the frequency of the appearance of a word in several issues causes it to become less useful to distinguish among these issues.
% tf-idf can be calculated as Equations~(\ref{equation:tf})(\ref{equation:idf})(\ref{equation:tfidf}).
% \begin{equation}
% tf(t,i) = \frac{{{n_t}}}{{{N_i}}}
% \label{equation:tf}
% \end{equation}
% \begin{equation}
% idf(t) = \log \left( {\frac{{{N_I}}}{{\left| {i \in I:t \in i} \right|}}} \right)
% \label{equation:idf}
% \end{equation}
% \begin{equation}
% tf-idf(t,i) = tf(t,i) \times idf(t)
% \label{equation:tfidf}
% \end{equation}
% Where \textit{t} is a term, \textit{i} is the corpus of an issue, \textit{I} is the corpus of all issues in the given project, $n_t$ is the count of appearance for term \textit{t} in the issue, $N_i$ is the total number of terms in issue \textit{i} and $N_I$ is the total number of issues in the given project.

BIN
submit/figure/Bugzilla.pdf Normal file

Binary file not shown.

Binary file not shown.

BIN
submit/figure/boxplot.pdf Normal file

Binary file not shown.

BIN
submit/figure/example.pdf Normal file

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

BIN
submit/figure/framework.pdf Normal file

Binary file not shown.

BIN
submit/figure/improved.pdf Normal file

Binary file not shown.

Binary file not shown.

BIN
submit/figure/rule.pdf Normal file

Binary file not shown.

Binary file not shown.

BIN
submit/figure/workflow.pdf Normal file

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

194
submit/introduction.tex Normal file
View File

@ -0,0 +1,194 @@
The prosperity of open source software (OSS) resulted in an increasing number of developers joining in and contributing to OSS communities.
GitHub is one of the most popular social coding communities that attracts a large number of developers~\cite{gousios2015work} .
As of April 2016\footnote{\url{https://en.wikipedia.org/wiki/GitHub}},
over 14 million registered users are collaborating in GitHub,
showing a great power in driving the OSS project forward.
% At the same time, it brings many problems to project manager.
Based on the perspective of contributors, reporting issues using issue tracking system (ITS)
is one of the most important activities in OSS communities ~\cite{jalbert2008automated}.
Section~\ref{ITS_T} describes that the GitHub-provided ITS is more lightweight to use today,
compared to the traditional ITS (e.g., Bugzilla).
A contributor is only required a short textual abstract, containing a title and an optional description,
to report a new issue in GitHub.
Therefore, this simplified process of reporting issues decreases the barrier to entry
and attracts more inexperienced external contributors.
According to our statistics, Ruby on Rails,
%\footnote{\url{https://github.com/rails/rails}},
one of the most active projects in GitHub, receives upwards of 700 new issues each month.
%there are 134 issue reports per month in average,
%and they receive a maximum of 723 issue reports a month.
However, the extreme openness of ITS poses a serious challenge for the core team in project maintenance.
In large-scale projects, many undesirable and vague issue reports are submitted by external contributors
(\eg asking questions, as shown in Figure~\ref{figure:example})
because of their reluctance to spend adequate time to read and comprehend the contribution guidelines
(as shown in Figure~\ref{figure:rule}),
which provide details on reporting a high-quality issue and the kind of issue the project prefers.
Thus, issue categorization is a labor-intensive and time-consuming task for project managers.
Furthermore, the core team members have to provide quick responses and resolve the incoming issues in time
to sustain the passion of external contributors~\cite{huang2016effectiveness}.
%~\footnote{The guide of Ruby on Rails highlights that please don't ask question in the ITS.}.
%and just submit issue report casually as shown in Figure \ref{figure:example}.
%Most of time, tasks like these are not directly helpful to the development of the project.
%which results in many problems for project managers.
%To better manage these activities, many projects have made some specification about how to use ITS, as shown in Figure \ref{figure:rule}.
%These docs show the steps of how to submit an issue report and specify what categories and formats of issue report they prefer.
%For one thing, the increasing number of contributors accelerate the exploration of bug and bring many issue reports,
%which adds to managers' task.
%Managers have to deal with these issue reports in time to keep passion of these contributors \cite{huang2016effectiveness}.
%And all that makes managers have to spare much time to handle these tasks.
%For another, the increasing number of contributors makes management task more complicated.
% In the process of project developing,
% in order to better manage ITS and improve the efficiency of cooperation with other developers,
% the core developers may make some specification or follow some potential habit about how to use ITS.
% Some projects write docs of rules for how to contribute to the project, as show in Figure \ref{figure:rule}.
%For non-core developers, they may not spend enough time to comprehend the specification of the project, and just submit issue report casually as shown in Figure \ref{figure:example}.
%Most of time, tasks like these are not directly helpful to the development of the project.
%which results in many problems for project managers.
\begin{figure}[!htbp]
\centering
\setlength{\fboxrule}{0.5pt}
% \setlength{\fboxsep}{1cm}
\fbox{\includegraphics[scale = 0.31]{figure/wrong_usage}}
\caption{Example of undesirable issue in ITS}
\label{figure:example}
\end{figure}
\begin{figure}[!htbp]
\centering
\setlength{\fboxrule}{0.5pt}
% \setlength{\fboxsep}{1cm}
\fbox{\includegraphics[scale = 0.25]{figure/rule}}
\caption{Example of contribution guidelines in GitHub}
\label{figure:rule}
\end{figure}
Most issue management tasks are organized based on the label system in
GitHub\footnote{\url{https://help.github.com/articles/applying-labels-to-issues-and-pull-requests/}}
(as discussed in Section~\ref{ITS_GH}).
One of the most popular practices is distinguishing different types~\cite{antoniol2008bug} of issues
(\eg bug, feature request, and refactoring), which is a manual process maintained by core developers.
Thus, the high performance of issue categorizing approach,
especially issues established in limited prior information
(\ie the majority of issues only have textual summary and historical data of submitters),
could significantly reduce the cost of issue management.
This paper focuses on the challenge of distinguishing real bugs from nonbugs among all issues,
similar to prior work~\cite{herzig2013s,zhou2014combining}.
Although a fine-grained classification is deferred for future work,
we argue that this work can 1)~greatly improve the efficiency of issue management in GitHub
because well-established projects prefer to maintain limited types of issues,
especially for bug and feature
(\eg Ruby on Rails\footnote{\url{http://edgeguides.rubyonrails.org/contributing_to_ruby_on_rails.html}},
and angualr.js\footnote{\url{https://github.com/angular/angular/blob/master/CONTRIBUTING.md#issue}});
2)~reduce the noise and bias~\cite{herzig2013s,herzig2013predicting}
introduced by confusing real bugs with other types of issue (\eg feature requests),
when building bug prediction~\cite{d2010extensive,hata2012bug} or
software quality~\cite{vasilescu2015quality}
models based on mining the big data from GitHub.
%For the academic research based on mining the big data from GitHub,
%identifying real bugs accurately reduce the bias~\cite{herzig2013s,herzig2013predicting}
%for building bug prediction~\cite{d2010extensive,neuhaus2007predicting,hata2012bug} or
%software quality~\cite{kan2002metrics,vasilescu2015quality,yu2016initial} models.
In summary, the key contributions of this paper include:
\begin{itemize}
\item The study of text-based classification approaches on a large-scale dataset.
Four different machine learning classifiers were evaluated on 80 popular projects in GitHub.
The results show that the support vector machines (SVM) achieve the best performance.
% , which has best accuracy on average and performs most stably for different projects.
\item The limitations of text-based classification approaches were analyzed,
and the results showed that semantic perplexity
(\ie an issue's description confuses bug-related sentences with nonbug-related sentences)
is a crucial factor that affects classification performances.
Thus, representative metrics were designed to quantify the semantic perplexity of an issue report.
%In the process of manually analysis, we found that issue reports that are hard to classify always contain perplexing text.
%The result of regression analysis approve our assume that perplexity of free text can significant affect the performance of ML techniques.
% We conclude a scheme of issue reports through manually analyzing 589 issue reports. We find that issue reports whose description discusses both bug and feature are most likely classified incorrectly. The key to distinguish them is to analysis the structure of the description. For features, the feature-like sentences are more likely to appear in the begin or end of the description.
\item A novel two-stage classifying framework was designed to improve the performances of traditional classification models.
Features relating to semantic perplexity were extracted from free text in the first stage,
and then a synthesized classification model was built in the second stage.
The quantitative evaluations show that classification performance can achieve a significant improvement.
%a significant improvement for 101 projects.
% Based on the scheme detected above, we improve the ML classification methods using heuristic method.
% We divide description into sentences and use classification model to classify these sentences,
% then select out issue reports which are most likely to be wrong, according to
% the classification results of individually sentences (see Section 4.3).
% Finally, we evaluate our heuristic method on 111 projects and achieve 7.85\% improvement on average.
\end{itemize}
The structure of this paper is organized as follows.
In Section 2, we introduce the background of our study, and illustrate related work and our research questions.
In section 3, we present the key process of building effective classification models.
The results and discussion can be found in Section 4.
Finally, we draw our conclusions in Section 5.
%Moreover, there are many different categories of issue in ITS~\cite{antoniol2008bug}: information about bugs, requests of new feature or function, refactoring / restructuring activities, etc.
%Developers use labels to distinguish what category the issue report, but it is not mandatory when submit the issue.
%These loose restrictions make users in GitHub only concern what they are interesting in,
%which result in serious lack of categories information that are very useful.
%Repairing the missing categories of issue reports can be a very useful work, but it takes a lot of time and manpower to maintain it.
%So automatic classifier for issue reports seems necessary.
% Issue Tracking System (ITS) plays an important role on guiding the maintenance activities of software developers \cite{jalbert2008automated},
% which is widely used in Open Source Software (OSS) as well as in the industry software.
% With the help of ITS, project managers can clearly understand what happen on their projects
% (i.e., how many issues are reported by users daily; which module is associated with most bugs?)
% In general, there are many different kinds of issue in ITS~\cite{antoniol2008bug}: information about bugs, requests of new feature or function, refactoring / restructuring activities, etc. Users can label issue reports to distinguish categories of them, but most of issues missing the message of that. It takes a lot of time and manpower to maintain it and automatic classifier for issue seems necessary.
%Distinguishing the categories of issue reports also benefits for research works.
%For some bug prediction methods \cite{d2010extensive,neuhaus2007predicting}, they use bug reports information to map vulnerabilities to components, which help them to build prediction model.
%Thus, it is meaningful to accurately tell which issues are bug-prone and which are nonbug-prone.
% The largest social coding platform, such as GitHub, creates a convenient and comfortable environment for many software developers through integrating many developing tools and social media tools. At the same time, GitHub attracts many programming enthusiasts to contribute and stores much valuable historical data of these activity. However, users in GitHub only concern what they are interesting in, which result in serious lack of information that are very useful for a few people, such as the categories if a issue report. Hence, repairing the missing categories of issue reports can be a very useful work.
% Herzig et al~\cite{herzig2013s} illustrate six categories of issue reports from Bugzilla and Jira, i.e., \{BUG, RFE, IMPR, DOC, REFAC, OTHER\} and the list clearly distinguishes the task of different kinds of maintenance work.
% From these categories, issue report about BUG and RFE (feature) are most in GitHub \cite{bissyande2013got}. As for academic research, mining bug \cite{antoniol2008bug,herzig2013s,maalej2015bug,zhou2014combining} and feature \cite{fischer2003analyzing,maalej2015bug} are concerned most.
%In this paper, since a fine-grained classification of issue reports is beyond the scope of this paper, we are more interested in a binary classification, i.e., Bug and Non-Bug, similar to above prior work.
%
%The key contributions of this paper include:
%
%1) We study issue classification models on a large scale.
%We utilize 4 different MLTs on ITS of 101 popular projects in GitHub.
%From evaluation results, we find that SVM is significantly better than other 3 method.
%% , which has best accuracy on average and performs most stably for different projects.
%
%2) In the process of manually analysis, we found that issue reports that are hard to classify always contain perplexing text.
%The result of regression analysis approve our assume that perplexity of free text can significant affect the performance of ML techniques.
% We conclude a scheme of issue reports through manually analyzing 589 issue reports. We find that issue reports whose description discusses both bug and feature are most likely classified incorrectly. The key to distinguish them is to analysis the structure of the description. For features, the feature-like sentences are more likely to appear in the begin or end of the description.
%3) To improve the performance of classification model, We build a 2-level classifier framework, which takes advantage of what we found before.
%The 2-level classifier framework extract features about perplexity of free text in first level,
%and build a better performance classification model in second level.
%The analysis result shows a significant improvement for 101 projects.
%% Based on the scheme detected above, we improve the ML classification methods using heuristic method.
%% We divide description into sentences and use classification model to classify these sentences,
%% then select out issue reports which are most likely to be wrong, according to
%% the classification results of individually sentences (see Section 4.3).
%% Finally, we evaluate our heuristic method on 111 projects and achieve 7.85\% improvement on average.
%
%The structure of this paper is as follows. In Section 2, we introduce background of our study, and illustrates related work and some research questions. In section 3, we present some key process of building effective classification model. Result and discussion can be found in Section 4. Finally, we draw our conclusions in Section 5.
% As for academic research,
% many research have studied about bug prediction . The task of them is to predict whether the issue is talking about a bug. The key problem they meet is reducing wrong information users provide about categories of issue reports. Most of them label issues manually and it limits scope of projects that they can use to test their method.
% The categories of issue is not limited to bug. Besides recording bug report, feature request is another kind of most used issue reports. It is meaningful to distinguish feature from other categories because of a sufficient number of issue reports about feature and useful information they provide. Similar approach are proposed to mining features from app store. And research \cite{fischer2003analyzing} studied relationship between features through bug report data.

257
submit/main.tex Normal file
View File

@ -0,0 +1,257 @@
% This is "sig-alternate.tex" V2.1 April 2013
% This file should be compiled with V2.5 of "sig-alternate.cls" May 2012
%
% This example file demonstrates the use of the 'sig-alternate.cls'
% V2.5 LaTeX2e document class file. It is for those submitting
% articles to ACM Conference Proceedings WHO DO NOT WISH TO
% STRICTLY ADHERE TO THE SIGS (PUBS-BOARD-ENDORSED) STYLE.
% The 'sig-alternate.cls' file will produce a similar-looking,
% albeit, 'tighter' paper resulting in, invariably, fewer pages.
%
% ----------------------------------------------------------------------------------------------------------------
% This .tex file (and associated .cls V2.5) produces:
% 1) The Permission Statement
% 2) The Conference (location) Info information
% 3) The Copyright Line with ACM data
% 4) NO page numbers
%
% as against the acm_proc_article-sp.cls file which
% DOES NOT produce 1) thru' 3) above.
%
% Using 'sig-alternate.cls' you have control, however, from within
% the source .tex file, over both the CopyrightYear
% (defaulted to 200X) and the ACM Copyright Data
% (defaulted to X-XXXXX-XX-X/XX/XX).
% e.g.
% \CopyrightYear{2007} will cause 2007 to appear in the copyright line.
% \crdata{0-12345-67-8/90/12} will cause 0-12345-67-8/90/12 to appear in the copyright line.
%
% ---------------------------------------------------------------------------------------------------------------
% This .tex source is an example which *does* use
% the .bib file (from which the .bbl file % is produced).
% REMEMBER HOWEVER: After having produced the .bbl file,
% and prior to final submission, you *NEED* to 'insert'
% your .bbl file into your source .tex file so as to provide
% ONE 'self-contained' source file.
%
% ================= IF YOU HAVE QUESTIONS =======================
% Questions regarding the SIGS styles, SIGS policies and
% procedures, Conferences etc. should be sent to
% Adrienne Griscti (griscti@acm.org)
%
% Technical questions _only_ to
% Gerald Murray (murray@hq.acm.org)
% ===============================================================
%
% For tracking purposes - this is V2.0 - May 2012B
\documentclass{sig-alternate-05-2015}
\usepackage{subfigure}
\usepackage{graphicx}
\usepackage{flushend}
\usepackage{booktabs}
\usepackage{multirow}
%\newcommand{\eg}{{\emph{e.g.}},\xspace}
%\newcommand{\ie}{{\emph{i.e.}},\xspace}
\begin{document}
% Copyright
\setcopyright{acmcopyright}
%\setcopyright{acmlicensed}
%\setcopyright{rightsretained}
%\setcopyright{usgov}
%\setcopyright{usgovmixed}
%\setcopyright{cagov}
%\setcopyright{cagovmixed}
% DOI
\doi{10.475/123_4}
% ISBN
\isbn{123-4567-24-567/08/06}
%Conference
\conferenceinfo{PLDI '13}{June 16--19, 2013, Seattle, WA, USA}
\acmPrice{\$15.00}
%
% --- Author Metadata here ---
\conferenceinfo{WOODSTOCK}{'97 El Paso, Texas USA}
%\CopyrightYear{2007} % Allows default copyright year (20XX) to be over-ridden - IF NEED BE.
%\crdata{0-12345-67-8/90/01} % Allows default copyright data (0-89791-88-6/97/05) to be over-ridden - IF NEED BE.
% --- End of Author Metadata ---
\title{Where is the Road for Issue Report Classification Based on Text Mining?}
% \subtitle{[Extended Abstract]
% \titlenote{A full version of this paper is available as
% \textit{Author's Guide to Preparing ACM SIG Proceedings Using
% \LaTeX$2_\epsilon$\ and BibTeX} at
% \texttt{www.acm.org/eaddress.htm}}}
%
% You need the command \numberofauthors to handle the 'placement
% and alignment' of the authors beneath the title.
%
% For aesthetic reasons, we recommend 'three authors at a time'
% i.e. three 'name/affiliation blocks' be placed beneath the title.
%
% NOTE: You are NOT restricted in how many 'rows' of
% "name/affiliations" may appear. We just ask that you restrict
% the number of 'columns' to three.
%
% Because of the available 'opening page real-estate'
% we ask you to refrain from putting more than six authors
% (two rows with three columns) beneath the article title.
% More than six makes the first-page appear very cluttered indeed.
%
% Use the \alignauthor commands to handle the names
% and affiliations for an 'aesthetic maximum' of six authors.
% Add names, affiliations, addresses for
% the seventh etc. author(s) as the argument for the
% \additionalauthors command.
% These 'additional authors' will be output/set for you
% without further effort on your part as the last section in
% the body of your article BEFORE References or any Appendices.
\numberofauthors{4} % in this sample file, there are a *total*
% of EIGHT authors. SIX appear on the 'first-page' (for formatting
% reasons) and the remaining two appear in the \additionalauthors section.
%
\author{
% You can go ahead and credit any number of authors here,
% e.g. one 'row of three' or two rows (consisting of one row of three
% and a second row of one, two or three).
%
% The command \alignauthor (no curly braces needed) should
% precede each author name, affiliation/snail-mail address and
% e-mail address. Additionally, tag each line of
% affiliation/address with \affaddr, and tag the
% e-mail address with \email.
%
% 1st. author
% \alignauthor
Qiang Fan, Yue Yu, Gang Yin, Tao Wang, Huaimin Wang\\
\affaddr{National University of Defence Technology}\\
%\affaddr{1932 Wallamaloo Lane}\\
\affaddr{Changsha, China}\\
\email{\{fanqiang09, yuyue, yingang, taowang2005, hmwang\}@nudt.edu.cn}
% 2nd. author
% \alignauthor
% Yue Yu\\
% \affaddr{National University of Defence Technology}\\
% %\affaddr{1932 Wallamaloo Lane}\\
% \affaddr{Changsha, China}\\
% \email{yuyue@nudt.edu.cn}
% % 3rd. author
% \alignauthor Gang Yin\\
% \affaddr{National University of Defence Technology}\\
% %\affaddr{1932 Wallamaloo Lane}\\
% \affaddr{Changsha, China}\\
% \email{yingang@nudt.edu.cn}
% \and % use '\and' if you need 'another row' of author names
% % 4th. author
% \alignauthor Tao Wang\\
% \affaddr{National University of Defence Technology}\\
% %\affaddr{1932 Wallamaloo Lane}\\
% \affaddr{Changsha, China}\\
% \email{taowang2005@nudt.edu.cn}
% % 5th. author
% % \alignauthor Zx Li, Mw Chen \titlenote{They are raters of issue reports metioned in the paper}\\
% % \affaddr{National University of Defence Technology}\\
% % %\affaddr{1932 Wallamaloo Lane}\\
% % \affaddr{Changsha, China}\\
% % \email{starleelzx@163.com}
% % 6th. author
% \alignauthor Huaimin Wang\\
% \affaddr{National University of Defence Technology}\\
% %\affaddr{1932 Wallamaloo Lane}\\
% \affaddr{Changsha, China}\\
% \email{hmwang@nudt.edu.cn}
}
% There's nothing stopping you putting the seventh, eighth, etc.
% author on the opening page (as the 'third row') but we ask,
% for aesthetic reasons that you place these 'additional authors'
% in the \additional authors block, viz.
% \additionalauthors{Additional authors: John Smith (The Th{\o}rv{\"a}ld Group,
% email: {\texttt{jsmith@affiliation.org}}) and Julius P.~Kumquat
% (The Kumquat Consortium, email: {\texttt{jpkumquat@consortium.net}}).}
% \date{30 July 1999}
% Just remember to make sure that the TOTAL number of authors
% is the number that will appear on the first page PLUS the
% number that will appear in the \additionalauthors section.
\maketitle
\begin{abstract}
\input{abstract}
\end{abstract}
\keywords{issue tracking system; machine learning algorithms; mining software repositories}
% \\[2ex]
\section{Introduction}
\input{introduction}
\section{background and related work}
\input{background}
\section{methods}
\input{data_process}
% \begin{figure*}[!htbp]
% \centering
% \includegraphics[scale = 0.46]{precision}
% \caption{Accuracy of different ML}
% \label{f_all}
% \end{figure*}
% \section{Result and Discussion}
% \input{result}
% \section{RQ1:The Performance of ML}
% \input{rq1}
% \label{ML}
\section{result and discussion}
\input{result}
% \section{RQ3:The Improved Classification Method}
% \input{rq3}
\section{Conclusions}
\input{conclusion}
%ACKNOWLEDGMENTS are optional
\section{Acknowledgments}
This research is supported by National Science Foundation of China (Grant No.61432020, 61472430 61502512 and 61303064).
%
% The following two commands are all you need in the
% initial runs of your .tex file to
% produce the bibliography for the citations in your paper.
\bibliographystyle{abbrv}
\bibliography{sigproc} % sigproc.bib is the name of the Bibliography in this case
% You must have a proper ".bib" file
% and remember to run:
% latex bibtex latex latex
% to resolve all references
%
% ACM needs 'a single self-contained file'!
%
%APPENDICES are optional
%\balancecolumns
% This next section command marks the start of
% Appendix B, and does not continue the present hierarchy
%\balancecolumns % GM June 2007
% That's all folks!
\end{document}

294
submit/result.tex Normal file
View File

@ -0,0 +1,294 @@
\subsection{Evaluation Metrics}
\textit{Precision}, \textit{recall} and \textit{F-measure} are widely used standard metrics in related work, such as issue assignment \cite{zanetti2013categorizing}, bug prediction \cite{d2010extensive,neuhaus2007predicting} and reviewer recommendation \cite{lee2013patch,jeong2009improving}.
These metrics can measure the performance of models from different perspectives.
For instance, \textit{precision} is used to measure the exactness of the prediction, whereas \textit{recall} evaluates the completeness.
\textit{F-measure} denotes the balance and discrepancy between \textit{precision} and \textit{recall},
which can be interpreted as the weighted average of \textit{precision} and \textit{recall}:
\begin{equation}
F-measure = 2*\frac{precision*recall}{precision+recall}
\end{equation}
In~\cite{zhou2014combining}, the weighted average value of \textit{F-measure} for both categories
is used to evaluate the classification model.
This metric considers the performance of both categories and provides an overall performance of the classification model.
Thus, in this paper, metrics similar to~\cite{zhou2014combining}, which are defined in Equation \ref{equation:performance}, were selected,
in which the \textit{average F-measure} as $f_{avg}$, \textit{F-measure} of bug (nonbug) as ${f_{bug}}$ (${f_{nonbug}}$), and number of bug (nonbug) as ${n_{bug}}$(${n_{nonbug}}$).
\begin{equation}
{f_{avg}} = \frac{n_{bug}*f_{bug}+n_{nonbug}*f_{nonbug}}{n_{bug}+n_{nonbug}}
\label{equation:performance}
\end{equation}
\subsection{RQ1:Performance of Text-based Classification}
\begin{table*}[htbp]
\centering
\caption{\label{tab:compareML} Comparisons of different text-based classifiers}
\begin{tabular}{lccccc}
\toprule
Group A vs. Group B & Estimator & Lower & Upper & Statistic & p-value \\
\midrule
NB vs. Base Line & 0.028 & 0.010 & 0.074 & -9.61727834 & 0.000000e+00 \\
LR vs. Base Line & 0.005 & 0.001 & 0.025 & -8.85487385 & 0.000000e+00 \\
RF vs. Base Line & 0.001 & 0.000 & 0.016 & -6.53029671 & 3.446741e-10 \\
SVM vs. Base Line & 0.001 & 0.001 & 0.002 & -43.36487241 & 0.000000e+00 \\
SVM vs. NB & 0.127 & 0.067 & 0.226 & -7.62243934 & 0.1.173506e-13 \\
SVM vs. LR & 0.310 & 0.207 & 0.437 & -4.04251177 & 5.212218e-04 \\
SVM vs. RF & 0.296 & 0.195 & 0.421 & -4.36384024 & 1.115098e-04 \\
\bottomrule
Combined method vs. SVM & 0.348 & 0.246 & 0.466 & -3.277566 & 0.005725885 \\
Combined method vs. developer information & 0.398 & 0.290 & 0.517 & -2.207413 & 0.012257345 \\
Combined method vs. perplexity information & 0.448 & 0.336 & 0.567 & -1.113729 & 0.048434745 \\
\bottomrule
\end{tabular}
\end{table*}
For issue classification,
four different machine learning classifiers (``\textit{Naive Bayes}'', ``\textit{Logistic Regression}'', ``\textit{Random Tree}'', ``\textit{Support Vector Machine}'') were used on our dataset,
and four classification models were built for each project
to discuss which text-based classifier performs best.
A baseline method was set, a grep-based method, which uses a simple grep with keywords like ``bug'',``defect'', or ``fix'' to determine whether an issue is bug.
Boxplot was used to exhibit the result of each classifier and acquire the overall performance of all projects for each classifier.
The \textit{average F-measure} (\ie $f_{avg}$) was used to evaluate these classifiers and calculate the average value of ten-fold results as performance.
Figure~\ref{figure:performanceML} shows the $f_{avg}$ of baseline method and four different classifiers,
where the y-axis is $f_{avg}$.
Figure \ref{figure:performanceML} shows that all four classifier approaches outperform the baseline method.
\textit{Logistic Regression} and \textit{Random Tree} reach a close performance,
which are better than that of \textit{Naive Bayes} but slightly worse than \textit{SVM}.
Statistical analysis was used to verify our conclusions about the difference between text-based classifiers and the baseline method.
Traditionally, the comparison of multiple groups follows a two-step approach:
first, a global null hypothesis is tested,
and then multiple comparisons are used to test the sub-hypotheses pertaining to each pair of groups.
However, the global test null hypothesis may be rejected, whereas none of the sub-hypotheses are rejected, or vice versa \cite{gabriel1969simultaneous}.
Therefore, the one-step approach, multiple contrast test procedure $\widetilde{\textbf{T}}$~\cite{konietschke2012rank,yu2016IST},
is preferred in this study.
The procedure $\widetilde{\textbf{T}}$ by \textit{nparcomp} package \cite{fraenkel1993design} in R was implemented to evaluate the F-Measure of all the approaches operating on 80 projects in our dataset.
The \textit{Tukey} (all-pairs) was set to contrast to compare all groups pairwise.
For each pair of groups, the 95\% confidence interval was analyzed to test whether the corresponding null sub-hypothesis can be rejected.
If the lower boundary of the interval is greater than zero for groups A and B,
then the metric value is higher in A than in B.
Similarly, if the upper boundary of the interval is less than zero for groups A and B,
then the metric value is lower in A than in B.
Finally, if the lower boundary of the interval is less than zero and the upper boundary is greater than zero,
then the data do not provide sufficient evidence to reject the null hypothesis.
% \begin{table*}[htbp]
% \centering
% \caption{\label{tab:compareImprove}Performance comparison of different approaches}
% \begin{tabular}{llcccc}
% \toprule
% Case & Group A vs. Group B & Avg. Imp & Min. Imp & Max. Imp & p.Value \\
% \midrule
% Performance < 0.75 & DI + SVM vs. SVM & 0.003 & 0.003 & 0.033 & 0.8743 \\
% & PI + DI + SVM vs. SVM & 0.024 & 0.002 & 0.061 & 0.0041 \\
% & PI + DI + SVM vs. DI + SVM & 0.021 & 0 & 0.076 & 0.0168 \\
% \midrule
% Performance < 0.8 & DI + SVM vs. SVM & 0.001 & -0.031 & 0.033 & 0.8163 \\
% & PI + DI + SVM vs. SVM & 0.014 & -0.008 & 0.061 & 0.0124 \\
% & PI + DI + SVM vs. DI + SVM & 0.013 & 0 & 0.076 & 0.0299 \\
% \bottomrule
% \end{tabular}
% \end{table*}
Table~\ref{tab:compareML} shows results of procedure $\widetilde{\textbf{T}}$
(the last three rows are the results of Section~\ref{sec:2stage}).
All the $p-values$ are less than 0.05 in the first four rows.
Thus, a significant difference among the four text-based classifications and base line method is observed,
which implies that text-based classifications are useful for the issue classification in ITS of GitHub.
Similarly, all the $p-values$ are less than 0.05 in the next four rows and the lower and upper boundaries are greater than zero,
which means that the performance of \textit{SVM} is significantly better than the other three classifications.
% \fbox{
% \parbox{
% \begin{center}
% aaa\\
% bbb
% \end{center}
% }
% }
\begin{framed}
\noindent
\textbf{Result 1:}
\textit{
In the context of GitHub's ITS, text-based classification approaches can achieve
69.7\% to 98.9\% of average F-measure (calculated as Equation~\ref{equation:performance}) on our large-scale dataset,
and the SVM classifier is the most effective approach compared to other typical classifiers.
%For most of projects in GitHub, MLTs still work on classifying issues.
%And SVM performs significant better than other 3 MLTs for 101 projects.
}
\end{framed}
\begin{figure}[!htb]
\centering
% \includegraphics[width=8.5cm]{classprocess}
\includegraphics[width=8.5cm]{figure/boxplot}
\caption{$f_{avg}$ of different ML methods}
\label{figure:performanceML}
\end{figure}%picture
\subsection{RQ2:Regression Analysis}
Table~\ref{tab:fixedmodel} shows the result of regression analysis.
The model achieves an amazing fit (${R_c}^2 = 92.5\%$).
In our model, all the remaining factors are well below three,
thereby indicating the absence of multicollinearity~\cite{gharehyazie2014developer}.
Moreover, no interaction among the variables in the models is observed,
making the interpretation of our results easy and maintaining the cleanliness of the models.
For project-level measures, the number of issues ($log(issue\_num)$) is highly significant,
which means that {\color{red}{the number of issues in train set is far from sufficient. Based on current data set,}} the more training sets used,
the higher is the \textit{average F-measure} that the classification achieves.
Moreover, no statistical significance is observed in other project-level measures with regard to the influencing the performance of the classification model.
For issue-level measures, the number of confused issues ($log(confuse\_count + 0.5)$) contained in the dataset is highly significant.
When the dataset contains many confused issues, too many instances locate closely at the hyperplane of the classification model,
which complicates the construction of an effective model.
The median number of words ($log(med\_word\_count)$) in issues is insignificant,
thereby suggesting that providing many textual summaries of the issues does not help in distinguishing between bug-prone or non-bug-prone,
and some key words may be enough to build an effective classification model.
% \begin{table}[htbp]
% \centering
% \caption{\label{tab:compareML}Performance comparison of different ML techniques}
% \begin{tabular}{lrrll}
% \toprule
% & Estimate & t Value & p value & \\
% \midrule
% $log(star + watch)$ & 0.006207 & 3.216 & 0.00173 & ** \\
% $log(post_num)$ & 0.078255 & 13.754 & <2e-16 & *** \\
% $log(confuse_count + 0.5)$ & -0.064934 & -18.108 & <2e-16 & *** \\
% $log(med_word_count)$ & 0.009829 & 1.265 & 0.20883 & \\
% \bottomrule
% \end{tabular}
% \end{table}
\begin{table}[htbp]
\centering
\caption{\label{tab:fixedmodel}Regression result of fixed effect}
\begin{tabular}{l|r@{}lr@{}l}
\hline
& Coeffs & & Sum Sq. & \\ \hline
(intercept) & -3.37543 &* & & \\
$log(star + watch)$ & 0.06316 & & 0.197 & \\
$log(issue\_num)$ & 1.90440 &***& 30.484 &*** \\
$log(contributors)$ & -0.03135 & & 0.022 & \\
$log(age + 0.5)$ & -0.22421 &* & 0.288 & \\
$log(commits)$ & -0.34256 & & 0.842 &* \\ \hline
$log(confuse\_count + 0.5)$ & -1.83346 &***& 134.623 &*** \\
$log(med\_word\_count)$ & 0.12505 & & 0.067 & \\ \hline
marginal R-squared & \multicolumn{4}{c}{0.6798150} \\
conditional R-squared & \multicolumn{4}{c}{0.9251896} \\ \hline
\multicolumn{5}{l}{signif.: $p<0.001$ `***', $p<0.01$ `**', $p<0.05$ `*'} \\
\end{tabular}
\end{table}
\begin{framed}
\textbf{Result 2:}
\textit{
% 增加训练集的数量能够有效的提升模型的性能除此之外训练数据中如果包含太多的confused issue会严重影响训练集的性能。
Increasing the size of the training set can effectively improve the performance of the classification model.
Furthermore, too many confused issues in the training set will seriously affect its performance.
}
\end{framed}
\subsection{RQ3:2-stage Classification}
\label{sec:2stage}
% \begin{table*}[htbp]
% \centering
% \caption{Classification results of different methods}
% \label{table:detailresult}
% \begin{tabular}{|l|l|ccc|ccc|ccc|}
% \hline
% & & \multicolumn{3}{c|}{1st baseline} & \multicolumn{3}{c|}{2nd baseline} & \multicolumn{3}{c|}{our approach} \\
% & & Precision & Recall & F-Meature & Precision & Recall & F-Meature & Precision & Recall & F-Meature \\ \hline
% \multirow{3}{*}{First Quartile} & Bug & 0.720 & 0.672 & 0.670 & 0.718 & 0.680 & 0.676 & 0.734 & 0.715 & 0.706 \\
% & Nonbug & 0.747 & 0.708 & 0.708 & 0.747 & 0.708 & 0.711 & 0.754 & 0.737 & 0.730 \\
% & Average & 0.734 & 0.690 & 0.689 & 0.733 & 0.694 & 0.694 & \textbf{0.744} & \textbf{0.726} & \textbf{0.718} \\ \hline
% \multirow{3}{*}{Median} & Bug & 0.748 & 0.719 & 0.715 & 0.748 & 0.722 & 0.719 & 0.761 & 0.746 & 0.731 \\
% & Nonbug & 0.765 & 0.727 & 0.733 & 0.765 & 0.729 & 0.735 & 0.770 & 0.759 & 0.750 \\
% & Average & 0.757 & 0.723 & 0.724 & 0.757 & 0.726 & 0.727 & \textbf{0.766} & \textbf{0.753} & \textbf{0.741} \\ \hline
% \end{tabular}
% \end{table*}
An experiment based on the two-stage classifier was conducted to validate our approach.
{\color{red}{In the first stage, SVM was used as text classifier based on the conclusion of RQ1.
In the second stage, we selected Logistic Regression as our prediction model, which performed better than other classifier in table~\ref{tag:packages}.}}
As projects that achieve a high $f_{avg}$ (\ie average F-measure) contain few confused issues,
our approach has a slight effect on these projects.
Thus, to explore the performance of our approach for different projects, the project selection has two cases.
In the first case, projects whose $f_{avg}$ is less than the \textit{first quartile} (0.7521) are selected.
In the second case, projects whose $f_{avg}$ is less than the \textit{median} (0.7935) are selected.
In this paper, \textit{\textbf{SVM}} is selected as the first method,
whose $f_{avg}$ is the best among the four different text-based classifiers.
Two other approaches are used to explore the effect of \textit{\textbf{developer information}}
and \textit{\textbf{perplexity information}} in our two-stage approach.
The developer information method only extracts probability of being bug-prone from free text in the first stage, because omitting the structured information is serious in GitHub;
thus, the historical activities of the reporter in Stage 2 were selected to build as a classifier similar to work~\cite{zhou2014combining}
and are described in Section~\ref{section:secondlevel}.
The perplexity information method,
which is used to compare the effect of perplexity information and developer information,
extracts perplexity in the first stage, and does not use structured developer information in the second stage.
The \textit{\textbf{combined method}} is our two-stage approach, which makes use of both perplexity information and developer information.
% Table~\ref{table:detailresult} shows the comparison results of precision, recall,
% and F-measure for two baseline approaches with the results of our approach.
% The values in Table~\ref{table:detailresult} are the average results of 101 projects.
Figure~\ref{figure:2levelresult} shows the comparison results of $f_{avg}$.
\begin{figure}[!htb]
\centering
% \includegraphics[width=8.5cm]{classprocess}
\includegraphics[width=8.5cm]{figure/final_result3}
\caption{$f_{avg}$ of different methods}
\label{figure:2levelresult}
\end{figure}%picture
Figure~\ref{figure:2levelresult} shows that the combined method outperforms all other methods for 80 projects.
% Table~\ref{table:detailresult} shows that the average values of precision, recall, and F-measure are all better than
% those of the baseline methods.
For procedure $\widetilde{\textbf{T}}$ (last rows in Table~\ref{tab:compareML}),
all the $p-values$ of combined method versus SVM, developer information method and perplexity information method are less than 0.05,
and the lower and upper boundaries are greater than zero,
which means that combined method is more significant than other approaches.
\begin{framed}
\noindent
\textbf{Result 3:}
\textit{
The two-stage classification approach can achieve a statistically significant improvement
compared to traditional text-based classification by integrating our novel perplexity features.
%significantly improve the classification performances
%with the 1.1\% precision and 3.2\% recall on average.
%The approach that adds perplexity information extracted from free text of issues can significant improve the performance of classification model.
}
\end{framed}
Although the value of the absolute increase is not impressive,
the extracted features (semantic perplexity information) are generally effective in improving the performance of the classifier model.
The result of dividing sentences is not ideal.
Issue reports in these projects are contrasted sharply,
combining free text with codes, hyper-link, and stack track, thereby complicating the finetuning for every project.
The preprocessing is not perfect, which limits the promotion of some projects.
Even facing this challenge, our approach still achieves a stable improvement.
We believe that when applied in practice,
a highly individualized data preprocessing approach can be a great help in extracting features
that are in agreement with our approach.

21
submit/rq1.tex Normal file
View File

@ -0,0 +1,21 @@
In order to discuss which ML technique performs best to classify issues as bug or feature, we utilize 5 different ML techniques (exhibited in section 3.4) in our data collect and build classification model for every project. Figure \ref{f_all} show the average accuracy of 10-fold cross-validation classification results for different projects. It reflects the performace of different ML techniques and baseline on 111 projects, and baseline is the accuracy of greater of predicting all issues as bug or feature. In Figure \ref{f_all}, the x axis means the id of different projects, and each point in the line present the accuracy of one ML techniques in the given project.
Table \ref{tab:rMLA} exhibit the statistical result for different ML techniques. From table \ref{tab:rMLA} for most of projects, \textbf{SVM gets the best accuracy on average and performs most stably}, while, NB and LRL perform worse and volatile. Because of the excellent performance of SVM, we will use the classification result of SVM in the following experiment.
% Table generated by Excel2LaTeX from sheet 'Sheet1'
% Table generated by Excel2LaTeX from sheet 'Sheet1'
\begin{table}[htbp]
\centering
\caption{Statistical result of accuracy for different ML}
\begin{tabular}{|c|c|c|c|} \hline
%\toprule
\textbf{ML} & \textbf{Mean} & \textbf{Median} & \textbf{St. Dev.} \\ \hline
%\midrule
Baseline & 0.6695 & 0.6464 & 0.1299 \\ \hline
SVM & \textbf{0.8335} & \textbf{0.8226} & \textbf{0.0563} \\ \hline
NB & \textbf{0.7619} & \textbf{0.7488} & 0.0793 \\ \hline
LR & 0.778 & 0.7561 & \textbf{0.0796} \\ \hline
ET & 0.807 & 0.7946 & 0.0603 \\ \hline
RF & 0.7965 & 0.78 & 0.0653 \\ \hline
%\bottomrule
\end{tabular}%
\label{tab:rMLA}%
\end{table}%

0
submit/rq2.tex Normal file
View File

98
submit/rq3.tex Normal file
View File

@ -0,0 +1,98 @@
We design a two level classification method to improve classification result, which consists of the following steps:
1) The first level uses free text information, title and content of issues, to build a supervised ML model.
In this level, we don't directly predict whether the issue report is bug-prone or not, but predict how probably it is bug.
In our study, we use linear SVM as our prediction model.
Otherwise, for each issue, we split the title and content into sentences, and use prediction model we build before to predict every sentences.
Through these process, we collect confusion informations of issues.
All those output of prediction model are regarded as features extracted from these unstructured free text, which will be used in the second level.
2) The second level uses structured reporter features of each issue report together with unstructured features extracted from the first level.
In this level, we fed these features into a new machine leaner, which will predict whether the issue is bug-prone.
In our improved classification method, we use logistic regression as our prediction model.
We use supervised learning in both level, which make use of label described in section \ref{labeling process}. Each level of our method is explained below, and overview of our method is shown in Figure XX.
\subsection{Level 1 - Free Text Classification}
In this level, the main task is to mine the information contain in free text.
Like the process of section \ref{ML}, we use two main information sources: summary and description.
But in this time, we don't directly build a model to predict whether the issue report is bug-prone.
We build a prediction model from summary and description of training set, and apply the probability output of the model to predict testing set.
At the same time, we divide summary and description into sentences, and make use of the model we build before to predict each sentence.
Sentences predicting makes us understand the semantic changing when developers report an issue, and obtain confuse information contain in the free text.
Through analyzing sentences' category-prone, we can look deeply into free text, and mine regular pattern of developers reporting habits, which will help to improve classification performance.
In level 1, we mainly extract following features from free text:
\textbf{Probably:} this feature means how probably an issue report is predicted as a bug.
We use the probability output of SVM to get this feature.
\textbf{SentenceCount:} the total number of sentences in the issue report, including summary and description.
\textbf{MostBugProb:} the max probability of all sentences that are predicted as bug-prone.
Here we use the model we build from summary and description before, and get the probability for each sentence through probability output.
\textbf{MostNonbugProb:} the max probability of all sentences that are predicted as non-bug-prone.
We use similar process as \textbf{MostBugProb} to get this feature.
\textbf{Location:} the sequence number of the most non-bug sentence.
We use this feature to show where the most non-bug appear in the issue report.
\textbf{BugCount:} the number of sentences that are predicted as bug-prone.
\textbf{NonbugCount:} the number of sentences that are predicted as non-bug-prone.
For 10-fold strategy we used, extracting these features needs to be careful.
We assume all these features are produced by training set, which means we use training set to build a prediction model and use this model to extract features of each instance in training set.
These approach allows us extract features without using instances of testing set,
which may introduce some additional information and make results less scientific.
\subsection{Level 2 - Combining Free Text and Developer Information Classification}
In level 1, we get the probability of bug-prone and confusion information of sentences for each issue report.
We get these features from free text of issue report, and these features will be part of input of level 2.
For classifying issue report, we believe the experience of developers may influent categories of issue reports.
For example, skilled developers may more likely to report a bug-prone issue, and they may more likely to provide issue reports that meet desire of managers.
So in level 2, we provide some structure features about authors who report issue reports.
These features contain identity of developer in the project, history developing activities and social influence, etc.
The detail information are shown fellow.
% 要写考虑这些feature的直觉可以一个总结下面写几个feature
\textbf{IsCoreDeveloper:} this feature shows whether the author who report the issue is core developer in the project.
If the author is core developer, we set the feature as 1, otherwise, we set the feature as 0.
\textbf{IssueCountInProject:} the number of issues the author reports before in the project.
We use this this feature to reveal the experience of the author about reporting issue report in the project.
\textbf{IssueCountInGitHub:} the number of issues the author reports before in GitHub.
Different with \textbf{IssueCountInProject:}, we use this this feature to reveal the experience of the author about reporting issue report in all projects.
\textbf{CommentCountInProject:} the number of comments the author commits before in the project.
We use this this feature to reveal the history activities of the author about issue in the project.
\textbf{CommentCountInGitHub:} the number of comments the author commits before in GitHub.
Like the comparing of \textbf{IssueCountInProject:} and \textbf{IssueCountInGitHub:},
we use this this feature to reveal the history activities of the author about issue in all projects.
\textbf{FollowerCount:} the number of follower that the author get.
This feature can reveal the social influence of the author in GitHub.
\textbf{RegisterTime:} this feature shows how long the developer have registered.
The longer a developer have registered, the more familiar he is to the principles of GitHub.
To complete the process of classification in level 2, we need data grafting process to combine the features from free text and features from structured developer information.
During process of level 1, the ID of report is not included in, yet only input is the instances with lists of preprocessed terms, as well as the corresponding labels.
We use a special subprocess to smooth the linkage of two level, which is data grafting.
Data grafting aims at melting dataset from different sources, and combines features into a regular form according to its source.
Because we use 10-fold strategy to partition training set, we need to follow the change of dataset.
So we do the same partition process as training set for ID information, which makes it possible to trace every instances.
We utilize logistic regression as our prediction model in level 2.
It needs to be careful to partition dataset into training set and testing set before building prediction model.
The output of level 1 and the input of level 2 are associated.
We must let testing set of level 1 and level 2 are same, which can make certain that we use the same training set to build model and do not introduce extra information from testing set forever.
\subsection{Result of Improved Classification Method}
Based on

57
submit/test.tex Normal file
View File

@ -0,0 +1,57 @@
\documentclass[11pt]{article}
\usepackage{CJK}
\usepackage[top=2cm, bottom=2cm, left=2cm, right=2cm]{geometry}
\usepackage{algorithm}
\usepackage{algorithmicx}
\usepackage{algpseudocode}
\usepackage{amsmath}
\floatname{algorithm}{algorithm}
\renewcommand{\algorithmicrequire}{\textbf{imput:}}
\renewcommand{\algorithmicensure}{\textbf{output:}}
\begin{document}
\begin{CJK*}{UTF8}{gkai}
\begin{algorithm}
\caption{Improving Classification Model}
\begin{algorithmic}[1] %每行显示行号
\Require $model$, $issue$
\Ensure $change$ prediction or not
\Function {Improving}{$model, issue$}
\State $change \gets False$
\If {$\Delta < threshold_\Delta$}
\State $sentences \gets +$ \Call{SplitSentence}{$issue$}
\State $preds \gets +$ \Call{model.predict\_proba}{$sentences$}
\For{$pred:preds$}
\If {$\Delta_s < threshold_{\Delta_s}$}
\If {position in begin $||$ end}
\State $change \gets True$
\State $break$
\EndIf
\EndIf
\EndFor
\EndIf
\State \Return{$change$}
\EndFunction
\end{algorithmic}
\label{a:improving}
\end{algorithm}
\end{CJK*}
\begin{table}[h] %开始一个表格environment表格的位置是h,here。因为latex会根据情况把表格放在页面的适当的位置设置了h选项就不会随便乱放了
\begin{tabular}{p{3.5cm}|p{3cmp{2.5cm}}|p{5cm}} %开始一个具体的表格设置第一列的宽度是3.5cm第二列的宽度是2.5cm第三列的宽度是5cm。默认情况下latex的text wrap不是很好所以这里我就设置了每一列的宽度强制转换。而且我没有使用分割单元格的竖线。如果需要的话可以这样设置{c|c|c},这里的 c表示center当然还可以使用l和r意思很没白了。这样就定义了一个3列的表格而且有单元格分割线。对了就是使用|来表示分割线的。
%下面就是单元格的具体内容了
Format & Extension & Description \\ %&来分隔单元格的内容 \\表示进入下一行
\hline %画一个横线下面的就都是一样了这里一共有4行内容
Bitmap & .bmp & Bitmap images are recommended because they offer the most control over the exact image and colors.\\
\hline
Graphics Interchange Format (GIF) & .gif & Compressed image format used for Web pages. Animated GIFs are supported.\\
\hline
Joint Photographic Experts Group (JPEG) & .jpeg, .jpg & Compressed image format used for Web pages.\\
\hline
Portable Network Graphics (PNG) & .png & Compressed image format used for Web pages.
\end{tabular}
\caption{Art File Formats} %显示表格的标题
\end{table}
\end{document}

13
submit/threats.tex Normal file
View File

@ -0,0 +1,13 @@
Our study had two main threats.
The fist threat concerned the study design about category extraction on our data set.
We utilized tags used most in GitHub to distinguish the category of issues.
We trained model on issues with these tags, so that we could clearly know the category of samples in training set.
We ignored the issues without these tags, which might introduce bias to dataset.
However, many researches~\cite{antoniol2008bug,maalej2015bug,zhou2014combining} selected labeled issues as training set.
What's more, unlabeled issues was only a small part of the data set, which was 9.8\% in our study.
The second threat is the number of projects we used.
We filtered projects by the number of labeled issues and the rate of bug-prone issues in projects.
Our findings are based on 80 projects in GitHub.
Compared to other studies, although we used a large-scale dataset, it is still a very small part of projects in GitHub.
For filtered projects, our method need an extra adjustment and further evaluation.