science_china_letter/dockerfile_supplementary.tex

242 lines
23 KiB
TeX

%-----------------------------------------------------------------------
% Template File for Science China Information Sciences
% Downloaded from http://scis.scichina.com
% Please compile the tex file using LATEX or PDF-LATEX or CCT-LATEX
%-----------------------------------------------------------------------
\documentclass{SCIS2018}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%% Author's definitions for this manuscript
%%% 作者附加的定义
%%% 常用环境已经加载好, 不需要重复加载
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%% Begin. 开始
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{document}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%% Authors do not modify the information below
%%% 作者不需要修改此处信息
\ArticleType{Supplementary File}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%% title: 标题
%%% \title{title}{title for citation}
\title{A Clustering-based Approach for Mining Dockerfile
Evolutionary Trajectories}{A Clustering-based Approach for Mining Dockerfile
Evolutionary Trajectories}
%%% Corresponding author: 通信作者
%%% \author[number]{Full name}{{email@xxx.com}}
%%% General author: 一般作者
%%% \author[number]{Full name}{}
\author[1,2]{Yang ZHANG}{yangzhang15@nudt.edu.cn}
\author[1,2]{Huaimin WANG}{}
\author[3]{Vladimir FILKOV}{}
%%% Author information for page head. 页眉中的作者信息
\AuthorMark{Zhang Y}
%%% Authors for citation. 首页引用中的作者信息
\AuthorCitation{Zhang Y, Wang H M, Filkov V, et al}
%%% Authors' contribution. 同等贡献
%\contributions{Authors A and B have the same contribution to this work.}
%%% Address. 地址
%%% \address[number]{Affiliation, City {\rm Postcode}, Country}
\address[1]{Key Laboratory of Parallel and Distributed Computing}
\address[2]{College of Computer,National University of Defense Technology, Changsha, 410073, China}
\address[3]{University of California, Davis, CA, 95616, USA}
\maketitle
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%% The main text. 正文部分
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{appendix}
\section{Background}
(1) Docker\footnote{https://www.docker.com/} is an OSS project that implements operating system-level virtualization~\cite{Anderson}, and is built on many technologies from operating systems research: LXC~\cite{Dua} (Linux Containers), virtualization of the OS~\cite{Boettiger}, \emph{etc}.
Docker has become the de-facto industry standard, and its usage is
spreading rapidly~\cite{Cito}. As of December 2017, the most common Docker registry, \emph{i.e.}, Docker Hub\footnote{https://hub.docker.com/}, has hosted close to $1,700,000$ Docker projects. The Docker technology is primarily intended for developers to create and publish \emph{containers}~\cite{Manu}. With containers, applications can share the same operating system and, whenever possible, libraries and binaries~\cite{Bernstein}. In fact, Docker launches its containers from a \emph{Docker image}, which is a series of data layers on top of a base image~\cite{Felter}. When developers make changes to a container, instead of directly writing the changes to the image of the container, Docker adds an additional layer containing the changes to the image~\cite{Mouat}. Since production environment replicas can be easily made in local computers, developers can test their changes in a matter of seconds. Also, changes to the containers can be made rapidly as only sections that need changing are updated.
The content of a Docker image is defined by declarations in the \texttt{Dockerfile} which specifies the Docker commands and the order of their execution for creating the desired images~\cite{Merkel}. To meet the requirements of project development, the content of a Dockerfile may be modified at different stages by project maintainers, which we refer to as the \emph{Dockerfile evolution}.
During our preliminary data gathering of more than 57,000 projects from Docker Hub, nearly 70\% of them had changed their Dockerfile at least once. Over 5,000 projects changed their Dockerfile more than 10 times.
Among individual projects, the evolution of Dockerfile may vary, because different projects have different codes and structures. But, for projects with similar requirements and cultures, a similar Dockerfile evolutionary trajectory may exist. Investigating Dockerfile evolution can help understand the configuration practices of maintainers and benefit the project maintenance in the Docker environment.
\smallskip
\noindent(2) Dockerfile\footnote{https://docs.docker.com/engine/reference/builder/} is a text document that contains all the commands a user could call on the command line to assemble an image~\cite{Merkel}. Users can build an automated build that executes several command-line instructions in succession by using \texttt{docker build}. Docker has provided multiple types of instructions in the Dockerfile, involving \texttt{FROM}, \texttt{MAINTAINER}, \texttt{RUN}, \texttt{COPY}, \texttt{ADD}, \texttt{ENV}, \emph{etc.} Specifically, The \texttt{FROM} instruction specifies the \emph{Base image}, which can give a first indication of what it is that the projects use Docker for~\cite{Cito}. \texttt{MAINTAINER} instruction provides the name and email of an active maintainer. \texttt{ENV} instruction sets the environment variables. \texttt{COPY} instruction places files into the container. \texttt{ADD} instruction places files and unpacks the archive (\emph{e.g.}, zip) in the container. \texttt{RUN} instruction executes any possible shell commands in a new layer on top of the current image and commits the results. Docker runs instructions in a Dockerfile in order and treats lines that begin with ``\texttt{\#}'' as a comment. A Dockerfile must start with a \texttt{FROM} instruction. Other parts are then added on top of the base one~\cite{Jaramillo}. Each instruction represents one layer in Docker image. Thus, the scale of a Dockerfile can reveal the size and complexity of the corresponding Docker image.
As we mentioned above, the content of Dockerfile may be modified at different stages by project maintainers to meet the project goals. \emph{E.g.}, during the previous stage, the owner \emph{inutano} of project \emph{inutano/wpgsa-docker} added \texttt{USER} instruction and new python scripts to the initial Dockerfile. But during the later stage, he just updated the plugin \emph{wPGSA}'s version, \emph{e.g.}, ``\emph{0.2.0}''$\to$\emph{0.3.0}''. These changes can be intuitively reflected in the Dockerfile scale evolution, which depends on the practices of the individual projects. It may vary because different projects have different codes and structures. But projects with similar goals and cultures may exhibit similar Dockerfile scale evolutionary trajectories. This motivated us to conduct an exploratory study of
the Dockerfile scale evolutionary trajectories to quantify the Docker evolution and shed light on the project maintainers' different learning curves on Docker configuration, which can benefit the project maintenance.
\section{Research Data}
(1) Projects selection.
From the container list in Docker Hub\footnote{https://store.docker.com}, we collected basic information for all the containers' that were there on or before July 2017. Docker Hub is a cloud-based registry service providing tools for developers to link to their source code repositories, to build image artifacts, and to deploy the images. These images are stored and maintained in Docker Hub repositories, of which there are close to $1,700,000$ at the time of December 2017.
Since Docker's inception in 2013, a large number of GitHub projects have used Docker~\cite{Cito}.
Docker Hub provides good GitHub integration and developers can easily combine their GitHub and Docker Hub repositories in their workflow. It also provides some featured tools, \emph{e.g.}, automated builds\footnote{https://docs.docker.com/docker-hub/builds/}, which allow developers to build their images automatically from GitHub sources~\cite{Boettiger}. Moreover, the builds data and Dockerfile information on Docker Hub are available for download, if the repositories are public and use auto-builds tool.
We identified projects that use the auto-builds tool by checking for the presence of the string ``\texttt{is\_automated}'' through the Docker Hub API, \emph{i.e.}, true means the project has auto-builds. After removing the projects that forked from other GitHub project or hosted on Bitbucket, we collected the basic information of 47,149 projects, including their names, creation times, and linked GitHub repository addresses.
\smallskip
\noindent(2) Data collection and filtering.
Based on the selected projects, we extracted the Dockerfile change information from the GitHub commit logs, including \{\emph{repo, changed\_date, commit\_sha}\}. We downloaded the Dockerfile content of each change by using the regular URL expression\footnote{https://raw.githubusercontent.com/[repo]/[commit\_sha]/Dockerfile}. Then we extracted detailed data of each Dockerfile, \emph{e.g.}, base image and image scale (lines of commands without blank lines and comments) by using text parsing.
To better fulfill our quantitative study, we filtered our data according to the following set of conditions:
\begin{itemize}
\item The project should be created before August 2016, \emph{i.e.}, project age$\ge$12 months;
\item The project should change Dockerfile enough times, \emph{i.e.}, Dockerfile versions$\ge$10;
\item The Dockerfile should have complete information, \emph{i.e.}, Dockerfile have a base image and its scale$>$0.
\end{itemize}
After this filtering, we obtained our final set of 2,840 projects.
In total, our dataset contains 76,925 versions of Dockerfile. On average, each project has changed Dockerfile 27.1 times (median: 18) and each version of Dockerfile has 32.7 (median: 22) lines of commands and 16.1 (median: 12) image layers.
\begin{figure}[!b]
\begin{minipage}{1\linewidth}
\centering
\includegraphics[width=0.6\textwidth]{figures/heatmap1.pdf}
\caption{Heatmap of the Dockerfile scale evolutionary trajectories.}
\label{fig:3}
\end{minipage}
\end{figure}
\smallskip
\noindent(3) Pre-clustering.
Before performing our approach, we drew the heatmap of the Dockerfile scale evolutionary trajectories of total 2,840 projects, as shown in Figure~\ref{fig:3}. By manually looking the color change trends, we can see that there are six clusters with significant differences. We marked them by using the dotted lines. So we set the number of clusters, $k$=6 in our K-means clustering process.
After clustering all projects' evolutionary vectors, we marked each project as one of six categories.
To better understand in what way Dockerfile scale evolves in different clusters, we drew the Dockerfile scale variation curves of some examples in each cluster (see Figure~\ref{fig:3}) and conducted case studies on randomly selected projects. After our manual analysis, we summarized below the specific paradigms of the six clusters.
\section{Dockerfile Evolutionary Paradigms}
(1) C1: Increasing and holding.
21.8\% of projects belong to this cluster. In this cluster, we find that developers added new instructions or new settings to the Dockerfile at the early periods. But after reaching to a certain size, they just updated the basic environment variables or just changed the location of instruction. So the Dockerfile size maintains stable (see Figure~\ref{fig:4}, \emph{top}, \emph{left}).
For example, in project \emph{inutano/wpgsa-docker}, its initial Dockerfile size was 7 lines. Then the manager \emph{inutano} added \texttt{USER} instruction and new python scripts so that the Dockerfile size reached to 12 lines. After that, the manager just changed the image or container and user name, \emph{e.g.}, ``\emph{FROM jupyter/datascience-notebook:4.0}''$\to$``\emph{FROM inutano/research-base:0.1.1}''. Then he just updated the plugin \emph{wPGSA}'s version, \emph{e.g.}, ``\emph{0.2.0}''$\to$``\emph{0.3.0}''.
\smallskip
\noindent(2) C2: Constantly growing.
31.6\% of projects belong to this cluster. We find that in this cluster, developers kept adding services, support or plugin to the Dockerfile which makes Dockerfile size continue to increase. Compared with other paradigms, the evolution path of this paradigm is the simplest and the easiest to understand (see Figure~\ref{fig:4}, \emph{top}, \emph{right}).
For example, in project \emph{macintoshplus/php}, the initial size of Dockerfile was 22 lines. Then developer \emph{macintoshplus} added new \texttt{RUN} instruction, new plugin or new support, \emph{e.g.}, \emph{SQLite} and \emph{Java}, which increases the size of Dockerfile to 33 lines.
\smallskip
\noindent(3) C3: Holding and increasing.
19.2\% of projects belong to this cluster. We find that in this cluster, developers just updated the basic environment variables at the early periods, which makes Dockerfile size maintain stable. Then, new instructions, plugins or supports being added, and Dockerfile size increased (see Figure~\ref{fig:4}, \emph{middle}, \emph{left}).
For instance, in project \emph{ckeyer/obc-env}, its initial Dockerfile size was 6 lines. During the first 5 changes, developer \emph{ckeyer} just added new parameters to \texttt{RUN} instruction or updated the file path, \emph{e.g.}, ``\emph{cd /github.com...}''$\to$``\emph{cd \$GOPATH/src/github.com...}''. The Dockerfile size still was 6 lines. Then \emph{ckeyer} changed the base image ``\emph{FROM ckeyer/obc:base}''$\to$``\emph{FROM alpine:edge}'' and added new \texttt{RUN} and \texttt{ENV} instructions. During the last 8 changes, 15 lines of code have been added to the Dockerfile which makes Dockerfile size up to 21 lines.
\smallskip
\noindent(4) C4: Increasing and decreasing.
10.2\% of projects belong to this cluster. We find that in this cluster, developers kept adding new instructions at the early periods, so Dockerfile size increases. But after reaching to a big size, maintainers tried to reconstruct the Dockerfile by moving useless plugins/services, or moving some settings to additional script files, which makes Dockerfile size drop at the later periods (see Figure~\ref{fig:4}, \emph{middle}, \emph{right}).
For example, in project \emph{combro2k/virtualmail}, Dockerfile initial size was 24 lines. During the first 101 changes, developer \emph{combro2k} kept adding new instructions, \emph{e.g.}, \texttt{RUN} and \texttt{ADD} which makes the Dockerfile size up to 163 lines. But during the later changes, \emph{combro2k} tried to reconstruct the Dockerfile by moving some instructions to other script files, \emph{e.g.}, moving 12 \texttt{ENV} settings of software versions to the ``\emph{resource/bin/setup.sh}'' file. Finally, Dockerfile size dropped to 13 lines.
\smallskip
\noindent(5) C5: Holding and decreasing.
9.5\% of projects belong to this cluster.
We find that in this cluster, developers changed very little, Dockerfile size maintains stable at the early periods. While at the latter periods, developers tried to change the Dockerfile structure by moving some instructions to additional script files or using a more lightweight base image, so Dockerfile size dropped (see Figure~\ref{fig:4}, \emph{bottom}, \emph{left}).
For instance, in project \emph{mobingidocker/ubuntu-apache2-ruby}, Dockerfile had 22 lines of initial code. Then developer \emph{David Siaw} added new script file, logging, git service and dependencies into the \texttt{RUN} instruction. After 6 changes, Dockerfile size had 23 lines, only one line was added. At the 7th change, \emph{David Siaw} moved 11 \texttt{RUN} instructions to the \emph{provision.sh} file. Finally, Dockerfile size dropped to 13 lines.
\smallskip
\noindent(6) C6: Gradually reducing.
7.7\% of projects belong to this cluster. We find that in this cluster, developers kept removing useless instructions or changed the base image to reduce image layers, which makes Dockerfile size continue to decrease. But at the later periods, the decreasing trend began to slow down (see Figure~\ref{fig:4}, \emph{bottom}, \emph{right}).
For example, in project \emph{keymetrics/pm2-docker-alpine}, the initial Dockerfile size was 28 lines. Then developer \emph{Unitech} changed the base image, ``\emph{FROM alpine:latest}''$\to$``\emph{FROM mhart/alpine-node:4}'' and its related instructions, so Dockerfile size dropped to 14 lines. After that, \emph{Unitech} removed some instructions and merged similar instructions to reduce image layers and speed up the Dockerfile build process. Finally, Dockerfile size dropped to 7 lines.
\begin{figure}[t!]
\begin{minipage}{1\linewidth}
\centering
\includegraphics[width=0.4\textwidth]{figures/cluster11.pdf}
\includegraphics[width=0.4\textwidth]{figures/cluster22.pdf}
\includegraphics[width=0.4\textwidth]{figures/cluster33.pdf}
\includegraphics[width=0.4\textwidth]{figures/cluster44.pdf}
\includegraphics[width=0.4\textwidth]{figures/cluster55.pdf}
\includegraphics[width=0.4\textwidth]{figures/cluster66.pdf}
\caption{Examples of Dockerfile scale (has been normalized) variation curves in different Clusters.}
\label{fig:4}
\end{minipage}
\end{figure}
\begin{figure}[!t]
\begin{minipage}{1\linewidth}
\centering
\includegraphics[width=0.6\textwidth]{figures/project_age.pdf}
\caption{Difference of project ages between different Clusters.}
\label{fig:5}
\end{minipage}
\end{figure}
\begin{figure}[!t]
\begin{minipage}{1\linewidth}
\centering
\includegraphics[width=0.6\textwidth]{figures/dockerfile_size_1.pdf}
\caption{Difference of Dockerfile scale between different Clusters.}
\label{fig:6}
\end{minipage}
\end{figure}
\section{Differences Comparison}
Those paradigms may indicate different project development stages and goals.
To further explore the difference between projects in the six paradigms, we compared the project age and average Dockerfile scale.
\smallskip
\noindent(1) Project age.
We find that on average, the project age of Cluster-1 is 24.7 months (median: 23.0), of Cluster-2 the value is 24.8 (median: 23.0), of Cluster-3 the value is 24.8 (median: 22.0), of Cluster-4 the value is 25.7 (median: 24.0), of Cluster-5 the value is 26.2 (median: 24.0), and of Cluster-6 the value is 24.5 (median: 24.0). As shown in Figure~\ref{fig:5}, the later three clusters seems to have larger project ages than the previous three clusters. Overall, projects in \emph{Cluster-1}, \emph{Cluster-2}, and \emph{Cluster-3} may be in the early stages of development, so they need to add content to the Dockerfile to meet the requirements. As for projects in \emph{Cluster-4} and \emph{Cluster-5}, they may be in the latter stages of development. So in the late periods, Dockerfile does not need to be added more new content, even some of its content needs to be removed or adjusted to make the Dockerfile architecture more reasonable. It reflects the learning curves of project maintainers. Interestingly, we find it difficult to explain why the projects of \emph{Cluster-6} have been reducing the Dockerfile size. One explanation is that maintainers are not very satisfied with their original Dockerfile architecture, probably because it contains a lot of unnecessary content, so they are constantly adjusting and modifying it later.
\smallskip
\noindent(2) Average Dockerfile scale.
We find that projects of Cluster-1 have an average of 28.6 lines of Dockerfile scale (median: 22.1), of Cluster-2 the value is 28.3 lines (median: 22.8), of Cluster-3 the value is 24.0 lines (median: 19.1), of Cluster-4 the value is 26.3 lines (median: 18.9), of Cluster-5 the value is 21.8 lines (median: 17.0), and of Cluster-6 the value is 19.8 lines (median: 17.1). As shown in Figure~\ref{fig:6}, the previous three clusters seems to have larger mean Dockerfile scale than the later three clusters, which corresponds to that the later three clusters decrease their Dockerfile scale in history. This indicates that different clusters of projects may have different development goals at different stages, \emph{i.e.}, some projects kept adding components, plugins to make the project more powerful, while some other projects tend to make the project more lightweight and easy to maintain by removing unnecessary components or refactoring the entire framework.
\end{appendix}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%% Reference section. 参考文献
%%% citation in the content using "some words~\cite{1,2}".
%%% ~ is needed to make the reference number is on the same line with the word before it.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{thebibliography}{99}
\bibitem{Cito} J. Cito, G. Schermann, J. E. Wittern, et al. An empirical analysis of the docker container ecosystem on github. In: Proceedings of the 14th International Conference on Mining Software Repositories. IEEE Press, 2017. pp. 323-333.
\bibitem{Dua} R. Dua, A. R. Raja, and D. Kakadia. Virtualization vs containerization to support paas. In: Proceedings of the International Conference on Cloud Engineering. IEEE, 2014. pp. 610-614.
\bibitem{Boettiger} C. Boettiger. An introduction to docker for reproducible research. ACM SIGOPS Operating Systems Review, 2015, 49(1): pp. 71-79.
\bibitem{Manu} A. Manu, J. K. Patel, S. Akhtar, et al. Docker container security via heuristics-based multilateral security-conceptual and pragmatic study. In: Proceedings of the International Conference on Circuit, Power and Computing Technologies. IEEE, 2016. pp. 1-14.
\bibitem{Bernstein} D. Bernstein. Containers and cloud: From lxc to docker to kubernetes. IEEE Cloud Computing, 2014, 1(3): pp. 81-84.
\bibitem{Merkel} D. Merkel. Docker: lightweight linux containers for consistent development and deployment. Linux Journal, 2014, 2014(239): p. 2.
\bibitem{Felter} W. Felter, A. Ferreira, R. Rajamony, et al. An updated performance comparison of virtual machines and linux containers. In: Proceedings of the International Symposium On Performance Analysis of Systems and Software. IEEE, 2015. pp. 171-172.
\bibitem{Mouat} A. Mouat. Using Docker: Developing and Deploying Software with Containers. O'Reilly Media, Inc., 2015.
\bibitem{Anderson} C. Anderson. Docker [software engineering]. IEEE Software, 2015, 32(3): pp. 102-c3.
\bibitem{Altman} N. S. Altman. An introduction to kernel and nearest-neighbor nonparametric regression. The American Statistician, 1992, 46(3): pp.
175-185.
\bibitem{Cawley} G. C. Cawley and N. L. Talbot. Fast exact leave-one-out cross-validation of sparse least-squares support vector machines. Neural networks, 2004, 17(10): pp. 1467-1475.
\bibitem{Powell} M. J. Powell. A direct search optimization method that models the objective and constraint functions by linear interpolation. Advances in optimization and numerical analysis. Springer, 1994, pp. 51-67.
\bibitem{Hartigan} J. A. Hartigan and M. A. Wong. Algorithm as 136: A k-means clustering algorithm, Journal of the Royal Statistical Society. Series
C (Applied Statistics), 1979, 28(1): pp. 100-108.
\bibitem{Jaramillo} D. Jaramillo, D. V. Nguyen, and R. Smart. Leveraging microservices architecture by using docker technology. In: Proceedings of SoutheastCon. IEEE, 2016. pp. 1-5.\\
\end{thebibliography}
\end{document}