Add docs (#70)

* add ci config

* move docs to fluid project
This commit is contained in:
biubiu-biub 2020-08-25 21:52:26 +08:00 committed by GitHub
parent e44d5ee7bd
commit 5c4a466670
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
36 changed files with 7321 additions and 533 deletions

24
docs/build.sh Executable file
View File

@ -0,0 +1,24 @@
#!/bin/bash
langs="en zh"
for lang in $langs
do
cp ${lang}/dev/api_doc.md ${lang}/dev/api_doc.html
python3 scripts/mergeByTOC.py ${lang}/
done
./scripts/genDoc.sh
for lang in $langs
do
xvfb-run wkhtmltopdf ./en/dev/api_doc.html api.pdf
python3 scripts/mergePDF.py ${lang}
done
for lang in $langs
do
rm ${lang}/dev/api_doc.html
rm ${lang}/doc.md
rm api.pdf
rm output_${lang}.pdf
done

View File

@ -1,236 +0,0 @@
# 示例 - 远程文件访问加速
Fluid使用[Alluxio](https://www.alluxio.io)为用户提供了极其便捷的远程文件访问接口使得程序能够像访问本地文件一样访问远程文件同时借助Alluxio提供的文件缓存能力程序对于已访问过的文件重复访问能够获得大幅度的速度提升。本文档通过一个简单的例子演示了上述功能特性
## 前提条件
在运行该示例之前,请参考[安装文档](../installation_cn/README.md)完成安装并检查Fluid各组件正常运行
```shell script
$ kubectl get pod -n fluid-system
NAME READY STATUS RESTARTS AGE
controller-manager-7fd6457ccf-jnkvn 1/1 Running 0 60s
csi-nodeplugin-fluid-6rhpt 2/2 Running 0 60s
csi-nodeplugin-fluid-6zwgl 2/2 Running 0 60s
```
## 运行示例
**查看待创建的Dataset资源对象**
```shell script
$ cat samples/accelerate/dataset.yaml
apiVersion: data.fluid.io/v1alpha1
kind: Dataset
metadata:
name: hbase
spec:
mounts:
- mountPoint: https://mirrors.tuna.tsinghua.edu.cn/apache/hbase/2.2.5/
name: hbase
```
> 本示例将以Apache镜像站点上的Hbase v2.25相关资源作为演示中使用的远程文件
**创建Dataset资源对象**
```shell script
$ kubectl create -f samples/accelerate/dataset.yaml
dataset.data.fluid.io/hbase created
```
**查看Dataset资源对象状态**
```shell script
$ kubectl get dataset hbase -o yaml
apiVersion: data.fluid.io/v1alpha1
kind: Dataset
...
status:
conditions: []
phase: NotBound
```
该Dataset资源对象目前还未与任何AlluxioRuntime资源对象绑定因此其`status`中的`phase`属性值为`NotBound`这意味着该Dataset资源对象仍然处于不可用状态
**创建AlluxioRuntime资源对象**
```shell script
$ kubectl create -f samples/accelerate/runtime.yaml
alluxioruntime.data.fluid.io/hbase created
```
等待一段时间让AlluxioRuntime资源对象中的各个组件得以顺利启动看到类似以下状态
```shell script
$ kubectl get pod
NAME READY STATUS RESTARTS AGE
hbase-fuse-hvxgh 1/1 Running 0 27s
hbase-fuse-sjhxk 1/1 Running 0 27s
hbase-master-0 2/2 Running 0 62s
hbase-worker-92cln 2/2 Running 0 27s
hbase-worker-rlb5w 2/2 Running 0 27s
```
**再次查看Dataset资源对象状态**
```shell script
$ kubectl get dataset hbase -o yaml
...
...
status:
cacheStates:
cacheCapacity: 4GiB
cached: 0B
cachedPercentage: 0%
conditions:
- lastTransitionTime: "2020-07-29T08:23:44Z"
lastUpdateTime: "2020-07-29T08:26:29Z"
message: The ddc runtime is ready.
reason: DatasetReady
status: "True"
type: Ready
phase: Bound
runtimes:
- category: Accelerate
name: hbase
namespace: default
type: alluxio
ufsTotal: 443.5MiB
```
因为已经与一个成功启动的AlluxioRuntime绑定该Dataset资源对象的`Status`得到了更新,从上述状态中可以获知有关资源对象的基本信息
**查看AlluxioRuntime状态**
```shell script
$ kubectl get alluxioruntime hbase -o yaml
...
...
status:
cacheStates:
cacheCapacity: 4GiB
cached: 0B
cachedPercentage: 0%
conditions:
- lastProbeTime: "2020-07-29T08:23:05Z"
lastTransitionTime: "2020-07-29T08:23:05Z"
message: The master is initialized.
reason: Master is initialized
status: "True"
type: MasterInitialized
- lastProbeTime: "2020-07-29T08:23:40Z"
lastTransitionTime: "2020-07-29T08:23:05Z"
message: The master is ready.
reason: Master is ready
status: "True"
type: MasterReady
- lastProbeTime: "2020-07-29T08:23:20Z"
lastTransitionTime: "2020-07-29T08:23:20Z"
message: The workers are initialized.
reason: Workers are initialized
status: "True"
type: WorkersInitialized
- lastProbeTime: "2020-07-29T08:23:20Z"
lastTransitionTime: "2020-07-29T08:23:20Z"
message: The fuses are initialized.
reason: Fuses are initialized
status: "True"
type: FusesInitialized
- lastProbeTime: "2020-07-29T08:23:40Z"
lastTransitionTime: "2020-07-29T08:23:40Z"
message: The workers are partially ready.
reason: Workers are ready
status: "True"
type: WorkersReady
- lastProbeTime: "2020-07-29T08:23:40Z"
lastTransitionTime: "2020-07-29T08:23:40Z"
message: The fuses are ready.
reason: Fuses are ready
status: "True"
type: FusesReady
currentFuseNumberScheduled: 2
currentMasterNumberScheduled: 1
currentWorkerNumberScheduled: 2
desiredFuseNumberScheduled: 2
desiredMasterNumberScheduled: 1
desiredWorkerNumberScheduled: 2
fuseNumberAvailable: 2
fuseNumberReady: 2
fusePhase: Ready
masterNumberReady: 1
masterPhase: Ready
valueFile: hbase-alluxio-values
workerNumberAvailable: 2
workerNumberReady: 2
workerPhase: Ready
```
**查看与远程文件关联的PersistentVolume以及PersistentVolumeClaim**
```shell script
$ kubectl get pv
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE
hbase 100Gi RWX Retain Bound default/hbase 18m
```
```shell script
$ kubectl get pvc
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
hbase Bound hbase 100Gi RWX 18m
```
与远程文件关联的PV,PVC已经由Fluid生成应用可以通过该PVC完成远程文件在Pod中的挂载并通过挂载目录实现远程文件访问
## 远程文件访问
**启动应用进行远程文件访问**
```shell script
kubectl create -f samples/accelerate/nginx.yaml
```
登录Nginx Pod:
```shell script
kubectl exec -it nginx -- bash
```
查看远程文件挂载情况:
```shell script
# ls -1 /data/hbase
CHANGES.md
RELEASENOTES.md
api_compare_2.2.5RC0_to_2.2.4.html
hbase-2.2.5-bin.tar.gz
hbase-2.2.5-client-bin.tar.gz
hbase-2.2.5-src.tar.gz
```
```shell script
# du -sh /data/hbase/hbase-2.2.5-client-bin.tar.gz
200M /data/hbase/hbase-2.2.5-client-bin.tar.gz
```
## 远程文件访问加速
**启动测试作业**
```shell script
$ kubectl create -f samples/accelerate/test.yaml
job.batch/fluid-test created
```
该测试程序会尝试读取一个远程文件(e.g. `hbase-2.2.5-client-bin.tar.gz`),并打印出此过程所耗费的时间:
```shell script
$ kubectl logs fluid-test-cqmwj
real 1m 9.55s
user 0m 0.00s
sys 0m 0.64s
```
可见第一次远程文件的读取耗费了接近70s的时间
**再次启动测试作业**
```shell script
kubectl delete -f samples/accelerate/test.yaml
kubectl create -f samples/accelerate/test.yaml
```
由于远程文件已经被缓存,此次测试作业能够迅速完成:
```shell script
$ kubectl logs fluid-test-hpzqc
real 0m 2.03s
user 0m 0.00s
sys 0m 0.63s
```
同样的文件访问操作仅耗费了2s
因为该文件已经在Alluxio中被缓存因此访问的速度大大加快可见Fluid利用Alluxio实现了远程文件访问的加速
> 注意: 上述文件的访问速度与示例运行环境的网络条件有关,如果文件访问速度过慢,请更换更小的远程文件尝试
## 环境清理
```shell script
kubectl delete -f samples/accelerate
```

View File

22
docs/en/TOC.md Normal file
View File

@ -0,0 +1,22 @@
# Fluid Documentation
<!-- markdownlint-disable MD007 -->
<!-- markdownlint-disable MD032 -->
## TOC
+ Userguide
- [Overview](userguide/overview.md)
- [Get Started](userguide/get_started.md)
- [Installation](userguide/install.md)
- [Diagnose](userguide/diagnose.md)
+ Samples
- [Accelerate Data Accessing](samples/accelerate_data_accessing.md)
- [Cache Co-locality](samples/data_co_locality.md)
- [Machine Learning](samples/machinelearning.md)
- [Warm up](samples/warmup.md)
- [Dawnbench](samples/dawnbench.md)
+ Developer Guide
- [How to develop](dev/how_to_develop.md)
- [API_Doc](dev/api_doc.md)

2030
docs/en/dev/api_doc.md Normal file

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,199 @@
# Developer Guide
## Requirements
- git
- golang (version >= 1.13)
- docker (version >= 19.03)
- Kubernetes (version >= 1.14)
- GNU Make
For installation of golang, please refer to [Install Golang](https://golang.org/dl/)
`make` is usually in a `build-essential` package in your distribution's package manager of choice. Make sure you have `make` on your machine.
There're great chances that you may want to run your implementation in a real Kubernetes cluster, so probably a Docker is needed for some necessary operations like building images.
See [Install Docker](https://docs.docker.com/engine/install/) for more information.
## How to Build, Run and Debug
### Get Source Code
```shell
$ mkdir -p $GOPATH/src/github.com/cloudnativefluid/
$ cd $GOPATH/src/github.com/cloudnativefluid
$ git clone https://github.com/fluid-cloudnative/fluid.git
```
> **NOTE**: In this document, we build, run and debug under non-module environment.
>
> See [Go Modules](https://github.com/golang/go/wiki/Modules) for more information if some issue occurs to you.
### Build binary
`Makefile` under project directory provides many tasks you may want to use including Test, Build, Debug, Deploy etc.
You can simply get a binary by running:
```shell
# build controller manager
$ make manager
# build fluid CSI plugin
$ make csi
```
By default, the binary would be put under `<fluid-path>/bin`.
### Build image
1\. Set tags for images
```shell
# image name for controller manager
$ export IMG=<registry>/<namespace>/<img-repo>
# image name for CSI plugin
$ export CSI_IMG=<registry>/<namespace>/<csi-img-repo>
```
Image tag will be automatically injected with SHA1 value of current git revision.
2. Login to a image registry
Make sure you've login to a docker image registry that you'd like to push your image to:
```shell
$ sudo docker login <docker-registry>
```
3. Build your image and push:
```shell
# build and push image for controller manager
$ make docker-push
# build and push image for CSI plugin
$ make docker-push-csi
```
Alternatively, it makes no difference that you build your images first and then manually push them:
```shell
$ make docker-build
$ make docker-build-csi
$ docker push <IMG>:<IMG_TAG>
```
### Run your fluid on kubernetes cluster
In the following steps, we assume you have properly configured `KUBECONFIG` environment variable or set up `~/.kube/config`. See [Kubeconfig docs](https://kubernetes.io/docs/tasks/access-application-cluster/configure-access-multiple-clusters/) for more information.
1. Push your images to a image registry accessible to your Kubernetes cluster
If your images are pushed to some private repositories, make sure your Kubernetes cluster hold credentials for accessing those repositories.
2. Change image in the samples we provide:
```yaml
# <fluid-path>/config/fluid/patches/image_in_manager.yaml
...
...
containers:
- name: manager
image: <registry>/<namespace>/<img-repo>:<img-tag>
```
```yaml
# <fluid-path>/config/fluid/patches/image_in_csi-plugin.yaml
...
...
containers:
- name: plugins
image: <registry>/<namespace>/<csi-img-name>:<csi-img-tag>
```
3. Install CRDs
```shell
$ kubectl apply -k config/crd
```
Check CRD with:
```shell
$ kubectl get crd | grep fluid
alluxiodataloads.data.fluid.io 2020-08-22T03:53:46Z
alluxioruntimes.data.fluid.io 2020-08-22T03:53:46Z
datasets.data.fluid.io 2020-08-22T03:53:46Z
```
4. Install your implementation
```shelll
$ kubectl apply -k config/fluid
```
Check Fluid system with:
```shell
$ kubectl get pod -n fluid-system
NAME READY STATUS RESTARTS AGE
controller-manager-7fd6457ccf-p7j2x 1/1 Running 0 84s
csi-nodeplugin-fluid-pj9tv 2/2 Running 0 84s
csi-nodeplugin-fluid-t8ctj 2/2 Running 0 84s
```
5. Run samples to verify your implementation
Here is a sample provided by us, you may want to rewrite it according to your implementation.
```shell
$ kubectl apply -k config/samples
```
Check sample pods:
```shell
$ kubectl get pod
NAME READY STATUS RESTARTS AGE
cifar10-fuse-vb6l4 1/1 Running 0 6m15s
cifar10-fuse-vtqpx 1/1 Running 0 6m15s
cifar10-master-0 2/2 Running 0 8m24s
cifar10-worker-729xz 2/2 Running 0 6m15s
cifar10-worker-d6kmd 2/2 Running 0 6m15s
nginx-0 1/1 Running 0 8m30s
nginx-1 1/1 Running 0 8m30s
```
6. Check logs to verify your implementation
```shell
$ kubectl logs -n fluid-system <CONTROLLER_MANAGER_NAME>
```
7. Clean up
```shell
$ kubectl delete -k config/samples
$ kubectl delete -k config/fluid
$ kubectl delete -k config/crd
```
### Debug
You can debug your program in multiple ways, here is just a brief guide for how to debug with `go-delve`
**Prerequisites**
Make sure you have `go-delve` installed. See [go-delve installation guide](https://github.com/go-delve/delve/tree/master/Documentation/installation) for more information
**Debug locally**
```shell
# build & debug in one line
$ dlv debug <fluid-path>/cmd/controller/main.go
# debug binary
$ make manager
$ dlv exec bin/manager
```
**Debug remotely**
On remote host:
```shell
$ dlv debug --headless --listen ":12345" --log --api-version=2 cmd/controller/main.go
```
The command above will make `go-delve` start a debug service and listen for port 12345.
On local host, connect to the debug service:
```shell
$ dlv connect "<remote-address>:12345" --api-version=2
```
> Note: To debug remotely, make sure the specified port is not occupied and the firewall has been properly configured.

View File

@ -0,0 +1,370 @@
# DEMO - Speed Up Accessing Remote Files
Powered by [Alluxio](https://www.alluxio.io) and [Fuse](https://github.com/libfuse/libfuse), Fluid provides a simple way for users to access files stored in remote filesystems, just like accessing some ordinary file in local filesystems.
What's more, with a powerful caching capability provided, users can enjoy a great speedup on accessing remote files especially for those that have a frequent access pattern.
This demo aims to show you an overview of all the features mentioned above.
## Prerequisites
Before everything we are going to do, please refer to [Installation Guide](../userguide/install.md) to install Fluid on your Kubernetes Cluster, and make sure all the components used by Fluid are ready like this:
```shell
$ kubectl get pod -n fluid-system
NAME READY STATUS RESTARTS AGE
controller-manager-7fd6457ccf-jnkvn 1/1 Running 0 60s
csi-nodeplugin-fluid-6rhpt 2/2 Running 0 60s
csi-nodeplugin-fluid-6zwgl 2/2 Running 0 60s
```
Normally, you shall see a Pod named "controller-manager" and several Pods named "csi-nodeplugin".
The num of "csi-nodeplugin" Pods depends on how many nodes your Kubernetes cluster have(e.g. 2 in this demo), so please make sure all "csi-nodeplugin" Pods are working properly.
## Set up workspace
```shell
$ mkdir <any-path>/accelerate
$ cd <any-path>/accelerate
```
## Install Resources to Kubernetes
**Check the `Dataset` object to be created**
```shell
$ cat<<EOF >dataset.yaml
apiVersion: data.fluid.io/v1alpha1
kind: Dataset
metadata:
name: hbase
spec:
mounts:
- mountPoint: https://mirrors.tuna.tsinghua.edu.cn/apache/hbase/2.2.5/
name: hbase
EOF
```
Here, we'd like to create a resource object with kind `Dataset`. `Dataset` is a Custom Resource Definition(CRD) defined by Fluid and used to tell Fluid where to find all the data you'd like to access.
Under the hood, Fluid uses Alluxio to do some mount operations, so `mountPoint` property can be any legal UFS path acknowledged by Alluxio. Here, we use [WebUFS](https://docs.alluxio.io/os/user/stable/en/ufs/WEB.html) for its simplicity.
For more information about UFS, please refer to [Alluxio Docs - Storage Integrations](https://docs.alluxio.io/os/user/stable/en/ufs/HDFS.html)
For more information about properties in `Dataset`, please refer to our [API doc](../dev/api_doc.md)
> We use hbase v2.2.5 on a mirror site of Apache downloads as an example of remote file. It's nothing special, you can change it to any remote file you like. But please note that, if you are going to use WebUFS like we do, files on Apache sites are highly recommended because you might need some advanced configurations due to current implementation of WebUFS.
**Create the `Dataset` object**
```shell
$ kubectl create -f dataset.yaml
dataset.data.fluid.io/hbase created
```
**Check status of the `Dataset` object**
```shell
$ kubectl get dataset hbase -o yaml
apiVersion: data.fluid.io/v1alpha1
kind: Dataset
...
status:
conditions: []
phase: NotBound
```
With a `NotBound` phase in status, the dataset is not ready cause there isn't any `AlluxioRuntime` object supporting it. We'll create one in the following steps.
**Check the `AlluxioRuntime` object to be created**
```shell
$ cat<<EOF >runtime.yaml
apiVersion: data.fluid.io/v1alpha1
kind: AlluxioRuntime
metadata:
name: hbase
spec:
replicas: 2
tieredstore:
levels:
- mediumtype: MEM
path: /dev/shm
quota: 2Gi
high: "0.95"
low: "0.7"
storageType: Memory
properties:
alluxio.user.file.writetype.default: MUST_CACHE
alluxio.master.journal.folder: /journal
alluxio.master.journal.type: UFS
alluxio.user.block.size.bytes.default: 256MB
alluxio.user.streaming.reader.chunk.size.bytes: 256MB
alluxio.user.local.reader.chunk.size.bytes: 256MB
alluxio.worker.network.reader.buffer.size: 256MB
alluxio.user.streaming.data.timeout: 300sec
master:
jvmOptions:
- "-Xmx4G"
worker:
jvmOptions:
- "-Xmx4G"
fuse:
jvmOptions:
- "-Xmx4G "
- "-Xms4G "
# For now, only support local
shortCircuitPolicy: local
args:
- fuse
- --fuse-opts=direct_io,ro,max_read=131072,attr_timeout=7200,entry_timeout=7200,nonempty
EOF
```
**Create a `AlluxioRuntime` object**
```shell
$ kubectl create -f runtime.yaml
alluxioruntime.data.fluid.io/hbase created
```
`AlluxioRuntime` is another CRD defined by Fluid. An `AluxioRuntime` object describes specifications used to run an Alluxio instance.
Wait for a while, and make sure all components defined in the `AlluxioRuntime` object are ready. You shall see something like this:
```shell
$ kubectl get pod
NAME READY STATUS RESTARTS AGE
hbase-fuse-hvxgh 1/1 Running 0 27s
hbase-fuse-sjhxk 1/1 Running 0 27s
hbase-master-0 2/2 Running 0 62s
hbase-worker-92cln 2/2 Running 0 27s
hbase-worker-rlb5w 2/2 Running 0 27s
```
**Check status of the `Dataset` object again**
```shell
$ kubectl get dataset hbase -o yaml
...
...
status:
cacheStates:
cacheCapacity: 4GiB
cached: 0B
cachedPercentage: 0%
conditions:
- lastTransitionTime: "2020-07-29T08:23:44Z"
lastUpdateTime: "2020-07-29T08:26:29Z"
message: The ddc runtime is ready.
reason: DatasetReady
status: "True"
type: Ready
phase: Bound
runtimes:
- category: Accelerate
name: hbase
namespace: default
type: alluxio
ufsTotal: 443.5MiB
```
Status of the `Dataset` object has been updated since a related Alluxio instance is ready and successfully bounded to the `Dataset` object. As you can see, basic information about runtime along with some other status info are provided in `status`.
**Check status of the `AlluxioRuntime` object**
```shell
$ kubectl get alluxioruntime hbase -o yaml
...
...
status:
cacheStates:
cacheCapacity: 4GiB
cached: 0B
cachedPercentage: 0%
conditions:
- lastProbeTime: "2020-07-29T08:23:05Z"
lastTransitionTime: "2020-07-29T08:23:05Z"
message: The master is initialized.
reason: Master is initialized
status: "True"
type: MasterInitialized
- lastProbeTime: "2020-07-29T08:23:40Z"
lastTransitionTime: "2020-07-29T08:23:05Z"
message: The master is ready.
reason: Master is ready
status: "True"
type: MasterReady
- lastProbeTime: "2020-07-29T08:23:20Z"
lastTransitionTime: "2020-07-29T08:23:20Z"
message: The workers are initialized.
reason: Workers are initialized
status: "True"
type: WorkersInitialized
- lastProbeTime: "2020-07-29T08:23:20Z"
lastTransitionTime: "2020-07-29T08:23:20Z"
message: The fuses are initialized.
reason: Fuses are initialized
status: "True"
type: FusesInitialized
- lastProbeTime: "2020-07-29T08:23:40Z"
lastTransitionTime: "2020-07-29T08:23:40Z"
message: The workers are partially ready.
reason: Workers are ready
status: "True"
type: WorkersReady
- lastProbeTime: "2020-07-29T08:23:40Z"
lastTransitionTime: "2020-07-29T08:23:40Z"
message: The fuses are ready.
reason: Fuses are ready
status: "True"
type: FusesReady
currentFuseNumberScheduled: 2
currentMasterNumberScheduled: 1
currentWorkerNumberScheduled: 2
desiredFuseNumberScheduled: 2
desiredMasterNumberScheduled: 1
desiredWorkerNumberScheduled: 2
fuseNumberAvailable: 2
fuseNumberReady: 2
fusePhase: Ready
masterNumberReady: 1
masterPhase: Ready
valueFile: hbase-alluxio-values
workerNumberAvailable: 2
workerNumberReady: 2
workerPhase: Ready
```
Detailed information about the Alluxio instance is provided here.
**Check related PersistentVolume and PersistentVolumeClaim**
```shell
$ kubectl get pv
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE
hbase 100Gi RWX Retain Bound default/hbase 18m
```
```shell
$ kubectl get pvc
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
hbase Bound hbase 100Gi RWX 18m
```
Related PV and PVC have been created by Fluid since the `Dataset` object is ready(bounded).
Workloads are now able to access remote files by mounting PVC.
## Remote File Access
**Check the app to be created**
```shell
$ cat<<EOF >nginx.yaml
apiVersion: v1
kind: Pod
metadata:
name: nginx
spec:
containers:
- name: nginx
image: nginx
volumeMounts:
- mountPath: /data
name: hbase-vol
volumes:
- name: hbase-vol
persistentVolumeClaim:
claimName: hbase
EOF
```
**Run a demo app to access remote files**
```shell
$ kubectl create -f nginx.yaml
```
Login to nginx Pod:
```shell
$ kubectl exec -it nginx -- bash
```
Check file status
```shell
$ ls -1 /data/hbase
CHANGES.md
RELEASENOTES.md
api_compare_2.2.5RC0_to_2.2.4.html
hbase-2.2.5-bin.tar.gz
hbase-2.2.5-client-bin.tar.gz
hbase-2.2.5-src.tar.gz
```
```shell
$ du -h /data/hbase/*
174K /data/hbase/CHANGES.md
106K /data/hbase/RELEASENOTES.md
115K /data/hbase/api_compare_2.2.5RC0_to_2.2.4.html
211M /data/hbase/hbase-2.2.5-bin.tar.gz
200M /data/hbase/hbase-2.2.5-client-bin.tar.gz
34M /data/hbase/hbase-2.2.5-src.tar.gz
```
Logout:
```shell
$ exit
```
As you may have seen, all the files on the WebUFS(e.g. hbase-related files on Apache mirror in our case) appear no differences from any other file in the local filesystem of the nginx Pod.
## Speed Up Accessing Remote Files
To demonstrate how great speedup you may enjoy when accessing remote files, here is a demo job:
**Check the test job to be launched**
```shell
$ cat<<EOF >app.yaml
apiVersion: batch/v1
kind: Job
metadata:
name: fluid-copy-test
spec:
template:
spec:
restartPolicy: OnFailure
containers:
- name: busybox
image: busybox
command: ["/bin/sh"]
args: ["-c", "set -x; time cp -r /data/hbase ./"]
volumeMounts:
- mountPath: /data
name: hbase-vol
volumes:
- name: hbase-vol
persistentVolumeClaim:
claimName: hbase
EOF
```
**Launch a test job**
```shell
$ kubectl create -f app.yaml
job.batch/fluid-test created
```
Under the hood, the test job executes a shell command `time cp -r /data/hbase ./` and prints its result.
Wait for a while and make sure the job has completed. You can check its result by:
```shell
$ kubectl logs fluid-copy-test-h59w9
+ time cp -r /data/hbase ./
real 1m 2.74s
user 0m 0.00s
sys 0m 1.35s
```
It's our first time to read such a file, and it takes us about 63s. It may be not as fast as you expected but:
**Re-Launch the test job**
```shell
$ kubectl delete -f app.yaml
$ kubectl create -f app.yaml
```
It'll finish very soon after creation this time:
```shell
$ kubectl logs fluid-copy-test-d9h2x
+ time cp -r /data/hbase ./
real 0m 2.94s
user 0m 0.00s
sys 0m 1.27s
```
The same read operation takes only 3s this time.
The great speedup attributes to the powerful caching capability provided by Alluxio. That means that once you access some remote file, it will be cached in Alluxio, and your next following operations will enjoy a local access instead of a remote one, and thus a great speedup.
> Note: Time spent for the test job depends on your network environment. If it takes too long for you to complete the job, changing a mirror or some smaller file might help.
## Clean Up
```shell
$ kubectl delete -f .
```

View File

@ -0,0 +1,284 @@
# DEMO - Cache Co-locality for Workload Scheduling
In Fluid, remote files specified in `Dataset` object are schedulable, which means you are able to control where to put your data in a k8s cluster,
just like what you may have done to Pods. Also, Fluid is able to make cache co-locality scheduling decisions for workloads to minimize overhead costs.
This demo will show you an overview about features mentioned above.
## Prerequisites
Before everything we are going to do, please refer to [Installation Guide](../userguide/install.md) to install Fluid on your Kubernetes Cluster, and make sure all the components used by Fluid are ready like this:
```shell
$ kubectl get pod -n fluid-system
NAME READY STATUS RESTARTS AGE
controller-manager-7fd6457ccf-jnkvn 1/1 Running 0 60s
csi-nodeplugin-fluid-6rhpt 2/2 Running 0 60s
csi-nodeplugin-fluid-6zwgl 2/2 Running 0 60s
```
Normally, you shall see a Pod named "controller-manager" and several Pods named "csi-nodeplugin".
The num of "csi-nodeplugin" Pods depends on how many nodes your Kubernetes cluster have(e.g. 2 in this demo), so please make sure all "csi-nodeplugin" Pods are working properly.
## Set up workspace
```shell
$ mkdir <any-path>/co-locality
$ cd <any-path>/co-locality
```
## Install Resources to Kubernetes
**Check all nodes in your Kubernetes cluster**
```shell
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
cn-beijing.192.168.1.146 Ready <none> 7d14h v1.16.9-aliyun.1
cn-beijing.192.168.1.147 Ready <none> 7d14h v1.16.9-aliyun.1
```
**Label one of the nodes**
```shell
$ kubectl label nodes cn-beijing.192.168.1.146 hbase-cache=true
```
Since we'll use `NodeSelector` to manage where to put our data, we mark the desired node by labeling it.
**Check all nodes again**
```shell
$ kubectl get node -L hbase-cache
NAME STATUS ROLES AGE VERSION HBASE-CACHE
cn-beijing.192.168.1.146 Ready <none> 7d14h v1.16.9-aliyun.1 true
cn-beijing.192.168.1.147 Ready <none> 7d14h v1.16.9-aliyun.1
```
Only one of the two nodes holds a label `hbase-cache=true`. In the following steps, we are going to make sure it's the only location the data cache can be put on.
**Check the `Dataset` object to be created**
```shell
$ cat<<EOF >dataset.yaml
apiVersion: data.fluid.io/v1alpha1
kind: Dataset
metadata:
name: hbase
spec:
mounts:
- mountPoint: https://mirrors.tuna.tsinghua.edu.cn/apache/hbase/2.2.5/
name: hbase
nodeAffinity:
required:
nodeSelectorTerms:
- matchExpressions:
- key: hbase-cache
operator: In
values:
- "true"
EOF
```
We defined a `nodeSelectorTerm` in `Dataset` object's `spec` to make sure only nodes with label `hbase-cache=true` are considered to be available for the dataset.
**Create the dataset object**
```shell
$ kubectl create -f dataset.yaml
dataset.data.fluid.io/hbase created
```
**Check the `AlluxioRuntime` object to be created**
```shell
$ cat<<EOF >runtime.yaml
apiVersion: data.fluid.io/v1alpha1
kind: AlluxioRuntime
metadata:
name: hbase
spec:
replicas: 2
tieredstore:
levels:
- mediumtype: MEM
path: /dev/shm
quota: 2Gi
high: "0.95"
low: "0.7"
storageType: Memory
properties:
alluxio.user.file.writetype.default: MUST_CACHE
alluxio.master.journal.folder: /journal
alluxio.master.journal.type: UFS
alluxio.user.block.size.bytes.default: 256MB
alluxio.user.streaming.reader.chunk.size.bytes: 256MB
alluxio.user.local.reader.chunk.size.bytes: 256MB
alluxio.worker.network.reader.buffer.size: 256MB
alluxio.user.streaming.data.timeout: 300sec
master:
jvmOptions:
- "-Xmx4G"
worker:
jvmOptions:
- "-Xmx4G"
fuse:
jvmOptions:
- "-Xmx4G "
- "-Xms4G "
- "-XX:+UseG1GC "
- "-XX:MaxDirectMemorySize=4g "
- "-XX:+UnlockExperimentalVMOptions "
- "-XX:ActiveProcessorCount=8 "
# For now, only support local
shortCircuitPolicy: local
args:
- fuse
- --fuse-opts=direct_io,ro,max_read=131072,attr_timeout=7200,entry_timeout=7200,nonempty
EOF
```
In this snippet of yaml, there are many specifications used by Fluid to launch an Alluxio instance. By creating such an `AlluxioRuntime` object, an Alluxio instance with 1 master and 2 workers is expected to be launched.
**Create the `AlluxioRuntime` object**
```shell
$ kubectl create -f runtime.yaml
alluxioruntime.data.fluid.io/hbase created
$ kubectl get pod -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
hbase-fuse-42csf 1/1 Running 0 104s 192.168.1.146 cn-beijing.192.168.1.146 <none> <none>
hbase-master-0 2/2 Running 0 3m3s 192.168.1.147 cn-beijing.192.168.1.147 <none> <none>
hbase-worker-l62m4 2/2 Running 0 104s 192.168.1.146 cn-beijing.192.168.1.146 <none> <none>
```
While two running workers are expected, there's only one running on the node with label `hbase-cache=true`. The `nodeSelectorTerm` stops another worker from being deployed.
**Check status of the `AlluxioRuntime` object**
```shell
$ kubectl get alluxioruntime hbase -o yaml
...
status:
cacheStates:
cacheCapacity: 2GiB
cached: 0B
cachedPercentage: 0%
conditions:
...
currentFuseNumberScheduled: 1
currentMasterNumberScheduled: 1
currentWorkerNumberScheduled: 1
desiredFuseNumberScheduled: 2
desiredMasterNumberScheduled: 1
desiredWorkerNumberScheduled: 2
fuseNumberAvailable: 1
fuseNumberReady: 1
fusePhase: PartialReady
masterNumberReady: 1
masterPhase: Ready
valueFile: hbase-alluxio-values
workerNumberAvailable: 1
workerNumberReady: 1
workerPhase: PartialReady
```
As expected, `workerPhase` is `PartialReady` and `currentWorkerNumberScheduled: 1` is less than `desiredWorkerNumberScheduled: 2`.
**Check the workload to be created**
A sample workload is provided to demonstrate how cache co-locality scheduling works. Let's check it out first:
```shell
$ cat<<EOF >app.yaml
apiVersion: apps/v1beta1
kind: StatefulSet
metadata:
name: nginx
labels:
app: nginx
spec:
replicas: 2
serviceName: "nginx"
podManagementPolicy: "Parallel"
selector: # define how the deployment finds the pods it manages
matchLabels:
app: nginx
template: # define the pods specifications
metadata:
labels:
app: nginx
spec:
affinity:
# prevent two Nginx Pod from being scheduled at the same Node
# just for demonstrating co-locality demo
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- nginx
topologyKey: "kubernetes.io/hostname"
containers:
- name: nginx
image: nginx
volumeMounts:
- mountPath: /data
name: hbase-vol
volumes:
- name: hbase-vol
persistentVolumeClaim:
claimName: hbase
EOF
```
The `podAntiAffinity` property might be a little confusing.
Here is the explanation: The `podAntiAffinity` property makes sure all pods created by the workload should be distributed across different nodes, which can provide us a clear view of how cache co-locality scheduling works.
In short, it's just a property for demonstration, you don't need to put much focus on that :)
**Run the workload**
```shell
$ kubectl create -f app.yaml
statefulset.apps/nginx created
```
**Check status of the workload**
```shell
$ kubectl get pod -o wide -l app=nginx
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
nginx-0 1/1 Running 0 2m5s 192.168.1.146 cn-beijing.192.168.1.146 <none> <none>
nginx-1 0/1 Pending 0 2m5s <none> <none> <none> <none>
```
Only one Pod is ready, and running on the only node that matches the `nodeSelectorTerm`
**Check the reason why it's still not ready**
```shell
$ kubectl describe pod nginx-1
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling <unknown> default-scheduler 0/2 nodes are available: 1 node(s) didn't match pod affinity/anti-affinity, 1 node(s) didn't satisfy existing pods anti-affinity rules, 1 node(s) had volume node affinity conflict.
Warning FailedScheduling <unknown> default-scheduler 0/2 nodes are available: 1 node(s) didn't match pod affinity/anti-affinity, 1 node(s) didn't satisfy existing pods anti-affinity rules, 1 node(s) had volume node affinity conflict.
```
As you may have seen, for one reason, `podAntiAffinity` prevents `nginx-1` Pod from being scheduled together with `nginx-0`. For another, there's only one node satisfying the given affinity condition.
**Label another node**
```shell
$ kubectl label node cn-beijing.192.168.1.147 hbase-cache=true
```
Now all of the two nodes hold the same label `hbase-cache=true`, re-check all the pods:
```shell
$ kubectl get pod -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
hbase-fuse-42csf 1/1 Running 0 44m 192.168.1.146 cn-beijing.192.168.1.146 <none> <none>
hbase-fuse-kth4g 1/1 Running 0 10m 192.168.1.147 cn-beijing.192.168.1.147 <none> <none>
hbase-master-0 2/2 Running 0 46m 192.168.1.147 cn-beijing.192.168.1.147 <none> <none>
hbase-worker-l62m4 2/2 Running 0 44m 192.168.1.146 cn-beijing.192.168.1.146 <none> <none>
hbase-worker-rvncl 2/2 Running 0 10m 192.168.1.147 cn-beijing.192.168.1.147 <none> <none>
```
There're two running Alluxio workers now.
```shell
$ kubectl get pod -l app=nginx -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
nginx-0 1/1 Running 0 21m 192.168.1.146 cn-beijing.192.168.1.146 <none> <none>
nginx-1 1/1 Running 0 21m 192.168.1.147 cn-beijing.192.168.1.147 <none> <none>
```
Another nginx Pod is also no longer pending.
In conclusion, schedulable data cache and cache co-locality scheduling for workloads are both supported by Fluid. Usually, they work together and offer a more flexible way to users who need some data management in Kubernetes.
## Clean Up
```shell
$ kubectl delete -f .
# unlabel nodes
$ kubectl label node cn-beijing.192.168.1.146 hbase-cache-
$ kubectl label node cn-beijing.192.168.1.147 hbase-cache-
```

View File

@ -0,0 +1 @@
# dawnbench_en

View File

@ -0,0 +1,294 @@
# Accelerate Machine Learning Training with Fluid
This article describes how to deploy [ImageNet](http://www.image-net.org/) dataset stored on [Aliyun OSS](https://cn.aliyun.com/product/oss) to Kubernetes cluster with Fluid, and train a ResNet-50 model on this dataset using [arena](https://github.com/kubeflow/arena). In this article, we perform machine learning training on 4 nodes, each node with 8 GPU cards.
## Prerequisites
- [Fluid](https://github.com/fluid-cloudnative/fluid) (version >= 0.1.0)
- [arena](https://github.com/kubeflow/arena)version >= 0.4.0
> **NOTE**:
>
> 1. The document requires Fluid installed on your Kubernetes cluster. Please refer to [Fluid Installation Guide](../userguide/install.md) to finish installation before going to the next step.
>
> 2. Arena is a CLI that is convenient for data scientists to run and monitor machine learning tasks. See [arena-installation-tutorial](https://github.com/kubeflow/arena/blob/master/docs/installation/INSTALL_FROM_BINARY.md) for more information.
## Deploy Dataset on Kubernetes Cluster with Fluid
### Create Dataset and Runtime
The following `dataset.yaml` file defined a `Dataset` and `Runtime` separated by `---`.
The dataset is stored on [Alibaba Cloud OSS](https://cn.aliyun.com/product/oss). To ensure that Alluxio can successfully mount the dataset, please make sure that configurations in the `dataset.yaml` are correct set, including `mountPoint`, `fs.oss.accessKeyId`, `fs.oss.accessKeySecret` and `fs.oss.endpoint`.
> See Alluxio's official document [Aliyun Object Storage Service](https://docs.alluxio.io/os/user/stable/en/ufs/OSS.html) for more examples of using OSS in Alluxio.
This document takes 4 machines to training machine learning tasks, so `spec.replicas` is set to `4`. In addition, the following configuration file `dataset.yaml` also sets many parameters based on our experience to optimize the IO performance of Alluxio in machine learning tasks, including Alluxio, Fuse and JVM levels. You can adjust these parameters according to the test environment and task requirements.
```shell
$ cat << EOF >> dataset.yaml
apiVersion: data.fluid.io/v1alpha1
kind: Dataset
metadata:
name: imagenet
spec:
mounts:
- mountPoint: oss://<OSS_BUCKET>/<OSS_DIRECTORY>/
name: imagenet
options:
fs.oss.accessKeyId: <OSS_ACCESS_KEY_ID>
fs.oss.accessKeySecret: <OSS_ACCESS_KEY_SECRET>
fs.oss.endpoint: <OSS_ENDPOINT>
---
apiVersion: data.fluid.io/v1alpha1
kind: AlluxioRuntime
metadata:
name: imagenet
spec:
replicas: 4
data:
replicas: 1
# alluxioVersion:
# image: registry.cn-huhehaote.aliyuncs.com/alluxio/alluxio
# imageTag: "2.3.0-SNAPSHOT-bbce37a"
# imagePullPolicy: Always
tieredstore:
levels:
- mediumtype: SSD
path: /var/lib/docker/alluxio
quota: 50Gi
high: "0.99"
low: "0.8"
properties:
# alluxio fuse
alluxio.fuse.jnifuse.enabled: "true"
alluxio.fuse.debug.enabled: "false"
alluxio.fuse.cached.paths.max: "1000000"
alluxio.fuse.logging.threshold: 1000ms
# alluxio master
alluxio.master.metastore: ROCKS
alluxio.master.journal.folder: /journal
alluxio.master.journal.type: UFS
alluxio.master.metastore.inode.cache.max.size: "10000000"
alluxio.master.journal.log.size.bytes.max: 500MB
alluxio.master.metadata.sync.concurrency.level: "128"
alluxio.master.metadata.sync.executor.pool.size: "128"
alluxio.master.metadata.sync.ufs.prefetch.pool.size: "128"
alluxio.master.rpc.executor.max.pool.size: "1024"
alluxio.master.rpc.executor.core.pool.size: "128"
# alluxio worker
alluxio.worker.allocator.class: alluxio.worker.block.allocator.GreedyAllocator
alluxio.worker.network.reader.buffer.size: 32MB
alluxio.worker.file.buffer.size: 320MB
alluxio.worker.block.master.client.pool.size: "1024"
# alluxio user
alluxio.user.block.worker.client.pool.min: "512"
alluxio.user.file.writetype.default: MUST_CACHE
alluxio.user.ufs.block.read.location.policy: alluxio.client.block.policy.LocalFirstAvoidEvictionPolicy
alluxio.user.block.write.location.policy.class: alluxio.client.block.policy.LocalFirstAvoidEvictionPolicy
alluxio.user.block.size.bytes.default: 16MB
alluxio.user.streaming.reader.chunk.size.bytes: 32MB
alluxio.user.local.reader.chunk.size.bytes: 32MB
alluxio.user.metrics.collection.enabled: "false"
alluxio.user.update.file.accesstime.disabled: "true"
alluxio.user.file.passive.cache.enabled: "false"
alluxio.user.block.avoid.eviction.policy.reserved.size.bytes: 2GB
alluxio.user.block.master.client.pool.gc.threshold: 2day
alluxio.user.file.master.client.threads: "1024"
alluxio.user.block.master.client.threads: "1024"
alluxio.user.file.readtype.default: CACHE
alluxio.user.metadata.cache.enabled: "true"
alluxio.user.metadata.cache.expiration.time: 2day
alluxio.user.metadata.cache.max.size: "1000000"
alluxio.user.direct.memory.io.enabled: "true"
alluxio.user.worker.list.refresh.interval: 2min
alluxio.user.logging.threshold: 1000ms
# other alluxio configurations
alluxio.web.ui.enabled: "false"
alluxio.security.stale.channel.purge.interval: 365d
alluxio.job.worker.threadpool.size: "164"
master:
jvmOptions:
- "-Xmx6G"
- "-XX:+UnlockExperimentalVMOptions"
- "-XX:ActiveProcessorCount=8"
worker:
jvmOptions:
- "-Xmx12G"
- "-XX:+UnlockExperimentalVMOptions"
- "-XX:MaxDirectMemorySize=32g"
- "-XX:ActiveProcessorCount=8"
resources:
limits:
cpu: 8
fuse:
# image: registry.cn-huhehaote.aliyuncs.com/alluxio/alluxio-fuse
# imageTag: "2.3.0-SNAPSHOT-bbce37a"
# imagePullPolicy: Always
env:
MAX_IDLE_THREADS: "32"
jvmOptions:
- "-Xmx16G"
- "-Xms16G"
- "-XX:+UseG1GC"
- "-XX:MaxDirectMemorySize=32g"
- "-XX:+UnlockExperimentalVMOptions"
- "-XX:ActiveProcessorCount=24"
resources:
limits:
cpu: 16
shortCircuitPolicy: local
args:
- fuse
- --fuse-opts=kernel_cache,ro,max_read=131072,attr_timeout=7200,entry_timeout=7200,nonempty
EOF
```
Create Dataset and Alluxio Runtime with:
```shell
$ kubectl create -f dataset.yaml
```
Check the status Alluxio Runtime, and there should be `1` Master`4` Worker and `4` Fuse running:
```shell
$ kubectl describe alluxioruntime imagenet
Name: imagenet
Namespace: default
Labels: <none>
Annotations: <none>
API Version: data.fluid.io/v1alpha1
Kind: AlluxioRuntime
Metadata:
# more metadata
Spec:
# more spec
Status:
Cache States:
Cache Capacity: 200GiB
Cached: 0B
Cached Percentage: 0%
Conditions:
# more conditions
Current Fuse Number Scheduled: 4
Current Master Number Scheduled: 1
Current Worker Number Scheduled: 4
Desired Fuse Number Scheduled: 4
Desired Master Number Scheduled: 1
Desired Worker Number Scheduled: 4
Fuse Number Available: 4
Fuse Numb Status: True
Type: Ready
Phase: Bound
Runtimes:
Category: Accelerate
Name: imagenet
Namespace: default
Type: alluxio
Ufs Total: 143.7GiB
Events: <none>
```
At the same time, Dataset is bound to Alluxio Runtime:
```shell
$ kubectl describe dataset
Name: imagenet
Namespace: default
Labels: <none>
Annotations: <none>
API Version: data.fluid.io/v1alpha1
Kind: Dataset
Metadata:
# more metadata
Spec:
# more spec
Status:
Cache States:
Cache Capacity: 200GiB
Cached: 0B
Cached Percentage: 0%
Conditions:
Last Transition Time: 2020-08-18T11:01:09Z
Last Update Time: 2020-08-18T11:02:48Z
Message: The ddc runtime is ready.
Reason: DatasetReady
Status: True
Type: Ready
Phase: Bound
Runtimes:
Category: Accelerate
Name: imagenet
Namespace: default
Type: alluxio
Ufs Total: 143.7GiB
Events: <none>
```
A pv and pvc named `imagenet` are successfully created. So far, the dataset stored on cloud has been successfully deployed to the kubernetes cluster.
```shell
$ kubectl get pv,pvc
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE
persistentvolume/imagenet 100Gi RWX Retain Bound default/imagenet 7m11s
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
persistentvolumeclaim/imagenet Bound imagenet 100Gi RWX 7m11s
```
## Example: Run Deep Learning Frameworks Using Arena
`arena` provides a convenient way to help users submit and monitor machine learning tasks. In this article, we use `arena` to simplify the deployment process of machine learning tasks.
If you have installed `arena` and dataset has been successfully deployed to the local cluster, you can start training a ResNet50 model by simply executing the following command:
```shell
arena submit mpi \
--name horovod-resnet50-v2-4x8-fluid \
--gpus=8 \
--workers=4 \
--working-dir=/horovod-demo/tensorflow-demo/ \
--data imagenet:/data \
-e DATA_DIR=/data/imagenet \
-e num_batch=1000 \
-e datasets_num_private_threads=8 \
--image=registry.cn-hangzhou.aliyuncs.com/tensorflow-samples/horovod-benchmark-dawnbench-v2:0.18.1-tf1.14.0-torch1.2.0-mxnet1.5.0-py3.6 \
./launch-example.sh 4 8
```
Notes
- `--name`specify the name of job, `horovod-resnet50-v2-4x8-fluid` in this example
- `--workers`specify the number of nodes (workers) participating in training
- `--gpus`specify the number of GPUs used by each worker
- `--working-dir`specify working directory
- `--data`tell workers to mount a volume named `imagenet` to the directory `/data`
- `-e DATA_DIR`specify the directory where dataset locates
- `./launch-example.sh 4 8`run shell script to launch training process
Check whether the task is executed normally:
```shell
$ arena get horovod-resnet50-v2-4x8-fluid -e
STATUS: RUNNING
NAMESPACE: default
PRIORITY: N/A
TRAINING DURATION: 16s
NAME STATUS TRAINER AGE INSTANCE NODE
horovod-resnet50-v2-4x8-fluid RUNNING MPIJOB 16s horovod-resnet50-v2-4x8-fluid-launcher-czlfn 192.168.1.21
horovod-resnet50-v2-4x8-fluid RUNNING MPIJOB 16s horovod-resnet50-v2-4x8-fluid-worker-0 192.168.1.16
horovod-resnet50-v2-4x8-fluid RUNNING MPIJOB 16s horovod-resnet50-v2-4x8-fluid-worker-1 192.168.1.21
horovod-resnet50-v2-4x8-fluid RUNNING MPIJOB 16s horovod-resnet50-v2-4x8-fluid-worker-2 192.168.1.25
horovod-resnet50-v2-4x8-fluid RUNNING MPIJOB 16s horovod-resnet50-v2-4x8-fluid-worker-3 192.168.3.29
```
If you find that `4` workers are in the `RUNNING` state, congratulations! It means that you have successfully started the training.
If you want to know where the training is going, please check the arena log:
```shell
$ arena logs --tail 100 -f horovod-resnet50-v2-4x8-fluid
```

View File

@ -0,0 +1 @@
# warmup_en

View File

@ -0,0 +1 @@
# diagnose

View File

@ -0,0 +1,182 @@
# Get Started with fluid
This document mainly describes how to create a Kubernetes cluster environment, complete fluid installation and deployment with Helm, and use fluid to create a data set and speed up your application.
## Create a Kubernetes cluster:
A Kubernetes environment is prerequisite for fluid,choose the most suitable solution to get it based on your experience:
- If you have already had a Kubernetes cluster, you can skip to step [Deploy fluid](#deploy-fluid).
- If you have not used Kubernetes before, you can use Minikube to create a Kubernetes cluster.
[Minikube](https://kubernetes.io/docs/setup/minikube/) can create a Kubernetes cluster in a virtual machine, which can run on macOS, Linux and Windows.
Please ensure that the following requirements are met:
- [Minikube](https://kubernetes.io/docs/tasks/tools/install-minikube/) :version 1.0.0+
- [kubectl](https://kubernetes.io/docs/tasks/tools/install-kubectl) : version 1.14+
After installing Minikube:
```shell
minikube start
```
If the installation is successful, you will get prompt message like this:
```shell
minikube v1.12.1 on Darwin 10.14.5
```
Use `kubectl` to access the newly created Kubernetes cluster
```shell
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
nginx-deployment-558fc78868-kvjnf 1/1 Running 1 4d12h
nginx-deployment-558fc78868-kx9gt 1/1 Running 1 4d12h
```
## Deploy fluid
Before the installation, make sure that the following requirements have been met:
- You can access the Kubernetes cluster with `kubectl` successfully.
- [Helm](https://helm.sh/docs/intro/install/): Helm 3 is installed.
- [Git](): Git is installed
1. Download fluid
```shell
git clone https://github.com/fluid-cloudnative/fluid.git
cd fluid/charts/fluid
```
2. Install fluid with Helm
```shell
helm install fluid fluid
NAME: fluid
LAST DEPLOYED: Tue Jul 7 11:22:07 2020
NAMESPACE: default
STATUS: deployed
REVISION: 1
TEST SUITE: None
```
3. Check installation results
```shell
kubectl get po -n fluid-system
NAME READY STATUS RESTARTS AGE
controller-manager-6b864dfd4f-995gm 1/1 Running 0 32h
csi-nodeplugin-fluid-c6pzj 2/2 Running 0 32h
csi-nodeplugin-fluid-wczmq 2/2 Running 0 32h
```
## Create a dataset
Fluid provides cloud-native data acceleration and management capabilities, and use `dataset` as a high-level abstraction to facilitate user management. Here we will show you how to create a dataset with fluid.
1. Create a Dataset object through the CRD file, which describes the source of the dataset.
```yaml
apiVersion: data.fluid.io/v1alpha1
kind: Dataset
metadata:
name: demo
spec:
mounts:
- mountPoint: https://mirror.bit.edu.cn/apache/spark/spark-3.0.0/
name: spark
```
Create dataset with kubectl
```shell
kubectl create -f dataset.yaml
```
After the dataset is created, it is in the `not bound` state and needs to be bound to a runtime to use it.
2. Also we create an Alluxio Runtime object based on the alluxioRuntimeCRD file, which enables the dataset.
```yaml
apiVersion: data.fluid.io/v1alpha1
kind: AlluxioRuntime
metadata:
name: demo
spec:
replicas: 1
tieredstore:
levels:
- mediumtype: MEM
path: /dev/shm
quota: 2Gi
high: "0.95"
low: "0.7"
storageType: Memory
properties:
alluxio.user.file.writetype.default: MUST_CACHE
alluxio.master.journal.folder: /journal
alluxio.master.journal.type: UFS
alluxio.user.block.size.bytes.default: 256MB
alluxio.user.streaming.reader.chunk.size.bytes: 256MB
alluxio.user.local.reader.chunk.size.bytes: 256MB
alluxio.worker.network.reader.buffer.size: 256MB
alluxio.user.streaming.data.timeout: 300sec
master:
jvmOptions:
- "-Xmx4G"
worker:
jvmOptions:
- "-Xmx4G"
fuse:
jvmOptions:
- "-Xmx4G "
- "-Xms4G "
# For now, only support local
shortCircuitPolicy: local
args:
- fuse
- --fuse-opts=direct_io,ro,max_read=131072
```
Create Alluxio Runtime with kubectl
```shell
kubectl create -f runtime.yaml
```
3. Next, we create an application to access this dataset. Here we will access the same data multiple times and compare the time consumed by each access.
```yaml
apiVersion: v1
kind: Pod
metadata:
name: demo-app
spec:
containers:
- name: demo
image: nginx
volumeMounts:
- mountPath: /data
name: demo
volumes:
- name: demo
persistentVolumeClaim:
claimName: demo
```
Create Pod with kubectl
```shell
kubectl create -f app.yaml
```
4. Dive into the container to access data, the first access will take longer.
```
kubectl exec -it demo-app -- bash
# du -sh /data/spark/spark-3.0.0-bin-without-hadoop.tgz
150M /data/spark/spark-3.0.0-bin-without-hadoop.tgz
# time cp /data/spark/spark-3.0.0-bin-without-hadoop.tgz /dev/null
real 0m13.171s
user 0m0.002s
sys 0m0.028s
```
5. In order to avoid the influence of other factors like page cache, we will delete the previous container, create the same application, and try to access the same file. Since the file has been cached by alluxio at this time, you can see that it takes significantly less time now.
```
kubectl delete -f app.yaml && kubectl create -f app.yaml
...
# time cp /data/spark/spark-3.0.0-bin-without-hadoop.tgz /dev/null
real 0m0.344s
user 0m0.002s
sys 0m0.020s
```
At this point, we have successfully created a data set and completed the acceleration. For the further use and management of the dataset, please refer to the two examples of [accelerate](../samples/accelerate_data_accessing.md) and [co-locality](../samples/data_co_locality.md).

View File

@ -0,0 +1,87 @@
# Deploy Fluid on Your Kubernetes Cluster
## Prerequisites
- git
- kubernetes clusterversion >= 1.14, and support CSI
- kubectlversion >= 1.14
- [helm](https://helm.sh/)version >= 3.0
The following documents assume that you have installed all the above requirements.
For the installation and configuration of kubectl, please refer to [here](https://kubernetes.io/docs/tasks/tools/install-kubectl/).
For the installation and configuration of Helm 3, please refer to [here](https://v3.helm.sh/docs/intro/install/).
## How to Deploy
### Download Fluid Chart
You can execute the following command in any folder to clone source code from [fluid repository](https://github.com/fluid-cloudnative/fluid):
```shell
$ git clone https://github.com/fluid-cloudnative/fluid.git
```
[helm charts](https://github.com/fluid-cloudnative/fluid/tree/master/charts) used to deploy Fluid is included in source code.
### Install Fluid with Helm
Enter the cloned local repository:
```shell
$ cd fluid
```
Create namespace:
```shell
$ kubectl create ns fluid-system
```
Install fluid with:
```shell
$ helm install fluid charts/fluid/fluid
NAME: fluid
LAST DEPLOYED: Fri Jul 24 16:10:18 2020
NAMESPACE: default
STATUS: deployed
REVISION: 1
TEST SUITE: None
```
> The general format of the `helm install` command is like: `helm install <RELEASE_NAME> <SOURCE>`. In the above command, `fluid` means the release name, and `charts/fluid/fluid` specify the path to the helm chart.
### Check Status of Component
**Check CRD used by Fluid:**
```shell
$ kubectl get crd | grep data.fluid.io
alluxiodataloads.data.fluid.io 2020-07-24T06:54:50Z
alluxioruntimes.data.fluid.io 2020-07-24T06:54:50Z
datasets.data.fluid.io 2020-07-24T06:54:50Z
```
**Check the status of pods:**
```shell
$ kubectl get pod -n fluid-system
NAME READY STATUS RESTARTS AGE
controller-manager-7f99c884dd-894g9 1/1 Running 0 5m28s
csi-nodeplugin-fluid-dm9b8 2/2 Running 0 5m28s
csi-nodeplugin-fluid-hwtvh 2/2 Running 0 5m28s
```
If the Pod status is as shown above, then Fluid is installed on your Kubernetes cluster successfully!
### Uninstall Fluid
```shell
$ helm delete fluid
$ kubectl delete -f charts/fluid/fluid/crds
```
> The `fluid` here means the <RELEASE_NAME> during installation.

View File

@ -0,0 +1,15 @@
# Overview
[Fluid](https://github.com/fluid-cloudnative/fluid) is An open source Kubernetes-native Distributed Dataset Orchestrator and Accelerator for Data Analysis and Machine Learning. It provides a full management life-cycle for Data orchastration system(Alluxio) including deployment, scaling, configuratio changes. With Fluid, the end user can manage the data without touching the Data Caching System.
> **Note:**
>
> You can only deploy Fluid in a Kubernetes cluster.
The corresponding relationship between Fluid and Alluxio versions is as follows:
| Fluid version | Compatible Alluxio versions |
|:---|:---|
| v0.1 | [Alluxio JNI Fuse 2.3](https://github.com/Alluxio/alluxio/tree/branch-2.3-fuse)|

View File

@ -1,65 +0,0 @@
## 安装Fluid
本文档假设您已经有可用并可以访问的Kubernetes集群。
### 要求
- Kubernetes >=1.16, kubectl >= 1.16
- Helm 3
对于kubectl的安装和配置请参考[此处](https://kubernetes.io/docs/tasks/tools/install-kubectl/)
对于Helm 3的安装和配置请参考[此处](https://v3.helm.sh/docs/intro/install/)
### 步骤
1\. 通过export KUBECONFIG=<your-kubeconfig-path>或创建`~/.kube/config`以准备kubeconfig文件
2\. 检查helm能否正常管理Kubernetes集群
```shell script
$ helm list
$ echo $?
```
3\. 获取Fluid Chart
```shell script
$ cd <some-dir>
$ wget http://kubeflow.oss-cn-beijing.aliyuncs.com/fluid-0.1.0.tgz
$ tar -xvf fluid-0.1.0.tgz
```
4\. 使用Helm安装Fluid
```shell script
$ helm install <release-name> fluid
NAME: <release-name>
LAST DEPLOYED: Fri Jul 24 16:10:18 2020
NAMESPACE: default
STATUS: deployed
REVISION: 1
TEST SUITE: None
```
`<release-name>`是任何您喜欢的名字(e.g. `fluid-release`)该名字用于Helm的Release管理
5\. 检查各组件状态
**查看Fluid使用的CRD:**
```shell script
$ kubectl get crd | grep data.fluid.io
alluxiodataloads.data.fluid.io 2020-07-24T06:54:50Z
alluxioruntimes.data.fluid.io 2020-07-24T06:54:50Z
datasets.data.fluid.io 2020-07-24T06:54:50Z
```
**查看各Pod的状态:**
```shell script
$ kubectl get pod -n fluid-system
NAME READY STATUS RESTARTS AGE
controller-manager-7f99c884dd-894g9 1/1 Running 0 5m28s
csi-nodeplugin-fluid-dm9b8 2/2 Running 0 5m28s
csi-nodeplugin-fluid-hwtvh 2/2 Running 0 5m28s
```
如果Pod状态如上所示那么Fluid就可以正常使用了
6\. 卸载Fluid
```shell script
$ helm del <release-name>
```
`<release-name>`可以通过`helm list | grep fluid`查看

BIN
docs/media/logo Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 7.1 KiB

35
docs/scripts/genDoc.sh Executable file
View File

@ -0,0 +1,35 @@
#!/bin/bash
MAINFONT="WenQuanYi Micro Hei"
MONOFONT="WenQuanYi Micro Hei Mono"
# MAINFONT="Tsentsiu Sans HG"
# MONOFONT="Tsentsiu Sans Console HG"
#_version_tag="$(date '+%Y%m%d').$(git rev-parse --short HEAD)"
_version_tag="$(date '+%Y%m%d')"
# default version: `pandoc --latex-engine=xelatex doc.md -s -o output2.pdf`
# used to debug template setting error
lang="en zh"
for d in ${lang}
do
if [ $d = "en" ]; then
docs_title=" Fluid Documentation"
else
docs_title=" Fluid 用户文档"
fi
pandoc -N --toc --smart --latex-engine=xelatex \
--template=templates/template.tex \
--columns=120 \
--listings \
-V title="$docs_title" \
-V author="Fluid" \
-V date="${_version_tag}" \
-V CJKmainfont="${MAINFONT}" \
-V fontsize=12pt \
-V geometry:margin=1in \
"$d/doc.md" -s -o "output_$d.pdf"
done

176
docs/scripts/mergeByTOC.py Normal file
View File

@ -0,0 +1,176 @@
#!/usr/bin/env python3
# coding: utf8
#
# Generate all-in-one Markdown file for ``doc-cn``
# Tip: 不支持中文文件名
# readme.md 中的目录引用的md多次或者md的sub heading),以第一次出现为主
# 每个版本都会生成一个自己的 PDF
from __future__ import print_function, unicode_literals
import re
import os
import sys
followups = []
in_toc = False
contents = []
lang = sys.argv[1]
# pattern=[]()
hyper_link_pattern = re.compile(r'\[(.*?)\]\((.*?)(#.*?)?\)')
# pattern= -- []()
toc_line_pattern = re.compile(r'([\-\+]+)\s\[(.*?)\]\((.*?)(#.*?)?\)')
# pattern= ! []()
image_link_pattern = re.compile(r'!\[(.*?)\]\((.*?)\)')
# pattern= -+
level_pattern = re.compile(r'(\s*[\-\+]+)\s')
# match all headings
heading_patthern = re.compile(r'(^#+|\n#+)\s')
entry_file = lang + "TOC.md"
# stage 1, parse toc
with open(entry_file) as fp:
level = 0
current_level = ""
for line in fp:
if not in_toc and line.startswith("## "):
in_toc = True
elif in_toc and line.startswith('## '):
in_toc = False
# yes, toc processing done
# contents.append(line[1:]) # skip 1 level TOC
break
## line.strip避免添加空行
elif in_toc and not line.startswith('#') and line.strip():
## get level from space length
level_space_str = level_pattern.findall(line)[0][:-1]
#pingcap 文档两个空格一级缩进
level = len(level_space_str) // 2 + 1 ## python divide get integer
matches = toc_line_pattern.findall(line)
if matches:
for match in matches:
fpath = match[2]
if fpath.startswith('http'):
## remove list format character `- `, `+ `
followups.append(('TOC', level, line.strip()[2:]))
elif fpath.endswith('.md'):
# remove first slash from the fpath
key = ('FILE', level, fpath)
if key not in followups:
followups.append(key)
else:
name = line.strip().split(None, 1)[-1]
key = ('TOC', level, name)
if key not in followups:
followups.append(key)
else:
pass
# overview part in README.md
followups.insert(1, ("RAW", 0, fp.read()))
# stage 2, get file heading
file_link_name = {}
title_pattern = re.compile(r'(^#+)\s.*')
for tp, lv, f in followups:
if tp != 'FILE':
continue
try:
for line in open(lang + f).readlines():
if line.startswith("#"):
tag = line.strip()
break
except Exception as e:
print(e)
tag = ""
if tag.startswith('# '):
tag = tag[2:]
elif tag.startswith('## '):
tag = tag[3:]
file_link_name[f] = tag.lower().replace(' ', '-')
def replace_link_wrap(chapter, name):
# Note: 仅仅支持 hash 匹配,如果在多个文档中有同名 heading 会碰撞
# 支持 chapter 文档中的 ./ddd.md, xxx.md, xxx.md#xxx 等
def replace_link(match):
full = match.group(0)
link_name = match.group(1)
link = match.group(2)
frag = match.group(3)
if link.startswith('https'):
return '[{}]({})'.format(link_name,link)
if link.endswith('.md') or '.md#' in link:
if link.startswith('../'):
link=link[3:]
if not frag:
for fpath in file_link_name:
if link == fpath:
frag = '#' + file_link_name[fpath]
return '[%s](%s)' % (link_name, frag)
elif link.endswith('.png') or link.endswith('.jpeg') or link.endswith('.svg') or link.endswith('.gif') or link.endswith('.jpg'):
# special handing for pic
img_link = re.sub(r'[\.\/]*media\/', './media/', link, count=0, flags=0)
return '[%s](%s)' % (link_name, img_link)
else:
return full
return hyper_link_pattern.sub(replace_link, chapter)
def replace_heading_func(diff_level=0):
def replace_heading(match):
if diff_level == 0:
return match.group(0)
else:
return '\n' + '#' * (match.group(0).count('#') + diff_level) + ' '
return replace_heading
def replace_img_link(match):
full = match.group(0)
link_name = match.group(1)
link = match.group(2)
if link.endswith('.png'):
fname = os.path.basename(link)
return '![%s](./media/%s)' % (link_name, fname)
# stage 3, concat files
for type_, level, name in followups:
if type_ == 'TOC':
contents.append("\n{} {}\n".format('#' * level, name))
elif type_ == 'RAW':
contents.append(name)
elif type_ == 'FILE':
igore='api_doc.md'
if igore in name:
contents.append('# api reference\n\n')
continue
try:
with open(lang + name) as fp:
chapter = fp.read()
chapter = replace_link_wrap(chapter, name)
# chapter = image_link_pattern.sub(replace_img_link, chapter)
# fix heading level
off=heading_patthern.findall(chapter)
if len(off)!=0:
diff_level = level - off[0].count('#')
#print(name, type_, level, diff_level)
chapter = heading_patthern.sub(replace_heading_func(diff_level), chapter)
contents.append(chapter)
contents.append('') # add an empty line
except Exception as e:
print(e)
print("generate file error: ignore!")
# stage 4, generage final doc.md
target_doc_file = lang + 'doc.md'
with open(target_doc_file, 'w') as fp:
fp.write('\n'.join(contents))
contents = []

16
docs/scripts/mergePDF.py Normal file
View File

@ -0,0 +1,16 @@
import PyPDF2
import sys
lang=sys.argv[1]
offset=0
merger=PyPDF2.PdfFileMerger()
target=[]
target.append("output_{}.pdf".format(lang))
target.append("api.pdf")
output="docs_{}.pdf".format(lang)
for pdf in target:
merger.merge(offset,pdf)
pn=PyPDF2.PdfFileReader(pdf).getNumPages()
offset+=pn
merger.write(output)

293
docs/templates/template.tex vendored Normal file
View File

@ -0,0 +1,293 @@
\documentclass[$if(fontsize)$$fontsize$,$endif$$if(lang)$$lang$,$endif$$if(papersize)$$papersize$,$endif$$for(classoption)$$classoption$$sep$,$endfor$]{$documentclass$}
$if(fontfamily)$
\usepackage{$fontfamily$}
$else$
\usepackage{lmodern}
$endif$
$if(linestretch)$
\usepackage{setspace}
\setstretch{$linestretch$}
$endif$
\usepackage{amssymb,amsmath}
\usepackage{ifxetex,ifluatex}
\usepackage{fixltx2e} % provides \textsubscript
\ifnum 0\ifxetex 1\fi\ifluatex 1\fi=0 % if pdftex
\usepackage[T1]{fontenc}
\usepackage[utf8]{inputenc}
$if(euro)$
\usepackage{eurosym}
$endif$
\else % if luatex or xelatex
\ifxetex
\usepackage{mathspec}
\usepackage{xltxtra,xunicode}
$if(CJKmainfont)$
\usepackage{xeCJK}
\setCJKmainfont{$CJKmainfont$}
$endif$
\else
\usepackage{fontspec}
\fi
\defaultfontfeatures{Mapping=tex-text,Scale=MatchLowercase}
\newcommand{\euro}{}
$if(mainfont)$
\setmainfont{$mainfont$}
$endif$
$if(sansfont)$
\setsansfont{$sansfont$}
$endif$
$if(monofont)$
\setmonofont[Mapping=tex-ansi]{$monofont$}
$endif$
$if(mathfont)$
\setmathfont(Digits,Latin,Greek){$mathfont$}
$endif$
\fi
% use upquote if available, for straight quotes in verbatim environments
\IfFileExists{upquote.sty}{\usepackage{upquote}}{}
% use microtype if available
\IfFileExists{microtype.sty}{%
\usepackage{microtype}
\UseMicrotypeSet[protrusion]{basicmath} % disable protrusion for tt fonts
}{}
$if(geometry)$
\usepackage[$for(geometry)$$geometry$$sep$,$endfor$]{geometry}
$endif$
$if(natbib)$
\usepackage{natbib}
\bibliographystyle{$if(biblio-style)$$biblio-style$$else$plainnat$endif$}
$endif$
$if(biblatex)$
\usepackage{biblatex}
$if(biblio-files)$
\bibliography{$biblio-files$}
$endif$
$endif$
$if(listings)$
\usepackage{xcolor}
\usepackage{listings}
\lstset{
basicstyle=\ttfamily,
keywordstyle=\color[rgb]{0.13,0.29,0.53}\bfseries,
stringstyle=\color[rgb]{0.31,0.60,0.02},
commentstyle=\color[rgb]{0.56,0.35,0.01}\itshape,
numberstyle=\footnotesize,
frame=single,
showspaces=false, % show spaces everywhere adding particular underscores; it overrides 'showstringspaces'
showstringspaces=false, % underline spaces within strings only
columns=flexible,
breaklines=true,
postbreak=\raisebox{0ex}[0ex][0ex]{\ensuremath{\color{gray}\hookrightarrow\space}}
}
$endif$
$if(lhs)$
\lstnewenvironment{code}{\lstset{columns=flexible,breaklines=true,language=Haskell,basicstyle=\small\ttfamily}}{}
$endif$
$if(highlighting-macros)$
$highlighting-macros$
$endif$
$if(verbatim-in-note)$
\usepackage{fancyvrb}
$endif$
$if(tables)$
\usepackage{longtable,booktabs}
$if(beamer)$
\usepackage{caption}
% Make caption package work with longtable
\makeatletter
\def\fnum@table{\tablename~\thetable}
\makeatother
$else$
% Correct order of tables after \paragraph or \subparagraph
\usepackage{etoolbox}
\makeatletter
\patchcmd\longtable{\par}{\if@noskipsec\mbox{}\fi\par}{}{}
\makeatother
% Allow footnotes in longtable head/foot
\IfFileExists{footnotehyper.sty}{\usepackage{footnotehyper}}{\usepackage{footnote}}
\makesavenoteenv{longtable}
$endif$
$endif$
$if(graphics)$
\usepackage{graphicx}
\makeatletter
\def\maxwidth{\ifdim\Gin@nat@width>\linewidth\linewidth\else\Gin@nat@width\fi}
\def\maxheight{\ifdim\Gin@nat@height>\textheight\textheight\else\Gin@nat@height\fi}
\makeatother
% Scale images if necessary, so that they will not overflow the page
% margins by default, and it is still possible to overwrite the defaults
% using explicit options in \includegraphics[width, height, ...]{}
\setkeys{Gin}{width=\maxwidth,height=\maxheight,keepaspectratio}
$endif$
\ifxetex
\usepackage[setpagesize=false, % page size defined by xetex
unicode=false, % unicode breaks when used with xetex
xetex]{hyperref}
\else
\usepackage[unicode=true]{hyperref}
\fi
\hypersetup{breaklinks=true,
bookmarks=true,
pdfauthor={$author-meta$},
pdftitle={$title-meta$},
colorlinks=true,
citecolor=$if(citecolor)$$citecolor$$else$blue$endif$,
urlcolor=$if(urlcolor)$$urlcolor$$else$blue$endif$,
linkcolor=$if(linkcolor)$$linkcolor$$else$magenta$endif$,
pdfborder={0 0 0}}
\urlstyle{same} % don't use monospace font for urls
$if(links-as-notes)$
% Make links footnotes instead of hotlinks:
\renewcommand{\href}[2]{#2\footnote{\url{#1}}}
$endif$
$if(strikeout)$
\usepackage[normalem]{ulem}
% avoid problems with \sout in headers with hyperref:
\pdfstringdefDisableCommands{\renewcommand{\sout}{}}
$endif$
\setlength{\parindent}{0pt}
\setlength{\parskip}{6pt plus 2pt minus 1pt}
\setlength{\emergencystretch}{3em} % prevent overfull lines
\providecommand{\tightlist}{%
\setlength{\itemsep}{0pt}\setlength{\parskip}{0pt}}
$if(numbersections)$
\setcounter{secnumdepth}{5}
$else$
\setcounter{secnumdepth}{0}
$endif$
$if(verbatim-in-note)$
\VerbatimFootnotes % allows verbatim text in footnotes
$endif$
$if(lang)$
\ifxetex
\usepackage{polyglossia}
\setmainlanguage{$mainlang$}
\else
\usepackage[$lang$]{babel}
\fi
$endif$
$if(title)$
\title{$title$$if(subtitle)$\\\vspace{0.5em}{\large $subtitle$}$endif$}
$endif$
$if(author)$
\author{$for(author)$$author$$sep$ \and $endfor$}
$endif$
\date{$date$}
$for(header-includes)$
$header-includes$
$endfor$
% quote style
% http://tex.stackexchange.com/questions/179982/add-a-black-border-to-block-quotations
\usepackage{framed}
% \usepackage{xcolor}
\let\oldquote=\quote
\let\endoldquote=\endquote
\colorlet{shadecolor}{orange!15}
\renewenvironment{quote}{\begin{shaded*}\begin{oldquote}}{\end{oldquote}\end{shaded*}}
% https://www.zhihu.com/question/25082703/answer/30038248
% no cross chapter
\usepackage[section]{placeins}
% no float everywhere
\usepackage{float}
\floatplacement{figure}{H}
% we chinese write article this way
\usepackage{indentfirst}
\setlength{\parindent}{2em}
\renewcommand{\contentsname}{Table of Contents}
\renewcommand\figurename{Figure}
% fix overlap toc number and title
% https://blog.csdn.net/golden1314521/article/details/39926135
\usepackage{titlesec}
\usepackage{titletoc}
% \titlecontents{标题名}[左间距]{标题格式}{标题标志}{无序号标题}{指引线与页码}[下间距]
% fix overlap
\titlecontents{subsection}
[4em]
{}%
{\contentslabel{3em}}%
{}%
{\titlerule*[0.5pc]{$$\cdot$$}\contentspage\hspace*{0em}}%
\titlecontents{subsubsection}
[7em]
{}%
{\contentslabel{3.5em}}%
{}%
{\titlerule*[0.5pc]{$$\cdot$$}\contentspage\hspace*{0em}}%
\usepackage[all]{background}
% \backgroundsetup{contents=Fluid.,color=blue,opacity=0.2}
\backgroundsetup{contents=\includegraphics{media/logo},
placement=top,scale=0.2,hshift=1250pt,vshift=-20pt,
opacity=0.8,angle=0}
% avoid level-4, 5 heading to be connected with following content
% https://github.com/jgm/pandoc/issues/1658
\let\oldparagraph\paragraph
\renewcommand{\paragraph}[1]{\oldparagraph{#1}\mbox{}}
\let\oldsubparagraph\subparagraph
\renewcommand{\subparagraph}[1]{\oldsubparagraph{#1}\mbox{}}
\begin{document}
% no bg at title page
\NoBgThispage
$if(title)$
\maketitle
$endif$
$if(abstract)$
\begin{abstract}
$abstract$
\end{abstract}
$endif$
$for(include-before)$
$include-before$
$endfor$
$if(toc)$
{
\hypersetup{linkcolor=black}
\setcounter{tocdepth}{$toc-depth$}
\tableofcontents
}
$endif$
$if(lof)$
\listoffigures
$endif$
\newpage
$body$
$if(natbib)$
$if(biblio-files)$
$if(biblio-title)$
$if(book-class)$
\renewcommand\bibname{$biblio-title$}
$else$
\renewcommand\refname{$biblio-title$}
$endif$
$endif$
\bibliography{$biblio-files$}
$endif$
$endif$
$if(biblatex)$
\printbibliography$if(biblio-title)$[title=$biblio-title$]$endif$
$endif$
$for(include-after)$
$include-after$
$endfor$
\end{document}

22
docs/zh/TOC.md Normal file
View File

@ -0,0 +1,22 @@
# Fluid Documentation
<!-- markdownlint-disable MD007 -->
<!-- markdownlint-disable MD032 -->
## TOC
+ Userguide
- [Overview](userguide/overview.md)
- [Get Started](userguide/get_started.md)
- [Installation](userguide/install.md)
- [Diagnose](userguide/diagnose.md)
+ Samples
- [Accelerate Data Accessing](samples/accelerate_data_accessing.md)
- [Cache Co-locality](samples/data_co_locality.md)
- [Machine Learning](samples/machinelearning.md)
- [Dawnbench](samples/dawnbench.md)
- [Warm up](samples/warmup.md)
+ Developer Guide
- [How to develop](dev/how_to_develop.md)
- [API_Doc](dev/api_doc.md)

2030
docs/zh/dev/api_doc.md Normal file

File diff suppressed because it is too large Load Diff

View File

@ -1,177 +1,184 @@
## Fluid开发文档
### 环境需求
- golang 1.13+
- docker 19.03+
- GNU Make
对于golang的安装与配置请参考[此处](https://golang.org/dl/)
对于docker的安装与配置请参考[此处](https://docs.docker.com/engine/install/)
Fluid需要使用`make`命令进行项目构建,使用以下命令安装`make`
- Linux
- `sudo apt-get install build-essential`
### 获取Fluid源码
不支持Go module
```shell script
mkdir -p $GOPATH/src/github.com/cloudnativefluid/
cd $GOPATH/src/github.com/cloudnativefluid
git clone https://github.com/cheyang/fluid.git
```
支持Go module:
```shell script
cd <any-place-you-like>
git clone https://github.com/cheyang/fluid.git
```
> 有关Go module可以参阅 [golang 官方文档](https://github.com/golang/go/wiki/Modules) 获取更多信息
### 编译
Fluid项目根目录下的`Makefile`文件已经包含了项目开发中的编译、构建、部署等基本逻辑
```shell script
# 构建Controller Manager Binary
make manager
# 构建CSI Binary
make csi
```
构建得到的Binary程序位于`./bin`目录下
>注意如果您正在使用Go Module进行项目开发那么可能需要将Makefile文件中的相关目标的`GO111MODULE=off`修改为`GO111MODULE=on`以使得编译成功
### 镜像构建
```shell script
# 为manager镜像命名
export IMG=<your-registry>/<your-namespace>/<img-name>
# 为CSI插件镜像命名
export CSI_IMG=<your-registry>/<your-namespace>/<csi-img-name>
# 构建manager镜像
make docker-build
# 构建CSI插件镜像
make docker-build-csi
```
在运行Fluid之前需要将构建的镜像推送到可以访问的镜像仓库中
1\. 登录镜像仓库:
```shell script
sudo docker login <docker-registry>
```
2\. 推送镜像:
```shell script
make docker-push
make docker-push-csi
```
### 运行
接下来的内容将假设在本地环境中已经通过`KUBECONFIG`环境变量或是在`~/.kube/config`文件中配置好了可以访问的Kubernetes集群您可以通过`kubectl cluster-info`对该配置进行快速检查。更多有关`kubeconfig`的信息可以参考
[kubernetes官方文档](https://kubernetes.io/docs/tasks/access-application-cluster/configure-access-multiple-clusters/)
> 以下内容将使用`kustomize``kubectl 1.14+`已经内置了`kustomize`工具,正在使用`kubectl 1.14`版本以下的开发者请参考 [此处](https://kustomize.io/) 获取有关kustomize的更多信息
0\. 将构建的镜像上传到Kubernetes集群可以访问的镜像仓库
> 如果构建并上传的镜像在私有仓库中请确保在kubernetes集群的各个结点上已经成功执行了`sudo docker login <docker-registry>`操作
1\. 修改`config/fluid/patches`中各文件的镜像名
```yaml
# config/fluid/patches/image_in_manager.yaml
...
...
containers:
- name: manager
image: <your-registry>/<your-namespace>/<img-name>:<img-tag>
```
```yaml
# config/fluid/patches/image_in_csi-plugin.yaml
...
...
containers:
- name: plugins
image: <your-registry>/<your-namespace>/<csi-img-name>:<img-tag>
```
2\. 创建CRD
```shell script
kubectl apply -k config/crd
```
3\. 创建Fluid各组件
```shell script
kubectl apply -k config/fluid
```
4\.编写样例或使用提供的样例
```shell script
kubectl apply -k config/samples
```
5\.查看各组件的运行情况,确保各组件和样例资源正常运行
```shell script
$ kubectl get pod -n fluid-system
NAME READY STATUS RESTARTS AGE
controller-manager-7fd6457ccf-p7j2x 1/1 Running 0 84s
csi-nodeplugin-fluid-pj9tv 2/2 Running 0 84s
csi-nodeplugin-fluid-t8ctj 2/2 Running 0 84s
```
```shell script
$ kubectl get pod
NAME READY STATUS RESTARTS AGE
cifar10-fuse-vb6l4 1/1 Running 0 6m15s
cifar10-fuse-vtqpx 1/1 Running 0 6m15s
cifar10-master-0 2/2 Running 0 8m24s
cifar10-worker-729xz 2/2 Running 0 6m15s
cifar10-worker-d6kmd 2/2 Running 0 6m15s
nginx-0 1/1 Running 0 8m30s
nginx-1 1/1 Running 0 8m30s
```
> 注意: 上述命令可能随您组件的不同实现或是不同的样例产生不同的结果
6\.通过日志等方法查看您的组件是否运作正常(e.g. `kubectl logs -n fluid-system controller-manager`)
7\.环境清理
```shell script
kubectl delete -k config/samples
kubectl delete -k config/fluid
kubectl delete -k config/crd
```
### 调试
**前提条件**
确保环境中已经安装了go-delve具体安装过程可以参考[go-delve安装手册](https://github.com/go-delve/delve/tree/master/Documentation/installation)
**本地调试**
```shell script
# 让go-delve完成编译工作
dlv debug cmd/controller/main.go
# 先编译后调试
make manager
dlv exec bin/manager
```
**远程调试**
在开发Fluid时通常情况下更为常用的方式是远程调试确保本机和远程主机均已正确安装了go-delve
在远程主机上:
```shell script
dlv debug --headless --listen ":12345" --log --api-version=2 cmd/controller/main.go
```
这将使得远程主机的调试程序监听指定的端口(e.g. 12345)
在本机上:
```shell script
dlv connect "<remote-addr>:12345" --api-version=2
```
# Fluid开发文档
## 环境需求
- git
- golang (version >= 1.13)
- docker (version >= 19.03)
- Kubernetes (version >= 1.14)
- GNU Make
对于golang的安装与配置请参考[此处](https://golang.org/dl/)。
对于docker的安装与配置请参考[此处](https://docs.docker.com/engine/install/)。
Fluid需要使用`make`命令进行项目构建,使用以下命令安装`make`
- Linux
- `sudo apt-get install build-essential`
## 编译、运行和调试
### 获取Fluid源码
```shell
$ mkdir -p $GOPATH/src/github.com/cloudnativefluid/
$ cd $GOPATH/src/github.com/cloudnativefluid
$ git clone https://github.com/fluid-cloudnative/fluid.git
```
> **注意**本文在非Go Module模式下完成Fluid的编译、运行和调试。
>
> 有关Go module可以参阅 [golang 官方文档](https://github.com/golang/go/wiki/Modules) 获取更多信息。
### 编译
Fluid项目根目录下的`Makefile`文件已经包含了项目开发中的编译、构建、部署等基本逻辑
```shell
# 构建Controller Manager Binary
$ make manager
# 构建CSI Binary
$ make csi
```
构建得到的Binary程序位于`./bin`目录下。
### 镜像构建
1. 设置镜像名称
```shell
# 为manager镜像命名
$ export IMG=<your-registry>/<your-namespace>/<img-name>
# 为CSI插件镜像命名
$ export CSI_IMG=<your-registry>/<your-namespace>/<csi-img-name>
# 构建manager镜像
$ make docker-build
# 构建CSI插件镜像
$ make docker-build-csi
```
在运行Fluid之前需要将构建的镜像推送到可以访问的镜像仓库中
2. 登录镜像仓库:
```shell
$ sudo docker login <docker-registry>
```
3. 推送镜像:
```shell
$ make docker-push
$ make docker-push-csi
```
### 运行
接下来的内容将假设在本地环境中已经通过`KUBECONFIG`环境变量或是在`~/.kube/config`文件中配置好了可以访问的Kubernetes集群您可以通过`kubectl cluster-info`对该配置进行快速检查。更多有关`kubeconfig`的信息可以参考
[kubernetes官方文档](https://kubernetes.io/docs/tasks/access-application-cluster/configure-access-multiple-clusters/)
> 以下内容将使用`kustomize``kubectl 1.14+`已经内置了`kustomize`工具,正在使用`kubectl 1.14`版本以下的开发者请参考 [此处](https://kustomize.io/) 获取有关kustomize的更多信息
1. 将构建的镜像上传到Kubernetes集群可以访问的镜像仓库
> 如果构建并上传的镜像在私有仓库中请确保在kubernetes集群的各个结点上已经成功执行了`sudo docker login <docker-registry>`操作
2. 修改`config/fluid/patches`中各文件的镜像名
```yaml
# config/fluid/patches/image_in_manager.yaml
...
...
containers:
- name: manager
image: <your-registry>/<your-namespace>/<img-name>:<img-tag>
```
3. 创建CRD
```shell
$ kubectl apply -k config/crd
```
检查CRD
```shell
$ kubectl get crd | grep fluid
alluxiodataloads.data.fluid.io 2020-08-22T03:53:46Z
alluxioruntimes.data.fluid.io 2020-08-22T03:53:46Z
datasets.data.fluid.io 2020-08-22T03:53:46Z
```
4. 创建Fluid各组件
```shell
$ kubectl apply -k config/fluid
```
检查Fluid组件
```shell
$ kubectl get pod -n fluid-system
NAME READY STATUS RESTARTS AGE
controller-manager-7fd6457ccf-p7j2x 1/1 Running 0 84s
csi-nodeplugin-fluid-pj9tv 2/2 Running 0 84s
csi-nodeplugin-fluid-t8ctj 2/2 Running 0 84s
```
5. 编写样例或使用提供的样例
```shell
$ kubectl apply -k config/samples
```
检查样例pod
```shell
$ kubectl get pod
NAME READY STATUS RESTARTS AGE
cifar10-fuse-vb6l4 1/1 Running 0 6m15s
cifar10-fuse-vtqpx 1/1 Running 0 6m15s
cifar10-master-0 2/2 Running 0 8m24s
cifar10-worker-729xz 2/2 Running 0 6m15s
cifar10-worker-d6kmd 2/2 Running 0 6m15s
nginx-0 1/1 Running 0 8m30s
nginx-1 1/1 Running 0 8m30s
```
> 注意: 上述命令可能随您组件的不同实现或是不同的样例产生不同的结果。
6. 通过日志等方法查看您的组件是否运作正常(e.g. `kubectl logs -n fluid-system controller-manager`)
7. 环境清理
```shell
$ kubectl delete -k config/samples
$ kubectl delete -k config/fluid
$ kubectl delete -k config/crd
```
### 调试
**前提条件**
确保环境中已经安装了go-delve具体安装过程可以参考[go-delve安装手册](https://github.com/go-delve/delve/tree/master/Documentation/installation)
**本地调试**
```shell
# 让go-delve完成编译工作
$ dlv debug cmd/controller/main.go
# 先编译后调试
$ make manager
$ dlv exec bin/manager
```
**远程调试**
在开发Fluid时通常情况下更为常用的方式是远程调试确保本机和远程主机均已正确安装了go-delve
在远程主机上:
```shell
$ dlv debug --headless --listen ":12345" --log --api-version=2 cmd/controller/main.go
```
这将使得远程主机的调试程序监听指定的端口(e.g. 12345)
在本机上:
```shell
$ dlv connect "<remote-addr>:12345" --api-version=2
```
> 注意:要进行远程调试,请确保远程主机指定的端口未被占用并且已经对远程主机的防火墙进行了适当的配置

View File

@ -0,0 +1,367 @@
# 示例 - 远程文件访问加速
通过[Alluxio](https://www.alluxio.io)和[Fuse](https://github.com/libfuse/libfuse)Fluid为用户提供了一种更为简单的文件访问接口使得任意运行在Kubernetes集群上的程序能够像访问本地文件一样轻松访问存储在远程文件系统中的文件。更为重要的是Fluid借助Alluxio提供了强大的文件缓存能力这意味着用户在访问远程文件时尤其是那些具有较高访问频率的远程文件时用户可以享受到大幅度的文件访问速度的提升。
本文档通过一个简单的例子演示了上述功能特性
## 前提条件
在运行该示例之前,请参考[安装文档](../userguide/install.md)完成安装并检查Fluid各组件正常运行
```shell
$ kubectl get pod -n fluid-system
NAME READY STATUS RESTARTS AGE
controller-manager-7fd6457ccf-jnkvn 1/1 Running 0 60s
csi-nodeplugin-fluid-6rhpt 2/2 Running 0 60s
csi-nodeplugin-fluid-6zwgl 2/2 Running 0 60s
```
通常来说你会看到一个名为“controller-manager”的Pod和多个名为“csi-nodeplugin”的Pod正在运行。其中“csi-nodeplugin”这些Pod的数量取决于你的Kubernetes集群中结点的数量。
## 新建工作环境
```shell
$ mkdir <any-path>/accelerate
$ cd <any-path>/accelerate
```
## 运行示例
**查看待创建的Dataset资源对象**
```shell
$ cat<<EOF >dataset.yaml
apiVersion: data.fluid.io/v1alpha1
kind: Dataset
metadata:
name: hbase
spec:
mounts:
- mountPoint: https://mirrors.tuna.tsinghua.edu.cn/apache/hbase/2.2.5/
name: hbase
EOF
```
在这里我们将要创建一个kind为`Dataset`的资源对象(Resource object)。`Dataset`是Fluid所定义的一个Custom Resource Definition(CRD)该CRD被用来告知Fluid在哪里可以找到你所需要的数据。Fluid将该CRD对象中定义的`mountPoint`属性挂载到Alluxio之上因此该属性可以是任何合法的能够被Alluxio识别的UFS地址。在本示例中为了简单我们使用[WebUFS](https://docs.alluxio.io/os/user/stable/en/ufs/WEB.html)进行演示。
更多有关UFS的信息请参考[Alluxio文档-底层存储系统](https://docs.alluxio.io/os/user/stable/cn/ufs/OSS.html)部分。
> 本示例将以Apache镜像站点上的Hbase v2.25相关资源作为演示中使用的远程文件。这个选择并没有任何特殊之处你可以将这个远程文件修改为任意你喜欢的远程文件。但是如果你想要和我们一样使用WebUFS进行操作的话最好还是选择一个Apache镜像源站点( e.g. [清华镜像源](https://mirrors.tuna.tsinghua.edu.cn/apache) )因为基于目前WebUFS的实现如果你选择其他更加复杂的网页作为WebUFS你可能需要进行更多更复杂的配置。
**创建Dataset资源对象**
```shell
$ kubectl create -f dataset.yaml
dataset.data.fluid.io/hbase created
```
**查看Dataset资源对象状态**
```shell
$ kubectl get dataset hbase -o yaml
apiVersion: data.fluid.io/v1alpha1
kind: Dataset
...
status:
conditions: []
phase: NotBound
```
如上所示,`status`中的`phase`属性值为`NotBound`,这意味着该`Dataset`资源对象目前还未与任何`AlluxioRuntime`资源对象绑定,接下来,我们将创建一个`AlluxioRuntime`资源对象。
**查看待创建的AlluxioRuntime资源对象**
```shell
$ cat<<EOF >runtime.yaml
apiVersion: data.fluid.io/v1alpha1
kind: AlluxioRuntime
metadata:
name: hbase
spec:
replicas: 2
tieredstore:
levels:
- mediumtype: MEM
path: /dev/shm
quota: 2Gi
high: "0.95"
low: "0.7"
storageType: Memory
properties:
alluxio.user.file.writetype.default: MUST_CACHE
alluxio.master.journal.folder: /journal
alluxio.master.journal.type: UFS
alluxio.user.block.size.bytes.default: 256MB
alluxio.user.streaming.reader.chunk.size.bytes: 256MB
alluxio.user.local.reader.chunk.size.bytes: 256MB
alluxio.worker.network.reader.buffer.size: 256MB
alluxio.user.streaming.data.timeout: 300sec
master:
jvmOptions:
- "-Xmx4G"
worker:
jvmOptions:
- "-Xmx4G"
fuse:
jvmOptions:
- "-Xmx4G "
- "-Xms4G "
# For now, only support local
shortCircuitPolicy: local
args:
- fuse
- --fuse-opts=direct_io,ro,max_read=131072,attr_timeout=7200,entry_timeout=7200,nonempty
EOF
```
**创建AlluxioRuntime资源对象**
```shell
$ kubectl create -f runtime.yaml
alluxioruntime.data.fluid.io/hbase created
```
`AlluxioRuntime`是另一个Fluid定义的CRD。一个`AlluxioRuntime`资源对象描述了在Kubernetes集群中运行一个Alluxio实例所需要的配置信息。
等待一段时间让AlluxioRuntime资源对象中的各个组件得以顺利启动你会看到类似以下状态
```shell
$ kubectl get pod
NAME READY STATUS RESTARTS AGE
hbase-fuse-hvxgh 1/1 Running 0 27s
hbase-fuse-sjhxk 1/1 Running 0 27s
hbase-master-0 2/2 Running 0 62s
hbase-worker-92cln 2/2 Running 0 27s
hbase-worker-rlb5w 2/2 Running 0 27s
```
**再次查看Dataset资源对象状态**
```shell
$ kubectl get dataset hbase -o yaml
...
...
status:
cacheStates:
cacheCapacity: 4GiB
cached: 0B
cachedPercentage: 0%
conditions:
- lastTransitionTime: "2020-07-29T08:23:44Z"
lastUpdateTime: "2020-07-29T08:26:29Z"
message: The ddc runtime is ready.
reason: DatasetReady
status: "True"
type: Ready
phase: Bound
runtimes:
- category: Accelerate
name: hbase
namespace: default
type: alluxio
ufsTotal: 443.5MiB
```
因为已经与一个成功启动的AlluxioRuntime绑定该Dataset资源对象的`Status`得到了更新,此时`phase`属性值已经变为`Bound`状态。从上述状态中可以获知有关资源对象的基本信息
**查看AlluxioRuntime状态**
```shell
$ kubectl get alluxioruntime hbase -o yaml
...
...
status:
cacheStates:
cacheCapacity: 4GiB
cached: 0B
cachedPercentage: 0%
conditions:
- lastProbeTime: "2020-07-29T08:23:05Z"
lastTransitionTime: "2020-07-29T08:23:05Z"
message: The master is initialized.
reason: Master is initialized
status: "True"
type: MasterInitialized
- lastProbeTime: "2020-07-29T08:23:40Z"
lastTransitionTime: "2020-07-29T08:23:05Z"
message: The master is ready.
reason: Master is ready
status: "True"
type: MasterReady
- lastProbeTime: "2020-07-29T08:23:20Z"
lastTransitionTime: "2020-07-29T08:23:20Z"
message: The workers are initialized.
reason: Workers are initialized
status: "True"
type: WorkersInitialized
- lastProbeTime: "2020-07-29T08:23:20Z"
lastTransitionTime: "2020-07-29T08:23:20Z"
message: The fuses are initialized.
reason: Fuses are initialized
status: "True"
type: FusesInitialized
- lastProbeTime: "2020-07-29T08:23:40Z"
lastTransitionTime: "2020-07-29T08:23:40Z"
message: The workers are partially ready.
reason: Workers are ready
status: "True"
type: WorkersReady
- lastProbeTime: "2020-07-29T08:23:40Z"
lastTransitionTime: "2020-07-29T08:23:40Z"
message: The fuses are ready.
reason: Fuses are ready
status: "True"
type: FusesReady
currentFuseNumberScheduled: 2
currentMasterNumberScheduled: 1
currentWorkerNumberScheduled: 2
desiredFuseNumberScheduled: 2
desiredMasterNumberScheduled: 1
desiredWorkerNumberScheduled: 2
fuseNumberAvailable: 2
fuseNumberReady: 2
fusePhase: Ready
masterNumberReady: 1
masterPhase: Ready
valueFile: hbase-alluxio-values
workerNumberAvailable: 2
workerNumberReady: 2
workerPhase: Ready
```
`AlluxioRuntime`资源对象的`status`中包含了更多更详细的信息
**查看与远程文件关联的PersistentVolume以及PersistentVolumeClaim**
```shell
$ kubectl get pv
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE
hbase 100Gi RWX Retain Bound default/hbase 18m
```
```shell
$ kubectl get pvc
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
hbase Bound hbase 100Gi RWX 18m
```
`Dataset`资源对象准备完成后即与Alluxio实例绑定后与该资源对象关联的PV, PVC已经由Fluid生成应用可以通过该PVC完成远程文件在Pod中的挂载并通过挂载目录实现远程文件访问
## 远程文件访问
**查看待创建的应用**
```shell
$ cat<<EOF >nginx.yaml
apiVersion: v1
kind: Pod
metadata:
name: nginx
spec:
containers:
- name: nginx
image: nginx
volumeMounts:
- mountPath: /data
name: hbase-vol
volumes:
- name: hbase-vol
persistentVolumeClaim:
claimName: hbase
EOF
```
**启动应用进行远程文件访问**
```shell
$ kubectl create -f nginx.yaml
```
登录Nginx Pod:
```shell
$ kubectl exec -it nginx -- bash
```
查看远程文件挂载情况:
```shell
$ ls -1 /data/hbase
CHANGES.md
RELEASENOTES.md
api_compare_2.2.5RC0_to_2.2.4.html
hbase-2.2.5-bin.tar.gz
hbase-2.2.5-client-bin.tar.gz
hbase-2.2.5-src.tar.gz
```
```shell
$ du -h /data/hbase/*
174K /data/hbase/CHANGES.md
106K /data/hbase/RELEASENOTES.md
115K /data/hbase/api_compare_2.2.5RC0_to_2.2.4.html
211M /data/hbase/hbase-2.2.5-bin.tar.gz
200M /data/hbase/hbase-2.2.5-client-bin.tar.gz
34M /data/hbase/hbase-2.2.5-src.tar.gz
```
登出Nginx Pod:
```shell
$ exit
```
正如你所见WebUFS上所存储的全部文件(也就是hbase v2.2.5的相关文件)可以以和本地文件完全没有区别的方式存在于某个Pod中并且可以被该Pod十分方便地访问
## 远程文件访问加速
为了演示在访问远程文件时,你能获得多大的加速效果,我们提供了一个测试作业的样例:
**查看待创建的测试作业**
```shell
$ cat<<EOF >app.yaml
apiVersion: batch/v1
kind: Job
metadata:
name: fluid-copy-test
spec:
template:
spec:
restartPolicy: OnFailure
containers:
- name: busybox
image: busybox
command: ["/bin/sh"]
args: ["-c", "set -x; time cp -r /data/hbase ./"]
volumeMounts:
- mountPath: /data
name: hbase-vol
volumes:
- name: hbase-vol
persistentVolumeClaim:
claimName: hbase
EOF
```
**启动测试作业**
```shell
$ kubectl create -f app.yaml
job.batch/fluid-test created
```
该测试程序会执行`time cp -r /data/hbase ./`的shell命令其中`/data/hbase`是远程文件在Pod中挂载的位置该命令完成后会在终端显示命令执行的时长
```shell
kubectl logs fluid-copy-test-h59w9
+ time cp -r /data/hbase ./
real 1m 2.74s
user 0m 0.00s
sys 0m 1.35s
```
可见第一次远程文件的读取耗费了接近63s的时间。当然你可能会觉得这并没有你预期的那么快但是
**再次启动测试作业**
```shell
$ kubectl delete -f app.yaml
$ kubectl create -f app.yaml
```
由于远程文件已经被缓存,此次测试作业能够迅速完成:
```shell
$ kubectl logs fluid-copy-test-d9h2x
+ time cp -r /data/hbase ./
real 0m 2.94s
user 0m 0.00s
sys 0m 1.27s
```
同样的文件访问操作仅耗费了3s
这种大幅度的加速效果归因于Alluxio所提供的强大的缓存能力这种缓存能力意味着只要你访问某个远程文件一次该文件就会被缓存在Alluxio中你的所有接下来的重复访问都不再需要进行远程文件读取而是从Alluxio中直接获取数据因此对于数据的访问加速也就不难解释了。
> 注意: 上述文件的访问速度与示例运行环境的网络条件有关,如果文件访问速度过慢,请更换更小的远程文件尝试
## 环境清理
```shell
$ kubectl delete -f .
```

View File

@ -1,9 +1,11 @@
# 示例 - 数据缓存亲和性调度
Fluid提供了针对数据缓存的调度机制这意味着用户能够像管理Pod一样管理数据缓存在Kubernetes集群中的存放位置这些存放位置同样也会间接地影响相关应用的调度策略。本文档通过一个简单的示例来演示上述功能特性该示例将会尝试将远程文件的数据缓存分布在指定的集群结点之上并启动应用使用这些数据缓存
在Fluid中`Dataset`资源对象中所定义的远程文件是可被调度的这意味着你能够像管理你的Pod一样管理远程文件缓存在Kubernetes集群上的存放位置。另外Fluid同样支持对于应用的数据缓存亲和性调度这种调度方式将应用(e.g. 数据分析任务、机器学习任务等)与所需要的数据缓存放置在一起,以尽可能地减少额外的开销。
本文档将向你简单地展示上述特性
## 前提条件
在运行该示例之前,请参考[安装文档](../installation_cn/README.md)完成安装并检查Fluid各组件正常运行
```shell script
在运行该示例之前,请参考[安装文档](../userguide/install.md)完成安装并检查Fluid各组件正常运行
```shell
$ kubectl get pod -n fluid-system
NAME READY STATUS RESTARTS AGE
controller-manager-7fd6457ccf-jnkvn 1/1 Running 0 60s
@ -11,9 +13,17 @@ csi-nodeplugin-fluid-6rhpt 2/2 Running 0 60s
csi-nodeplugin-fluid-6zwgl 2/2 Running 0 60s
```
通常来说你会看到一个名为“controller-manager”的Pod和多个名为“csi-nodeplugin”的Pod正在运行。其中“csi-nodeplugin”这些Pod的数量取决于你的Kubernetes集群中结点的数量。
## 新建工作环境
```shell
$ mkdir <any-path>/co-locality
$ cd <any-path>/co-locality
```
## 运行示例
**查看全部结点**
```shell script
```shell
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
cn-beijing.192.168.1.146 Ready <none> 7d14h v1.16.9-aliyun.1
@ -21,22 +31,23 @@ cn-beijing.192.168.1.147 Ready <none> 7d14h v1.16.9-aliyun.1
```
**使用标签标识结点**
```shell script
```shell
$ kubectl label nodes cn-beijing.192.168.1.146 hbase-cache=true
```
在接下来的步骤中,我们将使用`NodeSelector`来管理集群中存放数据的位置,所以在这里标记期望的结点
**再次查看结点**
```shell script
```shell
$ kubectl get node -L hbase-cache
NAME STATUS ROLES AGE VERSION HBASE-CACHE
cn-beijing.192.168.1.146 Ready <none> 7d14h v1.16.9-aliyun.1 true
cn-beijing.192.168.1.147 Ready <none> 7d14h v1.16.9-aliyun.1
```
目前在全部2个结点中仅有一个结点添加了`hbase-cache=true`的标签,接下来将使用该标签作为依据进行数据缓存的调度
目前在全部2个结点中仅有一个结点添加了`hbase-cache=true`的标签,接下来,我们希望数据缓存仅会被放置在该结点之上
**检查待创建的Dataset资源对象**
```shell script
$ cat samples/co-locality/dataset.yaml
```shell
$ cat<<EOF >dataset.yaml
apiVersion: data.fluid.io/v1alpha1
kind: Dataset
metadata:
@ -53,41 +64,68 @@ spec:
operator: In
values:
- "true"
EOF
```
在该Dataset资源对象的`spec.nodeAffinity`属性中定义了亲和性调度的相关配置,该配置要求将数据缓存放置在具有`hbase-cache=true`标签的结点之上
在该`Dataset`资源对象的`spec`属性中,我们定义了一个`nodeSelectorTerm`的子属性,该子属性要求数据缓存必须被放置在具有`hbase-cache=true`标签的结点之上
**创建Dataset资源对象**
```shell script
$ kubectl create -f samples/co-locality/dataset.yaml
```shell
$ kubectl create -f dataset.yaml
dataset.data.fluid.io/hbase created
```
**检查待创建的AlluxioRuntime资源对象**
```shell script
cat samples/co-locality/runtime.yaml
```shell
$ cat<<EOF >runtime.yaml
apiVersion: data.fluid.io/v1alpha1
kind: AlluxioRuntime
metadata:
name: hbase
spec:
...
replicas: 2
tieredstore:
levels:
- mediumtype: MEM
path: /dev/shm
quota: 2Gi
high: "0.95"
low: "0.7"
storageType: Memory
properties:
alluxio.user.file.writetype.default: MUST_CACHE
alluxio.master.journal.folder: /journal
alluxio.master.journal.type: UFS
alluxio.user.block.size.bytes.default: 256MB
alluxio.user.streaming.reader.chunk.size.bytes: 256MB
alluxio.user.local.reader.chunk.size.bytes: 256MB
alluxio.worker.network.reader.buffer.size: 256MB
alluxio.user.streaming.data.timeout: 300sec
master:
replicas: 1
...
jvmOptions:
- "-Xmx4G"
worker:
...
jvmOptions:
- "-Xmx4G"
fuse:
image: alluxio/alluxio-fuse
imageTag: "2.3.0-SNAPSHOT"
imagePullPolicy: Always
...
jvmOptions:
- "-Xmx4G "
- "-Xms4G "
- "-XX:+UseG1GC "
- "-XX:MaxDirectMemorySize=4g "
- "-XX:+UnlockExperimentalVMOptions "
- "-XX:ActiveProcessorCount=8 "
# For now, only support local
shortCircuitPolicy: local
args:
- fuse
- --fuse-opts=direct_io,ro,max_read=131072,attr_timeout=7200,entry_timeout=7200,nonempty
EOF
```
该配置文件表明希望创建一个AlluxioRuntime资源其中包含1个Alluxio Master和2个Alluxio Worker并且对于任意一个Alluxio Worker均会启动一个Alluxio Fuse组件与其协同工作
该配置文件片段中包含了许多与Alluxio相关的配置信息这些信息将被Fluid用来启动一个Alluxio实例。通过创建这么一个`AlluxioRuntime`资源对象Fluid将会启动一个包含1个Alluxio Master和2个Alluxio Worker的Alluxio实例
**创建AlluxioRuntime资源并查看状态**
```shell script
$ kubectl create -f samples/co-locality/runtime.yaml
```shell
$ kubectl create -f runtime.yaml
alluxioruntime.data.fluid.io/hbase created
$ kubectl get pod -o wide
@ -96,10 +134,10 @@ hbase-fuse-42csf 1/1 Running 0 104s 192.168.1.146 cn-beij
hbase-master-0 2/2 Running 0 3m3s 192.168.1.147 cn-beijing.192.168.1.147 <none> <none>
hbase-worker-l62m4 2/2 Running 0 104s 192.168.1.146 cn-beijing.192.168.1.146 <none> <none>
```
仅有一组Alluxio Worker/Alluxio Fuse成功启动并且均运行在具有指定标签的结点即`cn-beijing.192.168.1.146`之上。
在此处可以看到尽管我们期望看见两个AlluxioWorker被启动但仅有一组Alluxio Worker成功启动并且运行在具有指定标签即`hbase-cache=true`)的结点之上。
**检查AlluxioRuntime状态**
```shell script
```shell
$ kubectl get alluxioruntime hbase -o yaml
...
status:
@ -125,16 +163,31 @@ status:
workerNumberReady: 1
workerPhase: PartialReady
```
与预想一致,无论是Alluxio Worker还是Alluxio Fuse其状态均为PartialReady这是另一个结点无法满足Dataset资源对象的亲和性要求所致
与预想一致,`workerPhase`状态此时为`PartialReady`,并且`currentWorkerNumberScheduled: 1`小于`desiredWorkerNumberScheduled: 2`
**查看待创建的应用**
```shell script
$ cat samples/co-locality/app.yaml
...
我们提供了一个样例应用来演示Fluid是如何进行数据缓存亲和性调度的首先查看该应用
```shell
$ cat<<EOF >app.yaml
apiVersion: apps/v1beta1
kind: StatefulSet
metadata:
name: nginx
labels:
app: nginx
spec:
...
replicas: 2
serviceName: "nginx"
podManagementPolicy: "Parallel"
selector: # define how the deployment finds the pods it manages
matchLabels:
app: nginx
template: # define the pods specifications
...
metadata:
labels:
app: nginx
spec:
affinity:
# prevent two Nginx Pod from being scheduled at the same Node
@ -148,28 +201,38 @@ spec:
values:
- nginx
topologyKey: "kubernetes.io/hostname"
...
containers:
- name: nginx
image: nginx
volumeMounts:
- mountPath: /data
name: hbase-vol
volumes:
- name: hbase-vol
persistentVolumeClaim:
claimName: hbase
EOF
```
该应用定义了`PodAntiAffinity`的相关配置这些配置将确保属于相同应用的多个Pod不会被调度到同一结点通过这样的配置能够更加清楚地演示数据缓存的调度对使用该数据缓存的应用的影响
其中的`podAntiAffinity`可能会让人有一点疑惑,关于这个属性的解释如下:`podAntiAffinity`属性将会确保属于相同应用的多个Pod被分散到多个不同的结点这样的配置能够让我们更加清晰的观察到Fluid的数据缓存亲和性调度是怎么进行的。所以简单来说这只是一个专用于演示的属性你不必太过在意它
**运行应用**
```shell script
$ kubectl create -f samples/co-locality/app.yaml
```shell
$ kubectl create -f app.yaml
statefulset.apps/nginx created
```
**查看应用运行状态**
```shell script
kubectl get pod -o wide -l app=nginx
```shell
$ kubectl get pod -o wide -l app=nginx
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
nginx-0 1/1 Running 0 2m5s 192.168.1.146 cn-beijing.192.168.1.146 <none> <none>
nginx-1 0/1 Pending 0 2m5s <none> <none> <none> <none>
```
仅有一个Nginx Pod成功启动并且运行在具有指定标签的结点
仅有一个Nginx Pod成功启动并且运行在满足`nodeSelectorTerm`的结点之
**查看应用启动失败原因**
```shell script
```shell
$ kubectl describe pod nginx-1
...
Events:
@ -178,14 +241,14 @@ Events:
Warning FailedScheduling <unknown> default-scheduler 0/2 nodes are available: 1 node(s) didn't match pod affinity/anti-affinity, 1 node(s) didn't satisfy existing pods anti-affinity rules, 1 node(s) had volume node affinity conflict.
Warning FailedScheduling <unknown> default-scheduler 0/2 nodes are available: 1 node(s) didn't match pod affinity/anti-affinity, 1 node(s) didn't satisfy existing pods anti-affinity rules, 1 node(s) had volume node affinity conflict.
```
一方面,由于`samples/co-locality/app.yaml`中对于`PodAntiAffinity`的配置使得两个Nginx Pod无法被调度到同一节点。**另一方面由于目前满足Dataset资源对象亲和性要求的结点仅有一个因此仅有一个Nginx Pod被成功调度**
如上所示,一方面,为了满足`PodAntiAffinity`属性的要求使得两个Nginx Pod无法被调度到同一节点。另一方面由于目前满足Dataset资源对象亲和性要求的结点仅有一个因此仅有一个Nginx Pod被成功调度
**为结点添加标签**
```shell script
kubectl label node cn-beijing.192.168.1.147 hbase-cache=true
**为另一个结点添加标签**
```shell
$ kubectl label node cn-beijing.192.168.1.147 hbase-cache=true
```
现在两个结点都具有相同的标签了,此时重新检查各个组件的运行状态
```shell script
现在全部两个结点都具有相同的标签了,此时重新检查各个组件的运行状态
```shell
$ kubectl get pod -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
hbase-fuse-42csf 1/1 Running 0 44m 192.168.1.146 cn-beijing.192.168.1.146 <none> <none>
@ -194,22 +257,24 @@ hbase-master-0 2/2 Running 0 46m 192.168.1.147 cn-beiji
hbase-worker-l62m4 2/2 Running 0 44m 192.168.1.146 cn-beijing.192.168.1.146 <none> <none>
hbase-worker-rvncl 2/2 Running 0 10m 192.168.1.147 cn-beijing.192.168.1.147 <none> <none>
```
两个Alluxio Worker和Alluxio Fuse都成功启动,并且分别运行在两个结点上
两个Alluxio Worker都成功启动并且分别运行在两个结点上
```shell script
```shell
$ kubectl get pod -l app=nginx -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
nginx-0 1/1 Running 0 21m 192.168.1.146 cn-beijing.192.168.1.146 <none> <none>
nginx-1 1/1 Running 0 21m 192.168.1.147 cn-beijing.192.168.1.147 <none> <none>
```
两个Nginx Pod均成功启动并且分别运行在两个结点上
另一个nginx Pod不再处于`Pending`状态,已经成功启动并运行在另一个结点上
可见可调度的数据缓存以及对应用的数据缓存亲和性调度都是被Fluid所支持的特性。在绝大多数情况下这两个特性协同工作为用户提供了一种更灵活更便捷的方式在Kubernetes集群中管理数据。
可见Fluid支持数据缓存的调度策略这些调度策略为用户提供了更加灵活的数据缓存管理能力
## 环境清理
```shell script
kubectl delete -f samples/co-locality
```shell
$ kubectl delete -f .
kubectl label node cn-beijing.192.168.1.146 hbase-cache-
kubectl label node cn-beijing.192.168.1.147 hbase-cache-
$ kubectl label node cn-beijing.192.168.1.146 hbase-cache-
$ kubectl label node cn-beijing.192.168.1.147 hbase-cache-
```

View File

@ -15,7 +15,7 @@ arena是一个方便数据科学家运行和监视机器学习任务的CLI
### 部署fluid
请参照[fluid部署教程](../installation_cn/README.md)在kubernetes集群上安装fluid。
请参照[fluid部署教程](../userguide/install.md)在kubernetes集群上安装fluid。
### 创建dataset

View File

@ -0,0 +1,296 @@
# 用Fluid加速机器学习训练
本文介绍如何使用Fluid部署[阿里云OSS](https://cn.aliyun.com/product/oss)云端[ImageNet](http://www.image-net.org/)数据集到kubernetes集群并使用[arena](https://github.com/kubeflow/arena)在此数据集上训练ResNet-50模型。本文以四机八卡测试环境为例。
## 前提条件
- [Fluid](https://github.com/fluid-cloudnative/fluid) (version >= 0.1.0)
- [arena](https://github.com/kubeflow/arena)version >= 0.4.0
> **注意**
>
> 1. 本文要求在Kubernetes集群中已安装好Fluid如果您还没部署Fluid请参考[Fluid安装手册](../userguide/install.md)在您的Kubernetes集群上安装Fluid。
>
> 2. `arena`是一个方便数据科学家运行和监视机器学习任务的CLI, 本文使用`arena`提交机器学习任务,安装教程可参考[arena安装教程](https://github.com/kubeflow/arena/blob/master/docs/installation/INSTALL_FROM_BINARY.md)。
## 用Fluid部署云端数据集
### 创建Dataset和Runtime
如下的`dataset.yaml`文件中定义了一个`Dataset`和`Runtime`,并`---`符号将它们的定义分割。
数据集存储在[阿里云OSS](https://cn.aliyun.com/product/oss)为保证Alluxio能够成功挂载OSS上的数据集请确保`dataset.yaml`文件中设置了正确的`mountPoint`、`fs.oss.accessKeyId`、`fs.oss.accessKeySecret`和`fs.oss.endpoint`。
> 你可以参考Alluxio的官方文档示例[Aliyun Object Storage Service](https://docs.alluxio.io/os/user/stable/en/ufs/OSS.html)了解更多在Alluxio中使用OSS的例子。
本文档以四机八卡为例,所以在`dataset.yaml`中设置`spec.replicas=4`。此外,`dataset.yaml`文件还根据我们的测试经验设置了许多参数以优化Alluxio的IO性能包括Alluxio、Fuse和JVM等层次您可以自行根据机器配置和任务需求调整参数。
```shell
$ cat << EOF >> dataset.yaml
apiVersion: data.fluid.io/v1alpha1
kind: Dataset
metadata:
name: imagenet
spec:
mounts:
- mountPoint: oss://<OSS_BUCKET>/<OSS_DIRECTORY>/
name: imagenet
options:
fs.oss.accessKeyId: <OSS_ACCESS_KEY_ID>
fs.oss.accessKeySecret: <OSS_ACCESS_KEY_SECRET>
fs.oss.endpoint: <OSS_ENDPOINT>
---
apiVersion: data.fluid.io/v1alpha1
kind: AlluxioRuntime
metadata:
name: imagenet
spec:
replicas: 4
data:
replicas: 1
# alluxioVersion:
# image: registry.cn-huhehaote.aliyuncs.com/alluxio/alluxio
# imageTag: "2.3.0-SNAPSHOT-bbce37a"
# imagePullPolicy: Always
tieredstore:
levels:
- mediumtype: SSD
path: /var/lib/docker/alluxio
quota: 50Gi
high: "0.99"
low: "0.8"
properties:
# alluxio fuse
alluxio.fuse.jnifuse.enabled: "true"
alluxio.fuse.debug.enabled: "false"
alluxio.fuse.cached.paths.max: "1000000"
alluxio.fuse.logging.threshold: 1000ms
# alluxio master
alluxio.master.metastore: ROCKS
alluxio.master.journal.folder: /journal
alluxio.master.journal.type: UFS
alluxio.master.metastore.inode.cache.max.size: "10000000"
alluxio.master.journal.log.size.bytes.max: 500MB
alluxio.master.metadata.sync.concurrency.level: "128"
alluxio.master.metadata.sync.executor.pool.size: "128"
alluxio.master.metadata.sync.ufs.prefetch.pool.size: "128"
alluxio.master.rpc.executor.max.pool.size: "1024"
alluxio.master.rpc.executor.core.pool.size: "128"
# alluxio worker
alluxio.worker.allocator.class: alluxio.worker.block.allocator.GreedyAllocator
alluxio.worker.network.reader.buffer.size: 32MB
alluxio.worker.file.buffer.size: 320MB
alluxio.worker.block.master.client.pool.size: "1024"
# alluxio user
alluxio.user.block.worker.client.pool.min: "512"
alluxio.user.file.writetype.default: MUST_CACHE
alluxio.user.ufs.block.read.location.policy: alluxio.client.block.policy.LocalFirstAvoidEvictionPolicy
alluxio.user.block.write.location.policy.class: alluxio.client.block.policy.LocalFirstAvoidEvictionPolicy
alluxio.user.block.size.bytes.default: 16MB
alluxio.user.streaming.reader.chunk.size.bytes: 32MB
alluxio.user.local.reader.chunk.size.bytes: 32MB
alluxio.user.metrics.collection.enabled: "false"
alluxio.user.update.file.accesstime.disabled: "true"
alluxio.user.file.passive.cache.enabled: "false"
alluxio.user.block.avoid.eviction.policy.reserved.size.bytes: 2GB
alluxio.user.block.master.client.pool.gc.threshold: 2day
alluxio.user.file.master.client.threads: "1024"
alluxio.user.block.master.client.threads: "1024"
alluxio.user.file.readtype.default: CACHE
alluxio.user.metadata.cache.enabled: "true"
alluxio.user.metadata.cache.expiration.time: 2day
alluxio.user.metadata.cache.max.size: "1000000"
alluxio.user.direct.memory.io.enabled: "true"
alluxio.user.worker.list.refresh.interval: 2min
alluxio.user.logging.threshold: 1000ms
# other alluxio configurations
alluxio.web.ui.enabled: "false"
alluxio.security.stale.channel.purge.interval: 365d
alluxio.job.worker.threadpool.size: "164"
master:
jvmOptions:
- "-Xmx6G"
- "-XX:+UnlockExperimentalVMOptions"
- "-XX:ActiveProcessorCount=8"
worker:
jvmOptions:
- "-Xmx12G"
- "-XX:+UnlockExperimentalVMOptions"
- "-XX:MaxDirectMemorySize=32g"
- "-XX:ActiveProcessorCount=8"
resources:
limits:
cpu: 8
fuse:
# image: registry.cn-huhehaote.aliyuncs.com/alluxio/alluxio-fuse
# imageTag: "2.3.0-SNAPSHOT-bbce37a"
# imagePullPolicy: Always
env:
MAX_IDLE_THREADS: "32"
jvmOptions:
- "-Xmx16G"
- "-Xms16G"
- "-XX:+UseG1GC"
- "-XX:MaxDirectMemorySize=32g"
- "-XX:+UnlockExperimentalVMOptions"
- "-XX:ActiveProcessorCount=24"
resources:
limits:
cpu: 16
shortCircuitPolicy: local
args:
- fuse
- --fuse-opts=kernel_cache,ro,max_read=131072,attr_timeout=7200,entry_timeout=7200,nonempty
EOF
```
创建Dataset和Runtime
```shell
$ kubectl create -f dataset.yaml
```
检查Alluxio Runtime可以看到`1`个Master`4`个Worker和`4`个Fuse已成功部署
```shell
$ kubectl describe alluxioruntime imagenet
Name: imagenet
Namespace: default
Labels: <none>
Annotations: <none>
API Version: data.fluid.io/v1alpha1
Kind: AlluxioRuntime
Metadata:
# more metadata
Spec:
# more spec
Status:
Cache States:
Cache Capacity: 200GiB
Cached: 0B
Cached Percentage: 0%
Conditions:
# more conditions
Current Fuse Number Scheduled: 4
Current Master Number Scheduled: 1
Current Worker Number Scheduled: 4
Desired Fuse Number Scheduled: 4
Desired Master Number Scheduled: 1
Desired Worker Number Scheduled: 4
Fuse Number Available: 4
Fuse Numb Status: True
Type: Ready
Phase: Bound
Runtimes:
Category: Accelerate
Name: imagenet
Namespace: default
Type: alluxio
Ufs Total: 143.7GiB
Events: <none>
```
同时检查到Dataset也绑定到Alluxio Runtime
```shell
$ kubectl describe dataset
Name: imagenet
Namespace: default
Labels: <none>
Annotations: <none>
API Version: data.fluid.io/v1alpha1
Kind: Dataset
Metadata:
# more metadata
Spec:
# more spec
Status:
Cache States:
Cache Capacity: 200GiB
Cached: 0B
Cached Percentage: 0%
Conditions:
Last Transition Time: 2020-08-18T11:01:09Z
Last Update Time: 2020-08-18T11:02:48Z
Message: The ddc runtime is ready.
Reason: DatasetReady
Status: True
Type: Ready
Phase: Bound
Runtimes:
Category: Accelerate
Name: imagenet
Namespace: default
Type: alluxio
Ufs Total: 143.7GiB
Events: <none>
```
检查pv和pvc名为imagenet的pv和pvc被成功创建
```shell
$ kubectl get pv,pvc
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE
persistentvolume/imagenet 100Gi RWX Retain Bound default/imagenet 7m11s
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
persistentvolumeclaim/imagenet Bound imagenet 100Gi RWX 7m11s
```
至此OSS云端数据集已成功部署到kubernetes集群中。
## 示例使用arena提交深度学习任务
`arena`提供了便捷的方式帮助用户提交和监控机器学习任务。在本文中,我们使用`arena`简化机器学习任务的部署流程。
如果您已经安装`arena`并且云端数据集已成功部署到本地集群中只需要简单执行以下命令便能提交ResNet50四机八卡训练任务
```shell
arena submit mpi \
--name horovod-resnet50-v2-4x8-fluid \
--gpus=8 \
--workers=4 \
--working-dir=/horovod-demo/tensorflow-demo/ \
--data imagenet:/data \
-e DATA_DIR=/data/imagenet \
-e num_batch=1000 \
-e datasets_num_private_threads=8 \
--image=registry.cn-hangzhou.aliyuncs.com/tensorflow-samples/horovod-benchmark-dawnbench-v2:0.18.1-tf1.14.0-torch1.2.0-mxnet1.5.0-py3.6 \
./launch-example.sh 4 8
```
arena参数说明
- `--name`指定job的名字
- `--workers`指定参与训练的节点worker
- `--gpus`指定每个worker使用的GPU数
- `--working-dir`:指定工作路径
- `--data`挂载Volume `imagenet`到worker的`/data`目录
- `-e DATA_DIR`:指定数据集位置
- `./launch-example.sh 4 8`:运行脚本启动四机八卡测试
检查任务是否正常执行:
```shell
$ arena get horovod-resnet50-v2-4x8-fluid -e
STATUS: RUNNING
NAMESPACE: default
PRIORITY: N/A
TRAINING DURATION: 16s
NAME STATUS TRAINER AGE INSTANCE NODE
horovod-resnet50-v2-4x8-fluid RUNNING MPIJOB 16s horovod-resnet50-v2-4x8-fluid-launcher-czlfn 192.168.1.21
horovod-resnet50-v2-4x8-fluid RUNNING MPIJOB 16s horovod-resnet50-v2-4x8-fluid-worker-0 192.168.1.16
horovod-resnet50-v2-4x8-fluid RUNNING MPIJOB 16s horovod-resnet50-v2-4x8-fluid-worker-1 192.168.1.21
horovod-resnet50-v2-4x8-fluid RUNNING MPIJOB 16s horovod-resnet50-v2-4x8-fluid-worker-2 192.168.1.25
horovod-resnet50-v2-4x8-fluid RUNNING MPIJOB 16s horovod-resnet50-v2-4x8-fluid-worker-3 192.168.3.29
```
如果您看到`4`个处于`RUNNING`状态的worker说明您已经成功启动训练。
如果您想知道训练进行到哪一步了请检查arena日志
```shell
$ arena logs --tail 100 -f horovod-resnet50-v2-4x8-fluid
```

View File

@ -0,0 +1 @@
# warm up

View File

@ -2,7 +2,7 @@
## 脚本介绍
fluid提供了shell脚本[diagnose-fluid.sh](../../tools/diagnose-fluid.sh)帮助用户快速收集fluid系统和Runtime容器的日志信息。
fluid提供了shell脚本[diagnose-fluid.sh](../../../tools/diagnose-fluid.sh)帮助用户快速收集fluid系统和Runtime容器的日志信息。
## 如何使用

View File

@ -0,0 +1,170 @@
# fluid 快速上手
本文档介绍了如何创建 Kubernetes 集群环境,通过 Helm 完成 fluid 安装部署,并使用 fluid 创建数据集。
## 创建 Kubernetes 集群
fluid 需要 Kubernetes 环境,根据你的使用经历选择最适合你的方案:
- 你已经有了一个 Kubernetes 环境,并满足 Kubernetes :版本>=1.14,可以直接[部署fluid](#部署fluid)
- 你之前没有使用过 Kubernetes可以使用 Minikube 创建 Kubernetes 集群.
[Minikube](https://kubernetes.io/docs/setup/minikube/)可以在虚拟机中创建一个 Kubernetes 集群,可在 macOS, Linux 和 Windows 上运行。
请确保满足以下要求:
- [Minikube](https://kubernetes.io/docs/tasks/tools/install-minikube/) :版本 1.0.0+
- [kubectl](https://kubernetes.io/docs/tasks/tools/install-kubectl) : 版本 1.14+
安装好Minikube之后:
```shell
minikube start
```
如果安装成功的话,会出现类似的提示信息:
```shell
Darwin 10.14.5 上的 minikube v1.12.1
```
使用 `kubectl`访问新创建的 Kubernetes 集群
```shell
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
nginx-deployment-558fc78868-kvjnf 1/1 Running 1 4d12h
nginx-deployment-558fc78868-kx9gt 1/1 Running 1 4d12h
```
## 部署fluid
开始之前,确保已满足以下要求:
- 使用 `kubectl` 可以成功访问到 Kubernetes 集群
- [Helm](https://helm.sh/docs/intro/install/) : Helm 3 已安装
1. 获取 fluid
```shell
git clone https://github.com/fluid-cloudnative/fluid.git
```
2. 使用 Helm 安装 fluid
```shell
helm install fluid fluid
NAME: fluid
LAST DEPLOYED: Tue Jul 7 11:22:07 2020
NAMESPACE: default
STATUS: deployed
REVISION: 1
TEST SUITE: None
```
3. 查看安装结果
```shell
kubectl get pod -n fluid-system
NAME READY STATUS RESTARTS AGE
controller-manager-6b864dfd4f-995gm 1/1 Running 0 32h
csi-nodeplugin-fluid-c6pzj 2/2 Running 0 32h
csi-nodeplugin-fluid-wczmq 2/2 Running 0 32h
```
## Create a dataset
fluid提供了云原生的数据加速和管理能力并抽象出了`数据集`概念方便用户管理,接下来将演示如何用 fluid 创建一个数据集。
1. 通过CRD文件创建一个Dataset对象其中描述了数据集的来源。
```yaml
apiVersion: data.fluid.io/v1alpha1
kind: Dataset
metadata:
name: demo
spec:
mounts:
- mountPoint: https://mirror.bit.edu.cn/apache/spark/spark-3.0.0/
name: spark
```
执行安装
```
kubectl create -f dataset.yaml
```
dataset创建以后处于 `not bound` 状态,需要绑定 runtime 才能使用。
2. 同样根据 alluxioRuntime的CRD文件创建一个 Alluxio Runtime 对象,用来描述支持这个数据集的 runtime。
```yaml
apiVersion: data.fluid.io/v1alpha1
kind: AlluxioRuntime
metadata:
name: demo
spec:
replicas: 1
tieredstore:
levels:
- mediumtype: MEM
path: /dev/shm
quota: 2Gi
high: "0.95"
low: "0.7"
storageType: Memory
properties:
alluxio.user.file.writetype.default: MUST_CACHE
alluxio.master.journal.folder: /journal
alluxio.master.journal.type: UFS
alluxio.user.block.size.bytes.default: 256MB
alluxio.user.streaming.reader.chunk.size.bytes: 256MB
alluxio.user.local.reader.chunk.size.bytes: 256MB
alluxio.worker.network.reader.buffer.size: 256MB
alluxio.user.streaming.data.timeout: 300sec
master:
jvmOptions:
- "-Xmx4G"
worker:
jvmOptions:
- "-Xmx4G"
fuse:
jvmOptions:
- "-Xmx4G "
- "-Xms4G "
# For now, only support local
shortCircuitPolicy: local
args:
- fuse
- --fuse-opts=direct_io,ro,max_read=131072
```
使用kubectl完成创建
```shell
kubectl create -f runtime.yaml
```
3. 接下来,我们创建一个应用容器来使用该数据集,我们将多次访问同一数据,并比较访问时间来展示 fluid 的加速效果。
```yaml
apiVersion: v1
kind: Pod
metadata:
name: demo-app
spec:
containers:
- name: demo
image: nginx
volumeMounts:
- mountPath: /data
name: demo
volumes:
- name: demo
persistentVolumeClaim:
claimName: demo
```
4. 登录到应用容器中访问数据,初次访问会花费更长时间。
```shell
kubectl exec -it demo-app -- bash
# du -sh /data/spark/spark-3.0.0-bin-without-hadoop.tgz
150M /data/spark/spark-3.0.0-bin-without-hadoop.tgz
# time cp /data/spark/spark-3.0.0-bin-without-hadoop.tgz /dev/null
real 0m13.171s
user 0m0.002s
sys 0m0.028s
```
5. 为了避免其他因素(比如 page cache )对结果造成影响,我们将删除之前的容器,新建相同的应用,尝试访问同样的文件。由于此时文件已经被 alluxio 缓存,可以看到第二次访问所需时间远小于第一次。
```shell
kubectl delete -f app.yaml && kubectl create -f app.yaml
...
# time cp /data/spark/spark-3.0.0-bin-without-hadoop.tgz /dev/null
real 0m0.344s
user 0m0.002s
sys 0m0.020s
```
到这里,我们已经成功创建了一个数据集并完成了加速,关于数据集的进一步使用和管理可以参考[accelerate](
../user/accelerate_data_accessing.md)和[co-locality](../user/data_co_locality.md)这两个例子。

View File

@ -0,0 +1,87 @@
# 在Kubernetes集群上部署Fluid
## 前提条件
- git
- Kubernetes集群version >= 1.14, 并且支持CSI功能
- kubectlversion >= 1.14
- Helmversion >= 3.0
接下来的文档假设您已经配置好上述所有环境。
对于kubectl的安装和配置请参考[此处](https://kubernetes.io/docs/tasks/tools/install-kubectl/)。
对于Helm 3的安装和配置请参考[此处](https://v3.helm.sh/docs/intro/install/)。
## Fluid安装步骤
### 获取Fluid Chart
您可以在任意文件夹,执行以下命令,从[fluid代码仓库](https://github.com/fluid-cloudnative/fluid)拷贝源代码:
```shell
$ git clone https://github.com/fluid-cloudnative/fluid.git
```
fluid源代码中包含了部署fluid所需的[helm charts](https://github.com/fluid-cloudnative/fluid/tree/master/charts)。
### 使用Helm安装Fluid
进入刚才克隆的本地代码仓库:
```shell
$ cd fluid
```
创建命名空间:
```shell
$ kubectl create ns fluid-system
```
安装fluid
```shell
$ helm install fluid charts/fluid/fluid
NAME: fluid
LAST DEPLOYED: Fri Jul 24 16:10:18 2020
NAMESPACE: default
STATUS: deployed
REVISION: 1
TEST SUITE: None
```
> `helm install`命令的一般格式是`helm install <RELEASE_NAME> <SOURCE>`,在上面的命令中,`fluid`指定了安装的release名字这可以自行更改`charts/fluid/fluid`指定了helm chart的所在路径。
### 检查各组件状态
**查看Fluid使用的CRD:**
```shell
$ kubectl get crd | grep data.fluid.io
alluxiodataloads.data.fluid.io 2020-07-24T06:54:50Z
alluxioruntimes.data.fluid.io 2020-07-24T06:54:50Z
datasets.data.fluid.io 2020-07-24T06:54:50Z
```
**查看各Pod的状态:**
```shell
$ kubectl get pod -n fluid-system
NAME READY STATUS RESTARTS AGE
controller-manager-7f99c884dd-894g9 1/1 Running 0 5m28s
csi-nodeplugin-fluid-dm9b8 2/2 Running 0 5m28s
csi-nodeplugin-fluid-hwtvh 2/2 Running 0 5m28s
```
如果Pod状态如上所示那么Fluid就可以正常使用了
### 卸载Fluid
```shell
$ helm delete fluid
$ kubectl delete -f charts/fluid/fluid/crds
```
> 这里的`fluid`对应安装时指定的<RELEASE_NAME>

View File

@ -0,0 +1,14 @@
# Overview
[Fluid](https://github.com/fluid-cloudnative/fluid) is An open source Kubernetes-native Distributed Dataset Orchestrator and Accelerator for Data Analysis and Machine Learning. It provides a full management life-cycle for Data orchastration system(Alluxio) including deployment, scaling, configuratio changes. With Fluid, the end user can manage the data without touching the Data Caching System.
> **Note:**
>
> You can only deploy Fluid in a Kubernetes cluster.
The corresponding relationship between Fluid and Alluxio versions is as follows:
| Fluid version | Compatible Alluxio versions |
|:---|:---|
| v0.1 | [Alluxio JNI Fuse 2.3](https://github.com/Alluxio/alluxio/tree/branch-2.3-fuse)|