autodis_can_been_used_on_ModelArts

2021-06-17 18:37:05 +08:00 · 2021-06-17 18:37:05 +08:00 · 924e8f3aa6
parent 63d853cf35
commit 924e8f3aa6
16 changed files with 662 additions and 207 deletions
--- a/model_zoo/research/recommend/autodis/README.md
+++ b/model_zoo/research/recommend/autodis/README.md
@ -57,7 +57,7 @@ After installing MindSpore via the official website, you can start training and
  ```python
  # run training example
  python train.py \
-    --dataset_path='dataset/train' \
+    --train_data_dir='dataset/train' \
    --ckpt_path='./checkpoint' \
    --eval_file_name='auc.log' \
    --loss_file_name='loss.log' \
@ -66,11 +66,11 @@ After installing MindSpore via the official website, you can start training and

  # run evaluation example
  python eval.py \
-    --dataset_path='dataset/test' \
+    --test_data_dir='dataset/test' \
    --checkpoint_path='./checkpoint/autodis.ckpt' \
    --device_target='Ascend' > ms_log/eval_output.log 2>&1 &
  OR
-  sh scripts/run_eval.sh 0 Ascend /dataset_path /checkpoint_path/autodis.ckpt
+  sh scripts/run_eval.sh 0 Ascend /test_data_dir /checkpoint_path/autodis.ckpt
  ```

  For distributed training, a hccl configuration file with JSON format needs to be created in advance.
@ -79,6 +79,50 @@ After installing MindSpore via the official website, you can start training and

  <https://gitee.com/mindspore/mindspore/tree/master/model_zoo/utils/hccl_tools>.

+- running on ModelArts
+
+  If you want to run in modelarts, please check the official documentation of [modelarts](https://support.huaweicloud.com/modelarts/), and you can start training as follows
+
+    - Training with single cards on ModelArts
+
+    ```python
+    # (1) Upload the code folder to S3 bucket.
+    # (2) Click to "create training task" on the website UI interface.
+    # (3) Set the code directory to "/{path}/autodis" on the website UI interface.
+    # (4) Set the startup file to /{path}/autodis/train.py" on the website UI interface.
+    # (5) Perform a or b.
+    #     a. setting parameters in /{path}/autodis/default_config.yaml.
+    #         1. Set ”enable_modelarts: True“
+    #     b. adding on the website UI interface.
+    #         1. Add ”enable_modelarts=True“
+    # (6) Upload the dataset to S3 bucket.
+    # (7) Check the "data storage location" on the website UI interface and set the "Dataset path" path.
+    # (8) Set the "Output file path" and "Job log path" to your path on the website UI interface.
+    # (9) Under the item "resource pool selection", select the specification of single cards.
+    # (10) Create your job.
+    ```
+
+    - evaluating with single card on ModelArts
+
+    ```python
+    # (1) Upload the code folder to S3 bucket.
+    # (2) Click to "create training task" on the website UI interface.
+    # (3) Set the code directory to "/{path}/autodis" on the website UI interface.
+    # (4) Set the startup file to /{path}/autodis/eval.py" on the website UI interface.
+    # (5) Perform a or b.
+    #     a. setting parameters in /{path}/autodis/default_config.yaml.
+    #         1. Set ”enable_modelarts: True“
+    #         2. Set “checkpoint_path: ./{path}/*.ckpt”('checkpoint_path' indicates the path of the weight file to be evaluated relative to the file `eval.py`, and the weight file must be included in the code directory.)
+    #     b. adding on the website UI interface.
+    #         1. Add ”enable_modelarts=True“
+    #         2. Add “checkpoint_path=./{path}/*.ckpt”('checkpoint_path' indicates the path of the weight file to be evaluated relative to the file `eval.py`, and the weight file must be included in the code directory.)
+    # (6) Upload the dataset to S3 bucket.
+    # (7) Check the "data storage location" on the website UI interface and set the "Dataset path" path (there is only data or zip package under this path).
+    # (8) Set the "Output file path" and "Job log path" to your path on the website UI interface.
+    # (9) Under the item "resource pool selection", select the specification of a single card.
+    # (10) Create your job.
+    ```
+
 # [Script Description](#contents)

 ## [Script and Sample Code](#contents)
@ -86,54 +130,52 @@ After installing MindSpore via the official website, you can start training and
 ```bash
 .
 └─autodis
-  ├─README.md
-  ├─mindspore_hub_conf.md             # config for mindspore hub
+  ├─README.md                         # descriptions of warpctc
+  ├─ascend310_infer                   # application for 310 inference
  ├─scripts
    ├─run_standalone_train.sh         # launch standalone training(1p) in Ascend or GPU
+    ├─run_infer_310.sh                # launch 310infer
    └─run_eval.sh                     # launch evaluating in Ascend or GPU
  ├─src
+    ├─model_utils
+      ├─config.py                     # parsing parameter configuration file of "*.yaml"
+      ├─device_adapter.py             # local or ModelArts training
+      ├─local_adapter.py              # get related environment variables in local training
+      └─moxing_adapter.py             # get related environment variables in ModelArts training
    ├─__init__.py                     # python init file
-    ├─config.py                       # parameter configuration
    ├─callback.py                     # define callback function
    ├─autodis.py                      # AutoDis network
-    ├─dataset.py                      # create dataset for AutoDis
-  ├─eval.py                           # eval net
-  └─train.py                          # train net
+    └─dataset.py                      # create dataset for AutoDis
+  ├─default_config.yaml               # parameter configuration
+  ├─eval.py                           # eval script
+  ├─export.py                         # export checkpoint file into air/mindir
+  ├─mindspore_hub_conf.py             # mindspore hub interface
+  ├─postprocess.py                    # 310infer postprocess script
+  ├─preprocess.py                     # 310infer preprocess script
+  └─train.py                          # train script
 ```

 ## [Script Parameters](#contents)

-Parameters for both training and evaluation can be set in config.py
+Parameters for both training and evaluation can be set in `default_config.yaml`

- train parameters
-
-  ```python
-  optional arguments:
-  -h, --help            show this help message and exit
-  --dataset_path DATASET_PATH
-                        Dataset path
-  --ckpt_path CKPT_PATH
-                        Checkpoint path
-  --eval_file_name EVAL_FILE_NAME
-                        Auc log file path. Default: "./auc.log"
-  --loss_file_name LOSS_FILE_NAME
-                        Loss log file path. Default: "./loss.log"
-  --do_eval DO_EVAL     Do evaluation or not. Default: True
-  --device_target DEVICE_TARGET
-                        Ascend or GPU. Default: Ascend
-  ```
-
- eval parameters
+- Parameters that can be modified at the terminal

  ```bash
-  optional arguments:
-  -h, --help            show this help message and exit
-  --checkpoint_path CHECKPOINT_PATH
-                        Checkpoint file path
-  --dataset_path DATASET_PATH
-                        Dataset path
-  --device_target DEVICE_TARGET
-                        Ascend or GPU. Default: Ascend
+  # Train
+  train_data_dir: ''                  # train dataset path
+  ckpt_path: 'ckpts'                  # the folder path to save '*.ckpt' files. Relative path.
+  eval_file_name: "./auc.log"         # file path to record accuracy
+  loss_file_name: "./loss.log"        # file path to record loss
+  do_eval: "True"                     # whether do eval while training, default is 'True'.
+  # Test
+  test_data_dir: ''                   # test dataset path
+  checkpoint_path: ''                 # the path of the weight file to be evaluated relative to the file `eval.py`, and the weight file must be included in the code directory.
+  # Export
+  batch_size: 16000                   # batch_size for exported model.
+  ckpt_file: ''                       # the path of the weight file to be exported relative to the file `export.py`, and the weight file must be included in the code directory.
+  file_name: "autodis"                # output file name.
+  file_format: "AIR"                  # output file format, you can choose from AIR or MINDIR, default is AIR"
  ```

 ## [Training Process](#contents)
@ -144,7 +186,7 @@ Parameters for both training and evaluation can be set in config.py

  ```python
  python train.py \
-    --dataset_path='dataset/train' \
+    --train_data_dir='dataset/train' \
    --ckpt_path='./checkpoint' \
    --eval_file_name='auc.log' \
    --loss_file_name='loss.log' \
@ -174,11 +216,11 @@ Parameters for both training and evaluation can be set in config.py

  ```python
  python eval.py \
-    --dataset_path='dataset/test' \
+    --test_data_dir='dataset/test' \
    --checkpoint_path='./checkpoint/autodis.ckpt' \
    --device_target='Ascend' > ms_log/eval_output.log 2>&1 &
  OR
-  sh scripts/run_eval.sh 0 Ascend /dataset_path /checkpoint_path/autodis.ckpt
+  sh scripts/run_eval.sh 0 Ascend /test_data_dir /checkpoint_path/autodis.ckpt
  ```

  The above python command will run in the background. You can view the results through the file "eval_output.log". The accuracy is saved in auc.log file.
@ -191,12 +233,37 @@ Parameters for both training and evaluation can be set in config.py

 ### [Export MindIR](#contents)

-```shell
-python export.py --ckpt_file [CKPT_PATH] --file_name [FILE_NAME] --file_format [FILE_FORMAT]
-```
+- Export on local

-The ckpt_file parameter is required,
-`file_format` should be in ["AIR", "MINDIR"]
+  ```shell
+  # The ckpt_file parameter is required, `EXPORT_FORMAT` should be in ["AIR", "MINDIR"]
+  python export.py --ckpt_file [CKPT_PATH] --file_name [FILE_NAME] --file_format [FILE_FORMAT]
+  ```
+
+- Export on ModelArts (If you want to run in modelarts, please check the official documentation of [modelarts](https://support.huaweicloud.com/modelarts/), and you can start as follows)
+
+  ```python
+  # (1) Upload the code folder to S3 bucket.
+  # (2) Click to "create training task" on the website UI interface.
+  # (3) Set the code directory to "/{path}/autodis" on the website UI interface.
+  # (4) Set the startup file to /{path}/autodis/export.py" on the website UI interface.
+  # (5) Perform a or b.
+  #     a. setting parameters in /{path}/autodis/default_config.yaml.
+  #         1. Set ”enable_modelarts: True“
+  #         2. Set “ckpt_file: ./{path}/*.ckpt”('ckpt_file' indicates the path of the weight file to be exported relative to the file `export.py`, and the weight file must be included in the code directory.)
+  #         3. Set ”file_name: autodis“
+  #         4. Set ”file_format：AIR“(you can choose from AIR or MINDIR)
+  #     b. adding on the website UI interface.
+  #         1. Add ”enable_modelarts=True“
+  #         2. Add “ckpt_file=./{path}/*.ckpt”('ckpt_file' indicates the path of the weight file to be exported relative to the file `export.py`, and the weight file must be included in the code directory.)
+  #         3. Add ”file_name=autodis“
+  #         4. Add ”file_format=AIR“(you can choose from AIR or MINDIR)
+  # (7) Check the "data storage location" on the website UI interface and set the "Dataset path" path (This step is useless, but necessary.).
+  # (8) Set the "Output file path" and "Job log path" to your path on the website UI interface.
+  # (9) Under the item "resource pool selection", select the specification of a single card.
+  # (10) Create your job.
+  # You will see autodis.air under "Output file path".
+  ```

 ### Infer on Ascend310

--- a/model_zoo/research/recommend/autodis/default_config.yaml
+++ b/model_zoo/research/recommend/autodis/default_config.yaml
@ -0,0 +1,91 @@
+# Builtin Configurations(DO NOT CHANGE THESE CONFIGURATIONS unless you know exactly what you are doing)
+enable_modelarts: False
+# Url for modelarts
+data_url: ""
+train_url: ""
+checkpoint_url: ""
+# Path for local
+data_path: "/cache/data"
+output_path: "/cache/train"
+load_path: "/cache/checkpoint_path"
+device_target: "Ascend"
+enable_profiling: False
+# ==============================================================================
+# Parameters that can be modified at the terminal
+# Train
+train_data_dir: ''
+ckpt_path: 'ckpts'
+eval_file_name: "./auc.log"
+loss_file_name: "./loss.log"
+do_eval: "True"
+# Test
+test_data_dir: ''
+checkpoint_path: ''
+# Export
+batch_size: 16000
+ckpt_file: ''
+file_name: "autodis"
+file_format: "AIR"
+# Dataset related
+DataConfig:
+    data_vocab_size: 184965
+    train_num_of_parts: 21
+    test_num_of_parts: 3
+    batch_size: 1000
+    data_field_size: 39
+    # dataset format, 1: mindrecord, 2: tfrecord, 3: h5
+    data_format: 2
+# Model related
+ModelConfig:
+    batch_size: DataConfig.batch_size
+    data_field_size: DataConfig.data_field_size
+    data_vocab_size: DataConfig.data_vocab_size
+    data_emb_dim: 80
+    deep_layer_args: [[400, 400, 512], "relu"]
+    init_args: [-0.01, 0.01]
+    weight_bias_init: ['normal', 'normal']
+    keep_prob: 0.9
+    split_index: 13
+    hash_size: 20
+    temperature: 0.00001  # 1e-5
+# Training related
+TrainConfig:
+    batch_size: DataConfig.batch_size
+    l2_coef: 0.000001  # 1e-6
+    learning_rate: 0.00001  # 1e-5
+    epsilon: 0.00000001  # 1e-8
+    loss_scale: 1024.0
+    train_epochs: 15
+    save_checkpoint: True
+    ckpt_file_name_prefix: "autodis"
+    save_checkpoint_steps: 1
+    keep_checkpoint_max: 15
+    eval_callback: True
+    loss_callback: True
+
+---
+# Help description for each configuration
+# Parameters that been used on ModelArts
+enable_modelarts: "Whether training on modelarts, default: False"
+data_url: "Url for modelarts"
+train_url: "Url for modelarts"
+data_path: "The location of the input data."
+output_path: "The location of the output file."
+device_target: "Running platform, choose from Ascend, and default is Ascend."
+enable_profiling: 'Whether enable profiling while training, default: False'
+# Parameters that can be modified at the terminal
+train_data_dir: "Train dataset dir, default is None"
+ckpt_path: "ckpt dir to save, default is None"
+eval_file_name: "Loss log file path. Default: './loss.log'"
+loss_file_name: "Loss log file path. Default: './loss.log'"
+do_eval: 'Do evaluation or not, only support "True" or "False". Default: "True"'
+test_data_dir: "Test dataset dir, default is None"
+checkpoint_path: " Relative path of '*.ckpt' to be evaluated relative to the eval.py"
+ckpt_file: "Checkpoint file path."
+file_name: "Output file name."
+file_format: "Output file format, you can choose from AIR or MINDIR, default is AIR"
+---
+#Choices
+device_target: ["Ascend"]
+do_eval: ["True", "False"]
+file_format: ["AIR", "MINDIR"]
--- a/model_zoo/research/recommend/autodis/eval.py
+++ b/model_zoo/research/recommend/autodis/eval.py
@ -16,47 +16,43 @@
 import os
 import sys
 import time
-import argparse

 from mindspore import context
 from mindspore.train.model import Model
 from mindspore.train.serialization import load_checkpoint, load_param_into_net

 from src.autodis import ModelBuilder, AUCMetric
-from src.config import DataConfig, ModelConfig, TrainConfig
+from src.model_utils.config import config, data_config, model_config, train_config
 from src.dataset import create_dataset, DataType
+from src.model_utils.moxing_adapter import moxing_wrapper
+from src.model_utils.device_adapter import get_device_id

 sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
-parser = argparse.ArgumentParser(description='CTR Prediction')
-parser.add_argument('--checkpoint_path', type=str, default=None, help='Checkpoint file path')
-parser.add_argument('--dataset_path', type=str, default=None, help='Dataset path')
-parser.add_argument('--device_target', type=str, default="Ascend", choices=["Ascend"],
-                    help='Default: Ascend')
-args_opt, _ = parser.parse_known_args()
-device_id = int(os.getenv('DEVICE_ID'))
-context.set_context(mode=context.GRAPH_MODE, device_target=args_opt.device_target, device_id=device_id)
-

 def add_write(file_path, print_str):
    with open(file_path, 'a+', encoding='utf-8') as file_out:
        file_out.write(print_str + '\n')

+def modelarts_pre_process():
+    '''modelarts pre process function.'''
+    config.test_data_dir = config.data_path
+    config.checkpoint_path = os.path.join(os.path.dirname(os.path.abspath(__file__)), config.checkpoint_path)

-if __name__ == '__main__':
-    data_config = DataConfig()
-    model_config = ModelConfig()
-    train_config = TrainConfig()
-
-    ds_eval = create_dataset(args_opt.dataset_path, train_mode=False,
+@moxing_wrapper(pre_process=modelarts_pre_process)
+def run_eval():
+    '''eval function'''
+    device_id = get_device_id()
+    context.set_context(mode=context.GRAPH_MODE, device_target=config.device_target, device_id=device_id)
+    ds_eval = create_dataset(config.test_data_dir, train_mode=False,
                             epochs=1, batch_size=train_config.batch_size,
                             data_type=DataType(data_config.data_format))
-    model_builder = ModelBuilder(ModelConfig, TrainConfig)
+    model_builder = ModelBuilder(model_config, train_config)
    train_net, eval_net = model_builder.get_train_eval_net()
    train_net.set_train()
    eval_net.set_train(False)
    auc_metric = AUCMetric()
    model = Model(train_net, eval_network=eval_net, metrics={"auc": auc_metric})
-    param_dict = load_checkpoint(args_opt.checkpoint_path)
+    param_dict = load_checkpoint(config.checkpoint_path)
    load_param_into_net(eval_net, param_dict)

    start = time.time()
@ -66,3 +62,6 @@ if __name__ == '__main__':
    out_str = f'{time_str} AUC: {list(res.values())[0]}, eval time: {eval_time}s.'
    print(out_str)
    add_write('./auc.log', str(out_str))
+
+if __name__ == '__main__':
+    run_eval()
--- a/model_zoo/research/recommend/autodis/export.py
+++ b/model_zoo/research/recommend/autodis/export.py
@ -13,38 +13,42 @@
 # limitations under the License.
 # ============================================================================
 """export ckpt to model"""
-import argparse
+import os
 import numpy as np

 from mindspore import context, Tensor
 from mindspore.train.serialization import export, load_checkpoint

 from src.autodis import ModelBuilder
-from src.config import DataConfig, ModelConfig, TrainConfig
+from src.model_utils.config import config, data_config, model_config, train_config
+from src.model_utils.device_adapter import get_device_id
+from src.model_utils.moxing_adapter import moxing_wrapper

-parser = argparse.ArgumentParser(description="autodis export")
-parser.add_argument("--device_id", type=int, default=0, help="Device id")
-parser.add_argument("--ckpt_file", type=str, required=True, help="Checkpoint file path.")
-parser.add_argument("--file_name", type=str, default="autodis", help="output file name.")
-parser.add_argument("--file_format", type=str, choices=["AIR", "ONNX", "MINDIR"], default="AIR", help="file format")
-parser.add_argument("--device_target", type=str, choices=["Ascend"], default="Ascend",
-                    help="device target")
-args = parser.parse_args()

-context.set_context(mode=context.GRAPH_MODE, device_target=args.device_target, device_id=args.device_id)
+def modelarts_pre_process():
+    '''modelarts pre process function.'''
+    config.file_name = os.path.join(config.output_path, config.file_name)
+    config.ckpt_file = os.path.join(os.path.dirname(os.path.abspath(__file__)), config.ckpt_file)

-if __name__ == "__main__":
-    data_config = DataConfig()

-    model_builder = ModelBuilder(ModelConfig, TrainConfig)
+@moxing_wrapper(pre_process=modelarts_pre_process)
+def run_export():
+    '''export checkpoint file into air/mindir'''
+    context.set_context(mode=context.GRAPH_MODE, device_target=config.device_target, device_id=get_device_id())
+
+    model_builder = ModelBuilder(model_config, train_config)
    _, network = model_builder.get_train_eval_net()
    network.set_train(False)

-    load_checkpoint(args.ckpt_file, net=network)
+    load_checkpoint(config.ckpt_file, net=network)

    batch_ids = Tensor(np.zeros([data_config.batch_size, data_config.data_field_size]).astype(np.int32))
    batch_wts = Tensor(np.zeros([data_config.batch_size, data_config.data_field_size]).astype(np.float32))
    labels = Tensor(np.zeros([data_config.batch_size, 1]).astype(np.float32))

    input_data = [batch_ids, batch_wts, labels]
-    export(network, *input_data, file_name=args.file_name, file_format=args.file_format)
+    export(network, *input_data, file_name=config.file_name, file_format=config.file_format)
+
+
+if __name__ == "__main__":
+    run_export()
--- a/model_zoo/research/recommend/autodis/mindspore_hub_conf.py
+++ b/model_zoo/research/recommend/autodis/mindspore_hub_conf.py
@ -14,12 +14,10 @@
 # ============================================================================
 """hub config."""
 from src.autodis import ModelBuilder
-from src.config import ModelConfig, TrainConfig
+from src.model_utils.config import model_config, train_config

 def create_network(name, *args, **kwargs):
    if name == 'autodis':
-        model_config = ModelConfig()
-        train_config = TrainConfig()
        model_builder = ModelBuilder(model_config, train_config)
        _, autodis_eval_net = model_builder.get_train_eval_net()
        return autodis_eval_net
--- a/model_zoo/research/recommend/autodis/postprocess.py
+++ b/model_zoo/research/recommend/autodis/postprocess.py
@ -18,7 +18,7 @@ import argparse
 import numpy as np
 from mindspore import Tensor
 from src.autodis import AUCMetric
-from src.config import TrainConfig
+from src.model_utils.config import train_config

 parser = argparse.ArgumentParser(description='postprocess')
 parser.add_argument('--result_path', type=str, default="./result_Files", help='result path')
@ -28,7 +28,6 @@ args_opt, _ = parser.parse_known_args()
 def get_acc():
    ''' get accuracy '''
    auc_metric = AUCMetric()
-    train_config = TrainConfig()
    files = os.listdir(args_opt.label_path)
    batch_size = train_config.batch_size

--- a/model_zoo/research/recommend/autodis/preprocess.py
+++ b/model_zoo/research/recommend/autodis/preprocess.py
@ -16,7 +16,7 @@
 import os
 import argparse

-from src.config import DataConfig, TrainConfig
+from src.model_utils.config import data_config, train_config
 from src.dataset import create_dataset, DataType

 parser = argparse.ArgumentParser(description='preprocess.')
@ -24,11 +24,9 @@ parser.add_argument('--dataset_path', type=str, default=None, help='Dataset path
 parser.add_argument('--result_path', type=str, default='./preprocess_Result', help='Result path')
 args_opt, _ = parser.parse_known_args()

+
 def generate_bin():
    '''generate bin files'''
-    data_config = DataConfig()
-    train_config = TrainConfig()
-
    ds = create_dataset(args_opt.dataset_path, train_mode=False,
                        epochs=1, batch_size=train_config.batch_size,
                        data_type=DataType(data_config.data_format))
@ -53,5 +51,6 @@ def generate_bin():

    print("=" * 20, "export bin files finished", "=" * 20)

+
 if __name__ == '__main__':
    generate_bin()
--- a/model_zoo/research/recommend/autodis/scripts/run_eval.sh
+++ b/model_zoo/research/recommend/autodis/scripts/run_eval.sh
@ -14,14 +14,34 @@
 # limitations under the License.
 # ============================================================================
 echo "Please run the script as: "
-echo "sh scripts/run_eval.sh DEVICE_ID DEVICE_TARGET DATASET_PATH CHECKPOINT_PATH"
+echo "sh scripts/run_eval.sh [DEVICE_ID] [DEVICE_TARGET] [TEST_DATA_DIR] [CHECKPOINT_PATH]"
 echo "for example: sh scripts/run_eval.sh 0 GPU /dataset_path /checkpoint_path"
 echo "After running the script, the network runs in the background, The log will be generated in ms_log/eval_output.log"

-export DEVICE_ID=$1
 DEVICE_TARGET=$2
-DATA_URL=$3
-CHECKPOINT_PATH=$4
+DATA_URL=$(readlink -f "$3")
+CHECKPOINT_PATH=$(readlink -f "$4")
+
+DEVICE_TARGET=$2
+if [ "$DEVICE_TARGET" = "GPU" ]; then
+  export CUDA_VISIBLE_DEVICES=$1
+elif [ "$DEVICE_TARGET" = "Ascend" ]; then
+  export DEVICE_ID=$1
+else
+  echo "Unsupported platform:$DEVICE_TARGET"
+  exit 1
+fi
+
+abs_path=$(readlink -f "$0")
+cur_path=$(dirname $abs_path)
+cd $cur_path
+
+rm -rf ./eval_$DEVICE_TARGET
+mkdir ./eval_$DEVICE_TARGET
+cp ../eval.py ./eval_$DEVICE_TARGET
+cp ../*.yaml ./eval_$DEVICE_TARGET
+cp -r ../src ./eval_$DEVICE_TARGET
+cd ./eval_$DEVICE_TARGET || exit

 mkdir -p ms_log
 CUR_DIR=`pwd`
@ -29,6 +49,6 @@ export GLOG_log_dir=${CUR_DIR}/ms_log
 export GLOG_logtostderr=0

 python -u eval.py \
-    --dataset_path=$DATA_URL \
+    --test_data_dir=$DATA_URL \
    --checkpoint_path=$CHECKPOINT_PATH \
    --device_target=$DEVICE_TARGET > ms_log/eval_output.log 2>&1 &
--- a/model_zoo/research/recommend/autodis/scripts/run_standalone_train.sh
+++ b/model_zoo/research/recommend/autodis/scripts/run_standalone_train.sh
@ -14,31 +14,41 @@
 # limitations under the License.
 # ============================================================================
 echo "Please run the script as: "
-echo "sh scripts/run_standalone_train.sh DEVICE_ID/CUDA_VISIBLE_DEVICES DEVICE_TARGET DATASET_PATH"
+echo "sh scripts/run_standalone_train.sh [DEVICE_ID/CUDA_VISIBLE_DEVICES] [DEVICE_TARGET] [TRAIN_DATA_DIR]"
 echo "for example: sh scripts/run_standalone_train.sh 0 GPU /dataset_path"
 echo "After running the script, the network runs in the background, The log will be generated in ms_log/output.log"

 DEVICE_TARGET=$2

-if [ "$DEVICE_TARGET" = "GPU" ]
-then
+if [ "$DEVICE_TARGET" = "GPU" ]; then
  export CUDA_VISIBLE_DEVICES=$1
-fi
-
-if [ "$DEVICE_TARGET" = "Ascend" ]
-then
+elif [ "$DEVICE_TARGET" = "Ascend" ]; then
  export DEVICE_ID=$1
+else
+  echo "Unsupported platform:$DEVICE_TARGET"
+  exit 1
 fi

-DATA_URL=$3
+DATA_URL=$(readlink -f "$3")
+
+abs_path=$(readlink -f "$0")
+cur_path=$(dirname $abs_path)
+cd $cur_path
+
+rm -rf ./train_single_$DEVICE_TARGET
+mkdir ./train_single_$DEVICE_TARGET
+cp ../train.py ./train_single_$DEVICE_TARGET
+cp ../*.yaml ./train_single_$DEVICE_TARGET
+cp -r ../src ./train_single_$DEVICE_TARGET
+cd ./train_single_$DEVICE_TARGET || exit

 mkdir -p ms_log
 CUR_DIR=`pwd`
 export GLOG_log_dir=${CUR_DIR}/ms_log
 export GLOG_logtostderr=0
-
+echo "Start train at platform:$DEVICE_TARGET, device_id:$DEVICE_ID"
 python -u train.py \
-    --dataset_path=$DATA_URL \
+    --train_data_dir=$DATA_URL \
    --ckpt_path="checkpoint" \
    --eval_file_name='auc.log' \
    --loss_file_name='loss.log' \
--- a/model_zoo/research/recommend/autodis/src/config.py
+++ b/model_zoo/research/recommend/autodis/src/config.py
@ -1,64 +0,0 @@
-# Copyright 2020 Huawei Technologies Co., Ltd
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-# ============================================================================
-"""
-network config setting, will be used in train.py and eval.py
-"""
-
-
-class DataConfig:
-    """
-    Define parameters of dataset.
-    """
-    data_vocab_size = 184965
-    train_num_of_parts = 21
-    test_num_of_parts = 3
-    batch_size = 1000
-    data_field_size = 39
-    # dataset format, 1: mindrecord, 2: tfrecord, 3: h5
-    data_format = 2
-
-
-class ModelConfig:
-    """
-    Define parameters of model.
-    """
-    batch_size = DataConfig.batch_size
-    data_field_size = DataConfig.data_field_size
-    data_vocab_size = DataConfig.data_vocab_size
-    data_emb_dim = 80
-    deep_layer_args = [[400, 400, 512], "relu"]
-    init_args = [-0.01, 0.01]
-    weight_bias_init = ['normal', 'normal']
-    keep_prob = 0.9
-    split_index = 13
-    hash_size = 20
-    temperature = 1e-5
-
-class TrainConfig:
-    """
-    Define parameters of training.
-    """
-    batch_size = DataConfig.batch_size
-    l2_coef = 1e-6
-    learning_rate = 1e-5
-    epsilon = 1e-8
-    loss_scale = 1024.0
-    train_epochs = 15
-    save_checkpoint = True
-    ckpt_file_name_prefix = "autodis"
-    save_checkpoint_steps = 1
-    keep_checkpoint_max = 15
-    eval_callback = True
-    loss_callback = True
--- a/model_zoo/research/recommend/autodis/src/dataset.py
+++ b/model_zoo/research/recommend/autodis/src/dataset.py
@ -24,7 +24,7 @@ import pandas as pd
 import mindspore.dataset as ds
 import mindspore.common.dtype as mstype

-from .config import DataConfig
+from src.model_utils.config import data_config as DataConfig


 class DataType(Enum):
--- a/model_zoo/research/recommend/autodis/src/model_utils/config.py
+++ b/model_zoo/research/recommend/autodis/src/model_utils/config.py
@ -0,0 +1,152 @@
+# Copyright 2021 Huawei Technologies Co., Ltd
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ============================================================================
+
+"""Parse arguments"""
+
+import os
+import ast
+import argparse
+from pprint import pprint, pformat
+import yaml
+
+_config = "default_config.yaml"
+
+
+class Config:
+    """
+    Configuration namespace. Convert dictionary to members.
+    """
+    def __init__(self, cfg_dict):
+        for k, v in cfg_dict.items():
+            if isinstance(v, (list, tuple)):
+                setattr(self, k, [Config(x) if isinstance(x, dict) else x for x in v])
+            else:
+                setattr(self, k, Config(v) if isinstance(v, dict) else v)
+
+    def __str__(self):
+        return pformat(self.__dict__)
+
+    def __repr__(self):
+        return self.__str__()
+
+
+def parse_cli_to_yaml(parser, cfg, helper=None, choices=None, cfg_path=_config):
+    """
+    Parse command line arguments to the configuration according to the default yaml.
+
+    Args:
+        parser: Parent parser.
+        cfg: Base configuration.
+        helper: Helper description.
+        cfg_path: Path to the default yaml config.
+    """
+    parser = argparse.ArgumentParser(description="[REPLACE THIS at config.py]",
+                                     parents=[parser])
+    helper = {} if helper is None else helper
+    choices = {} if choices is None else choices
+    for item in cfg:
+        if not isinstance(cfg[item], list) and not isinstance(cfg[item], dict):
+            help_description = helper[item] if item in helper else "Please reference to {}".format(cfg_path)
+            choice = choices[item] if item in choices else None
+            if isinstance(cfg[item], bool):
+                parser.add_argument("--" + item, type=ast.literal_eval, default=cfg[item], choices=choice,
+                                    help=help_description)
+            else:
+                parser.add_argument("--" + item, type=type(cfg[item]), default=cfg[item], choices=choice,
+                                    help=help_description)
+    args = parser.parse_args()
+    return args
+
+
+def parse_yaml(yaml_path):
+    """
+    Parse the yaml config file.
+
+    Args:
+        yaml_path: Path to the yaml config.
+    """
+    with open(yaml_path, 'r') as fin:
+        try:
+            cfgs = yaml.load_all(fin.read(), Loader=yaml.FullLoader)
+            cfgs = [x for x in cfgs]
+            if len(cfgs) == 1:
+                cfg_helper = {}
+                cfg = cfgs[0]
+                cfg_choices = {}
+            elif len(cfgs) == 2:
+                cfg, cfg_helper = cfgs
+                cfg_choices = {}
+            elif len(cfgs) == 3:
+                cfg, cfg_helper, cfg_choices = cfgs
+            else:
+                raise ValueError("At most 3 docs (config, description for help, choices) are supported in config yaml")
+            print(cfg_helper)
+        except:
+            raise ValueError("Failed to parse yaml")
+    return cfg, cfg_helper, cfg_choices
+
+
+def merge(args, cfg):
+    """
+    Merge the base config from yaml file and command line arguments.
+
+    Args:
+        args: Command line arguments.
+        cfg: Base configuration.
+    """
+    args_var = vars(args)
+    for item in args_var:
+        cfg[item] = args_var[item]
+    return cfg
+
+
+def extra_operations(cfg):
+    """
+    Do extra work on Config object.
+
+    Args:
+        cfg: Object after instantiation of class 'Config'.
+    """
+    cfg.ModelConfig.batch_size = cfg.DataConfig.batch_size
+    cfg.ModelConfig.data_field_size = cfg.DataConfig.data_field_size
+    cfg.ModelConfig.data_vocab_size = cfg.DataConfig.data_vocab_size
+    cfg.TrainConfig.batch_size = cfg.DataConfig.batch_size
+
+
+def get_config():
+    """
+    Get Config according to the yaml file and cli arguments.
+    """
+    parser = argparse.ArgumentParser(description="default name", add_help=False)
+    current_dir = os.path.dirname(os.path.abspath(__file__))
+    parser.add_argument("--config_path", type=str, default=os.path.join(current_dir, "../../{}".format(_config)),
+                        help="Config file path")
+    path_args, _ = parser.parse_known_args()
+    default, helper, choices = parse_yaml(path_args.config_path)
+    pprint(default)
+    args = parse_cli_to_yaml(parser=parser, cfg=default, helper=helper, choices=choices, cfg_path=path_args.config_path)
+    final_config = merge(args, default)
+    config_obj = Config(final_config)
+    extra_operations(config_obj)
+    return config_obj
+
+
+config = get_config()
+data_config = config.DataConfig
+model_config = config.ModelConfig
+train_config = config.TrainConfig
+
+if __name__ == '__main__':
+    print(config)
--- a/model_zoo/research/recommend/autodis/src/model_utils/device_adapter.py
+++ b/model_zoo/research/recommend/autodis/src/model_utils/device_adapter.py
@ -0,0 +1,27 @@
+# Copyright 2021 Huawei Technologies Co., Ltd
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ============================================================================
+
+"""Device adapter for ModelArts"""
+
+from src.model_utils.config import config
+
+if config.enable_modelarts:
+    from src.model_utils.moxing_adapter import get_device_id, get_device_num, get_rank_id, get_job_id
+else:
+    from src.model_utils.local_adapter import get_device_id, get_device_num, get_rank_id, get_job_id
+
+__all__ = [
+    "get_device_id", "get_device_num", "get_rank_id", "get_job_id"
+]
--- a/model_zoo/research/recommend/autodis/src/model_utils/local_adapter.py
+++ b/model_zoo/research/recommend/autodis/src/model_utils/local_adapter.py
@ -0,0 +1,36 @@
+# Copyright 2021 Huawei Technologies Co., Ltd
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ============================================================================
+
+"""Local adapter"""
+
+import os
+
+def get_device_id():
+    device_id = os.getenv('DEVICE_ID', '0')
+    return int(device_id)
+
+
+def get_device_num():
+    device_num = os.getenv('RANK_SIZE', '1')
+    return int(device_num)
+
+
+def get_rank_id():
+    global_rank_id = os.getenv('RANK_ID', '0')
+    return int(global_rank_id)
+
+
+def get_job_id():
+    return "Local Job"
--- a/model_zoo/research/recommend/autodis/src/model_utils/moxing_adapter.py
+++ b/model_zoo/research/recommend/autodis/src/model_utils/moxing_adapter.py
@ -0,0 +1,123 @@
+# Copyright 2021 Huawei Technologies Co., Ltd
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ============================================================================
+
+"""Moxing adapter for ModelArts"""
+
+import os
+import functools
+from mindspore import context
+from mindspore.profiler import Profiler
+from src.model_utils.config import config
+
+_global_sync_count = 0
+
+def get_device_id():
+    device_id = os.getenv('DEVICE_ID', '0')
+    return int(device_id)
+
+
+def get_device_num():
+    device_num = os.getenv('RANK_SIZE', '1')
+    return int(device_num)
+
+
+def get_rank_id():
+    global_rank_id = os.getenv('RANK_ID', '0')
+    return int(global_rank_id)
+
+
+def get_job_id():
+    job_id = os.getenv('JOB_ID')
+    job_id = job_id if job_id != "" else "default"
+    return job_id
+
+def sync_data(from_path, to_path):
+    """
+    Download data from remote obs to local directory if the first url is remote url and the second one is local path
+    Upload data from local directory to remote obs in contrast.
+    """
+    import moxing as mox
+    import time
+    global _global_sync_count
+    sync_lock = "/tmp/copy_sync.lock" + str(_global_sync_count)
+    _global_sync_count += 1
+
+    # Each server contains 8 devices as most.
+    if get_device_id() % min(get_device_num(), 8) == 0 and not os.path.exists(sync_lock):
+        print("from path: ", from_path)
+        print("to path: ", to_path)
+        mox.file.copy_parallel(from_path, to_path)
+        print("===finish data synchronization===")
+        try:
+            os.mknod(sync_lock)
+            # print("os.mknod({}) success".format(sync_lock))
+        except IOError:
+            pass
+        print("===save flag===")
+
+    while True:
+        if os.path.exists(sync_lock):
+            break
+        time.sleep(1)
+
+    print("Finish sync data from {} to {}.".format(from_path, to_path))
+
+
+def moxing_wrapper(pre_process=None, post_process=None):
+    """
+    Moxing wrapper to download dataset and upload outputs.
+    """
+    def wrapper(run_func):
+        @functools.wraps(run_func)
+        def wrapped_func(*args, **kwargs):
+            # Download data from data_url
+            if config.enable_modelarts:
+                if config.data_url:
+                    sync_data(config.data_url, config.data_path)
+                    print("Dataset downloaded: ", os.listdir(config.data_path))
+                if config.checkpoint_url:
+                    sync_data(config.checkpoint_url, config.load_path)
+                    print("Preload downloaded: ", os.listdir(config.load_path))
+                if config.train_url:
+                    sync_data(config.train_url, config.output_path)
+                    print("Workspace downloaded: ", os.listdir(config.output_path))
+
+                context.set_context(save_graphs_path=os.path.join(config.output_path, str(get_rank_id())))
+                config.device_num = get_device_num()
+                config.device_id = get_device_id()
+                if not os.path.exists(config.output_path):
+                    os.makedirs(config.output_path)
+
+                if pre_process:
+                    pre_process()
+
+            if config.enable_profiling:
+                profiler = Profiler()
+
+            run_func(*args, **kwargs)
+
+            if config.enable_profiling:
+                profiler.analyse()
+
+            # Upload data to train_url
+            if config.enable_modelarts:
+                if post_process:
+                    post_process()
+
+                if config.train_url:
+                    print("Start to copy output directory")
+                    sync_data(config.output_path, config.train_url)
+        return wrapped_func
+    return wrapper
--- a/model_zoo/research/recommend/autodis/train.py
+++ b/model_zoo/research/recommend/autodis/train.py
@ -15,7 +15,6 @@
 """train_criteo."""
 import os
 import sys
-import argparse

 from mindspore import context
 from mindspore.context import ParallelMode
@ -25,57 +24,50 @@ from mindspore.train.callback import ModelCheckpoint, CheckpointConfig, TimeMoni
 from mindspore.common import set_seed

 from src.autodis import ModelBuilder, AUCMetric
-from src.config import DataConfig, ModelConfig, TrainConfig
 from src.dataset import create_dataset, DataType
 from src.callback import EvalCallBack, LossCallBack
+from src.model_utils.moxing_adapter import moxing_wrapper
+from src.model_utils.config import config, train_config, data_config, model_config
+from src.model_utils.device_adapter import get_device_id, get_device_num, get_rank_id

 sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
-parser = argparse.ArgumentParser(description='CTR Prediction')
-parser.add_argument('--dataset_path', type=str, default=None, help='Dataset path')
-parser.add_argument('--ckpt_path', type=str, default=None, help='Checkpoint path')
-parser.add_argument('--eval_file_name', type=str, default="./auc.log",
-                    help='Auc log file path. Default: "./auc.log"')
-parser.add_argument('--loss_file_name', type=str, default="./loss.log",
-                    help='Loss log file path. Default: "./loss.log"')
-parser.add_argument('--do_eval', type=str, default='True', choices=["True", "False"],
-                    help='Do evaluation or not, only support "True" or "False". Default: "True"')
-parser.add_argument('--device_target', type=str, default="Ascend", choices=["Ascend"],
-                    help='Default: Ascend')
-args_opt, _ = parser.parse_known_args()
-args_opt.do_eval = args_opt.do_eval == 'True'
-rank_size = int(os.environ.get("RANK_SIZE", 1))

 set_seed(1)

-if __name__ == '__main__':
-    data_config = DataConfig()
-    model_config = ModelConfig()
-    train_config = TrainConfig()
+def modelarts_pre_process():
+    '''modelarts pre process function.'''
+    config.train_data_dir = config.data_path
+    config.ckpt_path = os.path.join(config.output_path, config.ckpt_path)

+@moxing_wrapper(pre_process=modelarts_pre_process)
+def run_train():
+    '''train function'''
+    config.do_eval = config.do_eval == 'True'
+    rank_size = get_device_num()
    if rank_size > 1:
-        if args_opt.device_target == "Ascend":
-            device_id = int(os.getenv('DEVICE_ID'))
-            context.set_context(mode=context.GRAPH_MODE, device_target=args_opt.device_target, device_id=device_id)
+        if config.device_target == "Ascend":
+            device_id = get_device_id()
+            context.set_context(mode=context.GRAPH_MODE, device_target=config.device_target, device_id=device_id)
            context.reset_auto_parallel_context()
            context.set_auto_parallel_context(parallel_mode=ParallelMode.DATA_PARALLEL, gradients_mean=True)
            init()
-            rank_id = int(os.environ.get('RANK_ID'))
+            rank_id = get_rank_id()
        else:
-            print("Unsupported device_target ", args_opt.device_target)
+            print("Unsupported device_target ", config.device_target)
            exit()
    else:
-        if args_opt.device_target == "Ascend":
-            device_id = int(os.getenv('DEVICE_ID'))
-            context.set_context(mode=context.GRAPH_MODE, device_target=args_opt.device_target, device_id=device_id)
+        if config.device_target == "Ascend":
+            device_id = get_device_id()
+            context.set_context(mode=context.GRAPH_MODE, device_target=config.device_target, device_id=device_id)
        else:
-            print("Unsupported device_target ", args_opt.device_target)
+            print("Unsupported device_target ", config.device_target)
            exit()
        rank_size = None
        rank_id = None

    # Init Profiler

-    ds_train = create_dataset(args_opt.dataset_path,
+    ds_train = create_dataset(config.train_data_dir,
                              train_mode=True,
                              epochs=1,
                              batch_size=train_config.batch_size,
@ -84,34 +76,36 @@ if __name__ == '__main__':
                              rank_id=rank_id)
    print("ds_train.size: {}".format(ds_train.get_dataset_size()))

-    steps_size = ds_train.get_dataset_size()
+    # steps_size = ds_train.get_dataset_size()

-    model_builder = ModelBuilder(ModelConfig, TrainConfig)
+    model_builder = ModelBuilder(model_config, train_config)
    train_net, eval_net = model_builder.get_train_eval_net()
    auc_metric = AUCMetric()
    model = Model(train_net, eval_network=eval_net, metrics={"auc": auc_metric})

    time_callback = TimeMonitor(data_size=ds_train.get_dataset_size())
-    loss_callback = LossCallBack(loss_file_path=args_opt.loss_file_name)
+    loss_callback = LossCallBack(loss_file_path=config.loss_file_name)
    callback_list = [time_callback, loss_callback]

    if train_config.save_checkpoint:
        if rank_size:
            train_config.ckpt_file_name_prefix = train_config.ckpt_file_name_prefix + str(get_rank())
-            args_opt.ckpt_path = os.path.join(args_opt.ckpt_path, 'ckpt_' + str(get_rank()) + '/')
+            config.ckpt_path = os.path.join(config.ckpt_path, 'ckpt_' + str(get_rank()) + '/')
        config_ck = CheckpointConfig(save_checkpoint_steps=train_config.save_checkpoint_steps,
                                     keep_checkpoint_max=train_config.keep_checkpoint_max)
        ckpt_cb = ModelCheckpoint(prefix=train_config.ckpt_file_name_prefix,
-                                  directory=args_opt.ckpt_path,
+                                  directory=config.ckpt_path,
                                  config=config_ck)
        callback_list.append(ckpt_cb)

-    if args_opt.do_eval:
-        ds_eval = create_dataset(args_opt.dataset_path, train_mode=False,
+    if config.do_eval:
+        ds_eval = create_dataset(config.train_data_dir, train_mode=False,
                                 epochs=1,
                                 batch_size=train_config.batch_size,
                                 data_type=DataType(data_config.data_format))
        eval_callback = EvalCallBack(model, ds_eval, auc_metric,
-                                     eval_file_path=args_opt.eval_file_name)
+                                     eval_file_path=config.eval_file_name)
        callback_list.append(eval_callback)
    model.train(train_config.train_epochs, ds_train, callbacks=callback_list)
+if __name__ == '__main__':
+    run_train()