Add cache demo for modelzoo resnet

2021-04-15 17:13:18 -04:00 · 2021-04-15 17:13:18 -04:00 · 357ae7833c
parent fcac556d58
commit 357ae7833c
10 changed files with 277 additions and 19 deletions
--- a/model_zoo/official/cv/resnet/README.md
+++ b/model_zoo/official/cv/resnet/README.md
@ -155,7 +155,8 @@ python eval.py --net=[resnet50|resnet101] --dataset=[cifar10|imagenet2012] --dat
    ├── run_eval_gpu.sh                    # launch gpu evaluation
    ├── run_standalone_train_gpu.sh        # launch gpu standalone training(1 pcs)
    ├── run_gpu_resnet_benchmark.sh        # launch gpu benchmark train for resnet50 with imagenet2012
-    └── run_eval_gpu_resnet_benckmark.sh   # launch gpu benchmark eval for resnet50 with imagenet2012
+    |── run_eval_gpu_resnet_benckmark.sh   # launch gpu benchmark eval for resnet50 with imagenet2012
+    └── cache_util.sh                      # a collection of helper functions to manage cache
  ├── src
    ├── config.py                          # parameter configuration
    ├── dataset.py                         # data preprocessing
@ -330,7 +331,26 @@ bash run_parameter_server_train_gpu.sh [resnet50|resnet101] [cifar10|imagenet201

 #### Evaluation while training

-You can add `run_eval` to start shell and set it True, if you want evaluation while training. And you can set argument option: `eval_dataset_path`, `save_best_ckpt`, `eval_start_epoch`, `eval_interval` when `run_eval` is True.
+```bash
+# evaluation while distributed training Ascend example:
+bash run_distribute_train.sh [resnet18|resnet50|resnet101|se-resnet50] [cifar10|imagenet2012] [RANK_TABLE_FILE] [DATASET_PATH] [RUN_EVAL](optional) [EVAL_DATASET_PATH](optional)
+
+# evaluation while standalone training Ascend example:
+bash run_standalone_train.sh [resnet18|resnet50|resnet101|se-resnet50] [cifar10|imagenet2012] [RANK_TABLE_FILE] [DATASET_PATH] [RUN_EVAL](optional) [EVAL_DATASET_PATH](optional)
+
+# evaluation while distributed training GPU example:
+bash run_distribute_train_gpu.sh [resnet50|resnet101] [cifar10|imagenet2012]  [DATASET_PATH] [RUN_EVAL](optional) [EVAL_DATASET_PATH](optional)
+
+# evaluation while standalone training GPU example:
+bash run_standalone_train_gpu.sh [resnet50|resnet101] [cifar10|imagenet2012]  [DATASET_PATH] [RUN_EVAL](optional) [EVAL_DATASET_PATH](optional)
+```
+
+`RUN_EVAL` and `EVAL_DATASET_PATH` are optional arguments, setting `RUN_EVAL`=True allows you to do evaluation while training. When `RUN_EVAL` is set, `EVAL_DATASET_PATH` must also be set.
+And you can also set these optional arguments: `save_best_ckpt`, `eval_start_epoch`, `eval_interval` for python script when `RUN_EVAL` is True.
+
+By default, a standalone cache server would be started to cache all eval images in tensor format in memory to improve the evaluation performance. Please make sure the dataset fits in memory (Around 30GB of memory required for ImageNet2012 eval dataset, 6GB of memory required for CIFAR-10 eval dataset).
+
+Users can choose to shutdown the cache server after training or leave it alone for future usage.

 ### Result

--- a/model_zoo/official/cv/resnet/README_CN.md
+++ b/model_zoo/official/cv/resnet/README_CN.md
@ -67,7 +67,7 @@ ResNet的总体网络架构如下：
 使用的数据集：[ImageNet2012](http://www.image-net.org/)

 - 数据集大小：共1000个类、224*224彩色图像
-    - 训练集：共1,281,167张图像  
+    - 训练集：共1,281,167张图像
    - 测试集：共50,000张图像
 - 数据格式：JPEG
    - 注：数据在dataset.py中处理。
@ -143,7 +143,8 @@ bash run_eval_gpu.sh [resnet50|resnet101] [cifar10|imagenet2012] [DATASET_PATH]
    ├── run_distribute_train_gpu.sh        # 启动GPU分布式训练（8卡）
    ├── run_parameter_server_train_gpu.sh  # 启动GPU参数服务器训练（8卡）
    ├── run_eval_gpu.sh                    # 启动GPU评估
-    └── run_standalone_train_gpu.sh        # 启动GPU单机训练（单卡）
+    ├── run_standalone_train_gpu.sh        # 启动GPU单机训练（单卡）
+    └── cache_util.sh                      # 使用单节点緩存的帮助函数
  ├── src
    ├── config.py                          # 参数配置
    ├── dataset.py                         # 数据预处理
@ -304,7 +305,25 @@ bash run_parameter_server_train_gpu.sh [resnet50|resnet101] [cifar10|imagenet201

 #### 训练时推理

-训练时推理需要在启动文件中添加`run_eval` 并设置为True。与此同时需要设置: `eval_dataset_path`, `save_best_ckpt`, `eval_start_epoch`, `eval_interval` 。
+```bash
+# Ascend 分布式训练时推理示例:
+bash run_distribute_train.sh [resnet18|resnet50|resnet101|se-resnet50] [cifar10|imagenet2012] [RANK_TABLE_FILE] [DATASET_PATH] [RUN_EVAL](optional) [EVAL_DATASET_PATH](optional)
+
+# Ascend 单机训练时推理示例:
+bash run_standalone_train.sh [resnet18|resnet50|resnet101|se-resnet50] [cifar10|imagenet2012] [RANK_TABLE_FILE] [DATASET_PATH] [RUN_EVAL](optional) [EVAL_DATASET_PATH](optional)
+
+# GPU 分布式训练时推理示例:
+bash run_distribute_train_gpu.sh [resnet50|resnet101] [cifar10|imagenet2012]  [DATASET_PATH] [RUN_EVAL](optional) [EVAL_DATASET_PATH](optional)
+
+# GPU 单机训练时推理示例:
+bash run_standalone_train_gpu.sh [resnet50|resnet101] [cifar10|imagenet2012]  [DATASET_PATH] [RUN_EVAL](optional) [EVAL_DATASET_PATH](optional)
+```
+
+训练时推理需要在设置`RUN_EVAL`为True，与此同时还需要设置`EVAL_DATASET_PATH`。此外，当设置`RUN_EVAL`为True时还可为python脚本设置`save_best_ckpt`, `eval_start_epoch`, `eval_interval`等参数。
+
+默认情况下我们将启动一个独立的缓存服务器将推理数据集的图片以tensor的形式保存在内存中以带来推理性能的提升。用户在使用缓存前需确保内存大小足够缓存推理集中的图片（缓存ImageNet2012的推理集大约需要30GB的内存，缓存CIFAR-10的推理集约需要使用6GB的内存）。
+
+在训练结束后，可以选择关闭缓存服务器或不关闭它以继续为未来的推理提供缓存服务。

 ### 结果

--- a/model_zoo/official/cv/resnet/scripts/cache_util.sh
+++ b/model_zoo/official/cv/resnet/scripts/cache_util.sh
@ -0,0 +1,49 @@
+#!/usr/bin/env bash
+# Copyright 2021 Huawei Technologies Co., Ltd
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ============================================================================
+bootup_cache_server()
+{
+  echo "Booting up cache server..."
+  result=$(cache_admin --start 2>&1)
+  rc=$?
+  echo "${result}"
+  if [ "${rc}" -ne 0 ] && [[ ! ${result} =~ "Cache server is already up and running" ]]; then
+    echo "cache_admin command failure!" "${result}"
+    exit 1
+  fi
+}
+
+generate_cache_session()
+{
+  result=$(cache_admin -g | awk 'END {print $NF}')
+  rc=$?
+  echo "${result}"
+  if [ "${rc}" -ne 0 ]; then
+    echo "cache_admin command failure!" "${result}"
+    exit 1
+  fi
+}
+
+shutdown_cache_server()
+{
+  echo "Shutting down cache server..."
+  result=$(cache_admin --stop 2>&1)
+  rc=$?
+  echo "${result}"
+  if [ "${rc}" -ne 0 ] && [[ ! ${result} =~ "Server on port 50052 is not up or has been shutdown already" ]]; then
+    echo "cache_admin command failure!" "${result}"
+    exit 1
+  fi
+}
--- a/model_zoo/official/cv/resnet/scripts/run_distribute_train.sh
+++ b/model_zoo/official/cv/resnet/scripts/run_distribute_train.sh
@ -14,9 +14,12 @@
 # limitations under the License.
 # ============================================================================

-if [ $# != 4 ] && [ $# != 5 ]
+. cache_util.sh
+
+if [ $# != 4 ] && [ $# != 5 ] && [ $# != 6 ]
 then
  echo "Usage: bash run_distribute_train.sh [resnet18|resnet50|resnet101|se-resnet50] [cifar10|imagenet2012] [RANK_TABLE_FILE] [DATASET_PATH] [PRETRAINED_CKPT_PATH](optional)"
+  echo "       bash run_distribute_train.sh [resnet18|resnet50|resnet101|se-resnet50] [cifar10|imagenet2012] [RANK_TABLE_FILE] [DATASET_PATH] [RUN_EVAL](optional) [EVAL_DATASET_PATH](optional)"
  exit 1
 fi

@ -60,6 +63,12 @@ then
    PATH3=$(get_real_path $5)
 fi

+if [ $# == 6 ]
+then
+  RUN_EVAL=$5
+  EVAL_DATASET_PATH=$(get_real_path $6)
+fi
+
 if [ ! -f $PATH1 ]
 then 
    echo "error: RANK_TABLE_FILE=$PATH1 is not a file"
@ -78,6 +87,18 @@ then
 exit 1
 fi

+if [ "x${RUN_EVAL}" == "xTrue" ] && [ ! -d $EVAL_DATASET_PATH ]
+then
+  echo "error: EVAL_DATASET_PATH=$EVAL_DATASET_PATH is not a directory"
+  exit 1
+fi
+
+if [ "x${RUN_EVAL}" == "xTrue" ]
+then
+  bootup_cache_server
+  CACHE_SESSION_ID=$(generate_cache_session)
+fi
+
 ulimit -u unlimited
 export DEVICE_NUM=8
 export RANK_SIZE=8
@ -108,5 +129,10 @@ do
        python train.py --net=$1 --dataset=$2 --run_distribute=True --device_num=$DEVICE_NUM --dataset_path=$PATH2 --pre_trained=$PATH3 &> log &
    fi

+    if [ $# == 6 ]
+    then
+      python train.py --net=$1 --dataset=$2 --run_distribute=True --device_num=$DEVICE_NUM --dataset_path=$PATH2 \
+      --run_eval=$RUN_EVAL --eval_dataset_path=$EVAL_DATASET_PATH --enable_cache=True --cache_session_id=$CACHE_SESSION_ID &> log &
+    fi
    cd ..
 done
--- a/model_zoo/official/cv/resnet/scripts/run_distribute_train_gpu.sh
+++ b/model_zoo/official/cv/resnet/scripts/run_distribute_train_gpu.sh
@ -14,9 +14,12 @@
 # limitations under the License.
 # ============================================================================

-if [ $# != 3 ] && [ $# != 4 ]
+. cache_util.sh
+
+if [ $# != 3 ] && [ $# != 4 ] && [ $# != 5 ]
 then 
    echo "Usage: bash run_distribute_train_gpu.sh [resnet50|resnet101] [cifar10|imagenet2012]  [DATASET_PATH] [PRETRAINED_CKPT_PATH](optional)"
+    echo "       bash run_distribute_train_gpu.sh [resnet50|resnet101] [cifar10|imagenet2012]  [DATASET_PATH] [RUN_EVAL](optional) [EVAL_DATASET_PATH](optional)"
    exit 1
 fi

@ -54,6 +57,12 @@ then
    PATH2=$(get_real_path $4)
 fi

+if [ $# == 5 ]
+then
+  RUN_EVAL=$4
+  EVAL_DATASET_PATH=$(get_real_path $5)
+fi
+

 if [ ! -d $PATH1 ]
 then 
@ -67,6 +76,18 @@ then
    exit 1
 fi

+if [ "x${RUN_EVAL}" == "xTrue" ] && [ ! -d $EVAL_DATASET_PATH ]
+then
+  echo "error: EVAL_DATASET_PATH=$EVAL_DATASET_PATH is not a directory"
+  exit 1
+fi
+
+if [ "x${RUN_EVAL}" == "xTrue" ]
+then
+  bootup_cache_server
+  CACHE_SESSION_ID=$(generate_cache_session)
+fi
+
 ulimit -u unlimited
 export DEVICE_NUM=8
 export RANK_SIZE=8
@ -91,3 +112,11 @@ then
           python train.py --net=$1 --dataset=$2 --run_distribute=True \
           --device_num=$DEVICE_NUM --device_target="GPU" --dataset_path=$PATH1 --pre_trained=$PATH2 &> log &
 fi
+
+if [ $# == 5 ]
+then
+  mpirun --allow-run-as-root -n $RANK_SIZE --output-filename log_output --merge-stderr-to-stdout \
+           python train.py --net=$1 --dataset=$2 --run_distribute=True \
+           --device_num=$DEVICE_NUM --device_target="GPU" --dataset_path=$PATH1 --run_eval=$RUN_EVAL \
+           --eval_dataset_path=$EVAL_DATASET_PATH --enable_cache=True --cache_session_id=$CACHE_SESSION_ID &> log &
+fi
--- a/model_zoo/official/cv/resnet/scripts/run_standalone_train.sh
+++ b/model_zoo/official/cv/resnet/scripts/run_standalone_train.sh
@ -14,9 +14,12 @@
 # limitations under the License.
 # ============================================================================

-if [ $# != 3 ] && [ $# != 4 ]
+. cache_util.sh
+
+if [ $# != 3 ] && [ $# != 4 ] && [ $# != 5 ]
 then 
    echo "Usage: bash run_standalone_train.sh [resnet18|resnet50|resnet101|se-resnet50] [cifar10|imagenet2012] [DATASET_PATH] [PRETRAINED_CKPT_PATH](optional)"
+    echo "       bash run_standalone_train.sh [resnet18|resnet50|resnet101|se-resnet50] [cifar10|imagenet2012] [DATASET_PATH] [RUN_EVAL](optional) [EVAL_DATASET_PATH](optional)"
 exit 1
 fi

@ -59,6 +62,12 @@ then
    PATH2=$(get_real_path $4)
 fi

+if [ $# == 5 ]
+then
+  RUN_EVAL=$4
+  EVAL_DATASET_PATH=$(get_real_path $5)
+fi
+
 if [ ! -d $PATH1 ]
 then 
    echo "error: DATASET_PATH=$PATH1 is not a directory"
@ -71,6 +80,18 @@ then
 exit 1
 fi

+if [ "x${RUN_EVAL}" == "xTrue" ] && [ ! -d $EVAL_DATASET_PATH ]
+then
+  echo "error: EVAL_DATASET_PATH=$EVAL_DATASET_PATH is not a directory"
+  exit 1
+fi
+
+if [ "x${RUN_EVAL}" == "xTrue" ]
+then
+  bootup_cache_server
+  CACHE_SESSION_ID=$(generate_cache_session)
+fi
+
 ulimit -u unlimited
 export DEVICE_NUM=1
 export RANK_ID=0
@ -96,4 +117,10 @@ if [ $# == 4 ]
 then
    python train.py --net=$1 --dataset=$2 --dataset_path=$PATH1 --pre_trained=$PATH2 &> log &
 fi
+
+if [ $# == 5 ]
+then
+    python train.py --net=$1 --dataset=$2 --dataset_path=$PATH1 --run_eval=$RUN_EVAL \
+           --eval_dataset_path=$EVAL_DATASET_PATH --enable_cache=True --cache_session_id=$CACHE_SESSION_ID &> log &
+fi
 cd ..
--- a/model_zoo/official/cv/resnet/scripts/run_standalone_train_gpu.sh
+++ b/model_zoo/official/cv/resnet/scripts/run_standalone_train_gpu.sh
@ -14,9 +14,12 @@
 # limitations under the License.
 # ============================================================================

+. cache_util.sh
+
 if [ $# != 3 ] && [ $# != 4 ]
 then 
    echo "Usage: bash run_standalone_train_gpu.sh [resnet50|resnet101] [cifar10|imagenet2012] [DATASET_PATH] [PRETRAINED_CKPT_PATH](optional)"
+    echo "       bash run_standalone_train_gpu.sh [resnet50|resnet101] [cifar10|imagenet2012] [DATASET_PATH] [RUN_EVAL](optional) [EVAL_DATASET_PATH](optional)"
 exit 1
 fi

@ -54,6 +57,12 @@ then
    PATH2=$(get_real_path $4)
 fi

+if [ $# == 5 ]
+then
+  RUN_EVAL=$4
+  EVAL_DATASET_PATH=$(get_real_path $5)
+fi
+
 if [ ! -d $PATH1 ]
 then 
    echo "error: DATASET_PATH=$PATH1 is not a directory"
@ -66,6 +75,19 @@ then
 exit 1
 fi

+
+if [ "x${RUN_EVAL}" == "xTrue" ] && [ ! -d $EVAL_DATASET_PATH ]
+then
+  echo "error: EVAL_DATASET_PATH=$EVAL_DATASET_PATH is not a directory"
+  exit 1
+fi
+
+if [ "x${RUN_EVAL}" == "xTrue" ]
+then
+  bootup_cache_server
+  CACHE_SESSION_ID=$(generate_cache_session)
+fi
+
 ulimit -u unlimited
 export DEVICE_NUM=1
 export DEVICE_ID=0
@ -92,4 +114,10 @@ if [ $# == 4 ]
 then
    python train.py --net=$1 --dataset=$2 --device_target="GPU" --dataset_path=$PATH1 --pre_trained=$PATH2 &> log &
 fi
+
+if [ $# == 5 ]
+then
+    python train.py --net=$1 --dataset=$2 --device_target="GPU" --dataset_path=$PATH1 --run_eval=$RUN_EVAL \
+           --eval_dataset_path=$EVAL_DATASET_PATH --enable_cache=True --cache_session_id=$CACHE_SESSION_ID &> log &
+fi
 cd ..
--- a/model_zoo/official/cv/resnet/src/dataset.py
+++ b/model_zoo/official/cv/resnet/src/dataset.py
@ -23,7 +23,8 @@ import mindspore.dataset.transforms.c_transforms as C2
 from mindspore.communication.management import init, get_rank, get_group_size


-def create_dataset1(dataset_path, do_train, repeat_num=1, batch_size=32, target="Ascend", distribute=False):
+def create_dataset1(dataset_path, do_train, repeat_num=1, batch_size=32, target="Ascend", distribute=False,
+                    enable_cache=False, cache_session_id=None):
    """
    create a train or evaluate cifar10 dataset for resnet50
    Args:
@ -33,6 +34,8 @@ def create_dataset1(dataset_path, do_train, repeat_num=1, batch_size=32, target=
        batch_size(int): the batch size of dataset. Default: 32
        target(str): the device target. Default: Ascend
        distribute(bool): data for distribute or not. Default: False
+        enable_cache(bool): whether tensor caching service is used for eval.
+        cache_session_id(int): If enable_cache, cache session_id need to be provided.

    Returns:
        dataset
@ -70,7 +73,16 @@ def create_dataset1(dataset_path, do_train, repeat_num=1, batch_size=32, target=
    type_cast_op = C2.TypeCast(mstype.int32)

    data_set = data_set.map(operations=type_cast_op, input_columns="label", num_parallel_workers=8)
-    data_set = data_set.map(operations=trans, input_columns="image", num_parallel_workers=8)
+    # only enable cache for eval
+    if do_train:
+        enable_cache = False
+    if enable_cache:
+        if not cache_session_id:
+            raise ValueError("A cache session_id must be provided to use cache.")
+        eval_cache = ds.DatasetCache(session_id=int(cache_session_id), size=0)
+        data_set = data_set.map(operations=trans, input_columns="image", num_parallel_workers=8, cache=eval_cache)
+    else:
+        data_set = data_set.map(operations=trans, input_columns="image", num_parallel_workers=8)

    # apply batch operations
    data_set = data_set.batch(batch_size, drop_remainder=True)
@ -80,7 +92,8 @@ def create_dataset1(dataset_path, do_train, repeat_num=1, batch_size=32, target=
    return data_set


-def create_dataset2(dataset_path, do_train, repeat_num=1, batch_size=32, target="Ascend", distribute=False):
+def create_dataset2(dataset_path, do_train, repeat_num=1, batch_size=32, target="Ascend", distribute=False,
+                    enable_cache=False, cache_session_id=None):
    """
    create a train or eval imagenet2012 dataset for resnet50

@ -91,6 +104,8 @@ def create_dataset2(dataset_path, do_train, repeat_num=1, batch_size=32, target=
        batch_size(int): the batch size of dataset. Default: 32
        target(str): the device target. Default: Ascend
        distribute(bool): data for distribute or not. Default: False
+        enable_cache(bool): whether tensor caching service is used for eval.
+        cache_session_id(int): If enable_cache, cache session_id need to be provided.

    Returns:
        dataset
@ -135,7 +150,17 @@ def create_dataset2(dataset_path, do_train, repeat_num=1, batch_size=32, target=
    type_cast_op = C2.TypeCast(mstype.int32)

    data_set = data_set.map(operations=trans, input_columns="image", num_parallel_workers=8)
-    data_set = data_set.map(operations=type_cast_op, input_columns="label", num_parallel_workers=8)
+    # only enable cache for eval
+    if do_train:
+        enable_cache = False
+    if enable_cache:
+        if not cache_session_id:
+            raise ValueError("A cache session_id must be provided to use cache.")
+        eval_cache = ds.DatasetCache(session_id=int(cache_session_id), size=0)
+        data_set = data_set.map(operations=type_cast_op, input_columns="label", num_parallel_workers=8,
+                                cache=eval_cache)
+    else:
+        data_set = data_set.map(operations=type_cast_op, input_columns="label", num_parallel_workers=8)

    # apply batch operations
    data_set = data_set.batch(batch_size, drop_remainder=True)
@ -146,7 +171,8 @@ def create_dataset2(dataset_path, do_train, repeat_num=1, batch_size=32, target=
    return data_set


-def create_dataset3(dataset_path, do_train, repeat_num=1, batch_size=32, target="Ascend", distribute=False):
+def create_dataset3(dataset_path, do_train, repeat_num=1, batch_size=32, target="Ascend", distribute=False,
+                    enable_cache=False, cache_session_id=None):
    """
    create a train or eval imagenet2012 dataset for resnet101
    Args:
@ -156,6 +182,8 @@ def create_dataset3(dataset_path, do_train, repeat_num=1, batch_size=32, target=
        batch_size(int): the batch size of dataset. Default: 32
        target(str): the device target. Default: Ascend
        distribute(bool): data for distribute or not. Default: False
+        enable_cache(bool): whether tensor caching service is used for eval.
+        cache_session_id(int): If enable_cache, cache session_id need to be provided.

    Returns:
        dataset
@ -199,7 +227,17 @@ def create_dataset3(dataset_path, do_train, repeat_num=1, batch_size=32, target=
    type_cast_op = C2.TypeCast(mstype.int32)

    data_set = data_set.map(operations=trans, input_columns="image", num_parallel_workers=8)
-    data_set = data_set.map(operations=type_cast_op, input_columns="label", num_parallel_workers=8)
+    # only enable cache for eval
+    if do_train:
+        enable_cache = False
+    if enable_cache:
+        if not cache_session_id:
+            raise ValueError("A cache session_id must be provided to use cache.")
+        eval_cache = ds.DatasetCache(session_id=int(cache_session_id), size=0)
+        data_set = data_set.map(operations=type_cast_op, input_columns="label", num_parallel_workers=8,
+                                cache=eval_cache)
+    else:
+        data_set = data_set.map(operations=type_cast_op, input_columns="label", num_parallel_workers=8)

    # apply batch operations
    data_set = data_set.batch(batch_size, drop_remainder=True)
@ -209,7 +247,8 @@ def create_dataset3(dataset_path, do_train, repeat_num=1, batch_size=32, target=
    return data_set


-def create_dataset4(dataset_path, do_train, repeat_num=1, batch_size=32, target="Ascend", distribute=False):
+def create_dataset4(dataset_path, do_train, repeat_num=1, batch_size=32, target="Ascend", distribute=False,
+                    enable_cache=False, cache_session_id=None):
    """
    create a train or eval imagenet2012 dataset for se-resnet50

@ -220,6 +259,8 @@ def create_dataset4(dataset_path, do_train, repeat_num=1, batch_size=32, target=
        batch_size(int): the batch size of dataset. Default: 32
        target(str): the device target. Default: Ascend
        distribute(bool): data for distribute or not. Default: False
+        enable_cache(bool): whether tensor caching service is used for eval.
+        cache_session_id(int): If enable_cache, cache session_id need to be provided.

    Returns:
        dataset
@ -261,7 +302,17 @@ def create_dataset4(dataset_path, do_train, repeat_num=1, batch_size=32, target=

    type_cast_op = C2.TypeCast(mstype.int32)
    data_set = data_set.map(operations=trans, input_columns="image", num_parallel_workers=12)
-    data_set = data_set.map(operations=type_cast_op, input_columns="label", num_parallel_workers=12)
+    # only enable cache for eval
+    if do_train:
+        enable_cache = False
+    if enable_cache:
+        if not cache_session_id:
+            raise ValueError("A cache session_id must be provided to use cache.")
+        eval_cache = ds.DatasetCache(session_id=int(cache_session_id), size=0)
+        data_set = data_set.map(operations=type_cast_op, input_columns="label", num_parallel_workers=12,
+                                cache=eval_cache)
+    else:
+        data_set = data_set.map(operations=type_cast_op, input_columns="label", num_parallel_workers=12)

    # apply batch operations
    data_set = data_set.batch(batch_size, drop_remainder=True)
--- a/model_zoo/official/cv/resnet/src/eval_callback.py
+++ b/model_zoo/official/cv/resnet/src/eval_callback.py
@ -16,10 +16,12 @@

 import os
 import stat
+import time
 from mindspore import save_checkpoint
 from mindspore import log as logger
 from mindspore.train.callback import Callback

+
 class EvalCallBack(Callback):
    """
    Evaluation callback when training.
@ -72,8 +74,11 @@ class EvalCallBack(Callback):
        cb_params = run_context.original_args()
        cur_epoch = cb_params.cur_epoch_num
        if cur_epoch >= self.eval_start_epoch and (cur_epoch - self.eval_start_epoch) % self.interval == 0:
+            eval_start = time.time()
            res = self.eval_function(self.eval_param_dict)
-            print("epoch: {}, {}: {}".format(cur_epoch, self.metrics_name, res), flush=True)
+            eval_cost = time.time() - eval_start
+            print("epoch: {}, {}: {}, eval_cost:{:.2f}".format(cur_epoch, self.metrics_name, res, eval_cost),
+                  flush=True)
            if res >= self.best_res:
                self.best_res = res
                self.best_epoch = cur_epoch
--- a/model_zoo/official/cv/resnet/train.py
+++ b/model_zoo/official/cv/resnet/train.py
@ -60,6 +60,9 @@ parser.add_argument("--eval_start_epoch", type=int, default=40,
                    help="Evaluation start epoch when run_eval is True, default is 40.")
 parser.add_argument("--eval_interval", type=int, default=1,
                    help="Evaluation interval when run_eval is True, default is 1.")
+parser.add_argument('--enable_cache', type=ast.literal_eval, default=False,
+                    help='Caching the eval dataset in memory to speedup evaluation, default is False.')
+parser.add_argument('--cache_session_id', type=str, default="", help='The session id for cache service.')
 args_opt = parser.parse_args()

 set_seed(1)
@ -239,10 +242,11 @@ if __name__ == '__main__':
        if args_opt.eval_dataset_path is None or (not os.path.isdir(args_opt.eval_dataset_path)):
            raise ValueError("{} is not a existing path.".format(args_opt.eval_dataset_path))
        eval_dataset = create_dataset(dataset_path=args_opt.eval_dataset_path, do_train=False,
-                                      batch_size=config.batch_size, target=target)
+                                      batch_size=config.batch_size, target=target, enable_cache=args_opt.enable_cache,
+                                      cache_session_id=args_opt.cache_session_id)
        eval_param_dict = {"model": model, "dataset": eval_dataset, "metrics_name": "acc"}
        eval_cb = EvalCallBack(apply_eval, eval_param_dict, interval=args_opt.eval_interval,
-                               eval_start_epoch=args_opt.eval_start_epoch, save_best_ckpt=True,
+                               eval_start_epoch=args_opt.eval_start_epoch, save_best_ckpt=args_opt.save_best_ckpt,
                               ckpt_directory=ckpt_save_dir, besk_ckpt_name="best_acc.ckpt",
                               metrics_name="acc")
        cb += [eval_cb]