306 changed files with 5992 additions and 2208 deletions
--- a/README.md
+++ b/README.md
@ -69,6 +69,8 @@ MindSpore offers build options across multiple backends:
 |  | Ubuntu-aarch64 | ✔️ |
 |  | EulerOS-x86 | ✔️ |
 |  | EulerOS-aarch64 | ✔️ |
 |  | CentOS-x86 | ✔️ |
 |  | CentOS-aarch64 | ✔️ |
 | GPU CUDA 10.1 | Ubuntu-x86 | ✔️ |
 | CPU | Ubuntu-x86 | ✔️ |
 |  | Ubuntu-aarch64 | ✔️ |
@ -79,7 +81,7 @@ For installation using `pip`, take `CPU` and `Ubuntu-x86` build version as an ex
 1. Download whl from [MindSpore download page](https://www.mindspore.cn/versions/en), and install the package.
    ```
-    pip install https://ms-release.obs.cn-north-4.myhuaweicloud.com/0.7.0-beta/MindSpore/cpu/ubuntu_x86/mindspore-0.7.0-cp37-cp37m-linux_x86_64.whl
+    pip install https://ms-release.obs.cn-north-4.myhuaweicloud.com/1.0.0/MindSpore/cpu/ubuntu_x86/mindspore-1.0.0-cp37-cp37m-linux_x86_64.whl
    ```
 2. Run the following command to verify the install.
@ -136,8 +138,8 @@ currently the containerized build options are supported as follows:
    For `CPU` backend, you can directly pull and run the latest stable image using the below command:
    ```
-    docker pull mindspore/mindspore-cpu:0.7.0-beta
+    docker pull mindspore/mindspore-cpu:1.0.0
-    docker run -it mindspore/mindspore-cpu:0.7.0-beta /bin/bash
+    docker run -it mindspore/mindspore-cpu:1.0.0 /bin/bash
    ```
 * GPU
@ -170,8 +172,8 @@ currently the containerized build options are supported as follows:
    ```
    Then you can pull and run the latest stable image using the below command:
    ```
-    docker pull mindspore/mindspore-gpu:0.7.0-beta
+    docker pull mindspore/mindspore-gpu:1.0.0
-    docker run -it --runtime=nvidia --privileged=true mindspore/mindspore-gpu:0.7.0-beta /bin/bash
+    docker run -it --runtime=nvidia --privileged=true mindspore/mindspore-gpu:1.0.0 /bin/bash
    ```
    To test if the docker image works, please execute the python code below and check the output:
--- a/README_CN.md
+++ b/README_CN.md
@ -66,6 +66,8 @@ MindSpore提供跨多个后端的构建选项：
 |               | Ubuntu-aarch64  | ✔️   |
 |               | EulerOS-x86     | ✔️   |
 |               | EulerOS-aarch64 | ✔️   |
 |               | CentOS-x86      | ✔️   |
 |               | CentOS-aarch64  | ✔️   |
 | GPU CUDA 10.1 | Ubuntu-x86      | ✔️   |
 | CPU           | Ubuntu-x86      | ✔️   |
 |               | Ubuntu-aarch64  | ✔️   |
@ -76,7 +78,7 @@ MindSpore提供跨多个后端的构建选项：
 1. 请从[MindSpore下载页面](https://www.mindspore.cn/versions)下载并安装whl包。
    ```
-    pip install https://ms-release.obs.cn-north-4.myhuaweicloud.com/0.7.0-beta/MindSpore/cpu/ubuntu_x86/mindspore-0.7.0-cp37-cp37m-linux_x86_64.whl
+    pip install https://ms-release.obs.cn-north-4.myhuaweicloud.com/1.0.0/MindSpore/cpu/ubuntu_x86/mindspore-1.0.0-cp37-cp37m-linux_x86_64.whl
    ```
 2. 执行以下命令，验证安装结果。
@ -132,8 +134,8 @@ MindSpore的Docker镜像托管在[Docker Hub](https://hub.docker.com/r/mindspore
    对于`CPU`后端，可以直接使用以下命令获取并运行最新的稳定镜像：
    ```
-    docker pull mindspore/mindspore-cpu:0.7.0-beta
+    docker pull mindspore/mindspore-cpu:1.0.0
-    docker run -it mindspore/mindspore-cpu:0.7.0-beta /bin/bash
+    docker run -it mindspore/mindspore-cpu:1.0.0 /bin/bash
    ```
 * GPU
@ -166,8 +168,8 @@ MindSpore的Docker镜像托管在[Docker Hub](https://hub.docker.com/r/mindspore
    ```
    使用以下命令获取并运行最新的稳定镜像：
    ```
-    docker pull mindspore/mindspore-gpu:0.7.0-beta
+    docker pull mindspore/mindspore-gpu:1.0.0
-    docker run -it --runtime=nvidia --privileged=true mindspore/mindspore-gpu:0.7.0-beta /bin/bash
+    docker run -it --runtime=nvidia --privileged=true mindspore/mindspore-gpu:1.0.0 /bin/bash
    ```
    要测试Docker是否正常工作，请运行下面的Python代码并检查输出：
--- a/RELEASE.md
+++ b/RELEASE.md
@ -1,3 +1,112 @@
 # Release 1.0.0
 ## Major Features and Improvements
 ### MindSpore Training and Inference Framework
 #### Ascend 910
 * New models
    * DenseNet121: a dense convolutional neural network, which connects each layer to every other layer in a feed-forward fashion for object recognition on ImageNet dataset.
    * UNet2D-Medical: Unet Medical model for 2D image segmentation, Convolutional Networks for Biomedical Image Segmentation on ISBI Challenge database.
 * Frontend and user interface
    * Second-Order Optimization
      * Enable second-order optimization for Bert on Ascend 910, which can achieve a masked lm accuracy of 71.3% in 800 seconds using 8 Ascend 910 (Bert-Large @MLPerf v0.7 dataset).
    * New GNN model BGCF
      * Bayesian Graph Convolutional Filtering network which naturally incorporate the uncertainty in the user-item interaction graph shows excellent recommendation performance on Amazon-Beauty dataset.
    * Add append interface for SequentialCell.
    * Add a level `auto` for AMP.
 * Executor and performance optimization
    * Support quantitative network (Resnet50 & YoloV3 & MobileNetV2).
    * Project ease of use optimization: project compilation time optimization, CMakelist regularization, cudnn, cuda independent compilation and installation independent.
 * Data processing, augmentation, and save format
    * Support GeneratorDataset return string type
 #### Other Hardware Support
 * GPU platform
    * Enable second-order optimization for resnet50 on GPU, which achieve 30% improvement on training time compared to SGD with Momentum (Resnet50 @ImageNet).
 * CPU platform
    * ...
 #### User interfaces change log
 * Remove global object GradOperation in Autodiff([!5011](https://gitee.com/mindspore/mindspore/pulls/5011))
 * Remove useless attribute 'name' in Autodiff([!5172](https://gitee.com/mindspore/mindspore/pulls/5172))
 * Rectification distributed init([!5350](https://gitee.com/mindspore/mindspore/pulls/5350))
 * Move the setting of ParalleMode from train.parallel_utils to context([!5351](https://gitee.com/mindspore/mindspore/pulls/5351))
 * Modification of save_checkpoint([!5482](https://gitee.com/mindspore/mindspore/pulls/5482))
 * Wrap numpy random seed into an api([!5634](https://gitee.com/mindspore/mindspore/pulls/5634))
 * Delete enable_fused_layernorm in some modelzoo scripts([!5665](https://gitee.com/mindspore/mindspore/pulls/5665))
 * Move 'multi-subgraphs' interface to internal([!5696](https://gitee.com/mindspore/mindspore/pulls/5696))
 * Rename mirror_mean to gradient_mean([!5700](https://gitee.com/mindspore/mindspore/pulls/5700))
 * Remove default value of 'group' of DepthWiseConv2d([!5865](https://gitee.com/mindspore/mindspore/pulls/5865))
 * Modify interface for function and remove duplicated def([!5958](https://gitee.com/mindspore/mindspore/pulls/5958))
 * Unify Conv2d and DepthwiseConv2d([!5916](https://gitee.com/mindspore/mindspore/pulls/5916))
 * Modification of SoftmaxCrossEntropyWithLogits([!5502](https://gitee.com/mindspore/mindspore/pulls/5502))
 * Change API set_strategy() to shard()([!5991](https://gitee.com/mindspore/mindspore/pulls/5991))
 * Move batch_size from bert_cfg_cfg to cfg([!6233](https://gitee.com/mindspore/mindspore/pulls/6233))
 * Remove unused parameters from SummaryRecord __init__([!5548](https://gitee.com/mindspore/mindspore/pulls/5548))
 * remove sens parameter of TrainOneStepWithLossScaleCell([!5753](https://gitee.com/mindspore/mindspore/pulls/5753))
 * optimize the TrainOneStepCell for user's define([!6159](https://gitee.com/mindspore/mindspore/pulls/6159))
 * delete seed0 and seed1 of nn.Dropout([!5735](https://gitee.com/mindspore/mindspore/pulls/5735))
 * delete DataWrapper([!6101](https://gitee.com/mindspore/mindspore/pulls/6101))
 * LSTM API optimization([!6374](https://gitee.com/mindspore/mindspore/pulls/6374))
 * Merge P\C\F of ops([!5645](https://gitee.com/mindspore/mindspore/pulls/5645))
 * delete SoftmaxCrossEntropyExpand interface([!6607](https://gitee.com/mindspore/mindspore/pulls/6607))
 * Adjust GroupNorm interface([!6329](https://gitee.com/mindspore/mindspore/pulls/6329))
 * Modify init interface to internal interface([!6651](https://gitee.com/mindspore/mindspore/pulls/6651))
 * Log optimization([!5842](https://gitee.com/mindspore/mindspore/pulls/5842))
 * Remove useless API dataset.set_dataset_size（[!5806](https://gitee.com/mindspore/mindspore/pulls/5806))
 * Some of Dataset API add usage parameter（[!5605](https://gitee.com/mindspore/mindspore/pulls/5605))
 * Change the import path, such as from mindspore.dataset.transforms.vision to mindspore.dataset.vision.transforms（[!5384](https://gitee.com/mindspore/mindspore/pulls/5384))
 * Rename ImageFolderDatasetV2 to ImageFolderDataset（[!5384](https://gitee.com/mindspore/mindspore/pulls/5384))
 * Dataset.map parameter optimization（[!5384](https://gitee.com/mindspore/mindspore/pulls/5384))
 * Add new api dataset.get_col_names（[!5384](https://gitee.com/mindspore/mindspore/pulls/5384))
 * Add new api dataset.get_col_names（[!5384](https://gitee.com/mindspore/mindspore/pulls/5384))
 * Remove useless API MindRecord finish（[!5580](https://gitee.com/mindspore/mindspore/pulls/5580))
 ### MindSpore Lite
 * Converter
    * Add 6 TFLite op, 7 Caffe op, 1 ONNX op.
    * Add support for Windows.
    * Support parallel inference of multiple sessions to adapt to more scenarios
    * Support 8bits only weight-quantization, most main-stream models has small accuracy loss (less than 0.5%) when compared to non-qunantized fp32 model.
 * CPU & GPU
    * Add 20 CPU ops，include FP32, int8/uint8, FP16 and int32 ops.
    * Add supporting FP16 for GPU, add 14 GPU ops include FP32/FP16.
    * Add Buffer/Image2D transform op for GPU
    * Performance optimization for CPU ops focus on ARM32.
    * Performance optimization for GPU Convolution using winograd.
 * Tool & example
    * Add object detection Android Demo.
 ## Bugfixes
 * Models
    * fix the constant folding problem in multiply.([!6092](https://gitee.com/mindspore/mindspore/pulls/6092))
    * move batch_size from bert_net_cfg to cfg in bert scripts.([!6233](https://gitee.com/mindspore/mindspore/pulls/6233))
    * modify the checkpoint file path.([!6137](https://gitee.com/mindspore/mindspore/pulls/6137))
 * Python API
    * fix semi auto parallel parameter of reshape has another user([!5722](https://gitee.com/mindspore/mindspore/pulls/5722))
    * raise ValueError when call hook function in graph mode([!5831](https://gitee.com/mindspore/mindspore/pulls/5831))
 * Executor
    * fix pynative mode to build temporary nn objects.（[!6189](https://gitee.com/mindspore/mindspore/pulls/6189))
    * fix the accuracy problem of multiple inputs of multi-card communication operator broadcast.([!6522](https://gitee.com/mindspore/mindspore/pulls/5622))
    * fix the problem that the sample distribution interface categorical does not support graph mode.([!5772](https://gitee.com/mindspore/mindspore/pulls/5772))
    * fix the random seed failure problem of the polynomial downsampling distribution operator.([!5948](https://gitee.com/mindspore/mindspore/pulls/5948))
    * fix unnecessary address binding issues in GPU heterogeneous scenarios.([!6232](https://gitee.com/mindspore/mindspore/pulls/6232))
 * GPU platform
    * fix for kernel resource leak([!5315](https://gitee.com/mindspore/mindspore/pulls/5315))
    * fix for insufficient memory for continuous unit test running([!5617](https://gitee.com/mindspore/mindspore/pulls/5617))
    * fix for the memory leak in the sparse slicer([!5578](https://gitee.com/mindspore/mindspore/pulls/5578))
 * Data processing
    * fix hang when use pyfunc([!6346](https://gitee.com/mindspore/mindspore/pulls/6346))
    * fix GPU device queue does not release GIL during resource clean up([!5964](https://gitee.com/mindspore/mindspore/pulls/5964))
    * fix hang if scripte exit unnormally([!6441](https://gitee.com/mindspore/mindspore/pulls/6441))
 ## Contributors
 Thanks goes to these wonderful people:
 Adel, AGroupofProbiotocs, anthonyaje, anzhengqi, askmiao, baihuawei, baiyangfan, bai-yangfan, bingyaweng, BowenK, buxue, caifubi, CaoJian, caojian05, caozhou, Cathy, changzherui, chenfei, chengxianbin, chenhaozhe, chenjianping, chenzomi, chenzupeng, chujinjin, cj, cjh9368, Corleone, danish, Danish, dayschan, eric, Eric, fary86, fuzhiye, Gaoxiong, gengdongjie, gongdaguo, gukecai, guoqi, gzhcv, hangq, hanhuifeng2020, Harshvardhan, He, heleiwang, hexia, Hoai, HuangBingjian, huangdongrun, huanghui, huangxinjing, huzhifeng, hwjiaorui, Jesse, jianghui58, jiangzhiwen, Jiaqi, jin-xiulang, jinyaohui, jjfeing, John, Jonathan, jonyguo, jzg, kai00, kingfo, kingxian, kpy, kswang, laiyongqiang, leonwanghui, Li, liangchenghui, liangzelang, lichen_101010, lichenever, lihongkang, lilei, limingqi107, ling, linqingke, liubuyu, liuwenhao4, liuxiao78, liuxiao93, liuyang_655, liuzhongkai, Lixia, lixian, liyanliu, liyong, lizhenyu, luoyang, lvchangquan, lvliang, lz, mahdi, Mahdi, maning202007, Margaret_wangrui, mayang, mengyuanli, nhussain, ougongchang, panfengfeng, panyifeng, Payne, Peilin, peixu_ren, Pengyongrong, qianlong, r1chardf1d0, riemann_penn, root, Sheng, shenwei41, simson, Simson, Su, sunsuodong, tao_yunhao, tinazhang, VectorSL, , Wan, wandongdong, wangdongxu, wangmin, wangnan39@huawei.com, wangyue01, wangzhe, wanyiming, Wei, wenchunjiang, wilfChen, WilliamLian, wsc, wukesong, wuweikang, wuxuejian, Xiaoda, xiefangqi, xuanyue, xulei2020, Xun, xuyongfei, yanghaitao, yanghaitao1, yanghaoran, YangLuo, yangruoqi713, yankai, yanzhenxiang2020, yao_yf, yepei6, yeyunpeng, Yi, yoni, yoonlee666, yuchaojie, yujianfeng, yuximiao, zengzitao, Zhang, zhanghaibo5@huawei.com, zhanghuiyao, zhangyihui, zhangz0911gm, zhanke, zhanyuan, zhaodezan, zhaojichen, zhaoting, zhaozhenlong, zhengjun10, zhoufeng, zhousiyi, zhouyaqiang, Zichun, Zirui, Ziyan, zjun, ZPaC
 Contributions of any kind are welcome!
 # Release 0.7.0-beta
 ## Major Features and Improvements
--- a/2
+++ b/2
@ -1 +1 @@
-Subproject commit 4d897c23fc41c5f7013efd0c517796233671518a
+Subproject commit 6c492a8c9d9730ad11ffc5481cc532ae500b0da5
--- a/build.bat
+++ b/build.bat
@ -16,12 +16,13 @@
@title mindspore_build
 SET BASEPATH=%CD%
-IF NOT EXIST "%BASEPATH%/build" (
+SET BUILD_PATH=%BASEPATH%/build
 IF NOT EXIST "%BUILD_PATH%" (
    md "build"
 )
-cd %BASEPATH%/build
+cd %BUILD_PATH%
 set BUILD_PATH=%CD%
 IF NOT EXIST "%BUILD_PATH%/mindspore" (
    md "mindspore"
@ -38,7 +39,7 @@ IF "%1%" == "lite" (
        call :run_cmake
        IF errorlevel 1 (
            echo "cmake fail."
-            call :run_fail
+            goto run_fail
        )
    ) ELSE (
        call :gene_protobuf
@ -53,7 +54,7 @@ IF "%1%" == "lite" (
    )
    IF errorlevel 1 (
        echo "build fail."
-        call :run_fail
+        goto run_fail
    ) ELSE (
        cd %BASEPATH%/output
        rd /s /q _CPack_Packages
@ -63,7 +64,7 @@ IF "%1%" == "lite" (
    -G "CodeBlocks - MinGW Makefiles" ../..
    IF NOT %errorlevel% == 0 (
        echo "cmake fail."
-        call :run_fail
+        goto run_fail
    )
    IF "%1%" == "" (
@ -73,7 +74,7 @@ IF "%1%" == "lite" (
    )
    IF NOT %errorlevel% == 0 (
        echo "build fail."
-        call :run_fail
+        goto run_fail
    )
 )
@ -82,10 +83,14 @@ cd %BASEPATH%
 goto run_eof
 :run_cmake
    set VERSION_MAJOR=1
    set VERSION_MINOR=0
    set VERSION_REVISION=0
    echo "======Start building MindSpore Lite %VERSION_MAJOR%.%VERSION_MINOR%.%VERSION_REVISION%======"
    cd %BUILD_PATH%/mindspore
    cmake -DBUILD_DEVICE=on -DBUILD_CONVERTER=on -DPLATFORM_ARM64=off -DSUPPORT_TRAIN=off ^
    -DCMAKE_BUILD_TYPE=Release -DSUPPORT_GPU=off -DBUILD_MINDDATA=off -DOFFLINE_COMPILE=off ^
-    -DMS_VERSION_MAJOR=0 -DMS_VERSION_MINOR=7 -DMS_VERSION_REVISION=0 ^
+    -DMS_VERSION_MAJOR=%VERSION_MAJOR% -DMS_VERSION_MINOR=%VERSION_MINOR% -DMS_VERSION_REVISION=%VERSION_REVISION% ^
    -G "CodeBlocks - MinGW Makefiles" "%BASEPATH%/mindspore/lite"
 GOTO:EOF
@ -123,6 +128,5 @@ GOTO:EOF
 :run_fail
    cd %BASEPATH%
    set errorlevel=1
    EXIT
 :run_eof
--- a/build.sh
+++ b/build.sh
@ -25,7 +25,7 @@ usage()
  echo "bash build.sh [-d] [-r] [-v] [-c on|off] [-t on|off] [-g on|off] [-h] [-b ge] [-m infer|train] \\"
  echo "              [-a on|off] [-p on|off] [-i] [-L] [-R] [-D on|off] [-j[n]] [-e gpu|d|cpu] \\"
  echo "              [-P on|off] [-z [on|off]] [-M on|off] [-V 9.2|10.1] [-I arm64|arm32|x86_64] [-K] \\"
-  echo "              [-B on|off] [-w on|off] [-E] [-l on|off] [-n full|lite|off] [-T on|off]"
+  echo "              [-B on|off] [-w on|off] [-E] [-l on|off] [-n full|lite|off]"
  echo ""
  echo "Options:"
  echo "    -d Debug mode"
@ -58,7 +58,6 @@ usage()
  echo "    -B Enable debugger, default on"
  echo "    -E Enable IBVERBS for parameter server, default off"
  echo "    -l Compile with python dependency, default on"
  echo "    -T Enable on-device training, default off"
 }
 # check value of input is 'on' or 'off'
@ -237,7 +236,7 @@ checkopts()
        ;;
      z)
        eval ARG=\$\{$OPTIND\}
-        if [[ -n $ARG && $ARG != -* ]]; then
+        if [[ -n "$ARG" && "$ARG" != -* ]]; then
          OPTARG="$ARG"
          check_on_off $OPTARG z
          OPTIND=$((OPTIND + 1))
--- a/cmake/external_libs/mkl_dnn.cmake
+++ b/cmake/external_libs/mkl_dnn.cmake
@ -10,10 +10,10 @@ if (CMAKE_SYSTEM_NAME MATCHES "Windows")
        MD5 17757c84f49edd42d34ae8c9288110a1)
 else()
    mindspore_add_pkg(onednn
-        VER 1.5
+        VER 1.7
        LIBS dnnl mkldnn
-        URL https://github.com/oneapi-src/oneDNN/archive/v1.5.tar.gz
+        URL https://github.com/oneapi-src/oneDNN/archive/v1.7-rc.tar.gz
-        MD5 5d97e0e8f4c0b37da5f524533b7a644b
+        MD5 f5e3ce761f6521e235817299f882b965
        CMAKE_OPTION -DDNNL_ARCH_OPT_FLAGS='' -DDNNL_CPU_RUNTIME='SEQ' -DDNNL_BUILD_EXAMPLES=OFF -DDNNL_BUILD_TESTS=OFF)
 endif()
--- a/cmake/package_lite.cmake
+++ b/cmake/package_lite.cmake
@ -56,7 +56,7 @@ if (PLATFORM_ARM64)
    install(FILES ${TOP_DIR}/mindspore/lite/build/src/libmindspore-lite.so DESTINATION ${LIB_DIR} COMPONENT ${COMPONENT_NAME})
    install(FILES ${TOP_DIR}/mindspore/lite/build/src/libmindspore-lite.a DESTINATION ${LIB_DIR} COMPONENT ${COMPONENT_NAME})
    install(FILES ${TOP_DIR}/mindspore/core/ir/dtype/type_id.h DESTINATION ${INC_DIR}/ir/dtype COMPONENT ${COMPONENT_NAME})
-    install(DIRECTORY ${TOP_DIR}/mindspore/lite/include/ DESTINATION ${INC_DIR} COMPONENT ${COMPONENT_NAME} FILES_MATCHING PATTERN "*.h")
+    install(DIRECTORY ${TOP_DIR}/mindspore/lite/include/ DESTINATION ${INC_DIR} COMPONENT ${COMPONENT_NAME} FILES_MATCHING PATTERN "*.h" PATTERN "train_model.h" EXCLUDE PATTERN "train_session.h" EXCLUDE)
    install(DIRECTORY ${TOP_DIR}/mindspore/lite/schema/ DESTINATION ${INC_DIR}/schema COMPONENT ${COMPONENT_NAME} FILES_MATCHING PATTERN "*.h" PATTERN "inner" EXCLUDE)
    install(FILES ${TOP_DIR}/mindspore/lite/build/src/libmindspore-lite-optimize.so DESTINATION ${LIB_DIR} COMPONENT ${COMPONENT_NAME})
    install(FILES ${TOP_DIR}/mindspore/lite/build/src/libmindspore-lite-fp16.so DESTINATION ${LIB_DIR} COMPONENT ${COMPONENT_NAME})
@ -65,7 +65,7 @@ elseif (PLATFORM_ARM32)
    install(FILES ${TOP_DIR}/mindspore/lite/build/src/libmindspore-lite.so DESTINATION ${LIB_DIR} COMPONENT ${COMPONENT_NAME})
    install(FILES ${TOP_DIR}/mindspore/lite/build/src/libmindspore-lite.a DESTINATION ${LIB_DIR} COMPONENT ${COMPONENT_NAME})
    install(FILES ${TOP_DIR}/mindspore/core/ir/dtype/type_id.h DESTINATION ${INC_DIR}/ir/dtype COMPONENT ${COMPONENT_NAME})
-    install(DIRECTORY ${TOP_DIR}/mindspore/lite/include/ DESTINATION ${INC_DIR} COMPONENT ${COMPONENT_NAME} FILES_MATCHING PATTERN "*.h")
+    install(DIRECTORY ${TOP_DIR}/mindspore/lite/include/ DESTINATION ${INC_DIR} COMPONENT ${COMPONENT_NAME} FILES_MATCHING PATTERN "*.h" PATTERN "train_model.h" EXCLUDE PATTERN "train_session.h" EXCLUDE)
    install(DIRECTORY ${TOP_DIR}/mindspore/lite/schema/ DESTINATION ${INC_DIR}/schema COMPONENT ${COMPONENT_NAME} FILES_MATCHING PATTERN "*.h" PATTERN "inner" EXCLUDE)
    install(DIRECTORY ${TOP_DIR}/third_party/flatbuffers/include DESTINATION ${FLATBF_DIR} COMPONENT ${COMPONENT_NAME})
 elseif (CMAKE_SYSTEM_NAME MATCHES "Windows")
@ -75,7 +75,7 @@ elseif (CMAKE_SYSTEM_NAME MATCHES "Windows")
    install(FILES ${LIB_LIST} DESTINATION ${TOP_DIR}/build/mindspore/package COMPONENT ${COMPONENT_NAME})
    install(FILES ${TOP_DIR}/build/mindspore/tools/converter/libconverter_parser.a DESTINATION ${TOP_DIR}/build/mindspore/package COMPONENT ${PARSER_NAME})
 else ()
-    install(DIRECTORY ${TOP_DIR}/mindspore/lite/include/ DESTINATION ${INC_DIR_RUN_X86} COMPONENT ${RUN_X86_COMPONENT_NAME} FILES_MATCHING PATTERN "*.h")
+    install(DIRECTORY ${TOP_DIR}/mindspore/lite/include/ DESTINATION ${INC_DIR_RUN_X86} COMPONENT ${RUN_X86_COMPONENT_NAME} FILES_MATCHING PATTERN "*.h" PATTERN "train_model.h" EXCLUDE PATTERN "train_session.h" EXCLUDE)
    install(DIRECTORY ${TOP_DIR}/mindspore/lite/schema/ DESTINATION ${INC_DIR_RUN_X86}/schema COMPONENT ${RUN_X86_COMPONENT_NAME} FILES_MATCHING PATTERN "*.h" PATTERN "inner" EXCLUDE)
    install(FILES ${TOP_DIR}/mindspore/core/ir/dtype/type_id.h DESTINATION ${INC_DIR_RUN_X86}/ir/dtype COMPONENT ${RUN_X86_COMPONENT_NAME})
    install(DIRECTORY ${TOP_DIR}/third_party/flatbuffers/include DESTINATION ${FLATBF_DIR_RUN_X86} COMPONENT ${RUN_X86_COMPONENT_NAME})
--- a/docker/mindspore-cpu/1.0.0/Dockerfile
+++ b/docker/mindspore-cpu/1.0.0/Dockerfile
@ -0,0 +1,67 @@
 FROM ubuntu:18.04
 MAINTAINER leonwanghui <leon.wanghui@huawei.com>
 # Set env
 ENV PYTHON_ROOT_PATH /usr/local/python-3.7.5
 ENV PATH /usr/local/bin:$PATH
 # Install base tools
 RUN apt update \
    && DEBIAN_FRONTEND=noninteractive apt install -y \
    vim \
    wget \
    curl \
    xz-utils \
    net-tools \
    openssh-client \
    git \
    ntpdate \
    tzdata \
    tcl \
    sudo \
    bash-completion
 # Install compile tools
 RUN DEBIAN_FRONTEND=noninteractive apt install -y \
    gcc \
    g++ \
    zlibc \
    make \
    libgmp-dev \
    patch \
    autoconf \
    libtool \
    automake \
    flex
 # Set bash
 RUN echo "dash dash/sh boolean false" | debconf-set-selections
 RUN DEBIAN_FRONTEND=noninteractive dpkg-reconfigure dash
 # Install python (v3.7.5)
 RUN apt install -y libffi-dev libssl-dev zlib1g-dev libbz2-dev libncurses5-dev \
    libgdbm-dev libgdbm-compat-dev liblzma-dev libreadline-dev libsqlite3-dev \
    && cd /tmp \
    && wget https://github.com/python/cpython/archive/v3.7.5.tar.gz \
    && tar -xvf v3.7.5.tar.gz \
    && cd /tmp/cpython-3.7.5 \
    && mkdir -p ${PYTHON_ROOT_PATH} \
    && ./configure --prefix=${PYTHON_ROOT_PATH} \
    && make -j4 \
    && make install -j4 \
    && rm -f /usr/local/bin/python \
    && rm -f /usr/local/bin/pip \
    && ln -s ${PYTHON_ROOT_PATH}/bin/python3.7 /usr/local/bin/python \
    && ln -s ${PYTHON_ROOT_PATH}/bin/pip3.7 /usr/local/bin/pip \
    && rm -rf /tmp/cpython-3.7.5 \
    && rm -f /tmp/v3.7.5.tar.gz
 # Set pip source
 RUN mkdir -pv /root/.pip \
    && echo "[global]" > /root/.pip/pip.conf \
    && echo "trusted-host=mirrors.aliyun.com" >> /root/.pip/pip.conf \
    && echo "index-url=http://mirrors.aliyun.com/pypi/simple/" >> /root/.pip/pip.conf
 # Install MindSpore cpu whl package
 RUN pip install --no-cache-dir https://ms-release.obs.cn-north-4.myhuaweicloud.com/1.0.0/MindSpore/cpu/ubuntu_x86/mindspore-1.0.0-cp37-cp37m-linux_x86_64.whl
--- a/docker/mindspore-gpu/1.0.0/Dockerfile
+++ b/docker/mindspore-gpu/1.0.0/Dockerfile
@ -0,0 +1,81 @@
 FROM nvidia/cuda:10.1-cudnn7-devel-ubuntu18.04
 MAINTAINER leonwanghui <leon.wanghui@huawei.com>
 # Set env
 ENV PYTHON_ROOT_PATH /usr/local/python-3.7.5
 ENV OMPI_ROOT_PATH /usr/local/openmpi-3.1.5
 ENV PATH ${OMPI_ROOT_PATH}/bin:/usr/local/bin:$PATH
 ENV LD_LIBRARY_PATH ${OMPI_ROOT_PATH}/lib:$LD_LIBRARY_PATH
 # Install base tools
 RUN apt update \
    && DEBIAN_FRONTEND=noninteractive apt install -y \
    vim \
    wget \
    curl \
    xz-utils \
    net-tools \
    openssh-client \
    git \
    ntpdate \
    tzdata \
    tcl \
    sudo \
    bash-completion
 # Install compile tools
 RUN DEBIAN_FRONTEND=noninteractive apt install -y \
    gcc \
    g++ \
    zlibc \
    make \
    libgmp-dev \
    patch \
    autoconf \
    libtool \
    automake \
    flex
 # Set bash
 RUN echo "dash dash/sh boolean false" | debconf-set-selections
 RUN DEBIAN_FRONTEND=noninteractive dpkg-reconfigure dash
 # Install python (v3.7.5)
 RUN apt install -y libffi-dev libssl-dev zlib1g-dev libbz2-dev libncurses5-dev \
    libgdbm-dev libgdbm-compat-dev liblzma-dev libreadline-dev libsqlite3-dev \
    && cd /tmp \
    && wget https://github.com/python/cpython/archive/v3.7.5.tar.gz \
    && tar -xvf v3.7.5.tar.gz \
    && cd /tmp/cpython-3.7.5 \
    && mkdir -p ${PYTHON_ROOT_PATH} \
    && ./configure --prefix=${PYTHON_ROOT_PATH} \
    && make -j4 \
    && make install -j4 \
    && rm -f /usr/local/bin/python \
    && rm -f /usr/local/bin/pip \
    && ln -s ${PYTHON_ROOT_PATH}/bin/python3.7 /usr/local/bin/python \
    && ln -s ${PYTHON_ROOT_PATH}/bin/pip3.7 /usr/local/bin/pip \
    && rm -rf /tmp/cpython-3.7.5 \
    && rm -f /tmp/v3.7.5.tar.gz
 # Set pip source
 RUN mkdir -pv /root/.pip \
    && echo "[global]" > /root/.pip/pip.conf \
    && echo "trusted-host=mirrors.aliyun.com" >> /root/.pip/pip.conf \
    && echo "index-url=http://mirrors.aliyun.com/pypi/simple/" >> /root/.pip/pip.conf
 # Install openmpi (v3.1.5)
 RUN cd /tmp \
    && wget https://download.open-mpi.org/release/open-mpi/v3.1/openmpi-3.1.5.tar.gz \
    && tar -xvf openmpi-3.1.5.tar.gz \
    && cd /tmp/openmpi-3.1.5 \
    && mkdir -p ${OMPI_ROOT_PATH} \
    && ./configure --prefix=${OMPI_ROOT_PATH} \
    && make -j4 \
    && make install -j4 \
    && rm -rf /tmp/openmpi-3.1.5 \
    && rm -f /tmp/openmpi-3.1.5.tar.gz
 # Install MindSpore cuda-10.1 whl package
 RUN pip install --no-cache-dir https://ms-release.obs.cn-north-4.myhuaweicloud.com/1.0.0/MindSpore/gpu/ubuntu_x86/cuda-10.1/mindspore_gpu-1.0.0-cp37-cp37m-linux_x86_64.whl
--- a/2
+++ b/2
@ -1 +1 @@
-Subproject commit 4f6ebe0f92da8c758b1b71a2bc7ff33f9344e18a
+Subproject commit 14db109491bc81473905a5eb9e82f6234aca419b
--- a/include/infer_tensor.h
+++ b/include/infer_tensor.h
@ -74,7 +74,11 @@ class InferTensorBase {
    if (data_len == 0) {
      return true;
    }
-    memcpy_s(mutable_data(), data_size(), data, data_len);
+    auto ret = memcpy_s(mutable_data(), data_size(), data, data_len);
    if (ret != 0) {
      MSI_LOG_ERROR << "Set data memcpy_s failed";
      return false;
    }
    return true;
  }
--- a/include/inference.h
+++ b/include/inference.h
@ -81,7 +81,6 @@ class MS_API InferSession {
  }
  static std::shared_ptr<InferSession> CreateSession(const std::string &device, uint32_t device_id);
 };
 }  // namespace inference
 }  // namespace mindspore
 #endif  // MINDSPORE_INCLUDE_MS_SESSION_H
--- a/mindspore/_extends/parallel_compile/tbe_compiler/common.py
+++ b/mindspore/_extends/parallel_compile/tbe_compiler/common.py
@ -13,7 +13,6 @@
 # limitations under the License.
 # ============================================================================
 """tbe common"""
 import json
 import os
 class TBEException(Exception):
@ -27,23 +26,6 @@ class TBEException(Exception):
        return self.__error_msg
 def get_ddk_version():
    """get ddk version"""
    ddk_version = os.environ.get("DDK_VERSION")
    if ddk_version is None:
        default_ddk_info_file = '/usr/local/HiAI/runtime/ddk_info'
        backup_ddk_info_file = '/usr/local/Ascend/fwkacllib/ddk_info'
        if os.path.exists(default_ddk_info_file):
            with open(default_ddk_info_file, "r") as fp:
                ddk_version = json.load(fp)["VERSION"]
        elif os.path.exists(backup_ddk_info_file):
            with open(backup_ddk_info_file, "r") as fp:
                ddk_version = json.load(fp)["VERSION"]
        else:
            ddk_version = "Ascend910"
    return ddk_version
 def get_build_in_impl_path():
    """get build-in tbe implement path"""
    tbe_impl_path = os.environ.get("TBE_IMPL_PATH")
--- a/mindspore/_extends/parallel_compile/tbe_compiler/compiler.py
+++ b/mindspore/_extends/parallel_compile/tbe_compiler/compiler.py
@ -18,9 +18,8 @@ import os
 import sys
 from te.platform.cce_conf import te_set_version
 from te.platform.fusion_util import fusion_op
-from common import check_kernel_info, get_args, get_build_in_impl_path, get_ddk_version
+from common import check_kernel_info, get_args, get_build_in_impl_path
 ddk_version = get_ddk_version()
 build_in_impl_path = get_build_in_impl_path()
 # op function list
@ -30,7 +29,6 @@ fusion_pattern_end_flag = "fusion_pattern_end"
 def _initialize(impl_path):
    """Initialize"""
    te_set_version(ddk_version)
    if impl_path == "":
        op_module_name = build_in_impl_path
    else:
@ -53,7 +51,7 @@ def build_op(build_type, json_str):
    """
    kernel_info = json.loads(json_str)
    check_kernel_info(kernel_info)
-
+    te_set_version(kernel_info["op_info"]["socVersion"])
    op_name = kernel_info['op_info']['name']
    try:
@ -111,7 +109,7 @@ def compile_fusion_op(json_str):
        Exception: If specific keyword is not found.
    """
    args = json.loads(json_str)
-    te_set_version(ddk_version)
+    te_set_version(args['fusion_op']["socVersion"])
    if 'fusion_op' not in args or not args['fusion_op']:
        raise ValueError("Json string Errors, key:fusion_op not found.")
    fusion_op_arg = args['fusion_op']
--- a/mindspore/_extends/parallel_compile/tbe_compiler/tbe_process.py
+++ b/mindspore/_extends/parallel_compile/tbe_compiler/tbe_process.py
@ -104,7 +104,7 @@ class TbeProcess:
    def __init__(self):
        self.__processe_num = multiprocessing.cpu_count()
        # max_processes_num: Set the maximum number of concurrent processes for compiler
-        max_processes_num = 16
+        max_processes_num = 24
        if self.__processe_num > max_processes_num:
            self.__processe_num = max_processes_num
        self.__pool = None
--- a/mindspore/_version_check.py
+++ b/mindspore/_version_check.py
@ -122,15 +122,16 @@ class AscendEnvChecker(EnvChecker):
    def __init__(self):
        self.version = ["1.75.22.0.220"]
-        atlas_fwk_version = "/usr/local/Ascend/nnae/latest/fwkacllib/version.info"
+        atlas_fwk_version = "/usr/local/Ascend/ascend-toolkit/latest/fwkacllib/version.info"
        hisi_fwk_version = "/usr/local/Ascend/fwkacllib/version.info"
        if os.path.exists(atlas_fwk_version):
            # atlas default path
-            self.fwk_path = "/usr/local/Ascend/nnae/latest/fwkacllib"
+            self.fwk_path = "/usr/local/Ascend/ascend-toolkit/latest/fwkacllib"
-            self.op_impl_path = "/usr/local/Ascend/nnae/latest/opp/op_impl/built-in/ai_core/tbe"
+            self.op_impl_path = "/usr/local/Ascend/ascend-toolkit/latest/opp/op_impl/built-in/ai_core/tbe"
            self.tbe_path = self.fwk_path + "/lib64"
            self.cce_path = self.fwk_path + "/ccec_compiler/bin"
            self.fwk_version = atlas_fwk_version
            self.op_path = "/usr/local/Ascend/ascend-toolkit/latest/opp"
        elif os.path.exists(hisi_fwk_version):
            # hisi default path
            self.fwk_path = "/usr/local/Ascend/fwkacllib"
@ -138,6 +139,7 @@ class AscendEnvChecker(EnvChecker):
            self.tbe_path = self.fwk_path + "/lib64"
            self.cce_path = self.fwk_path + "/ccec_compiler/bin"
            self.fwk_version = hisi_fwk_version
            self.op_path = ""
        else:
            # custom or unknown environment
            self.fwk_path = ""
@ -145,17 +147,20 @@ class AscendEnvChecker(EnvChecker):
            self.tbe_path = ""
            self.cce_path = ""
            self.fwk_version = ""
            self.op_path = ""
        # env
        self.path = os.getenv("PATH")
        self.python_path = os.getenv("PYTHONPATH")
        self.ld_lib_path = os.getenv("LD_LIBRARY_PATH")
        self.ascend_opp_path = os.getenv("ASCEND_OPP_PATH")
        # check content
        self.path_check = "/fwkacllib/ccec_compiler/bin/"
        self.python_path_check = "opp/op_impl/built_in/ai_core/tbe/"
        self.ld_lib_path_check_fwk = "/fwkacllib/lib64/"
        self.ld_lib_path_check_addons = "/add-ons/"
        self.ascend_opp_path_check = "/op"
        self.v = ""
    def check_env(self, e):
@ -205,6 +210,15 @@ class AscendEnvChecker(EnvChecker):
                f"No such directory: {self.cce_path}, Please check if Ascend 910 AI software package is "
                "installed correctly.")
        if self.op_path is None:
            pass
        elif Path(self.op_path).is_dir():
            os.environ['ASCEND_OPP_PATH'] = self.op_path
        else:
            raise EnvironmentError(
                f"No such directory: {self.op_path}, Please check if Ascend 910 AI software package is "
                "installed correctly.")
    def _check_env(self):
        """ascend dependence path check"""
        if self.path is None or self.path_check not in self.path:
@ -223,6 +237,11 @@ class AscendEnvChecker(EnvChecker):
                           "LD_LIBRARY_PATH, you can reference to the installation guidelines "
                           "https://www.mindspore.cn/install")
        if self.ascend_opp_path is None or self.ascend_opp_path_check not in self.ascend_opp_path:
            logger.warning(
                "Can not find opp path (need by mindspore-ascend), please check if you have set env ASCEND_OPP_PATH, "
                "you can reference to the installation guidelines https://www.mindspore.cn/install")
    def _read_version(self, file_path):
        """get ascend version info"""
        with open(file_path, 'r') as f:
--- a/mindspore/ccsrc/CMakeLists.txt
+++ b/mindspore/ccsrc/CMakeLists.txt
@ -258,7 +258,7 @@ else ()
 endif ()
 set(MINDSPORE_RPATH ${ORIGIN_PATH}/lib)
 if (ENABLE_D)
-    set(MINDSPORE_RPATH ${MINDSPORE_RPATH}:/usr/local/Ascend/nnae/latest/fwkacllib/lib64)
+    set(MINDSPORE_RPATH ${MINDSPORE_RPATH}:/usr/local/Ascend/ascend-toolkit/latest/fwkacllib/lib64)
    set(MINDSPORE_RPATH ${MINDSPORE_RPATH}:/usr/local/Ascend/fwkacllib/lib64)
    set(MINDSPORE_RPATH ${MINDSPORE_RPATH}:/usr/local/Ascend/add-ons)
 elseif (ENABLE_GPU)
--- a/mindspore/ccsrc/backend/kernel_compiler/cpu/mkldnn/fused_batch_norm_grad_cpu_kernel.cc
+++ b/mindspore/ccsrc/backend/kernel_compiler/cpu/mkldnn/fused_batch_norm_grad_cpu_kernel.cc
@ -0,0 +1,139 @@
 /**
 * Copyright 2020 Huawei Technologies Co., Ltd
 *
 * Licensed under the Apache License, Version 2.0 (the "License");
 * you may not use this file except in compliance with the License.
 * You may obtain a copy of the License at
 *
 * http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */
 #include <string>
 #include "backend/kernel_compiler/cpu/mkldnn/fused_batch_norm_grad_cpu_kernel.h"
 #include "utils/ms_utils.h"
 #include "backend/kernel_compiler/cpu/mkldnn/mkl_kernel_engine.h"
 #include "runtime/device/cpu/cpu_device_address.h"
 namespace mindspore {
 namespace kernel {
 void FusedBatchNormGradCPUKernel::InitInputOutputSize(const CNodePtr &kernel_node) {
  CPUKernel::InitInputOutputSize(kernel_node);
  MS_EXCEPTION_IF_NULL(kernel_node);
  size_t type_size = sizeof(float);
  std::vector<size_t> shape = AnfAlgo::GetInputDeviceShape(kernel_node, 0);
  // for store diff  scale and bias
  size_t tensor_size = shape[1] * 2 * type_size;  // [2, c] to store scale and bias
  workspace_size_list_.emplace_back(tensor_size);
 }
 void FusedBatchNormGradCPUKernel::InitKernel(const CNodePtr &kernel_node) {
  MS_EXCEPTION_IF_NULL(kernel_node);
  MS_LOG(INFO) << "start init FusedBatchNormGradCPUKernel " ;
  auto node_name = AnfAlgo::GetCNodeName(kernel_node);
  /**
   * create desc for dy and x
   */ 
  //get dy and x 
  std::vector<size_t> dy_shape = AnfAlgo::GetInputDeviceShape(kernel_node, 0);
  std::vector<size_t> x_shape = AnfAlgo::GetInputDeviceShape(kernel_node, 1);
  //  the dy and x should be nchw ,so we need check them
  if (dy_shape.size() != 4 || x_shape.size() != 4) {
    MS_LOG(EXCEPTION) << "Fused batchnorm grad only support nchw input!";
  }
 // des for x,y,dy,dx
  batch_size = x_shape[0];
  channel = x_shape[1];
  hw_size = x_shape[2] * x_shape[3];
  nhw_size = x_shape[0] * hw_size;
  dnnl::memory::desc x_desc = GetDefaultMemDesc(x_shape);
  // create scale_bias_desc
  dnnl::memory::desc scale_desc = GetDefaultMemDesc({1, channel});
  auto epsilon = AnfAlgo::GetNodeAttr<float>(kernel_node, "epsilon");
  /**
   * define forward hint
   * */
  auto f_prop_kind = dnnl::prop_kind::forward_training; //forward prop kind
  auto f_normalization_flags = dnnl::normalization_flags::use_scale_shift; //forward normalization flags
  dnnl::batch_normalization_forward::desc fhint_desc =
    dnnl::batch_normalization_forward::desc(f_prop_kind, x_desc, epsilon, f_normalization_flags);
  auto fhint_prim_desc = dnnl::batch_normalization_forward::primitive_desc(fhint_desc, MKLKernelEngine::Get().engine());
  /**
   * define backward 
   */ 
  auto bprop_kind = dnnl::prop_kind::backward; 
  auto bnormalization_flags = dnnl::normalization_flags::use_scale_shift; 
  //create desc for backward 
  dnnl::batch_normalization_backward::desc b_desc =
    dnnl::batch_normalization_backward::desc(bprop_kind, fhint_prim_desc.dst_desc(), x_desc, epsilon, bnormalization_flags);
  //create prim_desc for backward
  auto bprim_desc = dnnl::batch_normalization_backward::primitive_desc(b_desc, MKLKernelEngine::Get().engine(), fhint_prim_desc);
  primitive_ = std::make_shared<dnnl::batch_normalization_backward>(bprim_desc);
  /**
   * add Argument for backward 
   */
  AddArgument(DNNL_ARG_SRC,bprim_desc.src_desc());//x
  AddArgument(DNNL_ARG_DIFF_DST,bprim_desc.diff_dst_desc());//dy
  AddArgument(DNNL_ARG_SCALE_SHIFT,scale_desc); // only need scale
  AddArgument(DNNL_ARG_MEAN, bprim_desc.mean_desc()); // mean
  AddArgument(DNNL_ARG_VARIANCE, bprim_desc.variance_desc()); // var
  AddArgument(DNNL_ARG_DIFF_SRC,bprim_desc.diff_src_desc()); //dx
  AddArgument(DNNL_ARG_DIFF_SCALE_SHIFT, bprim_desc.diff_weights_desc()); //diff scale and diff bias
  AddArgument(DNNL_ARG_WORKSPACE, bprim_desc.workspace_desc());// workspace
 }
 bool FusedBatchNormGradCPUKernel::Launch(const std::vector<kernel::AddressPtr> &inputs,
                                     const std::vector<kernel::AddressPtr> &workspace,
                                     const std::vector<kernel::AddressPtr> &outputs) {
  //input : dy x scale save_mean save_inv_variance 5
  MS_LOG(INFO) <<" start FusedBatchNormGradCPUkernel";
  if (inputs.size() < 5 || outputs.empty()) {
    MS_LOG(EXCEPTION) << "Error input output size!";
  }
  SetArgumentHandle(DNNL_ARG_SRC,inputs[1]->addr);//x
  SetArgumentHandle(DNNL_ARG_DIFF_DST,inputs[0]->addr);//dy
  SetArgumentHandle(DNNL_ARG_SCALE_SHIFT,inputs[2]->addr); // only need scale
  SetArgumentHandle(DNNL_ARG_MEAN, inputs[3]->addr); // mean
  SetArgumentHandle(DNNL_ARG_VARIANCE, inputs[4]->addr); // var
  SetArgumentHandle(DNNL_ARG_DIFF_SRC,outputs[0]->addr); //dx
  SetArgumentHandle(DNNL_ARG_DIFF_SCALE_SHIFT, workspace[0]->addr); //diff scale and diff bias
  ExecutePrimitive();
  auto scale = reinterpret_cast<float *>(outputs[1]->addr);
  auto bias = reinterpret_cast<float *>(outputs[2]->addr);
  auto wksp = reinterpret_cast<float *>(workspace[0]->addr);
  auto scale_ret = memcpy_s(scale, outputs[1]->size, workspace[0]->addr, outputs[1]->size);
  auto bias_ret = memcpy_s(bias , outputs[2]->size, wksp + (outputs[1]->size / sizeof(float)), outputs[2]->size);
  if (scale_ret != 0 || bias_ret !=0) {
    MS_LOG(EXCEPTION) << "Memcpy_s error.";
    return false;
  }
  MS_LOG(INFO) << " done batchnorm grad";
  return true;
 }
 }  // namespace kernel
 }  // namespace mindspore
--- a/mindspore/ccsrc/backend/kernel_compiler/cpu/mkldnn/fused_batch_norm_grad_cpu_kernel.h
+++ b/mindspore/ccsrc/backend/kernel_compiler/cpu/mkldnn/fused_batch_norm_grad_cpu_kernel.h
@ -0,0 +1,63 @@
 /**
 * Copyright 2020 Huawei Technologies Co., Ltd
 *
 * Licensed under the Apache License, Version 2.0 (the "License");
 * you may not use this file except in compliance with the License.
 * You may obtain a copy of the License at
 *
 * http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */
 #ifndef MINDSPORE_CCSRC_BACKEND_KERNEL_COMPILER_CPU_FUSED_BATCH_NORM_GRAD_CPU_KERNEL_H_
 #define MINDSPORE_CCSRC_BACKEND_KERNEL_COMPILER_CPU_FUSED_BATCH_NORM_GRAD_CPU_KERNEL_H_
 #include <vector>
 #include <memory>
 #include "backend/kernel_compiler/cpu/mkldnn/mkl_cpu_kernel.h"
 namespace mindspore {
 namespace kernel {
 class FusedBatchNormGradCPUKernel : public MKLCPUKernel {
 public:
  FusedBatchNormGradCPUKernel() = default;
  ~FusedBatchNormGradCPUKernel() override = default;
  void InitKernel(const CNodePtr &kernel_node) override;
  bool Launch(const std::vector<AddressPtr> &inputs, const std::vector<AddressPtr> &workspace,
              const std::vector<AddressPtr> &outputs) override;
 protected:
  void InitInputOutputSize(const CNodePtr &kernel_node) override;
 private:
  size_t batch_size{0};
  size_t channel{0};
  size_t hw_size{0};
  size_t nhw_size{0};
 };
 // registe fusedBatchnormGrad
 MS_REG_CPU_KERNEL(FusedBatchNormGrad,
                  KernelAttr()
                    .AddInputAttr(kNumberTypeFloat32)
                    .AddInputAttr(kNumberTypeFloat32)
                    .AddInputAttr(kNumberTypeFloat32)
                    .AddInputAttr(kNumberTypeFloat32)
                    .AddInputAttr(kNumberTypeFloat32)
                    .AddOutputAttr(kNumberTypeFloat32)
                    .AddOutputAttr(kNumberTypeFloat32)
                    .AddOutputAttr(kNumberTypeFloat32),
                  FusedBatchNormGradCPUKernel)
 }  // namespace kernel
 }  // namespace mindspore
 #endif  // MINDSPORE_CCSRC_BACKEND_KERNEL_COMPILER_CPU_FUSED_BATCH_NORM_GRAD_CPU_KERNEL_H_
--- a/mindspore/ccsrc/backend/kernel_compiler/cpu/mkldnn/realdiv_cpu_kernel.cc
+++ b/mindspore/ccsrc/backend/kernel_compiler/cpu/mkldnn/realdiv_cpu_kernel.cc
@ -0,0 +1,67 @@
 /**
 * Copyright 2020 Huawei Technologies Co., Ltd
 *
 * Licensed under the Apache License, Version 2.0 (the "License");
 * you may not use this file except in compliance with the License.
 * You may obtain a copy of the License at
 *
 * http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */
 #include "backend/kernel_compiler/cpu/mkldnn/realdiv_cpu_kernel.h"
 #include "backend/kernel_compiler/cpu/mkldnn/mkl_kernel_engine.h"
 #include "runtime/device/cpu/cpu_device_address.h"
 #include "utils/ms_utils.h"
 namespace mindspore {
 namespace kernel {
 void RealDivCPUKernel::InitKernel(const CNodePtr &kernel_node) {
  MS_EXCEPTION_IF_NULL(kernel_node);
  MS_LOG(INFO) << " Init Realdiv " ;
  std::vector<size_t> src0_shape = AnfAlgo::GetInputDeviceShape(kernel_node, 0);
  std::vector<size_t> src1_shape = AnfAlgo::GetInputDeviceShape(kernel_node, 1);
  std::vector<size_t> dst_shape = AnfAlgo::GetOutputDeviceShape(kernel_node, 0);
  if (src0_shape.size() != src1_shape.size() && src1_shape.size() > 1) {
    MS_LOG(EXCEPTION) << "RealDivCPUKernel : RealDiv only support same dim input or tensor / scalar " << src0_shape.size() << " vs "
                      << src1_shape.size();
  }
  if (src1_shape.size() < src0_shape.size()) {
    for (size_t i = src1_shape.size(); i < src0_shape.size(); ++i) {
      src1_shape.emplace_back(1);
    }
  }
  dnnl::memory::desc src0_desc = GetDefaultMemDesc(src0_shape);
  dnnl::memory::desc src1_desc = GetDefaultMemDesc(src1_shape);
  dnnl::memory::desc dst_desc = GetDefaultMemDesc(dst_shape);
  dnnl::binary::desc desc = dnnl::binary::desc(dnnl::algorithm::binary_div, src0_desc, src1_desc, dst_desc);
  auto prim_desc = dnnl::binary::primitive_desc(desc, MKLKernelEngine::Get().engine());
  primitive_ = std::make_shared<dnnl::binary>(prim_desc);
  AddArgument(DNNL_ARG_SRC_0, src0_desc);
  AddArgument(DNNL_ARG_SRC_1, src1_desc);
  AddArgument(DNNL_ARG_DST, dst_desc);
 }
 bool RealDivCPUKernel::Launch(const std::vector<kernel::AddressPtr> &inputs,
                                const std::vector<kernel::AddressPtr> & /*workspace*/,
                                const std::vector<kernel::AddressPtr> &outputs) {
  MS_LOG(INFO) << "RealDivCPUKernel : start Launch Realdiv" ;
  if (inputs.size() < 2 || outputs.empty()) {
    MS_LOG(EXCEPTION) << "RealDivCPUKernel : Realdiv error input output size!";
  }
  SetArgumentHandle(DNNL_ARG_SRC_0, inputs[0]->addr);
  SetArgumentHandle(DNNL_ARG_SRC_1, inputs[1]->addr);
  SetArgumentHandle(DNNL_ARG_DST, outputs[0]->addr);
  ExecutePrimitive();
  return true;
 }
 }  // namespace kernel
 }  // namespace mindspore
--- a/mindspore/ccsrc/backend/kernel_compiler/cpu/mkldnn/realdiv_cpu_kernel.h
+++ b/mindspore/ccsrc/backend/kernel_compiler/cpu/mkldnn/realdiv_cpu_kernel.h
@ -0,0 +1,42 @@
 /**
 * Copyright 2020 Huawei Technologies Co., Ltd
 *
 * Licensed under the Apache License, Version 2.0 (the "License");
 * you may not use this file except in compliance with the License.
 * You may obtain a copy of the License at
 *
 * http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */
 #ifndef MINDSPORE_CCSRC_BACKEND_KERNEL_COMPILER_CPU_REALDIV_CPU_KERNEL_H_
 #define MINDSPORE_CCSRC_BACKEND_KERNEL_COMPILER_CPU_REALDIV_CPU_KERNEL_H_
 #include <vector>
 #include <memory>
 #include "backend/kernel_compiler/cpu/mkldnn/mkl_cpu_kernel.h"
 namespace mindspore {
 namespace kernel {
 class RealDivCPUKernel : public MKLCPUKernel {
 public:
  RealDivCPUKernel() = default;
  ~RealDivCPUKernel() override = default;
  void InitKernel(const CNodePtr &kernel_node) override;
  bool Launch(const std::vector<AddressPtr> &inputs, const std::vector<AddressPtr> &workspace,
              const std::vector<AddressPtr> &outputs) override;
 };
 MS_REG_CPU_KERNEL(
  RealDiv,
  KernelAttr().AddInputAttr(kNumberTypeFloat32).AddInputAttr(kNumberTypeFloat32).AddOutputAttr(kNumberTypeFloat32),
  RealDivCPUKernel);
 }  // namespace kernel
 }  // namespace mindspore
 #endif  // MINDSPORE_CCSRC_BACKEND_KERNEL_COMPILER_CPU_REALDIV_CPU_KERNEL_H_
--- a/mindspore/ccsrc/backend/kernel_compiler/cpu/reduce_cpu_kernel.cc
+++ b/mindspore/ccsrc/backend/kernel_compiler/cpu/reduce_cpu_kernel.cc
@ -39,6 +39,7 @@ void ReduceCPUKernel::InitKernel(const CNodePtr &kernel_node) {
  }
  shape_ = AnfAlgo::GetInputDeviceShape(kernel_node, 0);
  auto axis_addr = AnfAlgo::GetCNodePrimitive(kernel_node)->GetAttr(AXIS);
  if (axis_addr->isa<ValueTuple>()) {
    auto attr_axis = AnfAlgo::GetNodeAttr<std::vector<int>>(kernel_node, AXIS);
    if (attr_axis.size() > shape_.size()) {
@ -47,18 +48,24 @@ void ReduceCPUKernel::InitKernel(const CNodePtr &kernel_node) {
      axis_.push_back(shape_.size() - 1);
    } else {
      for (auto axis : attr_axis) {
        while (axis < 0) {
          axis += SizeToInt(shape_.size());
        }
        if (IntToSize(axis) >= (shape_.size())) {
          MS_LOG(EXCEPTION) << "axis value is oversize.";
        }
-        axis < 0 ? axis_.push_back(axis + shape_.size()) : axis_.push_back(axis);
+        axis_.push_back(IntToSize(axis));
      }
    }
  } else if (axis_addr->isa<Int32Imm>()) {
    int axis = AnfAlgo::GetNodeAttr<int>(kernel_node, AXIS);
-    if (axis >= 0 && IntToSize(axis) >= shape_.size()) {
+    while (axis < 0) {
      axis += SizeToInt(shape_.size());
    }
    if (IntToSize(axis) >= shape_.size()) {
      MS_LOG(EXCEPTION) << "axis value is oversize.";
    }
-    axis < 0 ? axis_.push_back(axis + shape_.size()) : axis_.push_back(axis);
+    axis_.push_back(IntToSize(axis));
  } else {
    MS_LOG(EXCEPTION) << "Attribute axis type is invalid.";
  }
--- a/mindspore/ccsrc/backend/kernel_compiler/cpu/tile_cpu_kernel.cc
+++ b/mindspore/ccsrc/backend/kernel_compiler/cpu/tile_cpu_kernel.cc
@ -0,0 +1,173 @@
 /**
 * Copyright 2020 Huawei Technologies Co., Ltd
 *
 * Licensed under the Apache License, Version 2.0 (the "License");
 * you may not use this file except in compliance with the License.
 * You may obtain a copy of the License at
 *
 * http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */
 #include "backend/kernel_compiler/cpu/tile_cpu_kernel.h"
 #include "runtime/device/cpu/cpu_device_address.h"
 #include <iostream>
 #include <string>
 namespace mindspore {
 namespace kernel {
 size_t TileCPUKernel::GetOuterSize(std::vector<size_t> shape, int index){
  size_t size = 1;
  for(int i=0;i<index;i++){
    size *= shape.at(i);
  }
  return size;
 }
 size_t TileCPUKernel::GetInnerSize(std::vector<size_t> shape, int index){
  size_t size = 1;
  int shape_size = shape.size();
  for(int i=index;i < shape_size;i++){
    size *= shape.at(i);
  }
  return size;
 }
 size_t TileCPUKernel::GetTypeSize(){
  size_t type_size =1;
  if (InputTensorType == kNumberTypeInt32){
    type_size = sizeof(int);  
  }
  else if(InputTensorType == kNumberTypeInt64){
     type_size = sizeof(int64_t);
  }
  else if(InputTensorType == kNumberTypeInt16){
     type_size = sizeof(int16_t); 
  }
  else if(InputTensorType == kNumberTypeFloat32){
    type_size = sizeof(float);
  }
  return type_size;
 }
 void TileCPUKernel::InitKernel(const CNodePtr &kernel_node) { 
  MS_LOG(INFO) << " start Init Tile kernel";
  MS_EXCEPTION_IF_NULL(kernel_node); 
  src0_shape = AnfAlgo::GetInputDeviceShape(kernel_node, 0);
  multiples = AnfAlgo::GetNodeAttr<std::vector<int>>(kernel_node, "multiples");
  dst_shape = AnfAlgo::GetOutputDeviceShape(kernel_node, 0);
  if (src0_shape.size() > multiples.size()) {
    MS_LOG(EXCEPTION) << "The rank of input is greater than multiples.";
  }
  // make src0_shape align to multiples
  if (src0_shape.size() < multiples.size()) {
    for (size_t i = src0_shape.size(); i < multiples.size(); ++i) {
      src0_shape.emplace(src0_shape.begin(),1);
    }
  }
  if(src0_shape.size() != multiples.size()){
    MS_LOG(EXCEPTION) << "make src0_shape align to multiples failed.";
  }
  size_t output_tensor_size = 1;
  InputTensorType = AnfAlgo::GetInputDeviceDataType(kernel_node, 0);
  for(auto each: dst_shape){
    output_tensor_size  *= each ;
  }
  output_tensor_size *=  GetTypeSize();
  workspace_size_list_.emplace_back(output_tensor_size);
 }
 bool TileCPUKernel::Launch(const std::vector<kernel::AddressPtr> &inputs,
                              const std::vector<kernel::AddressPtr> & workspace,
                              const std::vector<kernel::AddressPtr> &outputs) {
  MS_LOG(INFO) << " start Launch Tile kernel";
  if (inputs.empty() || outputs.empty()) {
    MS_LOG(EXCEPTION) << "input or output empty!";
  }
  if (inputs[0]->size > outputs[0]->size) {
    return false;
  }
  if (InputTensorType == kNumberTypeInt32){
    LaunchKernel<int>(inputs, workspace ,outputs);
  }
  else if(InputTensorType == kNumberTypeInt64){
    LaunchKernel<int64_t>(inputs,workspace , outputs);
  }
  else if(InputTensorType == kNumberTypeInt16){
    LaunchKernel<int16_t>(inputs,workspace , outputs);
  }
  else if(InputTensorType == kNumberTypeFloat32){
    LaunchKernel<float>(inputs, workspace ,outputs);
  }
  return true;
 }
 /**
 * Launch kernel base on input type
 */
 template <typename T>
 void TileCPUKernel::LaunchKernel(const std::vector<kernel::AddressPtr> &inputs,
                                    const std::vector<kernel::AddressPtr> & workspace,
                                            const std::vector<kernel::AddressPtr> &outputs) {
  size_t shape_size = src0_shape.size();
  size_t item_size = sizeof(T); 
  auto X_ptr = reinterpret_cast<T *>(inputs[0]->addr); 
  auto Y_ptr = reinterpret_cast<T *>(outputs[0]->addr); 
  auto wksp_temp = reinterpret_cast<T *>(workspace[0]->addr); //workspace for save temp result
  auto start_of_wksp = reinterpret_cast<T *>(workspace[0]->addr); 
  auto start_of_Yptr = reinterpret_cast<T *>(outputs[0]->addr); 
  memcpy_s(Y_ptr , inputs[0]->size, X_ptr, inputs[0]->size);
  int data_count=1;
  auto temp_dst_shape  = src0_shape; //restore temp result shape
  for (int  axis_t = shape_size -1 ; axis_t >= 0 ; axis_t-- ){
    auto temp_dst_size = GetInnerSize(temp_dst_shape,0) * item_size; //get temp dst size
    memcpy_s(wksp_temp , temp_dst_size, Y_ptr, temp_dst_size );
    int outer_size = GetOuterSize( temp_dst_shape, axis_t);
    int inner_size = GetInnerSize(temp_dst_shape, axis_t);
    int tile = multiples[axis_t];
    int item_count = inner_size; 
    data_count =  item_count * item_size ; 
    for (int i=0; i < outer_size ; i++){
      for ( int t =0 ; t < tile ; t++){
        memcpy_s(Y_ptr , data_count, wksp_temp, data_count); // copy to Y
        Y_ptr +=  item_count;
      }
      wksp_temp += item_count;
    }
    // reset to start 
    Y_ptr = start_of_Yptr ; 
    wksp_temp = start_of_wksp;
    temp_dst_shape[axis_t] = dst_shape[axis_t] ; //set temp shape
  } 
 }
 }  // namespace kernel
 }  // namespace mindspore
--- a/mindspore/ccsrc/backend/kernel_compiler/cpu/tile_cpu_kernel.h
+++ b/mindspore/ccsrc/backend/kernel_compiler/cpu/tile_cpu_kernel.h
@ -0,0 +1,64 @@
 /**
 * Copyright 2020 Huawei Technologies Co., Ltd
 *
 * Licensed under the Apache License, Version 2.0 (the "License");
 * you may not use this file except in compliance with the License.
 * You may obtain a copy of the License at
 *
 * http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */
 #ifndef MINDSPORE_CCSRC_BACKEND_KERNEL_COMPILER_CPU_TILE_CPU_KERNEL_H_
 #define MINDSPORE_CCSRC_BACKEND_KERNEL_COMPILER_CPU_TILE_CPU_KERNEL_H_
 #include <vector>
 #include <memory>
 #include "backend/kernel_compiler/cpu/cpu_kernel.h"
 #include "backend/kernel_compiler/cpu/cpu_kernel_factory.h"
 namespace mindspore {
 namespace kernel {
 class TileCPUKernel : public CPUKernel {
 public:
  TileCPUKernel() = default;
  ~TileCPUKernel() override = default;
  void InitKernel(const CNodePtr &kernel_node) override;
  bool Launch(const std::vector<AddressPtr> &inputs, const std::vector<AddressPtr> &workspace,
              const std::vector<AddressPtr> &outputs) override;
 private:
    // launch kernel base on input type
  template <typename T>
  void LaunchKernel(const std::vector<kernel::AddressPtr> &inputs,
                    const std::vector<kernel::AddressPtr> & workspace,
                    const std::vector<kernel::AddressPtr> &outputs);
  size_t GetOuterSize(std::vector<size_t>, int );
  size_t GetInnerSize(std::vector<size_t>, int);
  TypeId InputTensorType{kNumberTypeInt32};
  size_t GetTypeSize();
  std::vector<size_t> src0_shape;
  std::vector<int> multiples;
  std::vector<size_t> dst_shape;
 };
 MS_REG_CPU_KERNEL(Tile, KernelAttr().AddInputAttr(kNumberTypeFloat32).AddOutputAttr(kNumberTypeFloat32),
                  TileCPUKernel);
 MS_REG_CPU_KERNEL(Tile, KernelAttr().AddInputAttr(kNumberTypeInt32).AddOutputAttr(kNumberTypeInt32),
                  TileCPUKernel);
 MS_REG_CPU_KERNEL(Tile, KernelAttr().AddInputAttr(kNumberTypeInt64).AddOutputAttr(kNumberTypeInt64),
                  TileCPUKernel);
 MS_REG_CPU_KERNEL(Tile, KernelAttr().AddInputAttr(kNumberTypeInt16).AddOutputAttr(kNumberTypeInt16),
                  TileCPUKernel);
 }  // namespace kernel
 }  // namespace mindspore
 #endif  // MINDSPORE_CCSRC_BACKEND_KERNEL_COMPILER_CPU_TILE_CPU_KERNEL_H_
--- a/mindspore/ccsrc/backend/kernel_compiler/gpu/cuda_impl/ctcloss_impl.cu
+++ b/mindspore/ccsrc/backend/kernel_compiler/gpu/cuda_impl/ctcloss_impl.cu
@ -0,0 +1,446 @@
 /**
 * Copyright 2020 Huawei Technologies Co., Ltd
 *
 * Licensed under the Apache License, Version 2.0 (the "License");
 * you may not use this file except in compliance with the License.
 * You may obtain a copy of the License at
 *
 * http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */
 #include <limits>
 #include "ctcloss_impl.cuh"
 #include "runtime/device/gpu/cuda_common.h"
 template <typename T>
 __device__ T LogSumExp(const T logprob1, const T logprob2) {
  if (logprob1 == logprob2 && logprob1 == -std::numeric_limits<T>::infinity()) {
    return logprob1;
  } else {
    return (logprob1 > logprob2) ? logprob1 + log1pf(expf(logprob2 - logprob1))
                                 : logprob2 + log1pf(expf(logprob1 - logprob2));
  }
 }
 template <typename T>
 __global__ void CalculateFwdVarKernel(T *log_alpha_b, int *label_value_with_blank, T *softmax_probs,
                                      const int *sequence_length, bool ctc_merge_repeated, int batch, int SOffSet,
                                      int maxtime, int blank, int *label_squence_length, int *cum_labels_length,
                                      bool ignore_longer_outputs_than_inputs) {
  for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < batch; i += blockDim.x * gridDim.x) {
    if (sequence_length[i] == 0 ||
        (ignore_longer_outputs_than_inputs && label_squence_length[i] > sequence_length[i])) {
    } else {
      T *log_alpha_b_cur = &log_alpha_b[i * SOffSet * maxtime];
      int *label_value_with_blank_cur = &label_value_with_blank[0];
      if (i > 0) {
        label_value_with_blank_cur = &label_value_with_blank[2 * cum_labels_length[i - 1] + i];
      }
      int numclass = blank + 1;
      int U = 2 * label_squence_length[i] + 1;
      int Ti = sequence_length[i];
      int low = 0;
      int high = 0;
      log_alpha_b_cur[0] = log(softmax_probs[i * numclass + blank]);
      int label0 = blank;
      if (U > 1) {
        label0 = label_value_with_blank_cur[1];
        log_alpha_b_cur[maxtime] = log(softmax_probs[i * numclass + label0]);
      }
      for (int t = 1; t < Ti; ++t) {
        low = 0;
        high = U;
        int low_limit = U - (2 * (Ti - t));
        int high_limit = 2 * (t + 1);
        if (low_limit > low) {
          low = low_limit;
        }
        if (high_limit < U) {
          high = high_limit;
        }
        for (int u = low; u < high; ++u) {
          T sum_log_alpha = -std::numeric_limits<T>::infinity();
          if (ctc_merge_repeated || label_value_with_blank_cur[u] == blank) {
            sum_log_alpha = log_alpha_b_cur[u * maxtime + t - 1];
          }
          if (u > 0) {
            sum_log_alpha = LogSumExp(sum_log_alpha, log_alpha_b_cur[(u - 1) * maxtime + t - 1]);
          }
          if (u > 1) {
            const bool matching_labels_merge =
              ctc_merge_repeated && (label_value_with_blank_cur[u] == label_value_with_blank_cur[u - 2]);
            if (label_value_with_blank_cur[u] != blank && !matching_labels_merge) {
              sum_log_alpha = LogSumExp(sum_log_alpha, log_alpha_b_cur[(u - 2) * maxtime + t - 1]);
            }
          }
          log_alpha_b_cur[u * maxtime + t] =
            log(softmax_probs[i * numclass + label_value_with_blank_cur[u] + t * numclass * batch]) + sum_log_alpha;
        }
      }
    }
  }
 }
 template <typename T>
 __global__ void CalculateBwdVarKernel(T *log_beta_b, int *label_value_with_blank, T *softmax_probs,
                                      const int *sequence_length, bool ctc_merge_repeated, int batch, int SOffSet,
                                      int maxtime, int blank, int *label_squence_length, int *cum_labels_length,
                                      bool ignore_longer_outputs_than_inputs) {
  for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < batch; i += blockDim.x * gridDim.x) {
    if (sequence_length[i] == 0 ||
        (ignore_longer_outputs_than_inputs && label_squence_length[i] > sequence_length[i])) {
    } else {
      T *log_beta_b_cur = &log_beta_b[i * SOffSet * maxtime];
      int *label_value_with_blank_cur = &label_value_with_blank[0];
      if (i > 0) {
        label_value_with_blank_cur = &label_value_with_blank[2 * cum_labels_length[i - 1] + i];
      }
      int numclass = blank + 1;
      int U = 2 * label_squence_length[i] + 1;
      int Ti = sequence_length[i];
      int low = 0;
      int high = 0;
      if (U > 1) {
        for (int u = U - 2; u < U; ++u) {
          log_beta_b_cur[u * maxtime + Ti - 1] = 0;
        }
      } else {
        log_beta_b_cur[Ti - 1] = 0;
        log_beta_b_cur[Ti - 2] = 0;
      }
      for (int t = Ti - 2; t >= 0; --t) {
        low = 0;
        high = U;
        int low_limit = U - (2 * (Ti - t));
        int high_limit = 2 * (t + 1);
        if (low_limit > low) {
          low = low_limit;
        }
        if (high_limit < U) {
          high = high_limit;
        }
        for (int u = low; u < high; ++u) {
          if (ctc_merge_repeated || label_value_with_blank_cur[u] == blank) {
            log_beta_b_cur[u * maxtime + t] = LogSumExp(
              log_beta_b_cur[u * maxtime + t],
              log_beta_b_cur[u * maxtime + t + 1] +
                log(softmax_probs[i * numclass + label_value_with_blank_cur[u] + (t + 1) * numclass * batch]));
          }
          if (u + 1 < U) {
            log_beta_b_cur[u * maxtime + t] = LogSumExp(
              log_beta_b_cur[u * maxtime + t],
              log_beta_b_cur[(u + 1) * maxtime + t + 1] +
                log(softmax_probs[i * numclass + label_value_with_blank_cur[u + 1] + (t + 1) * numclass * batch]));
          }
          if (u + 2 < U) {
            const bool matching_labels_merge =
              ctc_merge_repeated && (label_value_with_blank_cur[u] == label_value_with_blank_cur[u + 2]);
            if (label_value_with_blank_cur[u] != blank && !matching_labels_merge) {
              log_beta_b_cur[u * maxtime + t] = LogSumExp(
                log_beta_b_cur[u * maxtime + t],
                log_beta_b_cur[(u + 2) * maxtime + t + 1] +
                  log(softmax_probs[i * numclass + label_value_with_blank_cur[u + 2] + (t + 1) * numclass * batch]));
            }
          }
        }
      }
    }
  }
 }
 template <typename T>
 __global__ void ProbInitKernel(T *prob_num, int size) {
  for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < size; i += blockDim.x * gridDim.x) {
    prob_num[i] = -std::numeric_limits<T>::infinity();
  }
 }
 template <typename T>
 __global__ void LogBInitKernel(T *log_b, int log_prob_size) {
  for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < log_prob_size; i += blockDim.x * gridDim.x) {
    log_b[i] = -std::numeric_limits<T>::infinity();
  }
 }
 template <typename T>
 __global__ void CTCLossKernel(T *log_alpha_b, T *log_beta_b, T *softmax_probs, int *label_value_with_blank, int batch,
                              int SOffSet, int maxtime, int numclass, const int *sequence_length,
                              int *label_squence_length, int *cum_labels_length, T *cost, T *grads, T *prob_num,
                              bool ignore_longer_outputs_than_inputs) {
  for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < batch; i += blockDim.x * gridDim.x) {
    if (sequence_length[i] == 0 ||
        (ignore_longer_outputs_than_inputs && label_squence_length[i] > sequence_length[i])) {
    } else {
      T *grad_cur = &grads[i * numclass];
      const T *softmax_probs_cur = &softmax_probs[i * numclass];
      T *prob_num_cur = &prob_num[i * numclass];
      int U = 2 * label_squence_length[i] + 1;
      T log_pzx = -std::numeric_limits<T>::infinity();
      const T *log_alpha_b_cur = &log_alpha_b[i * SOffSet * maxtime];
      const T *log_beta_b_cur = &log_beta_b[i * SOffSet * maxtime];
      int *label_value_with_blank_cur = &label_value_with_blank[0];
      if (i > 0) {
        label_value_with_blank_cur = &label_value_with_blank[2 * cum_labels_length[i - 1] + i];
      }
      for (int u = 0; u < U; ++u) {
        log_pzx = LogSumExp(log_pzx, log_alpha_b_cur[u * maxtime] + log_beta_b_cur[u * maxtime]);
      }
      cost[i] = -log_pzx;
      // grad
      int L = numclass;
      int Ti = sequence_length[i];
      if (log_pzx == -std::numeric_limits<T>::infinity()) {
        for (int t = 0; t < Ti; ++t) {
          for (int l = 0; l < L; ++l) {
            grad_cur[t * numclass * batch + l] = softmax_probs_cur[t * numclass * batch + l];
          }
        }
      } else {
        for (int t = 0; t < Ti; ++t) {
          for (int u = 0; u < U; ++u) {
            int l = label_value_with_blank_cur[u];
            prob_num_cur[t * batch * numclass + l] =
              LogSumExp(prob_num_cur[t * batch * numclass + l],
                        log_alpha_b_cur[u * maxtime + t] + log_beta_b_cur[u * maxtime + t]);
          }
          for (int l = 0; l < L; ++l) {
            grad_cur[t * numclass * batch + l] =
              softmax_probs_cur[t * numclass * batch + l] - expf(prob_num_cur[t * batch * numclass + l] - log_pzx);
          }
        }
      }
    }
  }
 }
 template <typename T>
 __global__ void InnerSoftMaxKernel(const T *probs, T *softmax_probs, const int *sequence_length, int max_time,
                                   int batch, int numclass) {
  for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < batch * max_time; i += blockDim.x * gridDim.x) {
    int k = i / batch;
    int m = i % batch;
    if (k < sequence_length[m]) {
      T maxCoeff = 0.;
      T sumCoeff = 0.;
      for (int j = i * numclass; j < (i + 1) * numclass; ++j) {
        if (probs[j] > maxCoeff) {
          maxCoeff = probs[j];
        }
      }
      for (int j = i * numclass; j < (i + 1) * numclass; ++j) {
        sumCoeff += exp(probs[j] - maxCoeff);
        softmax_probs[j] = exp(probs[j] - maxCoeff);
      }
      for (int j = i * numclass; j < (i + 1) * numclass; ++j) {
        softmax_probs[j] /= sumCoeff;
      }
    }
  }
 }
 __global__ void GenLabelValuePCRKernel(int *label_value_sp, int *label_value_pcr, int *label_squence_length,
                                       int *cum_labels_length, int batch) {
  for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < batch; i += blockDim.x * gridDim.x) {
    int L = label_squence_length[i];
    label_squence_length[i] = 0;
    int offset = 0;
    if (i > 0) {
      offset = cum_labels_length[i - 1];
    }
    for (int l = offset; l < L; ++l) {
      if (l == offset || label_value_sp[l] != label_value_sp[l - 1]) {
        label_value_pcr[offset + label_squence_length[i]++] = label_value_sp[l];
      }
    }
  }
 }
 __global__ void UpdateLengthKernel(int *label_squence_length, int *cum_labels_length, int *max_labels_length,
                                   int batch) {
  max_labels_length[0] = 0;
  for (int i = 0; i < batch; ++i) {
    if (label_squence_length[i] > max_labels_length[0]) {
      max_labels_length[0] = label_squence_length[i];
    }
    if (i == 0) {
      cum_labels_length[i] = label_squence_length[i];
    } else {
      cum_labels_length[i] = label_squence_length[i] + cum_labels_length[i - 1];
    }
  }
 }
 template <typename T>
 void CalculateBwdVar(T *log_beta_b, int *label_value_with_blank, T *softmax_probs, const int *sequence_length,
                     bool ctc_merge_repeated, int batch, int SOffSet, int maxtime, int blank, int *label_squence_length,
                     int *cum_labels_length, bool ignore_longer_outputs_than_inputs, cudaStream_t stream) {
  int log_prob_size = SOffSet * batch * maxtime;
  LogBInitKernel<<<GET_BLOCKS(log_prob_size), GET_THREADS, 0, stream>>>(log_beta_b, log_prob_size);
  CalculateBwdVarKernel<<<GET_BLOCKS(batch), GET_THREADS, 0, stream>>>(
    log_beta_b, label_value_with_blank, softmax_probs, sequence_length, ctc_merge_repeated, batch, SOffSet, maxtime,
    blank, label_squence_length, cum_labels_length, ignore_longer_outputs_than_inputs);
 }
 template <typename T>
 void CalculateFwdVar(T *log_alpha_b, int *label_value_with_blank, T *softmax_probs, const int *sequence_length,
                     bool ctc_merge_repeated, int batch, int SOffSet, int maxtime, int blank, int *label_squence_length,
                     int *cum_labels_length, bool ignore_longer_outputs_than_inputs, cudaStream_t stream) {
  int log_prob_size = SOffSet * batch * maxtime;
  LogBInitKernel<<<GET_BLOCKS(log_prob_size), GET_THREADS, 0, stream>>>(log_alpha_b, log_prob_size);
  CalculateFwdVarKernel<<<GET_BLOCKS(batch), GET_THREADS, 0, stream>>>(
    log_alpha_b, label_value_with_blank, softmax_probs, sequence_length, ctc_merge_repeated, batch, SOffSet, maxtime,
    blank, label_squence_length, cum_labels_length, ignore_longer_outputs_than_inputs);
 }
 template <typename T>
 void InnerSoftMax(const T *probs, T *softmax_probs, const int *sequence_length, int max_time, int batch, int numclass,
                  cudaStream_t stream) {
  InnerSoftMaxKernel<<<GET_BLOCKS(batch * max_time), GET_THREADS, 0, stream>>>(probs, softmax_probs, sequence_length,
                                                                               max_time, batch, numclass);
 }
 __global__ void GenLabelWithBlankKernel(int *label_value, int *label_value_with_blank, int *label_squence_length,
                                        int *precum_labels_length, int *cum_labels_length, int batch, int blank) {
  for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < batch; i += blockDim.x * gridDim.x) {
    int offset = 0;
    int offset1 = 0;
    if (i > 0) {
      offset = 2 * cum_labels_length[i - 1] + i;
      offset1 = precum_labels_length[i - 1];
    }
    for (int j = 0; j < label_squence_length[i]; ++j) {
      label_value_with_blank[offset + 2 * j] = blank;
      label_value_with_blank[offset + 2 * j + 1] = label_value[offset1 + j];
    }
    label_value_with_blank[offset + 2 * label_squence_length[i]] = blank;
  }
 }
 void GenLabelWithBlank(int *label_value, int *label_value_with_blank, int *label_squence_length,
                       int *precum_labels_length, int *cum_labels_length, int batch, int blank, cudaStream_t stream) {
  GenLabelWithBlankKernel<<<GET_BLOCKS(batch), GET_THREADS, 0, stream>>>(
    label_value, label_value_with_blank, label_squence_length, precum_labels_length, cum_labels_length, batch, blank);
 }
 void GenLabelValuePCR(int *label_value_sp, int *label_value_pcr, int *label_squence_length, int *cum_labels_length,
                      int *max_labels_length, int batch, cudaStream_t stream) {
  GenLabelValuePCRKernel<<<GET_BLOCKS(batch), GET_THREADS, 0, stream>>>(label_value_sp, label_value_pcr,
                                                                        label_squence_length, cum_labels_length, batch);
  UpdateLengthKernel<<<1, 1, 0, stream>>>(label_squence_length, cum_labels_length, max_labels_length, batch);
 }
 __global__ void GenLabelValueKernel(int *label_value_sp, const int64_t *label_indices, const int *label_values,
                                    int *label_squence_length, int *cum_labels_length, int size) {
  for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < size; i += blockDim.x * gridDim.x) {
    int64_t b = label_indices[i * 2];
    int offset = 0;
    if (b > 0) {
      offset = cum_labels_length[b - 1];
    }
    int64_t index = offset + label_indices[i * 2 + 1];
    label_value_sp[index] = label_values[i];
  }
 }
 __global__ void LabelValueInitKernel(int *label_value_sp, int size, int blank) {
  for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < size; i += blockDim.x * gridDim.x) {
    label_value_sp[i] = blank;
  }
 }
 __global__ void RecalculateLengthKernel(int *label_value_sp, int *label_squence_length, int *cum_labels_length,
                                        int batch, int blank) {
  for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < batch; i += blockDim.x * gridDim.x) {
    int offset = 0;
    if (i > 0) {
      offset = cum_labels_length[i - 1];
    }
    int L = label_squence_length[i];
    label_squence_length[i] = 0;
    for (int j = offset; j < offset + L; ++j) {
      if (label_value_sp[j] >= blank) {
        break;
      } else {
        label_squence_length[i]++;
      }
    }
  }
 }
 void GenLabelValue(int *label_value_sp, const int64_t *label_indices, const int *label_values,
                   int *label_squence_length, int *cum_labels_length, int *max_labels_length, int size, int blank,
                   int batch, cudaStream_t stream) {
  LabelValueInitKernel<<<GET_BLOCKS(size), GET_THREADS, 0, stream>>>(label_value_sp, size, blank);
  GenLabelValueKernel<<<GET_BLOCKS(size), GET_THREADS, 0, stream>>>(label_value_sp, label_indices, label_values,
                                                                    label_squence_length, cum_labels_length, size);
  RecalculateLengthKernel<<<GET_BLOCKS(batch), GET_THREADS, 0, stream>>>(label_value_sp, label_squence_length,
                                                                         cum_labels_length, batch, blank);
  UpdateLengthKernel<<<1, 1, 0, stream>>>(label_squence_length, cum_labels_length, max_labels_length, batch);
 }
 __global__ void CalculatePreLengthKernel(int *label_squence_length, int *precum_labels_length, int *cum_labels_length,
                                         int *max_labels_length, const int64_t *label_indices, int batch, int size) {
  max_labels_length[0] = 0;
  for (int i = 0; i < size; ++i) {
    label_squence_length[label_indices[i * 2]]++;
    if (max_labels_length[0] < label_indices[i * 2]) {
      max_labels_length[0] = label_indices[i * 2];
    }
  }
  precum_labels_length[0] = label_squence_length[0];
  cum_labels_length[0] = label_squence_length[0];
  for (int i = 1; i < batch; ++i) {
    cum_labels_length[i] = cum_labels_length[i - 1] + label_squence_length[i];
    precum_labels_length[i] = precum_labels_length[i - 1] + label_squence_length[i];
  }
 }
 __global__ void CalculateMaxSequenceKernel(const int *sequence_length, int *max_labels_length, int batch) {
  max_labels_length[0] = 0;
  for (int i = 0; i < batch; ++i) {
    if (sequence_length[i] > max_labels_length[0]) {
      max_labels_length[0] = sequence_length[i];
    }
  }
 }
 void CalculateMaxSequence(const int *sequence_length, int *max_labels_length, int batch, cudaStream_t stream) {
  CalculateMaxSequenceKernel<<<1, 1, 0, stream>>>(sequence_length, max_labels_length, batch);
 }
 void CalculatePreLength(int *label_squence_length, int *precum_labels_length, int *cum_labels_length,
                        int *max_labels_length, const int64_t *label_indices, int batch, int size,
                        cudaStream_t stream) {
  CalculatePreLengthKernel<<<1, 1, 0, stream>>>(label_squence_length, precum_labels_length, cum_labels_length,
                                                max_labels_length, label_indices, batch, size);
 }
 template <typename T>
 void CTCLoss(T *log_alpha_b, T *log_beta_b, T *softmax_probs, int *label_value_with_blank, int batch, int SOffSet,
             int maxtime, int numclass, const int *sequence_length, int *label_squence_length, int *cum_labels_length,
             T *cost, T *grads, T *prob_num, bool ignore_longer_outputs_than_inputs, cudaStream_t stream) {
  ProbInitKernel<<<GET_BLOCKS(maxtime * batch * numclass), GET_THREADS, 0, stream>>>(prob_num,
                                                                                     maxtime * batch * numclass);
  CTCLossKernel<<<GET_BLOCKS(batch), GET_THREADS, 0, stream>>>(
    log_alpha_b, log_beta_b, softmax_probs, label_value_with_blank, batch, SOffSet, maxtime, numclass, sequence_length,
    label_squence_length, cum_labels_length, cost, grads, prob_num, ignore_longer_outputs_than_inputs);
 }
 template void CalculateFwdVar<float>(float *log_alpha_b, int *label_value_with_blank, float *softmax_probs,
                                     const int *sequence_length, bool ctc_merge_repeated, int batch, int SOffSet,
                                     int maxtime, int blank, int *label_squence_length, int *cum_labels_length,
                                     bool ignore_longer_outputs_than_inputs, cudaStream_t stream);
 template void CalculateBwdVar<float>(float *log_beta_b, int *label_value_with_blank, float *softmax_probs,
                                     const int *sequence_length, bool ctc_merge_repeated, int batch, int SOffSet,
                                     int maxtime, int blank, int *label_squence_length, int *cum_labels_length,
                                     bool ignore_longer_outputs_than_inputs, cudaStream_t stream);
 template void InnerSoftMax<float>(const float *probs, float *softmax_probs, const int *sequence_length, int max_time,
                                  int batch, int numclass, cudaStream_t stream);
 template void CTCLoss<float>(float *log_alpha_b, float *log_beta_b, float *softmax_probs, int *label_value_with_blank,
                             int batch, int SOffSet, int maxtime, int numclass, const int *sequence_length,
                             int *label_squence_length, int *cum_labels_length, float *cost, float *grads,
                             float *prob_num, bool ignore_longer_outputs_than_inputs, cudaStream_t stream);
--- a/mindspore/ccsrc/backend/kernel_compiler/gpu/cuda_impl/ctcloss_impl.cuh
+++ b/mindspore/ccsrc/backend/kernel_compiler/gpu/cuda_impl/ctcloss_impl.cuh
@ -0,0 +1,51 @@
 /**
 * Copyright 2020 Huawei Technologies Co., Ltd
 *
 * Licensed under the Apache License, Version 2.0 (the "License");
 * you may not use this file except in compliance with the License.
 * You may obtain a copy of the License at
 *
 * http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */
 #ifndef MINDSPORE_CCSRC_BACKEND_KERNEL_COMPILER_GPU_IMPL_CUH
 #define MINDSPORE_CCSRC_BACKEND_KERNEL_COMPILER_GPU_IMPL_CUH
 template <typename T>
 void CalculateFwdVar(T *log_alpha_b, int *label_value_with_blank, T *softmax_probs, const int *sequence_length,
                     bool ctc_merge_repeated, int batch, int SOffSet, int maxtime, int blank, int *label_squence_length,
                     int *cum_labels_length, bool ignore_longer_outputs_than_inputs, cudaStream_t stream);
 template <typename T>
 void CalculateBwdVar(T *log_beta_b, int *label_value_with_blank, T *softmax_probs, const int *sequence_length,
                     bool ctc_merge_repeated, int batch, int SOffSet, int maxtime, int blank, int *label_squence_length,
                     int *cum_labels_length, bool ignore_longer_outputs_than_inputs, cudaStream_t stream);
 template <typename T>
 void InnerSoftMax(const T *probs, T *softmax_cost, const int *sequence_length, int max_time, int batch, int numclass,
                  cudaStream_t stream);
 void GenLabelValuePCR(int *label_value_sp, int *label_value_pcr, int *label_squence_length, int *cum_labels_length,
                      int *max_labels_length, int batch, cudaStream_t stream);
 void GenLabelWithBlank(int *label_value, int *label_value_with_blank, int *label_squence_length,
                       int *precum_labels_length, int *cum_labels_length, int batch, int blank, cudaStream_t stream);
 void GenLabelValue(int *label_value_sp, const int64_t *label_indices, const int *label_values,
                   int *label_squence_length, int *cum_labels_length, int *max_labels_length, int size, int blank,
                   int batch, cudaStream_t stream);
 void CalculatePreLength(int *label_squence_length, int *precum_labels_length, int *cum_labels_length,
                        int *max_labels_length, const int64_t *label_indices, int batch, int size, cudaStream_t stream);
 void CalculateMaxSequence(const int *sequence_length, int *max_labels_length, int batch, cudaStream_t stream);
 template <typename T>
 void CTCLoss(T *log_alpha_b, T *log_beta_b, T *softmax_probs, int *label_value_with_blank, int batch, int SOffSet,
             int maxtime, int numclass, const int *sequence_length, int *label_squence_length, int *cum_labels_length,
             T *cost, T *grads, T *prob_num, bool ignore_longer_outputs_than_inputs, cudaStream_t stream);
 #endif  // MINDSPORE_CCSRC_BACKEND_KERNEL_COMPILER_GPU_IMPL_CUH
--- a/mindspore/ccsrc/backend/kernel_compiler/gpu/math/unary_op_gpu_kernel.cc
+++ b/mindspore/ccsrc/backend/kernel_compiler/gpu/math/unary_op_gpu_kernel.cc
@ -44,8 +44,12 @@ MS_REG_GPU_KERNEL_ONE(Square, KernelAttr().AddInputAttr(kNumberTypeFloat16).AddO
                      UnaryOpGpuKernel, half)
 MS_REG_GPU_KERNEL_ONE(Sqrt, KernelAttr().AddInputAttr(kNumberTypeFloat32).AddOutputAttr(kNumberTypeFloat32),
                      UnaryOpGpuKernel, float)
 MS_REG_GPU_KERNEL_ONE(Sqrt, KernelAttr().AddInputAttr(kNumberTypeFloat16).AddOutputAttr(kNumberTypeFloat16),
                      UnaryOpGpuKernel, half)
 MS_REG_GPU_KERNEL_ONE(Rsqrt, KernelAttr().AddInputAttr(kNumberTypeFloat32).AddOutputAttr(kNumberTypeFloat32),
                      UnaryOpGpuKernel, float)
 MS_REG_GPU_KERNEL_ONE(Rsqrt, KernelAttr().AddInputAttr(kNumberTypeFloat16).AddOutputAttr(kNumberTypeFloat16),
                      UnaryOpGpuKernel, half)
 MS_REG_GPU_KERNEL_ONE(Sin, KernelAttr().AddInputAttr(kNumberTypeFloat32).AddOutputAttr(kNumberTypeFloat32),
                      UnaryOpGpuKernel, float)
 MS_REG_GPU_KERNEL_ONE(Sin, KernelAttr().AddInputAttr(kNumberTypeFloat16).AddOutputAttr(kNumberTypeFloat16),
--- a/mindspore/ccsrc/backend/kernel_compiler/gpu/nn/ctcloss_gpu_kernel.cc
+++ b/mindspore/ccsrc/backend/kernel_compiler/gpu/nn/ctcloss_gpu_kernel.cc
@ -18,10 +18,10 @@
 namespace mindspore {
 namespace kernel {
-MS_REG_GPU_KERNEL_ONE(CTCLossV2,
+MS_REG_GPU_KERNEL_ONE(CTCLoss,
                      KernelAttr()
                        .AddInputAttr(kNumberTypeFloat32)
-                        .AddInputAttr(kNumberTypeInt32)
+                        .AddInputAttr(kNumberTypeInt64)
                        .AddInputAttr(kNumberTypeInt32)
                        .AddInputAttr(kNumberTypeInt32)
                        .AddOutputAttr(kNumberTypeFloat32)
--- a/mindspore/ccsrc/backend/kernel_compiler/gpu/nn/ctcloss_gpu_kernel.h
+++ b/mindspore/ccsrc/backend/kernel_compiler/gpu/nn/ctcloss_gpu_kernel.h
@ -19,64 +19,111 @@
 #include <cuda_runtime_api.h>
 #include <vector>
 #include <limits>
 #include "backend/kernel_compiler/gpu/gpu_kernel.h"
 #include "backend/kernel_compiler/gpu/gpu_kernel_factory.h"
 #include "runtime/device/gpu/gpu_memory_allocator.h"
-
+#include "backend/kernel_compiler/gpu/cuda_impl/ctcloss_impl.cuh"
 namespace mindspore {
 namespace kernel {
 template <typename T>
 class CtcLossGpuKernel : public GpuKernel {
 public:
  CtcLossGpuKernel()
-      : cudnn_handle_(nullptr),
+      : label_indice_size_(0),
        probs_desc_(nullptr),
        ctcloss_desc_(nullptr),
        label_size_(0),
-        input_lengths_size_(0),
+        squence_lengths_size_(0),
-        label_lengths_size_(0) {}
+        preprocess_collapse_repeated_(false),
-  ~CtcLossGpuKernel() override { DestroyResource(); }
+        ctc_merge_repeated_(true),
        ignore_longer_outputs_than_inputs_(false) {}
  ~CtcLossGpuKernel() override = default;
  const std::vector<size_t> &GetInputSizeList() const override { return input_size_list_; }
  const std::vector<size_t> &GetOutputSizeList() const override { return output_size_list_; }
  const std::vector<size_t> &GetWorkspaceSizeList() const override { return workspace_size_list_; }
-  bool Launch(const std::vector<AddressPtr> &inputs, const std::vector<AddressPtr> &,
+  bool Launch(const std::vector<AddressPtr> &inputs, const std::vector<AddressPtr> &workspace,
              const std::vector<AddressPtr> &outputs, void *stream_ptr) override {
    float *probs = GetDeviceAddress<float>(inputs, 0);
    float *costs = GetDeviceAddress<float>(outputs, 0);
    float *grads = GetDeviceAddress<float>(outputs, 1);
    // Copy labels/input_lengths/label_length to host as cudnn7.x.x requires
    int *labels_host = nullptr;
    int *no_blank_labels_host = nullptr;
    void *input_lengths_host = nullptr;
    void *label_lengths_host = nullptr;
    cudaStream_t stream = reinterpret_cast<cudaStream_t>(stream_ptr);
-    AllocHostMem(&labels_host, &no_blank_labels_host, &input_lengths_host, &label_lengths_host, inputs);
+    const T *probs = GetDeviceAddress<T>(inputs, 0);
-    CopyToHostSync(labels_host, no_blank_labels_host, input_lengths_host, label_lengths_host, inputs, stream);
+    const int64_t *label_indices = GetDeviceAddress<int64_t>(inputs, 1);
    const int *label_values = GetDeviceAddress<int>(inputs, 2);
    const int *sequence_length = GetDeviceAddress<int>(inputs, 3);
    T *costs = GetDeviceAddress<T>(outputs, 0);
    T *grads = GetDeviceAddress<T>(outputs, 1);
    T *softmax_probs = GetDeviceAddress<T>(workspace, 0);
    int *cum_labels_length = GetDeviceAddress<int>(workspace, 1);
    int *label_squence_length = GetDeviceAddress<int>(workspace, 2);
    int *label_value_sp = GetDeviceAddress<int>(workspace, 3);
    int *label_value_pcr = GetDeviceAddress<int>(workspace, 4);
    T *prob_num = GetDeviceAddress<T>(workspace, 5);
    int *precum_labels_length = GetDeviceAddress<int>(workspace, 6);
    int *max_labels_length = GetDeviceAddress<int>(workspace, 7);
    int numclass = SizeToInt(probs_dims_[2]);
    int batch = SizeToInt(probs_dims_[1]);
    int max_time = SizeToInt(probs_dims_[0]);
    int max_sequence = 0;
    CalculateMaxSequence(sequence_length, max_labels_length, batch, stream);
    CHECK_CUDA_RET_WITH_EXCEPT(
      cudaMemcpyAsync(&max_sequence, max_labels_length, sizeof(int), cudaMemcpyDeviceToHost, stream),
      "cudaMemcpyAsync failed.");
    CHECK_CUDA_RET_WITH_EXCEPT(cudaStreamSynchronize(stream), "cudaStreamSynchronize failed.");
    if (max_time < max_sequence) {
      MS_LOG(EXCEPTION) << "max_time should be greater than sequence length.";
    }
    InnerSoftMax(probs, softmax_probs, sequence_length, max_time, batch, numclass, stream);
    MemsetForWS(label_value_pcr, cum_labels_length, label_squence_length, costs, grads, stream);
    int max_labels_length_host = 0;
    int batch_label = 0;
    int *label_value_with_blank = nullptr;
    T *log_alpha_b = nullptr;
    T *log_beta_b = nullptr;
    CalculatePreLength(label_squence_length, precum_labels_length, cum_labels_length, max_labels_length, label_indices,
                       batch, label_size_ / sizeof(int), stream);
    CHECK_CUDA_RET_WITH_EXCEPT(
      cudaMemcpyAsync(&batch_label, max_labels_length, sizeof(int), cudaMemcpyDeviceToHost, stream),
      "cudaMemcpyAsync failed.");
    CHECK_CUDA_RET_WITH_EXCEPT(cudaStreamSynchronize(stream), "cudaStreamSynchronize failed.");
    if (batch != batch_label + 1) {
      MS_LOG(EXCEPTION) << "label batch should be equal to input batch.";
    }
    GenLabelValue(label_value_sp, label_indices, label_values, label_squence_length, cum_labels_length,
                  max_labels_length, label_size_ / sizeof(int), numclass - 1, batch, stream);
    if (preprocess_collapse_repeated_) {
      GenLabelValuePCR(label_value_sp, label_value_pcr, label_squence_length, cum_labels_length, max_labels_length,
                       batch, stream);
    }
    CHECK_CUDA_RET_WITH_EXCEPT(
      cudaMemcpyAsync(&max_labels_length_host, max_labels_length, sizeof(int), cudaMemcpyDeviceToHost, stream),
      "cudaMemcpyAsync failed.");
    CHECK_CUDA_RET_WITH_EXCEPT(cudaStreamSynchronize(stream), "cudaStreamSynchronize failed.");
    int SOffSet = 2 * max_labels_length_host + 1;
    int log_prob_size = batch * SOffSet * max_time;
    if (!ignore_longer_outputs_than_inputs_ && max_labels_length_host > max_time) {
      MS_LOG(EXCEPTION) << "output size is greater than input size.";
    }
    MemManageForCus(&log_alpha_b, &log_beta_b, &label_value_with_blank, cum_labels_length, log_prob_size, batch,
                    stream);
-    size_t workspace_size = 0;
+    if (preprocess_collapse_repeated_) {
-    CHECK_CUDNN_RET_WITH_EXCEPT(
+      GenLabelWithBlank(label_value_pcr, label_value_with_blank, label_squence_length, precum_labels_length,
-      cudnnGetCTCLossWorkspaceSize(
+                        cum_labels_length, batch, numclass - 1, stream);
-        cudnn_handle_, probs_desc_, probs_desc_, reinterpret_cast<int *>(no_blank_labels_host),
+    } else {
-        reinterpret_cast<int *>(label_lengths_host), reinterpret_cast<int *>(input_lengths_host),
+      GenLabelWithBlank(label_value_sp, label_value_with_blank, label_squence_length, precum_labels_length,
-        CUDNN_CTC_LOSS_ALGO_DETERMINISTIC, ctcloss_desc_, &workspace_size),
+                        cum_labels_length, batch, numclass - 1, stream);
      "cudnnGetCTCLossWorkspaceSize failed.");
    void *workspace = device::gpu::GPUMemoryAllocator::GetInstance().AllocTensorMem(workspace_size);
    if (workspace == nullptr) {
      MS_LOG(EXCEPTION) << "Failed to alloc workspace, size: " << workspace_size;
    }
-    CHECK_CUDNN_RET_WITH_EXCEPT(
+    CalculateFwdVar(log_alpha_b, label_value_with_blank, softmax_probs, sequence_length, ctc_merge_repeated_, batch,
-      cudnnCTCLoss(cudnn_handle_, probs_desc_, probs, reinterpret_cast<int *>(no_blank_labels_host),
+                    SOffSet, max_time, numclass - 1, label_squence_length, cum_labels_length,
-                   reinterpret_cast<int *>(label_lengths_host), reinterpret_cast<int *>(input_lengths_host), costs,
+                    ignore_longer_outputs_than_inputs_, stream);
-                   probs_desc_, grads, CUDNN_CTC_LOSS_ALGO_DETERMINISTIC, ctcloss_desc_, workspace, workspace_size),
+    CalculateBwdVar(log_beta_b, label_value_with_blank, softmax_probs, sequence_length, ctc_merge_repeated_, batch,
-      "cudnnCtcLoss failed.");
+                    SOffSet, max_time, numclass - 1, label_squence_length, cum_labels_length,
                    ignore_longer_outputs_than_inputs_, stream);
    CTCLoss(log_alpha_b, log_beta_b, softmax_probs, label_value_with_blank, batch, SOffSet, max_time, numclass,
            sequence_length, label_squence_length, cum_labels_length, costs, grads, prob_num,
            ignore_longer_outputs_than_inputs_, stream);
    CHECK_CUDA_RET_WITH_EXCEPT(cudaStreamSynchronize(stream), "cudaStreamSynchronize failed.");
-
+    FreeMem(label_value_with_blank, log_alpha_b, log_beta_b);
    device::gpu::GPUMemoryAllocator::GetInstance().FreeTensorMem(workspace);
    FreeHostMem(labels_host, no_blank_labels_host, input_lengths_host, label_lengths_host);
    return true;
  }
  bool Init(const CNodePtr &kernel_node) override {
@ -88,104 +135,98 @@ class CtcLossGpuKernel : public GpuKernel {
    probs_dims_[0] = probs_shape[0];
    probs_dims_[1] = probs_shape[1];
    probs_dims_[2] = probs_shape[2];
-
+    auto indice_dims = AnfAlgo::GetPrevNodeOutputInferShape(kernel_node, 1);
-    auto labels_dims = AnfAlgo::GetPrevNodeOutputInferShape(kernel_node, 1);
+    auto labels_dims = AnfAlgo::GetPrevNodeOutputInferShape(kernel_node, 2);
-    if (labels_dims.size() != 1 && labels_dims.size() != 2) {
+    if (labels_dims.size() != 1) {
      MS_LOG(EXCEPTION) << "labels dims: " << labels_dims.size() << " not support.";
    }
    if (indice_dims.size() != 2) {
      MS_LOG(EXCEPTION) << "labels indice dims: " << indice_dims.size() << " not support.";
    }
    label_size_ = sizeof(int);
    for (auto i : labels_dims) {
      label_size_ *= i;
    }
-
+    label_indice_size_ = sizeof(int64_t);
-    auto input_length_dims = AnfAlgo::GetPrevNodeOutputInferShape(kernel_node, 2);
+    for (auto i : indice_dims) {
-    input_lengths_size_ = input_length_dims[0] * sizeof(int);
+      label_indice_size_ *= i;
-    auto label_length_dims = AnfAlgo::GetPrevNodeOutputInferShape(kernel_node, 3);
+    }
-    label_lengths_size_ = label_length_dims[0] * sizeof(int);
+    auto squence_length_dims = AnfAlgo::GetPrevNodeOutputInferShape(kernel_node, 3);
-    CHECK_CUDNN_RET_WITH_EXCEPT(
+    squence_lengths_size_ = squence_length_dims[0] * sizeof(int);
-      cudnnSetTensorNdDescriptorEx(probs_desc_, CUDNN_TENSOR_NCHW, CUDNN_DATA_FLOAT, 3, probs_dims_),
+    preprocess_collapse_repeated_ = GetAttr<bool>(kernel_node, "preprocess_collapse_repeated");
-      "cudnnSetTensorNdDescriptorEx failed.");
+    ctc_merge_repeated_ = GetAttr<bool>(kernel_node, "ctc_merge_repeated");
-    CHECK_CUDNN_RET_WITH_EXCEPT(cudnnSetCTCLossDescriptorEx(ctcloss_desc_, CUDNN_DATA_FLOAT,
+    ignore_longer_outputs_than_inputs_ = GetAttr<bool>(kernel_node, "ignore_longer_outputs_than_inputs");
                                                            CUDNN_LOSS_NORMALIZATION_SOFTMAX, CUDNN_PROPAGATE_NAN),
                                "cudnnSetCTCLossDescriptorEx failed.");
    InitSizeLists();
    return true;
  }
 protected:
  void InitResource() override {
    cudnn_handle_ = device::gpu::GPUDeviceManager::GetInstance().GetCudnnHandle();
    CHECK_CUDNN_RET_WITH_EXCEPT(cudnnCreateTensorDescriptor(&probs_desc_), "cudnnCreateTensorDescriptor failed.");
    CHECK_CUDNN_RET_WITH_EXCEPT(cudnnCreateCTCLossDescriptor(&ctcloss_desc_), "cudnnCreateCTCLossDescriptor failed.");
  }
  void InitSizeLists() override {
-    input_size_list_.push_back(probs_dims_[0] * probs_dims_[1] * probs_dims_[2] * sizeof(float));
+    input_size_list_.push_back(probs_dims_[0] * probs_dims_[1] * probs_dims_[2] * sizeof(T));
    input_size_list_.push_back(label_indice_size_);
    input_size_list_.push_back(label_size_);
-    input_size_list_.push_back(input_lengths_size_);
+    input_size_list_.push_back(squence_lengths_size_);
-    input_size_list_.push_back(label_lengths_size_);
+    workspace_size_list_.push_back(probs_dims_[0] * probs_dims_[1] * probs_dims_[2] * sizeof(T));
-
+    workspace_size_list_.push_back(squence_lengths_size_);
-    output_size_list_.push_back(probs_dims_[1] * sizeof(float));
+    workspace_size_list_.push_back(squence_lengths_size_);
-    output_size_list_.push_back(probs_dims_[0] * probs_dims_[1] * probs_dims_[2] * sizeof(float));
+    workspace_size_list_.push_back(label_size_);
    workspace_size_list_.push_back(label_size_);
    workspace_size_list_.push_back(probs_dims_[0] * probs_dims_[1] * probs_dims_[2] * sizeof(T));
    workspace_size_list_.push_back(squence_lengths_size_);
    workspace_size_list_.push_back(sizeof(int));
    output_size_list_.push_back(probs_dims_[1] * sizeof(T));
    output_size_list_.push_back(probs_dims_[0] * probs_dims_[1] * probs_dims_[2] * sizeof(T));
  }
-
+  void MemsetForWS(int *label_value_pcr, int *cum_labels_length, int *label_squence_length, T *costs, T *grads,
- private:
+                   cudaStream_t stream) {
-  void DestroyResource() noexcept {
+    CHECK_CUDA_RET_WITH_EXCEPT(cudaMemsetAsync(label_value_pcr, static_cast<int>(0), label_size_, stream),
-    CHECK_CUDNN_RET_WITH_ERROR(cudnnDestroyCTCLossDescriptor(ctcloss_desc_), "cudnnDestroyCTCLossDescriptor failed.");
+                               "cudaMemSet failed in CtcLossGpuKernel::Launch.");
-    CHECK_CUDNN_RET_WITH_ERROR(cudnnDestroyTensorDescriptor(probs_desc_), "cudnnDestroyTensorDescriptor failed.");
+    CHECK_CUDA_RET_WITH_EXCEPT(cudaMemsetAsync(cum_labels_length, static_cast<int>(0), squence_lengths_size_, stream),
                               "cudaMemSet failed in CtcLossGpuKernel::Launch.");
    CHECK_CUDA_RET_WITH_EXCEPT(
      cudaMemsetAsync(label_squence_length, static_cast<int>(0), squence_lengths_size_, stream),
      "cudaMemSet failed in CtcLossGpuKernel::Launch.");
    CHECK_CUDA_RET_WITH_EXCEPT(cudaMemsetAsync(costs, static_cast<T>(0), probs_dims_[1] * sizeof(T), stream),
                               "cudaMemSet failed in CtcLossGpuKernel::Launch.");
    CHECK_CUDA_RET_WITH_EXCEPT(
      cudaMemsetAsync(grads, static_cast<T>(0), probs_dims_[0] * probs_dims_[1] * probs_dims_[2] * sizeof(T), stream),
      "cudaMemSet failed in CtcLossGpuKernel::Launch.");
  }
-
+  void MemManageForCus(T **log_alpha_b, T **log_beta_b, int **label_value_with_blank, int *cum_labels_length,
-  void AllocHostMem(int **labels_host, int **no_blank_labels_host, void **input_lengths_host, void **label_lengths_host,
+                       int log_prob_size, int batch, cudaStream_t stream) {
-                    const std::vector<AddressPtr> &inputs) {
+    int total_labels_size_host = 0;
-    CHECK_CUDA_RET_WITH_EXCEPT(cudaMallocHost(labels_host, inputs[1]->size), "cudaMallocHost failed.");
+    CHECK_CUDA_RET_WITH_EXCEPT(cudaMalloc(reinterpret_cast<void **>(log_alpha_b), sizeof(T) * log_prob_size),
-    CHECK_CUDA_RET_WITH_EXCEPT(cudaMallocHost(no_blank_labels_host, inputs[1]->size), "cudaMallocHost failed.");
+                               "cudaMalloc failed.");
-    CHECK_CUDA_RET_WITH_EXCEPT(cudaMallocHost(input_lengths_host, inputs[2]->size), "cudaMallocHost failed.");
+    CHECK_CUDA_RET_WITH_EXCEPT(cudaMalloc(reinterpret_cast<void **>(log_beta_b), sizeof(T) * log_prob_size),
-    CHECK_CUDA_RET_WITH_EXCEPT(cudaMallocHost(label_lengths_host, inputs[3]->size), "cudaMallocHost failed.");
+                               "cudaMalloc failed.");
-  }
+    CHECK_CUDA_RET_WITH_EXCEPT(cudaMemcpyAsync(&total_labels_size_host, cum_labels_length + batch - 1, sizeof(int),
-
+                                               cudaMemcpyDeviceToHost, stream),
-  void FreeHostMem(int *labels_host, int *no_blank_labels_host, void *input_lengths_host, void *label_lengths_host) {
+                               "cudaMemcpyAsync failed.");
    CHECK_CUDA_RET_WITH_EXCEPT(cudaFreeHost(label_lengths_host), "cudaFreeHost failed.");
    CHECK_CUDA_RET_WITH_EXCEPT(cudaFreeHost(input_lengths_host), "cudaFreeHost failed.");
    CHECK_CUDA_RET_WITH_EXCEPT(cudaFreeHost(labels_host), "cudaFreeHost failed.");
    CHECK_CUDA_RET_WITH_EXCEPT(cudaFreeHost(no_blank_labels_host), "cudaFreeHost failed.");
  }
  void CopyToHostSync(int *labels_host, int *no_blank_labels_host, void *input_lengths_host, void *label_lengths_host,
                      const std::vector<AddressPtr> &inputs, cudaStream_t stream) {
    CHECK_CUDA_RET_WITH_EXCEPT(cudaStreamSynchronize(stream), "cudaStreamSynchronize failed.");
    CHECK_CUDA_RET_WITH_EXCEPT(
-      cudaMemcpyAsync(labels_host, inputs[1]->addr, inputs[1]->size, cudaMemcpyDeviceToHost, stream),
+      cudaMalloc(reinterpret_cast<void **>(label_value_with_blank), sizeof(int) * (2 * total_labels_size_host + batch)),
-      "cudaMemcpyAsync failed.");
+      "cudaMalloc failed.");
-    CHECK_CUDA_RET_WITH_EXCEPT(
+  }
      cudaMemcpyAsync(input_lengths_host, inputs[2]->addr, inputs[2]->size, cudaMemcpyDeviceToHost, stream),
      "cudaMemcpyAsync failed.");
    CHECK_CUDA_RET_WITH_EXCEPT(
      cudaMemcpyAsync(label_lengths_host, inputs[3]->addr, inputs[3]->size, cudaMemcpyDeviceToHost, stream),
      "cudaMemcpyAsync failed.");
    CHECK_CUDA_RET_WITH_EXCEPT(cudaStreamSynchronize(stream), "cudaStreamSynchronize failed.");
-    // remove blank element
+  void FreeMem(int *label_value_with_blank, T *log_alpha_b, T *log_beta_b) {
-    size_t j = 0;
+    CHECK_CUDA_RET_WITH_EXCEPT(cudaFree(label_value_with_blank), "cudaFree failed.");
-    for (size_t i = 0; i < inputs[1]->size / sizeof(int); i++) {
+    CHECK_CUDA_RET_WITH_EXCEPT(cudaFree(log_alpha_b), "cudaFree failed.");
-      if (labels_host[i] != 0) {
+    CHECK_CUDA_RET_WITH_EXCEPT(cudaFree(log_beta_b), "cudaFree failed.");
        no_blank_labels_host[j] = labels_host[i];
        j++;
      }
    }
  }
  std::vector<size_t> input_size_list_;
  std::vector<size_t> output_size_list_;
  std::vector<size_t> workspace_size_list_;
-  cudnnHandle_t cudnn_handle_;
+  size_t probs_dims_[3] = {0};
-  cudnnTensorDescriptor_t probs_desc_;
+  int label_indice_size_;
  cudnnCTCLossDescriptor_t ctcloss_desc_;
  int probs_dims_[3] = {0};
  int label_size_;
-  int input_lengths_size_;
+  int squence_lengths_size_;
-  int label_lengths_size_;
+  bool preprocess_collapse_repeated_;
-};
+  bool ctc_merge_repeated_;
  bool ignore_longer_outputs_than_inputs_;
  T kLogZero_ = -std::numeric_limits<T>::infinity();
 };  // namespace kernel
 }  // namespace kernel
 }  // namespace mindspore
--- a/mindspore/ccsrc/backend/kernel_compiler/gpu/nn/mirror_pad_gpu_kernel.h
+++ b/mindspore/ccsrc/backend/kernel_compiler/gpu/nn/mirror_pad_gpu_kernel.h
@ -66,7 +66,6 @@ class MirrorPadGpuFwdKernel : public GpuKernel {
    }
    string mode = GetValue<string>(AnfAlgo::GetCNodePrimitive(kernel_node)->GetAttr("mode"));
    if (mode == "REFLECT") {
      mode_ = 0;  // reflected mirroring
    } else {
--- a/mindspore/ccsrc/backend/kernel_compiler/gpu/nn/mirror_pad_grad_gpu_kernel.h
+++ b/mindspore/ccsrc/backend/kernel_compiler/gpu/nn/mirror_pad_grad_gpu_kernel.h
@ -66,7 +66,6 @@ class MirrorPadGpuBackKernel : public GpuKernel {
    }
    string mode = GetValue<string>(AnfAlgo::GetCNodePrimitive(kernel_node)->GetAttr("mode"));
    if (mode == "REFLECT") {
      mode_ = 0;  // reflected mirroring
    } else {
--- a/mindspore/ccsrc/backend/kernel_compiler/gpu/nn/roi_align_gpu_kernel.cc
+++ b/mindspore/ccsrc/backend/kernel_compiler/gpu/nn/roi_align_gpu_kernel.cc
@ -27,6 +27,5 @@ MS_REG_GPU_KERNEL_ONE(
  ROIAlign,
  KernelAttr().AddInputAttr(kNumberTypeFloat16).AddInputAttr(kNumberTypeFloat16).AddOutputAttr(kNumberTypeFloat16),
  ROIAlignGpuFwdKernel, half)
 }  // namespace kernel
 }  // namespace mindspore
--- a/mindspore/ccsrc/backend/kernel_compiler/gpu/nn/roi_align_grad_gpu_kernel.cc
+++ b/mindspore/ccsrc/backend/kernel_compiler/gpu/nn/roi_align_grad_gpu_kernel.cc
@ -27,6 +27,5 @@ MS_REG_GPU_KERNEL_ONE(
  ROIAlignGrad,
  KernelAttr().AddInputAttr(kNumberTypeFloat16).AddInputAttr(kNumberTypeFloat16).AddOutputAttr(kNumberTypeFloat16),
  ROIAlignGradGpuFwdKernel, half)
 }  // namespace kernel
 }  // namespace mindspore
--- a/mindspore/ccsrc/backend/kernel_compiler/tbe/tbe_adapter.cc
+++ b/mindspore/ccsrc/backend/kernel_compiler/tbe/tbe_adapter.cc
@ -87,6 +87,7 @@ static std::map<string, string> tbe_func_adapter_map = {
  {"apply_centered_rms_prop", "apply_centered_rms_prop_d"},
  {"transpose", "transpose_d"},
  {"fill", "fill_d"},
  {"i_fmr", "ifmr"},
  {"unsorted_segment_sum", "unsorted_segment_sum_d"},
  {"unsorted_segment_prod", "unsorted_segment_prod_d"},
  {"concat", "concat_d"},
@ -139,7 +140,8 @@ static std::map<string, string> tbe_func_adapter_map = {
  {"inplace_update", "inplace_update_d"},
  {"matrix_diag", "matrix_diag_d"},
  {"matrix_diag_part", "matrix_diag_part_d"},
-  {"matrix_set_diag", "matrix_set_diag_d"}};
+  {"matrix_set_diag", "matrix_set_diag_d"},
  {"l_stm_input_grad", "lstm_input_grad"}};
 void TbeAdapter::NormalizeFuncName(std::string *func_name) {
  if (func_name == nullptr) {
--- a/mindspore/ccsrc/backend/kernel_compiler/tbe/tbe_kernel_build.cc
+++ b/mindspore/ccsrc/backend/kernel_compiler/tbe/tbe_kernel_build.cc
@ -25,6 +25,7 @@
 #include "backend/kernel_compiler/tbe/tbe_convert_utils.h"
 #include "backend/kernel_compiler/tbe/tbe_utils.h"
 #include "utils/ms_context.h"
 #include "runtime/dev.h"
 namespace mindspore {
 namespace kernel {
@ -86,6 +87,8 @@ constexpr auto kJPyModulePath = "py_module_path";
 constexpr auto kJPreBuildOutsAttrs = "prebuild_outs_attrs";
 constexpr auto kJKwdArgs = "kwds_args";
 constexpr auto kJListArgs = "list_args";
 constexpr auto kJSocVersion = "socVersion";
 constexpr auto kSOC_VERSION = "SOC_VERSION";
 bool TbeKernelJsonCreator::GenTbeSingleKernelJson(const std::shared_ptr<mindspore::AnfNode> &anf_node,
                                                  nlohmann::json *kernel_json) {
@ -122,6 +125,8 @@ bool TbeKernelJsonCreator::GenTbeSingleKernelJson(const std::shared_ptr<mindspor
  nlohmann::json attrs_json;
  (void)GenTbeAttrJson(anf_node, op_info_ptr, &attrs_json);
  op_info_json[kJAttrs] = attrs_json;
  auto soc_version = TbeKernelJsonCreator::GetSocVersion();
  op_info_json[kJSocVersion] = soc_version;
  std::string json_str = op_info_json.dump();
  size_t hash_id = std::hash<std::string>()(json_str);
  auto context_ptr = MsContext::GetInstance();
@ -150,7 +155,13 @@ bool TbeKernelJsonCreator::GenInputDescJson(const std::shared_ptr<AnfNode> &anf_
  MS_EXCEPTION_IF_NULL(input_ptr);
  MS_EXCEPTION_IF_NULL(input_list);
  std::string op_name = AnfAlgo::GetCNodeName(anf_node);
-  if (input_ptr->name() == "input_indices" && op_name == kTopKOpName) {
+  if (op_name == kDynamicRNNOpName && input_ptr->name() == "seq_length") {
    nlohmann::json input_desc_json;
    auto in_name = input_ptr->name();
    input_desc_json[kJName] = in_name + std::to_string(input_i);
    input_desc_json[kJValid] = false;
    input_list->emplace_back(input_desc_json);
  } else if (input_ptr->name() == "input_indices" && op_name == kTopKOpName) {
    TbeAdapter::GenTopKV2IndicesTensorInfo(anf_node, real_input_index, input_list, creater_type_);
  } else {
    auto dtype = GetDeviceInputType(anf_node, real_input_index);
@ -408,6 +419,30 @@ bool TbeKernelJsonCreator::GenTbeAttrJson(const std::shared_ptr<AnfNode> &anf_no
  return true;
 }
 string TbeKernelJsonCreator::GetSocVersion() {
  // Get default soc version.
  const int kSocVersionLen = 50;
  char soc_version[kSocVersionLen] = {0};
  auto ret = rtGetSocVersion(soc_version, kSocVersionLen);
  if (ret != RT_ERROR_NONE) {
    MS_LOG(EXCEPTION) << "GetSocVersion failed.";
  }
  MS_LOG(INFO) << "Default SocVersion is " << soc_version;
  // Get soc version from env value.
  const char *soc_version_env = getenv(kSOC_VERSION);
  if (soc_version_env != nullptr) {
    if (std::strcmp(soc_version, soc_version_env) != 0) {
      MS_LOG(WARNING) << "SocVerison change to " << soc_version_env;
      ret = rtSetSocVersion(soc_version_env);
      if (ret != RT_ERROR_NONE) {
        MS_LOG(EXCEPTION) << "SetSocVersion to " << soc_version_env << " failed, errorno: " << ret;
      }
      return soc_version_env;
    }
  }
  return soc_version;
 }
 void TbeKernelJsonCreator::ParseAttrValue(const std::string &type, const mindspore::ValuePtr &value,
                                          nlohmann::json *attr_obj) {
  MS_EXCEPTION_IF_NULL(value);
@ -624,6 +659,8 @@ bool TbeKernelBuild::GenFusionScopeJson(const std::vector<mindspore::AnfNodePtr>
  index = 0;
  data_list.insert(data_list.end(), compute_list.begin(), compute_list.end());
  (*fusion_json)[kFusionOpList] = data_list;
  auto soc_version = TbeKernelJsonCreator::GetSocVersion();
  (*fusion_json)[kJSocVersion] = soc_version;
  return true;
 }
@ -853,6 +890,7 @@ bool TbeKernelBuild::GenFusionDataInputJson(const std::shared_ptr<mindspore::Anf
    (*data_str)[kJName] = name;
    nlohmann::json output_desc;
    output_desc[kJName] = name;
    output_desc[kJDataType] = 0;
    output_desc[kJShape] = "NULL";
    output_desc_list.push_back(output_desc);
    (*index)++;
@ -985,6 +1023,7 @@ bool TbeKernelBuild::GenFusionComputeInputJson(const mindspore::CNodePtr &cnode,
    for (size_t i = 0; i < optional_num; ++i) {
      nlohmann::json optional_input_desc;
      optional_input_desc[kJName] = std::string(kOptional) + std::to_string(*index);
      optional_input_desc[kJShape] = "NULL";
      (*index)++;
      (*layer_iter)->emplace_back(nullptr);
      input_desc_list_tmp.emplace_back(optional_input_desc);
--- a/mindspore/ccsrc/backend/kernel_compiler/tbe/tbe_kernel_build.h
+++ b/mindspore/ccsrc/backend/kernel_compiler/tbe/tbe_kernel_build.h
@ -92,6 +92,7 @@ class TbeKernelJsonCreator {
  std::string json_name() { return json_name_; }
  bool GenTbeAttrJson(const std::shared_ptr<AnfNode> &anf_node, const std::shared_ptr<OpInfo> &op_info,
                      nlohmann::json *attrs_json);
  static string GetSocVersion();
 private:
  bool GenTbeInputsJson(const std::shared_ptr<AnfNode> &anf_node, const std::shared_ptr<OpInfo> &op_info,
--- a/mindspore/ccsrc/backend/optimizer/ascend/ascend_backend_optimization.cc
+++ b/mindspore/ccsrc/backend/optimizer/ascend/ascend_backend_optimization.cc
@ -19,6 +19,7 @@
 #include <memory>
 #include <string>
 #include "backend/optimizer/common/optimizer.h"
 #include "backend/optimizer/ascend/ir_fission/dynamic_rnn_grad_fission.h"
 #include "backend/optimizer/ascend/ir_fission/bn_split.h"
 #include "backend/optimizer/ascend/ir_fission/bn_grad_split.h"
 #include "backend/optimizer/ascend/ir_fission/batch_norm_grad_split.h"
@ -64,6 +65,7 @@
 #include "backend/optimizer/ascend/format_type/rectify_do_mask_kernel_info.h"
 #include "backend/optimizer/ascend/format_type/chang_axis_of_reduce_kernel.h"
 #include "backend/optimizer/ascend/format_type/split_unsupported_transdata.h"
 #include "backend/optimizer/ascend/format_type/insert_reshape_for_extract_image_patches_op.h"
 #include "backend/optimizer/pass/getitem_tuple.h"
 #include "backend/optimizer/pass/optimize_dependence.h"
 #include "backend/optimizer/pass/erase_visit_attr.h"
@ -106,6 +108,7 @@
 #include "backend/optimizer/ascend/ir_fission/concat_fission.h"
 #include "backend/optimizer/ascend/ir_fission/pack_fission.h"
 #include "backend/optimizer/ascend/enhancer/concat_outputs_for_all_gather.h"
 #include "backend/optimizer/ascend/enhancer/add_placeholder_for_dynamic_rnn.h"
 #include "utils/ms_context.h"
 #include "backend/optimizer/graph_kernel/composite_ops_fusion.h"
 #include "backend/optimizer/graph_kernel/basic_ops_fusion.h"
@ -231,6 +234,7 @@ void AscendMixPrecision(const std::shared_ptr<session::KernelGraph> &kernel_grap
  auto optimizer = std::make_shared<GraphOptimizer>();
  auto mixed_precision_pm = std::make_shared<PassManager>("cast_pm");
  mixed_precision_pm->AddPass(std::make_shared<InsertCast>());
  mixed_precision_pm->AddPass(std::make_shared<InsertReshapeForExtractImagePatchesOp>());
  mixed_precision_pm->AddPass(std::make_shared<GetitemTuple>());
  mixed_precision_pm->AddPass(std::make_shared<CommonSubexpressionElimination>());
  mixed_precision_pm->AddPass(std::make_shared<EliminateRedundantOp>());
@ -276,6 +280,8 @@ void AscendBackendIRFusionOptimization(const std::shared_ptr<session::KernelGrap
  }
  ir_fusion_pm->AddPass(std::make_shared<LayerNormGradSplit>());
  ir_fusion_pm->AddPass(std::make_shared<InsertPadForNMSWithMask>());
  ir_fusion_pm->AddPass(std::make_shared<InsertPlaceholderForDynamicRNN>());
  ir_fusion_pm->AddPass(std::make_shared<DynamicRNNGradFission>());
  AddAscendIRFusionRulesPass(ir_fusion_pm.get());
  AddAscendIRFusionPass(ir_fusion_pm.get());
--- a/mindspore/ccsrc/backend/optimizer/ascend/enhancer/add_placeholder_for_dynamic_rnn.cc
+++ b/mindspore/ccsrc/backend/optimizer/ascend/enhancer/add_placeholder_for_dynamic_rnn.cc
@ -0,0 +1,77 @@
 /**
 * Copyright 2020 Huawei Technologies Co., Ltd
 *
 * Licensed under the Apache License, Version 2.0 (the "License");
 * you may not use this file except in compliance with the License.
 * You may obtain a copy of the License at
 *
 * http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */
 #include "backend/optimizer/ascend/enhancer/add_placeholder_for_dynamic_rnn.h"
 #include <vector>
 #include <memory>
 #include "backend/optimizer/common/helper.h"
 #include "backend/session/anf_runtime_algorithm.h"
 #include "utils/utils.h"
 #include "abstract/abstract_value.h"
 #include "base/core_ops.h"
 namespace mindspore {
 namespace opt {
 const BaseRef InsertPlaceholderForDynamicRNN::DefinePattern() const {
  std::shared_ptr<Var> V = std::make_shared<CondVar>(UnVisited);
  std::shared_ptr<Var> Xs = std::make_shared<SeqVar>();
  return VectorRef({V, Xs});
 }
 const AnfNodePtr InsertPlaceholderForDynamicRNN::Process(const FuncGraphPtr &func_graph, const AnfNodePtr &node,
                                                         const EquivPtr &) const {
  MS_EXCEPTION_IF_NULL(func_graph);
  MS_EXCEPTION_IF_NULL(node);
  auto cnode = node->cast<CNodePtr>();
  MS_EXCEPTION_IF_NULL(cnode);
  auto op_name = AnfAlgo::GetCNodeName(cnode);
  if (op_name != kDynamicRNNOpName) {
    return nullptr;
  }
  AnfAlgo::SetNodeAttr(kAttrVisited, MakeValue(true), node);
  auto kernel_graph = func_graph->cast<std::shared_ptr<session::KernelGraph>>();
  MS_EXCEPTION_IF_NULL(kernel_graph);
  size_t input_num = AnfAlgo::GetInputTensorNum(node);
  if (input_num == 0) {
    return nullptr;
  }
  std::vector<AnfNodePtr> new_inputs = {AnfAlgo::GetCNodePrimitiveNode(cnode)};
  for (size_t in_idx = 0; in_idx < input_num; in_idx++) {
    auto input_node = AnfAlgo::GetInputNode(cnode, in_idx);
    if (in_idx == 3) {
      auto value = std::make_shared<None>();
      auto value_node = NewValueNode(value);
      value_node->set_abstract(std::make_shared<abstract::AbstractNone>());
      auto new_node = kernel_graph->NewValueNode(value_node);
      kernel_graph->AddValueNodeToGraph(new_node);
      new_inputs.push_back(new_node);
    }
    new_inputs.push_back(input_node);
  }
  CNodePtr new_node = nullptr;
  if (kernel_graph == nullptr) {
    new_node = std::make_shared<CNode>(*cnode);
  } else {
    new_node = kernel_graph->NewCNode(cnode);
  }
  MS_EXCEPTION_IF_NULL(new_node);
  new_node->set_inputs(new_inputs);
  return new_node;
 }
 }  // namespace opt
 }  // namespace mindspore
--- a/mindspore/ccsrc/backend/optimizer/ascend/enhancer/add_placeholder_for_dynamic_rnn.h
+++ b/mindspore/ccsrc/backend/optimizer/ascend/enhancer/add_placeholder_for_dynamic_rnn.h
@ -0,0 +1,37 @@
 /**
 * Copyright 2020 Huawei Technologies Co., Ltd
 *
 * Licensed under the Apache License, Version 2.0 (the "License");
 * you may not use this file except in compliance with the License.
 * You may obtain a copy of the License at
 *
 * http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */
 #ifndef MINDSPORE_CCSRC_PRE_ACTIVATE_ASCEND_ENHANCER_ADD_PLACEHOLDER_FOR_DYNAMIC_RNN_H
 #define MINDSPORE_CCSRC_PRE_ACTIVATE_ASCEND_ENHANCER_ADD_PLACEHOLDER_FOR_DYNAMIC_RNN_H
 #include <memory>
 #include <vector>
 #include "backend/optimizer/common/optimizer.h"
 #include "backend/optimizer/ascend/ascend_helper.h"
 namespace mindspore {
 namespace opt {
 class InsertPlaceholderForDynamicRNN : public PatternProcessPass {
 public:
  explicit InsertPlaceholderForDynamicRNN(bool multigraph = true)
      : PatternProcessPass("add_placeholder_for_dynamic_rnn", multigraph) {}
  ~InsertPlaceholderForDynamicRNN() override = default;
  const BaseRef DefinePattern() const override;
  const AnfNodePtr Process(const FuncGraphPtr &, const AnfNodePtr &, const EquivPtr &) const override;
 };
 }  // namespace opt
 }  // namespace mindspore
 #endif  // MINDSPORE_CCSRC_PRE_ACTIVATE_ASCEND_ENHANCER_ADD_PLACEHOLDER_FOR_DYNAMIC_RNN_H
--- a/mindspore/ccsrc/backend/optimizer/ascend/format_type/insert_reshape_for_extract_image_patches_op.cc
+++ b/mindspore/ccsrc/backend/optimizer/ascend/format_type/insert_reshape_for_extract_image_patches_op.cc
@ -0,0 +1,65 @@
 /**
 * Copyright 2020 Huawei Technologies Co., Ltd
 *
 * Licensed under the Apache License, Version 2.0 (the "License");
 * you may not use this file except in compliance with the License.
 * You may obtain a copy of the License at
 *
 * http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */
 #include "backend/optimizer/ascend/format_type/insert_reshape_for_extract_image_patches_op.h"
 #include <memory>
 #include "backend/optimizer/ascend/ascend_helper.h"
 #include "backend/session/anf_runtime_algorithm.h"
 #include "utils/utils.h"
 #include "base/core_ops.h"
 namespace mindspore {
 namespace opt {
 const BaseRef InsertReshapeForExtractImagePatchesOp::DefinePattern() const {
  VarPtr Xs = std::make_shared<SeqVar>();
  return VectorRef({prim::kPrimExtractImagePatches, Xs});
 }
 const AnfNodePtr InsertReshapeForExtractImagePatchesOp::Process(const FuncGraphPtr &func_graph, const AnfNodePtr &node,
                                                                const EquivPtr &equiv) const {
  MS_EXCEPTION_IF_NULL(func_graph);
  MS_EXCEPTION_IF_NULL(equiv);
  auto extract = CheckAnfNodeIfCNodeAndInputSize(node, 2);
  MS_EXCEPTION_IF_NULL(extract);
  auto in_node = extract->input(1);
  MS_EXCEPTION_IF_NULL(in_node);
  auto extract_kernel_build_info = AnfAlgo::GetSelectKernelBuildInfo(extract);
  auto in_node_kernel_build_info = AnfAlgo::GetSelectKernelBuildInfo(in_node);
  MS_EXCEPTION_IF_NULL(extract_kernel_build_info);
  MS_EXCEPTION_IF_NULL(in_node_kernel_build_info);
  std::vector<AnfNodePtr> reshape_inputs = {NewValueNode(std::make_shared<Primitive>(prim::kPrimReshape->name())),
                                            in_node};
  auto reshape_builder = std::make_shared<kernel::KernelBuildInfo::KernelBuildInfoBuilder>();
  reshape_builder->SetInputsFormat({kOpFormat_NC1HWC0});
  reshape_builder->SetOutputsFormat({kOpFormat_NC1HWC0});
  reshape_builder->SetInputsDeviceType({AnfAlgo::GetOutputDeviceDataType(in_node, 0)});
  reshape_builder->SetOutputsDeviceType({AnfAlgo::GetOutputDeviceDataType(in_node, 0)});
  reshape_builder->SetKernelType(in_node_kernel_build_info->kernel_type());
  reshape_builder->SetFusionType(in_node_kernel_build_info->fusion_type());
  reshape_builder->SetProcessor(in_node_kernel_build_info->processor());
  auto reshape = func_graph->NewCNode(reshape_inputs);
  reshape->set_scope(in_node->scope());
  auto shape_tmp = AnfAlgo::GetOutputInferShape(in_node, 0);
  AnfAlgo::SetOutputInferTypeAndShape({AnfAlgo::GetOutputDeviceDataType(in_node, 0)},
                                      {{shape_tmp[0], shape_tmp[2], shape_tmp[3], shape_tmp[1]}}, reshape.get());
  AnfAlgo::SetSelectKernelBuildInfo(reshape_builder->Build(), reshape.get());
  AnfAlgo::SetNodeAttr("nop_op", MakeValue(true), reshape);
  AnfAlgo::SetNodeInput(extract, reshape, 0);
  return extract;
 }
 }  // namespace opt
 }  // namespace mindspore
--- a/mindspore/ccsrc/backend/optimizer/ascend/format_type/insert_reshape_for_extract_image_patches_op.h
+++ b/mindspore/ccsrc/backend/optimizer/ascend/format_type/insert_reshape_for_extract_image_patches_op.h
@ -0,0 +1,41 @@
 /**
 * Copyright 2020 Huawei Technologies Co., Ltd
 *
 * Licensed under the Apache License, Version 2.0 (the "License");
 * you may not use this file except in compliance with the License.
 * You may obtain a copy of the License at
 *
 * http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */
 #ifndef MINDSPORE_CCSRC_BACKEND_OPTIMIZER_ASCEND_FORMAT_TYPE_INSERT_RESHAPE_FOR_EXTRACT_IMAGE_PATCHES_OP_H
 #define MINDSPORE_CCSRC_BACKEND_OPTIMIZER_ASCEND_FORMAT_TYPE_INSERT_RESHAPE_FOR_EXTRACT_IMAGE_PATCHES_OP_H
 #include <vector>
 #include <string>
 #include <utility>
 #include <memory>
 #include "ir/anf.h"
 #include "backend/optimizer/common/pattern_engine.h"
 #include "backend/optimizer/common/helper.h"
 #include "backend/optimizer/common/optimizer.h"
 namespace mindspore {
 namespace opt {
 class InsertReshapeForExtractImagePatchesOp : public PatternProcessPass {
 public:
  explicit InsertReshapeForExtractImagePatchesOp(bool multigraph = true)
      : PatternProcessPass("insert_reshape_for_extract_image_patches_op", multigraph) {}
  ~InsertReshapeForExtractImagePatchesOp() override = default;
  const BaseRef DefinePattern() const override;
  const AnfNodePtr Process(const FuncGraphPtr &, const AnfNodePtr &, const EquivPtr &) const override;
 };
 }  // namespace opt
 }  // namespace mindspore
 #endif  // MINDSPORE_CCSRC_BACKEND_OPTIMIZER_ASCEND_FORMAT_TYPE_INSERT_RESHAPE_FOR_EXTRACT_IMAGE_PATCHES_OP_H
--- a/mindspore/ccsrc/backend/optimizer/ascend/ir_fission/dynamic_rnn_grad_fission.cc
+++ b/mindspore/ccsrc/backend/optimizer/ascend/ir_fission/dynamic_rnn_grad_fission.cc
@ -0,0 +1,250 @@
 /**
 * Copyright 2020 Huawei Technologies Co., Ltd
 *
 * Licensed under the Apache License, Version 2.0 (the "License");
 * you may not use this file except in compliance with the License.
 * You may obtain a copy of the License at
 *
 * http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */
 #include "backend/optimizer/ascend/ir_fission/dynamic_rnn_grad_fission.h"
 #include <vector>
 #include <memory>
 #include <algorithm>
 #include "backend/session/anf_runtime_algorithm.h"
 #include "backend/optimizer/common/helper.h"
 namespace mindspore {
 namespace opt {
 constexpr size_t kDynamicRNNGradInputNum = 16;
 constexpr size_t kLSTMInputGradOutputNum = 4;
 const BaseRef DynamicRNNGradFission::DefinePattern() const {
  VarPtr Xs = std::make_shared<SeqVar>();
  return VectorRef({prim::kPrimDynamicRNNGrad, Xs});
 }
 AnfNodePtr CreateSplitVD(const FuncGraphPtr &graph, const AnfNodePtr &node) {
  MS_EXCEPTION_IF_NULL(graph);
  MS_EXCEPTION_IF_NULL(node);
  // SplitV
  std::vector<AnfNodePtr> splitvd_input = {NewValueNode(std::make_shared<Primitive>(prim::kPrimSplitV->name())), node};
  auto split_vd = graph->NewCNode(splitvd_input);
  MS_EXCEPTION_IF_NULL(split_vd);
  auto dtypes = {AnfAlgo::GetOutputInferDataType(node, 0), AnfAlgo::GetOutputInferDataType(node, 0)};
  std::vector<size_t> shape = {AnfAlgo::GetOutputInferShape(node, 0)[0] - 1, AnfAlgo::GetOutputInferShape(node, 0)[1],
                               AnfAlgo::GetOutputInferShape(node, 0)[2]};
  auto shape2 = {IntToSize(1), AnfAlgo::GetOutputInferShape(node, 0)[1], AnfAlgo::GetOutputInferShape(node, 0)[2]};
  std::vector<std::vector<size_t>> shapes = {shape, shape2};
  AnfAlgo::SetOutputInferTypeAndShape(dtypes, shapes, split_vd.get());
  AnfAlgo::SetNodeAttr("split_dim", MakeValue(0), split_vd);
  AnfAlgo::SetNodeAttr("num_split", MakeValue(2), split_vd);
  int tmp = SizeToInt(AnfAlgo::GetOutputInferShape(node, 0)[0]) - 1;
  AnfAlgo::SetNodeAttr("size_splits", MakeValue(std::vector<int>{tmp, 1}), split_vd);
  AnfAlgo::SetNodeAttr("is_backend_insert", MakeValue(true), split_vd);
  return split_vd;
 }
 AnfNodePtr CreateLSTMInputGrad(const FuncGraphPtr &graph, const AnfNodePtr &node) {
  MS_EXCEPTION_IF_NULL(graph);
  MS_EXCEPTION_IF_NULL(node);
  auto cnode = node->cast<CNodePtr>();
  MS_EXCEPTION_IF_NULL(cnode);
  const auto &dynamic_rnn_grad_inputs = cnode->inputs();
  std::vector<AnfNodePtr> lstm_input_grad_inputs = {NewValueNode(std::make_shared<Primitive>(kLSTMInputGradOpName)),
                                                    dynamic_rnn_grad_inputs[2],
                                                    dynamic_rnn_grad_inputs[6],
                                                    dynamic_rnn_grad_inputs[8],
                                                    dynamic_rnn_grad_inputs[9],
                                                    dynamic_rnn_grad_inputs[10],
                                                    dynamic_rnn_grad_inputs[11],
                                                    dynamic_rnn_grad_inputs[12],
                                                    dynamic_rnn_grad_inputs[13],
                                                    dynamic_rnn_grad_inputs[14],
                                                    dynamic_rnn_grad_inputs[15],
                                                    dynamic_rnn_grad_inputs[16]};
  std::vector<AnfNodePtr> ori_outputs;
  CreateMultipleOutputsOfAnfNode(graph, node, 5, &ori_outputs);
  auto lstm_op = graph->NewCNode(lstm_input_grad_inputs);
  MS_EXCEPTION_IF_NULL(lstm_op);
  auto ori_type = AnfAlgo::GetOutputInferDataType(dynamic_rnn_grad_inputs[8], 0);
  auto types = {AnfAlgo::GetOutputInferDataType(ori_outputs[2], 0), AnfAlgo::GetOutputInferDataType(ori_outputs[3], 0),
                AnfAlgo::GetOutputInferDataType(ori_outputs[4], 0), ori_type};
  std::vector<size_t> ori_shape = {AnfAlgo::GetOutputInferShape(dynamic_rnn_grad_inputs[8], 0)[0],
                                   AnfAlgo::GetOutputInferShape(dynamic_rnn_grad_inputs[8], 0)[1],
                                   4 * AnfAlgo::GetOutputInferShape(dynamic_rnn_grad_inputs[8], 0)[2]};
  auto shapes = {AnfAlgo::GetOutputInferShape(ori_outputs[2], 0), AnfAlgo::GetOutputInferShape(ori_outputs[3], 0),
                 AnfAlgo::GetOutputInferShape(ori_outputs[4], 0), ori_shape};
  AnfAlgo::SetOutputInferTypeAndShape(types, shapes, lstm_op.get());
  return lstm_op;
 }
 AnfNodePtr CreateBatchMatMul(const FuncGraphPtr &graph, const AnfNodePtr &node1, const AnfNodePtr &node2) {
  MS_EXCEPTION_IF_NULL(graph);
  MS_EXCEPTION_IF_NULL(node1);
  MS_EXCEPTION_IF_NULL(node2);
  // BatchMatMul
  std::vector<AnfNodePtr> matmul_inputs = {NewValueNode(std::make_shared<Primitive>(prim::kPrimBatchMatMul->name())),
                                           node2, node1};
  auto batch_matmul = graph->NewCNode(matmul_inputs);
  MS_EXCEPTION_IF_NULL(batch_matmul);
  auto types = {AnfAlgo::GetOutputInferDataType(node1, 0)};
  std::vector<size_t> shape = {AnfAlgo::GetOutputInferShape(node2, 0)[0], AnfAlgo::GetOutputInferShape(node2, 0)[2],
                               AnfAlgo::GetOutputInferShape(node1, 0)[2]};
  auto shapes = {shape};
  AnfAlgo::SetNodeAttr("is_backend_insert", MakeValue(true), batch_matmul);
  AnfAlgo::SetNodeAttr("transpose_x1", MakeValue(true), batch_matmul);
  AnfAlgo::SetNodeAttr("transpose_x2", MakeValue(false), batch_matmul);
  AnfAlgo::SetOutputInferTypeAndShape(types, shapes, batch_matmul.get());
  return batch_matmul;
 }
 AnfNodePtr AddHConcatD(const FuncGraphPtr &graph, const AnfNodePtr &node1, const AnfNodePtr &node2) {
  MS_EXCEPTION_IF_NULL(graph);
  MS_EXCEPTION_IF_NULL(node1);
  MS_EXCEPTION_IF_NULL(node2);
  std::vector<AnfNodePtr> ori_outputs;
  CreateMultipleOutputsOfAnfNode(graph, node2, 2, &ori_outputs);
  auto ori_shape = AnfAlgo::GetOutputInferShape(node1, 0);
  std::vector<std::vector<size_t>> shape_tmp;
  if (ori_shape.size() == 3) {
    shape_tmp = {ori_shape};
  } else {
    shape_tmp = {{IntToSize(1), ori_shape[0], ori_shape[1]}};
  }
  auto ori_dtype = {AnfAlgo::GetOutputInferDataType(node1, 0)};
  // reshape
  std::vector<AnfNodePtr> reshape_input = {NewValueNode(std::make_shared<Primitive>(prim::kPrimReshape->name())),
                                           node1};
  auto reshape = graph->NewCNode(reshape_input);
  AnfAlgo::SetNodeAttr(kAttrVisited, MakeValue(true), reshape);
  AnfAlgo::SetOutputInferTypeAndShape(ori_dtype, shape_tmp, reshape.get());
  // concatd --> concat
  std::vector<AnfNodePtr> concat_inputs = {NewValueNode(std::make_shared<Primitive>(prim::kPrimConcat->name())),
                                           reshape, ori_outputs[0]};
  auto concat_op = graph->NewCNode(concat_inputs);
  MS_EXCEPTION_IF_NULL(concat_op);
  std::vector<size_t> input = {AnfAlgo::GetOutputInferShape(node2, 0)[0] + 1, AnfAlgo::GetOutputInferShape(node2, 0)[1],
                               AnfAlgo::GetOutputInferShape(node2, 0)[2]};
  auto types = {AnfAlgo::GetOutputInferDataType(node1, 0)};
  auto shapes = {input};
  AnfAlgo::SetOutputInferTypeAndShape(types, shapes, concat_op.get());
  AnfAlgo::SetNodeAttr(kAttrN, MakeValue(2), concat_op);
  AnfAlgo::SetNodeAttr(kAttrDynInputSizes, MakeValue(std::vector<int>{2}), concat_op);
  AnfAlgo::SetNodeAttr("axis", MakeValue(0), concat_op);
  AnfAlgo::SetNodeAttr("is_backend_insert", MakeValue(true), concat_op);
  return concat_op;
 }
 AnfNodePtr AddConcatD(const FuncGraphPtr &graph, const AnfNodePtr &node1, const AnfNodePtr &node2) {
  MS_EXCEPTION_IF_NULL(graph);
  MS_EXCEPTION_IF_NULL(node1);
  MS_EXCEPTION_IF_NULL(node2);
  // concatd --> concat
  std::vector<AnfNodePtr> concat_inputs = {NewValueNode(std::make_shared<Primitive>(prim::kPrimConcat->name())), node1,
                                           node2};
  auto concat_op = graph->NewCNode(concat_inputs);
  MS_EXCEPTION_IF_NULL(concat_op);
  std::vector<size_t> input = {AnfAlgo::GetOutputInferShape(node1, 0)[0], AnfAlgo::GetOutputInferShape(node1, 0)[1],
                               AnfAlgo::GetOutputInferShape(node1, 0)[2] + AnfAlgo::GetOutputInferShape(node2, 0)[2]};
  auto types = {AnfAlgo::GetOutputInferDataType(node1, 0)};
  auto shapes = {input};
  AnfAlgo::SetOutputInferTypeAndShape(types, shapes, concat_op.get());
  AnfAlgo::SetNodeAttr(kAttrN, MakeValue(2), concat_op);
  AnfAlgo::SetNodeAttr(kAttrDynInputSizes, MakeValue(std::vector<int>{2}), concat_op);
  AnfAlgo::SetNodeAttr("axis", MakeValue(2), concat_op);
  AnfAlgo::SetNodeAttr("is_backend_insert", MakeValue(true), concat_op);
  return concat_op;
 }
 AnfNodePtr AddDwReduceSum(const FuncGraphPtr &graph, const AnfNodePtr &node1, const AnfNodePtr &node2) {
  // node1 : dynamic output
  // node2 : matmul
  MS_EXCEPTION_IF_NULL(graph);
  MS_EXCEPTION_IF_NULL(node1);
  MS_EXCEPTION_IF_NULL(node2);
  std::vector<AnfNodePtr> ori_outputs;
  CreateMultipleOutputsOfAnfNode(graph, node1, 5, &ori_outputs);
  // ReduceSumd
  std::vector<AnfNodePtr> reducesum_inputs = {NewValueNode(std::make_shared<Primitive>(prim::kPrimReduceSum->name())),
                                              node2};
  auto reduce_sumd = graph->NewCNode(reducesum_inputs);
  MS_EXCEPTION_IF_NULL(reduce_sumd);
  auto types = {AnfAlgo::GetOutputInferDataType(ori_outputs[0], 0)};
  auto shapes = {AnfAlgo::GetOutputInferShape(ori_outputs[0], 0)};
  AnfAlgo::SetOutputInferTypeAndShape(types, shapes, reduce_sumd.get());
  AnfAlgo::SetNodeAttr(kAttrAxis, MakeValue(std::vector<int>{0}), reduce_sumd);
  AnfAlgo::SetNodeAttr("keep_dims", MakeValue(false), reduce_sumd);
  AnfAlgo::SetNodeAttr("is_backend_insert", MakeValue(true), reduce_sumd);
  return reduce_sumd;
 }
 AnfNodePtr AddDbReduceSum(const FuncGraphPtr &graph, const AnfNodePtr &node1, const AnfNodePtr &node2) {
  // node1 lstm output
  // node2 // dynamic output
  MS_EXCEPTION_IF_NULL(graph);
  MS_EXCEPTION_IF_NULL(node1);
  MS_EXCEPTION_IF_NULL(node2);
  std::vector<AnfNodePtr> ori_outputs;
  CreateMultipleOutputsOfAnfNode(graph, node2, 5, &ori_outputs);
  // ReduceSumd --> ReduceSum
  std::vector<AnfNodePtr> reducerum_inputs = {NewValueNode(std::make_shared<Primitive>(prim::kPrimReduceSum->name())),
                                              node1};
  auto reduce_sumd = graph->NewCNode(reducerum_inputs);
  MS_EXCEPTION_IF_NULL(reduce_sumd);
  auto types = {AnfAlgo::GetOutputInferDataType(ori_outputs[1], 0)};
  auto shapes = {AnfAlgo::GetOutputInferShape(ori_outputs[1], 0)};
  AnfAlgo::SetOutputInferTypeAndShape(types, shapes, reduce_sumd.get());
  AnfAlgo::SetNodeAttr(kAttrAxis, MakeValue(std::vector<int>{0, 1}), reduce_sumd);
  AnfAlgo::SetNodeAttr("keep_dims", MakeValue(false), reduce_sumd);
  AnfAlgo::SetNodeAttr("is_backend_insert", MakeValue(true), reduce_sumd);
  return reduce_sumd;
 }
 const AnfNodePtr DynamicRNNGradFission::Process(const FuncGraphPtr &func_graph, const AnfNodePtr &node,
                                                const EquivPtr &) const {
  MS_EXCEPTION_IF_NULL(func_graph);
  MS_EXCEPTION_IF_NULL(node);
  auto cnode = node->cast<CNodePtr>();
  MS_EXCEPTION_IF_NULL(cnode);
  if (cnode->size() < kDynamicRNNGradInputNum + 1) {
    MS_LOG(INFO) << "The input num of DynamicRNNGrad less than" << kDynamicRNNGradInputNum
                 << ". The node should not be changed";
    return nullptr;
  }
  // input_list of dynamic_rnn_grad
  const auto &ori_inputs = cnode->inputs();
  // create split_vd
  auto split_vd = CreateSplitVD(func_graph, ori_inputs[7]);
  // create concat_1
  auto h_concat = AddHConcatD(func_graph, ori_inputs[5], split_vd);
  // create concat_2
  auto concat = AddConcatD(func_graph, ori_inputs[1], h_concat);
  // create lsym_input_grad
  auto lstm_input_grad = CreateLSTMInputGrad(func_graph, cnode);
  std::vector<AnfNodePtr> lstm_outputs;
  CreateMultipleOutputsOfAnfNode(func_graph, lstm_input_grad, kLSTMInputGradOutputNum, &lstm_outputs);
  // create matmul
  auto batch_matmul = CreateBatchMatMul(func_graph, lstm_outputs[3], concat);
  // create reduce_sum_1
  auto dw_reduce_sum = AddDwReduceSum(func_graph, node, batch_matmul);
  // create reduce_sum_2
  auto db_reduce_sum = AddDbReduceSum(func_graph, lstm_outputs[3], node);
  std::vector<AnfNodePtr> make_tuple_inputs = {NewValueNode(prim::kPrimMakeTuple),
                                               dw_reduce_sum,
                                               db_reduce_sum,
                                               lstm_outputs[0],
                                               lstm_outputs[1],
                                               lstm_outputs[2]};
  auto make_tuple = func_graph->NewCNode(make_tuple_inputs);
  MS_EXCEPTION_IF_NULL(make_tuple);
  return make_tuple;
 }
 }  // namespace opt
 }  // namespace mindspore
--- a/mindspore/ccsrc/backend/optimizer/ascend/ir_fission/dynamic_rnn_grad_fission.h
+++ b/mindspore/ccsrc/backend/optimizer/ascend/ir_fission/dynamic_rnn_grad_fission.h
@ -0,0 +1,33 @@
 /**
 * Copyright 2020 Huawei Technologies Co., Ltd
 *
 * Licensed under the Apache License, Version 2.0 (the "License");
 * you may not use this file except in compliance with the License.
 * You may obtain a copy of the License at
 *
 * http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */
 #ifndef MINDSPORE_CCSRC_BACKEND_OPTIMIZER_ASCEND_IR_FISSION_DYNAMIC_RNN_GRAD_FISSION_H_
 #define MINDSPORE_CCSRC_BACKEND_OPTIMIZER_ASCEND_IR_FISSION_DYNAMIC_RNN_GRAD_FISSION_H_
 #include "backend/optimizer/common/optimizer.h"
 namespace mindspore {
 namespace opt {
 class DynamicRNNGradFission : public PatternProcessPass {
 public:
  explicit DynamicRNNGradFission(bool multigraph = true) : PatternProcessPass("dynamic_rnn_grad_fission", multigraph) {}
  ~DynamicRNNGradFission() override = default;
  const BaseRef DefinePattern() const override;
  const AnfNodePtr Process(const FuncGraphPtr &, const AnfNodePtr &, const EquivPtr &) const override;
 };
 }  // namespace opt
 }  // namespace mindspore
 #endif  // MINDSPORE_CCSRC_BACKEND_OPTIMIZER_ASCEND_IR_FISSION_DYNAMIC_RNN_GRAD_FISSION_H_
--- a/mindspore/ccsrc/backend/optimizer/ascend/ir_fission/transdata_split.cc
+++ b/mindspore/ccsrc/backend/optimizer/ascend/ir_fission/transdata_split.cc
@ -21,11 +21,10 @@
 namespace mindspore {
 namespace opt {
-const std::set<std::pair<string, string>> invalid_formats_pair = {{kOpFormat_C1HWNCoC0, kOpFormat_NCHW},
+const std::set<std::pair<string, string>> invalid_formats_pair = {
-                                                                  {kOpFormat_NCHW, kOpFormat_C1HWNCoC0},
+  {kOpFormat_C1HWNCoC0, kOpFormat_NCHW},          {kOpFormat_NCHW, kOpFormat_C1HWNCoC0},
-                                                                  {kOpFormat_C1HWNCoC0, kOpFormat_DEFAULT},
+  {kOpFormat_C1HWNCoC0, kOpFormat_DEFAULT},       {kOpFormat_DEFAULT, kOpFormat_FRACTAL_ZN_LSTM},
-                                                                  {kOpFormat_DEFAULT, kOpFormat_FRACTAL_ZN_LSTM},
+  {kOpFormat_FRACTAL_ZN_LSTM, kOpFormat_DEFAULT}, {kOpFormat_DEFAULT, kOpFormat_C1HWNCoC0}};
                                                                  {kOpFormat_DEFAULT, kOpFormat_C1HWNCoC0}};
 bool TransDataSplit::Run(const FuncGraphPtr &func_graph) {
  MS_EXCEPTION_IF_NULL(func_graph);
@ -83,6 +82,9 @@ bool TransDataSplit::DoSplit(const FuncGraphPtr &func_graph, const AnfNodePtr &n
    new_transpose_node = NewTransOpNode(func_graph, AnfAlgo::GetInputNode(node->cast<CNodePtr>(), 0), kernel_select_,
                                        false, prim::kPrimTranspose->name());
    AnfAlgo::SetNodeAttr(kAttrPerm, MakeValue(std::vector<int>{2, 3, 1, 0}), new_transpose_node);
    if (output_format == kOpFormat_FRACTAL_ZN_LSTM) {
      AnfAlgo::SetNodeAttr("nop_op", MakeValue(true), new_transpose_node);
    }
    RefreshKernelBuildInfo(input_format, kOpFormat_HWCN, new_transpose_node);
    // trans hwcn to output_format
--- a/mindspore/ccsrc/backend/optimizer/common/helper.cc
+++ b/mindspore/ccsrc/backend/optimizer/common/helper.cc
@ -404,7 +404,11 @@ bool IsNopNode(const AnfNodePtr &node) {
  }
  CNodePtr cnode = node->cast<CNodePtr>();
  MS_EXCEPTION_IF_NULL(cnode);
-  if (nop_nodes.find(AnfAlgo::GetCNodeName(cnode)) == nop_nodes.end()) {
+  bool is_nop_node = false;
  if (AnfAlgo::HasNodeAttr("nop_op", cnode)) {
    is_nop_node = AnfAlgo::GetNodeAttr<bool>(cnode, "nop_op");
  }
  if (nop_nodes.find(AnfAlgo::GetCNodeName(cnode)) == nop_nodes.end() && !is_nop_node) {
    return false;
  }
  return true;
--- a/mindspore/ccsrc/backend/optimizer/mem_reuse/kernel_refcount.h
+++ b/mindspore/ccsrc/backend/optimizer/mem_reuse/kernel_refcount.h
@ -26,7 +26,7 @@ namespace mindspore {
 namespace memreuse {
 enum RefCountType { kDynamicRefCount, kStaticRefCount };
 enum NodeType { kCommonNode, kCommunicationNode };
-enum KernelRefType { kCommon, kRefNodeInput, kRefNodeOutput, kCommNotReuse, kCommReuse, kSummary };
+enum KernelRefType { kCommon, kRefNodeInput, kRefNodeOutput, kCommNotReuse, kCommReuse, kSummary, kAllocated };
 static constexpr int kInitIndex = -1;
 class KernelRefCount {
 public:
--- a/mindspore/ccsrc/backend/optimizer/mem_reuse/mem_reuse.cc
+++ b/mindspore/ccsrc/backend/optimizer/mem_reuse/mem_reuse.cc
@ -76,6 +76,9 @@ bool MemReuseUtil::InitDynamicOutputKernelRef() {
            }
          } else {
            kernel_ref->type_ = kCommon;
            if (AnfAlgo::OutputAddrExist(kernel_cnode, output_index)) {
              kernel_ref->type_ = kAllocated;
            }
          }
        }
        kernel_refs.push_back(kernel_ref);
--- a/mindspore/ccsrc/backend/optimizer/mem_reuse/mem_reuse_allocator.cc
+++ b/mindspore/ccsrc/backend/optimizer/mem_reuse/mem_reuse_allocator.cc
@ -16,6 +16,7 @@
 #include "backend/optimizer/mem_reuse/mem_reuse_allocator.h"
 #include "backend/optimizer/mem_reuse/mem_reuse.h"
 #include "backend/optimizer/mem_reuse/mem_reuse_checker.h"
 #include "utils/ms_context.h"
 #ifdef ENABLE_D
 #include "runtime/device/ascend/ascend_stream_assign.h"
 #endif
@ -32,12 +33,22 @@ void BestFitMemReuse::InitMemReuseInfo(const MemReuseUtil *mem_reuse_util_ptr) {
  set_workspace_ptr_list(mem_reuse_util_ptr->total_wk_ref_list());
  set_op_ptr_list(mem_reuse_util_ptr->kernel_def_ptr_list());
  // check info Correctness
-  for (auto &tensor : tensor_ptr_list_) {
+  auto context_ptr = MsContext::GetInstance();
-    tensor->size_ = AlignCommonMemorySize(tensor->size_);
+  MS_EXCEPTION_IF_NULL(context_ptr);
  std::string device_target = context_ptr->get_param<std::string>(MS_CTX_DEVICE_TARGET);
  // CPU device does not need to be aligned
  if (device_target != kCPUDevice) {
      MS_LOG(DEBUG) << "Align Memory Size";
      for (auto &tensor : tensor_ptr_list_) {
        tensor->size_ = AlignCommonMemorySize(tensor->size_);
      }
      // align wk size to 512
      for (auto &wk : wk_tensor_list_) {
        wk->size_ = AlignCommonMemorySize(wk->size_);
      }
  }
-  // align wk size to 512 && refcount == 1
+  // set wk refcount == 1
  for (auto &wk : wk_tensor_list_) {
    wk->size_ = AlignCommonMemorySize(wk->size_);
    wk->ref_count_ = 1;
  }
 #ifdef ENABLE_D
@ -155,6 +166,9 @@ void BestFitMemReuse::AssignCommonNodeOutputOffset() {
      // get align size for communication op's single input
      tensor_desc->size_ = AlignCommunicationMemorySize(tensor_desc->size_);
      total_comm_reuse_size += tensor_desc->size_;
    } else if (tensor_desc->type_ == kAllocated) {
      // no need to alloc allocated memory
      continue;
    }
    auto reusable_membuf_map = GetReusableMembufMap(tensor_desc->size_);
--- a/mindspore/ccsrc/backend/session/anf_runtime_algorithm.cc
+++ b/mindspore/ccsrc/backend/session/anf_runtime_algorithm.cc
@ -516,6 +516,10 @@ std::vector<size_t> AnfRuntimeAlgorithm::GetInputDeviceShape(const AnfNodePtr &n
  if (trans::IsNeedPadding(format, infer_shape.size())) {
    infer_shape = trans::PaddingShapeTo4d(infer_shape, GetInputReshapeType(node, input_idx));
  }
  if (node->isa<CNode>() && GetCNodeName(node) == kExtractImagePatchesOpName) {
    auto shape_tmp = {infer_shape[0], infer_shape[3], infer_shape[1], infer_shape[2]};
    return trans::TransShapeToDevice(shape_tmp, format);
  }
  return trans::TransShapeToDevice(infer_shape, format);
 }
--- a/mindspore/ccsrc/backend/session/ascend_session.cc
+++ b/mindspore/ccsrc/backend/session/ascend_session.cc
@ -179,6 +179,7 @@ GraphId AscendSession::CompileGraph(NotNull<FuncGraphPtr> func_graph) {
    debugger_->PreExecute(root_graph);
  }
 #endif
  SetSummaryNodes(root_graph.get());
  // alloc mem
  MemoryAlloc(root_graph.get());
  // generate and load task into device
--- a/mindspore/ccsrc/backend/session/cpu_session.cc
+++ b/mindspore/ccsrc/backend/session/cpu_session.cc
@ -75,6 +75,16 @@ GraphId CPUSession::CompileGraph(const AnfNodePtrList &lst, const AnfNodePtrList
 #endif
  MS_LOG(INFO) << "Build kernel";
  BuildKernel(graph.get());
  // Set graph execution order before memory alloc, ensure that memory alloc is according to the reorder graph
  auto execution_order = graph->execution_order();
  Reorder(&execution_order);
  graph->set_execution_order(execution_order);
  MS_LOG(INFO) << "Start kernel runtime";
  if (!runtime_.Init()) {
    MS_LOG(EXCEPTION) << "CPU start kernel runtime failed";
  }
  MS_LOG(INFO) << "Assign kernel address";
  runtime_.AssignKernelAddress(graph.get());
  return graph_id;
@ -98,11 +108,8 @@ void CPUSession::RunGraph(const GraphId &graph_id, const std::vector<tensor::Ten
 #endif
  MS_LOG(INFO) << "Run graph start";
  auto execution_order = kernel_graph->execution_order();
  Reorder(&execution_order);
  bool enable_summary = summary_callback_ != nullptr;
  kernel_graph->set_execution_order(execution_order);
  NamedSummaryOutputs summary_outputs;
  if (enable_summary) {
    SetSummaryNodes(kernel_graph.get());
--- a/mindspore/ccsrc/backend/session/executor.h
+++ b/mindspore/ccsrc/backend/session/executor.h
@ -125,7 +125,7 @@ class CreateCommGroupTask : public Task {
  void Run() override;
  std::string group_name_;
  std::vector<uint32_t> ranks_;
-  bool result_;
+  bool result_{false};
 };
 class DestroyCommGroupTask : public Task {
@ -134,7 +134,7 @@ class DestroyCommGroupTask : public Task {
  ~DestroyCommGroupTask() override = default;
  void Run() override;
  std::string group_name_;
-  bool result_;
+  bool result_{false};
 };
 class ExitTask : public Task {
--- a/mindspore/ccsrc/debug/data_dump/e2e_dump_util.cc
+++ b/mindspore/ccsrc/debug/data_dump/e2e_dump_util.cc
@ -64,8 +64,11 @@ void E2eDumpUtil::DumpGPUMemToFile(const std::string &file_path, const std::stri
                                   const ShapeVector &int_shapes, const TypeId &type, size_t slot, Debugger *debugger) {
 #ifdef ENABLE_DEBUGGER
  auto format = kOpFormat_DEFAULT;
  MS_EXCEPTION_IF_NULL(debugger);
  DebugServices *debug_services = debugger->debug_services();
  MS_EXCEPTION_IF_NULL(debug_services);
  TensorLoader *tensor_loader = debug_services->tensor_loader();
  MS_EXCEPTION_IF_NULL(tensor_loader);
  auto ret = tensor_loader->DumpTensorToFile(original_kernel_name, trans_flag, file_path, format, int_shapes, type,
                                             addr->type_id(), addr->format(), slot);
  if (!ret) {
--- a/mindspore/ccsrc/debug/debugger/debugger.cc
+++ b/mindspore/ccsrc/debug/debugger/debugger.cc
@ -309,7 +309,7 @@ void Debugger::PostDebugOp() {
  }
 }
-std::map<std::pair<uint32_t, uint32_t>, std::string> &Debugger::GetStreamTaskToOpnameMap() {
+std::map<std::pair<uint32_t, uint32_t>, std::string> Debugger::GetStreamTaskToOpnameMap() const {
  return stream_task_to_opname_;
 }
@ -754,7 +754,8 @@ uint64_t BytestoInt64(const std::vector<char> &buffer) {
  uint64_t ret;
  ret = ((uint64_t)buffer[7] << 56) | ((uint64_t)buffer[6] << 48) | ((uint64_t)buffer[5] << 40) |
-        ((uint64_t)buffer[4] << 32) | (buffer[3] << 24) | (buffer[2] << 16) | (buffer[1] << 8) | buffer[0];
+        ((uint64_t)buffer[4] << 32) | ((uint64_t)buffer[3] << 24) | ((uint64_t)buffer[2] << 16) |
        ((uint64_t)buffer[1] << 8) | ((uint64_t)buffer[0]);
  return ret;
 }
--- a/mindspore/ccsrc/debug/debugger/debugger.h
+++ b/mindspore/ccsrc/debug/debugger/debugger.h
@ -94,7 +94,7 @@ class Debugger : public std::enable_shared_from_this<Debugger> {
  int32_t step_num() const;
-  std::map<std::pair<uint32_t, uint32_t>, std::string> &GetStreamTaskToOpnameMap();
+  std::map<std::pair<uint32_t, uint32_t>, std::string> GetStreamTaskToOpnameMap() const;
  // check if any feature that uses the debugger backend is enabled
  bool DebuggerBackendEnabled();
--- a/mindspore/ccsrc/debug/debugger/grpc_client.cc
+++ b/mindspore/ccsrc/debug/debugger/grpc_client.cc
@ -79,9 +79,9 @@ std::vector<std::string> ChunkString(std::string str, int graph_size) {
    }
    std::string buffer;
    buffer.resize(chunk_size);
-    errno_t err = memcpy_s(reinterpret_cast<char *>(buffer.data()), chunk_size, str.data() + size_iter, chunk_size);
+    auto err = memcpy_s(reinterpret_cast<char *>(buffer.data()), chunk_size, str.data() + size_iter, chunk_size);
    if (err != 0) {
-      MS_LOG(ERROR) << "memcpy_s failed. err code is: " << err;
+      MS_LOG(EXCEPTION) << "memcpy_s failed. errorno is: " << err;
    }
    buf.push_back(buffer);
    size_iter += CHUNK_SIZE;
--- a/mindspore/ccsrc/frontend/optimizer/irpass/special_op_eliminate.h
+++ b/mindspore/ccsrc/frontend/optimizer/irpass/special_op_eliminate.h
@ -255,7 +255,7 @@ class PynativeEliminater : public OptimizerCaller {
    MS_LOG(DEBUG) << "Start FillZero";
    ValuePtr out = nullptr;
    if (value->isa<Int32Imm>()) {
-      return value;
+      return MakeValue(value->cast<Int32ImmPtr>()->value());
    }
    if (value->isa<tensor::Tensor>()) {
@ -298,9 +298,10 @@ class PynativeEliminater : public OptimizerCaller {
      if (rep != nullptr) {
        if (rep->isa<ValueNode>()) {
          auto value_node = rep->cast<ValueNodePtr>();
-          value_node->set_value(FillZero(value_node->value()));
+          auto new_value_node = NewValueNode(FillZero(value_node->value()));
          new_value_node->set_has_new_value(value_node->has_new_value());
          MS_LOG(DEBUG) << "Zeros_like replace ok " << rep->DebugString(4);
-          return rep;
+          return new_value_node;
        }
      }
    }
@ -315,9 +316,10 @@ class PynativeEliminater : public OptimizerCaller {
      if (rep != nullptr) {
        if (rep->isa<ValueNode>()) {
          auto value_node = rep->cast<ValueNodePtr>();
-          value_node->set_value(FillZero(value_node->value()));
+          auto new_value_node = NewValueNode(FillZero(value_node->value()));
          new_value_node->set_has_new_value(value_node->has_new_value());
          MS_LOG(DEBUG) << "Zeros_like replace ok 2 " << rep->DebugString(4);
-          return rep;
+          return new_value_node;
        }
      }
    }
--- a/mindspore/ccsrc/minddata/dataset/api/python/bindings/dataset/core/bindings.cc
+++ b/mindspore/ccsrc/minddata/dataset/api/python/bindings/dataset/core/bindings.cc
@ -69,7 +69,7 @@ PYBIND_REGISTER(Tensor, 0, ([](const py::module *m) {
                      auto &tensor = py::cast<Tensor &>(t);
                      if (tensor.type() == DataType::DE_STRING) {
                        py::array res;
-                        tensor.GetDataAsNumpyStrings(&res);
+                        THROW_IF_ERROR(tensor.GetDataAsNumpyStrings(&res));
                        return res;
                      }
                      py::buffer_info info;
--- a/mindspore/ccsrc/minddata/dataset/util/services.cc
+++ b/mindspore/ccsrc/minddata/dataset/util/services.cc
@ -30,6 +30,8 @@ namespace mindspore {
 namespace dataset {
 std::unique_ptr<Services> Services::instance_ = nullptr;
 std::once_flag Services::init_instance_flag_;
 std::set<std::string> Services::unique_id_list_ = {};
 std::mutex Services::unique_id_mutex_;
 #if !defined(_WIN32) && !defined(_WIN64) && !defined(__ANDROID__) && !defined(ANDROID)
 std::string Services::GetUserName() {
@ -52,8 +54,23 @@ std::string Services::GetUniqueID() {
  std::mt19937 gen = GetRandomDevice();
  std::uniform_int_distribution<uint32_t> dist(0, kStr.size() - 1);
  char buffer[UNIQUEID_LEN];
-  for (int i = 0; i < UNIQUEID_LEN; i++) {
+  {
-    buffer[i] = kStr[dist(gen)];
+    std::unique_lock<std::mutex> lock(unique_id_mutex_);
    while (true) {
      auto ret = memset_s(buffer, UNIQUEID_LEN, 0, UNIQUEID_LEN);
      if (ret != 0) {
        MS_LOG(ERROR) << "memset_s error, errorno(" << ret << ")";
        return std::string("");
      }
      for (int i = 0; i < UNIQUEID_LEN; i++) {
        buffer[i] = kStr[dist(gen)];
      }
      if (unique_id_list_.find(std::string(buffer, UNIQUEID_LEN)) != unique_id_list_.end()) {
        continue;
      }
      unique_id_list_.insert(std::string(buffer, UNIQUEID_LEN));
      break;
    }
  }
  return std::string(buffer, UNIQUEID_LEN);
 }
--- a/mindspore/ccsrc/minddata/dataset/util/services.h
+++ b/mindspore/ccsrc/minddata/dataset/util/services.h
@ -19,6 +19,7 @@
 #include <algorithm>
 #include <memory>
 #include <mutex>
 #include <set>
 #include <string>
 #include <vector>
 #include "minddata/dataset/util/memory_pool.h"
@ -97,6 +98,8 @@ class Services {
 private:
  static std::once_flag init_instance_flag_;
  static std::unique_ptr<Services> instance_;
  static std::set<std::string> unique_id_list_;
  static std::mutex unique_id_mutex_;
  // A small pool used for small objects that last until the
  // Services Manager shuts down. Used by all sub-services.
  std::shared_ptr<MemoryPool> pool_;
--- a/mindspore/ccsrc/pipeline/jit/parse/parse.cc
+++ b/mindspore/ccsrc/pipeline/jit/parse/parse.cc
@ -1790,38 +1790,92 @@ bool UpdateFuncGraphFlags(py::object obj, const FuncGraphPtr &func_graph) {
  return true;
 }
 // Generate and copy a ValueNode, or a CNode with its child nodes
 static AnfNodePtr CopyNodesFromParamDefaultValue(const FuncGraphPtr func_graph, const AnfNodePtr &param_node) {
  MS_EXCEPTION_IF_NULL(param_node);
  if (param_node->isa<ValueNode>()) {
    return std::make_shared<ValueNode>(param_node->cast<ValueNodePtr>()->value());
  }
  // Parameter default value is CNode.
  std::size_t index = 0;
  std::vector<AnfNodePtr> old_cnodes;
  old_cnodes.emplace_back(param_node);
  auto res = func_graph->NewCNode({});
  std::vector<CNodePtr> new_cnodes;
  new_cnodes.emplace_back(res);
  while (index < old_cnodes.size()) {
    auto current = old_cnodes[index];
    auto current_new_cnode = new_cnodes[index];
    index++;
    MS_EXCEPTION_IF_NULL(current);
    if (current->isa<CNode>()) {
      auto &inputs = current->cast<CNodePtr>()->inputs();
      for (auto it = inputs.begin(); it != inputs.end(); it++) {
        AnfNodePtr input = *it;
        if (input != nullptr && input->isa<CNode>()) {
          old_cnodes.emplace_back(input);
          auto new_cnode = func_graph->NewCNode({});
          new_cnodes.emplace_back(new_cnode);
          current_new_cnode->add_input(new_cnode);
        } else if (input->isa<ValueNode>()) {
          current_new_cnode->add_input(std::make_shared<ValueNode>(input->cast<ValueNodePtr>()->value()));
        } else {
          MS_LOG(EXCEPTION) << "Wrong type item in default parameters: " << input->ToString();
        }
      }
    }
  }
  return res;
 }
 FuncGraphPtr MakeTopGraph(const py::object &cell, const ValuePtr &cell_ptr) {
  auto current_graph = dyn_cast<FuncGraph>(cell_ptr);
  if (current_graph == nullptr) {
    MS_LOG(EXCEPTION) << "Current graph cast failed from " << cell_ptr->ToString();
  }
  auto func_graph = std::make_shared<FuncGraph>();
-  func_graph->debug_info()->set_name("top");
+  func_graph->debug_info()->set_name(current_graph->debug_info()->name() + "_wrapper");
-  // def top(*arg, *kwargs):
+  // Copy all parameters information
-  auto param_vargs = func_graph->add_parameter();
+  for (auto &para : current_graph->parameters()) {
-  auto args_name = "args";
+    auto param = func_graph->add_parameter();
-  param_vargs->set_name(args_name);
+    auto orig_param = para->cast<ParameterPtr>();
-  param_vargs->debug_info()->set_name(args_name);
+    auto name = orig_param->name();
-
+    param->set_name(name);
-  auto param_vkwargs = func_graph->add_parameter();
+    param->debug_info()->set_name(name);
-  args_name = "kwargs";
+  }
-  param_vkwargs->set_name(args_name);
+  func_graph->set_has_vararg(current_graph->has_vararg());
-  param_vkwargs->debug_info()->set_name(args_name);
+  func_graph->set_has_kwarg(current_graph->has_kwarg());
-
+  func_graph->set_kwonlyargs_count(current_graph->kwonlyargs_count());
-  func_graph->set_has_vararg(true);
+  // Copy all default values
-  func_graph->set_has_kwarg(true);
+  for (auto &d : current_graph->parameter_default_value()) {
-  func_graph->set_kwonlyargs_count(0);
+    func_graph->set_param_default_value(d.first, CopyNodesFromParamDefaultValue(func_graph, d.second));
  }
  // cell_obj
  MS_LOG(DEBUG) << "add Flag for " << std::string(py::str(cell));
  parse::UpdateFuncGraphFlags(cell, func_graph);
  // top graph's construct flag
  if (py::hasattr(cell, "construct")) {
    parse::UpdateFuncGraphFlags(cell.attr("construct"), func_graph);
  }
-  // ret = cell_obj(*arg, *kwargs)
+  auto unpacking = func_graph->has_vararg() || func_graph->has_kwarg();
-  auto call_fn = MakeUnpackCall(func_graph, NewValueNode(cell_ptr), {param_vargs, param_vkwargs});
+  if (!unpacking) {
-
+    std::vector<AnfNodePtr> inputs;
-  // return ret
+    inputs.emplace_back(NewValueNode(cell_ptr));
-  func_graph->set_output(call_fn);
+    auto &params = func_graph->parameters();
-  MS_LOG(DEBUG) << "add Flag for " << std::string(py::str(cell));
+    (void)std::transform(params.begin(), params.end(), std::back_inserter(inputs),
                         [](AnfNodePtr node) -> AnfNodePtr { return node; });
    func_graph->set_output(func_graph->NewCNode(inputs));
  } else {
    // ret = cell_obj(*arg, *kwargs)
    auto call_fn = MakeUnpackCall(func_graph, NewValueNode(cell_ptr), func_graph->parameters());
    // return ret
    func_graph->set_output(call_fn);
  }
  return func_graph;
 }
 }  // namespace parse
--- a/mindspore/ccsrc/pipeline/pynative/pynative_execute.cc
+++ b/mindspore/ccsrc/pipeline/pynative/pynative_execute.cc
@ -260,7 +260,7 @@ py::object DoAutoCast(const py::object &arg, const TypeId &type_id) {
 py::object DoParamMixPrecisionCast(bool *is_cast, const py::object obj) {
  auto tensor = py::cast<tensor::TensorPtr>(obj);
  auto cast_type = tensor->cast_dtype();
-  py::object cast_output;
+  py::object cast_output = obj;
  if (cast_type != nullptr) {
    auto source_element = tensor->Dtype();
    if (source_element != nullptr && IsSubType(source_element, kFloat) && *source_element != *cast_type) {
@ -282,12 +282,15 @@ py::object DoParamMixPrecisionCastTuple(bool *is_cast, const py::tuple tuple) {
      result[i] = DoParamMixPrecisionCast(is_cast, tuple[i]);
    } else if (py::isinstance<py::tuple>(tuple[i])) {
      result[i] = DoParamMixPrecisionCastTuple(is_cast, tuple[i]);
    } else {
      result[i] = tuple[i];
    }
  }
  return result;
 }
 bool GetSignatureType(const PrimitivePyPtr &prim, std::vector<SignatureEnumDType> *dtypes) {
  MS_EXCEPTION_IF_NULL(dtypes);
  auto signature = prim->signatures();
  bool has_sig_dtype = false;
  (void)std::transform(signature.begin(), signature.end(), std::back_inserter(*dtypes),
@ -733,20 +736,29 @@ ValuePtr PynativeExecutor::GetForwardValue(const OpExecInfoPtr &op_exec_info) {
 AnfNodePtr PynativeExecutor::MakeCNode(const OpExecInfoPtr &op_exec_info, std::vector<bool> *op_masks,
                                       abstract::AbstractBasePtrList *args_spec_list) {
  MS_EXCEPTION_IF_NULL(op_masks);
  MS_EXCEPTION_IF_NULL(args_spec_list);
  CNodePtr cnode = nullptr;
  std::vector<AnfNodePtr> inputs;
  auto prim = op_exec_info->py_primitive;
  const auto &signature = prim->signatures();
  inputs.push_back(NewValueNode(prim));
  size_t size = op_exec_info->op_inputs.size();
  auto sig_size = signature.size();
  // ignore signature for cast op
  if (sig_size > 0 && sig_size != size) {
    MS_EXCEPTION(ValueError) << op_exec_info->op_name << " inputs size " << size << " does not match the requires "
                             << "inputs size " << sig_size;
  }
  bool is_cast_op = (op_exec_info->op_name == "Cast");
  if (!is_cast_op) {
    const auto &signature = prim->signatures();
    for (size_t i = 0; i < size; i++) {
      auto obj = op_exec_info->op_inputs[i];
      auto sig = SignatureEnumRW::kRWDefault;
-      if (signature.size() > 0) {
+      if (sig_size > 0) {
        sig = signature[i].rw;
      }
      MS_LOG(DEBUG) << "check mix precision " << op_exec_info->op_name << " input " << i << " "
@ -1173,10 +1185,6 @@ py::tuple RunOp(const py::args &args) {
    std::ostringstream oss;
    trace::TraceGraphEval();
    trace::GetEvalStackInfo(oss);
    // call py::print to output function call stack to STDOUT, in case of output the log to file, the user can see
    // these info from screen, no need to open log file to find these info
    py::print(oss.str());
    MS_LOG(ERROR) << oss.str();
    PynativeExecutor::GetInstance()->Clean();
    // re-throw this exception to Python interpreter to handle it
    throw(py::error_already_set(ex));
--- a/mindspore/ccsrc/profiler/device/gpu/gpu_profiling.cc
+++ b/mindspore/ccsrc/profiler/device/gpu/gpu_profiling.cc
@ -406,8 +406,10 @@ void GPUProfiler::OpDataProducerBegin(const std::string op_name, void *stream) {
    CHECK_CUDA_RET_WITH_ERROR(cudaEventRecord(op_event_start_, (CUstream)stream_),
                              "cudaEventRecord op event start failed");
    op_host_time_start_ = GetHostTimeStamp();
    op_cupti_time_start_ = GetCUPTITimeStamp();
  } else {
    op_host_time_start_ = GetHostTimeStamp();
    op_cupti_time_start_ = GetCUPTITimeStamp();
  }
  SetRunTimeData(op_name, stream);
 }
@ -431,7 +433,7 @@ void GPUProfiler::OpDataProducerEnd() {
  }
  MS_LOG(DEBUG) << "Host Time Elapsed(us)," << op_name_ << "," << op_time_elapsed;
  SetRunTimeData(op_name_, op_time_elapsed);
-  SetRunTimeData(op_name_, op_host_time_start_, op_time_elapsed);
+  SetRunTimeData(op_name_, op_cupti_time_start_, op_time_elapsed);
 }
 void GPUProfiler::StopCUPTI() {
--- a/mindspore/ccsrc/profiler/device/gpu/gpu_profiling.h
+++ b/mindspore/ccsrc/profiler/device/gpu/gpu_profiling.h
@ -170,6 +170,7 @@ class GPUProfiler {
  cudaEvent_t op_event_stop_;
  uint64_t op_host_time_start_;
  uint64_t op_host_time_stop_;
  uint64_t op_cupti_time_start_;
  std::string profile_data_path_;
 };
 }  // namespace gpu
--- a/mindspore/ccsrc/pybind_api/ir/tensor_py.cc
+++ b/mindspore/ccsrc/pybind_api/ir/tensor_py.cc
@ -455,7 +455,7 @@ REGISTER_PYBIND_DEFINE(Tensor, ([](const py::module *m) {
                                  >>> data.set_dtype(mindspore.int32)
                                  mindspore.int32
                              )mydelimiter")
-                           .def("set_cast_dtype", &Tensor::set_cast_dtype)
+                           .def("set_cast_dtype", &Tensor::set_cast_dtype, py::arg("dtype") = nullptr)
                           .def("__str__", &Tensor::ToString)
                           .def("__repr__", &Tensor::ToStringRepr)
                           .def(py::pickle(
--- a/mindspore/ccsrc/runtime/device/CMakeLists.txt
+++ b/mindspore/ccsrc/runtime/device/CMakeLists.txt
@ -14,12 +14,12 @@ endif ()
 if (ENABLE_CPU)
    file(GLOB_RECURSE CPU_SRC_LIST RELATIVE ${CMAKE_CURRENT_SOURCE_DIR} "cpu/*.cc")
-    list(REMOVE_ITEM CPU_SRC_LIST "cpu/mpi/mpi_adapter.cc", "cpu/mpi/mpi_export.cc")
+    list(REMOVE_ITEM CPU_SRC_LIST "cpu/mpi/mpi_adapter.cc" "cpu/mpi/mpi_export.cc")
 endif ()
 if (ENABLE_MPI)
    if (ENABLE_CPU)
-        file(GLOB_RECURSE MPI_SRC_LIST "cpu/mpi/mpi_adapter.cc", "cpu/mpi/mpi_export.cc")
+        file(GLOB_RECURSE MPI_SRC_LIST "cpu/mpi/mpi_adapter.cc" "cpu/mpi/mpi_export.cc")
        set_property(SOURCE ${MPI_SRC_LIST}
            PROPERTY COMPILE_DEFINITIONS SUBMODULE_ID=mindspore::SubModuleId::SM_DEVICE)
        add_library(mpi_adapter SHARED ${MPI_SRC_LIST})
--- a/mindspore/ccsrc/runtime/device/ascend/dump/data_dumper.cc
+++ b/mindspore/ccsrc/runtime/device/ascend/dump/data_dumper.cc
@ -57,7 +57,6 @@ constexpr const char *kOpTypeOpDebug = "Opdebug";
 namespace mindspore {
 namespace device {
 namespace ascend {
 DataDumper::~DataDumper() {
  ReleaseDevMem(&dev_load_mem_);
  ReleaseDevMem(&dev_unload_mem_);
@ -116,7 +115,7 @@ void DataDumper::LoadDumpInfo() {
  auto debugger = mindspore::Debugger::GetInstance();
  MS_EXCEPTION_IF_NULL(debugger);
  if (debugger->DebuggerBackendEnabled()) {
-    std::map<std::pair<uint32_t, uint32_t>, std::string> &stream_task_to_opname = debugger->GetStreamTaskToOpnameMap();
+    std::map<std::pair<uint32_t, uint32_t>, std::string> stream_task_to_opname = debugger->GetStreamTaskToOpnameMap();
    // extract stream id, task id and opname from runtime_info_map for overflow detection
    std::transform(runtime_info_map_.begin(), runtime_info_map_.end(),
                   std::inserter(stream_task_to_opname, stream_task_to_opname.end()),
--- a/mindspore/ccsrc/runtime/device/ascend/kernel_select_ascend.cc
+++ b/mindspore/ccsrc/runtime/device/ascend/kernel_select_ascend.cc
@ -52,8 +52,12 @@ const int kUnSupportMixedDataTypeIndex = -1;
 bool MatchInferOutputDataType(const CNodePtr &cnode, const kernel::KernelBuildInfo &kernel_build_info) {
  MS_EXCEPTION_IF_NULL(cnode);
  // Check input data type
  auto name = AnfAlgo::GetCNodeName(cnode);
  for (size_t input_index = 0; input_index < kernel_build_info.GetInputNum(); ++input_index) {
    TypeId input_origin_type = AnfAlgo::GetPrevNodeOutputInferDataType(cnode, input_index);
    if (name == kDynamicRNNOpName && input_origin_type == kMetaTypeNone) {
      continue;
    }
    if (kernel_build_info.GetInputDeviceType(input_index) != input_origin_type) {
      return false;
    }
@ -478,6 +482,9 @@ void SetTensorDeviceInfo(const kernel::KernelBuildInfo &selected_kernel_info, co
      AnfAlgo::SetSelectKernelBuildInfo(builder->Build(), input_kernel_node.get());
      continue;
    }
    if (selected_kernel_info.GetInputFormat(input_index) == kOpFormat_FRACTAL_ZN_LSTM) {
      continue;
    }
    // we set special device info of a input tensor.
    bool is_ref = false;
    auto op_info = kernel::OpLib::FindOp(AnfAlgo::GetCNodeName(kernel_node), kernel::kTBE);
--- a/mindspore/ccsrc/runtime/device/ascend/tasksink/task_generator.cc
+++ b/mindspore/ccsrc/runtime/device/ascend/tasksink/task_generator.cc
@ -127,8 +127,12 @@ bool TaskGenerator::LaunchKernel(const CNodePtr &anf_node_ptr, uint32_t stream_i
  auto kernel_mod = AnfAlgo::GetKernelMod(anf_node_ptr);
  MS_EXCEPTION_IF_NULL(kernel_mod);
  kernel_mod->set_kernel_name(anf_node_ptr->fullname_with_scope());
  auto op_name = AnfAlgo::GetCNodeName(anf_node_ptr);
  if (AnfAlgo::GetCNodeName(anf_node_ptr) != kAtomicAddrCleanOpName) {
    for (size_t i = 0; i < AnfAlgo::GetInputTensorNum(anf_node_ptr); ++i) {
      if (op_name == kDynamicRNNOpName && i == 3) {
        continue;
      }
      auto real_input_index = AnfAlgo::GetRealInputIndex(anf_node_ptr, i);
      auto device_address = AnfAlgo::GetPrevNodeOutputAddr(anf_node_ptr, real_input_index);
      AddressPtr input = std::make_shared<Address>();
--- a/mindspore/ccsrc/runtime/device/cpu/cpu_kernel_runtime.cc
+++ b/mindspore/ccsrc/runtime/device/cpu/cpu_kernel_runtime.cc
@ -22,22 +22,68 @@
 #include <functional>
 #include "backend/kernel_compiler/kernel.h"
 #include "runtime/device/cpu/cpu_device_address.h"
 #include "runtime/device/cpu/cpu_memory_manager.h"
 #include "utils/ms_context.h"
 #include "backend/session/anf_runtime_algorithm.h"
 #include "backend/session/session_basic.h"
 #include "frontend/operator/ops.h"
 #include "utils/shape_utils.h"
 #include "utils/profile.h"
 #ifdef MEM_REUSE_DEBUG
 #include "backend/optimizer/mem_reuse/mem_reuse_checker.h"
 #endif
 namespace mindspore {
 namespace device {
 namespace cpu {
 bool CPUKernelRuntime::Init()
 {
  mem_manager_ = std::make_shared<CPUMemoryManager>();
  MS_EXCEPTION_IF_NULL(mem_manager_);
  return true;
 }
 const size_t INIT_NODE_REF = 1;
 void CPUKernelRuntime::AssignKernelAddress(session::KernelGraph *kernel_graph) {
  AssignValueNodeAddress(kernel_graph);
  AssignInputNodeAddress(kernel_graph);
-  AssignKernelOutputAddress(kernel_graph);
+  auto context_ptr = MsContext::GetInstance();
-  resource_manager_.AssignMemory(kernel_graph);
+  MS_EXCEPTION_IF_NULL(context_ptr);
  bool is_enable_mem_reuse = context_ptr->get_param<bool>(MS_CTX_ENABLE_MEM_REUSE);
  if (is_enable_mem_reuse) {
    AssignStaticMemoryOutput(kernel_graph);
    AssignDynamicMemory(kernel_graph); 
    #ifdef MEM_REUSE_DEBUG
    mindspore::memreuse::MemReuseChecker::GetInstance().CheckNormalIR(kernel_graph);
    #endif
  } else {
    AssignKernelOutputAddress(kernel_graph);
    static_cast<CPUMemoryManager *>(mem_manager_.get())->AssignMemory(kernel_graph);
  }
 }
 void CPUKernelRuntime::AssignStaticMemoryOutput(const session::KernelGraph *graph) {
  MS_EXCEPTION_IF_NULL(graph);
  auto nodes = AnfAlgo::GetAllOutput(graph->output(), {prim::kPrimTupleGetItem});
  std::vector<session::KernelWithIndex> non_communication_op;
  // Assign Communicate Op Memory firstly.
  for (const auto &node : nodes) {
    auto item_with_index = AnfAlgo::VisitKernelWithReturnType(node, 0, true);
    MS_EXCEPTION_IF_NULL(item_with_index.first);
    if (!item_with_index.first->isa<CNode>() || !AnfAlgo::IsRealKernel(item_with_index.first)) {
      continue;
    }
    if (AnfAlgo::IsCommunicationOp(item_with_index.first)) {
      AssignCommunicationNodeMem(kStaticMem, item_with_index.first);
    } else {
      non_communication_op.emplace_back(item_with_index);
    }
  }
  for (const auto &item_with_index : non_communication_op) {
    AssignNodeOutputMem(kStaticMem, item_with_index.first, SizeToInt(item_with_index.second));
  }
 }
 void CPUKernelRuntime::AssignValueNodeAddress(session::KernelGraph *kernel_graph) {
@ -73,7 +119,8 @@ void CPUKernelRuntime::AssignValueNodeAddress(session::KernelGraph *kernel_graph
          tensor->data_type() == kNumberTypeInt64) {
        address->ptr_ = tensor->data_c();
      } else {
-        address->ptr_ = resource_manager_.MemMalloc(tensor_size);
+        MS_EXCEPTION_IF_NULL(mem_manager_);
        address->ptr_ = static_cast<CPUMemoryManager *>(mem_manager_.get())->MemMalloc(tensor_size);
        if (!address->SyncHostToDevice(data_shape, LongToSize(tensor->data().nbytes()), tensor->data_type(),
                                       tensor->data_c())) {
          MS_LOG(EXCEPTION) << "Value node sync host to device failed!";
@ -166,10 +213,11 @@ tensor::TensorPtr CPUKernelRuntime::CreatTensorForOutput(session::KernelGraph *k
    tensor->set_sync_status(kNeedSyncDeviceToHostImmediately);
  } else {
    if (infer_type_id != device_type_id) {
      MS_EXCEPTION_IF_NULL(mem_manager_);
      size_t type_size = GetTypeByte(TypeIdToType(device_type_id));
      ShapeVector data_shape = tensor->shape();
      size_t tensor_size = std::accumulate(data_shape.begin(), data_shape.end(), type_size, std::multiplies<size_t>());
-      address->ptr_ = resource_manager_.MemMalloc(tensor_size);
+      address->ptr_ = static_cast<CPUMemoryManager *>(mem_manager_.get())->MemMalloc(tensor_size);
      tensor->set_device_address(address);
      tensor->set_sync_status(kNeedSyncDeviceToHostImmediately);
    } else {
@ -241,10 +289,11 @@ void CPUKernelRuntime::BindInputOutput(session::KernelGraph *kernel_graph, const
          tensor->data_type() == kNumberTypeInt32 || tensor->data_type() == kNumberTypeInt64) {
        address->ptr_ = tensor->data_c();
      } else {
        MS_EXCEPTION_IF_NULL(mem_manager_);
        ShapeVector data_shape = tensor->shape();
        size_t tensor_size =
          std::accumulate(data_shape.begin(), data_shape.end(), sizeof(float), std::multiplies<size_t>());
-        address->ptr_ = resource_manager_.MemMalloc(tensor_size);
+        address->ptr_ = static_cast<CPUMemoryManager *>(mem_manager_.get())->MemMalloc(tensor_size);
        if (!address->SyncHostToDevice(data_shape, LongToSize(tensor->data().nbytes()), tensor->data_type(),
                                       tensor->data_c())) {
          MS_LOG(EXCEPTION) << "Parameter node sync host to device failed!";
@ -271,7 +320,8 @@ void CPUKernelRuntime::AddRuntimeAddress(DeviceAddress *address, std::vector<ker
  kernel::AddressPtr input = std::make_shared<kernel::Address>();
  MS_EXCEPTION_IF_NULL(input);
  if (address->ptr_ == nullptr) {
-    address->ptr_ = resource_manager_.MemMalloc(address->size_);
+    MS_EXCEPTION_IF_NULL(mem_manager_);
    address->ptr_ = static_cast<CPUMemoryManager *>(mem_manager_.get())->MemMalloc(address->size_);
  }
  MS_EXCEPTION_IF_NULL(address->ptr_);
  input->addr = address->ptr_;
@ -280,16 +330,19 @@ void CPUKernelRuntime::AddRuntimeAddress(DeviceAddress *address, std::vector<ker
 }
 void CPUKernelRuntime::IncreaseSummaryRefCount(const session::NamedSummaryOutputs &summary_outputs) {
-  resource_manager_.IncreaseSummaryRefCount(summary_outputs);
+  MS_EXCEPTION_IF_NULL(mem_manager_);
  static_cast<CPUMemoryManager *>(mem_manager_.get())->IncreaseSummaryRefCount(summary_outputs);
 }
 void CPUKernelRuntime::DecreaseSummaryRefCount(const session::NamedSummaryOutputs &summary_outputs) {
-  resource_manager_.DecreaseSummaryRefCount(summary_outputs);
+  MS_EXCEPTION_IF_NULL(mem_manager_);
  static_cast<CPUMemoryManager *>(mem_manager_.get())->DecreaseSummaryRefCount(summary_outputs);
 }
 bool CPUKernelRuntime::Run(session::KernelGraph *kernel_graph, bool is_task_sink, Debugger *debugger) {
  MS_EXCEPTION_IF_NULL(kernel_graph);
-  resource_manager_.IncreaseAddressRefCount(kernel_graph);
+  MS_EXCEPTION_IF_NULL(mem_manager_);
  static_cast<CPUMemoryManager *>(mem_manager_.get())->IncreaseAddressRefCount(kernel_graph);
  auto kernels = kernel_graph->execution_order();
  for (const auto &kernel : kernels) {
@ -319,7 +372,7 @@ bool CPUKernelRuntime::Run(session::KernelGraph *kernel_graph, bool is_task_sink
      AddRuntimeAddress(device_address, &kernel_workspaces);
    }
    auto ret = kernel_mod->Launch(kernel_inputs, kernel_workspaces, kernel_outputs, 0);
-    resource_manager_.DecreaseAddressRefCount(kernel);
+    static_cast<CPUMemoryManager *>(mem_manager_.get())->DecreaseAddressRefCount(kernel);
    if (!ret) {
      MS_LOG(EXCEPTION) << "Launch kernel failed.";
    }
--- a/mindspore/ccsrc/runtime/device/cpu/cpu_kernel_runtime.h
+++ b/mindspore/ccsrc/runtime/device/cpu/cpu_kernel_runtime.h
@ -24,7 +24,6 @@
 #include "runtime/device/kernel_runtime.h"
 #include "backend/session/kernel_graph.h"
 #include "backend/session/session_basic.h"
 #include "runtime/device/cpu/cpu_resource_manager.h"
 #include "backend/session/anf_runtime_algorithm.h"
 #include "utils/any.h"
 namespace mindspore {
@ -35,7 +34,7 @@ class CPUKernelRuntime : public KernelRuntime {
  CPUKernelRuntime() = default;
  ~CPUKernelRuntime() override = default;
-  bool Init() override { return true; }
+  bool Init() override;
  bool Run(session::KernelGraph *graph, bool is_task_sink, Debugger *debugger = nullptr) override;
  void AssignKernelAddress(session::KernelGraph *kernel_graph);
  void BindInputOutput(session::KernelGraph *kernel_graph, const std::vector<tensor::TensorPtr> &inputs,
@ -49,6 +48,7 @@ class CPUKernelRuntime : public KernelRuntime {
                                       TypeId type_id) override;
 private:
  void AssignStaticMemoryOutput(const session::KernelGraph *graph);
  tensor::TensorPtr CreatTensorForOutput(session::KernelGraph *kernel_graph, const CNodePtr &node, size_t index);
  BaseRef CreatTensorForOutput(session::KernelGraph *kernel_graph, const session::KernelWithIndex &kernel_with_index);
@ -56,7 +56,6 @@ class CPUKernelRuntime : public KernelRuntime {
  void AssignInputNodeAddress(const session::KernelGraph *kernel_graph);
  void AssignKernelOutputAddress(const session::KernelGraph *kernel_graph);
  void AddRuntimeAddress(DeviceAddress *address, std::vector<kernel::AddressPtr> *input_list);
  CPUResourceManager resource_manager_;
  std::set<DeviceAddressPtr> bound_addresses_;
  std::map<AnfNodePtr, tensor::TensorPtr> input_param_tensor_map_;
 };
--- a/mindspore/ccsrc/runtime/device/cpu/cpu_resource_manager.cc
+++ b/mindspore/ccsrc/runtime/device/cpu/cpu_resource_manager.cc
@ -1,5 +1,5 @@
 /**
- * Copyright 2019 Huawei Technologies Co., Ltd
+ * Copyright 2020 Huawei Technologies Co., Ltd
 *
 * Licensed under the Apache License, Version 2.0 (the "License");
 * you may not use this file except in compliance with the License.
@ -13,15 +13,34 @@
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */
 #include "runtime/device/cpu/cpu_resource_manager.h"
 #include "backend/session/anf_runtime_algorithm.h"
 #include "runtime/device/cpu/cpu_memory_manager.h"
 #include "backend/session/anf_runtime_algorithm.h"
 #include "utils/ms_context.h"
 #include "utils/convert_utils.h"
 namespace mindspore {
 namespace device {
 namespace cpu {
 CPUResourceManager::~CPUResourceManager() { MemFree(); }
-void CPUResourceManager::MemFree() {
+uint8_t *CPUMemoryManager::MallocStaticMem(size_t size, bool communication_mem) {
  // For cpu, static memory is the same as dynamic memory
  return MallocDynamicMem(size, communication_mem);
 }
 uint8_t *CPUMemoryManager::MallocDynamicMem(size_t size, bool) {
  return (uint8_t *)MemMalloc(size);
 }
 void CPUMemoryManager::ResetDynamicMemory() {
  for (auto &&iter : dynamic_mem_) {
    free(iter.first);
  }
  dynamic_mem_.clear();
 }
 CPUMemoryManager::~CPUMemoryManager() { MemFree(); }
 void CPUMemoryManager::MemFree() {
  if (mem_ptr_ != nullptr) {
    free(mem_ptr_);
    mem_ptr_ = nullptr;
@ -34,7 +53,7 @@ void CPUResourceManager::MemFree() {
  dynamic_mem_.clear();
 }
-void CPUResourceManager::AssignMemory(const session::KernelGraph *graph) {
+void CPUMemoryManager::AssignMemory(const session::KernelGraph *graph) {
  size_t graph_mem_size = mem_plan_.MemPlan(graph);
  if (graph_mem_size > mem_size_) {
    if (mem_size_ > 0) {
@ -43,6 +62,7 @@ void CPUResourceManager::AssignMemory(const session::KernelGraph *graph) {
    }
    mem_ptr_ = reinterpret_cast<uint8_t *>(malloc(graph_mem_size));
    if (mem_ptr_ != nullptr) {
      MS_LOG(INFO) << "Simple MemPlan GraphMemSize [" << graph_mem_size << "]";
      mem_size_ = graph_mem_size;
      dynamic_malloc_ = false;
    } else {
@ -56,7 +76,7 @@ void CPUResourceManager::AssignMemory(const session::KernelGraph *graph) {
  mem_plan_.MemAssign(graph, mem_ptr_);
 }
-void *CPUResourceManager::MemMalloc(size_t mem_size) {
+void *CPUMemoryManager::MemMalloc(size_t mem_size) {
  void *ptr = malloc(mem_size);
  if (ptr != nullptr) {
    memset_s(ptr, mem_size, 0, mem_size);
@ -67,7 +87,7 @@ void *CPUResourceManager::MemMalloc(size_t mem_size) {
  }
 }
-void CPUResourceManager::MemFree(void *ptr) {
+void CPUMemoryManager::MemFree(void *ptr) {
  auto iter = dynamic_mem_.find(ptr);
  if (iter != dynamic_mem_.end()) {
    (void)dynamic_mem_.erase(iter);
@ -75,7 +95,7 @@ void CPUResourceManager::MemFree(void *ptr) {
  }
 }
-void CPUResourceManager::IncreaseSummaryRefCount(const session::NamedSummaryOutputs &summary_outputs) {
+void CPUMemoryManager::IncreaseSummaryRefCount(const session::NamedSummaryOutputs &summary_outputs) {
  if (!dynamic_malloc_) {
    return;
  }
@ -93,7 +113,7 @@ void CPUResourceManager::IncreaseSummaryRefCount(const session::NamedSummaryOutp
  }
 }
-void CPUResourceManager::DecreaseSummaryRefCount(const session::NamedSummaryOutputs &summary_outputs) {
+void CPUMemoryManager::DecreaseSummaryRefCount(const session::NamedSummaryOutputs &summary_outputs) {
  if (!dynamic_malloc_) {
    return;
  }
@ -115,7 +135,7 @@ void CPUResourceManager::DecreaseSummaryRefCount(const session::NamedSummaryOutp
  }
 }
-void CPUResourceManager::IncreaseAddressRefCount(const session::KernelGraph *graph) {
+void CPUMemoryManager::IncreaseAddressRefCount(const session::KernelGraph *graph) {
  if (!dynamic_malloc_) {
    return;
  }
@ -140,7 +160,7 @@ void CPUResourceManager::IncreaseAddressRefCount(const session::KernelGraph *gra
  }
 }
-void CPUResourceManager::DecreaseAddressRefCount(const AnfNodePtr &kernel) {
+void CPUMemoryManager::DecreaseAddressRefCount(const AnfNodePtr &kernel) {
  if (!dynamic_malloc_) {
    return;
  }
--- a/mindspore/ccsrc/runtime/device/cpu/cpu_resource_manager.h
+++ b/mindspore/ccsrc/runtime/device/cpu/cpu_resource_manager.h
@ -1,5 +1,5 @@
 /**
- * Copyright 2019 Huawei Technologies Co., Ltd
+ * Copyright 2020 Huawei Technologies Co., Ltd
 *
 * Licensed under the Apache License, Version 2.0 (the "License");
 * you may not use this file except in compliance with the License.
@ -13,22 +13,27 @@
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */
 #ifndef MINDSPORE_CCSRC_RUNTIME_DEVICE_CPU_CPU_RESOURCE_MANAGER_H_
 #define MINDSPORE_CCSRC_RUNTIME_DEVICE_CPU_CPU_RESOURCE_MANAGER_H_
 #ifndef MINDSPORE_CCSRC_RUNTIME_DEVICE_CPU_CPU_MEMORY_MANAGER_H_
 #define MINDSPORE_CCSRC_RUNTIME_DEVICE_CPU_CPU_MEMORY_MANAGER_H_
 #include <vector>
 #include <map>
 #include "backend/session/kernel_graph.h"
 #include "backend/session/session_basic.h"
 #include "runtime/device/device_address.h"
 #include "runtime/device/memory_manager.h"
 #include "runtime/device/cpu/cpu_simple_mem_plan.h"
 namespace mindspore {
 namespace device {
 namespace cpu {
-class CPUResourceManager {
+class CPUMemoryManager : public MemoryManager {
 public:
-  CPUResourceManager() = default;
+  CPUMemoryManager() = default;
-  ~CPUResourceManager();
+  virtual ~CPUMemoryManager();
  void MallocDeviceMemory() override {}
  void FreeDeviceMemory() override {}
  void ResetDynamicMemory() override;
  void AssignMemory(const session::KernelGraph *graph);
  void IncreaseAddressRefCount(const session::KernelGraph *graph);
@ -38,6 +43,10 @@ class CPUResourceManager {
  void IncreaseSummaryRefCount(const session::NamedSummaryOutputs &summary_outputs);
  void DecreaseSummaryRefCount(const session::NamedSummaryOutputs &summary_outputs);
 protected:
  uint8_t *MallocStaticMem(size_t size, bool communication_mem) override;
  uint8_t *MallocDynamicMem(size_t size, bool communication_mem) override;
 private:
  void MemFree();
  CPUSimpleMemPlan mem_plan_;
@ -50,5 +59,4 @@ class CPUResourceManager {
 }  // namespace cpu
 }  // namespace device
 }  // namespace mindspore
-
+#endif  // MINDSPORE_CCSRC_RUNTIME_DEVICE_CPU_CPU_MEMORY_MANAGER_H_
 #endif  // MINDSPORE_CCSRC_RUNTIME_DEVICE_CPU_CPU_RESOURCE_MANAGER_H_
--- a/mindspore/ccsrc/runtime/device/device_address.h
+++ b/mindspore/ccsrc/runtime/device/device_address.h
@ -28,7 +28,7 @@ namespace mindspore {
 namespace device {
 namespace cpu {
 class CPUSimpleMemPlan;
-class CPUResourceManager;
+class CPUMemoryManager;
 class CPUKernelRuntime;
 }  // namespace cpu
 namespace ascend {
@ -87,7 +87,7 @@ class DeviceAddress : public mindspore::DeviceSync {
  friend class MemoryManager;
  friend class mindspore::device::ascend::tasksink::TaskGenerator;
  friend class mindspore::device::cpu::CPUSimpleMemPlan;
-  friend class mindspore::device::cpu::CPUResourceManager;
+  friend class mindspore::device::cpu::CPUMemoryManager;
  friend class mindspore::device::cpu::CPUKernelRuntime;
  friend class mindspore::device::gpu::GPUKernelRuntime;
  friend class mindspore::device::gpu::GPUMemoryManager;
--- a/mindspore/ccsrc/runtime/device/gpu/gpu_common.h
+++ b/mindspore/ccsrc/runtime/device/gpu/gpu_common.h
@ -141,7 +141,6 @@ inline bool CheckNullInput(std::vector<size_t> input_shape) {
      MS_LOG(EXCEPTION) << "CUAD curand Error: " << message << " | curandStatus: " << status; \
    }                                                                                         \
  }
 }  // namespace gpu
 }  // namespace device
 }  // namespace mindspore
--- a/mindspore/ccsrc/runtime/device/gpu/mpi/mpi_initializer.cc
+++ b/mindspore/ccsrc/runtime/device/gpu/mpi/mpi_initializer.cc
@ -25,7 +25,6 @@
 namespace mindspore {
 namespace device {
 namespace gpu {
 MPIInitializer &MPIInitializer::GetInstance() {
  static MPIInitializer instance;
  return instance;
--- a/mindspore/ccsrc/utils/load_onnx/anf_converter.cc
+++ b/mindspore/ccsrc/utils/load_onnx/anf_converter.cc
@ -58,7 +58,23 @@ int AnfConverter::ValidateFileStr(const std::string &modelFile, std::string file
 bool AnfConverter::ReadOnnxFromBinary(const std::string &modelFile, google::protobuf::Message *onnx_model) {
  std::unique_ptr<char> onnx_file(new (std::nothrow) char[PATH_MAX]{0});
-  int fd = open(modelFile.c_str(), O_RDONLY);
+  if (modelFile.size() > PATH_MAX) {
    MS_LOG(DEBUG) << "file path " << modelFile << " is too long.";
    return false;
  }
  char real_path[PATH_MAX + 1] = {0};
 #if defined(_WIN32) || defined(_WIN64)
  if (nullptr == _fullpath(real_path, modelFile.c_str(), PATH_MAX)) {
    MS_LOG(DEBUG) << modelFile << " does not exit.";
    return false;
  }
 #else
  if (nullptr == realpath(modelFile.c_str(), real_path)) {
    MS_LOG(DEBUG) << modelFile << " does not exit.";
    return false;
  }
 #endif
  int fd = open(real_path, O_RDONLY);
  if (fd < 0) {
    MS_LOG(EXCEPTION) << "failed to open file";
  }
--- a/mindspore/ccsrc/utils/system/sha256.h
+++ b/mindspore/ccsrc/utils/system/sha256.h
@ -100,28 +100,27 @@ bool ProcessInner(const std::string &message, const int &bias, uint32_t *digest,
    w[i] = sigma3(w[i - 2]) + w[i - 7] + sigma2(w[i - 15]) + w[i - 16];
  }
-  std::shared_ptr<uint32_t> hash(new uint32_t[digest_size], std::default_delete<uint32_t[]>());
+  std::shared_ptr<uint32_t[]> hash(new uint32_t[digest_size]());
  size_t mem_size = digest_size * sizeof(uint32_t);
  auto ret = memcpy_s(hash.get(), mem_size, digest, mem_size);
  if (ret != EOK) {
    return false;
  }
  for (int i = 0; i < kIterationNumber; ++i) {
-    uint32_t t1 =
+    uint32_t t1 = w[i] + constant[i] + hash[7] + sigma1(hash[4]) + ch(hash[4], hash[5], hash[6]);
-      w[i] + constant[i] + hash.get()[7] + sigma1(hash.get()[4]) + ch(hash.get()[4], hash.get()[5], hash.get()[6]);
+    uint32_t t2 = sigma0(hash[0]) + ma(hash[0], hash[1], hash[2]);
    uint32_t t2 = sigma0(hash.get()[0]) + ma(hash.get()[0], hash.get()[1], hash.get()[2]);
    for (int j = digest_size - 1; j >= 0; --j) {
      if (j == 4) {
-        hash.get()[j] = hash.get()[j - 1] + t1;
+        hash[j] = hash[j - 1] + t1;
      } else if (j == 0) {
-        hash.get()[j] = t1 + t2;
+        hash[j] = t1 + t2;
      } else {
-        hash.get()[j] = hash.get()[j - 1];
+        hash[j] = hash[j - 1];
      }
    }
  }
  for (int i = 0; i < digest_size; ++i) {
-    digest[i] += hash.get()[i];
+    digest[i] += hash[i];
  }
  return true;
 }
--- a/mindspore/ccsrc/utils/utils.h
+++ b/mindspore/ccsrc/utils/utils.h
@ -219,6 +219,8 @@ constexpr auto kFusedScaleApplyMomentum = "FusedScaleApplyMomentum";
 constexpr auto kBasicLSTMCellWeightGradOpName = "BasicLSTMCellWeightGrad";
 constexpr auto kBasicLSTMCellInputGradOpName = "BasicLSTMCellInputGrad";
 constexpr auto kBasicLSTMCellOpName = "BasicLSTMCell";
 constexpr auto kDynamicRNNOpName = "DynamicRNN";
 constexpr auto kLSTMInputGradOpName = "LSTMInputGrad";
 // attr key name
 constexpr auto kAttrInputNames = "input_names";
--- a/mindspore/common/api.py
+++ b/mindspore/common/api.py
@ -292,7 +292,6 @@ class _PynativeExecutor:
    def __init__(self):
        self._executor = PynativeExecutor_.get_instance()
    #TODO(kpy):add a type arg
    def new_graph(self, obj, *args, **kwargs):
        self._executor.new_graph(obj, *args, *(kwargs.values()))
--- a/mindspore/common/seed.py
+++ b/mindspore/common/seed.py
@ -30,12 +30,14 @@ def set_seed(seed):
    Note:
        The global seed is used by numpy.random, mindspore.common.Initializer, mindspore.ops.composite.random_ops and
-          mindspore.nn.probability.distribution.
+        mindspore.nn.probability.distribution.
        If global seed is not set, these packages will use their own default seed independently, numpy.random and
-          mindspore.common.Initializer will choose a random seed, mindspore.ops.composite.random_ops and
+        mindspore.common.Initializer will choose a random seed, mindspore.ops.composite.random_ops and
-          mindspore.nn.probability.distribution will use zero.
+        mindspore.nn.probability.distribution will use zero.
        Seed set by numpy.random.seed() only used by numpy.random, while seed set by this API will also used by
-          numpy.random, so just set all seed by this API is recommended.
+        numpy.random, so just set all seed by this API is recommended.
    Args:
        seed (int): The seed to be set.
@ -97,14 +99,13 @@ def _get_op_seed(op_seed, kernel_name):
        seed (int): The op-seed to be updated.
        kernel_name (string): The random op kernel.
    """
-    if ((kernel_name, op_seed) not in _KERNEL_SEED) or (_KERNEL_SEED[(kernel_name, op_seed)] == -1):
+    if (kernel_name, op_seed) not in _KERNEL_SEED:
        _KERNEL_SEED[(kernel_name, op_seed)] = op_seed
    _KERNEL_SEED[(kernel_name, op_seed)] = 0
    return _KERNEL_SEED[(kernel_name, op_seed)]
 def _reset_op_seed():
    """
    Reset op seeds in the kernel's dictionary.
    """
-    for key in _KERNEL_SEED:
+    for (kernel_name, op_seed) in _KERNEL_SEED:
-        _KERNEL_SEED[key] = -1
+        _KERNEL_SEED[(kernel_name, op_seed)] = op_seed
--- a/mindspore/core/base/core_ops.h
+++ b/mindspore/core/base/core_ops.h
@ -104,6 +104,9 @@ inline const PrimitivePtr kPrimPad = std::make_shared<Primitive>("Pad");
 inline const PrimitivePtr kPrimArgMaxWithValue = std::make_shared<Primitive>("ArgMaxWithValue");
 inline const PrimitivePtr kPrimUnique = std::make_shared<Primitive>("Unique");
 inline const PrimitivePtr kPrimUniqueGrad = std::make_shared<Primitive>("UniqueGrad");
 inline const PrimitivePtr kPrimExtractImagePatches = std::make_shared<Primitive>("ExtractImagePatches");
 inline const PrimitivePtr kPrimDynamicRNN = std::make_shared<Primitive>("DynamicRNN");
 inline const PrimitivePtr kPrimDynamicRNNGrad = std::make_shared<Primitive>("DynamicRNNGrad");
 // NN
 inline const PrimitivePtr kPrimFlatten = std::make_shared<Primitive>("Flatten");
@ -213,6 +216,7 @@ inline const PrimitivePtr kPrimRound = std::make_shared<Primitive>("Round");
 inline const PrimitivePtr kPrimExp = std::make_shared<Primitive>("Exp");
 inline const PrimitivePtr kPrimLog = std::make_shared<Primitive>("Log");
 inline const PrimitivePtr kPrimRsqrt = std::make_shared<Primitive>("Rsqrt");
 inline const PrimitivePtr kPrimSplitV = std::make_shared<Primitive>("SplitV");
 // Statements
 inline const PrimitivePtr kPrimReturn = std::make_shared<Primitive>("return");
--- a/mindspore/core/ir/tensor.cc
+++ b/mindspore/core/ir/tensor.cc
@ -28,7 +28,6 @@
 #include <algorithm>
 #include <type_traits>
 #include <typeinfo>
 #include <regex>
 #include "abstract/abstract_value.h"
@ -373,25 +372,17 @@ class TensorDataImpl : public TensorData {
                  std::is_same<T, double>::value) {
      return str;
    }
-    // Use regular expressions to replace placeholders.
+    // Replace # with placeholder.
-    std::regex pattern("#+");
+    size_t index = str.find('#');
-    std::smatch result;
+    while (index != str.npos) {
-    int bias = 0;
+      size_t pos = index;
-    std::string::const_iterator start = str.begin();
+      while (str[pos] == '#') {
-    std::string::const_iterator end = str.end();
+        pos++;
    while (std::regex_search(start, end, result, pattern)) {
      const int len = result.str(0).length();
      const int pos = result.position();
      bias += pos;
      if (bias > static_cast<int32_t>(str.length())) {
        return "";
      }
-      // Replace # with placeholder.
+      int len = pos - index;
      std::string space(max_width - len, ' ');
-      str = str.replace(bias, len, space);
+      str = str.replace(index, len, space);
-      // Update the starting position of the search.
+      index = str.find('#', index);
      start = str.begin() + bias;
      end = str.end();
    }
    return str;
  }
--- a/mindspore/core/ir/tensor.h
+++ b/mindspore/core/ir/tensor.h
@ -269,7 +269,7 @@ class Tensor : public MetaTensor {
  std::string id() const { return id_; }
  TypePtr cast_dtype() { return cast_dtype_; }
-  void set_cast_dtype(TypePtr dtype) { cast_dtype_ = dtype; }
+  void set_cast_dtype(TypePtr dtype = nullptr) { cast_dtype_ = dtype; }
  void SetNeedWait(bool need_wait) {
    if (event_ != nullptr) {
--- a/mindspore/core/ir/value.cc
+++ b/mindspore/core/ir/value.cc
@ -266,7 +266,7 @@ std::string KeywordArg::ToString() const {
  buffer << "KeywordArg[";
  buffer << "key : " << key_;
  MS_EXCEPTION_IF_NULL(value_);
-  buffer << "value : " << value_->ToString();
+  buffer << ", value : " << value_->ToString();
  buffer << "]";
  return buffer.str();
 }
--- a/mindspore/dataset/engine/datasets.py
+++ b/mindspore/dataset/engine/datasets.py
@ -2474,7 +2474,7 @@ class ConcatDataset(DatasetOp):
            raise ValueError("The parameter num_shards of DistributedSampler must be positive int!")
        if sampler.get_num_samples() is not None:
-            raise ValueError("The parameter num_samples of DistributedSampler must be set!")
+            raise ValueError("The parameter num_samples of DistributedSampler is not support to be set!")
        self._sampler = _select_sampler(None, sampler, None, None, None)
        cumulative_samples_nums = 0
@ -3256,14 +3256,13 @@ class SamplerFn:
        # Event for end of epoch
        if multi_process is True:
            self.eoe = multiprocessing.Event()
            self.eof = multiprocessing.Event()
        else:
            self.eoe = threading.Event()
            self.eof = threading.Event()
        # Create workers
        for _ in range(num_worker):
            if multi_process is True:
-                worker = _GeneratorWorkerMp(dataset, self.eoe, self.eof)
+                worker = _GeneratorWorkerMp(dataset, self.eoe)
            else:
                worker = _GeneratorWorkerMt(dataset, self.eoe, self.eof)
            worker.daemon = True
@ -3304,15 +3303,40 @@ class SamplerFn:
    def __del__(self):
        self.eoe.set()
        self.eof.set()
        if self.multi_process is False:
            self.eof.set()
            for w in self.workers:
                w.join()
-def _generator_worker_loop(dataset, idx_queue, result_queue, eoe, eof):
+def _generator_worker_loop_mp(dataset, idx_queue, result_queue, eoe):
    """
-    Multiprocessing or multithread generator worker process loop.
+    Multiprocessing generator worker process loop
    """
    while True:
        # Fetch index, block
        try:
            idx = idx_queue.get()
        except KeyboardInterrupt:
            raise Exception("Generator worker receives KeyboardInterrupt")
        if idx is None:
            # When the queue is out of scope from master process, a None item can be fetched from the queue.
            # Upon receiving None, worker process should check if EOE is set.
            assert eoe.is_set(), ""
            return
        # Fetch data, any exception from __getitem__ will terminate worker and timeout master process
        result = dataset[idx]
        # Send data, block
        try:
            result_queue.put(result)
        except KeyboardInterrupt:
            raise Exception("Generator worker receives KeyboardInterrupt")
        del result, idx
 def _generator_worker_loop_mt(dataset, idx_queue, result_queue, eoe, eof):
    """
    Multithread generator worker process loop.
    """
    while True:
        # Fetch index, block
@ -3360,7 +3384,7 @@ class _GeneratorWorkerMt(threading.Thread):
    def __init__(self, dataset, eoe, eof):
        self.idx_queue = queue.Queue(16)
        self.res_queue = queue.Queue(16)
-        super().__init__(target=_generator_worker_loop, args=(dataset, self.idx_queue, self.res_queue, eoe, eof))
+        super().__init__(target=_generator_worker_loop_mt, args=(dataset, self.idx_queue, self.res_queue, eoe, eof))
    def put(self, item):
        """
@ -3372,7 +3396,7 @@ class _GeneratorWorkerMt(threading.Thread):
        """
        Get function for worker result queue. Block with timeout.
        """
-        return self.res_queue.get(timeout=10)
+        return self.res_queue.get(timeout=30)
 class _GeneratorWorkerMp(multiprocessing.Process):
@ -3380,10 +3404,10 @@ class _GeneratorWorkerMp(multiprocessing.Process):
    Worker process for multiprocess Generator.
    """
-    def __init__(self, dataset, eoe, eof):
+    def __init__(self, dataset, eoe):
        self.idx_queue = multiprocessing.Queue(16)
        self.res_queue = multiprocessing.Queue(16)
-        super().__init__(target=_generator_worker_loop, args=(dataset, self.idx_queue, self.res_queue, eoe, eof))
+        super().__init__(target=_generator_worker_loop_mp, args=(dataset, self.idx_queue, self.res_queue, eoe))
    def put(self, item):
        """
@ -3395,17 +3419,18 @@ class _GeneratorWorkerMp(multiprocessing.Process):
        """
        Get function for worker result queue. Block with timeout.
        """
-        while check_iterator_cleanup() is False:
+        # Relax 10s to 30s, since it sometimes will cause "Generator worker process timeout"
-            try:
+        # when we run too many iterators with infinite epoch(num_epoch=-1)
-                return self.res_queue.get(timeout=10)
+        return self.res_queue.get(timeout=30)
            except multiprocessing.TimeoutError:
                continue
        raise Exception("Generator worker process timeout")
    def __del__(self):
-        self.terminate()
+        # Try to destruct here, sometimes the class itself will be destructed in advance,
        # so "self" will be a NoneType
        try:
            self.terminate()
        except AttributeError:
            pass
 class GeneratorDataset(MappableDataset):
--- a/mindspore/dataset/transforms/py_transforms.py
+++ b/mindspore/dataset/transforms/py_transforms.py
@ -80,12 +80,44 @@ class Compose:
        >>> dataset = ds.ImageFolderDataset(dataset_dir, num_parallel_workers=8)
        >>> # create a list of transformations to be applied to the image data
        >>> transform = py_transforms.Compose([py_vision.Decode(),
-        >>>                                   py_vision.RandomHorizontalFlip(0.5),
+        >>>                                    py_vision.RandomHorizontalFlip(0.5),
-        >>>                                   py_vision.ToTensor(),
+        >>>                                    py_vision.ToTensor(),
-        >>>                                   py_vision.Normalize((0.491, 0.482, 0.447), (0.247, 0.243, 0.262)),
+        >>>                                    py_vision.Normalize((0.491, 0.482, 0.447), (0.247, 0.243, 0.262)),
-        >>>                                   py_vision.RandomErasing()])
+        >>>                                    py_vision.RandomErasing()])
        >>> # apply the transform to the dataset through dataset.map()
        >>> dataset = dataset.map(operations=transform, input_columns="image")
        >>>
        >>> # Compose is also be invoked implicitly, by just passing in a list of ops
        >>> # the above example then becomes:
        >>> transform_list = [py_vision.Decode(),
        >>>                   py_vision.RandomHorizontalFlip(0.5),
        >>>                   py_vision.ToTensor(),
        >>>                   py_vision.Normalize((0.491, 0.482, 0.447), (0.247, 0.243, 0.262)),
        >>>                   py_vision.RandomErasing()]
        >>>
        >>> # apply the transform to the dataset through dataset.map()
        >>> dataset = dataset.map(operations=transform_list, input_columns="image")
        >>>
        >>> # Certain C++ and Python ops can be combined, but not all of them
        >>> # An example of combined operations
        >>> import mindspore.dataset as ds
        >>> import mindspore.dataset.transforms.c_transforms as c_transforms
        >>> import mindspore.dataset.vision.c_transforms as c_vision
        >>>
        >>> data = ds.NumpySlicesDataset(arr, column_names=["cols"], shuffle=False)
        >>> transformed_list = [py_transforms.OneHotOp(2), c_transforms.Mask(c_transforms.Relational.EQ, 1)]
        >>> data = data.map(operations=transformed_list, input_columns=["cols"])
        >>>
        >>> # Here is an example of mixing vision ops
        >>> data_dir = "/path/to/imagefolder_directory"
        >>> data1 = ds.ImageFolderDataset(dataset_dir=data_dir, shuffle=False)
        >>> input_columns = ["column_names"]
        >>> op_list=[c_vision.Decode(),
        >>>          c_vision.Resize((224, 244)),
        >>>          py_vision.ToPIL(),
        >>>          np.array, # need to convert PIL image to a NumPy array to pass it to C++ operation
        >>>          c_vision.Resize((24, 24))]
        >>> data1 = data1.map(operations=op_list, input_columns=input_columns)
    """
    @check_compose_list
--- a/mindspore/lite/CMakeLists.txt
+++ b/mindspore/lite/CMakeLists.txt
@ -28,10 +28,10 @@ if (SUPPORT_TRAIN)
    elseif (PLATFORM_ARM32)
        set(COMPONENT_NAME runtime-arm32-${PROCESS_UNIT}-train)
    elseif (WIN32)
-        set(PARSER_NAME libconverter-parser-win-${PROCESS_UNIT})
+        set(PARSER_NAME libconverter-parser-win-${PROCESS_UNIT}-train)
-        set(COMPONENT_NAME converter-win-${PROCESS_UNIT})
+        set(COMPONENT_NAME converter-win-${PROCESS_UNIT}-train)
    else ()
-        set(COMPONENT_NAME converter-ubuntu)
+        set(COMPONENT_NAME converter-ubuntu-train)
    endif()
    set(RUN_X86_COMPONENT_NAME runtime-x86-${PROCESS_UNIT}-train)
 else ()
--- a/mindspore/lite/nnacl/assembly/arm32/ConvDwInt8PostAlign4PerChannel.S
+++ b/mindspore/lite/nnacl/assembly/arm32/ConvDwInt8PostAlign4PerChannel.S
@ -0,0 +1,113 @@
 #ifdef __arm__
 #ifndef __aarch64__
 .text
 .align 5
 .global ConvDwInt8PostAlign4PerChannel
 #ifndef __APPLE__
 .type ConvDwInt8PostAlign4PerChannel, %function
 #endif
 // void ConvDwInt8PostAlign4PerChannel(int8_t *dst, int32_t *buffer, int channel4, int32_t output_zp, int32_t *out_multiplier,
 //                                     int32_t *left_shift, int32_t *right_shift, int32_t acc_min, int32_t acc_max);
 // r0: dst, r1: buffer, r2: num_pixels, r3: output_zp, r4: out_multiplier,
 // r5: left_shift, r6: right_shift, r7: acc_min, r8: acc_max
 ConvDwInt8PostAlign4PerChannel:
    // at return, clang generates "push {lr}, pop {pc}"" while gcc will generate "bx lr"
    // according to https://stackoverflow.com/questions/53625807
    // even if we jump to link register instead of saving it, we still have to save it in subroutine calls anyway
    // clang's rule seems more simple, though there are no subroutine calls here
    // r4-r8 and q4-q7 must be saved according to https://static.docs.arm.com/ihi0042/i/aapcs32.pdf
    push {r4-r8, r10}
    vpush {q4-q7}
    add sp, sp, #88
    vdup.32 q15, r3 // output_zp
    ldr r4, [sp] // out_multiplier
    ldr r5, [sp, #4] // left_shift
    ldr r6, [sp, #8] // right_shift
    ldr r7, [sp, #12] // acc_min
    vdup.32 q11, r7
    ldr r8, [sp, #16] // acc_max
    vdup.32 q10, r8
    mov r10, r0
    LoopDepth8:
        cmp r2, #8
        blt End
        vld1.32 {q0}, [r1]!
        vld1.32 {q13}, [r5]!
        vshl.s32 q0, q0, q13
        vld1.32 {q14}, [r4]!
        vqrdmulh.s32 q0, q0, q14
        vld1.32 {q12}, [r6]!
        vand q4, q0, q12
        vshr.s32 q4, q4, #31
        vqadd.s32 q0, q0, q4
        vrshl.s32 q0, q0, q12
        vadd.i32 q0, q0, q15
        vmax.s32 q0, q0, q11
        vmin.s32 q0, q0, q10
        vqmovn.s32 d4, q0
        vld1.32 {q1}, [r1]!
        vld1.32 {q13}, [r5]!
        vshl.s32 q1, q1, q13
        vld1.32 {q14}, [r4]!
        vqrdmulh.s32 q1, q1, q14
        vld1.32 {q12}, [r6]!
        vand q4, q1, q12
        vshr.s32 q4, q4, #31
        vqadd.s32 q1, q1, q4
        vrshl.s32 q1, q1, q12
        vadd.i32 q1, q1, q15
        vmax.s32 q1, q1, q11
        vmin.s32 q1, q1, q10
        vqmovn.s32 d5, q1
        vqmovn.s16 d4, q2
        vst1.8 {d4}, [r10]!
        sub r2, r2, #8
        b LoopDepth8
    LoopDepth4:
        cmp r2, #4
        blt End
        vld1.32 {q0}, [r1]!
        vld1.32 {q13}, [r5]!
        vshl.s32 q0, q0, q13
        vld1.32 {q14}, [r4]!
        vqrdmulh.s32 q0, q0, q14
        vld1.32 {q12}, [r6]!
        vand q4, q0, q12
        vshr.s32 q4, q4, #31
        vqadd.s32 q0, q0, q4
        vrshl.s32 q0, q0, q12
        vadd.i32 q0, q0, q15
        vmax.s32 q0, q0, q11
        vmin.s32 q0, q0, q10
        vqmovn.s32 d0, q0
        vqmovn.s16 d0, q0
        vst1.8 {d0[0]}, [r10]!
        vst1.8 {d0[1]}, [r10]!
        vst1.8 {d0[2]}, [r10]!
        vst1.8 {d0[3]}, [r10]!
        sub r2, r2, #4
        b LoopDepth4
    End:
        sub sp, sp, #88
        vpop {q4-q7}
        pop {r4-r8, r10}
        bx lr
 #endif
 #endif
--- a/mindspore/lite/nnacl/fp32/expandDims.c
+++ b/mindspore/lite/nnacl/fp32/expandDims.c
@ -18,7 +18,7 @@
 #include <string.h>
 #include "nnacl/errorcode.h"
-int ExpandDims(float *input_ptr, float *output_ptr, size_t data_size) {
+int ExpandDims(void *input_ptr, void *output_ptr, size_t data_size) {
  memcpy(output_ptr, input_ptr, data_size);
  return NNACL_OK;
 }
--- a/mindspore/lite/nnacl/fp32/expandDims.h
+++ b/mindspore/lite/nnacl/fp32/expandDims.h
@ -27,7 +27,7 @@ typedef struct ExpandDimsParameter {
 #ifdef __cplusplus
 extern "C" {
 #endif
-int ExpandDims(float *input_ptr, float *output_ptr, size_t data_size);
+int ExpandDims(void *input_ptr, void *output_ptr, size_t data_size);
 #ifdef __cplusplus
 }
 #endif
--- a/mindspore/lite/nnacl/fp32/leaky_relu.c
+++ b/mindspore/lite/nnacl/fp32/leaky_relu.c
@ -1,26 +0,0 @@
 /**
 * Copyright 2020 Huawei Technologies Co., Ltd
 *
 * Licensed under the Apache License, Version 2.0 (the "License");
 * you may not use this file except in compliance with the License.
 * You may obtain a copy of the License at
 *
 // * http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */
 #include "nnacl/fp32/leaky_relu.h"
 void DoLeakyRelu(float *input, float *output, LeakyReluParameter *param, int task_id) {
  for (int i = task_id; i < param->input_num_; i += param->op_parameter_.thread_num_) {
    if (input[i] <= 0) {
      output[i] = input[i] * param->slope_[0];
    } else {
      output[i] = input[i];
    }
  }
 }
--- a/mindspore/lite/nnacl/fp32/leaky_relu.h
+++ b/mindspore/lite/nnacl/fp32/leaky_relu.h
@ -1,30 +0,0 @@
 /**
 * Copyright 2020 Huawei Technologies Co., Ltd
 *
 * Licensed under the Apache License, Version 2.0 (the "License");
 * you may not use this file except in compliance with the License.
 * You may obtain a copy of the License at
 *
 * http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */
 #ifndef MINDSPORE_LITE_NNACL_PRELU_H_
 #define MINDSPORE_LITE_NNACL_PRELU_H_
 #include "nnacl/op_base.h"
 #include "nnacl/leaky_relu_parameter.h"
 #ifdef __cplusplus
 extern "C" {
 #endif
 void DoLeakyRelu(float *input, float *output, LeakyReluParameter *prelu_param_, int task_id);
 #ifdef __cplusplus
 }
 #endif
 #endif  // MINDSPORE_LITE_NNACL_PRELU_H_
--- a/mindspore/lite/nnacl/fp32/pooling.c
+++ b/mindspore/lite/nnacl/fp32/pooling.c
@ -19,7 +19,7 @@
 #include "nnacl/errorcode.h"
 int AvgPooling(const float *input_ptr, float *output_ptr, PoolingParameter *pooling_param, int task_id, float minf,
-                float maxf) {
+               float maxf) {
  int stride_w = pooling_param->stride_w_;
  int stride_h = pooling_param->stride_h_;
  int pad_w = pooling_param->pad_l_;
@ -137,7 +137,7 @@ int AvgPooling(const float *input_ptr, float *output_ptr, PoolingParameter *pool
            real_count = window;
          }
          if (real_count == 0) {
-              return NNACL_ERR;
+            return NNACL_ERR;
          }
          tmp_avg = tmp_avg / (float)real_count;
          tmp_avg = fmax(tmp_avg, minf);
--- a/mindspore/lite/nnacl/fp32/pooling.h
+++ b/mindspore/lite/nnacl/fp32/pooling.h
@ -28,7 +28,7 @@
 extern "C" {
 #endif
 int AvgPooling(const float *input_ptr, float *output_ptr, PoolingParameter *pooling_param, int task_id, float minf,
-                float maxf);
+               float maxf);
 void MaxPooling(const float *input_ptr, float *output_ptr, PoolingParameter *pooling_param, int task_id, float minf,
                float maxf);
 #ifdef __cplusplus
--- a/mindspore/lite/nnacl/int8/common_func.h
+++ b/mindspore/lite/nnacl/int8/common_func.h
@ -35,6 +35,9 @@ void PostFuncInt8C4(const int32_t *in, const int32_t *bias, int8_t *out, size_t
 #ifdef ENABLE_ARM
 void ConvDwInt8Row(int32_t *output_ptr, const int8_t *input_ptr, const int16_t *weight_ptr, int num_pixels,
                   int output_channel, int input_step, int8_t input_zp);
 void ConvDwInt8PostAlign4PerChannel(int8_t *dst, int32_t *buffer, int channel4, int32_t output_zp,
                                    int32_t *out_multiplier, int32_t *left_shift, int32_t *right_shift, int32_t acc_min,
                                    int32_t acc_max);
 void ConvDwInt8PostAlign4(int8_t *dst, int32_t *buffer, int num_pixels, int32_t output_zp, int32_t out_multiplier,
                          int32_t left_shift, int32_t right_shift, int32_t acc_min, int32_t acc_max);
 void IndirectGemmInt16to32_8x4(int32_t *dst, const int16_t *src, const int16_t *weight, size_t ksize, size_t ic8,
@ -64,9 +67,6 @@ void IndirectGemmInt8_4x4(int8_t *output, const int8_t *input, const int8_t *wei
 void DeconvDwInt8Center(int32_t *dst, const int16_t *src, const int16_t *weight, size_t height, size_t width,
                        size_t kernel_h, size_t kernel_w, size_t out_h_step, size_t block_channel, size_t in_sh_step,
                        size_t in_sw_step, size_t in_kh_step, size_t in_kw_step);
 void ConvDwInt8PostAlign4PerChannel(int8_t *dst, int32_t *buffer, int channel4, int32_t output_zp,
                                    int32_t *out_multiplier, int32_t *left_shift, int32_t *right_shift, int32_t acc_min,
                                    int32_t acc_max);
 #endif
 #ifdef __cplusplus
--- a/mindspore/lite/nnacl/int8/conv_depthwise_int8.c
+++ b/mindspore/lite/nnacl/int8/conv_depthwise_int8.c
@ -39,7 +39,7 @@ void ConvDwInt8Post(int8_t *dst, int32_t *buffer, int output_w, int channel, int
    // support perchannel
    for (int w = 0; w < output_w; w++) {
      int channel4 = 0;
-#ifdef ENABLE_ARM64
+#ifdef ENABLE_ARM
      channel4 = channel / 4 * 4;
      ConvDwInt8PostAlign4PerChannel(dst, buffer, channel4, output_zp, out_multiplier, left_shift, right_shift, acc_min,
                                     acc_max);
--- a/mindspore/lite/nnacl/int8/leaky_relu_int8.c
+++ b/mindspore/lite/nnacl/int8/leaky_relu_int8.c
@ -17,8 +17,7 @@
 #include "nnacl/int8/leaky_relu_int8.h"
 #include "nnacl/errorcode.h"
-int DoLeakReluInt8(int8_t *inputs, int8_t *output_ptr, LeakyReluQuantArg *quant_prelu_parm, QuantArg *input_quant,
+int DoLeakReluInt8(int8_t *inputs, int8_t *output_ptr, LeakyReluQuantArg *quant_prelu_parm, int task_id) {
                   int task_id) {
  if (quant_prelu_parm == NULL) {
    return NNACL_NULL_PTR;
  }
@ -27,17 +26,12 @@ int DoLeakReluInt8(int8_t *inputs, int8_t *output_ptr, LeakyReluQuantArg *quant_
  const float output_inverse_scale = 1.f / output_scale;
  int output_dim = quant_prelu_parm->input_dim_;
  float scale = quant_prelu_parm->quant_arg.in_args_.scale_ * output_inverse_scale;
  float bias = -quant_prelu_parm->quant_arg.in_args_.zp_ * scale;
  for (int i = 0; i < output_dim; i++) {
    input_quant[i].scale_ = quant_prelu_parm->quant_arg.in_args_.scale_;
    input_quant[i].zp_ = quant_prelu_parm->quant_arg.in_args_.zp_;
  }
  for (int i = 0; i < output_dim; i++) {
    float scale = input_quant[i].scale_ * output_inverse_scale;
    float bias = -input_quant[i].zp_ * scale;
    for (int j = task_id; j < quant_prelu_parm->element_num; j += quant_prelu_parm->op_parameter_.thread_num_) {
      if (inputs[j] <= 0) {
-        int32_t output_tmp = round(inputs[j] * quant_prelu_parm->slope_[0] * scale + bias) + output_zp;
+        int32_t output_tmp = round(inputs[j] * quant_prelu_parm->slope_ * scale + bias) + output_zp;
        if (output_tmp > 127) {
          output_ptr[j] = 127;
        } else if (output_tmp < -128) {
@ -57,6 +51,5 @@ int DoLeakReluInt8(int8_t *inputs, int8_t *output_ptr, LeakyReluQuantArg *quant_
      }
    }
  }
  free(input_quant);
  return NNACL_OK;
 }
--- a/mindspore/lite/nnacl/int8/leaky_relu_int8.h
+++ b/mindspore/lite/nnacl/int8/leaky_relu_int8.h
@ -23,8 +23,7 @@
 #ifdef __cplusplus
 extern "C" {
 #endif
-int DoLeakReluInt8(int8_t *inputs, int8_t *output_ptr, LeakyReluQuantArg *quant_Prelu_parm, QuantArg *input_quant,
+int DoLeakReluInt8(int8_t *inputs, int8_t *output_ptr, LeakyReluQuantArg *quant_Prelu_parm, int task_id);
                   int task_id);
 #ifdef __cplusplus
 }
 #endif
--- a/Show More
+++ b/Show More
		`@ -1 +1 @@`
			`Subproject commit 4d897c23fc41c5f7013efd0c517796233671518a`				`Subproject commit 6c492a8c9d9730ad11ffc5481cc532ae500b0da5`
		`@ -1 +1 @@`
			`Subproject commit 4f6ebe0f92da8c758b1b71a2bc7ff33f9344e18a`				`Subproject commit 14db109491bc81473905a5eb9e82f6234aca419b`