306 changed files with 5992 additions and 2208 deletions
--- a/README.md
+++ b/README.md
@ -69,6 +69,8 @@ MindSpore offers build options across multiple backends:
 |  | Ubuntu-aarch64 | ✔️ |
 |  | EulerOS-x86 | ✔️ |
 |  | EulerOS-aarch64 | ✔️ |
+|  | CentOS-x86 | ✔️ |
+|  | CentOS-aarch64 | ✔️ |
 | GPU CUDA 10.1 | Ubuntu-x86 | ✔️ |
 | CPU | Ubuntu-x86 | ✔️ |
 |  | Ubuntu-aarch64 | ✔️ |
@ -79,7 +81,7 @@ For installation using `pip`, take `CPU` and `Ubuntu-x86` build version as an ex
 1. Download whl from [MindSpore download page](https://www.mindspore.cn/versions/en), and install the package.

    ```
-    pip install https://ms-release.obs.cn-north-4.myhuaweicloud.com/0.7.0-beta/MindSpore/cpu/ubuntu_x86/mindspore-0.7.0-cp37-cp37m-linux_x86_64.whl
+    pip install https://ms-release.obs.cn-north-4.myhuaweicloud.com/1.0.0/MindSpore/cpu/ubuntu_x86/mindspore-1.0.0-cp37-cp37m-linux_x86_64.whl
    ```

 2. Run the following command to verify the install.
@ -136,8 +138,8 @@ currently the containerized build options are supported as follows:

    For `CPU` backend, you can directly pull and run the latest stable image using the below command:
    ```
-    docker pull mindspore/mindspore-cpu:0.7.0-beta
-    docker run -it mindspore/mindspore-cpu:0.7.0-beta /bin/bash
+    docker pull mindspore/mindspore-cpu:1.0.0
+    docker run -it mindspore/mindspore-cpu:1.0.0 /bin/bash
    ```

 * GPU
@ -170,8 +172,8 @@ currently the containerized build options are supported as follows:
    ```
    Then you can pull and run the latest stable image using the below command:
    ```
-    docker pull mindspore/mindspore-gpu:0.7.0-beta
-    docker run -it --runtime=nvidia --privileged=true mindspore/mindspore-gpu:0.7.0-beta /bin/bash
+    docker pull mindspore/mindspore-gpu:1.0.0
+    docker run -it --runtime=nvidia --privileged=true mindspore/mindspore-gpu:1.0.0 /bin/bash
    ```

    To test if the docker image works, please execute the python code below and check the output:
--- a/README_CN.md
+++ b/README_CN.md
@ -66,6 +66,8 @@ MindSpore提供跨多个后端的构建选项：
 |               | Ubuntu-aarch64  | ✔️   |
 |               | EulerOS-x86     | ✔️   |
 |               | EulerOS-aarch64 | ✔️   |
+|               | CentOS-x86      | ✔️   |
+|               | CentOS-aarch64  | ✔️   |
 | GPU CUDA 10.1 | Ubuntu-x86      | ✔️   |
 | CPU           | Ubuntu-x86      | ✔️   |
 |               | Ubuntu-aarch64  | ✔️   |
@ -76,7 +78,7 @@ MindSpore提供跨多个后端的构建选项：
 1. 请从[MindSpore下载页面](https://www.mindspore.cn/versions)下载并安装whl包。

    ```
-    pip install https://ms-release.obs.cn-north-4.myhuaweicloud.com/0.7.0-beta/MindSpore/cpu/ubuntu_x86/mindspore-0.7.0-cp37-cp37m-linux_x86_64.whl
+    pip install https://ms-release.obs.cn-north-4.myhuaweicloud.com/1.0.0/MindSpore/cpu/ubuntu_x86/mindspore-1.0.0-cp37-cp37m-linux_x86_64.whl
    ```

 2. 执行以下命令，验证安装结果。
@ -132,8 +134,8 @@ MindSpore的Docker镜像托管在[Docker Hub](https://hub.docker.com/r/mindspore

    对于`CPU`后端，可以直接使用以下命令获取并运行最新的稳定镜像：
    ```
-    docker pull mindspore/mindspore-cpu:0.7.0-beta
-    docker run -it mindspore/mindspore-cpu:0.7.0-beta /bin/bash
+    docker pull mindspore/mindspore-cpu:1.0.0
+    docker run -it mindspore/mindspore-cpu:1.0.0 /bin/bash
    ```

 * GPU
@ -166,8 +168,8 @@ MindSpore的Docker镜像托管在[Docker Hub](https://hub.docker.com/r/mindspore
    ```
    使用以下命令获取并运行最新的稳定镜像：
    ```
-    docker pull mindspore/mindspore-gpu:0.7.0-beta
-    docker run -it --runtime=nvidia --privileged=true mindspore/mindspore-gpu:0.7.0-beta /bin/bash
+    docker pull mindspore/mindspore-gpu:1.0.0
+    docker run -it --runtime=nvidia --privileged=true mindspore/mindspore-gpu:1.0.0 /bin/bash
    ```

    要测试Docker是否正常工作，请运行下面的Python代码并检查输出：
--- a/RELEASE.md
+++ b/RELEASE.md
@ -1,3 +1,112 @@
+# Release 1.0.0
+
+## Major Features and Improvements
+### MindSpore Training and Inference Framework
+#### Ascend 910
+* New models
+    * DenseNet121: a dense convolutional neural network, which connects each layer to every other layer in a feed-forward fashion for object recognition on ImageNet dataset.
+    * UNet2D-Medical: Unet Medical model for 2D image segmentation, Convolutional Networks for Biomedical Image Segmentation on ISBI Challenge database.
+* Frontend and user interface
+    * Second-Order Optimization
+      * Enable second-order optimization for Bert on Ascend 910, which can achieve a masked lm accuracy of 71.3% in 800 seconds using 8 Ascend 910 (Bert-Large @MLPerf v0.7 dataset).
+    * New GNN model BGCF
+      * Bayesian Graph Convolutional Filtering network which naturally incorporate the uncertainty in the user-item interaction graph shows excellent recommendation performance on Amazon-Beauty dataset.
+    * Add append interface for SequentialCell.
+    * Add a level `auto` for AMP.
+* Executor and performance optimization
+    * Support quantitative network (Resnet50 & YoloV3 & MobileNetV2).
+    * Project ease of use optimization: project compilation time optimization, CMakelist regularization, cudnn, cuda independent compilation and installation independent.
+* Data processing, augmentation, and save format
+    * Support GeneratorDataset return string type
+
+#### Other Hardware Support
+* GPU platform
+    * Enable second-order optimization for resnet50 on GPU, which achieve 30% improvement on training time compared to SGD with Momentum (Resnet50 @ImageNet).
+* CPU platform
+    * ...
+#### User interfaces change log
+* Remove global object GradOperation in Autodiff([!5011](https://gitee.com/mindspore/mindspore/pulls/5011))
+* Remove useless attribute 'name' in Autodiff([!5172](https://gitee.com/mindspore/mindspore/pulls/5172))
+* Rectification distributed init([!5350](https://gitee.com/mindspore/mindspore/pulls/5350))
+* Move the setting of ParalleMode from train.parallel_utils to context([!5351](https://gitee.com/mindspore/mindspore/pulls/5351))
+* Modification of save_checkpoint([!5482](https://gitee.com/mindspore/mindspore/pulls/5482))
+* Wrap numpy random seed into an api([!5634](https://gitee.com/mindspore/mindspore/pulls/5634))
+* Delete enable_fused_layernorm in some modelzoo scripts([!5665](https://gitee.com/mindspore/mindspore/pulls/5665))
+* Move 'multi-subgraphs' interface to internal([!5696](https://gitee.com/mindspore/mindspore/pulls/5696))
+* Rename mirror_mean to gradient_mean([!5700](https://gitee.com/mindspore/mindspore/pulls/5700))
+* Remove default value of 'group' of DepthWiseConv2d([!5865](https://gitee.com/mindspore/mindspore/pulls/5865))
+* Modify interface for function and remove duplicated def([!5958](https://gitee.com/mindspore/mindspore/pulls/5958))
+* Unify Conv2d and DepthwiseConv2d([!5916](https://gitee.com/mindspore/mindspore/pulls/5916))
+* Modification of SoftmaxCrossEntropyWithLogits([!5502](https://gitee.com/mindspore/mindspore/pulls/5502))
+* Change API set_strategy() to shard()([!5991](https://gitee.com/mindspore/mindspore/pulls/5991))
+* Move batch_size from bert_cfg_cfg to cfg([!6233](https://gitee.com/mindspore/mindspore/pulls/6233))
+* Remove unused parameters from SummaryRecord __init__([!5548](https://gitee.com/mindspore/mindspore/pulls/5548))
+* remove sens parameter of TrainOneStepWithLossScaleCell([!5753](https://gitee.com/mindspore/mindspore/pulls/5753))
+* optimize the TrainOneStepCell for user's define([!6159](https://gitee.com/mindspore/mindspore/pulls/6159))
+* delete seed0 and seed1 of nn.Dropout([!5735](https://gitee.com/mindspore/mindspore/pulls/5735))
+* delete DataWrapper([!6101](https://gitee.com/mindspore/mindspore/pulls/6101))
+* LSTM API optimization([!6374](https://gitee.com/mindspore/mindspore/pulls/6374))
+* Merge P\C\F of ops([!5645](https://gitee.com/mindspore/mindspore/pulls/5645))
+* delete SoftmaxCrossEntropyExpand interface([!6607](https://gitee.com/mindspore/mindspore/pulls/6607))
+* Adjust GroupNorm interface([!6329](https://gitee.com/mindspore/mindspore/pulls/6329))
+* Modify init interface to internal interface([!6651](https://gitee.com/mindspore/mindspore/pulls/6651))
+* Log optimization([!5842](https://gitee.com/mindspore/mindspore/pulls/5842))
+* Remove useless API dataset.set_dataset_size（[!5806](https://gitee.com/mindspore/mindspore/pulls/5806))
+* Some of Dataset API add usage parameter（[!5605](https://gitee.com/mindspore/mindspore/pulls/5605))
+* Change the import path, such as from mindspore.dataset.transforms.vision to mindspore.dataset.vision.transforms（[!5384](https://gitee.com/mindspore/mindspore/pulls/5384))
+* Rename ImageFolderDatasetV2 to ImageFolderDataset（[!5384](https://gitee.com/mindspore/mindspore/pulls/5384))
+* Dataset.map parameter optimization（[!5384](https://gitee.com/mindspore/mindspore/pulls/5384))
+* Add new api dataset.get_col_names（[!5384](https://gitee.com/mindspore/mindspore/pulls/5384))
+* Add new api dataset.get_col_names（[!5384](https://gitee.com/mindspore/mindspore/pulls/5384))
+* Remove useless API MindRecord finish（[!5580](https://gitee.com/mindspore/mindspore/pulls/5580))
+
+### MindSpore Lite
+* Converter
+    * Add 6 TFLite op, 7 Caffe op, 1 ONNX op.
+    * Add support for Windows.
+    * Support parallel inference of multiple sessions to adapt to more scenarios
+    * Support 8bits only weight-quantization, most main-stream models has small accuracy loss (less than 0.5%) when compared to non-qunantized fp32 model.
+
+* CPU & GPU
+    * Add 20 CPU ops，include FP32, int8/uint8, FP16 and int32 ops.
+    * Add supporting FP16 for GPU, add 14 GPU ops include FP32/FP16.
+    * Add Buffer/Image2D transform op for GPU
+    * Performance optimization for CPU ops focus on ARM32.
+    * Performance optimization for GPU Convolution using winograd.
+
+* Tool & example
+    * Add object detection Android Demo.
+
+## Bugfixes
+* Models
+    * fix the constant folding problem in multiply.([!6092](https://gitee.com/mindspore/mindspore/pulls/6092))
+    * move batch_size from bert_net_cfg to cfg in bert scripts.([!6233](https://gitee.com/mindspore/mindspore/pulls/6233))
+    * modify the checkpoint file path.([!6137](https://gitee.com/mindspore/mindspore/pulls/6137))
+* Python API
+    * fix semi auto parallel parameter of reshape has another user([!5722](https://gitee.com/mindspore/mindspore/pulls/5722))
+    * raise ValueError when call hook function in graph mode([!5831](https://gitee.com/mindspore/mindspore/pulls/5831))
+* Executor
+    * fix pynative mode to build temporary nn objects.（[!6189](https://gitee.com/mindspore/mindspore/pulls/6189))
+    * fix the accuracy problem of multiple inputs of multi-card communication operator broadcast.([!6522](https://gitee.com/mindspore/mindspore/pulls/5622))
+    * fix the problem that the sample distribution interface categorical does not support graph mode.([!5772](https://gitee.com/mindspore/mindspore/pulls/5772))
+    * fix the random seed failure problem of the polynomial downsampling distribution operator.([!5948](https://gitee.com/mindspore/mindspore/pulls/5948))
+    * fix unnecessary address binding issues in GPU heterogeneous scenarios.([!6232](https://gitee.com/mindspore/mindspore/pulls/6232))
+* GPU platform
+    * fix for kernel resource leak([!5315](https://gitee.com/mindspore/mindspore/pulls/5315))
+    * fix for insufficient memory for continuous unit test running([!5617](https://gitee.com/mindspore/mindspore/pulls/5617))
+    * fix for the memory leak in the sparse slicer([!5578](https://gitee.com/mindspore/mindspore/pulls/5578))
+* Data processing
+    * fix hang when use pyfunc([!6346](https://gitee.com/mindspore/mindspore/pulls/6346))
+    * fix GPU device queue does not release GIL during resource clean up([!5964](https://gitee.com/mindspore/mindspore/pulls/5964))
+    * fix hang if scripte exit unnormally([!6441](https://gitee.com/mindspore/mindspore/pulls/6441))
+
+## Contributors
+Thanks goes to these wonderful people:
+
+Adel, AGroupofProbiotocs, anthonyaje, anzhengqi, askmiao, baihuawei, baiyangfan, bai-yangfan, bingyaweng, BowenK, buxue, caifubi, CaoJian, caojian05, caozhou, Cathy, changzherui, chenfei, chengxianbin, chenhaozhe, chenjianping, chenzomi, chenzupeng, chujinjin, cj, cjh9368, Corleone, danish, Danish, dayschan, eric, Eric, fary86, fuzhiye, Gaoxiong, gengdongjie, gongdaguo, gukecai, guoqi, gzhcv, hangq, hanhuifeng2020, Harshvardhan, He, heleiwang, hexia, Hoai, HuangBingjian, huangdongrun, huanghui, huangxinjing, huzhifeng, hwjiaorui, Jesse, jianghui58, jiangzhiwen, Jiaqi, jin-xiulang, jinyaohui, jjfeing, John, Jonathan, jonyguo, jzg, kai00, kingfo, kingxian, kpy, kswang, laiyongqiang, leonwanghui, Li, liangchenghui, liangzelang, lichen_101010, lichenever, lihongkang, lilei, limingqi107, ling, linqingke, liubuyu, liuwenhao4, liuxiao78, liuxiao93, liuyang_655, liuzhongkai, Lixia, lixian, liyanliu, liyong, lizhenyu, luoyang, lvchangquan, lvliang, lz, mahdi, Mahdi, maning202007, Margaret_wangrui, mayang, mengyuanli, nhussain, ougongchang, panfengfeng, panyifeng, Payne, Peilin, peixu_ren, Pengyongrong, qianlong, r1chardf1d0, riemann_penn, root, Sheng, shenwei41, simson, Simson, Su, sunsuodong, tao_yunhao, tinazhang, VectorSL, , Wan, wandongdong, wangdongxu, wangmin, wangnan39@huawei.com, wangyue01, wangzhe, wanyiming, Wei, wenchunjiang, wilfChen, WilliamLian, wsc, wukesong, wuweikang, wuxuejian, Xiaoda, xiefangqi, xuanyue, xulei2020, Xun, xuyongfei, yanghaitao, yanghaitao1, yanghaoran, YangLuo, yangruoqi713, yankai, yanzhenxiang2020, yao_yf, yepei6, yeyunpeng, Yi, yoni, yoonlee666, yuchaojie, yujianfeng, yuximiao, zengzitao, Zhang, zhanghaibo5@huawei.com, zhanghuiyao, zhangyihui, zhangz0911gm, zhanke, zhanyuan, zhaodezan, zhaojichen, zhaoting, zhaozhenlong, zhengjun10, zhoufeng, zhousiyi, zhouyaqiang, Zichun, Zirui, Ziyan, zjun, ZPaC
+
+Contributions of any kind are welcome!
+
 # Release 0.7.0-beta

 ## Major Features and Improvements
--- a/2
+++ b/2
@ -1 +1 @@
-Subproject commit 4d897c23fc41c5f7013efd0c517796233671518a
+Subproject commit 6c492a8c9d9730ad11ffc5481cc532ae500b0da5
--- a/build.bat
+++ b/build.bat
@ -16,12 +16,13 @@
@title mindspore_build

 SET BASEPATH=%CD%
-IF NOT EXIST "%BASEPATH%/build" (
+SET BUILD_PATH=%BASEPATH%/build
+
+IF NOT EXIST "%BUILD_PATH%" (
    md "build"
 )

-cd %BASEPATH%/build
-set BUILD_PATH=%CD%
+cd %BUILD_PATH%

 IF NOT EXIST "%BUILD_PATH%/mindspore" (
    md "mindspore"
@ -38,7 +39,7 @@ IF "%1%" == "lite" (
        call :run_cmake
        IF errorlevel 1 (
            echo "cmake fail."
-            call :run_fail
+            goto run_fail
        )
    ) ELSE (
        call :gene_protobuf
@ -53,7 +54,7 @@ IF "%1%" == "lite" (
    )
    IF errorlevel 1 (
        echo "build fail."
-        call :run_fail
+        goto run_fail
    ) ELSE (
        cd %BASEPATH%/output
        rd /s /q _CPack_Packages
@ -63,7 +64,7 @@ IF "%1%" == "lite" (
    -G "CodeBlocks - MinGW Makefiles" ../..
    IF NOT %errorlevel% == 0 (
        echo "cmake fail."
-        call :run_fail
+        goto run_fail
    )

    IF "%1%" == "" (
@ -73,7 +74,7 @@ IF "%1%" == "lite" (
    )
    IF NOT %errorlevel% == 0 (
        echo "build fail."
-        call :run_fail
+        goto run_fail
    )
 )

@ -82,10 +83,14 @@ cd %BASEPATH%
 goto run_eof

 :run_cmake
+    set VERSION_MAJOR=1
+    set VERSION_MINOR=0
+    set VERSION_REVISION=0
+    echo "======Start building MindSpore Lite %VERSION_MAJOR%.%VERSION_MINOR%.%VERSION_REVISION%======"
    cd %BUILD_PATH%/mindspore
    cmake -DBUILD_DEVICE=on -DBUILD_CONVERTER=on -DPLATFORM_ARM64=off -DSUPPORT_TRAIN=off ^
    -DCMAKE_BUILD_TYPE=Release -DSUPPORT_GPU=off -DBUILD_MINDDATA=off -DOFFLINE_COMPILE=off ^
-    -DMS_VERSION_MAJOR=0 -DMS_VERSION_MINOR=7 -DMS_VERSION_REVISION=0 ^
+    -DMS_VERSION_MAJOR=%VERSION_MAJOR% -DMS_VERSION_MINOR=%VERSION_MINOR% -DMS_VERSION_REVISION=%VERSION_REVISION% ^
    -G "CodeBlocks - MinGW Makefiles" "%BASEPATH%/mindspore/lite"
 GOTO:EOF

@ -123,6 +128,5 @@ GOTO:EOF
 :run_fail
    cd %BASEPATH%
    set errorlevel=1
-    EXIT

 :run_eof
--- a/build.sh
+++ b/build.sh
@ -25,7 +25,7 @@ usage()
  echo "bash build.sh [-d] [-r] [-v] [-c on|off] [-t on|off] [-g on|off] [-h] [-b ge] [-m infer|train] \\"
  echo "              [-a on|off] [-p on|off] [-i] [-L] [-R] [-D on|off] [-j[n]] [-e gpu|d|cpu] \\"
  echo "              [-P on|off] [-z [on|off]] [-M on|off] [-V 9.2|10.1] [-I arm64|arm32|x86_64] [-K] \\"
-  echo "              [-B on|off] [-w on|off] [-E] [-l on|off] [-n full|lite|off] [-T on|off]"
+  echo "              [-B on|off] [-w on|off] [-E] [-l on|off] [-n full|lite|off]"
  echo ""
  echo "Options:"
  echo "    -d Debug mode"
@ -58,7 +58,6 @@ usage()
  echo "    -B Enable debugger, default on"
  echo "    -E Enable IBVERBS for parameter server, default off"
  echo "    -l Compile with python dependency, default on"
-  echo "    -T Enable on-device training, default off"
 }

 # check value of input is 'on' or 'off'
@ -237,7 +236,7 @@ checkopts()
        ;;
      z)
        eval ARG=\$\{$OPTIND\}
-        if [[ -n $ARG && $ARG != -* ]]; then
+        if [[ -n "$ARG" && "$ARG" != -* ]]; then
          OPTARG="$ARG"
          check_on_off $OPTARG z
          OPTIND=$((OPTIND + 1))
--- a/cmake/external_libs/mkl_dnn.cmake
+++ b/cmake/external_libs/mkl_dnn.cmake
@ -10,10 +10,10 @@ if (CMAKE_SYSTEM_NAME MATCHES "Windows")
        MD5 17757c84f49edd42d34ae8c9288110a1)
 else()
    mindspore_add_pkg(onednn
-        VER 1.5
+        VER 1.7
        LIBS dnnl mkldnn
-        URL https://github.com/oneapi-src/oneDNN/archive/v1.5.tar.gz
-        MD5 5d97e0e8f4c0b37da5f524533b7a644b
+        URL https://github.com/oneapi-src/oneDNN/archive/v1.7-rc.tar.gz
+        MD5 f5e3ce761f6521e235817299f882b965
        CMAKE_OPTION -DDNNL_ARCH_OPT_FLAGS='' -DDNNL_CPU_RUNTIME='SEQ' -DDNNL_BUILD_EXAMPLES=OFF -DDNNL_BUILD_TESTS=OFF)
 endif()

--- a/cmake/package_lite.cmake
+++ b/cmake/package_lite.cmake
@ -56,7 +56,7 @@ if (PLATFORM_ARM64)
    install(FILES ${TOP_DIR}/mindspore/lite/build/src/libmindspore-lite.so DESTINATION ${LIB_DIR} COMPONENT ${COMPONENT_NAME})
    install(FILES ${TOP_DIR}/mindspore/lite/build/src/libmindspore-lite.a DESTINATION ${LIB_DIR} COMPONENT ${COMPONENT_NAME})
    install(FILES ${TOP_DIR}/mindspore/core/ir/dtype/type_id.h DESTINATION ${INC_DIR}/ir/dtype COMPONENT ${COMPONENT_NAME})
-    install(DIRECTORY ${TOP_DIR}/mindspore/lite/include/ DESTINATION ${INC_DIR} COMPONENT ${COMPONENT_NAME} FILES_MATCHING PATTERN "*.h")
+    install(DIRECTORY ${TOP_DIR}/mindspore/lite/include/ DESTINATION ${INC_DIR} COMPONENT ${COMPONENT_NAME} FILES_MATCHING PATTERN "*.h" PATTERN "train_model.h" EXCLUDE PATTERN "train_session.h" EXCLUDE)
    install(DIRECTORY ${TOP_DIR}/mindspore/lite/schema/ DESTINATION ${INC_DIR}/schema COMPONENT ${COMPONENT_NAME} FILES_MATCHING PATTERN "*.h" PATTERN "inner" EXCLUDE)
    install(FILES ${TOP_DIR}/mindspore/lite/build/src/libmindspore-lite-optimize.so DESTINATION ${LIB_DIR} COMPONENT ${COMPONENT_NAME})
    install(FILES ${TOP_DIR}/mindspore/lite/build/src/libmindspore-lite-fp16.so DESTINATION ${LIB_DIR} COMPONENT ${COMPONENT_NAME})
@ -65,7 +65,7 @@ elseif (PLATFORM_ARM32)
    install(FILES ${TOP_DIR}/mindspore/lite/build/src/libmindspore-lite.so DESTINATION ${LIB_DIR} COMPONENT ${COMPONENT_NAME})
    install(FILES ${TOP_DIR}/mindspore/lite/build/src/libmindspore-lite.a DESTINATION ${LIB_DIR} COMPONENT ${COMPONENT_NAME})
    install(FILES ${TOP_DIR}/mindspore/core/ir/dtype/type_id.h DESTINATION ${INC_DIR}/ir/dtype COMPONENT ${COMPONENT_NAME})
-    install(DIRECTORY ${TOP_DIR}/mindspore/lite/include/ DESTINATION ${INC_DIR} COMPONENT ${COMPONENT_NAME} FILES_MATCHING PATTERN "*.h")
+    install(DIRECTORY ${TOP_DIR}/mindspore/lite/include/ DESTINATION ${INC_DIR} COMPONENT ${COMPONENT_NAME} FILES_MATCHING PATTERN "*.h" PATTERN "train_model.h" EXCLUDE PATTERN "train_session.h" EXCLUDE)
    install(DIRECTORY ${TOP_DIR}/mindspore/lite/schema/ DESTINATION ${INC_DIR}/schema COMPONENT ${COMPONENT_NAME} FILES_MATCHING PATTERN "*.h" PATTERN "inner" EXCLUDE)
    install(DIRECTORY ${TOP_DIR}/third_party/flatbuffers/include DESTINATION ${FLATBF_DIR} COMPONENT ${COMPONENT_NAME})
 elseif (CMAKE_SYSTEM_NAME MATCHES "Windows")
@ -75,7 +75,7 @@ elseif (CMAKE_SYSTEM_NAME MATCHES "Windows")
    install(FILES ${LIB_LIST} DESTINATION ${TOP_DIR}/build/mindspore/package COMPONENT ${COMPONENT_NAME})
    install(FILES ${TOP_DIR}/build/mindspore/tools/converter/libconverter_parser.a DESTINATION ${TOP_DIR}/build/mindspore/package COMPONENT ${PARSER_NAME})
 else ()
-    install(DIRECTORY ${TOP_DIR}/mindspore/lite/include/ DESTINATION ${INC_DIR_RUN_X86} COMPONENT ${RUN_X86_COMPONENT_NAME} FILES_MATCHING PATTERN "*.h")
+    install(DIRECTORY ${TOP_DIR}/mindspore/lite/include/ DESTINATION ${INC_DIR_RUN_X86} COMPONENT ${RUN_X86_COMPONENT_NAME} FILES_MATCHING PATTERN "*.h" PATTERN "train_model.h" EXCLUDE PATTERN "train_session.h" EXCLUDE)
    install(DIRECTORY ${TOP_DIR}/mindspore/lite/schema/ DESTINATION ${INC_DIR_RUN_X86}/schema COMPONENT ${RUN_X86_COMPONENT_NAME} FILES_MATCHING PATTERN "*.h" PATTERN "inner" EXCLUDE)
    install(FILES ${TOP_DIR}/mindspore/core/ir/dtype/type_id.h DESTINATION ${INC_DIR_RUN_X86}/ir/dtype COMPONENT ${RUN_X86_COMPONENT_NAME})
    install(DIRECTORY ${TOP_DIR}/third_party/flatbuffers/include DESTINATION ${FLATBF_DIR_RUN_X86} COMPONENT ${RUN_X86_COMPONENT_NAME})
--- a/docker/mindspore-cpu/1.0.0/Dockerfile
+++ b/docker/mindspore-cpu/1.0.0/Dockerfile
@ -0,0 +1,67 @@
+FROM ubuntu:18.04
+
+MAINTAINER leonwanghui <leon.wanghui@huawei.com>
+
+# Set env
+ENV PYTHON_ROOT_PATH /usr/local/python-3.7.5
+ENV PATH /usr/local/bin:$PATH
+
+# Install base tools
+RUN apt update \
+    && DEBIAN_FRONTEND=noninteractive apt install -y \
+    vim \
+    wget \
+    curl \
+    xz-utils \
+    net-tools \
+    openssh-client \
+    git \
+    ntpdate \
+    tzdata \
+    tcl \
+    sudo \
+    bash-completion
+
+# Install compile tools
+RUN DEBIAN_FRONTEND=noninteractive apt install -y \
+    gcc \
+    g++ \
+    zlibc \
+    make \
+    libgmp-dev \
+    patch \
+    autoconf \
+    libtool \
+    automake \
+    flex
+
+# Set bash
+RUN echo "dash dash/sh boolean false" | debconf-set-selections
+RUN DEBIAN_FRONTEND=noninteractive dpkg-reconfigure dash
+
+# Install python (v3.7.5)
+RUN apt install -y libffi-dev libssl-dev zlib1g-dev libbz2-dev libncurses5-dev \
+    libgdbm-dev libgdbm-compat-dev liblzma-dev libreadline-dev libsqlite3-dev \
+    && cd /tmp \
+    && wget https://github.com/python/cpython/archive/v3.7.5.tar.gz \
+    && tar -xvf v3.7.5.tar.gz \
+    && cd /tmp/cpython-3.7.5 \
+    && mkdir -p ${PYTHON_ROOT_PATH} \
+    && ./configure --prefix=${PYTHON_ROOT_PATH} \
+    && make -j4 \
+    && make install -j4 \
+    && rm -f /usr/local/bin/python \
+    && rm -f /usr/local/bin/pip \
+    && ln -s ${PYTHON_ROOT_PATH}/bin/python3.7 /usr/local/bin/python \
+    && ln -s ${PYTHON_ROOT_PATH}/bin/pip3.7 /usr/local/bin/pip \
+    && rm -rf /tmp/cpython-3.7.5 \
+    && rm -f /tmp/v3.7.5.tar.gz
+
+# Set pip source
+RUN mkdir -pv /root/.pip \
+    && echo "[global]" > /root/.pip/pip.conf \
+    && echo "trusted-host=mirrors.aliyun.com" >> /root/.pip/pip.conf \
+    && echo "index-url=http://mirrors.aliyun.com/pypi/simple/" >> /root/.pip/pip.conf
+
+# Install MindSpore cpu whl package
+RUN pip install --no-cache-dir https://ms-release.obs.cn-north-4.myhuaweicloud.com/1.0.0/MindSpore/cpu/ubuntu_x86/mindspore-1.0.0-cp37-cp37m-linux_x86_64.whl
--- a/docker/mindspore-gpu/1.0.0/Dockerfile
+++ b/docker/mindspore-gpu/1.0.0/Dockerfile
@ -0,0 +1,81 @@
+FROM nvidia/cuda:10.1-cudnn7-devel-ubuntu18.04
+
+MAINTAINER leonwanghui <leon.wanghui@huawei.com>
+
+# Set env
+ENV PYTHON_ROOT_PATH /usr/local/python-3.7.5
+ENV OMPI_ROOT_PATH /usr/local/openmpi-3.1.5
+ENV PATH ${OMPI_ROOT_PATH}/bin:/usr/local/bin:$PATH
+ENV LD_LIBRARY_PATH ${OMPI_ROOT_PATH}/lib:$LD_LIBRARY_PATH
+
+# Install base tools
+RUN apt update \
+    && DEBIAN_FRONTEND=noninteractive apt install -y \
+    vim \
+    wget \
+    curl \
+    xz-utils \
+    net-tools \
+    openssh-client \
+    git \
+    ntpdate \
+    tzdata \
+    tcl \
+    sudo \
+    bash-completion
+
+# Install compile tools
+RUN DEBIAN_FRONTEND=noninteractive apt install -y \
+    gcc \
+    g++ \
+    zlibc \
+    make \
+    libgmp-dev \
+    patch \
+    autoconf \
+    libtool \
+    automake \
+    flex
+
+# Set bash
+RUN echo "dash dash/sh boolean false" | debconf-set-selections
+RUN DEBIAN_FRONTEND=noninteractive dpkg-reconfigure dash
+
+# Install python (v3.7.5)
+RUN apt install -y libffi-dev libssl-dev zlib1g-dev libbz2-dev libncurses5-dev \
+    libgdbm-dev libgdbm-compat-dev liblzma-dev libreadline-dev libsqlite3-dev \
+    && cd /tmp \
+    && wget https://github.com/python/cpython/archive/v3.7.5.tar.gz \
+    && tar -xvf v3.7.5.tar.gz \
+    && cd /tmp/cpython-3.7.5 \
+    && mkdir -p ${PYTHON_ROOT_PATH} \
+    && ./configure --prefix=${PYTHON_ROOT_PATH} \
+    && make -j4 \
+    && make install -j4 \
+    && rm -f /usr/local/bin/python \
+    && rm -f /usr/local/bin/pip \
+    && ln -s ${PYTHON_ROOT_PATH}/bin/python3.7 /usr/local/bin/python \
+    && ln -s ${PYTHON_ROOT_PATH}/bin/pip3.7 /usr/local/bin/pip \
+    && rm -rf /tmp/cpython-3.7.5 \
+    && rm -f /tmp/v3.7.5.tar.gz
+
+# Set pip source
+RUN mkdir -pv /root/.pip \
+    && echo "[global]" > /root/.pip/pip.conf \
+    && echo "trusted-host=mirrors.aliyun.com" >> /root/.pip/pip.conf \
+    && echo "index-url=http://mirrors.aliyun.com/pypi/simple/" >> /root/.pip/pip.conf
+
+# Install openmpi (v3.1.5)
+RUN cd /tmp \
+    && wget https://download.open-mpi.org/release/open-mpi/v3.1/openmpi-3.1.5.tar.gz \
+    && tar -xvf openmpi-3.1.5.tar.gz \
+    && cd /tmp/openmpi-3.1.5 \
+    && mkdir -p ${OMPI_ROOT_PATH} \
+    && ./configure --prefix=${OMPI_ROOT_PATH} \
+    && make -j4 \
+    && make install -j4 \
+    && rm -rf /tmp/openmpi-3.1.5 \
+    && rm -f /tmp/openmpi-3.1.5.tar.gz
+
+# Install MindSpore cuda-10.1 whl package
+RUN pip install --no-cache-dir https://ms-release.obs.cn-north-4.myhuaweicloud.com/1.0.0/MindSpore/gpu/ubuntu_x86/cuda-10.1/mindspore_gpu-1.0.0-cp37-cp37m-linux_x86_64.whl
--- a/2
+++ b/2
@ -1 +1 @@
-Subproject commit 4f6ebe0f92da8c758b1b71a2bc7ff33f9344e18a
+Subproject commit 14db109491bc81473905a5eb9e82f6234aca419b
--- a/include/infer_tensor.h
+++ b/include/infer_tensor.h
@ -74,7 +74,11 @@ class InferTensorBase {
    if (data_len == 0) {
      return true;
    }
-    memcpy_s(mutable_data(), data_size(), data, data_len);
+    auto ret = memcpy_s(mutable_data(), data_size(), data, data_len);
+    if (ret != 0) {
+      MSI_LOG_ERROR << "Set data memcpy_s failed";
+      return false;
+    }
    return true;
  }

--- a/include/inference.h
+++ b/include/inference.h
@ -81,7 +81,6 @@ class MS_API InferSession {
  }
  static std::shared_ptr<InferSession> CreateSession(const std::string &device, uint32_t device_id);
 };
-
 }  // namespace inference
 }  // namespace mindspore
 #endif  // MINDSPORE_INCLUDE_MS_SESSION_H
--- a/mindspore/_extends/parallel_compile/tbe_compiler/common.py
+++ b/mindspore/_extends/parallel_compile/tbe_compiler/common.py
@ -13,7 +13,6 @@
 # limitations under the License.
 # ============================================================================
 """tbe common"""
-import json
 import os

 class TBEException(Exception):
@ -27,23 +26,6 @@ class TBEException(Exception):
        return self.__error_msg


-def get_ddk_version():
-    """get ddk version"""
-    ddk_version = os.environ.get("DDK_VERSION")
-    if ddk_version is None:
-        default_ddk_info_file = '/usr/local/HiAI/runtime/ddk_info'
-        backup_ddk_info_file = '/usr/local/Ascend/fwkacllib/ddk_info'
-        if os.path.exists(default_ddk_info_file):
-            with open(default_ddk_info_file, "r") as fp:
-                ddk_version = json.load(fp)["VERSION"]
-        elif os.path.exists(backup_ddk_info_file):
-            with open(backup_ddk_info_file, "r") as fp:
-                ddk_version = json.load(fp)["VERSION"]
-        else:
-            ddk_version = "Ascend910"
-    return ddk_version
-
-
 def get_build_in_impl_path():
    """get build-in tbe implement path"""
    tbe_impl_path = os.environ.get("TBE_IMPL_PATH")
--- a/mindspore/_extends/parallel_compile/tbe_compiler/compiler.py
+++ b/mindspore/_extends/parallel_compile/tbe_compiler/compiler.py
@ -18,9 +18,8 @@ import os
 import sys
 from te.platform.cce_conf import te_set_version
 from te.platform.fusion_util import fusion_op
-from common import check_kernel_info, get_args, get_build_in_impl_path, get_ddk_version
+from common import check_kernel_info, get_args, get_build_in_impl_path

-ddk_version = get_ddk_version()
 build_in_impl_path = get_build_in_impl_path()

 # op function list
@ -30,7 +29,6 @@ fusion_pattern_end_flag = "fusion_pattern_end"

 def _initialize(impl_path):
    """Initialize"""
-    te_set_version(ddk_version)
    if impl_path == "":
        op_module_name = build_in_impl_path
    else:
@ -53,7 +51,7 @@ def build_op(build_type, json_str):
    """
    kernel_info = json.loads(json_str)
    check_kernel_info(kernel_info)
-
+    te_set_version(kernel_info["op_info"]["socVersion"])
    op_name = kernel_info['op_info']['name']

    try:
@ -111,7 +109,7 @@ def compile_fusion_op(json_str):
        Exception: If specific keyword is not found.
    """
    args = json.loads(json_str)
-    te_set_version(ddk_version)
+    te_set_version(args['fusion_op']["socVersion"])
    if 'fusion_op' not in args or not args['fusion_op']:
        raise ValueError("Json string Errors, key:fusion_op not found.")
    fusion_op_arg = args['fusion_op']
--- a/mindspore/_extends/parallel_compile/tbe_compiler/tbe_process.py
+++ b/mindspore/_extends/parallel_compile/tbe_compiler/tbe_process.py
@ -104,7 +104,7 @@ class TbeProcess:
    def __init__(self):
        self.__processe_num = multiprocessing.cpu_count()
        # max_processes_num: Set the maximum number of concurrent processes for compiler
-        max_processes_num = 16
+        max_processes_num = 24
        if self.__processe_num > max_processes_num:
            self.__processe_num = max_processes_num
        self.__pool = None
--- a/mindspore/_version_check.py
+++ b/mindspore/_version_check.py
@ -122,15 +122,16 @@ class AscendEnvChecker(EnvChecker):

    def __init__(self):
        self.version = ["1.75.22.0.220"]
-        atlas_fwk_version = "/usr/local/Ascend/nnae/latest/fwkacllib/version.info"
+        atlas_fwk_version = "/usr/local/Ascend/ascend-toolkit/latest/fwkacllib/version.info"
        hisi_fwk_version = "/usr/local/Ascend/fwkacllib/version.info"
        if os.path.exists(atlas_fwk_version):
            # atlas default path
-            self.fwk_path = "/usr/local/Ascend/nnae/latest/fwkacllib"
-            self.op_impl_path = "/usr/local/Ascend/nnae/latest/opp/op_impl/built-in/ai_core/tbe"
+            self.fwk_path = "/usr/local/Ascend/ascend-toolkit/latest/fwkacllib"
+            self.op_impl_path = "/usr/local/Ascend/ascend-toolkit/latest/opp/op_impl/built-in/ai_core/tbe"
            self.tbe_path = self.fwk_path + "/lib64"
            self.cce_path = self.fwk_path + "/ccec_compiler/bin"
            self.fwk_version = atlas_fwk_version
+            self.op_path = "/usr/local/Ascend/ascend-toolkit/latest/opp"
        elif os.path.exists(hisi_fwk_version):
            # hisi default path
            self.fwk_path = "/usr/local/Ascend/fwkacllib"
@ -138,6 +139,7 @@ class AscendEnvChecker(EnvChecker):
            self.tbe_path = self.fwk_path + "/lib64"
            self.cce_path = self.fwk_path + "/ccec_compiler/bin"
            self.fwk_version = hisi_fwk_version
+            self.op_path = ""
        else:
            # custom or unknown environment
            self.fwk_path = ""
@ -145,17 +147,20 @@ class AscendEnvChecker(EnvChecker):
            self.tbe_path = ""
            self.cce_path = ""
            self.fwk_version = ""
+            self.op_path = ""

        # env
        self.path = os.getenv("PATH")
        self.python_path = os.getenv("PYTHONPATH")
        self.ld_lib_path = os.getenv("LD_LIBRARY_PATH")
+        self.ascend_opp_path = os.getenv("ASCEND_OPP_PATH")

        # check content
        self.path_check = "/fwkacllib/ccec_compiler/bin/"
        self.python_path_check = "opp/op_impl/built_in/ai_core/tbe/"
        self.ld_lib_path_check_fwk = "/fwkacllib/lib64/"
        self.ld_lib_path_check_addons = "/add-ons/"
+        self.ascend_opp_path_check = "/op"
        self.v = ""

    def check_env(self, e):
@ -205,6 +210,15 @@ class AscendEnvChecker(EnvChecker):
                f"No such directory: {self.cce_path}, Please check if Ascend 910 AI software package is "
                "installed correctly.")

+        if self.op_path is None:
+            pass
+        elif Path(self.op_path).is_dir():
+            os.environ['ASCEND_OPP_PATH'] = self.op_path
+        else:
+            raise EnvironmentError(
+                f"No such directory: {self.op_path}, Please check if Ascend 910 AI software package is "
+                "installed correctly.")
+
    def _check_env(self):
        """ascend dependence path check"""
        if self.path is None or self.path_check not in self.path:
@ -223,6 +237,11 @@ class AscendEnvChecker(EnvChecker):
                           "LD_LIBRARY_PATH, you can reference to the installation guidelines "
                           "https://www.mindspore.cn/install")

+        if self.ascend_opp_path is None or self.ascend_opp_path_check not in self.ascend_opp_path:
+            logger.warning(
+                "Can not find opp path (need by mindspore-ascend), please check if you have set env ASCEND_OPP_PATH, "
+                "you can reference to the installation guidelines https://www.mindspore.cn/install")
+
    def _read_version(self, file_path):
        """get ascend version info"""
        with open(file_path, 'r') as f:
--- a/mindspore/ccsrc/CMakeLists.txt
+++ b/mindspore/ccsrc/CMakeLists.txt
@ -258,7 +258,7 @@ else ()
 endif ()
 set(MINDSPORE_RPATH ${ORIGIN_PATH}/lib)
 if (ENABLE_D)
-    set(MINDSPORE_RPATH ${MINDSPORE_RPATH}:/usr/local/Ascend/nnae/latest/fwkacllib/lib64)
+    set(MINDSPORE_RPATH ${MINDSPORE_RPATH}:/usr/local/Ascend/ascend-toolkit/latest/fwkacllib/lib64)
    set(MINDSPORE_RPATH ${MINDSPORE_RPATH}:/usr/local/Ascend/fwkacllib/lib64)
    set(MINDSPORE_RPATH ${MINDSPORE_RPATH}:/usr/local/Ascend/add-ons)
 elseif (ENABLE_GPU)
--- a/mindspore/ccsrc/backend/kernel_compiler/cpu/mkldnn/fused_batch_norm_grad_cpu_kernel.cc
+++ b/mindspore/ccsrc/backend/kernel_compiler/cpu/mkldnn/fused_batch_norm_grad_cpu_kernel.cc
@ -0,0 +1,139 @@
+/**
+ * Copyright 2020 Huawei Technologies Co., Ltd
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include <string>
+#include "backend/kernel_compiler/cpu/mkldnn/fused_batch_norm_grad_cpu_kernel.h"
+#include "utils/ms_utils.h"
+#include "backend/kernel_compiler/cpu/mkldnn/mkl_kernel_engine.h"
+#include "runtime/device/cpu/cpu_device_address.h"
+
+namespace mindspore {
+namespace kernel {
+void FusedBatchNormGradCPUKernel::InitInputOutputSize(const CNodePtr &kernel_node) {
+  
+  CPUKernel::InitInputOutputSize(kernel_node);
+  MS_EXCEPTION_IF_NULL(kernel_node);
+  size_t type_size = sizeof(float);
+  std::vector<size_t> shape = AnfAlgo::GetInputDeviceShape(kernel_node, 0);
+
+  // for store diff  scale and bias
+  size_t tensor_size = shape[1] * 2 * type_size;  // [2, c] to store scale and bias
+  workspace_size_list_.emplace_back(tensor_size);
+}
+
+void FusedBatchNormGradCPUKernel::InitKernel(const CNodePtr &kernel_node) {
+  MS_EXCEPTION_IF_NULL(kernel_node);
+  MS_LOG(INFO) << "start init FusedBatchNormGradCPUKernel " ;
+  auto node_name = AnfAlgo::GetCNodeName(kernel_node);
+
+  /**
+   * create desc for dy and x
+   */ 
+  //get dy and x 
+  std::vector<size_t> dy_shape = AnfAlgo::GetInputDeviceShape(kernel_node, 0);
+  std::vector<size_t> x_shape = AnfAlgo::GetInputDeviceShape(kernel_node, 1);
+  //  the dy and x should be nchw ,so we need check them
+  if (dy_shape.size() != 4 || x_shape.size() != 4) {
+    MS_LOG(EXCEPTION) << "Fused batchnorm grad only support nchw input!";
+  }
+  
+ // des for x,y,dy,dx
+  batch_size = x_shape[0];
+  channel = x_shape[1];
+  hw_size = x_shape[2] * x_shape[3];
+  nhw_size = x_shape[0] * hw_size;
+  dnnl::memory::desc x_desc = GetDefaultMemDesc(x_shape);
+
+  // create scale_bias_desc
+  dnnl::memory::desc scale_desc = GetDefaultMemDesc({1, channel});
+
+  auto epsilon = AnfAlgo::GetNodeAttr<float>(kernel_node, "epsilon");
+
+  /**
+   * define forward hint
+   * */
+  auto f_prop_kind = dnnl::prop_kind::forward_training; //forward prop kind
+  auto f_normalization_flags = dnnl::normalization_flags::use_scale_shift; //forward normalization flags
+  dnnl::batch_normalization_forward::desc fhint_desc =
+    dnnl::batch_normalization_forward::desc(f_prop_kind, x_desc, epsilon, f_normalization_flags);
+  auto fhint_prim_desc = dnnl::batch_normalization_forward::primitive_desc(fhint_desc, MKLKernelEngine::Get().engine());
+  
+  /**
+   * define backward 
+   */ 
+
+  auto bprop_kind = dnnl::prop_kind::backward; 
+  auto bnormalization_flags = dnnl::normalization_flags::use_scale_shift; 
+  //create desc for backward 
+  dnnl::batch_normalization_backward::desc b_desc =
+    dnnl::batch_normalization_backward::desc(bprop_kind, fhint_prim_desc.dst_desc(), x_desc, epsilon, bnormalization_flags);
+  //create prim_desc for backward
+  auto bprim_desc = dnnl::batch_normalization_backward::primitive_desc(b_desc, MKLKernelEngine::Get().engine(), fhint_prim_desc);
+
+  primitive_ = std::make_shared<dnnl::batch_normalization_backward>(bprim_desc);
+
+  /**
+   * add Argument for backward 
+   */
+  AddArgument(DNNL_ARG_SRC,bprim_desc.src_desc());//x
+  AddArgument(DNNL_ARG_DIFF_DST,bprim_desc.diff_dst_desc());//dy
+  AddArgument(DNNL_ARG_SCALE_SHIFT,scale_desc); // only need scale
+  AddArgument(DNNL_ARG_MEAN, bprim_desc.mean_desc()); // mean
+  AddArgument(DNNL_ARG_VARIANCE, bprim_desc.variance_desc()); // var
+  
+  AddArgument(DNNL_ARG_DIFF_SRC,bprim_desc.diff_src_desc()); //dx
+  AddArgument(DNNL_ARG_DIFF_SCALE_SHIFT, bprim_desc.diff_weights_desc()); //diff scale and diff bias
+  AddArgument(DNNL_ARG_WORKSPACE, bprim_desc.workspace_desc());// workspace
+
+}
+
+bool FusedBatchNormGradCPUKernel::Launch(const std::vector<kernel::AddressPtr> &inputs,
+                                     const std::vector<kernel::AddressPtr> &workspace,
+                                     const std::vector<kernel::AddressPtr> &outputs) {
+  //input : dy x scale save_mean save_inv_variance 5
+  MS_LOG(INFO) <<" start FusedBatchNormGradCPUkernel";
+  
+  if (inputs.size() < 5 || outputs.empty()) {
+    MS_LOG(EXCEPTION) << "Error input output size!";
+  }
+
+  SetArgumentHandle(DNNL_ARG_SRC,inputs[1]->addr);//x
+  SetArgumentHandle(DNNL_ARG_DIFF_DST,inputs[0]->addr);//dy
+  SetArgumentHandle(DNNL_ARG_SCALE_SHIFT,inputs[2]->addr); // only need scale
+  SetArgumentHandle(DNNL_ARG_MEAN, inputs[3]->addr); // mean
+  SetArgumentHandle(DNNL_ARG_VARIANCE, inputs[4]->addr); // var
+  
+  SetArgumentHandle(DNNL_ARG_DIFF_SRC,outputs[0]->addr); //dx
+  SetArgumentHandle(DNNL_ARG_DIFF_SCALE_SHIFT, workspace[0]->addr); //diff scale and diff bias
+  ExecutePrimitive();
+
+  auto scale = reinterpret_cast<float *>(outputs[1]->addr);
+  auto bias = reinterpret_cast<float *>(outputs[2]->addr);
+  auto wksp = reinterpret_cast<float *>(workspace[0]->addr);
+  
+  auto scale_ret = memcpy_s(scale, outputs[1]->size, workspace[0]->addr, outputs[1]->size);
+  auto bias_ret = memcpy_s(bias , outputs[2]->size, wksp + (outputs[1]->size / sizeof(float)), outputs[2]->size);
+  
+  if (scale_ret != 0 || bias_ret !=0) {
+    MS_LOG(EXCEPTION) << "Memcpy_s error.";
+    return false;
+  }
+
+  MS_LOG(INFO) << " done batchnorm grad";
+  return true;
+}
+}  // namespace kernel
+}  // namespace mindspore
--- a/mindspore/ccsrc/backend/kernel_compiler/cpu/mkldnn/fused_batch_norm_grad_cpu_kernel.h
+++ b/mindspore/ccsrc/backend/kernel_compiler/cpu/mkldnn/fused_batch_norm_grad_cpu_kernel.h
@ -0,0 +1,63 @@
+/**
+ * Copyright 2020 Huawei Technologies Co., Ltd
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+#ifndef MINDSPORE_CCSRC_BACKEND_KERNEL_COMPILER_CPU_FUSED_BATCH_NORM_GRAD_CPU_KERNEL_H_
+#define MINDSPORE_CCSRC_BACKEND_KERNEL_COMPILER_CPU_FUSED_BATCH_NORM_GRAD_CPU_KERNEL_H_
+
+#include <vector>
+#include <memory>
+#include "backend/kernel_compiler/cpu/mkldnn/mkl_cpu_kernel.h"
+
+namespace mindspore {
+namespace kernel {
+class FusedBatchNormGradCPUKernel : public MKLCPUKernel {
+ public:
+  FusedBatchNormGradCPUKernel() = default;
+  ~FusedBatchNormGradCPUKernel() override = default;
+
+  void InitKernel(const CNodePtr &kernel_node) override;
+
+  bool Launch(const std::vector<AddressPtr> &inputs, const std::vector<AddressPtr> &workspace,
+              const std::vector<AddressPtr> &outputs) override;
+
+ protected:
+  void InitInputOutputSize(const CNodePtr &kernel_node) override;
+
+ private:
+  size_t batch_size{0};
+  size_t channel{0};
+  size_t hw_size{0};
+  size_t nhw_size{0};
+
+};
+
+// registe fusedBatchnormGrad
+MS_REG_CPU_KERNEL(FusedBatchNormGrad,
+                  KernelAttr()
+                    .AddInputAttr(kNumberTypeFloat32)
+                    .AddInputAttr(kNumberTypeFloat32)
+                    .AddInputAttr(kNumberTypeFloat32)
+                    .AddInputAttr(kNumberTypeFloat32)
+                    .AddInputAttr(kNumberTypeFloat32)
+                    .AddOutputAttr(kNumberTypeFloat32)
+                    .AddOutputAttr(kNumberTypeFloat32)
+                    .AddOutputAttr(kNumberTypeFloat32),
+                  FusedBatchNormGradCPUKernel)
+
+
+}  // namespace kernel
+}  // namespace mindspore
+
+#endif  // MINDSPORE_CCSRC_BACKEND_KERNEL_COMPILER_CPU_FUSED_BATCH_NORM_GRAD_CPU_KERNEL_H_
--- a/mindspore/ccsrc/backend/kernel_compiler/cpu/mkldnn/realdiv_cpu_kernel.cc
+++ b/mindspore/ccsrc/backend/kernel_compiler/cpu/mkldnn/realdiv_cpu_kernel.cc
@ -0,0 +1,67 @@
+/**
+ * Copyright 2020 Huawei Technologies Co., Ltd
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+#include "backend/kernel_compiler/cpu/mkldnn/realdiv_cpu_kernel.h"
+#include "backend/kernel_compiler/cpu/mkldnn/mkl_kernel_engine.h"
+#include "runtime/device/cpu/cpu_device_address.h"
+#include "utils/ms_utils.h"
+
+namespace mindspore {
+namespace kernel {
+void RealDivCPUKernel::InitKernel(const CNodePtr &kernel_node) {
+  MS_EXCEPTION_IF_NULL(kernel_node);
+  MS_LOG(INFO) << " Init Realdiv " ;
+  std::vector<size_t> src0_shape = AnfAlgo::GetInputDeviceShape(kernel_node, 0);
+  std::vector<size_t> src1_shape = AnfAlgo::GetInputDeviceShape(kernel_node, 1);
+  std::vector<size_t> dst_shape = AnfAlgo::GetOutputDeviceShape(kernel_node, 0);
+  if (src0_shape.size() != src1_shape.size() && src1_shape.size() > 1) {
+    MS_LOG(EXCEPTION) << "RealDivCPUKernel : RealDiv only support same dim input or tensor / scalar " << src0_shape.size() << " vs "
+                      << src1_shape.size();
+  }
+
+
+  if (src1_shape.size() < src0_shape.size()) {
+    for (size_t i = src1_shape.size(); i < src0_shape.size(); ++i) {
+      src1_shape.emplace_back(1);
+    }
+  }
+  dnnl::memory::desc src0_desc = GetDefaultMemDesc(src0_shape);
+  dnnl::memory::desc src1_desc = GetDefaultMemDesc(src1_shape);
+  dnnl::memory::desc dst_desc = GetDefaultMemDesc(dst_shape);
+  
+  dnnl::binary::desc desc = dnnl::binary::desc(dnnl::algorithm::binary_div, src0_desc, src1_desc, dst_desc);
+  auto prim_desc = dnnl::binary::primitive_desc(desc, MKLKernelEngine::Get().engine());
+  primitive_ = std::make_shared<dnnl::binary>(prim_desc);
+  AddArgument(DNNL_ARG_SRC_0, src0_desc);
+  AddArgument(DNNL_ARG_SRC_1, src1_desc);
+  AddArgument(DNNL_ARG_DST, dst_desc);
+}
+
+bool RealDivCPUKernel::Launch(const std::vector<kernel::AddressPtr> &inputs,
+                                const std::vector<kernel::AddressPtr> & /*workspace*/,
+                                const std::vector<kernel::AddressPtr> &outputs) {
+  
+  MS_LOG(INFO) << "RealDivCPUKernel : start Launch Realdiv" ;
+  if (inputs.size() < 2 || outputs.empty()) {
+    MS_LOG(EXCEPTION) << "RealDivCPUKernel : Realdiv error input output size!";
+  }
+  SetArgumentHandle(DNNL_ARG_SRC_0, inputs[0]->addr);
+  SetArgumentHandle(DNNL_ARG_SRC_1, inputs[1]->addr);
+  SetArgumentHandle(DNNL_ARG_DST, outputs[0]->addr);
+  ExecutePrimitive();
+  return true;
+}
+}  // namespace kernel
+}  // namespace mindspore
--- a/mindspore/ccsrc/backend/kernel_compiler/cpu/mkldnn/realdiv_cpu_kernel.h
+++ b/mindspore/ccsrc/backend/kernel_compiler/cpu/mkldnn/realdiv_cpu_kernel.h
@ -0,0 +1,42 @@
+/**
+ * Copyright 2020 Huawei Technologies Co., Ltd
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+#ifndef MINDSPORE_CCSRC_BACKEND_KERNEL_COMPILER_CPU_REALDIV_CPU_KERNEL_H_
+#define MINDSPORE_CCSRC_BACKEND_KERNEL_COMPILER_CPU_REALDIV_CPU_KERNEL_H_
+#include <vector>
+#include <memory>
+#include "backend/kernel_compiler/cpu/mkldnn/mkl_cpu_kernel.h"
+
+namespace mindspore {
+namespace kernel {
+class RealDivCPUKernel : public MKLCPUKernel {
+ public:
+  RealDivCPUKernel() = default;
+  ~RealDivCPUKernel() override = default;
+
+  void InitKernel(const CNodePtr &kernel_node) override;
+
+  bool Launch(const std::vector<AddressPtr> &inputs, const std::vector<AddressPtr> &workspace,
+              const std::vector<AddressPtr> &outputs) override;
+};
+
+MS_REG_CPU_KERNEL(
+  RealDiv,
+  KernelAttr().AddInputAttr(kNumberTypeFloat32).AddInputAttr(kNumberTypeFloat32).AddOutputAttr(kNumberTypeFloat32),
+  RealDivCPUKernel);
+}  // namespace kernel
+}  // namespace mindspore
+
+#endif  // MINDSPORE_CCSRC_BACKEND_KERNEL_COMPILER_CPU_REALDIV_CPU_KERNEL_H_
--- a/mindspore/ccsrc/backend/kernel_compiler/cpu/reduce_cpu_kernel.cc
+++ b/mindspore/ccsrc/backend/kernel_compiler/cpu/reduce_cpu_kernel.cc
@ -39,6 +39,7 @@ void ReduceCPUKernel::InitKernel(const CNodePtr &kernel_node) {
  }
  shape_ = AnfAlgo::GetInputDeviceShape(kernel_node, 0);
  auto axis_addr = AnfAlgo::GetCNodePrimitive(kernel_node)->GetAttr(AXIS);
+
  if (axis_addr->isa<ValueTuple>()) {
    auto attr_axis = AnfAlgo::GetNodeAttr<std::vector<int>>(kernel_node, AXIS);
    if (attr_axis.size() > shape_.size()) {
@ -47,18 +48,24 @@ void ReduceCPUKernel::InitKernel(const CNodePtr &kernel_node) {
      axis_.push_back(shape_.size() - 1);
    } else {
      for (auto axis : attr_axis) {
+        while (axis < 0) {
+          axis += SizeToInt(shape_.size());
+        }
        if (IntToSize(axis) >= (shape_.size())) {
          MS_LOG(EXCEPTION) << "axis value is oversize.";
        }
-        axis < 0 ? axis_.push_back(axis + shape_.size()) : axis_.push_back(axis);
+        axis_.push_back(IntToSize(axis));
      }
    }
  } else if (axis_addr->isa<Int32Imm>()) {
    int axis = AnfAlgo::GetNodeAttr<int>(kernel_node, AXIS);
-    if (axis >= 0 && IntToSize(axis) >= shape_.size()) {
+    while (axis < 0) {
+      axis += SizeToInt(shape_.size());
+    }
+    if (IntToSize(axis) >= shape_.size()) {
      MS_LOG(EXCEPTION) << "axis value is oversize.";
    }
-    axis < 0 ? axis_.push_back(axis + shape_.size()) : axis_.push_back(axis);
+    axis_.push_back(IntToSize(axis));
  } else {
    MS_LOG(EXCEPTION) << "Attribute axis type is invalid.";
  }
--- a/mindspore/ccsrc/backend/kernel_compiler/cpu/tile_cpu_kernel.cc
+++ b/mindspore/ccsrc/backend/kernel_compiler/cpu/tile_cpu_kernel.cc
@ -0,0 +1,173 @@
+/**
+ * Copyright 2020 Huawei Technologies Co., Ltd
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+#include "backend/kernel_compiler/cpu/tile_cpu_kernel.h"
+#include "runtime/device/cpu/cpu_device_address.h"
+
+#include <iostream>
+#include <string>
+
+
+namespace mindspore {
+namespace kernel {
+
+ 
+size_t TileCPUKernel::GetOuterSize(std::vector<size_t> shape, int index){
+  size_t size = 1;
+  for(int i=0;i<index;i++){
+    size *= shape.at(i);
+  }
+  return size;
+}
+
+size_t TileCPUKernel::GetInnerSize(std::vector<size_t> shape, int index){
+  size_t size = 1;
+  int shape_size = shape.size();
+  for(int i=index;i < shape_size;i++){
+    size *= shape.at(i);
+  }
+  return size;
+}
+
+size_t TileCPUKernel::GetTypeSize(){
+
+  size_t type_size =1;
+  if (InputTensorType == kNumberTypeInt32){
+    type_size = sizeof(int);  
+  }
+  else if(InputTensorType == kNumberTypeInt64){
+     type_size = sizeof(int64_t);
+  }
+  else if(InputTensorType == kNumberTypeInt16){
+     type_size = sizeof(int16_t); 
+  }
+  else if(InputTensorType == kNumberTypeFloat32){
+    type_size = sizeof(float);
+  }
+  return type_size;
+}
+
+
+void TileCPUKernel::InitKernel(const CNodePtr &kernel_node) { 
+  
+  MS_LOG(INFO) << " start Init Tile kernel";
+  MS_EXCEPTION_IF_NULL(kernel_node); 
+  src0_shape = AnfAlgo::GetInputDeviceShape(kernel_node, 0);
+  multiples = AnfAlgo::GetNodeAttr<std::vector<int>>(kernel_node, "multiples");
+  dst_shape = AnfAlgo::GetOutputDeviceShape(kernel_node, 0);
+  
+  if (src0_shape.size() > multiples.size()) {
+    MS_LOG(EXCEPTION) << "The rank of input is greater than multiples.";
+  }
+
+  // make src0_shape align to multiples
+  if (src0_shape.size() < multiples.size()) {
+    for (size_t i = src0_shape.size(); i < multiples.size(); ++i) {
+      src0_shape.emplace(src0_shape.begin(),1);
+    }
+  }
+  if(src0_shape.size() != multiples.size()){
+    MS_LOG(EXCEPTION) << "make src0_shape align to multiples failed.";
+  }
+
+  size_t output_tensor_size = 1;
+  InputTensorType = AnfAlgo::GetInputDeviceDataType(kernel_node, 0);
+  for(auto each: dst_shape){
+    output_tensor_size  *= each ;
+  }
+  output_tensor_size *=  GetTypeSize();
+  workspace_size_list_.emplace_back(output_tensor_size);
+
+}
+
+bool TileCPUKernel::Launch(const std::vector<kernel::AddressPtr> &inputs,
+                              const std::vector<kernel::AddressPtr> & workspace,
+                              const std::vector<kernel::AddressPtr> &outputs) {
+  
+  MS_LOG(INFO) << " start Launch Tile kernel";
+  if (inputs.empty() || outputs.empty()) {
+    MS_LOG(EXCEPTION) << "input or output empty!";
+  }
+  if (inputs[0]->size > outputs[0]->size) {
+    return false;
+  }
+
+  if (InputTensorType == kNumberTypeInt32){
+    LaunchKernel<int>(inputs, workspace ,outputs);
+  }
+  else if(InputTensorType == kNumberTypeInt64){
+    LaunchKernel<int64_t>(inputs,workspace , outputs);
+  }
+  else if(InputTensorType == kNumberTypeInt16){
+    LaunchKernel<int16_t>(inputs,workspace , outputs);
+  }
+  else if(InputTensorType == kNumberTypeFloat32){
+    LaunchKernel<float>(inputs, workspace ,outputs);
+  }
+
+  return true;
+}
+
+/**
+ * Launch kernel base on input type
+*/
+template <typename T>
+void TileCPUKernel::LaunchKernel(const std::vector<kernel::AddressPtr> &inputs,
+                                    const std::vector<kernel::AddressPtr> & workspace,
+                                            const std::vector<kernel::AddressPtr> &outputs) {
+  
+  size_t shape_size = src0_shape.size();
+  size_t item_size = sizeof(T); 
+
+  auto X_ptr = reinterpret_cast<T *>(inputs[0]->addr); 
+  auto Y_ptr = reinterpret_cast<T *>(outputs[0]->addr); 
+  auto wksp_temp = reinterpret_cast<T *>(workspace[0]->addr); //workspace for save temp result
+  
+  auto start_of_wksp = reinterpret_cast<T *>(workspace[0]->addr); 
+  auto start_of_Yptr = reinterpret_cast<T *>(outputs[0]->addr); 
+
+  memcpy_s(Y_ptr , inputs[0]->size, X_ptr, inputs[0]->size);
+  
+  int data_count=1;
+  auto temp_dst_shape  = src0_shape; //restore temp result shape
+
+  for (int  axis_t = shape_size -1 ; axis_t >= 0 ; axis_t-- ){
+    
+    auto temp_dst_size = GetInnerSize(temp_dst_shape,0) * item_size; //get temp dst size
+    memcpy_s(wksp_temp , temp_dst_size, Y_ptr, temp_dst_size );
+    
+    int outer_size = GetOuterSize( temp_dst_shape, axis_t);
+    int inner_size = GetInnerSize(temp_dst_shape, axis_t);
+    int tile = multiples[axis_t];
+
+    int item_count = inner_size; 
+    data_count =  item_count * item_size ; 
+   
+
+    for (int i=0; i < outer_size ; i++){
+      for ( int t =0 ; t < tile ; t++){
+        memcpy_s(Y_ptr , data_count, wksp_temp, data_count); // copy to Y
+        Y_ptr +=  item_count;
+      }
+      wksp_temp += item_count;
+    }
+    // reset to start 
+    Y_ptr = start_of_Yptr ; 
+    wksp_temp = start_of_wksp;
+    temp_dst_shape[axis_t] = dst_shape[axis_t] ; //set temp shape
+  } 
+}
+}  // namespace kernel
+}  // namespace mindspore
--- a/mindspore/ccsrc/backend/kernel_compiler/cpu/tile_cpu_kernel.h
+++ b/mindspore/ccsrc/backend/kernel_compiler/cpu/tile_cpu_kernel.h
@ -0,0 +1,64 @@
+/**
+ * Copyright 2020 Huawei Technologies Co., Ltd
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+#ifndef MINDSPORE_CCSRC_BACKEND_KERNEL_COMPILER_CPU_TILE_CPU_KERNEL_H_
+#define MINDSPORE_CCSRC_BACKEND_KERNEL_COMPILER_CPU_TILE_CPU_KERNEL_H_
+#include <vector>
+#include <memory>
+#include "backend/kernel_compiler/cpu/cpu_kernel.h"
+#include "backend/kernel_compiler/cpu/cpu_kernel_factory.h"
+
+namespace mindspore {
+namespace kernel {
+class TileCPUKernel : public CPUKernel {
+ public:
+  TileCPUKernel() = default;
+  ~TileCPUKernel() override = default;
+
+  void InitKernel(const CNodePtr &kernel_node) override;
+
+  bool Launch(const std::vector<AddressPtr> &inputs, const std::vector<AddressPtr> &workspace,
+              const std::vector<AddressPtr> &outputs) override;
+ private:
+    // launch kernel base on input type
+  template <typename T>
+  void LaunchKernel(const std::vector<kernel::AddressPtr> &inputs,
+                    const std::vector<kernel::AddressPtr> & workspace,
+                    const std::vector<kernel::AddressPtr> &outputs);
+
+  size_t GetOuterSize(std::vector<size_t>, int );
+  size_t GetInnerSize(std::vector<size_t>, int);
+  
+  TypeId InputTensorType{kNumberTypeInt32};
+  size_t GetTypeSize();
+
+  std::vector<size_t> src0_shape;
+  std::vector<int> multiples;
+  std::vector<size_t> dst_shape;
+};
+
+MS_REG_CPU_KERNEL(Tile, KernelAttr().AddInputAttr(kNumberTypeFloat32).AddOutputAttr(kNumberTypeFloat32),
+                  TileCPUKernel);
+MS_REG_CPU_KERNEL(Tile, KernelAttr().AddInputAttr(kNumberTypeInt32).AddOutputAttr(kNumberTypeInt32),
+                  TileCPUKernel);
+MS_REG_CPU_KERNEL(Tile, KernelAttr().AddInputAttr(kNumberTypeInt64).AddOutputAttr(kNumberTypeInt64),
+                  TileCPUKernel);
+MS_REG_CPU_KERNEL(Tile, KernelAttr().AddInputAttr(kNumberTypeInt16).AddOutputAttr(kNumberTypeInt16),
+                  TileCPUKernel);
+                  
+}  // namespace kernel
+}  // namespace mindspore
+
+#endif  // MINDSPORE_CCSRC_BACKEND_KERNEL_COMPILER_CPU_TILE_CPU_KERNEL_H_
--- a/mindspore/ccsrc/backend/kernel_compiler/gpu/cuda_impl/ctcloss_impl.cu
+++ b/mindspore/ccsrc/backend/kernel_compiler/gpu/cuda_impl/ctcloss_impl.cu
@ -0,0 +1,446 @@
+/**
+ * Copyright 2020 Huawei Technologies Co., Ltd
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include <limits>
+#include "ctcloss_impl.cuh"
+#include "runtime/device/gpu/cuda_common.h"
+template <typename T>
+__device__ T LogSumExp(const T logprob1, const T logprob2) {
+  if (logprob1 == logprob2 && logprob1 == -std::numeric_limits<T>::infinity()) {
+    return logprob1;
+  } else {
+    return (logprob1 > logprob2) ? logprob1 + log1pf(expf(logprob2 - logprob1))
+                                 : logprob2 + log1pf(expf(logprob1 - logprob2));
+  }
+}
+
+template <typename T>
+__global__ void CalculateFwdVarKernel(T *log_alpha_b, int *label_value_with_blank, T *softmax_probs,
+                                      const int *sequence_length, bool ctc_merge_repeated, int batch, int SOffSet,
+                                      int maxtime, int blank, int *label_squence_length, int *cum_labels_length,
+                                      bool ignore_longer_outputs_than_inputs) {
+  for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < batch; i += blockDim.x * gridDim.x) {
+    if (sequence_length[i] == 0 ||
+        (ignore_longer_outputs_than_inputs && label_squence_length[i] > sequence_length[i])) {
+    } else {
+      T *log_alpha_b_cur = &log_alpha_b[i * SOffSet * maxtime];
+      int *label_value_with_blank_cur = &label_value_with_blank[0];
+      if (i > 0) {
+        label_value_with_blank_cur = &label_value_with_blank[2 * cum_labels_length[i - 1] + i];
+      }
+      int numclass = blank + 1;
+      int U = 2 * label_squence_length[i] + 1;
+      int Ti = sequence_length[i];
+      int low = 0;
+      int high = 0;
+      log_alpha_b_cur[0] = log(softmax_probs[i * numclass + blank]);
+      int label0 = blank;
+      if (U > 1) {
+        label0 = label_value_with_blank_cur[1];
+        log_alpha_b_cur[maxtime] = log(softmax_probs[i * numclass + label0]);
+      }
+      for (int t = 1; t < Ti; ++t) {
+        low = 0;
+        high = U;
+        int low_limit = U - (2 * (Ti - t));
+        int high_limit = 2 * (t + 1);
+        if (low_limit > low) {
+          low = low_limit;
+        }
+        if (high_limit < U) {
+          high = high_limit;
+        }
+        for (int u = low; u < high; ++u) {
+          T sum_log_alpha = -std::numeric_limits<T>::infinity();
+          if (ctc_merge_repeated || label_value_with_blank_cur[u] == blank) {
+            sum_log_alpha = log_alpha_b_cur[u * maxtime + t - 1];
+          }
+          if (u > 0) {
+            sum_log_alpha = LogSumExp(sum_log_alpha, log_alpha_b_cur[(u - 1) * maxtime + t - 1]);
+          }
+          if (u > 1) {
+            const bool matching_labels_merge =
+              ctc_merge_repeated && (label_value_with_blank_cur[u] == label_value_with_blank_cur[u - 2]);
+            if (label_value_with_blank_cur[u] != blank && !matching_labels_merge) {
+              sum_log_alpha = LogSumExp(sum_log_alpha, log_alpha_b_cur[(u - 2) * maxtime + t - 1]);
+            }
+          }
+          log_alpha_b_cur[u * maxtime + t] =
+            log(softmax_probs[i * numclass + label_value_with_blank_cur[u] + t * numclass * batch]) + sum_log_alpha;
+        }
+      }
+    }
+  }
+}
+
+template <typename T>
+__global__ void CalculateBwdVarKernel(T *log_beta_b, int *label_value_with_blank, T *softmax_probs,
+                                      const int *sequence_length, bool ctc_merge_repeated, int batch, int SOffSet,
+                                      int maxtime, int blank, int *label_squence_length, int *cum_labels_length,
+                                      bool ignore_longer_outputs_than_inputs) {
+  for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < batch; i += blockDim.x * gridDim.x) {
+    if (sequence_length[i] == 0 ||
+        (ignore_longer_outputs_than_inputs && label_squence_length[i] > sequence_length[i])) {
+    } else {
+      T *log_beta_b_cur = &log_beta_b[i * SOffSet * maxtime];
+      int *label_value_with_blank_cur = &label_value_with_blank[0];
+      if (i > 0) {
+        label_value_with_blank_cur = &label_value_with_blank[2 * cum_labels_length[i - 1] + i];
+      }
+      int numclass = blank + 1;
+      int U = 2 * label_squence_length[i] + 1;
+      int Ti = sequence_length[i];
+      int low = 0;
+      int high = 0;
+      if (U > 1) {
+        for (int u = U - 2; u < U; ++u) {
+          log_beta_b_cur[u * maxtime + Ti - 1] = 0;
+        }
+      } else {
+        log_beta_b_cur[Ti - 1] = 0;
+        log_beta_b_cur[Ti - 2] = 0;
+      }
+      for (int t = Ti - 2; t >= 0; --t) {
+        low = 0;
+        high = U;
+        int low_limit = U - (2 * (Ti - t));
+        int high_limit = 2 * (t + 1);
+        if (low_limit > low) {
+          low = low_limit;
+        }
+        if (high_limit < U) {
+          high = high_limit;
+        }
+        for (int u = low; u < high; ++u) {
+          if (ctc_merge_repeated || label_value_with_blank_cur[u] == blank) {
+            log_beta_b_cur[u * maxtime + t] = LogSumExp(
+              log_beta_b_cur[u * maxtime + t],
+              log_beta_b_cur[u * maxtime + t + 1] +
+                log(softmax_probs[i * numclass + label_value_with_blank_cur[u] + (t + 1) * numclass * batch]));
+          }
+          if (u + 1 < U) {
+            log_beta_b_cur[u * maxtime + t] = LogSumExp(
+              log_beta_b_cur[u * maxtime + t],
+              log_beta_b_cur[(u + 1) * maxtime + t + 1] +
+                log(softmax_probs[i * numclass + label_value_with_blank_cur[u + 1] + (t + 1) * numclass * batch]));
+          }
+          if (u + 2 < U) {
+            const bool matching_labels_merge =
+              ctc_merge_repeated && (label_value_with_blank_cur[u] == label_value_with_blank_cur[u + 2]);
+            if (label_value_with_blank_cur[u] != blank && !matching_labels_merge) {
+              log_beta_b_cur[u * maxtime + t] = LogSumExp(
+                log_beta_b_cur[u * maxtime + t],
+                log_beta_b_cur[(u + 2) * maxtime + t + 1] +
+                  log(softmax_probs[i * numclass + label_value_with_blank_cur[u + 2] + (t + 1) * numclass * batch]));
+            }
+          }
+        }
+      }
+    }
+  }
+}
+
+template <typename T>
+__global__ void ProbInitKernel(T *prob_num, int size) {
+  for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < size; i += blockDim.x * gridDim.x) {
+    prob_num[i] = -std::numeric_limits<T>::infinity();
+  }
+}
+template <typename T>
+__global__ void LogBInitKernel(T *log_b, int log_prob_size) {
+  for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < log_prob_size; i += blockDim.x * gridDim.x) {
+    log_b[i] = -std::numeric_limits<T>::infinity();
+  }
+}
+
+template <typename T>
+__global__ void CTCLossKernel(T *log_alpha_b, T *log_beta_b, T *softmax_probs, int *label_value_with_blank, int batch,
+                              int SOffSet, int maxtime, int numclass, const int *sequence_length,
+                              int *label_squence_length, int *cum_labels_length, T *cost, T *grads, T *prob_num,
+                              bool ignore_longer_outputs_than_inputs) {
+  for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < batch; i += blockDim.x * gridDim.x) {
+    if (sequence_length[i] == 0 ||
+        (ignore_longer_outputs_than_inputs && label_squence_length[i] > sequence_length[i])) {
+    } else {
+      T *grad_cur = &grads[i * numclass];
+      const T *softmax_probs_cur = &softmax_probs[i * numclass];
+      T *prob_num_cur = &prob_num[i * numclass];
+      int U = 2 * label_squence_length[i] + 1;
+      T log_pzx = -std::numeric_limits<T>::infinity();
+      const T *log_alpha_b_cur = &log_alpha_b[i * SOffSet * maxtime];
+      const T *log_beta_b_cur = &log_beta_b[i * SOffSet * maxtime];
+      int *label_value_with_blank_cur = &label_value_with_blank[0];
+      if (i > 0) {
+        label_value_with_blank_cur = &label_value_with_blank[2 * cum_labels_length[i - 1] + i];
+      }
+      for (int u = 0; u < U; ++u) {
+        log_pzx = LogSumExp(log_pzx, log_alpha_b_cur[u * maxtime] + log_beta_b_cur[u * maxtime]);
+      }
+      cost[i] = -log_pzx;
+      // grad
+      int L = numclass;
+      int Ti = sequence_length[i];
+      if (log_pzx == -std::numeric_limits<T>::infinity()) {
+        for (int t = 0; t < Ti; ++t) {
+          for (int l = 0; l < L; ++l) {
+            grad_cur[t * numclass * batch + l] = softmax_probs_cur[t * numclass * batch + l];
+          }
+        }
+      } else {
+        for (int t = 0; t < Ti; ++t) {
+          for (int u = 0; u < U; ++u) {
+            int l = label_value_with_blank_cur[u];
+            prob_num_cur[t * batch * numclass + l] =
+              LogSumExp(prob_num_cur[t * batch * numclass + l],
+                        log_alpha_b_cur[u * maxtime + t] + log_beta_b_cur[u * maxtime + t]);
+          }
+          for (int l = 0; l < L; ++l) {
+            grad_cur[t * numclass * batch + l] =
+              softmax_probs_cur[t * numclass * batch + l] - expf(prob_num_cur[t * batch * numclass + l] - log_pzx);
+          }
+        }
+      }
+    }
+  }
+}
+
+template <typename T>
+__global__ void InnerSoftMaxKernel(const T *probs, T *softmax_probs, const int *sequence_length, int max_time,
+                                   int batch, int numclass) {
+  for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < batch * max_time; i += blockDim.x * gridDim.x) {
+    int k = i / batch;
+    int m = i % batch;
+    if (k < sequence_length[m]) {
+      T maxCoeff = 0.;
+      T sumCoeff = 0.;
+      for (int j = i * numclass; j < (i + 1) * numclass; ++j) {
+        if (probs[j] > maxCoeff) {
+          maxCoeff = probs[j];
+        }
+      }
+      for (int j = i * numclass; j < (i + 1) * numclass; ++j) {
+        sumCoeff += exp(probs[j] - maxCoeff);
+        softmax_probs[j] = exp(probs[j] - maxCoeff);
+      }
+      for (int j = i * numclass; j < (i + 1) * numclass; ++j) {
+        softmax_probs[j] /= sumCoeff;
+      }
+    }
+  }
+}
+
+__global__ void GenLabelValuePCRKernel(int *label_value_sp, int *label_value_pcr, int *label_squence_length,
+                                       int *cum_labels_length, int batch) {
+  for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < batch; i += blockDim.x * gridDim.x) {
+    int L = label_squence_length[i];
+    label_squence_length[i] = 0;
+    int offset = 0;
+    if (i > 0) {
+      offset = cum_labels_length[i - 1];
+    }
+    for (int l = offset; l < L; ++l) {
+      if (l == offset || label_value_sp[l] != label_value_sp[l - 1]) {
+        label_value_pcr[offset + label_squence_length[i]++] = label_value_sp[l];
+      }
+    }
+  }
+}
+
+__global__ void UpdateLengthKernel(int *label_squence_length, int *cum_labels_length, int *max_labels_length,
+                                   int batch) {
+  max_labels_length[0] = 0;
+  for (int i = 0; i < batch; ++i) {
+    if (label_squence_length[i] > max_labels_length[0]) {
+      max_labels_length[0] = label_squence_length[i];
+    }
+    if (i == 0) {
+      cum_labels_length[i] = label_squence_length[i];
+    } else {
+      cum_labels_length[i] = label_squence_length[i] + cum_labels_length[i - 1];
+    }
+  }
+}
+
+template <typename T>
+void CalculateBwdVar(T *log_beta_b, int *label_value_with_blank, T *softmax_probs, const int *sequence_length,
+                     bool ctc_merge_repeated, int batch, int SOffSet, int maxtime, int blank, int *label_squence_length,
+                     int *cum_labels_length, bool ignore_longer_outputs_than_inputs, cudaStream_t stream) {
+  int log_prob_size = SOffSet * batch * maxtime;
+  LogBInitKernel<<<GET_BLOCKS(log_prob_size), GET_THREADS, 0, stream>>>(log_beta_b, log_prob_size);
+  CalculateBwdVarKernel<<<GET_BLOCKS(batch), GET_THREADS, 0, stream>>>(
+    log_beta_b, label_value_with_blank, softmax_probs, sequence_length, ctc_merge_repeated, batch, SOffSet, maxtime,
+    blank, label_squence_length, cum_labels_length, ignore_longer_outputs_than_inputs);
+}
+
+template <typename T>
+void CalculateFwdVar(T *log_alpha_b, int *label_value_with_blank, T *softmax_probs, const int *sequence_length,
+                     bool ctc_merge_repeated, int batch, int SOffSet, int maxtime, int blank, int *label_squence_length,
+                     int *cum_labels_length, bool ignore_longer_outputs_than_inputs, cudaStream_t stream) {
+  int log_prob_size = SOffSet * batch * maxtime;
+  LogBInitKernel<<<GET_BLOCKS(log_prob_size), GET_THREADS, 0, stream>>>(log_alpha_b, log_prob_size);
+  CalculateFwdVarKernel<<<GET_BLOCKS(batch), GET_THREADS, 0, stream>>>(
+    log_alpha_b, label_value_with_blank, softmax_probs, sequence_length, ctc_merge_repeated, batch, SOffSet, maxtime,
+    blank, label_squence_length, cum_labels_length, ignore_longer_outputs_than_inputs);
+}
+
+template <typename T>
+void InnerSoftMax(const T *probs, T *softmax_probs, const int *sequence_length, int max_time, int batch, int numclass,
+                  cudaStream_t stream) {
+  InnerSoftMaxKernel<<<GET_BLOCKS(batch * max_time), GET_THREADS, 0, stream>>>(probs, softmax_probs, sequence_length,
+                                                                               max_time, batch, numclass);
+}
+
+__global__ void GenLabelWithBlankKernel(int *label_value, int *label_value_with_blank, int *label_squence_length,
+                                        int *precum_labels_length, int *cum_labels_length, int batch, int blank) {
+  for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < batch; i += blockDim.x * gridDim.x) {
+    int offset = 0;
+    int offset1 = 0;
+    if (i > 0) {
+      offset = 2 * cum_labels_length[i - 1] + i;
+      offset1 = precum_labels_length[i - 1];
+    }
+    for (int j = 0; j < label_squence_length[i]; ++j) {
+      label_value_with_blank[offset + 2 * j] = blank;
+      label_value_with_blank[offset + 2 * j + 1] = label_value[offset1 + j];
+    }
+    label_value_with_blank[offset + 2 * label_squence_length[i]] = blank;
+  }
+}
+
+void GenLabelWithBlank(int *label_value, int *label_value_with_blank, int *label_squence_length,
+                       int *precum_labels_length, int *cum_labels_length, int batch, int blank, cudaStream_t stream) {
+  GenLabelWithBlankKernel<<<GET_BLOCKS(batch), GET_THREADS, 0, stream>>>(
+    label_value, label_value_with_blank, label_squence_length, precum_labels_length, cum_labels_length, batch, blank);
+}
+
+void GenLabelValuePCR(int *label_value_sp, int *label_value_pcr, int *label_squence_length, int *cum_labels_length,
+                      int *max_labels_length, int batch, cudaStream_t stream) {
+  GenLabelValuePCRKernel<<<GET_BLOCKS(batch), GET_THREADS, 0, stream>>>(label_value_sp, label_value_pcr,
+                                                                        label_squence_length, cum_labels_length, batch);
+  UpdateLengthKernel<<<1, 1, 0, stream>>>(label_squence_length, cum_labels_length, max_labels_length, batch);
+}
+
+__global__ void GenLabelValueKernel(int *label_value_sp, const int64_t *label_indices, const int *label_values,
+                                    int *label_squence_length, int *cum_labels_length, int size) {
+  for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < size; i += blockDim.x * gridDim.x) {
+    int64_t b = label_indices[i * 2];
+    int offset = 0;
+    if (b > 0) {
+      offset = cum_labels_length[b - 1];
+    }
+    int64_t index = offset + label_indices[i * 2 + 1];
+    label_value_sp[index] = label_values[i];
+  }
+}
+__global__ void LabelValueInitKernel(int *label_value_sp, int size, int blank) {
+  for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < size; i += blockDim.x * gridDim.x) {
+    label_value_sp[i] = blank;
+  }
+}
+__global__ void RecalculateLengthKernel(int *label_value_sp, int *label_squence_length, int *cum_labels_length,
+                                        int batch, int blank) {
+  for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < batch; i += blockDim.x * gridDim.x) {
+    int offset = 0;
+    if (i > 0) {
+      offset = cum_labels_length[i - 1];
+    }
+    int L = label_squence_length[i];
+    label_squence_length[i] = 0;
+    for (int j = offset; j < offset + L; ++j) {
+      if (label_value_sp[j] >= blank) {
+        break;
+      } else {
+        label_squence_length[i]++;
+      }
+    }
+  }
+}
+void GenLabelValue(int *label_value_sp, const int64_t *label_indices, const int *label_values,
+                   int *label_squence_length, int *cum_labels_length, int *max_labels_length, int size, int blank,
+                   int batch, cudaStream_t stream) {
+  LabelValueInitKernel<<<GET_BLOCKS(size), GET_THREADS, 0, stream>>>(label_value_sp, size, blank);
+  GenLabelValueKernel<<<GET_BLOCKS(size), GET_THREADS, 0, stream>>>(label_value_sp, label_indices, label_values,
+                                                                    label_squence_length, cum_labels_length, size);
+  RecalculateLengthKernel<<<GET_BLOCKS(batch), GET_THREADS, 0, stream>>>(label_value_sp, label_squence_length,
+                                                                         cum_labels_length, batch, blank);
+  UpdateLengthKernel<<<1, 1, 0, stream>>>(label_squence_length, cum_labels_length, max_labels_length, batch);
+}
+
+__global__ void CalculatePreLengthKernel(int *label_squence_length, int *precum_labels_length, int *cum_labels_length,
+                                         int *max_labels_length, const int64_t *label_indices, int batch, int size) {
+  max_labels_length[0] = 0;
+  for (int i = 0; i < size; ++i) {
+    label_squence_length[label_indices[i * 2]]++;
+    if (max_labels_length[0] < label_indices[i * 2]) {
+      max_labels_length[0] = label_indices[i * 2];
+    }
+  }
+  precum_labels_length[0] = label_squence_length[0];
+  cum_labels_length[0] = label_squence_length[0];
+  for (int i = 1; i < batch; ++i) {
+    cum_labels_length[i] = cum_labels_length[i - 1] + label_squence_length[i];
+    precum_labels_length[i] = precum_labels_length[i - 1] + label_squence_length[i];
+  }
+}
+
+__global__ void CalculateMaxSequenceKernel(const int *sequence_length, int *max_labels_length, int batch) {
+  max_labels_length[0] = 0;
+  for (int i = 0; i < batch; ++i) {
+    if (sequence_length[i] > max_labels_length[0]) {
+      max_labels_length[0] = sequence_length[i];
+    }
+  }
+}
+
+void CalculateMaxSequence(const int *sequence_length, int *max_labels_length, int batch, cudaStream_t stream) {
+  CalculateMaxSequenceKernel<<<1, 1, 0, stream>>>(sequence_length, max_labels_length, batch);
+}
+
+void CalculatePreLength(int *label_squence_length, int *precum_labels_length, int *cum_labels_length,
+                        int *max_labels_length, const int64_t *label_indices, int batch, int size,
+                        cudaStream_t stream) {
+  CalculatePreLengthKernel<<<1, 1, 0, stream>>>(label_squence_length, precum_labels_length, cum_labels_length,
+                                                max_labels_length, label_indices, batch, size);
+}
+
+template <typename T>
+void CTCLoss(T *log_alpha_b, T *log_beta_b, T *softmax_probs, int *label_value_with_blank, int batch, int SOffSet,
+             int maxtime, int numclass, const int *sequence_length, int *label_squence_length, int *cum_labels_length,
+             T *cost, T *grads, T *prob_num, bool ignore_longer_outputs_than_inputs, cudaStream_t stream) {
+  ProbInitKernel<<<GET_BLOCKS(maxtime * batch * numclass), GET_THREADS, 0, stream>>>(prob_num,
+                                                                                     maxtime * batch * numclass);
+  CTCLossKernel<<<GET_BLOCKS(batch), GET_THREADS, 0, stream>>>(
+    log_alpha_b, log_beta_b, softmax_probs, label_value_with_blank, batch, SOffSet, maxtime, numclass, sequence_length,
+    label_squence_length, cum_labels_length, cost, grads, prob_num, ignore_longer_outputs_than_inputs);
+}
+
+template void CalculateFwdVar<float>(float *log_alpha_b, int *label_value_with_blank, float *softmax_probs,
+                                     const int *sequence_length, bool ctc_merge_repeated, int batch, int SOffSet,
+                                     int maxtime, int blank, int *label_squence_length, int *cum_labels_length,
+                                     bool ignore_longer_outputs_than_inputs, cudaStream_t stream);
+
+template void CalculateBwdVar<float>(float *log_beta_b, int *label_value_with_blank, float *softmax_probs,
+                                     const int *sequence_length, bool ctc_merge_repeated, int batch, int SOffSet,
+                                     int maxtime, int blank, int *label_squence_length, int *cum_labels_length,
+                                     bool ignore_longer_outputs_than_inputs, cudaStream_t stream);
+
+template void InnerSoftMax<float>(const float *probs, float *softmax_probs, const int *sequence_length, int max_time,
+                                  int batch, int numclass, cudaStream_t stream);
+
+template void CTCLoss<float>(float *log_alpha_b, float *log_beta_b, float *softmax_probs, int *label_value_with_blank,
+                             int batch, int SOffSet, int maxtime, int numclass, const int *sequence_length,
+                             int *label_squence_length, int *cum_labels_length, float *cost, float *grads,
+                             float *prob_num, bool ignore_longer_outputs_than_inputs, cudaStream_t stream);
--- a/mindspore/ccsrc/backend/kernel_compiler/gpu/cuda_impl/ctcloss_impl.cuh
+++ b/mindspore/ccsrc/backend/kernel_compiler/gpu/cuda_impl/ctcloss_impl.cuh
@ -0,0 +1,51 @@
+/**
+ * Copyright 2020 Huawei Technologies Co., Ltd
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#ifndef MINDSPORE_CCSRC_BACKEND_KERNEL_COMPILER_GPU_IMPL_CUH
+#define MINDSPORE_CCSRC_BACKEND_KERNEL_COMPILER_GPU_IMPL_CUH
+
+template <typename T>
+void CalculateFwdVar(T *log_alpha_b, int *label_value_with_blank, T *softmax_probs, const int *sequence_length,
+                     bool ctc_merge_repeated, int batch, int SOffSet, int maxtime, int blank, int *label_squence_length,
+                     int *cum_labels_length, bool ignore_longer_outputs_than_inputs, cudaStream_t stream);
+
+template <typename T>
+void CalculateBwdVar(T *log_beta_b, int *label_value_with_blank, T *softmax_probs, const int *sequence_length,
+                     bool ctc_merge_repeated, int batch, int SOffSet, int maxtime, int blank, int *label_squence_length,
+                     int *cum_labels_length, bool ignore_longer_outputs_than_inputs, cudaStream_t stream);
+
+template <typename T>
+void InnerSoftMax(const T *probs, T *softmax_cost, const int *sequence_length, int max_time, int batch, int numclass,
+                  cudaStream_t stream);
+
+void GenLabelValuePCR(int *label_value_sp, int *label_value_pcr, int *label_squence_length, int *cum_labels_length,
+                      int *max_labels_length, int batch, cudaStream_t stream);
+
+void GenLabelWithBlank(int *label_value, int *label_value_with_blank, int *label_squence_length,
+                       int *precum_labels_length, int *cum_labels_length, int batch, int blank, cudaStream_t stream);
+
+void GenLabelValue(int *label_value_sp, const int64_t *label_indices, const int *label_values,
+                   int *label_squence_length, int *cum_labels_length, int *max_labels_length, int size, int blank,
+                   int batch, cudaStream_t stream);
+
+void CalculatePreLength(int *label_squence_length, int *precum_labels_length, int *cum_labels_length,
+                        int *max_labels_length, const int64_t *label_indices, int batch, int size, cudaStream_t stream);
+void CalculateMaxSequence(const int *sequence_length, int *max_labels_length, int batch, cudaStream_t stream);
+template <typename T>
+void CTCLoss(T *log_alpha_b, T *log_beta_b, T *softmax_probs, int *label_value_with_blank, int batch, int SOffSet,
+             int maxtime, int numclass, const int *sequence_length, int *label_squence_length, int *cum_labels_length,
+             T *cost, T *grads, T *prob_num, bool ignore_longer_outputs_than_inputs, cudaStream_t stream);
+#endif  // MINDSPORE_CCSRC_BACKEND_KERNEL_COMPILER_GPU_IMPL_CUH
--- a/mindspore/ccsrc/backend/kernel_compiler/gpu/math/unary_op_gpu_kernel.cc
+++ b/mindspore/ccsrc/backend/kernel_compiler/gpu/math/unary_op_gpu_kernel.cc
@ -44,8 +44,12 @@ MS_REG_GPU_KERNEL_ONE(Square, KernelAttr().AddInputAttr(kNumberTypeFloat16).AddO
                      UnaryOpGpuKernel, half)
 MS_REG_GPU_KERNEL_ONE(Sqrt, KernelAttr().AddInputAttr(kNumberTypeFloat32).AddOutputAttr(kNumberTypeFloat32),
                      UnaryOpGpuKernel, float)
+MS_REG_GPU_KERNEL_ONE(Sqrt, KernelAttr().AddInputAttr(kNumberTypeFloat16).AddOutputAttr(kNumberTypeFloat16),
+                      UnaryOpGpuKernel, half)
 MS_REG_GPU_KERNEL_ONE(Rsqrt, KernelAttr().AddInputAttr(kNumberTypeFloat32).AddOutputAttr(kNumberTypeFloat32),
                      UnaryOpGpuKernel, float)
+MS_REG_GPU_KERNEL_ONE(Rsqrt, KernelAttr().AddInputAttr(kNumberTypeFloat16).AddOutputAttr(kNumberTypeFloat16),
+                      UnaryOpGpuKernel, half)
 MS_REG_GPU_KERNEL_ONE(Sin, KernelAttr().AddInputAttr(kNumberTypeFloat32).AddOutputAttr(kNumberTypeFloat32),
                      UnaryOpGpuKernel, float)
 MS_REG_GPU_KERNEL_ONE(Sin, KernelAttr().AddInputAttr(kNumberTypeFloat16).AddOutputAttr(kNumberTypeFloat16),
--- a/mindspore/ccsrc/backend/kernel_compiler/gpu/nn/ctcloss_gpu_kernel.cc
+++ b/mindspore/ccsrc/backend/kernel_compiler/gpu/nn/ctcloss_gpu_kernel.cc
@ -18,10 +18,10 @@

 namespace mindspore {
 namespace kernel {
-MS_REG_GPU_KERNEL_ONE(CTCLossV2,
+MS_REG_GPU_KERNEL_ONE(CTCLoss,
                      KernelAttr()
                        .AddInputAttr(kNumberTypeFloat32)
-                        .AddInputAttr(kNumberTypeInt32)
+                        .AddInputAttr(kNumberTypeInt64)
                        .AddInputAttr(kNumberTypeInt32)
                        .AddInputAttr(kNumberTypeInt32)
                        .AddOutputAttr(kNumberTypeFloat32)
--- a/mindspore/ccsrc/backend/kernel_compiler/gpu/nn/ctcloss_gpu_kernel.h
+++ b/mindspore/ccsrc/backend/kernel_compiler/gpu/nn/ctcloss_gpu_kernel.h
@ -19,64 +19,111 @@

 #include <cuda_runtime_api.h>
 #include <vector>
+#include <limits>
 #include "backend/kernel_compiler/gpu/gpu_kernel.h"
 #include "backend/kernel_compiler/gpu/gpu_kernel_factory.h"
 #include "runtime/device/gpu/gpu_memory_allocator.h"
-
+#include "backend/kernel_compiler/gpu/cuda_impl/ctcloss_impl.cuh"
 namespace mindspore {
 namespace kernel {
 template <typename T>
 class CtcLossGpuKernel : public GpuKernel {
 public:
  CtcLossGpuKernel()
-      : cudnn_handle_(nullptr),
-        probs_desc_(nullptr),
-        ctcloss_desc_(nullptr),
+      : label_indice_size_(0),
        label_size_(0),
-        input_lengths_size_(0),
-        label_lengths_size_(0) {}
-  ~CtcLossGpuKernel() override { DestroyResource(); }
+        squence_lengths_size_(0),
+        preprocess_collapse_repeated_(false),
+        ctc_merge_repeated_(true),
+        ignore_longer_outputs_than_inputs_(false) {}
+  ~CtcLossGpuKernel() override = default;

  const std::vector<size_t> &GetInputSizeList() const override { return input_size_list_; }
  const std::vector<size_t> &GetOutputSizeList() const override { return output_size_list_; }
  const std::vector<size_t> &GetWorkspaceSizeList() const override { return workspace_size_list_; }

-  bool Launch(const std::vector<AddressPtr> &inputs, const std::vector<AddressPtr> &,
+  bool Launch(const std::vector<AddressPtr> &inputs, const std::vector<AddressPtr> &workspace,
              const std::vector<AddressPtr> &outputs, void *stream_ptr) override {
-    float *probs = GetDeviceAddress<float>(inputs, 0);
-    float *costs = GetDeviceAddress<float>(outputs, 0);
-    float *grads = GetDeviceAddress<float>(outputs, 1);
-
-    // Copy labels/input_lengths/label_length to host as cudnn7.x.x requires
-    int *labels_host = nullptr;
-    int *no_blank_labels_host = nullptr;
-    void *input_lengths_host = nullptr;
-    void *label_lengths_host = nullptr;
    cudaStream_t stream = reinterpret_cast<cudaStream_t>(stream_ptr);
-    AllocHostMem(&labels_host, &no_blank_labels_host, &input_lengths_host, &label_lengths_host, inputs);
-    CopyToHostSync(labels_host, no_blank_labels_host, input_lengths_host, label_lengths_host, inputs, stream);
+    const T *probs = GetDeviceAddress<T>(inputs, 0);
+    const int64_t *label_indices = GetDeviceAddress<int64_t>(inputs, 1);
+    const int *label_values = GetDeviceAddress<int>(inputs, 2);
+    const int *sequence_length = GetDeviceAddress<int>(inputs, 3);
+    T *costs = GetDeviceAddress<T>(outputs, 0);
+    T *grads = GetDeviceAddress<T>(outputs, 1);
+    T *softmax_probs = GetDeviceAddress<T>(workspace, 0);
+    int *cum_labels_length = GetDeviceAddress<int>(workspace, 1);
+    int *label_squence_length = GetDeviceAddress<int>(workspace, 2);
+    int *label_value_sp = GetDeviceAddress<int>(workspace, 3);
+    int *label_value_pcr = GetDeviceAddress<int>(workspace, 4);
+    T *prob_num = GetDeviceAddress<T>(workspace, 5);
+    int *precum_labels_length = GetDeviceAddress<int>(workspace, 6);
+    int *max_labels_length = GetDeviceAddress<int>(workspace, 7);
+    int numclass = SizeToInt(probs_dims_[2]);
+    int batch = SizeToInt(probs_dims_[1]);
+    int max_time = SizeToInt(probs_dims_[0]);
+    int max_sequence = 0;
+    CalculateMaxSequence(sequence_length, max_labels_length, batch, stream);
+    CHECK_CUDA_RET_WITH_EXCEPT(
+      cudaMemcpyAsync(&max_sequence, max_labels_length, sizeof(int), cudaMemcpyDeviceToHost, stream),
+      "cudaMemcpyAsync failed.");
+    CHECK_CUDA_RET_WITH_EXCEPT(cudaStreamSynchronize(stream), "cudaStreamSynchronize failed.");
+    if (max_time < max_sequence) {
+      MS_LOG(EXCEPTION) << "max_time should be greater than sequence length.";
+    }
+    InnerSoftMax(probs, softmax_probs, sequence_length, max_time, batch, numclass, stream);
+    MemsetForWS(label_value_pcr, cum_labels_length, label_squence_length, costs, grads, stream);
+    int max_labels_length_host = 0;
+    int batch_label = 0;
+    int *label_value_with_blank = nullptr;
+    T *log_alpha_b = nullptr;
+    T *log_beta_b = nullptr;
+    CalculatePreLength(label_squence_length, precum_labels_length, cum_labels_length, max_labels_length, label_indices,
+                       batch, label_size_ / sizeof(int), stream);
+    CHECK_CUDA_RET_WITH_EXCEPT(
+      cudaMemcpyAsync(&batch_label, max_labels_length, sizeof(int), cudaMemcpyDeviceToHost, stream),
+      "cudaMemcpyAsync failed.");
+    CHECK_CUDA_RET_WITH_EXCEPT(cudaStreamSynchronize(stream), "cudaStreamSynchronize failed.");
+    if (batch != batch_label + 1) {
+      MS_LOG(EXCEPTION) << "label batch should be equal to input batch.";
+    }
+    GenLabelValue(label_value_sp, label_indices, label_values, label_squence_length, cum_labels_length,
+                  max_labels_length, label_size_ / sizeof(int), numclass - 1, batch, stream);
+    if (preprocess_collapse_repeated_) {
+      GenLabelValuePCR(label_value_sp, label_value_pcr, label_squence_length, cum_labels_length, max_labels_length,
+                       batch, stream);
+    }
+    CHECK_CUDA_RET_WITH_EXCEPT(
+      cudaMemcpyAsync(&max_labels_length_host, max_labels_length, sizeof(int), cudaMemcpyDeviceToHost, stream),
+      "cudaMemcpyAsync failed.");
+    CHECK_CUDA_RET_WITH_EXCEPT(cudaStreamSynchronize(stream), "cudaStreamSynchronize failed.");
+    int SOffSet = 2 * max_labels_length_host + 1;
+    int log_prob_size = batch * SOffSet * max_time;
+    if (!ignore_longer_outputs_than_inputs_ && max_labels_length_host > max_time) {
+      MS_LOG(EXCEPTION) << "output size is greater than input size.";
+    }
+    MemManageForCus(&log_alpha_b, &log_beta_b, &label_value_with_blank, cum_labels_length, log_prob_size, batch,
+                    stream);

-    size_t workspace_size = 0;
-    CHECK_CUDNN_RET_WITH_EXCEPT(
-      cudnnGetCTCLossWorkspaceSize(
-        cudnn_handle_, probs_desc_, probs_desc_, reinterpret_cast<int *>(no_blank_labels_host),
-        reinterpret_cast<int *>(label_lengths_host), reinterpret_cast<int *>(input_lengths_host),
-        CUDNN_CTC_LOSS_ALGO_DETERMINISTIC, ctcloss_desc_, &workspace_size),
-      "cudnnGetCTCLossWorkspaceSize failed.");
-    void *workspace = device::gpu::GPUMemoryAllocator::GetInstance().AllocTensorMem(workspace_size);
-    if (workspace == nullptr) {
-      MS_LOG(EXCEPTION) << "Failed to alloc workspace, size: " << workspace_size;
+    if (preprocess_collapse_repeated_) {
+      GenLabelWithBlank(label_value_pcr, label_value_with_blank, label_squence_length, precum_labels_length,
+                        cum_labels_length, batch, numclass - 1, stream);
+    } else {
+      GenLabelWithBlank(label_value_sp, label_value_with_blank, label_squence_length, precum_labels_length,
+                        cum_labels_length, batch, numclass - 1, stream);
    }

-    CHECK_CUDNN_RET_WITH_EXCEPT(
-      cudnnCTCLoss(cudnn_handle_, probs_desc_, probs, reinterpret_cast<int *>(no_blank_labels_host),
-                   reinterpret_cast<int *>(label_lengths_host), reinterpret_cast<int *>(input_lengths_host), costs,
-                   probs_desc_, grads, CUDNN_CTC_LOSS_ALGO_DETERMINISTIC, ctcloss_desc_, workspace, workspace_size),
-      "cudnnCtcLoss failed.");
+    CalculateFwdVar(log_alpha_b, label_value_with_blank, softmax_probs, sequence_length, ctc_merge_repeated_, batch,
+                    SOffSet, max_time, numclass - 1, label_squence_length, cum_labels_length,
+                    ignore_longer_outputs_than_inputs_, stream);
+    CalculateBwdVar(log_beta_b, label_value_with_blank, softmax_probs, sequence_length, ctc_merge_repeated_, batch,
+                    SOffSet, max_time, numclass - 1, label_squence_length, cum_labels_length,
+                    ignore_longer_outputs_than_inputs_, stream);
+    CTCLoss(log_alpha_b, log_beta_b, softmax_probs, label_value_with_blank, batch, SOffSet, max_time, numclass,
+            sequence_length, label_squence_length, cum_labels_length, costs, grads, prob_num,
+            ignore_longer_outputs_than_inputs_, stream);
    CHECK_CUDA_RET_WITH_EXCEPT(cudaStreamSynchronize(stream), "cudaStreamSynchronize failed.");
-
-    device::gpu::GPUMemoryAllocator::GetInstance().FreeTensorMem(workspace);
-    FreeHostMem(labels_host, no_blank_labels_host, input_lengths_host, label_lengths_host);
+    FreeMem(label_value_with_blank, log_alpha_b, log_beta_b);
    return true;
  }
  bool Init(const CNodePtr &kernel_node) override {
@ -88,104 +135,98 @@ class CtcLossGpuKernel : public GpuKernel {
    probs_dims_[0] = probs_shape[0];
    probs_dims_[1] = probs_shape[1];
    probs_dims_[2] = probs_shape[2];
-
-    auto labels_dims = AnfAlgo::GetPrevNodeOutputInferShape(kernel_node, 1);
-    if (labels_dims.size() != 1 && labels_dims.size() != 2) {
+    auto indice_dims = AnfAlgo::GetPrevNodeOutputInferShape(kernel_node, 1);
+    auto labels_dims = AnfAlgo::GetPrevNodeOutputInferShape(kernel_node, 2);
+    if (labels_dims.size() != 1) {
      MS_LOG(EXCEPTION) << "labels dims: " << labels_dims.size() << " not support.";
    }
+    if (indice_dims.size() != 2) {
+      MS_LOG(EXCEPTION) << "labels indice dims: " << indice_dims.size() << " not support.";
+    }
    label_size_ = sizeof(int);
    for (auto i : labels_dims) {
      label_size_ *= i;
    }
-
-    auto input_length_dims = AnfAlgo::GetPrevNodeOutputInferShape(kernel_node, 2);
-    input_lengths_size_ = input_length_dims[0] * sizeof(int);
-    auto label_length_dims = AnfAlgo::GetPrevNodeOutputInferShape(kernel_node, 3);
-    label_lengths_size_ = label_length_dims[0] * sizeof(int);
-    CHECK_CUDNN_RET_WITH_EXCEPT(
-      cudnnSetTensorNdDescriptorEx(probs_desc_, CUDNN_TENSOR_NCHW, CUDNN_DATA_FLOAT, 3, probs_dims_),
-      "cudnnSetTensorNdDescriptorEx failed.");
-    CHECK_CUDNN_RET_WITH_EXCEPT(cudnnSetCTCLossDescriptorEx(ctcloss_desc_, CUDNN_DATA_FLOAT,
-                                                            CUDNN_LOSS_NORMALIZATION_SOFTMAX, CUDNN_PROPAGATE_NAN),
-                                "cudnnSetCTCLossDescriptorEx failed.");
+    label_indice_size_ = sizeof(int64_t);
+    for (auto i : indice_dims) {
+      label_indice_size_ *= i;
+    }
+    auto squence_length_dims = AnfAlgo::GetPrevNodeOutputInferShape(kernel_node, 3);
+    squence_lengths_size_ = squence_length_dims[0] * sizeof(int);
+    preprocess_collapse_repeated_ = GetAttr<bool>(kernel_node, "preprocess_collapse_repeated");
+    ctc_merge_repeated_ = GetAttr<bool>(kernel_node, "ctc_merge_repeated");
+    ignore_longer_outputs_than_inputs_ = GetAttr<bool>(kernel_node, "ignore_longer_outputs_than_inputs");
    InitSizeLists();
    return true;
  }

 protected:
-  void InitResource() override {
-    cudnn_handle_ = device::gpu::GPUDeviceManager::GetInstance().GetCudnnHandle();
-    CHECK_CUDNN_RET_WITH_EXCEPT(cudnnCreateTensorDescriptor(&probs_desc_), "cudnnCreateTensorDescriptor failed.");
-    CHECK_CUDNN_RET_WITH_EXCEPT(cudnnCreateCTCLossDescriptor(&ctcloss_desc_), "cudnnCreateCTCLossDescriptor failed.");
-  }
-
  void InitSizeLists() override {
-    input_size_list_.push_back(probs_dims_[0] * probs_dims_[1] * probs_dims_[2] * sizeof(float));
+    input_size_list_.push_back(probs_dims_[0] * probs_dims_[1] * probs_dims_[2] * sizeof(T));
+    input_size_list_.push_back(label_indice_size_);
    input_size_list_.push_back(label_size_);
-    input_size_list_.push_back(input_lengths_size_);
-    input_size_list_.push_back(label_lengths_size_);
-
-    output_size_list_.push_back(probs_dims_[1] * sizeof(float));
-    output_size_list_.push_back(probs_dims_[0] * probs_dims_[1] * probs_dims_[2] * sizeof(float));
+    input_size_list_.push_back(squence_lengths_size_);
+    workspace_size_list_.push_back(probs_dims_[0] * probs_dims_[1] * probs_dims_[2] * sizeof(T));
+    workspace_size_list_.push_back(squence_lengths_size_);
+    workspace_size_list_.push_back(squence_lengths_size_);
+    workspace_size_list_.push_back(label_size_);
+    workspace_size_list_.push_back(label_size_);
+    workspace_size_list_.push_back(probs_dims_[0] * probs_dims_[1] * probs_dims_[2] * sizeof(T));
+    workspace_size_list_.push_back(squence_lengths_size_);
+    workspace_size_list_.push_back(sizeof(int));
+    output_size_list_.push_back(probs_dims_[1] * sizeof(T));
+    output_size_list_.push_back(probs_dims_[0] * probs_dims_[1] * probs_dims_[2] * sizeof(T));
  }
-
- private:
-  void DestroyResource() noexcept {
-    CHECK_CUDNN_RET_WITH_ERROR(cudnnDestroyCTCLossDescriptor(ctcloss_desc_), "cudnnDestroyCTCLossDescriptor failed.");
-    CHECK_CUDNN_RET_WITH_ERROR(cudnnDestroyTensorDescriptor(probs_desc_), "cudnnDestroyTensorDescriptor failed.");
-  }
-
-  void AllocHostMem(int **labels_host, int **no_blank_labels_host, void **input_lengths_host, void **label_lengths_host,
-                    const std::vector<AddressPtr> &inputs) {
-    CHECK_CUDA_RET_WITH_EXCEPT(cudaMallocHost(labels_host, inputs[1]->size), "cudaMallocHost failed.");
-    CHECK_CUDA_RET_WITH_EXCEPT(cudaMallocHost(no_blank_labels_host, inputs[1]->size), "cudaMallocHost failed.");
-    CHECK_CUDA_RET_WITH_EXCEPT(cudaMallocHost(input_lengths_host, inputs[2]->size), "cudaMallocHost failed.");
-    CHECK_CUDA_RET_WITH_EXCEPT(cudaMallocHost(label_lengths_host, inputs[3]->size), "cudaMallocHost failed.");
-  }
-
-  void FreeHostMem(int *labels_host, int *no_blank_labels_host, void *input_lengths_host, void *label_lengths_host) {
-    CHECK_CUDA_RET_WITH_EXCEPT(cudaFreeHost(label_lengths_host), "cudaFreeHost failed.");
-    CHECK_CUDA_RET_WITH_EXCEPT(cudaFreeHost(input_lengths_host), "cudaFreeHost failed.");
-    CHECK_CUDA_RET_WITH_EXCEPT(cudaFreeHost(labels_host), "cudaFreeHost failed.");
-    CHECK_CUDA_RET_WITH_EXCEPT(cudaFreeHost(no_blank_labels_host), "cudaFreeHost failed.");
-  }
-
-  void CopyToHostSync(int *labels_host, int *no_blank_labels_host, void *input_lengths_host, void *label_lengths_host,
-                      const std::vector<AddressPtr> &inputs, cudaStream_t stream) {
-    CHECK_CUDA_RET_WITH_EXCEPT(cudaStreamSynchronize(stream), "cudaStreamSynchronize failed.");
+  void MemsetForWS(int *label_value_pcr, int *cum_labels_length, int *label_squence_length, T *costs, T *grads,
+                   cudaStream_t stream) {
+    CHECK_CUDA_RET_WITH_EXCEPT(cudaMemsetAsync(label_value_pcr, static_cast<int>(0), label_size_, stream),
+                               "cudaMemSet failed in CtcLossGpuKernel::Launch.");
+    CHECK_CUDA_RET_WITH_EXCEPT(cudaMemsetAsync(cum_labels_length, static_cast<int>(0), squence_lengths_size_, stream),
+                               "cudaMemSet failed in CtcLossGpuKernel::Launch.");
    CHECK_CUDA_RET_WITH_EXCEPT(
-      cudaMemcpyAsync(labels_host, inputs[1]->addr, inputs[1]->size, cudaMemcpyDeviceToHost, stream),
-      "cudaMemcpyAsync failed.");
+      cudaMemsetAsync(label_squence_length, static_cast<int>(0), squence_lengths_size_, stream),
+      "cudaMemSet failed in CtcLossGpuKernel::Launch.");
+    CHECK_CUDA_RET_WITH_EXCEPT(cudaMemsetAsync(costs, static_cast<T>(0), probs_dims_[1] * sizeof(T), stream),
+                               "cudaMemSet failed in CtcLossGpuKernel::Launch.");
    CHECK_CUDA_RET_WITH_EXCEPT(
-      cudaMemcpyAsync(input_lengths_host, inputs[2]->addr, inputs[2]->size, cudaMemcpyDeviceToHost, stream),
-      "cudaMemcpyAsync failed.");
-    CHECK_CUDA_RET_WITH_EXCEPT(
-      cudaMemcpyAsync(label_lengths_host, inputs[3]->addr, inputs[3]->size, cudaMemcpyDeviceToHost, stream),
+      cudaMemsetAsync(grads, static_cast<T>(0), probs_dims_[0] * probs_dims_[1] * probs_dims_[2] * sizeof(T), stream),
+      "cudaMemSet failed in CtcLossGpuKernel::Launch.");
+  }
+  void MemManageForCus(T **log_alpha_b, T **log_beta_b, int **label_value_with_blank, int *cum_labels_length,
+                       int log_prob_size, int batch, cudaStream_t stream) {
+    int total_labels_size_host = 0;
+    CHECK_CUDA_RET_WITH_EXCEPT(cudaMalloc(reinterpret_cast<void **>(log_alpha_b), sizeof(T) * log_prob_size),
+                               "cudaMalloc failed.");
+    CHECK_CUDA_RET_WITH_EXCEPT(cudaMalloc(reinterpret_cast<void **>(log_beta_b), sizeof(T) * log_prob_size),
+                               "cudaMalloc failed.");
+    CHECK_CUDA_RET_WITH_EXCEPT(cudaMemcpyAsync(&total_labels_size_host, cum_labels_length + batch - 1, sizeof(int),
+                                               cudaMemcpyDeviceToHost, stream),
                               "cudaMemcpyAsync failed.");
    CHECK_CUDA_RET_WITH_EXCEPT(cudaStreamSynchronize(stream), "cudaStreamSynchronize failed.");
+    CHECK_CUDA_RET_WITH_EXCEPT(
+      cudaMalloc(reinterpret_cast<void **>(label_value_with_blank), sizeof(int) * (2 * total_labels_size_host + batch)),
+      "cudaMalloc failed.");
+  }

-    // remove blank element
-    size_t j = 0;
-    for (size_t i = 0; i < inputs[1]->size / sizeof(int); i++) {
-      if (labels_host[i] != 0) {
-        no_blank_labels_host[j] = labels_host[i];
-        j++;
-      }
-    }
+  void FreeMem(int *label_value_with_blank, T *log_alpha_b, T *log_beta_b) {
+    CHECK_CUDA_RET_WITH_EXCEPT(cudaFree(label_value_with_blank), "cudaFree failed.");
+    CHECK_CUDA_RET_WITH_EXCEPT(cudaFree(log_alpha_b), "cudaFree failed.");
+    CHECK_CUDA_RET_WITH_EXCEPT(cudaFree(log_beta_b), "cudaFree failed.");
  }

  std::vector<size_t> input_size_list_;
  std::vector<size_t> output_size_list_;
  std::vector<size_t> workspace_size_list_;

-  cudnnHandle_t cudnn_handle_;
-  cudnnTensorDescriptor_t probs_desc_;
-  cudnnCTCLossDescriptor_t ctcloss_desc_;
-  int probs_dims_[3] = {0};
+  size_t probs_dims_[3] = {0};
+  int label_indice_size_;
  int label_size_;
-  int input_lengths_size_;
-  int label_lengths_size_;
-};
+  int squence_lengths_size_;
+  bool preprocess_collapse_repeated_;
+  bool ctc_merge_repeated_;
+  bool ignore_longer_outputs_than_inputs_;
+  T kLogZero_ = -std::numeric_limits<T>::infinity();
+};  // namespace kernel
 }  // namespace kernel
 }  // namespace mindspore

--- a/mindspore/ccsrc/backend/kernel_compiler/gpu/nn/mirror_pad_gpu_kernel.h
+++ b/mindspore/ccsrc/backend/kernel_compiler/gpu/nn/mirror_pad_gpu_kernel.h
@ -66,7 +66,6 @@ class MirrorPadGpuFwdKernel : public GpuKernel {
    }

    string mode = GetValue<string>(AnfAlgo::GetCNodePrimitive(kernel_node)->GetAttr("mode"));
-
    if (mode == "REFLECT") {
      mode_ = 0;  // reflected mirroring
    } else {
--- a/mindspore/ccsrc/backend/kernel_compiler/gpu/nn/mirror_pad_grad_gpu_kernel.h
+++ b/mindspore/ccsrc/backend/kernel_compiler/gpu/nn/mirror_pad_grad_gpu_kernel.h
@ -66,7 +66,6 @@ class MirrorPadGpuBackKernel : public GpuKernel {
    }

    string mode = GetValue<string>(AnfAlgo::GetCNodePrimitive(kernel_node)->GetAttr("mode"));
-
    if (mode == "REFLECT") {
      mode_ = 0;  // reflected mirroring
    } else {
--- a/mindspore/ccsrc/backend/kernel_compiler/gpu/nn/roi_align_gpu_kernel.cc
+++ b/mindspore/ccsrc/backend/kernel_compiler/gpu/nn/roi_align_gpu_kernel.cc
@ -27,6 +27,5 @@ MS_REG_GPU_KERNEL_ONE(
  ROIAlign,
  KernelAttr().AddInputAttr(kNumberTypeFloat16).AddInputAttr(kNumberTypeFloat16).AddOutputAttr(kNumberTypeFloat16),
  ROIAlignGpuFwdKernel, half)
-
 }  // namespace kernel
 }  // namespace mindspore
--- a/mindspore/ccsrc/backend/kernel_compiler/gpu/nn/roi_align_grad_gpu_kernel.cc
+++ b/mindspore/ccsrc/backend/kernel_compiler/gpu/nn/roi_align_grad_gpu_kernel.cc
@ -27,6 +27,5 @@ MS_REG_GPU_KERNEL_ONE(
  ROIAlignGrad,
  KernelAttr().AddInputAttr(kNumberTypeFloat16).AddInputAttr(kNumberTypeFloat16).AddOutputAttr(kNumberTypeFloat16),
  ROIAlignGradGpuFwdKernel, half)
-
 }  // namespace kernel
 }  // namespace mindspore
--- a/mindspore/ccsrc/backend/kernel_compiler/tbe/tbe_adapter.cc
+++ b/mindspore/ccsrc/backend/kernel_compiler/tbe/tbe_adapter.cc
@ -87,6 +87,7 @@ static std::map<string, string> tbe_func_adapter_map = {
  {"apply_centered_rms_prop", "apply_centered_rms_prop_d"},
  {"transpose", "transpose_d"},
  {"fill", "fill_d"},
+  {"i_fmr", "ifmr"},
  {"unsorted_segment_sum", "unsorted_segment_sum_d"},
  {"unsorted_segment_prod", "unsorted_segment_prod_d"},
  {"concat", "concat_d"},
@ -139,7 +140,8 @@ static std::map<string, string> tbe_func_adapter_map = {
  {"inplace_update", "inplace_update_d"},
  {"matrix_diag", "matrix_diag_d"},
  {"matrix_diag_part", "matrix_diag_part_d"},
-  {"matrix_set_diag", "matrix_set_diag_d"}};
+  {"matrix_set_diag", "matrix_set_diag_d"},
+  {"l_stm_input_grad", "lstm_input_grad"}};

 void TbeAdapter::NormalizeFuncName(std::string *func_name) {
  if (func_name == nullptr) {
--- a/mindspore/ccsrc/backend/kernel_compiler/tbe/tbe_kernel_build.cc
+++ b/mindspore/ccsrc/backend/kernel_compiler/tbe/tbe_kernel_build.cc
@ -25,6 +25,7 @@
 #include "backend/kernel_compiler/tbe/tbe_convert_utils.h"
 #include "backend/kernel_compiler/tbe/tbe_utils.h"
 #include "utils/ms_context.h"
+#include "runtime/dev.h"

 namespace mindspore {
 namespace kernel {
@ -86,6 +87,8 @@ constexpr auto kJPyModulePath = "py_module_path";
 constexpr auto kJPreBuildOutsAttrs = "prebuild_outs_attrs";
 constexpr auto kJKwdArgs = "kwds_args";
 constexpr auto kJListArgs = "list_args";
+constexpr auto kJSocVersion = "socVersion";
+constexpr auto kSOC_VERSION = "SOC_VERSION";

 bool TbeKernelJsonCreator::GenTbeSingleKernelJson(const std::shared_ptr<mindspore::AnfNode> &anf_node,
                                                  nlohmann::json *kernel_json) {
@ -122,6 +125,8 @@ bool TbeKernelJsonCreator::GenTbeSingleKernelJson(const std::shared_ptr<mindspor
  nlohmann::json attrs_json;
  (void)GenTbeAttrJson(anf_node, op_info_ptr, &attrs_json);
  op_info_json[kJAttrs] = attrs_json;
+  auto soc_version = TbeKernelJsonCreator::GetSocVersion();
+  op_info_json[kJSocVersion] = soc_version;
  std::string json_str = op_info_json.dump();
  size_t hash_id = std::hash<std::string>()(json_str);
  auto context_ptr = MsContext::GetInstance();
@ -150,7 +155,13 @@ bool TbeKernelJsonCreator::GenInputDescJson(const std::shared_ptr<AnfNode> &anf_
  MS_EXCEPTION_IF_NULL(input_ptr);
  MS_EXCEPTION_IF_NULL(input_list);
  std::string op_name = AnfAlgo::GetCNodeName(anf_node);
-  if (input_ptr->name() == "input_indices" && op_name == kTopKOpName) {
+  if (op_name == kDynamicRNNOpName && input_ptr->name() == "seq_length") {
+    nlohmann::json input_desc_json;
+    auto in_name = input_ptr->name();
+    input_desc_json[kJName] = in_name + std::to_string(input_i);
+    input_desc_json[kJValid] = false;
+    input_list->emplace_back(input_desc_json);
+  } else if (input_ptr->name() == "input_indices" && op_name == kTopKOpName) {
    TbeAdapter::GenTopKV2IndicesTensorInfo(anf_node, real_input_index, input_list, creater_type_);
  } else {
    auto dtype = GetDeviceInputType(anf_node, real_input_index);
@ -408,6 +419,30 @@ bool TbeKernelJsonCreator::GenTbeAttrJson(const std::shared_ptr<AnfNode> &anf_no
  return true;
 }

+string TbeKernelJsonCreator::GetSocVersion() {
+  // Get default soc version.
+  const int kSocVersionLen = 50;
+  char soc_version[kSocVersionLen] = {0};
+  auto ret = rtGetSocVersion(soc_version, kSocVersionLen);
+  if (ret != RT_ERROR_NONE) {
+    MS_LOG(EXCEPTION) << "GetSocVersion failed.";
+  }
+  MS_LOG(INFO) << "Default SocVersion is " << soc_version;
+  // Get soc version from env value.
+  const char *soc_version_env = getenv(kSOC_VERSION);
+  if (soc_version_env != nullptr) {
+    if (std::strcmp(soc_version, soc_version_env) != 0) {
+      MS_LOG(WARNING) << "SocVerison change to " << soc_version_env;
+      ret = rtSetSocVersion(soc_version_env);
+      if (ret != RT_ERROR_NONE) {
+        MS_LOG(EXCEPTION) << "SetSocVersion to " << soc_version_env << " failed, errorno: " << ret;
+      }
+      return soc_version_env;
+    }
+  }
+  return soc_version;
+}
+
 void TbeKernelJsonCreator::ParseAttrValue(const std::string &type, const mindspore::ValuePtr &value,
                                          nlohmann::json *attr_obj) {
  MS_EXCEPTION_IF_NULL(value);
@ -624,6 +659,8 @@ bool TbeKernelBuild::GenFusionScopeJson(const std::vector<mindspore::AnfNodePtr>
  index = 0;
  data_list.insert(data_list.end(), compute_list.begin(), compute_list.end());
  (*fusion_json)[kFusionOpList] = data_list;
+  auto soc_version = TbeKernelJsonCreator::GetSocVersion();
+  (*fusion_json)[kJSocVersion] = soc_version;
  return true;
 }

@ -853,6 +890,7 @@ bool TbeKernelBuild::GenFusionDataInputJson(const std::shared_ptr<mindspore::Anf
    (*data_str)[kJName] = name;
    nlohmann::json output_desc;
    output_desc[kJName] = name;
+    output_desc[kJDataType] = 0;
    output_desc[kJShape] = "NULL";
    output_desc_list.push_back(output_desc);
    (*index)++;
@ -985,6 +1023,7 @@ bool TbeKernelBuild::GenFusionComputeInputJson(const mindspore::CNodePtr &cnode,
    for (size_t i = 0; i < optional_num; ++i) {
      nlohmann::json optional_input_desc;
      optional_input_desc[kJName] = std::string(kOptional) + std::to_string(*index);
+      optional_input_desc[kJShape] = "NULL";
      (*index)++;
      (*layer_iter)->emplace_back(nullptr);
      input_desc_list_tmp.emplace_back(optional_input_desc);
--- a/mindspore/ccsrc/backend/kernel_compiler/tbe/tbe_kernel_build.h
+++ b/mindspore/ccsrc/backend/kernel_compiler/tbe/tbe_kernel_build.h
@ -92,6 +92,7 @@ class TbeKernelJsonCreator {
  std::string json_name() { return json_name_; }
  bool GenTbeAttrJson(const std::shared_ptr<AnfNode> &anf_node, const std::shared_ptr<OpInfo> &op_info,
                      nlohmann::json *attrs_json);
+  static string GetSocVersion();

 private:
  bool GenTbeInputsJson(const std::shared_ptr<AnfNode> &anf_node, const std::shared_ptr<OpInfo> &op_info,
--- a/mindspore/ccsrc/backend/optimizer/ascend/ascend_backend_optimization.cc
+++ b/mindspore/ccsrc/backend/optimizer/ascend/ascend_backend_optimization.cc
@ -19,6 +19,7 @@
 #include <memory>
 #include <string>
 #include "backend/optimizer/common/optimizer.h"
+#include "backend/optimizer/ascend/ir_fission/dynamic_rnn_grad_fission.h"
 #include "backend/optimizer/ascend/ir_fission/bn_split.h"
 #include "backend/optimizer/ascend/ir_fission/bn_grad_split.h"
 #include "backend/optimizer/ascend/ir_fission/batch_norm_grad_split.h"
@ -64,6 +65,7 @@
 #include "backend/optimizer/ascend/format_type/rectify_do_mask_kernel_info.h"
 #include "backend/optimizer/ascend/format_type/chang_axis_of_reduce_kernel.h"
 #include "backend/optimizer/ascend/format_type/split_unsupported_transdata.h"
+#include "backend/optimizer/ascend/format_type/insert_reshape_for_extract_image_patches_op.h"
 #include "backend/optimizer/pass/getitem_tuple.h"
 #include "backend/optimizer/pass/optimize_dependence.h"
 #include "backend/optimizer/pass/erase_visit_attr.h"
@ -106,6 +108,7 @@
 #include "backend/optimizer/ascend/ir_fission/concat_fission.h"
 #include "backend/optimizer/ascend/ir_fission/pack_fission.h"
 #include "backend/optimizer/ascend/enhancer/concat_outputs_for_all_gather.h"
+#include "backend/optimizer/ascend/enhancer/add_placeholder_for_dynamic_rnn.h"
 #include "utils/ms_context.h"
 #include "backend/optimizer/graph_kernel/composite_ops_fusion.h"
 #include "backend/optimizer/graph_kernel/basic_ops_fusion.h"
@ -231,6 +234,7 @@ void AscendMixPrecision(const std::shared_ptr<session::KernelGraph> &kernel_grap
  auto optimizer = std::make_shared<GraphOptimizer>();
  auto mixed_precision_pm = std::make_shared<PassManager>("cast_pm");
  mixed_precision_pm->AddPass(std::make_shared<InsertCast>());
+  mixed_precision_pm->AddPass(std::make_shared<InsertReshapeForExtractImagePatchesOp>());
  mixed_precision_pm->AddPass(std::make_shared<GetitemTuple>());
  mixed_precision_pm->AddPass(std::make_shared<CommonSubexpressionElimination>());
  mixed_precision_pm->AddPass(std::make_shared<EliminateRedundantOp>());
@ -276,6 +280,8 @@ void AscendBackendIRFusionOptimization(const std::shared_ptr<session::KernelGrap
  }
  ir_fusion_pm->AddPass(std::make_shared<LayerNormGradSplit>());
  ir_fusion_pm->AddPass(std::make_shared<InsertPadForNMSWithMask>());
+  ir_fusion_pm->AddPass(std::make_shared<InsertPlaceholderForDynamicRNN>());
+  ir_fusion_pm->AddPass(std::make_shared<DynamicRNNGradFission>());
  AddAscendIRFusionRulesPass(ir_fusion_pm.get());
  AddAscendIRFusionPass(ir_fusion_pm.get());

--- a/mindspore/ccsrc/backend/optimizer/ascend/enhancer/add_placeholder_for_dynamic_rnn.cc
+++ b/mindspore/ccsrc/backend/optimizer/ascend/enhancer/add_placeholder_for_dynamic_rnn.cc
@ -0,0 +1,77 @@
+/**
+ * Copyright 2020 Huawei Technologies Co., Ltd
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include "backend/optimizer/ascend/enhancer/add_placeholder_for_dynamic_rnn.h"
+#include <vector>
+#include <memory>
+#include "backend/optimizer/common/helper.h"
+#include "backend/session/anf_runtime_algorithm.h"
+#include "utils/utils.h"
+#include "abstract/abstract_value.h"
+#include "base/core_ops.h"
+
+namespace mindspore {
+namespace opt {
+const BaseRef InsertPlaceholderForDynamicRNN::DefinePattern() const {
+  std::shared_ptr<Var> V = std::make_shared<CondVar>(UnVisited);
+  std::shared_ptr<Var> Xs = std::make_shared<SeqVar>();
+  return VectorRef({V, Xs});
+}
+
+const AnfNodePtr InsertPlaceholderForDynamicRNN::Process(const FuncGraphPtr &func_graph, const AnfNodePtr &node,
+                                                         const EquivPtr &) const {
+  MS_EXCEPTION_IF_NULL(func_graph);
+  MS_EXCEPTION_IF_NULL(node);
+  auto cnode = node->cast<CNodePtr>();
+  MS_EXCEPTION_IF_NULL(cnode);
+  auto op_name = AnfAlgo::GetCNodeName(cnode);
+  if (op_name != kDynamicRNNOpName) {
+    return nullptr;
+  }
+  AnfAlgo::SetNodeAttr(kAttrVisited, MakeValue(true), node);
+  auto kernel_graph = func_graph->cast<std::shared_ptr<session::KernelGraph>>();
+  MS_EXCEPTION_IF_NULL(kernel_graph);
+  size_t input_num = AnfAlgo::GetInputTensorNum(node);
+  if (input_num == 0) {
+    return nullptr;
+  }
+
+  std::vector<AnfNodePtr> new_inputs = {AnfAlgo::GetCNodePrimitiveNode(cnode)};
+  for (size_t in_idx = 0; in_idx < input_num; in_idx++) {
+    auto input_node = AnfAlgo::GetInputNode(cnode, in_idx);
+    if (in_idx == 3) {
+      auto value = std::make_shared<None>();
+      auto value_node = NewValueNode(value);
+      value_node->set_abstract(std::make_shared<abstract::AbstractNone>());
+      auto new_node = kernel_graph->NewValueNode(value_node);
+      kernel_graph->AddValueNodeToGraph(new_node);
+      new_inputs.push_back(new_node);
+    }
+    new_inputs.push_back(input_node);
+  }
+
+  CNodePtr new_node = nullptr;
+  if (kernel_graph == nullptr) {
+    new_node = std::make_shared<CNode>(*cnode);
+  } else {
+    new_node = kernel_graph->NewCNode(cnode);
+  }
+  MS_EXCEPTION_IF_NULL(new_node);
+  new_node->set_inputs(new_inputs);
+  return new_node;
+}
+}  // namespace opt
+}  // namespace mindspore
--- a/mindspore/ccsrc/backend/optimizer/ascend/enhancer/add_placeholder_for_dynamic_rnn.h
+++ b/mindspore/ccsrc/backend/optimizer/ascend/enhancer/add_placeholder_for_dynamic_rnn.h
@ -0,0 +1,37 @@
+/**
+ * Copyright 2020 Huawei Technologies Co., Ltd
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+#ifndef MINDSPORE_CCSRC_PRE_ACTIVATE_ASCEND_ENHANCER_ADD_PLACEHOLDER_FOR_DYNAMIC_RNN_H
+#define MINDSPORE_CCSRC_PRE_ACTIVATE_ASCEND_ENHANCER_ADD_PLACEHOLDER_FOR_DYNAMIC_RNN_H
+
+#include <memory>
+#include <vector>
+#include "backend/optimizer/common/optimizer.h"
+#include "backend/optimizer/ascend/ascend_helper.h"
+
+namespace mindspore {
+namespace opt {
+class InsertPlaceholderForDynamicRNN : public PatternProcessPass {
+ public:
+  explicit InsertPlaceholderForDynamicRNN(bool multigraph = true)
+      : PatternProcessPass("add_placeholder_for_dynamic_rnn", multigraph) {}
+  ~InsertPlaceholderForDynamicRNN() override = default;
+  const BaseRef DefinePattern() const override;
+  const AnfNodePtr Process(const FuncGraphPtr &, const AnfNodePtr &, const EquivPtr &) const override;
+};
+}  // namespace opt
+}  // namespace mindspore
+
+#endif  // MINDSPORE_CCSRC_PRE_ACTIVATE_ASCEND_ENHANCER_ADD_PLACEHOLDER_FOR_DYNAMIC_RNN_H
--- a/mindspore/ccsrc/backend/optimizer/ascend/format_type/insert_reshape_for_extract_image_patches_op.cc
+++ b/mindspore/ccsrc/backend/optimizer/ascend/format_type/insert_reshape_for_extract_image_patches_op.cc
@ -0,0 +1,65 @@
+/**
+ * Copyright 2020 Huawei Technologies Co., Ltd
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include "backend/optimizer/ascend/format_type/insert_reshape_for_extract_image_patches_op.h"
+#include <memory>
+#include "backend/optimizer/ascend/ascend_helper.h"
+#include "backend/session/anf_runtime_algorithm.h"
+#include "utils/utils.h"
+#include "base/core_ops.h"
+
+namespace mindspore {
+namespace opt {
+const BaseRef InsertReshapeForExtractImagePatchesOp::DefinePattern() const {
+  VarPtr Xs = std::make_shared<SeqVar>();
+  return VectorRef({prim::kPrimExtractImagePatches, Xs});
+}
+
+const AnfNodePtr InsertReshapeForExtractImagePatchesOp::Process(const FuncGraphPtr &func_graph, const AnfNodePtr &node,
+                                                                const EquivPtr &equiv) const {
+  MS_EXCEPTION_IF_NULL(func_graph);
+  MS_EXCEPTION_IF_NULL(equiv);
+  auto extract = CheckAnfNodeIfCNodeAndInputSize(node, 2);
+  MS_EXCEPTION_IF_NULL(extract);
+  auto in_node = extract->input(1);
+  MS_EXCEPTION_IF_NULL(in_node);
+  auto extract_kernel_build_info = AnfAlgo::GetSelectKernelBuildInfo(extract);
+  auto in_node_kernel_build_info = AnfAlgo::GetSelectKernelBuildInfo(in_node);
+  MS_EXCEPTION_IF_NULL(extract_kernel_build_info);
+  MS_EXCEPTION_IF_NULL(in_node_kernel_build_info);
+  std::vector<AnfNodePtr> reshape_inputs = {NewValueNode(std::make_shared<Primitive>(prim::kPrimReshape->name())),
+                                            in_node};
+  auto reshape_builder = std::make_shared<kernel::KernelBuildInfo::KernelBuildInfoBuilder>();
+  reshape_builder->SetInputsFormat({kOpFormat_NC1HWC0});
+  reshape_builder->SetOutputsFormat({kOpFormat_NC1HWC0});
+  reshape_builder->SetInputsDeviceType({AnfAlgo::GetOutputDeviceDataType(in_node, 0)});
+  reshape_builder->SetOutputsDeviceType({AnfAlgo::GetOutputDeviceDataType(in_node, 0)});
+  reshape_builder->SetKernelType(in_node_kernel_build_info->kernel_type());
+  reshape_builder->SetFusionType(in_node_kernel_build_info->fusion_type());
+  reshape_builder->SetProcessor(in_node_kernel_build_info->processor());
+
+  auto reshape = func_graph->NewCNode(reshape_inputs);
+  reshape->set_scope(in_node->scope());
+  auto shape_tmp = AnfAlgo::GetOutputInferShape(in_node, 0);
+  AnfAlgo::SetOutputInferTypeAndShape({AnfAlgo::GetOutputDeviceDataType(in_node, 0)},
+                                      {{shape_tmp[0], shape_tmp[2], shape_tmp[3], shape_tmp[1]}}, reshape.get());
+  AnfAlgo::SetSelectKernelBuildInfo(reshape_builder->Build(), reshape.get());
+  AnfAlgo::SetNodeAttr("nop_op", MakeValue(true), reshape);
+  AnfAlgo::SetNodeInput(extract, reshape, 0);
+  return extract;
+}
+}  // namespace opt
+}  // namespace mindspore
--- a/mindspore/ccsrc/backend/optimizer/ascend/format_type/insert_reshape_for_extract_image_patches_op.h
+++ b/mindspore/ccsrc/backend/optimizer/ascend/format_type/insert_reshape_for_extract_image_patches_op.h
@ -0,0 +1,41 @@
+/**
+ * Copyright 2020 Huawei Technologies Co., Ltd
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#ifndef MINDSPORE_CCSRC_BACKEND_OPTIMIZER_ASCEND_FORMAT_TYPE_INSERT_RESHAPE_FOR_EXTRACT_IMAGE_PATCHES_OP_H
+#define MINDSPORE_CCSRC_BACKEND_OPTIMIZER_ASCEND_FORMAT_TYPE_INSERT_RESHAPE_FOR_EXTRACT_IMAGE_PATCHES_OP_H
+#include <vector>
+#include <string>
+#include <utility>
+#include <memory>
+#include "ir/anf.h"
+#include "backend/optimizer/common/pattern_engine.h"
+#include "backend/optimizer/common/helper.h"
+#include "backend/optimizer/common/optimizer.h"
+
+namespace mindspore {
+namespace opt {
+class InsertReshapeForExtractImagePatchesOp : public PatternProcessPass {
+ public:
+  explicit InsertReshapeForExtractImagePatchesOp(bool multigraph = true)
+      : PatternProcessPass("insert_reshape_for_extract_image_patches_op", multigraph) {}
+  ~InsertReshapeForExtractImagePatchesOp() override = default;
+  const BaseRef DefinePattern() const override;
+  const AnfNodePtr Process(const FuncGraphPtr &, const AnfNodePtr &, const EquivPtr &) const override;
+};
+}  // namespace opt
+}  // namespace mindspore
+
+#endif  // MINDSPORE_CCSRC_BACKEND_OPTIMIZER_ASCEND_FORMAT_TYPE_INSERT_RESHAPE_FOR_EXTRACT_IMAGE_PATCHES_OP_H
--- a/mindspore/ccsrc/backend/optimizer/ascend/ir_fission/dynamic_rnn_grad_fission.cc
+++ b/mindspore/ccsrc/backend/optimizer/ascend/ir_fission/dynamic_rnn_grad_fission.cc
@ -0,0 +1,250 @@
+/**
+ * Copyright 2020 Huawei Technologies Co., Ltd
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+#include "backend/optimizer/ascend/ir_fission/dynamic_rnn_grad_fission.h"
+#include <vector>
+#include <memory>
+#include <algorithm>
+#include "backend/session/anf_runtime_algorithm.h"
+#include "backend/optimizer/common/helper.h"
+
+namespace mindspore {
+namespace opt {
+constexpr size_t kDynamicRNNGradInputNum = 16;
+constexpr size_t kLSTMInputGradOutputNum = 4;
+const BaseRef DynamicRNNGradFission::DefinePattern() const {
+  VarPtr Xs = std::make_shared<SeqVar>();
+  return VectorRef({prim::kPrimDynamicRNNGrad, Xs});
+}
+
+AnfNodePtr CreateSplitVD(const FuncGraphPtr &graph, const AnfNodePtr &node) {
+  MS_EXCEPTION_IF_NULL(graph);
+  MS_EXCEPTION_IF_NULL(node);
+  // SplitV
+  std::vector<AnfNodePtr> splitvd_input = {NewValueNode(std::make_shared<Primitive>(prim::kPrimSplitV->name())), node};
+  auto split_vd = graph->NewCNode(splitvd_input);
+  MS_EXCEPTION_IF_NULL(split_vd);
+  auto dtypes = {AnfAlgo::GetOutputInferDataType(node, 0), AnfAlgo::GetOutputInferDataType(node, 0)};
+  std::vector<size_t> shape = {AnfAlgo::GetOutputInferShape(node, 0)[0] - 1, AnfAlgo::GetOutputInferShape(node, 0)[1],
+                               AnfAlgo::GetOutputInferShape(node, 0)[2]};
+  auto shape2 = {IntToSize(1), AnfAlgo::GetOutputInferShape(node, 0)[1], AnfAlgo::GetOutputInferShape(node, 0)[2]};
+  std::vector<std::vector<size_t>> shapes = {shape, shape2};
+  AnfAlgo::SetOutputInferTypeAndShape(dtypes, shapes, split_vd.get());
+  AnfAlgo::SetNodeAttr("split_dim", MakeValue(0), split_vd);
+  AnfAlgo::SetNodeAttr("num_split", MakeValue(2), split_vd);
+  int tmp = SizeToInt(AnfAlgo::GetOutputInferShape(node, 0)[0]) - 1;
+  AnfAlgo::SetNodeAttr("size_splits", MakeValue(std::vector<int>{tmp, 1}), split_vd);
+  AnfAlgo::SetNodeAttr("is_backend_insert", MakeValue(true), split_vd);
+  return split_vd;
+}
+
+AnfNodePtr CreateLSTMInputGrad(const FuncGraphPtr &graph, const AnfNodePtr &node) {
+  MS_EXCEPTION_IF_NULL(graph);
+  MS_EXCEPTION_IF_NULL(node);
+  auto cnode = node->cast<CNodePtr>();
+  MS_EXCEPTION_IF_NULL(cnode);
+  const auto &dynamic_rnn_grad_inputs = cnode->inputs();
+  std::vector<AnfNodePtr> lstm_input_grad_inputs = {NewValueNode(std::make_shared<Primitive>(kLSTMInputGradOpName)),
+                                                    dynamic_rnn_grad_inputs[2],
+                                                    dynamic_rnn_grad_inputs[6],
+                                                    dynamic_rnn_grad_inputs[8],
+                                                    dynamic_rnn_grad_inputs[9],
+                                                    dynamic_rnn_grad_inputs[10],
+                                                    dynamic_rnn_grad_inputs[11],
+                                                    dynamic_rnn_grad_inputs[12],
+                                                    dynamic_rnn_grad_inputs[13],
+                                                    dynamic_rnn_grad_inputs[14],
+                                                    dynamic_rnn_grad_inputs[15],
+                                                    dynamic_rnn_grad_inputs[16]};
+  std::vector<AnfNodePtr> ori_outputs;
+  CreateMultipleOutputsOfAnfNode(graph, node, 5, &ori_outputs);
+  auto lstm_op = graph->NewCNode(lstm_input_grad_inputs);
+  MS_EXCEPTION_IF_NULL(lstm_op);
+  auto ori_type = AnfAlgo::GetOutputInferDataType(dynamic_rnn_grad_inputs[8], 0);
+  auto types = {AnfAlgo::GetOutputInferDataType(ori_outputs[2], 0), AnfAlgo::GetOutputInferDataType(ori_outputs[3], 0),
+                AnfAlgo::GetOutputInferDataType(ori_outputs[4], 0), ori_type};
+  std::vector<size_t> ori_shape = {AnfAlgo::GetOutputInferShape(dynamic_rnn_grad_inputs[8], 0)[0],
+                                   AnfAlgo::GetOutputInferShape(dynamic_rnn_grad_inputs[8], 0)[1],
+                                   4 * AnfAlgo::GetOutputInferShape(dynamic_rnn_grad_inputs[8], 0)[2]};
+  auto shapes = {AnfAlgo::GetOutputInferShape(ori_outputs[2], 0), AnfAlgo::GetOutputInferShape(ori_outputs[3], 0),
+                 AnfAlgo::GetOutputInferShape(ori_outputs[4], 0), ori_shape};
+  AnfAlgo::SetOutputInferTypeAndShape(types, shapes, lstm_op.get());
+  return lstm_op;
+}
+
+AnfNodePtr CreateBatchMatMul(const FuncGraphPtr &graph, const AnfNodePtr &node1, const AnfNodePtr &node2) {
+  MS_EXCEPTION_IF_NULL(graph);
+  MS_EXCEPTION_IF_NULL(node1);
+  MS_EXCEPTION_IF_NULL(node2);
+  // BatchMatMul
+  std::vector<AnfNodePtr> matmul_inputs = {NewValueNode(std::make_shared<Primitive>(prim::kPrimBatchMatMul->name())),
+                                           node2, node1};
+  auto batch_matmul = graph->NewCNode(matmul_inputs);
+  MS_EXCEPTION_IF_NULL(batch_matmul);
+  auto types = {AnfAlgo::GetOutputInferDataType(node1, 0)};
+  std::vector<size_t> shape = {AnfAlgo::GetOutputInferShape(node2, 0)[0], AnfAlgo::GetOutputInferShape(node2, 0)[2],
+                               AnfAlgo::GetOutputInferShape(node1, 0)[2]};
+  auto shapes = {shape};
+  AnfAlgo::SetNodeAttr("is_backend_insert", MakeValue(true), batch_matmul);
+  AnfAlgo::SetNodeAttr("transpose_x1", MakeValue(true), batch_matmul);
+  AnfAlgo::SetNodeAttr("transpose_x2", MakeValue(false), batch_matmul);
+  AnfAlgo::SetOutputInferTypeAndShape(types, shapes, batch_matmul.get());
+  return batch_matmul;
+}
+
+AnfNodePtr AddHConcatD(const FuncGraphPtr &graph, const AnfNodePtr &node1, const AnfNodePtr &node2) {
+  MS_EXCEPTION_IF_NULL(graph);
+  MS_EXCEPTION_IF_NULL(node1);
+  MS_EXCEPTION_IF_NULL(node2);
+  std::vector<AnfNodePtr> ori_outputs;
+  CreateMultipleOutputsOfAnfNode(graph, node2, 2, &ori_outputs);
+  auto ori_shape = AnfAlgo::GetOutputInferShape(node1, 0);
+  std::vector<std::vector<size_t>> shape_tmp;
+  if (ori_shape.size() == 3) {
+    shape_tmp = {ori_shape};
+  } else {
+    shape_tmp = {{IntToSize(1), ori_shape[0], ori_shape[1]}};
+  }
+  auto ori_dtype = {AnfAlgo::GetOutputInferDataType(node1, 0)};
+  // reshape
+  std::vector<AnfNodePtr> reshape_input = {NewValueNode(std::make_shared<Primitive>(prim::kPrimReshape->name())),
+                                           node1};
+  auto reshape = graph->NewCNode(reshape_input);
+  AnfAlgo::SetNodeAttr(kAttrVisited, MakeValue(true), reshape);
+  AnfAlgo::SetOutputInferTypeAndShape(ori_dtype, shape_tmp, reshape.get());
+
+  // concatd --> concat
+  std::vector<AnfNodePtr> concat_inputs = {NewValueNode(std::make_shared<Primitive>(prim::kPrimConcat->name())),
+                                           reshape, ori_outputs[0]};
+  auto concat_op = graph->NewCNode(concat_inputs);
+  MS_EXCEPTION_IF_NULL(concat_op);
+  std::vector<size_t> input = {AnfAlgo::GetOutputInferShape(node2, 0)[0] + 1, AnfAlgo::GetOutputInferShape(node2, 0)[1],
+                               AnfAlgo::GetOutputInferShape(node2, 0)[2]};
+  auto types = {AnfAlgo::GetOutputInferDataType(node1, 0)};
+  auto shapes = {input};
+  AnfAlgo::SetOutputInferTypeAndShape(types, shapes, concat_op.get());
+  AnfAlgo::SetNodeAttr(kAttrN, MakeValue(2), concat_op);
+  AnfAlgo::SetNodeAttr(kAttrDynInputSizes, MakeValue(std::vector<int>{2}), concat_op);
+  AnfAlgo::SetNodeAttr("axis", MakeValue(0), concat_op);
+  AnfAlgo::SetNodeAttr("is_backend_insert", MakeValue(true), concat_op);
+  return concat_op;
+}
+
+AnfNodePtr AddConcatD(const FuncGraphPtr &graph, const AnfNodePtr &node1, const AnfNodePtr &node2) {
+  MS_EXCEPTION_IF_NULL(graph);
+  MS_EXCEPTION_IF_NULL(node1);
+  MS_EXCEPTION_IF_NULL(node2);
+  // concatd --> concat
+  std::vector<AnfNodePtr> concat_inputs = {NewValueNode(std::make_shared<Primitive>(prim::kPrimConcat->name())), node1,
+                                           node2};
+  auto concat_op = graph->NewCNode(concat_inputs);
+  MS_EXCEPTION_IF_NULL(concat_op);
+  std::vector<size_t> input = {AnfAlgo::GetOutputInferShape(node1, 0)[0], AnfAlgo::GetOutputInferShape(node1, 0)[1],
+                               AnfAlgo::GetOutputInferShape(node1, 0)[2] + AnfAlgo::GetOutputInferShape(node2, 0)[2]};
+  auto types = {AnfAlgo::GetOutputInferDataType(node1, 0)};
+  auto shapes = {input};
+  AnfAlgo::SetOutputInferTypeAndShape(types, shapes, concat_op.get());
+  AnfAlgo::SetNodeAttr(kAttrN, MakeValue(2), concat_op);
+  AnfAlgo::SetNodeAttr(kAttrDynInputSizes, MakeValue(std::vector<int>{2}), concat_op);
+  AnfAlgo::SetNodeAttr("axis", MakeValue(2), concat_op);
+  AnfAlgo::SetNodeAttr("is_backend_insert", MakeValue(true), concat_op);
+  return concat_op;
+}
+
+AnfNodePtr AddDwReduceSum(const FuncGraphPtr &graph, const AnfNodePtr &node1, const AnfNodePtr &node2) {
+  // node1 : dynamic output
+  // node2 : matmul
+  MS_EXCEPTION_IF_NULL(graph);
+  MS_EXCEPTION_IF_NULL(node1);
+  MS_EXCEPTION_IF_NULL(node2);
+  std::vector<AnfNodePtr> ori_outputs;
+  CreateMultipleOutputsOfAnfNode(graph, node1, 5, &ori_outputs);
+  // ReduceSumd
+  std::vector<AnfNodePtr> reducesum_inputs = {NewValueNode(std::make_shared<Primitive>(prim::kPrimReduceSum->name())),
+                                              node2};
+  auto reduce_sumd = graph->NewCNode(reducesum_inputs);
+  MS_EXCEPTION_IF_NULL(reduce_sumd);
+  auto types = {AnfAlgo::GetOutputInferDataType(ori_outputs[0], 0)};
+  auto shapes = {AnfAlgo::GetOutputInferShape(ori_outputs[0], 0)};
+  AnfAlgo::SetOutputInferTypeAndShape(types, shapes, reduce_sumd.get());
+  AnfAlgo::SetNodeAttr(kAttrAxis, MakeValue(std::vector<int>{0}), reduce_sumd);
+  AnfAlgo::SetNodeAttr("keep_dims", MakeValue(false), reduce_sumd);
+  AnfAlgo::SetNodeAttr("is_backend_insert", MakeValue(true), reduce_sumd);
+  return reduce_sumd;
+}
+
+AnfNodePtr AddDbReduceSum(const FuncGraphPtr &graph, const AnfNodePtr &node1, const AnfNodePtr &node2) {
+  // node1 lstm output
+  // node2 // dynamic output
+  MS_EXCEPTION_IF_NULL(graph);
+  MS_EXCEPTION_IF_NULL(node1);
+  MS_EXCEPTION_IF_NULL(node2);
+  std::vector<AnfNodePtr> ori_outputs;
+  CreateMultipleOutputsOfAnfNode(graph, node2, 5, &ori_outputs);
+  // ReduceSumd --> ReduceSum
+  std::vector<AnfNodePtr> reducerum_inputs = {NewValueNode(std::make_shared<Primitive>(prim::kPrimReduceSum->name())),
+                                              node1};
+  auto reduce_sumd = graph->NewCNode(reducerum_inputs);
+  MS_EXCEPTION_IF_NULL(reduce_sumd);
+  auto types = {AnfAlgo::GetOutputInferDataType(ori_outputs[1], 0)};
+  auto shapes = {AnfAlgo::GetOutputInferShape(ori_outputs[1], 0)};
+  AnfAlgo::SetOutputInferTypeAndShape(types, shapes, reduce_sumd.get());
+  AnfAlgo::SetNodeAttr(kAttrAxis, MakeValue(std::vector<int>{0, 1}), reduce_sumd);
+  AnfAlgo::SetNodeAttr("keep_dims", MakeValue(false), reduce_sumd);
+  AnfAlgo::SetNodeAttr("is_backend_insert", MakeValue(true), reduce_sumd);
+  return reduce_sumd;
+}
+
+const AnfNodePtr DynamicRNNGradFission::Process(const FuncGraphPtr &func_graph, const AnfNodePtr &node,
+                                                const EquivPtr &) const {
+  MS_EXCEPTION_IF_NULL(func_graph);
+  MS_EXCEPTION_IF_NULL(node);
+  auto cnode = node->cast<CNodePtr>();
+  MS_EXCEPTION_IF_NULL(cnode);
+  if (cnode->size() < kDynamicRNNGradInputNum + 1) {
+    MS_LOG(INFO) << "The input num of DynamicRNNGrad less than" << kDynamicRNNGradInputNum
+                 << ". The node should not be changed";
+    return nullptr;
+  }
+  // input_list of dynamic_rnn_grad
+  const auto &ori_inputs = cnode->inputs();
+  // create split_vd
+  auto split_vd = CreateSplitVD(func_graph, ori_inputs[7]);
+  // create concat_1
+  auto h_concat = AddHConcatD(func_graph, ori_inputs[5], split_vd);
+  // create concat_2
+  auto concat = AddConcatD(func_graph, ori_inputs[1], h_concat);
+  // create lsym_input_grad
+  auto lstm_input_grad = CreateLSTMInputGrad(func_graph, cnode);
+  std::vector<AnfNodePtr> lstm_outputs;
+  CreateMultipleOutputsOfAnfNode(func_graph, lstm_input_grad, kLSTMInputGradOutputNum, &lstm_outputs);
+  // create matmul
+  auto batch_matmul = CreateBatchMatMul(func_graph, lstm_outputs[3], concat);
+  // create reduce_sum_1
+  auto dw_reduce_sum = AddDwReduceSum(func_graph, node, batch_matmul);
+  // create reduce_sum_2
+  auto db_reduce_sum = AddDbReduceSum(func_graph, lstm_outputs[3], node);
+  std::vector<AnfNodePtr> make_tuple_inputs = {NewValueNode(prim::kPrimMakeTuple),
+                                               dw_reduce_sum,
+                                               db_reduce_sum,
+                                               lstm_outputs[0],
+                                               lstm_outputs[1],
+                                               lstm_outputs[2]};
+  auto make_tuple = func_graph->NewCNode(make_tuple_inputs);
+  MS_EXCEPTION_IF_NULL(make_tuple);
+  return make_tuple;
+}
+}  // namespace opt
+}  // namespace mindspore
--- a/mindspore/ccsrc/backend/optimizer/ascend/ir_fission/dynamic_rnn_grad_fission.h
+++ b/mindspore/ccsrc/backend/optimizer/ascend/ir_fission/dynamic_rnn_grad_fission.h
@ -0,0 +1,33 @@
+/**
+ * Copyright 2020 Huawei Technologies Co., Ltd
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+#ifndef MINDSPORE_CCSRC_BACKEND_OPTIMIZER_ASCEND_IR_FISSION_DYNAMIC_RNN_GRAD_FISSION_H_
+#define MINDSPORE_CCSRC_BACKEND_OPTIMIZER_ASCEND_IR_FISSION_DYNAMIC_RNN_GRAD_FISSION_H_
+
+#include "backend/optimizer/common/optimizer.h"
+
+namespace mindspore {
+namespace opt {
+class DynamicRNNGradFission : public PatternProcessPass {
+ public:
+  explicit DynamicRNNGradFission(bool multigraph = true) : PatternProcessPass("dynamic_rnn_grad_fission", multigraph) {}
+  ~DynamicRNNGradFission() override = default;
+  const BaseRef DefinePattern() const override;
+  const AnfNodePtr Process(const FuncGraphPtr &, const AnfNodePtr &, const EquivPtr &) const override;
+};
+}  // namespace opt
+}  // namespace mindspore
+
+#endif  // MINDSPORE_CCSRC_BACKEND_OPTIMIZER_ASCEND_IR_FISSION_DYNAMIC_RNN_GRAD_FISSION_H_
--- a/mindspore/ccsrc/backend/optimizer/ascend/ir_fission/transdata_split.cc
+++ b/mindspore/ccsrc/backend/optimizer/ascend/ir_fission/transdata_split.cc
@ -21,11 +21,10 @@

 namespace mindspore {
 namespace opt {
-const std::set<std::pair<string, string>> invalid_formats_pair = {{kOpFormat_C1HWNCoC0, kOpFormat_NCHW},
-                                                                  {kOpFormat_NCHW, kOpFormat_C1HWNCoC0},
-                                                                  {kOpFormat_C1HWNCoC0, kOpFormat_DEFAULT},
-                                                                  {kOpFormat_DEFAULT, kOpFormat_FRACTAL_ZN_LSTM},
-                                                                  {kOpFormat_DEFAULT, kOpFormat_C1HWNCoC0}};
+const std::set<std::pair<string, string>> invalid_formats_pair = {
+  {kOpFormat_C1HWNCoC0, kOpFormat_NCHW},          {kOpFormat_NCHW, kOpFormat_C1HWNCoC0},
+  {kOpFormat_C1HWNCoC0, kOpFormat_DEFAULT},       {kOpFormat_DEFAULT, kOpFormat_FRACTAL_ZN_LSTM},
+  {kOpFormat_FRACTAL_ZN_LSTM, kOpFormat_DEFAULT}, {kOpFormat_DEFAULT, kOpFormat_C1HWNCoC0}};

 bool TransDataSplit::Run(const FuncGraphPtr &func_graph) {
  MS_EXCEPTION_IF_NULL(func_graph);
@ -83,6 +82,9 @@ bool TransDataSplit::DoSplit(const FuncGraphPtr &func_graph, const AnfNodePtr &n
    new_transpose_node = NewTransOpNode(func_graph, AnfAlgo::GetInputNode(node->cast<CNodePtr>(), 0), kernel_select_,
                                        false, prim::kPrimTranspose->name());
    AnfAlgo::SetNodeAttr(kAttrPerm, MakeValue(std::vector<int>{2, 3, 1, 0}), new_transpose_node);
+    if (output_format == kOpFormat_FRACTAL_ZN_LSTM) {
+      AnfAlgo::SetNodeAttr("nop_op", MakeValue(true), new_transpose_node);
+    }
    RefreshKernelBuildInfo(input_format, kOpFormat_HWCN, new_transpose_node);

    // trans hwcn to output_format
--- a/mindspore/ccsrc/backend/optimizer/common/helper.cc
+++ b/mindspore/ccsrc/backend/optimizer/common/helper.cc
@ -404,7 +404,11 @@ bool IsNopNode(const AnfNodePtr &node) {
  }
  CNodePtr cnode = node->cast<CNodePtr>();
  MS_EXCEPTION_IF_NULL(cnode);
-  if (nop_nodes.find(AnfAlgo::GetCNodeName(cnode)) == nop_nodes.end()) {
+  bool is_nop_node = false;
+  if (AnfAlgo::HasNodeAttr("nop_op", cnode)) {
+    is_nop_node = AnfAlgo::GetNodeAttr<bool>(cnode, "nop_op");
+  }
+  if (nop_nodes.find(AnfAlgo::GetCNodeName(cnode)) == nop_nodes.end() && !is_nop_node) {
    return false;
  }
  return true;
--- a/mindspore/ccsrc/backend/optimizer/mem_reuse/kernel_refcount.h
+++ b/mindspore/ccsrc/backend/optimizer/mem_reuse/kernel_refcount.h
@ -26,7 +26,7 @@ namespace mindspore {
 namespace memreuse {
 enum RefCountType { kDynamicRefCount, kStaticRefCount };
 enum NodeType { kCommonNode, kCommunicationNode };
-enum KernelRefType { kCommon, kRefNodeInput, kRefNodeOutput, kCommNotReuse, kCommReuse, kSummary };
+enum KernelRefType { kCommon, kRefNodeInput, kRefNodeOutput, kCommNotReuse, kCommReuse, kSummary, kAllocated };
 static constexpr int kInitIndex = -1;
 class KernelRefCount {
 public:
--- a/mindspore/ccsrc/backend/optimizer/mem_reuse/mem_reuse.cc
+++ b/mindspore/ccsrc/backend/optimizer/mem_reuse/mem_reuse.cc
@ -76,6 +76,9 @@ bool MemReuseUtil::InitDynamicOutputKernelRef() {
            }
          } else {
            kernel_ref->type_ = kCommon;
+            if (AnfAlgo::OutputAddrExist(kernel_cnode, output_index)) {
+              kernel_ref->type_ = kAllocated;
+            }
          }
        }
        kernel_refs.push_back(kernel_ref);
--- a/mindspore/ccsrc/backend/optimizer/mem_reuse/mem_reuse_allocator.cc
+++ b/mindspore/ccsrc/backend/optimizer/mem_reuse/mem_reuse_allocator.cc
@ -16,6 +16,7 @@
 #include "backend/optimizer/mem_reuse/mem_reuse_allocator.h"
 #include "backend/optimizer/mem_reuse/mem_reuse.h"
 #include "backend/optimizer/mem_reuse/mem_reuse_checker.h"
+#include "utils/ms_context.h"
 #ifdef ENABLE_D
 #include "runtime/device/ascend/ascend_stream_assign.h"
 #endif
@ -32,12 +33,22 @@ void BestFitMemReuse::InitMemReuseInfo(const MemReuseUtil *mem_reuse_util_ptr) {
  set_workspace_ptr_list(mem_reuse_util_ptr->total_wk_ref_list());
  set_op_ptr_list(mem_reuse_util_ptr->kernel_def_ptr_list());
  // check info Correctness
+  auto context_ptr = MsContext::GetInstance();
+  MS_EXCEPTION_IF_NULL(context_ptr);
+  std::string device_target = context_ptr->get_param<std::string>(MS_CTX_DEVICE_TARGET);
+  // CPU device does not need to be aligned
+  if (device_target != kCPUDevice) {
+      MS_LOG(DEBUG) << "Align Memory Size";
      for (auto &tensor : tensor_ptr_list_) {
        tensor->size_ = AlignCommonMemorySize(tensor->size_);
      }
-  // align wk size to 512 && refcount == 1
+      // align wk size to 512
      for (auto &wk : wk_tensor_list_) {
        wk->size_ = AlignCommonMemorySize(wk->size_);
+      }
+  }
+  // set wk refcount == 1
+  for (auto &wk : wk_tensor_list_) {
    wk->ref_count_ = 1;
  }
 #ifdef ENABLE_D
@ -155,6 +166,9 @@ void BestFitMemReuse::AssignCommonNodeOutputOffset() {
      // get align size for communication op's single input
      tensor_desc->size_ = AlignCommunicationMemorySize(tensor_desc->size_);
      total_comm_reuse_size += tensor_desc->size_;
+    } else if (tensor_desc->type_ == kAllocated) {
+      // no need to alloc allocated memory
+      continue;
    }

    auto reusable_membuf_map = GetReusableMembufMap(tensor_desc->size_);
--- a/mindspore/ccsrc/backend/session/anf_runtime_algorithm.cc
+++ b/mindspore/ccsrc/backend/session/anf_runtime_algorithm.cc
@ -516,6 +516,10 @@ std::vector<size_t> AnfRuntimeAlgorithm::GetInputDeviceShape(const AnfNodePtr &n
  if (trans::IsNeedPadding(format, infer_shape.size())) {
    infer_shape = trans::PaddingShapeTo4d(infer_shape, GetInputReshapeType(node, input_idx));
  }
+  if (node->isa<CNode>() && GetCNodeName(node) == kExtractImagePatchesOpName) {
+    auto shape_tmp = {infer_shape[0], infer_shape[3], infer_shape[1], infer_shape[2]};
+    return trans::TransShapeToDevice(shape_tmp, format);
+  }
  return trans::TransShapeToDevice(infer_shape, format);
 }

--- a/mindspore/ccsrc/backend/session/ascend_session.cc
+++ b/mindspore/ccsrc/backend/session/ascend_session.cc
@ -179,6 +179,7 @@ GraphId AscendSession::CompileGraph(NotNull<FuncGraphPtr> func_graph) {
    debugger_->PreExecute(root_graph);
  }
 #endif
+  SetSummaryNodes(root_graph.get());
  // alloc mem
  MemoryAlloc(root_graph.get());
  // generate and load task into device
--- a/mindspore/ccsrc/backend/session/cpu_session.cc
+++ b/mindspore/ccsrc/backend/session/cpu_session.cc
@ -75,6 +75,16 @@ GraphId CPUSession::CompileGraph(const AnfNodePtrList &lst, const AnfNodePtrList
 #endif
  MS_LOG(INFO) << "Build kernel";
  BuildKernel(graph.get());
+  
+  // Set graph execution order before memory alloc, ensure that memory alloc is according to the reorder graph
+  auto execution_order = graph->execution_order();
+  Reorder(&execution_order);
+  graph->set_execution_order(execution_order);
+
+  MS_LOG(INFO) << "Start kernel runtime";
+  if (!runtime_.Init()) {
+    MS_LOG(EXCEPTION) << "CPU start kernel runtime failed";
+  }
  MS_LOG(INFO) << "Assign kernel address";
  runtime_.AssignKernelAddress(graph.get());
  return graph_id;
@ -98,11 +108,8 @@ void CPUSession::RunGraph(const GraphId &graph_id, const std::vector<tensor::Ten
 #endif

  MS_LOG(INFO) << "Run graph start";
-  auto execution_order = kernel_graph->execution_order();
-  Reorder(&execution_order);

  bool enable_summary = summary_callback_ != nullptr;
-  kernel_graph->set_execution_order(execution_order);
  NamedSummaryOutputs summary_outputs;
  if (enable_summary) {
    SetSummaryNodes(kernel_graph.get());
--- a/mindspore/ccsrc/backend/session/executor.h
+++ b/mindspore/ccsrc/backend/session/executor.h
@ -125,7 +125,7 @@ class CreateCommGroupTask : public Task {
  void Run() override;
  std::string group_name_;
  std::vector<uint32_t> ranks_;
-  bool result_;
+  bool result_{false};
 };

 class DestroyCommGroupTask : public Task {
@ -134,7 +134,7 @@ class DestroyCommGroupTask : public Task {
  ~DestroyCommGroupTask() override = default;
  void Run() override;
  std::string group_name_;
-  bool result_;
+  bool result_{false};
 };

 class ExitTask : public Task {
--- a/mindspore/ccsrc/debug/data_dump/e2e_dump_util.cc
+++ b/mindspore/ccsrc/debug/data_dump/e2e_dump_util.cc
@ -64,8 +64,11 @@ void E2eDumpUtil::DumpGPUMemToFile(const std::string &file_path, const std::stri
                                   const ShapeVector &int_shapes, const TypeId &type, size_t slot, Debugger *debugger) {
 #ifdef ENABLE_DEBUGGER
  auto format = kOpFormat_DEFAULT;
+  MS_EXCEPTION_IF_NULL(debugger);
  DebugServices *debug_services = debugger->debug_services();
+  MS_EXCEPTION_IF_NULL(debug_services);
  TensorLoader *tensor_loader = debug_services->tensor_loader();
+  MS_EXCEPTION_IF_NULL(tensor_loader);
  auto ret = tensor_loader->DumpTensorToFile(original_kernel_name, trans_flag, file_path, format, int_shapes, type,
                                             addr->type_id(), addr->format(), slot);
  if (!ret) {
--- a/mindspore/ccsrc/debug/debugger/debugger.cc
+++ b/mindspore/ccsrc/debug/debugger/debugger.cc
@ -309,7 +309,7 @@ void Debugger::PostDebugOp() {
  }
 }

-std::map<std::pair<uint32_t, uint32_t>, std::string> &Debugger::GetStreamTaskToOpnameMap() {
+std::map<std::pair<uint32_t, uint32_t>, std::string> Debugger::GetStreamTaskToOpnameMap() const {
  return stream_task_to_opname_;
 }

@ -754,7 +754,8 @@ uint64_t BytestoInt64(const std::vector<char> &buffer) {
  uint64_t ret;

  ret = ((uint64_t)buffer[7] << 56) | ((uint64_t)buffer[6] << 48) | ((uint64_t)buffer[5] << 40) |
-        ((uint64_t)buffer[4] << 32) | (buffer[3] << 24) | (buffer[2] << 16) | (buffer[1] << 8) | buffer[0];
+        ((uint64_t)buffer[4] << 32) | ((uint64_t)buffer[3] << 24) | ((uint64_t)buffer[2] << 16) |
+        ((uint64_t)buffer[1] << 8) | ((uint64_t)buffer[0]);

  return ret;
 }
--- a/mindspore/ccsrc/debug/debugger/debugger.h
+++ b/mindspore/ccsrc/debug/debugger/debugger.h
@ -94,7 +94,7 @@ class Debugger : public std::enable_shared_from_this<Debugger> {

  int32_t step_num() const;

-  std::map<std::pair<uint32_t, uint32_t>, std::string> &GetStreamTaskToOpnameMap();
+  std::map<std::pair<uint32_t, uint32_t>, std::string> GetStreamTaskToOpnameMap() const;

  // check if any feature that uses the debugger backend is enabled
  bool DebuggerBackendEnabled();
--- a/mindspore/ccsrc/debug/debugger/grpc_client.cc
+++ b/mindspore/ccsrc/debug/debugger/grpc_client.cc
@ -79,9 +79,9 @@ std::vector<std::string> ChunkString(std::string str, int graph_size) {
    }
    std::string buffer;
    buffer.resize(chunk_size);
-    errno_t err = memcpy_s(reinterpret_cast<char *>(buffer.data()), chunk_size, str.data() + size_iter, chunk_size);
+    auto err = memcpy_s(reinterpret_cast<char *>(buffer.data()), chunk_size, str.data() + size_iter, chunk_size);
    if (err != 0) {
-      MS_LOG(ERROR) << "memcpy_s failed. err code is: " << err;
+      MS_LOG(EXCEPTION) << "memcpy_s failed. errorno is: " << err;
    }
    buf.push_back(buffer);
    size_iter += CHUNK_SIZE;
--- a/mindspore/ccsrc/frontend/optimizer/irpass/special_op_eliminate.h
+++ b/mindspore/ccsrc/frontend/optimizer/irpass/special_op_eliminate.h
@ -255,7 +255,7 @@ class PynativeEliminater : public OptimizerCaller {
    MS_LOG(DEBUG) << "Start FillZero";
    ValuePtr out = nullptr;
    if (value->isa<Int32Imm>()) {
-      return value;
+      return MakeValue(value->cast<Int32ImmPtr>()->value());
    }

    if (value->isa<tensor::Tensor>()) {
@ -298,9 +298,10 @@ class PynativeEliminater : public OptimizerCaller {
      if (rep != nullptr) {
        if (rep->isa<ValueNode>()) {
          auto value_node = rep->cast<ValueNodePtr>();
-          value_node->set_value(FillZero(value_node->value()));
+          auto new_value_node = NewValueNode(FillZero(value_node->value()));
+          new_value_node->set_has_new_value(value_node->has_new_value());
          MS_LOG(DEBUG) << "Zeros_like replace ok " << rep->DebugString(4);
-          return rep;
+          return new_value_node;
        }
      }
    }
@ -315,9 +316,10 @@ class PynativeEliminater : public OptimizerCaller {
      if (rep != nullptr) {
        if (rep->isa<ValueNode>()) {
          auto value_node = rep->cast<ValueNodePtr>();
-          value_node->set_value(FillZero(value_node->value()));
+          auto new_value_node = NewValueNode(FillZero(value_node->value()));
+          new_value_node->set_has_new_value(value_node->has_new_value());
          MS_LOG(DEBUG) << "Zeros_like replace ok 2 " << rep->DebugString(4);
-          return rep;
+          return new_value_node;
        }
      }
    }
--- a/mindspore/ccsrc/minddata/dataset/api/python/bindings/dataset/core/bindings.cc
+++ b/mindspore/ccsrc/minddata/dataset/api/python/bindings/dataset/core/bindings.cc
@ -69,7 +69,7 @@ PYBIND_REGISTER(Tensor, 0, ([](const py::module *m) {
                      auto &tensor = py::cast<Tensor &>(t);
                      if (tensor.type() == DataType::DE_STRING) {
                        py::array res;
-                        tensor.GetDataAsNumpyStrings(&res);
+                        THROW_IF_ERROR(tensor.GetDataAsNumpyStrings(&res));
                        return res;
                      }
                      py::buffer_info info;
--- a/mindspore/ccsrc/minddata/dataset/util/services.cc
+++ b/mindspore/ccsrc/minddata/dataset/util/services.cc
@ -30,6 +30,8 @@ namespace mindspore {
 namespace dataset {
 std::unique_ptr<Services> Services::instance_ = nullptr;
 std::once_flag Services::init_instance_flag_;
+std::set<std::string> Services::unique_id_list_ = {};
+std::mutex Services::unique_id_mutex_;

 #if !defined(_WIN32) && !defined(_WIN64) && !defined(__ANDROID__) && !defined(ANDROID)
 std::string Services::GetUserName() {
@ -52,9 +54,24 @@ std::string Services::GetUniqueID() {
  std::mt19937 gen = GetRandomDevice();
  std::uniform_int_distribution<uint32_t> dist(0, kStr.size() - 1);
  char buffer[UNIQUEID_LEN];
+  {
+    std::unique_lock<std::mutex> lock(unique_id_mutex_);
+    while (true) {
+      auto ret = memset_s(buffer, UNIQUEID_LEN, 0, UNIQUEID_LEN);
+      if (ret != 0) {
+        MS_LOG(ERROR) << "memset_s error, errorno(" << ret << ")";
+        return std::string("");
+      }
      for (int i = 0; i < UNIQUEID_LEN; i++) {
        buffer[i] = kStr[dist(gen)];
      }
+      if (unique_id_list_.find(std::string(buffer, UNIQUEID_LEN)) != unique_id_list_.end()) {
+        continue;
+      }
+      unique_id_list_.insert(std::string(buffer, UNIQUEID_LEN));
+      break;
+    }
+  }
  return std::string(buffer, UNIQUEID_LEN);
 }

--- a/mindspore/ccsrc/minddata/dataset/util/services.h
+++ b/mindspore/ccsrc/minddata/dataset/util/services.h
@ -19,6 +19,7 @@
 #include <algorithm>
 #include <memory>
 #include <mutex>
+#include <set>
 #include <string>
 #include <vector>
 #include "minddata/dataset/util/memory_pool.h"
@ -97,6 +98,8 @@ class Services {
 private:
  static std::once_flag init_instance_flag_;
  static std::unique_ptr<Services> instance_;
+  static std::set<std::string> unique_id_list_;
+  static std::mutex unique_id_mutex_;
  // A small pool used for small objects that last until the
  // Services Manager shuts down. Used by all sub-services.
  std::shared_ptr<MemoryPool> pool_;
--- a/mindspore/ccsrc/pipeline/jit/parse/parse.cc
+++ b/mindspore/ccsrc/pipeline/jit/parse/parse.cc
@ -1790,38 +1790,92 @@ bool UpdateFuncGraphFlags(py::object obj, const FuncGraphPtr &func_graph) {
  return true;
 }

+// Generate and copy a ValueNode, or a CNode with its child nodes
+static AnfNodePtr CopyNodesFromParamDefaultValue(const FuncGraphPtr func_graph, const AnfNodePtr &param_node) {
+  MS_EXCEPTION_IF_NULL(param_node);
+  if (param_node->isa<ValueNode>()) {
+    return std::make_shared<ValueNode>(param_node->cast<ValueNodePtr>()->value());
+  }
+
+  // Parameter default value is CNode.
+  std::size_t index = 0;
+  std::vector<AnfNodePtr> old_cnodes;
+  old_cnodes.emplace_back(param_node);
+  auto res = func_graph->NewCNode({});
+  std::vector<CNodePtr> new_cnodes;
+  new_cnodes.emplace_back(res);
+  while (index < old_cnodes.size()) {
+    auto current = old_cnodes[index];
+    auto current_new_cnode = new_cnodes[index];
+    index++;
+    MS_EXCEPTION_IF_NULL(current);
+    if (current->isa<CNode>()) {
+      auto &inputs = current->cast<CNodePtr>()->inputs();
+      for (auto it = inputs.begin(); it != inputs.end(); it++) {
+        AnfNodePtr input = *it;
+        if (input != nullptr && input->isa<CNode>()) {
+          old_cnodes.emplace_back(input);
+          auto new_cnode = func_graph->NewCNode({});
+          new_cnodes.emplace_back(new_cnode);
+          current_new_cnode->add_input(new_cnode);
+        } else if (input->isa<ValueNode>()) {
+          current_new_cnode->add_input(std::make_shared<ValueNode>(input->cast<ValueNodePtr>()->value()));
+        } else {
+          MS_LOG(EXCEPTION) << "Wrong type item in default parameters: " << input->ToString();
+        }
+      }
+    }
+  }
+  return res;
+}
+
 FuncGraphPtr MakeTopGraph(const py::object &cell, const ValuePtr &cell_ptr) {
+  auto current_graph = dyn_cast<FuncGraph>(cell_ptr);
+  if (current_graph == nullptr) {
+    MS_LOG(EXCEPTION) << "Current graph cast failed from " << cell_ptr->ToString();
+  }
+
  auto func_graph = std::make_shared<FuncGraph>();
-  func_graph->debug_info()->set_name("top");
+  func_graph->debug_info()->set_name(current_graph->debug_info()->name() + "_wrapper");

-  // def top(*arg, *kwargs):
-  auto param_vargs = func_graph->add_parameter();
-  auto args_name = "args";
-  param_vargs->set_name(args_name);
-  param_vargs->debug_info()->set_name(args_name);
-
-  auto param_vkwargs = func_graph->add_parameter();
-  args_name = "kwargs";
-  param_vkwargs->set_name(args_name);
-  param_vkwargs->debug_info()->set_name(args_name);
-
-  func_graph->set_has_vararg(true);
-  func_graph->set_has_kwarg(true);
-  func_graph->set_kwonlyargs_count(0);
+  // Copy all parameters information
+  for (auto &para : current_graph->parameters()) {
+    auto param = func_graph->add_parameter();
+    auto orig_param = para->cast<ParameterPtr>();
+    auto name = orig_param->name();
+    param->set_name(name);
+    param->debug_info()->set_name(name);
+  }
+  func_graph->set_has_vararg(current_graph->has_vararg());
+  func_graph->set_has_kwarg(current_graph->has_kwarg());
+  func_graph->set_kwonlyargs_count(current_graph->kwonlyargs_count());
+  // Copy all default values
+  for (auto &d : current_graph->parameter_default_value()) {
+    func_graph->set_param_default_value(d.first, CopyNodesFromParamDefaultValue(func_graph, d.second));
+  }

  // cell_obj
+  MS_LOG(DEBUG) << "add Flag for " << std::string(py::str(cell));
  parse::UpdateFuncGraphFlags(cell, func_graph);
  // top graph's construct flag
  if (py::hasattr(cell, "construct")) {
    parse::UpdateFuncGraphFlags(cell.attr("construct"), func_graph);
  }

+  auto unpacking = func_graph->has_vararg() || func_graph->has_kwarg();
+  if (!unpacking) {
+    std::vector<AnfNodePtr> inputs;
+    inputs.emplace_back(NewValueNode(cell_ptr));
+    auto &params = func_graph->parameters();
+    (void)std::transform(params.begin(), params.end(), std::back_inserter(inputs),
+                         [](AnfNodePtr node) -> AnfNodePtr { return node; });
+    func_graph->set_output(func_graph->NewCNode(inputs));
+  } else {
    // ret = cell_obj(*arg, *kwargs)
-  auto call_fn = MakeUnpackCall(func_graph, NewValueNode(cell_ptr), {param_vargs, param_vkwargs});
-
+    auto call_fn = MakeUnpackCall(func_graph, NewValueNode(cell_ptr), func_graph->parameters());
    // return ret
    func_graph->set_output(call_fn);
-  MS_LOG(DEBUG) << "add Flag for " << std::string(py::str(cell));
+  }
  return func_graph;
 }
 }  // namespace parse
--- a/mindspore/ccsrc/pipeline/pynative/pynative_execute.cc
+++ b/mindspore/ccsrc/pipeline/pynative/pynative_execute.cc
@ -260,7 +260,7 @@ py::object DoAutoCast(const py::object &arg, const TypeId &type_id) {
 py::object DoParamMixPrecisionCast(bool *is_cast, const py::object obj) {
  auto tensor = py::cast<tensor::TensorPtr>(obj);
  auto cast_type = tensor->cast_dtype();
-  py::object cast_output;
+  py::object cast_output = obj;
  if (cast_type != nullptr) {
    auto source_element = tensor->Dtype();
    if (source_element != nullptr && IsSubType(source_element, kFloat) && *source_element != *cast_type) {
@ -282,12 +282,15 @@ py::object DoParamMixPrecisionCastTuple(bool *is_cast, const py::tuple tuple) {
      result[i] = DoParamMixPrecisionCast(is_cast, tuple[i]);
    } else if (py::isinstance<py::tuple>(tuple[i])) {
      result[i] = DoParamMixPrecisionCastTuple(is_cast, tuple[i]);
+    } else {
+      result[i] = tuple[i];
    }
  }
  return result;
 }

 bool GetSignatureType(const PrimitivePyPtr &prim, std::vector<SignatureEnumDType> *dtypes) {
+  MS_EXCEPTION_IF_NULL(dtypes);
  auto signature = prim->signatures();
  bool has_sig_dtype = false;
  (void)std::transform(signature.begin(), signature.end(), std::back_inserter(*dtypes),
@ -733,20 +736,29 @@ ValuePtr PynativeExecutor::GetForwardValue(const OpExecInfoPtr &op_exec_info) {

 AnfNodePtr PynativeExecutor::MakeCNode(const OpExecInfoPtr &op_exec_info, std::vector<bool> *op_masks,
                                       abstract::AbstractBasePtrList *args_spec_list) {
+  MS_EXCEPTION_IF_NULL(op_masks);
+  MS_EXCEPTION_IF_NULL(args_spec_list);
  CNodePtr cnode = nullptr;
  std::vector<AnfNodePtr> inputs;
+
  auto prim = op_exec_info->py_primitive;
+  const auto &signature = prim->signatures();
+
  inputs.push_back(NewValueNode(prim));

  size_t size = op_exec_info->op_inputs.size();
+  auto sig_size = signature.size();
  // ignore signature for cast op
+  if (sig_size > 0 && sig_size != size) {
+    MS_EXCEPTION(ValueError) << op_exec_info->op_name << " inputs size " << size << " does not match the requires "
+                             << "inputs size " << sig_size;
+  }
  bool is_cast_op = (op_exec_info->op_name == "Cast");
  if (!is_cast_op) {
-    const auto &signature = prim->signatures();
    for (size_t i = 0; i < size; i++) {
      auto obj = op_exec_info->op_inputs[i];
      auto sig = SignatureEnumRW::kRWDefault;
-      if (signature.size() > 0) {
+      if (sig_size > 0) {
        sig = signature[i].rw;
      }
      MS_LOG(DEBUG) << "check mix precision " << op_exec_info->op_name << " input " << i << " "
@ -1173,10 +1185,6 @@ py::tuple RunOp(const py::args &args) {
    std::ostringstream oss;
    trace::TraceGraphEval();
    trace::GetEvalStackInfo(oss);
-    // call py::print to output function call stack to STDOUT, in case of output the log to file, the user can see
-    // these info from screen, no need to open log file to find these info
-    py::print(oss.str());
-    MS_LOG(ERROR) << oss.str();
    PynativeExecutor::GetInstance()->Clean();
    // re-throw this exception to Python interpreter to handle it
    throw(py::error_already_set(ex));
--- a/mindspore/ccsrc/profiler/device/gpu/gpu_profiling.cc
+++ b/mindspore/ccsrc/profiler/device/gpu/gpu_profiling.cc
@ -406,8 +406,10 @@ void GPUProfiler::OpDataProducerBegin(const std::string op_name, void *stream) {
    CHECK_CUDA_RET_WITH_ERROR(cudaEventRecord(op_event_start_, (CUstream)stream_),
                              "cudaEventRecord op event start failed");
    op_host_time_start_ = GetHostTimeStamp();
+    op_cupti_time_start_ = GetCUPTITimeStamp();
  } else {
    op_host_time_start_ = GetHostTimeStamp();
+    op_cupti_time_start_ = GetCUPTITimeStamp();
  }
  SetRunTimeData(op_name, stream);
 }
@ -431,7 +433,7 @@ void GPUProfiler::OpDataProducerEnd() {
  }
  MS_LOG(DEBUG) << "Host Time Elapsed(us)," << op_name_ << "," << op_time_elapsed;
  SetRunTimeData(op_name_, op_time_elapsed);
-  SetRunTimeData(op_name_, op_host_time_start_, op_time_elapsed);
+  SetRunTimeData(op_name_, op_cupti_time_start_, op_time_elapsed);
 }

 void GPUProfiler::StopCUPTI() {
--- a/mindspore/ccsrc/profiler/device/gpu/gpu_profiling.h
+++ b/mindspore/ccsrc/profiler/device/gpu/gpu_profiling.h
@ -170,6 +170,7 @@ class GPUProfiler {
  cudaEvent_t op_event_stop_;
  uint64_t op_host_time_start_;
  uint64_t op_host_time_stop_;
+  uint64_t op_cupti_time_start_;
  std::string profile_data_path_;
 };
 }  // namespace gpu
--- a/mindspore/ccsrc/pybind_api/ir/tensor_py.cc
+++ b/mindspore/ccsrc/pybind_api/ir/tensor_py.cc
@ -455,7 +455,7 @@ REGISTER_PYBIND_DEFINE(Tensor, ([](const py::module *m) {
                                  >>> data.set_dtype(mindspore.int32)
                                  mindspore.int32
                              )mydelimiter")
-                           .def("set_cast_dtype", &Tensor::set_cast_dtype)
+                           .def("set_cast_dtype", &Tensor::set_cast_dtype, py::arg("dtype") = nullptr)
                           .def("__str__", &Tensor::ToString)
                           .def("__repr__", &Tensor::ToStringRepr)
                           .def(py::pickle(
--- a/mindspore/ccsrc/runtime/device/CMakeLists.txt
+++ b/mindspore/ccsrc/runtime/device/CMakeLists.txt
@ -14,12 +14,12 @@ endif ()

 if (ENABLE_CPU)
    file(GLOB_RECURSE CPU_SRC_LIST RELATIVE ${CMAKE_CURRENT_SOURCE_DIR} "cpu/*.cc")
-    list(REMOVE_ITEM CPU_SRC_LIST "cpu/mpi/mpi_adapter.cc", "cpu/mpi/mpi_export.cc")
+    list(REMOVE_ITEM CPU_SRC_LIST "cpu/mpi/mpi_adapter.cc" "cpu/mpi/mpi_export.cc")
 endif ()

 if (ENABLE_MPI)
    if (ENABLE_CPU)
-        file(GLOB_RECURSE MPI_SRC_LIST "cpu/mpi/mpi_adapter.cc", "cpu/mpi/mpi_export.cc")
+        file(GLOB_RECURSE MPI_SRC_LIST "cpu/mpi/mpi_adapter.cc" "cpu/mpi/mpi_export.cc")
        set_property(SOURCE ${MPI_SRC_LIST}
            PROPERTY COMPILE_DEFINITIONS SUBMODULE_ID=mindspore::SubModuleId::SM_DEVICE)
        add_library(mpi_adapter SHARED ${MPI_SRC_LIST})
--- a/mindspore/ccsrc/runtime/device/ascend/dump/data_dumper.cc
+++ b/mindspore/ccsrc/runtime/device/ascend/dump/data_dumper.cc
@ -57,7 +57,6 @@ constexpr const char *kOpTypeOpDebug = "Opdebug";
 namespace mindspore {
 namespace device {
 namespace ascend {
-
 DataDumper::~DataDumper() {
  ReleaseDevMem(&dev_load_mem_);
  ReleaseDevMem(&dev_unload_mem_);
@ -116,7 +115,7 @@ void DataDumper::LoadDumpInfo() {
  auto debugger = mindspore::Debugger::GetInstance();
  MS_EXCEPTION_IF_NULL(debugger);
  if (debugger->DebuggerBackendEnabled()) {
-    std::map<std::pair<uint32_t, uint32_t>, std::string> &stream_task_to_opname = debugger->GetStreamTaskToOpnameMap();
+    std::map<std::pair<uint32_t, uint32_t>, std::string> stream_task_to_opname = debugger->GetStreamTaskToOpnameMap();
    // extract stream id, task id and opname from runtime_info_map for overflow detection
    std::transform(runtime_info_map_.begin(), runtime_info_map_.end(),
                   std::inserter(stream_task_to_opname, stream_task_to_opname.end()),
--- a/mindspore/ccsrc/runtime/device/ascend/kernel_select_ascend.cc
+++ b/mindspore/ccsrc/runtime/device/ascend/kernel_select_ascend.cc
@ -52,8 +52,12 @@ const int kUnSupportMixedDataTypeIndex = -1;
 bool MatchInferOutputDataType(const CNodePtr &cnode, const kernel::KernelBuildInfo &kernel_build_info) {
  MS_EXCEPTION_IF_NULL(cnode);
  // Check input data type
+  auto name = AnfAlgo::GetCNodeName(cnode);
  for (size_t input_index = 0; input_index < kernel_build_info.GetInputNum(); ++input_index) {
    TypeId input_origin_type = AnfAlgo::GetPrevNodeOutputInferDataType(cnode, input_index);
+    if (name == kDynamicRNNOpName && input_origin_type == kMetaTypeNone) {
+      continue;
+    }
    if (kernel_build_info.GetInputDeviceType(input_index) != input_origin_type) {
      return false;
    }
@ -478,6 +482,9 @@ void SetTensorDeviceInfo(const kernel::KernelBuildInfo &selected_kernel_info, co
      AnfAlgo::SetSelectKernelBuildInfo(builder->Build(), input_kernel_node.get());
      continue;
    }
+    if (selected_kernel_info.GetInputFormat(input_index) == kOpFormat_FRACTAL_ZN_LSTM) {
+      continue;
+    }
    // we set special device info of a input tensor.
    bool is_ref = false;
    auto op_info = kernel::OpLib::FindOp(AnfAlgo::GetCNodeName(kernel_node), kernel::kTBE);
--- a/mindspore/ccsrc/runtime/device/ascend/tasksink/task_generator.cc
+++ b/mindspore/ccsrc/runtime/device/ascend/tasksink/task_generator.cc
@ -127,8 +127,12 @@ bool TaskGenerator::LaunchKernel(const CNodePtr &anf_node_ptr, uint32_t stream_i
  auto kernel_mod = AnfAlgo::GetKernelMod(anf_node_ptr);
  MS_EXCEPTION_IF_NULL(kernel_mod);
  kernel_mod->set_kernel_name(anf_node_ptr->fullname_with_scope());
+  auto op_name = AnfAlgo::GetCNodeName(anf_node_ptr);
  if (AnfAlgo::GetCNodeName(anf_node_ptr) != kAtomicAddrCleanOpName) {
    for (size_t i = 0; i < AnfAlgo::GetInputTensorNum(anf_node_ptr); ++i) {
+      if (op_name == kDynamicRNNOpName && i == 3) {
+        continue;
+      }
      auto real_input_index = AnfAlgo::GetRealInputIndex(anf_node_ptr, i);
      auto device_address = AnfAlgo::GetPrevNodeOutputAddr(anf_node_ptr, real_input_index);
      AddressPtr input = std::make_shared<Address>();
--- a/mindspore/ccsrc/runtime/device/cpu/cpu_kernel_runtime.cc
+++ b/mindspore/ccsrc/runtime/device/cpu/cpu_kernel_runtime.cc
@ -22,22 +22,68 @@
 #include <functional>
 #include "backend/kernel_compiler/kernel.h"
 #include "runtime/device/cpu/cpu_device_address.h"
+#include "runtime/device/cpu/cpu_memory_manager.h"
 #include "utils/ms_context.h"
 #include "backend/session/anf_runtime_algorithm.h"
 #include "backend/session/session_basic.h"
 #include "frontend/operator/ops.h"
 #include "utils/shape_utils.h"
 #include "utils/profile.h"
+#ifdef MEM_REUSE_DEBUG
+#include "backend/optimizer/mem_reuse/mem_reuse_checker.h"
+#endif

 namespace mindspore {
 namespace device {
 namespace cpu {
+
+bool CPUKernelRuntime::Init()
+{
+  mem_manager_ = std::make_shared<CPUMemoryManager>();
+  MS_EXCEPTION_IF_NULL(mem_manager_);
+  return true;
+}
+
 const size_t INIT_NODE_REF = 1;
 void CPUKernelRuntime::AssignKernelAddress(session::KernelGraph *kernel_graph) {
  AssignValueNodeAddress(kernel_graph);
  AssignInputNodeAddress(kernel_graph);
+  auto context_ptr = MsContext::GetInstance();
+  MS_EXCEPTION_IF_NULL(context_ptr);
+  bool is_enable_mem_reuse = context_ptr->get_param<bool>(MS_CTX_ENABLE_MEM_REUSE);
+  if (is_enable_mem_reuse) {
+    AssignStaticMemoryOutput(kernel_graph);
+    AssignDynamicMemory(kernel_graph); 
+    #ifdef MEM_REUSE_DEBUG
+    mindspore::memreuse::MemReuseChecker::GetInstance().CheckNormalIR(kernel_graph);
+    #endif
+  } else {
    AssignKernelOutputAddress(kernel_graph);
-  resource_manager_.AssignMemory(kernel_graph);
+    static_cast<CPUMemoryManager *>(mem_manager_.get())->AssignMemory(kernel_graph);
+  }
+}
+
+void CPUKernelRuntime::AssignStaticMemoryOutput(const session::KernelGraph *graph) {
+  MS_EXCEPTION_IF_NULL(graph);
+  auto nodes = AnfAlgo::GetAllOutput(graph->output(), {prim::kPrimTupleGetItem});
+  std::vector<session::KernelWithIndex> non_communication_op;
+  // Assign Communicate Op Memory firstly.
+  for (const auto &node : nodes) {
+    auto item_with_index = AnfAlgo::VisitKernelWithReturnType(node, 0, true);
+    MS_EXCEPTION_IF_NULL(item_with_index.first);
+    if (!item_with_index.first->isa<CNode>() || !AnfAlgo::IsRealKernel(item_with_index.first)) {
+      continue;
+    }
+    if (AnfAlgo::IsCommunicationOp(item_with_index.first)) {
+      AssignCommunicationNodeMem(kStaticMem, item_with_index.first);
+    } else {
+      non_communication_op.emplace_back(item_with_index);
+    }
+  }
+
+  for (const auto &item_with_index : non_communication_op) {
+    AssignNodeOutputMem(kStaticMem, item_with_index.first, SizeToInt(item_with_index.second));
+  }
 }

 void CPUKernelRuntime::AssignValueNodeAddress(session::KernelGraph *kernel_graph) {
@ -73,7 +119,8 @@ void CPUKernelRuntime::AssignValueNodeAddress(session::KernelGraph *kernel_graph
          tensor->data_type() == kNumberTypeInt64) {
        address->ptr_ = tensor->data_c();
      } else {
-        address->ptr_ = resource_manager_.MemMalloc(tensor_size);
+        MS_EXCEPTION_IF_NULL(mem_manager_);
+        address->ptr_ = static_cast<CPUMemoryManager *>(mem_manager_.get())->MemMalloc(tensor_size);
        if (!address->SyncHostToDevice(data_shape, LongToSize(tensor->data().nbytes()), tensor->data_type(),
                                       tensor->data_c())) {
          MS_LOG(EXCEPTION) << "Value node sync host to device failed!";
@ -166,10 +213,11 @@ tensor::TensorPtr CPUKernelRuntime::CreatTensorForOutput(session::KernelGraph *k
    tensor->set_sync_status(kNeedSyncDeviceToHostImmediately);
  } else {
    if (infer_type_id != device_type_id) {
+      MS_EXCEPTION_IF_NULL(mem_manager_);
      size_t type_size = GetTypeByte(TypeIdToType(device_type_id));
      ShapeVector data_shape = tensor->shape();
      size_t tensor_size = std::accumulate(data_shape.begin(), data_shape.end(), type_size, std::multiplies<size_t>());
-      address->ptr_ = resource_manager_.MemMalloc(tensor_size);
+      address->ptr_ = static_cast<CPUMemoryManager *>(mem_manager_.get())->MemMalloc(tensor_size);
      tensor->set_device_address(address);
      tensor->set_sync_status(kNeedSyncDeviceToHostImmediately);
    } else {
@ -241,10 +289,11 @@ void CPUKernelRuntime::BindInputOutput(session::KernelGraph *kernel_graph, const
          tensor->data_type() == kNumberTypeInt32 || tensor->data_type() == kNumberTypeInt64) {
        address->ptr_ = tensor->data_c();
      } else {
+        MS_EXCEPTION_IF_NULL(mem_manager_);
        ShapeVector data_shape = tensor->shape();
        size_t tensor_size =
          std::accumulate(data_shape.begin(), data_shape.end(), sizeof(float), std::multiplies<size_t>());
-        address->ptr_ = resource_manager_.MemMalloc(tensor_size);
+        address->ptr_ = static_cast<CPUMemoryManager *>(mem_manager_.get())->MemMalloc(tensor_size);
        if (!address->SyncHostToDevice(data_shape, LongToSize(tensor->data().nbytes()), tensor->data_type(),
                                       tensor->data_c())) {
          MS_LOG(EXCEPTION) << "Parameter node sync host to device failed!";
@ -271,7 +320,8 @@ void CPUKernelRuntime::AddRuntimeAddress(DeviceAddress *address, std::vector<ker
  kernel::AddressPtr input = std::make_shared<kernel::Address>();
  MS_EXCEPTION_IF_NULL(input);
  if (address->ptr_ == nullptr) {
-    address->ptr_ = resource_manager_.MemMalloc(address->size_);
+    MS_EXCEPTION_IF_NULL(mem_manager_);
+    address->ptr_ = static_cast<CPUMemoryManager *>(mem_manager_.get())->MemMalloc(address->size_);
  }
  MS_EXCEPTION_IF_NULL(address->ptr_);
  input->addr = address->ptr_;
@ -280,16 +330,19 @@ void CPUKernelRuntime::AddRuntimeAddress(DeviceAddress *address, std::vector<ker
 }

 void CPUKernelRuntime::IncreaseSummaryRefCount(const session::NamedSummaryOutputs &summary_outputs) {
-  resource_manager_.IncreaseSummaryRefCount(summary_outputs);
+  MS_EXCEPTION_IF_NULL(mem_manager_);
+  static_cast<CPUMemoryManager *>(mem_manager_.get())->IncreaseSummaryRefCount(summary_outputs);
 }

 void CPUKernelRuntime::DecreaseSummaryRefCount(const session::NamedSummaryOutputs &summary_outputs) {
-  resource_manager_.DecreaseSummaryRefCount(summary_outputs);
+  MS_EXCEPTION_IF_NULL(mem_manager_);
+  static_cast<CPUMemoryManager *>(mem_manager_.get())->DecreaseSummaryRefCount(summary_outputs);
 }

 bool CPUKernelRuntime::Run(session::KernelGraph *kernel_graph, bool is_task_sink, Debugger *debugger) {
  MS_EXCEPTION_IF_NULL(kernel_graph);
-  resource_manager_.IncreaseAddressRefCount(kernel_graph);
+  MS_EXCEPTION_IF_NULL(mem_manager_);
+  static_cast<CPUMemoryManager *>(mem_manager_.get())->IncreaseAddressRefCount(kernel_graph);

  auto kernels = kernel_graph->execution_order();
  for (const auto &kernel : kernels) {
@ -319,7 +372,7 @@ bool CPUKernelRuntime::Run(session::KernelGraph *kernel_graph, bool is_task_sink
      AddRuntimeAddress(device_address, &kernel_workspaces);
    }
    auto ret = kernel_mod->Launch(kernel_inputs, kernel_workspaces, kernel_outputs, 0);
-    resource_manager_.DecreaseAddressRefCount(kernel);
+    static_cast<CPUMemoryManager *>(mem_manager_.get())->DecreaseAddressRefCount(kernel);
    if (!ret) {
      MS_LOG(EXCEPTION) << "Launch kernel failed.";
    }
--- a/mindspore/ccsrc/runtime/device/cpu/cpu_kernel_runtime.h
+++ b/mindspore/ccsrc/runtime/device/cpu/cpu_kernel_runtime.h
@ -24,7 +24,6 @@
 #include "runtime/device/kernel_runtime.h"
 #include "backend/session/kernel_graph.h"
 #include "backend/session/session_basic.h"
-#include "runtime/device/cpu/cpu_resource_manager.h"
 #include "backend/session/anf_runtime_algorithm.h"
 #include "utils/any.h"
 namespace mindspore {
@ -35,7 +34,7 @@ class CPUKernelRuntime : public KernelRuntime {
  CPUKernelRuntime() = default;
  ~CPUKernelRuntime() override = default;

-  bool Init() override { return true; }
+  bool Init() override;
  bool Run(session::KernelGraph *graph, bool is_task_sink, Debugger *debugger = nullptr) override;
  void AssignKernelAddress(session::KernelGraph *kernel_graph);
  void BindInputOutput(session::KernelGraph *kernel_graph, const std::vector<tensor::TensorPtr> &inputs,
@ -49,6 +48,7 @@ class CPUKernelRuntime : public KernelRuntime {
                                       TypeId type_id) override;

 private:
+  void AssignStaticMemoryOutput(const session::KernelGraph *graph);
  tensor::TensorPtr CreatTensorForOutput(session::KernelGraph *kernel_graph, const CNodePtr &node, size_t index);

  BaseRef CreatTensorForOutput(session::KernelGraph *kernel_graph, const session::KernelWithIndex &kernel_with_index);
@ -56,7 +56,6 @@ class CPUKernelRuntime : public KernelRuntime {
  void AssignInputNodeAddress(const session::KernelGraph *kernel_graph);
  void AssignKernelOutputAddress(const session::KernelGraph *kernel_graph);
  void AddRuntimeAddress(DeviceAddress *address, std::vector<kernel::AddressPtr> *input_list);
-  CPUResourceManager resource_manager_;
  std::set<DeviceAddressPtr> bound_addresses_;
  std::map<AnfNodePtr, tensor::TensorPtr> input_param_tensor_map_;
 };
--- a/mindspore/ccsrc/runtime/device/cpu/cpu_resource_manager.cc
+++ b/mindspore/ccsrc/runtime/device/cpu/cpu_resource_manager.cc
@ -1,5 +1,5 @@
 /**
- * Copyright 2019 Huawei Technologies Co., Ltd
+ * Copyright 2020 Huawei Technologies Co., Ltd
 *
 * Licensed under the Apache License, Version 2.0 (the "License");
 * you may not use this file except in compliance with the License.
@ -13,15 +13,34 @@
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */
-#include "runtime/device/cpu/cpu_resource_manager.h"
-#include "backend/session/anf_runtime_algorithm.h"

+#include "runtime/device/cpu/cpu_memory_manager.h"
+#include "backend/session/anf_runtime_algorithm.h"
+#include "utils/ms_context.h"
+#include "utils/convert_utils.h"
 namespace mindspore {
 namespace device {
 namespace cpu {
-CPUResourceManager::~CPUResourceManager() { MemFree(); }

-void CPUResourceManager::MemFree() {
+uint8_t *CPUMemoryManager::MallocStaticMem(size_t size, bool communication_mem) {
+  // For cpu, static memory is the same as dynamic memory
+  return MallocDynamicMem(size, communication_mem);
+}
+
+uint8_t *CPUMemoryManager::MallocDynamicMem(size_t size, bool) {
+  return (uint8_t *)MemMalloc(size);
+}
+
+void CPUMemoryManager::ResetDynamicMemory() {
+  for (auto &&iter : dynamic_mem_) {
+    free(iter.first);
+  }
+  dynamic_mem_.clear();
+}
+
+CPUMemoryManager::~CPUMemoryManager() { MemFree(); }
+
+void CPUMemoryManager::MemFree() {
  if (mem_ptr_ != nullptr) {
    free(mem_ptr_);
    mem_ptr_ = nullptr;
@ -34,7 +53,7 @@ void CPUResourceManager::MemFree() {
  dynamic_mem_.clear();
 }

-void CPUResourceManager::AssignMemory(const session::KernelGraph *graph) {
+void CPUMemoryManager::AssignMemory(const session::KernelGraph *graph) {
  size_t graph_mem_size = mem_plan_.MemPlan(graph);
  if (graph_mem_size > mem_size_) {
    if (mem_size_ > 0) {
@ -43,6 +62,7 @@ void CPUResourceManager::AssignMemory(const session::KernelGraph *graph) {
    }
    mem_ptr_ = reinterpret_cast<uint8_t *>(malloc(graph_mem_size));
    if (mem_ptr_ != nullptr) {
+      MS_LOG(INFO) << "Simple MemPlan GraphMemSize [" << graph_mem_size << "]";
      mem_size_ = graph_mem_size;
      dynamic_malloc_ = false;
    } else {
@ -56,7 +76,7 @@ void CPUResourceManager::AssignMemory(const session::KernelGraph *graph) {
  mem_plan_.MemAssign(graph, mem_ptr_);
 }

-void *CPUResourceManager::MemMalloc(size_t mem_size) {
+void *CPUMemoryManager::MemMalloc(size_t mem_size) {
  void *ptr = malloc(mem_size);
  if (ptr != nullptr) {
    memset_s(ptr, mem_size, 0, mem_size);
@ -67,7 +87,7 @@ void *CPUResourceManager::MemMalloc(size_t mem_size) {
  }
 }

-void CPUResourceManager::MemFree(void *ptr) {
+void CPUMemoryManager::MemFree(void *ptr) {
  auto iter = dynamic_mem_.find(ptr);
  if (iter != dynamic_mem_.end()) {
    (void)dynamic_mem_.erase(iter);
@ -75,7 +95,7 @@ void CPUResourceManager::MemFree(void *ptr) {
  }
 }

-void CPUResourceManager::IncreaseSummaryRefCount(const session::NamedSummaryOutputs &summary_outputs) {
+void CPUMemoryManager::IncreaseSummaryRefCount(const session::NamedSummaryOutputs &summary_outputs) {
  if (!dynamic_malloc_) {
    return;
  }
@ -93,7 +113,7 @@ void CPUResourceManager::IncreaseSummaryRefCount(const session::NamedSummaryOutp
  }
 }

-void CPUResourceManager::DecreaseSummaryRefCount(const session::NamedSummaryOutputs &summary_outputs) {
+void CPUMemoryManager::DecreaseSummaryRefCount(const session::NamedSummaryOutputs &summary_outputs) {
  if (!dynamic_malloc_) {
    return;
  }
@ -115,7 +135,7 @@ void CPUResourceManager::DecreaseSummaryRefCount(const session::NamedSummaryOutp
  }
 }

-void CPUResourceManager::IncreaseAddressRefCount(const session::KernelGraph *graph) {
+void CPUMemoryManager::IncreaseAddressRefCount(const session::KernelGraph *graph) {
  if (!dynamic_malloc_) {
    return;
  }
@ -140,7 +160,7 @@ void CPUResourceManager::IncreaseAddressRefCount(const session::KernelGraph *gra
  }
 }

-void CPUResourceManager::DecreaseAddressRefCount(const AnfNodePtr &kernel) {
+void CPUMemoryManager::DecreaseAddressRefCount(const AnfNodePtr &kernel) {
  if (!dynamic_malloc_) {
    return;
  }
--- a/mindspore/ccsrc/runtime/device/cpu/cpu_resource_manager.h
+++ b/mindspore/ccsrc/runtime/device/cpu/cpu_resource_manager.h
@ -1,5 +1,5 @@
 /**
- * Copyright 2019 Huawei Technologies Co., Ltd
+ * Copyright 2020 Huawei Technologies Co., Ltd
 *
 * Licensed under the Apache License, Version 2.0 (the "License");
 * you may not use this file except in compliance with the License.
@ -13,22 +13,27 @@
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */
-#ifndef MINDSPORE_CCSRC_RUNTIME_DEVICE_CPU_CPU_RESOURCE_MANAGER_H_
-#define MINDSPORE_CCSRC_RUNTIME_DEVICE_CPU_CPU_RESOURCE_MANAGER_H_

+#ifndef MINDSPORE_CCSRC_RUNTIME_DEVICE_CPU_CPU_MEMORY_MANAGER_H_
+#define MINDSPORE_CCSRC_RUNTIME_DEVICE_CPU_CPU_MEMORY_MANAGER_H_
 #include <vector>
 #include <map>
 #include "backend/session/kernel_graph.h"
 #include "backend/session/session_basic.h"
 #include "runtime/device/device_address.h"
+#include "runtime/device/memory_manager.h"
 #include "runtime/device/cpu/cpu_simple_mem_plan.h"
 namespace mindspore {
 namespace device {
 namespace cpu {
-class CPUResourceManager {
+class CPUMemoryManager : public MemoryManager {
 public:
-  CPUResourceManager() = default;
-  ~CPUResourceManager();
+  CPUMemoryManager() = default;
+  virtual ~CPUMemoryManager();
+
+  void MallocDeviceMemory() override {}
+  void FreeDeviceMemory() override {}
+  void ResetDynamicMemory() override;

  void AssignMemory(const session::KernelGraph *graph);
  void IncreaseAddressRefCount(const session::KernelGraph *graph);
@ -38,6 +43,10 @@ class CPUResourceManager {
  void IncreaseSummaryRefCount(const session::NamedSummaryOutputs &summary_outputs);
  void DecreaseSummaryRefCount(const session::NamedSummaryOutputs &summary_outputs);

+ protected:
+  uint8_t *MallocStaticMem(size_t size, bool communication_mem) override;
+  uint8_t *MallocDynamicMem(size_t size, bool communication_mem) override;
+ 
 private:
  void MemFree();
  CPUSimpleMemPlan mem_plan_;
@ -50,5 +59,4 @@ class CPUResourceManager {
 }  // namespace cpu
 }  // namespace device
 }  // namespace mindspore
-
-#endif  // MINDSPORE_CCSRC_RUNTIME_DEVICE_CPU_CPU_RESOURCE_MANAGER_H_
+#endif  // MINDSPORE_CCSRC_RUNTIME_DEVICE_CPU_CPU_MEMORY_MANAGER_H_
--- a/mindspore/ccsrc/runtime/device/device_address.h
+++ b/mindspore/ccsrc/runtime/device/device_address.h
@ -28,7 +28,7 @@ namespace mindspore {
 namespace device {
 namespace cpu {
 class CPUSimpleMemPlan;
-class CPUResourceManager;
+class CPUMemoryManager;
 class CPUKernelRuntime;
 }  // namespace cpu
 namespace ascend {
@ -87,7 +87,7 @@ class DeviceAddress : public mindspore::DeviceSync {
  friend class MemoryManager;
  friend class mindspore::device::ascend::tasksink::TaskGenerator;
  friend class mindspore::device::cpu::CPUSimpleMemPlan;
-  friend class mindspore::device::cpu::CPUResourceManager;
+  friend class mindspore::device::cpu::CPUMemoryManager;
  friend class mindspore::device::cpu::CPUKernelRuntime;
  friend class mindspore::device::gpu::GPUKernelRuntime;
  friend class mindspore::device::gpu::GPUMemoryManager;
--- a/mindspore/ccsrc/runtime/device/gpu/gpu_common.h
+++ b/mindspore/ccsrc/runtime/device/gpu/gpu_common.h
@ -141,7 +141,6 @@ inline bool CheckNullInput(std::vector<size_t> input_shape) {
      MS_LOG(EXCEPTION) << "CUAD curand Error: " << message << " | curandStatus: " << status; \
    }                                                                                         \
  }
-
 }  // namespace gpu
 }  // namespace device
 }  // namespace mindspore
--- a/mindspore/ccsrc/runtime/device/gpu/mpi/mpi_initializer.cc
+++ b/mindspore/ccsrc/runtime/device/gpu/mpi/mpi_initializer.cc
@ -25,7 +25,6 @@
 namespace mindspore {
 namespace device {
 namespace gpu {
-
 MPIInitializer &MPIInitializer::GetInstance() {
  static MPIInitializer instance;
  return instance;
--- a/mindspore/ccsrc/utils/load_onnx/anf_converter.cc
+++ b/mindspore/ccsrc/utils/load_onnx/anf_converter.cc
@ -58,7 +58,23 @@ int AnfConverter::ValidateFileStr(const std::string &modelFile, std::string file

 bool AnfConverter::ReadOnnxFromBinary(const std::string &modelFile, google::protobuf::Message *onnx_model) {
  std::unique_ptr<char> onnx_file(new (std::nothrow) char[PATH_MAX]{0});
-  int fd = open(modelFile.c_str(), O_RDONLY);
+  if (modelFile.size() > PATH_MAX) {
+    MS_LOG(DEBUG) << "file path " << modelFile << " is too long.";
+    return false;
+  }
+  char real_path[PATH_MAX + 1] = {0};
+#if defined(_WIN32) || defined(_WIN64)
+  if (nullptr == _fullpath(real_path, modelFile.c_str(), PATH_MAX)) {
+    MS_LOG(DEBUG) << modelFile << " does not exit.";
+    return false;
+  }
+#else
+  if (nullptr == realpath(modelFile.c_str(), real_path)) {
+    MS_LOG(DEBUG) << modelFile << " does not exit.";
+    return false;
+  }
+#endif
+  int fd = open(real_path, O_RDONLY);
  if (fd < 0) {
    MS_LOG(EXCEPTION) << "failed to open file";
  }
--- a/mindspore/ccsrc/utils/system/sha256.h
+++ b/mindspore/ccsrc/utils/system/sha256.h
@ -100,28 +100,27 @@ bool ProcessInner(const std::string &message, const int &bias, uint32_t *digest,
    w[i] = sigma3(w[i - 2]) + w[i - 7] + sigma2(w[i - 15]) + w[i - 16];
  }

-  std::shared_ptr<uint32_t> hash(new uint32_t[digest_size], std::default_delete<uint32_t[]>());
+  std::shared_ptr<uint32_t[]> hash(new uint32_t[digest_size]());
  size_t mem_size = digest_size * sizeof(uint32_t);
  auto ret = memcpy_s(hash.get(), mem_size, digest, mem_size);
  if (ret != EOK) {
    return false;
  }
  for (int i = 0; i < kIterationNumber; ++i) {
-    uint32_t t1 =
-      w[i] + constant[i] + hash.get()[7] + sigma1(hash.get()[4]) + ch(hash.get()[4], hash.get()[5], hash.get()[6]);
-    uint32_t t2 = sigma0(hash.get()[0]) + ma(hash.get()[0], hash.get()[1], hash.get()[2]);
+    uint32_t t1 = w[i] + constant[i] + hash[7] + sigma1(hash[4]) + ch(hash[4], hash[5], hash[6]);
+    uint32_t t2 = sigma0(hash[0]) + ma(hash[0], hash[1], hash[2]);
    for (int j = digest_size - 1; j >= 0; --j) {
      if (j == 4) {
-        hash.get()[j] = hash.get()[j - 1] + t1;
+        hash[j] = hash[j - 1] + t1;
      } else if (j == 0) {
-        hash.get()[j] = t1 + t2;
+        hash[j] = t1 + t2;
      } else {
-        hash.get()[j] = hash.get()[j - 1];
+        hash[j] = hash[j - 1];
      }
    }
  }
  for (int i = 0; i < digest_size; ++i) {
-    digest[i] += hash.get()[i];
+    digest[i] += hash[i];
  }
  return true;
 }
--- a/mindspore/ccsrc/utils/utils.h
+++ b/mindspore/ccsrc/utils/utils.h
@ -219,6 +219,8 @@ constexpr auto kFusedScaleApplyMomentum = "FusedScaleApplyMomentum";
 constexpr auto kBasicLSTMCellWeightGradOpName = "BasicLSTMCellWeightGrad";
 constexpr auto kBasicLSTMCellInputGradOpName = "BasicLSTMCellInputGrad";
 constexpr auto kBasicLSTMCellOpName = "BasicLSTMCell";
+constexpr auto kDynamicRNNOpName = "DynamicRNN";
+constexpr auto kLSTMInputGradOpName = "LSTMInputGrad";

 // attr key name
 constexpr auto kAttrInputNames = "input_names";
--- a/mindspore/common/api.py
+++ b/mindspore/common/api.py
@ -292,7 +292,6 @@ class _PynativeExecutor:
    def __init__(self):
        self._executor = PynativeExecutor_.get_instance()

-    #TODO(kpy):add a type arg
    def new_graph(self, obj, *args, **kwargs):
        self._executor.new_graph(obj, *args, *(kwargs.values()))

--- a/mindspore/common/seed.py
+++ b/mindspore/common/seed.py
@ -31,9 +31,11 @@ def set_seed(seed):
    Note:
        The global seed is used by numpy.random, mindspore.common.Initializer, mindspore.ops.composite.random_ops and
        mindspore.nn.probability.distribution.
+
        If global seed is not set, these packages will use their own default seed independently, numpy.random and
        mindspore.common.Initializer will choose a random seed, mindspore.ops.composite.random_ops and
        mindspore.nn.probability.distribution will use zero.
+
        Seed set by numpy.random.seed() only used by numpy.random, while seed set by this API will also used by
        numpy.random, so just set all seed by this API is recommended.

@ -97,14 +99,13 @@ def _get_op_seed(op_seed, kernel_name):
        seed (int): The op-seed to be updated.
        kernel_name (string): The random op kernel.
    """
-    if ((kernel_name, op_seed) not in _KERNEL_SEED) or (_KERNEL_SEED[(kernel_name, op_seed)] == -1):
+    if (kernel_name, op_seed) not in _KERNEL_SEED:
        _KERNEL_SEED[(kernel_name, op_seed)] = op_seed
-    _KERNEL_SEED[(kernel_name, op_seed)] = 0
    return _KERNEL_SEED[(kernel_name, op_seed)]

 def _reset_op_seed():
    """
    Reset op seeds in the kernel's dictionary.
    """
-    for key in _KERNEL_SEED:
-        _KERNEL_SEED[key] = -1
+    for (kernel_name, op_seed) in _KERNEL_SEED:
+        _KERNEL_SEED[(kernel_name, op_seed)] = op_seed
--- a/mindspore/core/base/core_ops.h
+++ b/mindspore/core/base/core_ops.h
@ -104,6 +104,9 @@ inline const PrimitivePtr kPrimPad = std::make_shared<Primitive>("Pad");
 inline const PrimitivePtr kPrimArgMaxWithValue = std::make_shared<Primitive>("ArgMaxWithValue");
 inline const PrimitivePtr kPrimUnique = std::make_shared<Primitive>("Unique");
 inline const PrimitivePtr kPrimUniqueGrad = std::make_shared<Primitive>("UniqueGrad");
+inline const PrimitivePtr kPrimExtractImagePatches = std::make_shared<Primitive>("ExtractImagePatches");
+inline const PrimitivePtr kPrimDynamicRNN = std::make_shared<Primitive>("DynamicRNN");
+inline const PrimitivePtr kPrimDynamicRNNGrad = std::make_shared<Primitive>("DynamicRNNGrad");

 // NN
 inline const PrimitivePtr kPrimFlatten = std::make_shared<Primitive>("Flatten");
@ -213,6 +216,7 @@ inline const PrimitivePtr kPrimRound = std::make_shared<Primitive>("Round");
 inline const PrimitivePtr kPrimExp = std::make_shared<Primitive>("Exp");
 inline const PrimitivePtr kPrimLog = std::make_shared<Primitive>("Log");
 inline const PrimitivePtr kPrimRsqrt = std::make_shared<Primitive>("Rsqrt");
+inline const PrimitivePtr kPrimSplitV = std::make_shared<Primitive>("SplitV");

 // Statements
 inline const PrimitivePtr kPrimReturn = std::make_shared<Primitive>("return");
--- a/mindspore/core/ir/tensor.cc
+++ b/mindspore/core/ir/tensor.cc
@ -28,7 +28,6 @@
 #include <algorithm>
 #include <type_traits>
 #include <typeinfo>
-#include <regex>

 #include "abstract/abstract_value.h"

@ -373,25 +372,17 @@ class TensorDataImpl : public TensorData {
                  std::is_same<T, double>::value) {
      return str;
    }
-    // Use regular expressions to replace placeholders.
-    std::regex pattern("#+");
-    std::smatch result;
-    int bias = 0;
-    std::string::const_iterator start = str.begin();
-    std::string::const_iterator end = str.end();
-    while (std::regex_search(start, end, result, pattern)) {
-      const int len = result.str(0).length();
-      const int pos = result.position();
-      bias += pos;
-      if (bias > static_cast<int32_t>(str.length())) {
-        return "";
-      }
    // Replace # with placeholder.
+    size_t index = str.find('#');
+    while (index != str.npos) {
+      size_t pos = index;
+      while (str[pos] == '#') {
+        pos++;
+      }
+      int len = pos - index;
      std::string space(max_width - len, ' ');
-      str = str.replace(bias, len, space);
-      // Update the starting position of the search.
-      start = str.begin() + bias;
-      end = str.end();
+      str = str.replace(index, len, space);
+      index = str.find('#', index);
    }
    return str;
  }
--- a/mindspore/core/ir/tensor.h
+++ b/mindspore/core/ir/tensor.h
@ -269,7 +269,7 @@ class Tensor : public MetaTensor {

  std::string id() const { return id_; }
  TypePtr cast_dtype() { return cast_dtype_; }
-  void set_cast_dtype(TypePtr dtype) { cast_dtype_ = dtype; }
+  void set_cast_dtype(TypePtr dtype = nullptr) { cast_dtype_ = dtype; }

  void SetNeedWait(bool need_wait) {
    if (event_ != nullptr) {
--- a/mindspore/core/ir/value.cc
+++ b/mindspore/core/ir/value.cc
@ -266,7 +266,7 @@ std::string KeywordArg::ToString() const {
  buffer << "KeywordArg[";
  buffer << "key : " << key_;
  MS_EXCEPTION_IF_NULL(value_);
-  buffer << "value : " << value_->ToString();
+  buffer << ", value : " << value_->ToString();
  buffer << "]";
  return buffer.str();
 }
--- a/mindspore/dataset/engine/datasets.py
+++ b/mindspore/dataset/engine/datasets.py
@ -2474,7 +2474,7 @@ class ConcatDataset(DatasetOp):
            raise ValueError("The parameter num_shards of DistributedSampler must be positive int!")

        if sampler.get_num_samples() is not None:
-            raise ValueError("The parameter num_samples of DistributedSampler must be set!")
+            raise ValueError("The parameter num_samples of DistributedSampler is not support to be set!")

        self._sampler = _select_sampler(None, sampler, None, None, None)
        cumulative_samples_nums = 0
@ -3256,14 +3256,13 @@ class SamplerFn:
        # Event for end of epoch
        if multi_process is True:
            self.eoe = multiprocessing.Event()
-            self.eof = multiprocessing.Event()
        else:
            self.eoe = threading.Event()
            self.eof = threading.Event()
        # Create workers
        for _ in range(num_worker):
            if multi_process is True:
-                worker = _GeneratorWorkerMp(dataset, self.eoe, self.eof)
+                worker = _GeneratorWorkerMp(dataset, self.eoe)
            else:
                worker = _GeneratorWorkerMt(dataset, self.eoe, self.eof)
            worker.daemon = True
@ -3304,15 +3303,40 @@ class SamplerFn:

    def __del__(self):
        self.eoe.set()
-        self.eof.set()
        if self.multi_process is False:
+            self.eof.set()
            for w in self.workers:
                w.join()


-def _generator_worker_loop(dataset, idx_queue, result_queue, eoe, eof):
+def _generator_worker_loop_mp(dataset, idx_queue, result_queue, eoe):
    """
-    Multiprocessing or multithread generator worker process loop.
+    Multiprocessing generator worker process loop
+    """
+    while True:
+        # Fetch index, block
+        try:
+            idx = idx_queue.get()
+        except KeyboardInterrupt:
+            raise Exception("Generator worker receives KeyboardInterrupt")
+        if idx is None:
+            # When the queue is out of scope from master process, a None item can be fetched from the queue.
+            # Upon receiving None, worker process should check if EOE is set.
+            assert eoe.is_set(), ""
+            return
+        # Fetch data, any exception from __getitem__ will terminate worker and timeout master process
+        result = dataset[idx]
+        # Send data, block
+        try:
+            result_queue.put(result)
+        except KeyboardInterrupt:
+            raise Exception("Generator worker receives KeyboardInterrupt")
+        del result, idx
+
+
+def _generator_worker_loop_mt(dataset, idx_queue, result_queue, eoe, eof):
+    """
+    Multithread generator worker process loop.
    """
    while True:
        # Fetch index, block
@ -3360,7 +3384,7 @@ class _GeneratorWorkerMt(threading.Thread):
    def __init__(self, dataset, eoe, eof):
        self.idx_queue = queue.Queue(16)
        self.res_queue = queue.Queue(16)
-        super().__init__(target=_generator_worker_loop, args=(dataset, self.idx_queue, self.res_queue, eoe, eof))
+        super().__init__(target=_generator_worker_loop_mt, args=(dataset, self.idx_queue, self.res_queue, eoe, eof))

    def put(self, item):
        """
@ -3372,7 +3396,7 @@ class _GeneratorWorkerMt(threading.Thread):
        """
        Get function for worker result queue. Block with timeout.
        """
-        return self.res_queue.get(timeout=10)
+        return self.res_queue.get(timeout=30)


 class _GeneratorWorkerMp(multiprocessing.Process):
@ -3380,10 +3404,10 @@ class _GeneratorWorkerMp(multiprocessing.Process):
    Worker process for multiprocess Generator.
    """

-    def __init__(self, dataset, eoe, eof):
+    def __init__(self, dataset, eoe):
        self.idx_queue = multiprocessing.Queue(16)
        self.res_queue = multiprocessing.Queue(16)
-        super().__init__(target=_generator_worker_loop, args=(dataset, self.idx_queue, self.res_queue, eoe, eof))
+        super().__init__(target=_generator_worker_loop_mp, args=(dataset, self.idx_queue, self.res_queue, eoe))

    def put(self, item):
        """
@ -3395,17 +3419,18 @@ class _GeneratorWorkerMp(multiprocessing.Process):
        """
        Get function for worker result queue. Block with timeout.
        """
-        while check_iterator_cleanup() is False:
-            try:
-                return self.res_queue.get(timeout=10)
-            except multiprocessing.TimeoutError:
-                continue
-
-        raise Exception("Generator worker process timeout")
+        # Relax 10s to 30s, since it sometimes will cause "Generator worker process timeout"
+        # when we run too many iterators with infinite epoch(num_epoch=-1)
+        return self.res_queue.get(timeout=30)


    def __del__(self):
+        # Try to destruct here, sometimes the class itself will be destructed in advance,
+        # so "self" will be a NoneType
+        try:
            self.terminate()
+        except AttributeError:
+            pass


 class GeneratorDataset(MappableDataset):
--- a/mindspore/dataset/transforms/py_transforms.py
+++ b/mindspore/dataset/transforms/py_transforms.py
@ -86,6 +86,38 @@ class Compose:
        >>>                                    py_vision.RandomErasing()])
        >>> # apply the transform to the dataset through dataset.map()
        >>> dataset = dataset.map(operations=transform, input_columns="image")
+        >>>
+        >>> # Compose is also be invoked implicitly, by just passing in a list of ops
+        >>> # the above example then becomes:
+        >>> transform_list = [py_vision.Decode(),
+        >>>                   py_vision.RandomHorizontalFlip(0.5),
+        >>>                   py_vision.ToTensor(),
+        >>>                   py_vision.Normalize((0.491, 0.482, 0.447), (0.247, 0.243, 0.262)),
+        >>>                   py_vision.RandomErasing()]
+        >>>
+        >>> # apply the transform to the dataset through dataset.map()
+        >>> dataset = dataset.map(operations=transform_list, input_columns="image")
+        >>>
+        >>> # Certain C++ and Python ops can be combined, but not all of them
+        >>> # An example of combined operations
+        >>> import mindspore.dataset as ds
+        >>> import mindspore.dataset.transforms.c_transforms as c_transforms
+        >>> import mindspore.dataset.vision.c_transforms as c_vision
+        >>>
+        >>> data = ds.NumpySlicesDataset(arr, column_names=["cols"], shuffle=False)
+        >>> transformed_list = [py_transforms.OneHotOp(2), c_transforms.Mask(c_transforms.Relational.EQ, 1)]
+        >>> data = data.map(operations=transformed_list, input_columns=["cols"])
+        >>>
+        >>> # Here is an example of mixing vision ops
+        >>> data_dir = "/path/to/imagefolder_directory"
+        >>> data1 = ds.ImageFolderDataset(dataset_dir=data_dir, shuffle=False)
+        >>> input_columns = ["column_names"]
+        >>> op_list=[c_vision.Decode(),
+        >>>          c_vision.Resize((224, 244)),
+        >>>          py_vision.ToPIL(),
+        >>>          np.array, # need to convert PIL image to a NumPy array to pass it to C++ operation
+        >>>          c_vision.Resize((24, 24))]
+        >>> data1 = data1.map(operations=op_list, input_columns=input_columns)
    """

    @check_compose_list
--- a/mindspore/lite/CMakeLists.txt
+++ b/mindspore/lite/CMakeLists.txt
@ -28,10 +28,10 @@ if (SUPPORT_TRAIN)
    elseif (PLATFORM_ARM32)
        set(COMPONENT_NAME runtime-arm32-${PROCESS_UNIT}-train)
    elseif (WIN32)
-        set(PARSER_NAME libconverter-parser-win-${PROCESS_UNIT})
-        set(COMPONENT_NAME converter-win-${PROCESS_UNIT})
+        set(PARSER_NAME libconverter-parser-win-${PROCESS_UNIT}-train)
+        set(COMPONENT_NAME converter-win-${PROCESS_UNIT}-train)
    else ()
-        set(COMPONENT_NAME converter-ubuntu)
+        set(COMPONENT_NAME converter-ubuntu-train)
    endif()
    set(RUN_X86_COMPONENT_NAME runtime-x86-${PROCESS_UNIT}-train)
 else ()
--- a/mindspore/lite/nnacl/assembly/arm32/ConvDwInt8PostAlign4PerChannel.S
+++ b/mindspore/lite/nnacl/assembly/arm32/ConvDwInt8PostAlign4PerChannel.S
@ -0,0 +1,113 @@
+#ifdef __arm__
+#ifndef __aarch64__
+
+.text
+.align 5
+.global ConvDwInt8PostAlign4PerChannel
+#ifndef __APPLE__
+.type ConvDwInt8PostAlign4PerChannel, %function
+#endif
+
+// void ConvDwInt8PostAlign4PerChannel(int8_t *dst, int32_t *buffer, int channel4, int32_t output_zp, int32_t *out_multiplier,
+//                                     int32_t *left_shift, int32_t *right_shift, int32_t acc_min, int32_t acc_max);
+// r0: dst, r1: buffer, r2: num_pixels, r3: output_zp, r4: out_multiplier,
+// r5: left_shift, r6: right_shift, r7: acc_min, r8: acc_max
+
+ConvDwInt8PostAlign4PerChannel:
+    // at return, clang generates "push {lr}, pop {pc}"" while gcc will generate "bx lr"
+    // according to https://stackoverflow.com/questions/53625807
+    // even if we jump to link register instead of saving it, we still have to save it in subroutine calls anyway
+    // clang's rule seems more simple, though there are no subroutine calls here
+    // r4-r8 and q4-q7 must be saved according to https://static.docs.arm.com/ihi0042/i/aapcs32.pdf
+    push {r4-r8, r10}
+    vpush {q4-q7}
+    add sp, sp, #88
+
+    vdup.32 q15, r3 // output_zp
+
+    ldr r4, [sp] // out_multiplier
+    ldr r5, [sp, #4] // left_shift
+    ldr r6, [sp, #8] // right_shift
+
+    ldr r7, [sp, #12] // acc_min
+    vdup.32 q11, r7
+
+    ldr r8, [sp, #16] // acc_max
+    vdup.32 q10, r8
+
+    mov r10, r0
+
+    LoopDepth8:
+        cmp r2, #8
+        blt End
+        vld1.32 {q0}, [r1]!
+        vld1.32 {q13}, [r5]!
+        vshl.s32 q0, q0, q13
+        vld1.32 {q14}, [r4]!
+        vqrdmulh.s32 q0, q0, q14
+        vld1.32 {q12}, [r6]!
+        vand q4, q0, q12
+        vshr.s32 q4, q4, #31
+        vqadd.s32 q0, q0, q4
+        vrshl.s32 q0, q0, q12
+        vadd.i32 q0, q0, q15
+        vmax.s32 q0, q0, q11
+        vmin.s32 q0, q0, q10
+        vqmovn.s32 d4, q0
+
+        vld1.32 {q1}, [r1]!
+        vld1.32 {q13}, [r5]!
+        vshl.s32 q1, q1, q13
+        vld1.32 {q14}, [r4]!
+        vqrdmulh.s32 q1, q1, q14
+        vld1.32 {q12}, [r6]!
+        vand q4, q1, q12
+        vshr.s32 q4, q4, #31
+        vqadd.s32 q1, q1, q4
+        vrshl.s32 q1, q1, q12
+        vadd.i32 q1, q1, q15
+        vmax.s32 q1, q1, q11
+        vmin.s32 q1, q1, q10
+        vqmovn.s32 d5, q1
+        vqmovn.s16 d4, q2
+
+        vst1.8 {d4}, [r10]!
+
+        sub r2, r2, #8
+        b LoopDepth8
+
+    LoopDepth4:
+        cmp r2, #4
+        blt End
+        vld1.32 {q0}, [r1]!
+        vld1.32 {q13}, [r5]!
+        vshl.s32 q0, q0, q13
+        vld1.32 {q14}, [r4]!
+        vqrdmulh.s32 q0, q0, q14
+        vld1.32 {q12}, [r6]!
+        vand q4, q0, q12
+        vshr.s32 q4, q4, #31
+        vqadd.s32 q0, q0, q4
+        vrshl.s32 q0, q0, q12
+        vadd.i32 q0, q0, q15
+        vmax.s32 q0, q0, q11
+        vmin.s32 q0, q0, q10
+
+        vqmovn.s32 d0, q0
+        vqmovn.s16 d0, q0
+
+        vst1.8 {d0[0]}, [r10]!
+        vst1.8 {d0[1]}, [r10]!
+        vst1.8 {d0[2]}, [r10]!
+        vst1.8 {d0[3]}, [r10]!
+
+        sub r2, r2, #4
+        b LoopDepth4
+    End:
+        sub sp, sp, #88
+        vpop {q4-q7}
+        pop {r4-r8, r10}
+        bx lr
+
+#endif
+#endif
--- a/mindspore/lite/nnacl/fp32/expandDims.c
+++ b/mindspore/lite/nnacl/fp32/expandDims.c
@ -18,7 +18,7 @@
 #include <string.h>
 #include "nnacl/errorcode.h"

-int ExpandDims(float *input_ptr, float *output_ptr, size_t data_size) {
+int ExpandDims(void *input_ptr, void *output_ptr, size_t data_size) {
  memcpy(output_ptr, input_ptr, data_size);
  return NNACL_OK;
 }
--- a/mindspore/lite/nnacl/fp32/expandDims.h
+++ b/mindspore/lite/nnacl/fp32/expandDims.h
@ -27,7 +27,7 @@ typedef struct ExpandDimsParameter {
 #ifdef __cplusplus
 extern "C" {
 #endif
-int ExpandDims(float *input_ptr, float *output_ptr, size_t data_size);
+int ExpandDims(void *input_ptr, void *output_ptr, size_t data_size);
 #ifdef __cplusplus
 }
 #endif
--- a/mindspore/lite/nnacl/fp32/leaky_relu.c
+++ b/mindspore/lite/nnacl/fp32/leaky_relu.c
@ -1,26 +0,0 @@
-/**
- * Copyright 2020 Huawei Technologies Co., Ltd
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
-// * http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-#include "nnacl/fp32/leaky_relu.h"
-
-void DoLeakyRelu(float *input, float *output, LeakyReluParameter *param, int task_id) {
-  for (int i = task_id; i < param->input_num_; i += param->op_parameter_.thread_num_) {
-    if (input[i] <= 0) {
-      output[i] = input[i] * param->slope_[0];
-    } else {
-      output[i] = input[i];
-    }
-  }
-}
--- a/mindspore/lite/nnacl/fp32/leaky_relu.h
+++ b/mindspore/lite/nnacl/fp32/leaky_relu.h
@ -1,30 +0,0 @@
-/**
- * Copyright 2020 Huawei Technologies Co., Ltd
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- * http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-#ifndef MINDSPORE_LITE_NNACL_PRELU_H_
-#define MINDSPORE_LITE_NNACL_PRELU_H_
-
-#include "nnacl/op_base.h"
-#include "nnacl/leaky_relu_parameter.h"
-
-#ifdef __cplusplus
-extern "C" {
-#endif
-void DoLeakyRelu(float *input, float *output, LeakyReluParameter *prelu_param_, int task_id);
-#ifdef __cplusplus
-}
-#endif
-
-#endif  // MINDSPORE_LITE_NNACL_PRELU_H_
--- a/mindspore/lite/nnacl/int8/common_func.h
+++ b/mindspore/lite/nnacl/int8/common_func.h
@ -35,6 +35,9 @@ void PostFuncInt8C4(const int32_t *in, const int32_t *bias, int8_t *out, size_t
 #ifdef ENABLE_ARM
 void ConvDwInt8Row(int32_t *output_ptr, const int8_t *input_ptr, const int16_t *weight_ptr, int num_pixels,
                   int output_channel, int input_step, int8_t input_zp);
+void ConvDwInt8PostAlign4PerChannel(int8_t *dst, int32_t *buffer, int channel4, int32_t output_zp,
+                                    int32_t *out_multiplier, int32_t *left_shift, int32_t *right_shift, int32_t acc_min,
+                                    int32_t acc_max);
 void ConvDwInt8PostAlign4(int8_t *dst, int32_t *buffer, int num_pixels, int32_t output_zp, int32_t out_multiplier,
                          int32_t left_shift, int32_t right_shift, int32_t acc_min, int32_t acc_max);
 void IndirectGemmInt16to32_8x4(int32_t *dst, const int16_t *src, const int16_t *weight, size_t ksize, size_t ic8,
@ -64,9 +67,6 @@ void IndirectGemmInt8_4x4(int8_t *output, const int8_t *input, const int8_t *wei
 void DeconvDwInt8Center(int32_t *dst, const int16_t *src, const int16_t *weight, size_t height, size_t width,
                        size_t kernel_h, size_t kernel_w, size_t out_h_step, size_t block_channel, size_t in_sh_step,
                        size_t in_sw_step, size_t in_kh_step, size_t in_kw_step);
-void ConvDwInt8PostAlign4PerChannel(int8_t *dst, int32_t *buffer, int channel4, int32_t output_zp,
-                                    int32_t *out_multiplier, int32_t *left_shift, int32_t *right_shift, int32_t acc_min,
-                                    int32_t acc_max);
 #endif

 #ifdef __cplusplus
--- a/mindspore/lite/nnacl/int8/conv_depthwise_int8.c
+++ b/mindspore/lite/nnacl/int8/conv_depthwise_int8.c
@ -39,7 +39,7 @@ void ConvDwInt8Post(int8_t *dst, int32_t *buffer, int output_w, int channel, int
    // support perchannel
    for (int w = 0; w < output_w; w++) {
      int channel4 = 0;
-#ifdef ENABLE_ARM64
+#ifdef ENABLE_ARM
      channel4 = channel / 4 * 4;
      ConvDwInt8PostAlign4PerChannel(dst, buffer, channel4, output_zp, out_multiplier, left_shift, right_shift, acc_min,
                                     acc_max);
--- a/mindspore/lite/nnacl/int8/leaky_relu_int8.c
+++ b/mindspore/lite/nnacl/int8/leaky_relu_int8.c
@ -17,8 +17,7 @@
 #include "nnacl/int8/leaky_relu_int8.h"
 #include "nnacl/errorcode.h"

-int DoLeakReluInt8(int8_t *inputs, int8_t *output_ptr, LeakyReluQuantArg *quant_prelu_parm, QuantArg *input_quant,
-                   int task_id) {
+int DoLeakReluInt8(int8_t *inputs, int8_t *output_ptr, LeakyReluQuantArg *quant_prelu_parm, int task_id) {
  if (quant_prelu_parm == NULL) {
    return NNACL_NULL_PTR;
  }
@ -27,17 +26,12 @@ int DoLeakReluInt8(int8_t *inputs, int8_t *output_ptr, LeakyReluQuantArg *quant_
  const float output_inverse_scale = 1.f / output_scale;
  int output_dim = quant_prelu_parm->input_dim_;

+  float scale = quant_prelu_parm->quant_arg.in_args_.scale_ * output_inverse_scale;
+  float bias = -quant_prelu_parm->quant_arg.in_args_.zp_ * scale;
  for (int i = 0; i < output_dim; i++) {
-    input_quant[i].scale_ = quant_prelu_parm->quant_arg.in_args_.scale_;
-    input_quant[i].zp_ = quant_prelu_parm->quant_arg.in_args_.zp_;
-  }
-
-  for (int i = 0; i < output_dim; i++) {
-    float scale = input_quant[i].scale_ * output_inverse_scale;
-    float bias = -input_quant[i].zp_ * scale;
    for (int j = task_id; j < quant_prelu_parm->element_num; j += quant_prelu_parm->op_parameter_.thread_num_) {
      if (inputs[j] <= 0) {
-        int32_t output_tmp = round(inputs[j] * quant_prelu_parm->slope_[0] * scale + bias) + output_zp;
+        int32_t output_tmp = round(inputs[j] * quant_prelu_parm->slope_ * scale + bias) + output_zp;
        if (output_tmp > 127) {
          output_ptr[j] = 127;
        } else if (output_tmp < -128) {
@ -57,6 +51,5 @@ int DoLeakReluInt8(int8_t *inputs, int8_t *output_ptr, LeakyReluQuantArg *quant_
      }
    }
  }
-  free(input_quant);
  return NNACL_OK;
 }
--- a/mindspore/lite/nnacl/int8/leaky_relu_int8.h
+++ b/mindspore/lite/nnacl/int8/leaky_relu_int8.h
@ -23,8 +23,7 @@
 #ifdef __cplusplus
 extern "C" {
 #endif
-int DoLeakReluInt8(int8_t *inputs, int8_t *output_ptr, LeakyReluQuantArg *quant_Prelu_parm, QuantArg *input_quant,
-                   int task_id);
+int DoLeakReluInt8(int8_t *inputs, int8_t *output_ptr, LeakyReluQuantArg *quant_Prelu_parm, int task_id);
 #ifdef __cplusplus
 }
 #endif
--- a/mindspore/lite/nnacl/leaky_relu_parameter.h
+++ b/mindspore/lite/nnacl/leaky_relu_parameter.h
@ -1,29 +0,0 @@
-/**
- * Copyright 2020 Huawei Technologies Co., Ltd
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- * http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-#ifndef MINDSPORE_LITE_NNACL_LEAKY_RELU_PARAMETER_H_
-#define MINDSPORE_LITE_NNACL_LEAKY_RELU_PARAMETER_H_
-
-#include "nnacl/op_base.h"
-
-typedef struct LeakyReluParameter {
-  OpParameter op_parameter_;
-  float *slope_;
-  size_t slope_num_;
-  int input_num_;
-} LeakyReluParameter;
-
-#endif  // MINDSPORE_LITE_NNACL_LEAKY_RELU_PARAMETER_H_
--- a/mindspore/lite/nnacl/quantization/quantize.h
+++ b/mindspore/lite/nnacl/quantization/quantize.h
@ -252,7 +252,7 @@ typedef struct PowerQuantArg {
 typedef struct LeakyReluQuantArg {
  OpParameter op_parameter_;
  PreluQuantArg quant_arg;
-  float *slope_;
+  float slope_;
  int64_t axis_;
  const int *in_shape_;
  const int *out_shape_;
--- a/Show More
+++ b/Show More