!27141 add chinese api comments

Merge pull request !27141 from wangnan39/code_docs_add_chinese_docs
2021-12-03 09:38:17 +00:00 · 2021-12-03 09:38:17 +00:00 · 204f6528ae
parent 5bfb306f77 86afb1725d
commit 204f6528ae
45 changed files with 1236 additions and 576 deletions
--- a/docs/api/api_python/mindspore/mindspore.DynamicLossScaleManager.rst
+++ b/docs/api/api_python/mindspore/mindspore.DynamicLossScaleManager.rst
@ -38,7 +38,7 @@ mindspore.DynamicLossScaleManager
        
    .. py:method:: get_update_cell()

-        返回用于在 :class:`mindspore.TrainOneStepWithLossScaleCell` 中更新梯度放大系数的 `Cell` 实例。
+        返回用于更新梯度放大系数的 `Cell` 实例，:class:`mindspore.TrainOneStepWithLossScaleCell` 会调用该实例。

        **返回：**

--- a/docs/api/api_python/mindspore/mindspore.FixedLossScaleManager.rst
+++ b/docs/api/api_python/mindspore/mindspore.FixedLossScaleManager.rst
@ -44,7 +44,7 @@ mindspore.FixedLossScaleManager
        
    .. py:method:: get_update_cell()

-        返回用于更新 `loss_scale` 值的 `Cell` 实例，该实例将在 :class:`mindspore.TrainOneStepWithLossScaleCell` 中执行。
+        返回用于更新 `loss_scale` 值的 `Cell` 实例，:class:`mindspore.TrainOneStepWithLossScaleCell`会调用该实例。该类使用固定的梯度放大系数，因此该实例不执行任何操作。

        **返回：**

--- a/docs/api/api_python/mindspore/mindspore.LossScaleManager.rst
+++ b/docs/api/api_python/mindspore/mindspore.LossScaleManager.rst
@ -5,7 +5,8 @@ mindspore.LossScaleManager

   混合精度梯度放大系数（loss scale）管理器的抽象类。

-   派生类需要该类的所有方法。 `get_loss_scale` 用于获取当前的梯度放大系数。`update_loss_scale` 用于更新梯度放大系数，该方法将在训练过程中被调用。`get_update_cell` 用于获取更新梯度放大系数的 `Cell` 实例，该实例在将训练过程中被调用。下沉模式下仅 `get_update_cell` 方式生效，非下沉模式下两种更新梯度放大系数的方式均生效。
+   派生类需要实现该类的所有方法。 `get_loss_scale` 用于获取当前的梯度放大系数。 `update_loss_scale` 用于更新梯度放大系数，该方法将在训练过程中被调用。 `get_update_cell` 用于获取更新梯度放大系数的 `Cell` 实例，该实例将在训练过程中被调用。当前多使用`get_update_cell` 方式。
+
   例如：:class:`mindspore.FixedLossScaleManager` 和 :class:`mindspore.DynamicLossScaleManager` 。
    
    .. py:method:: get_loss_scale()
@ -14,7 +15,7 @@ mindspore.LossScaleManager

    .. py:method:: get_update_cell()
      
-      获取用于更新梯度放大系数的 :class:`mindspore.nn.Cell` 实例。
+      获取用于更新梯度放大系数的Cell实例。

    .. py:method:: update_loss_scale(overflow)

--- a/docs/api/api_python/nn/mindspore.nn.Adagrad.rst
+++ b/docs/api/api_python/nn/mindspore.nn.Adagrad.rst
@ -1,87 +0,0 @@
-mindspore.nn.Adagrad
-=====================
-
-.. py:class:: mindspore.nn.Adagrad(*args, **kwargs)
-
-    使用ApplyAdagrad算子实现Adagrad算法。
-
-    Adagrad用于在线学习和随机优化。
-    请参阅论文 `Efficient Learning using Forward-Backward Splitting <https://proceedings.neurips.cc/paper/2009/file/621bf66ddb7c962aa0d22ac97d69b793-Paper.pdf>`_。
-    公式如下：
-
-    .. math::
-        \begin{array}{ll} \\
-            h_{t+1} = h_{t} + g\\
-            w_{t+1} = w_{t} - lr*\frac{1}{\sqrt{h_{t+1}}}*g
-        \end{array}
-
-    :math:`h` 表示梯度平方的累积和， :math:`g` 表示 `grads` 。
-    :math:`lr` 代表 `learning_rate`， :math:`w` 代表 `params` 。
-
-    .. note::
-        在参数未分组时，优化器配置的 `weight_decay` 应用于名称含有"beta"或"gamma"的网络参数，通过网络参数分组可调整权重衰减策略。分组时，每组网络参数均可配置 `weight_decay` ，若未配置，则该组网络参数使用优化器中配置的 `weight_decay` 。
-
-    **参数：**
-
-    - **params** (Union[list[Parameter], list[dict]]) - 必须是 `Parameter` 组成的列表或字典组成的列表。当列表元素是字典时，字典的键可以是"params"、"lr"、"weight_decay"、"grad_centralization"和"order_params"：
-
-      - **params** - 必填。当前组别的权重，该值必须是 `Parameter` 列表。
-      - **lr** - 可选。如果键中存在"lr"，则使用对应的值作为学习率。如果没有，则使用优化器中配置的 `learning_rate` 作为学习率。
-      - **weight_decay** - 可选。如果键中存在"weight_decay"，则使用对应的值作为权重衰减值。如果没有，则使用优化器中配置的 `weight_decay` 作为权重衰减值。
-      - **grad_centralization** - 可选。如果键中存在"grad_centralization"，则使用对应的值，该值必须为布尔类型。如果没有，则认为 `grad_centralization` 为False。该参数仅适用于卷积层。
-      - **order_params** - 可选。对应值是预期的参数更新顺序。当使用参数分组功能时，通常使用该配置项保持 `parameters` 的顺序以提升性能。如果键中存在"order_params"，则会忽略该组配置中的其他键。"order_params"中的参数必须在某一组 `params` 参数中。
-    
-    - **accum** (float) - 累加器 :math:`h` 的初始值，必须大于等于零。默认值：0.1。
-    - **learning_rate** (Union[float, Tensor, Iterable, LearningRateSchedule]) - 默认值：0.001。
-
-      - **float** - 固定的学习率。必须大于等于零。
-      - **int** - 固定的学习率。必须大于等于零。整数类型会被转换为浮点数。
-      - **Tensor** - 可以是标量或一维向量。标量是固定的学习率。一维向量是动态的学习率，第i步将取向量中第i个值作为学习率。
-      - **Iterable** - 动态的学习率。第i步将取迭代器第i个值作为学习率。
-      - **LearningRateSchedule** - 动态的学习率。在训练过程中，优化器将使用步数（step）作为输入，调用 `LearningRateSchedule` 实例来计算当前学习率。
-    
-    - **update_slots** (bool) - 如果为True，则更新累加器 :math:`h` 。默认值：True。
-    - **loss_scale** (float) - 梯度缩放系数，必须大于0。如果 `loss_scale` 是整数，它将被转换为浮点数。通常使用默认值，仅当训练时使用了 `FixedLossScaleManager` ，且 `FixedLossScaleManager` 的 `drop_overflow_update` 属性配置为False时，此值需要与 `FixedLossScaleManager` 中的 `loss_scale` 相同。有关更多详细信息，请参阅 :class:`mindspore.FixedLossScaleManager` 。默认值：1.0。
-    - **weight_decay** (Union[float, int]) - 要乘以权重的权重衰减值，必须大于等于0.0。默认值：0.0。
-
-    **输入：**
-
-    **grads** (tuple[Tensor]) - 优化器中 `params` 的梯度，形状（shape）与 `params` 相同。
-
-    **输出：**
-
-    Tensor[bool]，值为True。
-
-    **异常：**
-
-    - **TypeError** - `learning_rate` 不是int、float、Tensor、Iterable或 `LearningRateSchedule` 。
-    - **TypeError** - `parameters` 的元素是 `Parameter` 或字典。
-    - **TypeError** - `accum` 或 `loss_scale` 不是float。
-    - **TypeError** - `update_slots` 不是bool。
-    - **TypeError** - `weight_decay` 不是float或int。
-    - **ValueError** - `loss_scale` 小于或等于0。
-    - **ValueError** - `accum` 或 `weight_decay` 小于0。
-
-    **支持平台：**
-
-    ``Ascend``  ``CPU``  ``GPU``
-
-    **样例：**
-
-    >>> net = Net()
-    >>> #1) 所有参数使用相同的学习率和权重衰减
-    >>> optim = nn.Adagrad(params=net.trainable_params())
-    >>>
-    >>> #2) 使用参数组并设置不同的值
-    >>> conv_params = list(filter(lambda x: 'conv' in x.name, net.trainable_params()))
-    >>> no_conv_params = list(filter(lambda x: 'conv' not in x.name, net.trainable_params()))
-    >>> group_params = [{'params': conv_params, 'weight_decay': 0.01, 'grad_centralization':True},
-    ...                 {'params': no_conv_params, 'lr': 0.01},
-    ...                 {'order_params': net.trainable_params()}]
-    >>> optim = nn.Adagrad(group_params, learning_rate=0.1, weight_decay=0.0)
-    >>> # conv_params参数组将使用优化器中的学习率0.1、该组的权重衰减0.01、该组的梯度中心化配置True。
-    >>> # no_conv_params参数组将使用该组的学习率0.01、优化器中的权重衰减0.0、梯度中心化使用默认值False。
-    >>> # 优化器按照"order_params"配置的参数顺序更新参数。
-    >>>
-    >>> loss = nn.SoftmaxCrossEntropyWithLogits()
-    >>> model = Model(net, loss_fn=loss, optimizer=optim)
--- a/docs/api/api_python/nn/mindspore.nn.Adam.rst
+++ b/docs/api/api_python/nn/mindspore.nn.Adam.rst
@ -1,101 +0,0 @@
-mindspore.nn.Adam
-==================
-
-.. py:class:: mindspore.nn.Adam(*args, **kwargs)
-
-    通过Adaptive Moment Estimation (Adam)算法更新梯度。
-
-    请参阅论文 `Adam: A Method for Stochastic Optimization <https://arxiv.org/abs/1412.6980>`_。
-
-    公式如下：
-
-    .. math::
-        \begin{array}{ll} \\
-            m_{t+1} = \beta_1 * m_{t} + (1 - \beta_1) * g \\
-            v_{t+1} = \beta_2 * v_{t} + (1 - \beta_2) * g * g \\
-            l = \alpha * \frac{\sqrt{1-\beta_2^t}}{1-\beta_1^t} \\
-            w_{t+1} = w_{t} - l * \frac{m_{t+1}}{\sqrt{v_{t+1}} + \epsilon}
-        \end{array}
-
-    :math:`m` 代表第一个动量矩阵 `moment1` ，:math:`v` 代表第二个动量矩阵 `moment2` ，:math:`g` 代表 `gradients` ，:math:`l` 代表缩放因子，:math:`\beta_1,\beta_2` 代表 `beta1` 和 `beta2` ，:math:`t` 代表更新步骤，:math:`beta_1^t` 和 :math:`beta_2^t` 代表 `beta1_power` 和 `beta2_power` ，:math:`\alpha` 代表 `learning_rate` ， :math:`w` 代表 `params` ， :math:`\epsilon` 代表 `eps` 。
-
-    .. note::
-        如果前向网络使用了SparseGatherV2等算子，优化器会执行稀疏运算，通过设置 `target` 为CPU，可在主机（host）上进行稀疏运算。
-        稀疏特性在持续开发中。
-  
-        在参数未分组时，优化器配置的 `weight_decay` 应用于名称含有"beta"或"gamma"的网络参数，通过网络参数分组可调整权重衰减策略。分组时，每组网络参数均可配置 `weight_decay` ，若未配置，则该组网络参数使用优化器中配置的 `weight_decay` 。
-
-    **参数：**
-
-    - **params** (Union[list[Parameter], list[dict]]) - 必须是 `Parameter` 组成的列表或字典组成的列表。当列表元素是字典时，字典的键可以是"params"、"lr"、"weight_decay"、"grad_centralization"和"order_params"：
-            
-      - **params** - 必填。当前组别的权重，该值必须是 `Parameter` 列表。
-      - **lr** - 可选。如果键中存在"lr"，则使用对应的值作为学习率。如果没有，则使用优化器中配置的 `learning_rate` 作为学习率。
-      - **weight_decay** - 可选。如果键中存在"weight_decay”，则使用对应的值作为权重衰减值。如果没有，则使用优化器中配置的 `weight_decay` 作为权重衰减值。
-      - **grad_centralization** - 可选。如果键中存在"grad_centralization"，则使用对应的值，该值必须为布尔类型。如果没有，则认为 `grad_centralization` 为False。该参数仅适用于卷积层。
-      - **order_params** - 可选。对应值是预期的参数更新顺序。当使用参数分组功能时，通常使用该配置项保持 `parameters` 的顺序以提升性能。如果键中存在"order_params"，则会忽略该组配置中的其他键。"order_params"中的参数必须在某一组 `params` 参数中。
-    
-    - **learning_rate** (Union[float, Tensor, Iterable, LearningRateSchedule]): 默认值：1e-3。
-
-      - **float** - 固定的学习率。必须大于等于零。
-      - **int** - 固定的学习率。必须大于等于零。整数类型会被转换为浮点数。
-      - **Tensor** - 可以是标量或一维向量。标量是固定的学习率。一维向量是动态的学习率，第i步将取向量中第i个值作为学习率。
-      - **Iterable** - 动态的学习率。第i步将取迭代器第i个值作为学习率。
-      - **LearningRateSchedule** - 动态的学习率。在训练过程中，优化器将使用步数（step）作为输入，调用 `LearningRateSchedule` 实例来计算当前学习率。
-    
-    - **beta1** (float) - `moment1` 的指数衰减率。参数范围（0.0,1.0）。默认值：0.9。
-    - **beta2** (float) - `moment2` 的指数衰减率。参数范围（0.0,1.0）。默认值：0.999。
-    - **eps** (float) - 将添加到分母中，以提高数值稳定性。必须大于0。默认值：1e-8。
-    - **use_locking** (bool) - 是否对参数更新加锁保护。如果为True，则 `w` 、`m` 和 `v` 的tensor更新将受到锁的保护。如果为False，则结果不可预测。默认值：False。
-    - **use_nesterov** (bool) - 是否使用Nesterov Accelerated Gradient (NAG)算法更新梯度。如果为True，使用NAG更新梯度。如果为False，则在不使用NAG的情况下更新梯度。默认值：False。
-    - **weight_decay** (float) - 权重衰减（L2 penalty）。必须大于等于0。默认值：0.0。
-    - **loss_scale** (float) - 梯度缩放系数，必须大于0。如果 `loss_scale` 是整数，它将被转换为浮点数。通常使用默认值，仅当训练时使用了 `FixedLossScaleManager` ，且 `FixedLossScaleManager` 的 `drop_overflow_update` 属性配置为False时，此值需要与 `FixedLossScaleManager` 中的 `loss_scale` 相同。有关更多详细信息，请参阅 :class:`mindspore.FixedLossScaleManager` 。默认值：1.0。
-
-    **输入：**
-
-    **gradients** (tuple[Tensor]) - `params` 的梯度，形状（shape）与 `params` 相同。
-
-    **输出：**
-
-    Tensor[bool]，值为True。
-
-    **异常：**
-
-    - **TypeError** - `learning_rate` 不是int、float、Tensor、Iterable或LearningRateSchedule。
-    - **TypeError** - `parameters` 的元素不是Parameter或字典。
-    - **TypeError** - `beta1` 、`beta2` 、 `eps` 或 `loss_scale` 不是float。
-    - **TypeError** - `weight_decay` 不是float或int。
-    - **TypeError** - `use_locking` 或 `use_nesterov` 不是bool。
-    - **ValueError** - `loss_scale` 或 `eps` 小于或等于0。
-    - **ValueError** - `beta1` 、`beta2` 不在（0.0,1.0）范围内。
-    - **ValueError** - `weight_decay` 小于0。
-
-    **支持平台：**
-
-    ``Ascend``  ``GPU``  ``CPU``
-
-    **样例：**
-    
-    >>> net = Net()
-    >>> #1) 所有参数使用相同的学习率和权重衰减
-    >>> optim = nn.Adam(params=net.trainable_params())
-    >>>
-    >>> #2) 使用参数组并设置不同的值
-    >>> conv_params = list(filter(lambda x: 'conv' in x.name, net.trainable_params()))
-    >>> no_conv_params = list(filter(lambda x: 'conv' not in x.name, net.trainable_params()))
-    >>> group_params = [{'params': conv_params, 'weight_decay': 0.01, 'grad_centralization':True},
-    ...                 {'params': no_conv_params, 'lr': 0.01},
-    ...                 {'order_params': net.trainable_params()}]
-    >>> optim = nn.Adam(group_params, learning_rate=0.1, weight_decay=0.0)
-    >>> # conv_params参数组将使用优化器中的学习率0.1、该组的权重衰减0.01、该组的梯度中心化配置True。
-    >>> # no_conv_params参数组将使用该组的学习率0.01、优化器中的权重衰减0.0、梯度中心化使用默认值False。
-    >>> # 优化器按照"order_params"配置的参数顺序更新参数。
-    >>>
-    >>> loss = nn.SoftmaxCrossEntropyWithLogits()
-    >>> model = Model(net, loss_fn=loss, optimizer=optim)
-    
-
-    .. py:method:: target
-        :property:
-
-        该属性用于指定在主机（host）上还是设备（device）上更新参数。输入类型为str，只能是'CPU'，'Ascend'或'GPU'。
--- a/docs/api/api_python/nn/mindspore.nn.AdamOffload.txt
+++ b/docs/api/api_python/nn/mindspore.nn.AdamOffload.txt
@ -0,0 +1,93 @@
+Class mindspore.nn.AdamOffload(params, learning_rate=1e-3, beta1=0.9, beta2=0.999, eps=1e-08, use_locking=False, use_nesterov=False, weight_decay=0.0, loss_scale=1.0)
+
+    此优化器在主机CPU上运行Adam优化算法，设备上仅执行网络参数的更新，最大限度地降低内存成本。
+    虽然会增加性能开销，但优化器可以运行更大的模型。
+    
+
+    Adam算法参见`Adam: A Method for Stochastic Optimization <https://arxiv.org/abs/1412.6980>`_。
+
+    更新公式如下：
+
+    .. math::
+        \begin{array}{ll} \\
+            m_{t+1} = \beta_1 * m_{t} + (1 - \beta_1) * g \\
+            v_{t+1} = \beta_2 * v_{t} + (1 - \beta_2) * g * g \\
+            l = \alpha * \frac{\sqrt{1-\beta_2^t}}{1-\beta_1^t} \\
+            w_{t+1} = w_{t} - l * \frac{m_{t+1}}{\sqrt{v_{t+1}} + \epsilon}
+        \end{array}
+
+    :math:`m`代表第一个矩向量`moment1`，:math:`v`代表第二个矩向量`moment2`，:math:`g`代表`gradients`，:math:`l`代表缩放因子，:math:`\beta_1,\beta_2`代表`beta1`和`beta2`，:math:`t`代表当前step，:math:`beta_1^t`和:math:`beta_2^t`代表`beta1_power`和`beta2_power`，:math:`\alpha`代表`learning_rate`，:math:`w`代表`params`，:math:`\epsilon`代表`eps`。
+
+    .. note::
+        此优化器目前仅支持图模式。
+
+        .. include:: mindspore.nn.optim_note_weight_decay.rst
+
+    参数：
+    - **params** (Union[list[Parameter], list[dict]]) - 必须是 `Parameter` 组成的列表或字典组成的列表。当列表元素是字典时，字典的键可以是"params"、"lr"、"weight_decay"、和"order_params"：
+
+      .. include:: mindspore.nn.optim_group_param.rst
+
+      .. include:: mindspore.nn.optim_group_lr.rst
+
+      .. include:: mindspore.nn.optim_group_weight_decay.rst
+
+      .. include:: mindspore.nn.optim_group_order.rst
+              
+
+    - **learning_rate (Union[float, Tensor, Iterable, LearningRateSchedule]): 默认值：1e-3。
+      .. include:: mindspore.nn.optim_arg_dynamic_lr.rst
+
+    - **beta1** (float) - `moment1` 的指数衰减率。参数范围（0.0,1.0）。默认值：0.9。
+
+    - **beta2** (float) - `moment2` 的指数衰减率。参数范围（0.0,1.0）。默认值：0.999。
+
+    - **eps** (float) - 将添加到分母中，以提高数值稳定性。必须大于0。默认值：1e-8。
+
+    - **use_locking** (bool) - 是否对参数更新加锁保护。如果为True，则 `w` 、`m` 和 `v` 的更新将受到锁保护。如果为False，则结果不可预测。默认值：False。
+
+    - **use_nesterov** (bool) - 是否使用Nesterov Accelerated Gradient (NAG)算法更新梯度。如果为True，使用NAG更新梯度。如果为False，则在不使用NAG的情况下更新梯度。默认值：False。
+
+    - **weight_decay** (float) - 权重衰减（L2 penalty）。必须大于等于0。默认值：0.0。
+
+    .. include:: mindspore.nn.optim_arg_loss_scale.rst
+
+
+    输入：
+        - **gradients** (tuple[Tensor])：`params`的梯度，shape与`params`相同。
+
+    输出：
+        Tensor[bool]，值为True。
+
+    异常：
+        TypeError：`learning_rate`不是int、float、Tensor、Iterable或LearningRateSchedule。
+        TypeError：`parameters`的元素不是Parameter或字典。
+        TypeError：`beta1`、`beta2`、`eps`或`loss_scale`不是float。
+        TypeError：`weight_decay`不是float或int。
+        TypeError：`use_locking`或`use_nesterov`不是bool。
+        ValueError：`loss_scale`或`eps`不大于0。
+        ValueError：`beta1`、`beta2`不在（0.0,1.0）范围内。
+        ValueError：`weight_decay`小于0。
+
+    支持平台：
+        ``Ascend`` ``GPU`` ``CPU``
+
+    示例：
+        >>> net = Net()
+        >>> #1) 所有参数使用相同的学习率和权重衰减
+        >>> optim = nn.AdamOffload(params=net.trainable_params())
+        >>>
+        >>> #2) 使用参数分组并设置不同的值
+        >>> conv_params = list(filter(lambda x: 'conv' in x.name, net.trainable_params()))
+        >>> no_conv_params = list(filter(lambda x: 'conv' not in x.name, net.trainable_params()))
+        >>> group_params = [{'params': conv_params, 'weight_decay': 0.01},
+        ...                 {'params': no_conv_params, 'lr': 0.01},
+        ...                 {'order_params': net.trainable_params()}]
+        >>> optim = nn.AdamOffload(group_params, learning_rate=0.1, weight_decay=0.0)
+        >>> # conv_params参数组将使用优化器中的学习率0.1、该组的权重衰减0.01。
+        >>> # no_conv_params参数组将使用该组的学习率0.01、优化器中的权重衰减0.0。
+        >>> # 优化器按照"order_params"配置的参数顺序更新参数。
+        >>>
+        >>> loss = nn.SoftmaxCrossEntropyWithLogits()
+        >>> model = Model(net, loss_fn=loss, optimizer=optim)
+    
--- a/docs/api/api_python/nn/mindspore.nn.AdamWeightDecay.txt
+++ b/docs/api/api_python/nn/mindspore.nn.AdamWeightDecay.txt
@ -0,0 +1,87 @@
+Class mindspore.nn.AdamWeightDecay(params, learning_rate=1e-3, beta1=0.9, beta2=0.999, eps=1e-06, weight_decay=0.0)
+
+    实现权重衰减Adam算法。
+
+    .. math::
+        \begin{array}{ll} \\
+            m_{t+1} = \beta_1 * m_{t} + (1 - \beta_1) * g \\
+            v_{t+1} = \beta_2 * v_{t} + (1 - \beta_2) * g * g \\
+            update = \frac{m_{t+1}}{\sqrt{v_{t+1}} + eps} \\
+            update =
+            \begin{cases}
+                update + weight\_decay * w_{t}
+                    & \text{ if } weight\_decay > 0 \\
+                update
+                    & \text{ otherwise }
+            \end{cases} \\
+            w_{t+1}  = w_{t} - lr * update
+        \end{array}
+
+    :math:`m`表示第1矩向量`moment1`,:math:`v`表示第2矩向量`moment2`，:math:`g`表示`gradients`，:math:`lr`表示`learning_rate`，:math:`\beta_1, \beta_2`表示`beta1`和`beta2`,:math:`t`表示当前step，:math:`w`表示`params`。
+
+
+    .. note::
+        .. include:: mindspore.nn.optim_note_loss_scale.rst
+        .. include:: mindspore.nn.optim_note_weight_decay.rst
+
+    参数：
+        params (Union[list[Parameter], list[dict]]) - 必须是 `Parameter` 组成的列表或字典组成的列表。当列表元素是字典时，字典的键可以是"params"、"lr"、"weight_decay"、和"order_params"：
+
+          .. include:: mindspore.nn.optim_group_param.rst
+
+          .. include:: mindspore.nn.optim_group_lr.rst
+
+          .. include:: mindspore.nn.optim_group_weight_decay.rst
+
+          .. include:: mindspore.nn.optim_group_order.rst
+
+
+        learning_rate (Union[float, Tensor, Iterable, LearningRateSchedule]): 默认值：1e-3。
+
+          .. include:: mindspore.nn.optim_arg_dynamic_lr.rst
+
+        beta1 (float)：`moment1` 的指数衰减率。参数范围（0.0,1.0）。默认值：0.9。
+
+        beta2 (float)：`moment2` 的指数衰减率。参数范围（0.0,1.0）。默认值：0.999。
+
+        eps (float)：将添加到分母中，以提高数值稳定性。必须大于0。默认值：1e-6。
+
+        weight_decay (float)：权重衰减（L2 penalty）。必须大于等于0。默认值：0.0。
+
+    输入：
+        - **gradients** (tuple[Tensor])：`params`的梯度，shape与`params`相同。
+
+    输出：
+        tuple[bool]，所有元素都为True。
+
+    异常：
+        TypeError：`learning_rate`不是int、float、Tensor、Iterable或LearningRateSchedule。
+        TypeError：`parameters`的元素不是Parameter或字典。
+        TypeError：`beta1`、`beta2`或`eps`不是float。
+        TypeError：`weight_decay`不是float或int。
+        ValueError：`eps`小于等于0。
+        ValueError：`beta1`、`beta2`不在（0.0,1.0）范围内。
+        ValueError：`weight_decay`小于0。
+
+    支持平台：
+        ``Ascend`` ``GPU`` ``CPU``
+
+    示例：
+        >>> net = Net()
+        >>> #1) 所有参数使用相同的学习率和权重衰减
+        >>> optim = nn.AdamWeightDecay(params=net.trainable_params())
+        >>>
+        >>> #2) 使用参数分组并设置不同的值
+        >>> conv_params = list(filter(lambda x: 'conv' in x.name, net.trainable_params()))
+        >>> no_conv_params = list(filter(lambda x: 'conv' not in x.name, net.trainable_params()))
+        >>> group_params = [{'params': conv_params, 'weight_decay': 0.01},
+        ...                 {'params': no_conv_params, 'lr': 0.01},
+        ...                 {'order_params': net.trainable_params()}]
+        >>> optim = nn.AdamWeightDecay(group_params, learning_rate=0.1, weight_decay=0.0)
+        >>> # conv_params参数组将使用优化器中的学习率0.1、该组的权重衰减0.01。
+        >>> # no_conv_params参数组将使用该组的学习率0.01、优化器中的权重衰减0.0。
+        >>> # 优化器按照"order_params"配置的参数顺序更新参数。
+        >>>
+        >>> loss = nn.SoftmaxCrossEntropyWithLogits()
+        >>> model = Model(net, loss_fn=loss, optimizer=optim)
+   
--- a/docs/api/api_python/nn/mindspore.nn.DynamicLossScaleUpdateCell.txt
+++ b/docs/api/api_python/nn/mindspore.nn.DynamicLossScaleUpdateCell.txt
@ -0,0 +1,58 @@
+Class mindspore.nn.DynamicLossScaleUpdateCell(loss_scale_value, scale_factor, scale_window)
+
+    用于动态地更新梯度放大系数(loss scale)的神经元。
+
+    使用梯度放大功能进行训练时，初始梯度放大系数值为`loss_scale_value`。
+    在每个训练步骤中，当出现溢出时，通过计算公式`loss_scale`/`scale_factor`减小梯度放大系数。
+    如果连续`scale_window`步（step）未溢出，则将通过`loss_scale` * `scale_factor`增大梯度放大系数。
+
+    该类是:class:`mindspore.nn.DynamicLossScaleManager`的`get_update_cell`方法的返回值。
+    训练过程中，类:class:`mindspore.TrainOneStepWithLossScaleCell`会调用该Cell来更新梯度放大系数。
+
+    参数：
+        loss_scale_value (float)：初始的梯度放大系数。
+        scale_factor (int)：增减系数。
+        scale_window (int)：未溢出时，增大梯度放大系数的最大连续训练步数。
+
+    输入：
+        - **loss_scale** (Tensor)：训练期间的梯度放大系数，shape为:math:`()`。
+        - **overflow** (bool)：是否发生溢出。
+
+    输出：
+        Bool，即输入`overflow`。
+
+    支持平台：
+        ``Ascend`` ``GPU``
+
+    示例：
+        >>> import numpy as np
+        >>> from mindspore import Tensor, Parameter, nn
+        >>> import mindspore.ops as ops
+        >>>
+        >>> class Net(nn.Cell):
+        ...     def __init__(self, in_features, out_features)：
+        ...         super(Net, self).__init__()
+        ...         self.weight = Parameter(Tensor(np.ones([in_features, out_features]).astype(np.float32)),
+        ...                                 name='weight')
+        ...         self.matmul = ops.MatMul()
+        ...
+        ...     def construct(self, x)：
+        ...         output = self.matmul(x, self.weight)
+        ...         return output
+        ...
+        >>> in_features, out_features = 16, 10
+        >>> net = Net(in_features, out_features)
+        >>> loss = nn.MSELoss()
+        >>> optimizer = nn.Momentum(net.trainable_params(), learning_rate=0.1, momentum=0.9)
+        >>> net_with_loss = nn.WithLossCell(net, loss)
+        >>> manager = nn.DynamicLossScaleUpdateCell(loss_scale_value=2**12, scale_factor=2, scale_window=1000)
+        >>> train_network = nn.TrainOneStepWithLossScaleCell(net_with_loss, optimizer, scale_sense=manager)
+        >>> input = Tensor(np.ones([out_features, in_features]), mindspore.float32)
+        >>> labels = Tensor(np.ones([out_features,]), mindspore.float32)
+        >>> output = train_network(input, labels)
+    
+
+get_loss_scale()
+
+        获取当前梯度放大系数。
+        
--- a/docs/api/api_python/nn/mindspore.nn.FTRL.txt
+++ b/docs/api/api_python/nn/mindspore.nn.FTRL.txt
@ -0,0 +1,103 @@
+Class mindspore.nn.FTRL(*args, **kwargs)
+
+    使用ApplyFtrl算子实现FTRL算法。
+
+    FTRL是一种在线凸优化算法，根据损失函数自适应地选择正则化函数。
+    详见论文`Adaptive Bound Optimization for Online Convex Optimization <https://arxiv.org/abs/1002.4908>`_。
+    工程文档参阅`Ad Click Prediction: a View from the Trenches <https://www.eecs.tufts.edu/~dsculley/papers/ad-click-prediction.pdf>`_。
+    
+
+    更新公式如下：
+
+    .. math::
+
+        \begin{array}{ll} \\
+            m_{t+1} = m_{t} + g^2 \\
+            u_{t+1} = u_{t} + g  - \frac{m_{t+1}^\text{-p} - m_{t}^\text{-p}}{\alpha } * \omega_{t} \\
+            \omega_{t+1} =
+            \begin{cases}
+                \frac{(sign(u_{t+1}) * l1 - u_{t+1})}{\frac{m_{t+1}^\text{-p}}{\alpha } + 2 * l2 }
+                    & \text{ if } |u_{t+1}| > l1 \\
+                0.0
+                    & \text{ otherwise }
+            \end{cases}\\
+        \end{array}
+
+    :math:`m`表示累加器，:math:`g`表示`grads`，:math:`t`表示当前step，:math:`u`表示需要更新的线性系数，:math:`p`表示`lr_power`，:math:`\alpha`表示`learning_rate`，:math:`\omega`表示`params`。
+
+    .. note::
+        .. include:: mindspore.nn.optim_note_sparse.rst
+
+        .. include:: mindspore.nn.optim_note_weight_decay.rst
+
+    参数：
+        params (Union[list[Parameter], list[dict]]) - 必须是 `Parameter` 组成的列表或字典组成的列表。当列表元素是字典时，字典的键可以是"params"、"lr"、"weight_decay"、"grad_centralization"和"order_params"：
+
+          .. include:: mindspore.nn.optim_group_param.rst
+
+          - **lr** - 学习率当前不支持参数分组。
+
+          .. include:: mindspore.nn.optim_group_weight_decay.rst
+
+          .. include:: mindspore.nn.optim_group_gc.rst
+
+          .. include:: mindspore.nn.optim_group_order.rst
+
+
+        initial_accum (float)：累加器`m`的初始值，必须大于等于零。默认值：0.1。
+
+        learning_rate (float)：学习速率值必须为零或正数，当前不支持动态学习率。默认值：0.001。
+
+        lr_power (float)：学习率的幂值，控制训练期间学习率的下降方式，必须小于或等于零。如果lr_power为零，则使用固定的学习率。默认值：-0.5。
+
+        l1 (float)：l1正则化强度，必须大于等于零。默认值：0.0。
+
+        l2 (float)：l2正则化强度，必须大于等于零。默认值：0.0。
+
+        use_locking (bool)：如果为True，则更新操作使用锁保护。默认值：False。
+
+        .. include:: mindspore.nn.optim_arg_loss_scale.rst
+
+        weight_decay (Union[float, int])：要乘以权重的权重衰减值，必须为零或正值。默认值：0.0。
+
+    输入：
+        - **grads** (tuple[Tensor])：优化器中`params`的梯度，shape与优化器中的`params`相同。
+          
+
+    输出：
+        tuple[Parameter]，更新的参数，shape与`params`相同。
+
+    异常：
+        TypeError：`initial_accum`、`learning_rate`、`lr_power`、`l1`、`l2`或`loss_scale`不是float。
+        TypeError：`parameters`的元素不是Parameter或dict。
+        TypeError：`weight_decay`不是float或int。
+        TypeError：`use_nesterov`不是bool。
+        ValueError：`lr_power`大于0。
+        ValueError：`loss_scale`小于等于0。
+        ValueError：`initial_accum`、`l1`或`l2`小于0。
+
+    支持平台：
+        ``Ascend`` ``GPU`` ``CPU``
+
+    示例：
+        >>> net = Net()
+        >>> #1) 所有参数使用相同的学习率和权重衰减
+        >>> optim = nn.FTRL(params=net.trainable_params())
+        >>>
+        >>> #2) 使用参数分组并设置不同的值
+        >>> conv_params = list(filter(lambda x: 'conv' in x.name, net.trainable_params()))
+        >>> no_conv_params = list(filter(lambda x: 'conv' not in x.name, net.trainable_params()))
+        >>> group_params = [{'params': conv_params, 'weight_decay': 0.01, 'grad_centralization':True},
+        ...                 {'params': no_conv_params},
+        ...                 {'order_params': net.trainable_params()}]
+        >>> optim = nn.FTRL(group_params, learning_rate=0.1, weight_decay=0.0)
+        >>> # conv_params参数组将使用优化器中的学习率0.1、该组的权重衰减0.01、该组的梯度中心化配置True。
+        >>> # no_conv_params参数组使用优化器中的学习率0.1、优化器中的权重衰减0.0、梯度中心化使用默认值False。
+        >>> # 优化器按照"order_params"配置的参数顺序更新参数。
+        >>>
+        >>>
+        >>> loss = nn.SoftmaxCrossEntropyWithLogits()
+        >>> model = Model(net, loss_fn=loss, optimizer=optim)
+    
+
+.. include::mindspore.nn.optim_target_unique_for_sparse.rst
--- a/docs/api/api_python/nn/mindspore.nn.FixedLossScaleUpdateCell.txt
+++ b/docs/api/api_python/nn/mindspore.nn.FixedLossScaleUpdateCell.txt
@ -0,0 +1,51 @@
+Class mindspore.nn.FixedLossScaleUpdateCell(loss_scale_value)
+
+    固定梯度放大系数的神经元。
+
+    该类是:class:`mindspore.nn.FixedLossScaleManager`的`get_update_cell`方法的返回值。
+    训练过程中，类:class:`mindspore.TrainOneStepWithLossScaleCell`会调用该Cell。
+
+    参数：
+        loss_scale_value (float)：初始梯度放大系数。
+
+    输入：
+        - **loss_scale** (Tensor)：训练期间的梯度放大系数，shape为:math:`()`，在当前类中，该值被忽略。
+        - **overflow** (bool)：是否发生溢出。
+
+    输出：
+        Bool，即输入`overflow`。
+
+    支持平台：
+        ``Ascend`` ``GPU``
+
+    示例：
+        >>> import numpy as np
+        >>> from mindspore import Tensor, Parameter, nn, ops
+        >>>
+        >>> class Net(nn.Cell):
+        ...     def __init__(self, in_features, out_features)：
+        ...         super(Net, self).__init__()
+        ...         self.weight = Parameter(Tensor(np.ones([in_features, out_features]).astype(np.float32)),
+        ...                                 name='weight')
+        ...         self.matmul = ops.MatMul()
+        ...
+        ...     def construct(self, x)：
+        ...         output = self.matmul(x, self.weight)
+        ...         return output
+        ...
+        >>> in_features, out_features = 16, 10
+        >>> net = Net(in_features, out_features)
+        >>> loss = nn.MSELoss()
+        >>> optimizer = nn.Momentum(net.trainable_params(), learning_rate=0.1, momentum=0.9)
+        >>> net_with_loss = nn.WithLossCell(net, loss)
+        >>> manager = nn.FixedLossScaleUpdateCell(loss_scale_value=2**12)
+        >>> train_network = nn.TrainOneStepWithLossScaleCell(net_with_loss, optimizer, scale_sense=manager)
+        >>> input = Tensor(np.ones([out_features, in_features]), mindspore.float32)
+        >>> labels = Tensor(np.ones([out_features,]), mindspore.float32)
+        >>> output = train_network(input, labels)
+    
+
+get_loss_scale()
+
+        获取当前梯度放大系数。
+        
--- a/docs/api/api_python/nn/mindspore.nn.LARS.txt
+++ b/docs/api/api_python/nn/mindspore.nn.LARS.txt
@ -0,0 +1,50 @@
+Class mindspore.nn.LARS(*args, **kwargs)
+
+    使用LARSUpdate算子实现LARS算法。
+
+    LARS算法采用大量的优化技术。详见论文`LARGE BATCH TRAINING OF CONVOLUTIONAL NETWORKS <https://arxiv.org/abs/1708.03888>`_。
+
+    更新公式如下：
+
+    .. math::
+
+        \begin{array}{ll} \\
+            \lambda  = \frac{\theta  \text{ * } || \omega  ||  } \\
+                            {|| g_{t} || \text{ + } \delta \text{ * } || \omega  || }  \\
+            \lambda  =
+            \begin{cases}
+                \min(\frac{\lambda}{\alpha }, 1)
+                    & \text{ if } clip = True \\
+                \lambda
+                    & \text{ otherwise }
+            \end{cases}\\
+            g_{t+1} = \lambda * (g_{t} + \delta * \omega)
+        \end{array}
+
+    :math:`\theta`表示`coefficient`，:math:`\omega`表示网络参数，:math:`g`表示`gradients`，:math:`t`表示当前step，:math:`\delta`表示`optimizer`配置的`weight_decay`，:math:`\alpha`表示`optimizer`配置的`learning_rate`，:math:`clip`表示`use_clip`。
+
+
+    参数：
+        optimizer (Optimizer)：待封装和修改梯度的MindSpore优化器。
+        epsilon (float)：将添加到分母中，提高数值稳定性。默认值：1e-05。
+        coefficient (float)：计算局部学习速率的信任系数。默认值：0.001。
+        use_clip (bool)：计算局部学习速率时是否裁剪。默认值：False。
+        lars_filter (Function)：用于指定使用LARS算法的网络参数。默认值：lambda x: 'LayerNorm' not in x.name and 'bias' not in x.name。
+
+    输入：
+        - **gradients** (tuple[Tensor])：优化器中`params`的梯度，shape与优化器中的`params`相同。
+          
+
+    输出：
+        Union[Tensor[bool], tuple[Parameter]]，取决于`optimizer`的输出。
+
+    支持平台：
+        ``Ascend`` ``CPU``
+
+    示例：
+        >>> net = Net()
+        >>> loss = nn.SoftmaxCrossEntropyWithLogits()
+        >>> opt = nn.Momentum(net.trainable_params(), 0.1, 0.9)
+        >>> opt_lars = nn.LARS(opt, epsilon=1e-08, coefficient=0.02)
+        >>> model = Model(net, loss_fn=loss, optimizer=opt_lars, metrics=None)
+    
--- a/docs/api/api_python/nn/mindspore.nn.Lamb.txt
+++ b/docs/api/api_python/nn/mindspore.nn.Lamb.txt
@ -0,0 +1,89 @@
+Class mindspore.nn.Lamb(*args, **kwargs)
+
+    LAMB（Layer-wise Adaptive Moments optimizer for Batching training，用于批训练的分层自适应矩优化器）算法优化器。
+
+    LAMB是一种采用分层自适应批优化技术的优化算法。
+    详见论文`LARGE BATCH OPTIMIZATION FOR DEEP LEARNING: TRAINING BERT IN 76 MINUTES <https://arxiv.org/abs/1904.00962>`_。
+
+    LAMB优化器旨在不降低精度的情况下增加训练batch size，支持自适应逐元素更新和精确的分层校正。
+    
+
+    参数更新如下：
+
+    ..  math::
+        \begin{gather*}
+        m_t = \beta_1 m_{t - 1}+ (1 - \beta_1)g_t\\
+        v_t = \beta_2 v_{t - 1}  + (1 - \beta_2)g_t^2\\
+        m_t = \frac{m_t}{\beta_1^t}\\
+        v_t = \frac{v_t}{\beta_2^t}\\
+        r_t = \frac{m_t}{\sqrt{v_t}+\epsilon}\\
+        w_t = w_{t-1} -\eta_t \frac{\| w_{t-1} \|}{\| r_t + \lambda w_{t-1} \|} (r_t + \lambda w_{t-1})
+        \end{gather*}
+
+    其中，math:`m`代表第一个矩向量，:math:`v`代表第二个矩向量，:math:`\eta`表示学习率，:math:`\lambda`表示LAMB权重衰减率。
+    
+
+    .. note::
+        .. include:: mindspore.nn.optim_note_weight_decay.rst
+
+        .. include:: mindspore.nn.optim_note_loss_scale.rst
+
+    参数：
+        params (Union[list[Parameter], list[dict]]): 必须是 `Parameter` 组成的列表或字典组成的列表。当列表元素是字典时，字典的键可以是"params"、"lr"、"weight_decay"、"grad_centralization"和"order_params"：
+
+          .. include:: mindspore.nn.optim_group_param.rst
+          .. include:: mindspore.nn.optim_group_lr.rst
+          .. include:: mindspore.nn.optim_group_weight_decay.rst
+          .. include:: mindspore.nn.optim_group_gc.rst
+          .. include:: mindspore.nn.optim_group_order.rst
+
+        learning_rate (Union[float, Tensor, Iterable, LearningRateSchedule]):
+          .. include:: mindspore.nn.optim_arg_dynamic_lr.rst
+
+        beta1 (float)：第一矩的指数衰减率。参数范围（0.0,1.0）。默认值：0.9。
+
+        beta2 (float)：第二矩的指数衰减率。参数范围（0.0,1.0）。默认值：0.999。
+
+        eps (float)：将添加到分母中，以提高数值稳定性。必须大于0。默认值：1e-6。
+
+        weight_decay (float)：权重衰减（L2 penalty）。必须大于等于0。默认值：0.0。
+
+    输入：
+        - **gradients** (tuple[Tensor])：`params`的梯度，shape与`params`相同。
+
+    输出：
+        tuple[bool]，所有元素都为True。
+
+    异常：
+        TypeError：`learning_rate`不是int、float、Tensor、Iterable或LearningRateSchedule。
+        TypeError：`parameters`的元素不是Parameter或dict。
+        TypeError：`beta1`、`beta2`或`eps`不是float。
+        TypeError：`weight_decay`不是float或int。
+        ValueError：`eps`小于等于0。
+        ValueError：`beta1`、`beta2`不在（0.0,1.0）范围内。
+        ValueError：`weight_decay`小于0。
+
+    支持平台：
+        ``Ascend`` ``GPU`` ``CPU``
+
+    示例：
+        >>> net = Net()
+        >>> #1) 所有参数使用相同的学习率和权重衰减
+        >>> optim = nn.Lamb(params=net.trainable_params(), learning_rate=0.1)
+        >>>
+        >>> #2) 使用参数分组并设置不同的值
+        >>> poly_decay_lr = learning_rate_schedule.PolynomialDecayLR(learning_rate=0.1, end_learning_rate=0.01,
+        ...                                                    decay_steps=4, power = 0.5)
+        >>> conv_params = list(filter(lambda x: 'conv' in x.name, net.trainable_params()))
+        >>> no_conv_params = list(filter(lambda x: 'conv' not in x.name, net.trainable_params()))
+        >>> group_params = [{'params': conv_params, 'weight_decay': 0.01, 'grad_centralization':True},
+        ...                 {'params': no_conv_params, 'lr': poly_decay_lr},
+        ...                 {'order_params': net.trainable_params(0.01)}]
+        >>> optim = nn.Lamb(group_params, learning_rate=0.1, weight_decay=0.0)
+        >>> # conv_params参数组将使用优化器中的学习率0.1、该组的权重衰减0.01、该组的梯度中心化配置True。
+        >>> # no_conv_params参数组将使用该组的衰减学习率、优化器中的权重衰减0.0、梯度中心化使用默认值False。
+        >>> # 优化器按照"order_params"配置的参数顺序更新参数。
+        >>>
+        >>> loss = nn.SoftmaxCrossEntropyWithLogits()
+        >>> model = Model(net, loss_fn=loss, optimizer=optim)
+    
--- a/docs/api/api_python/nn/mindspore.nn.LazyAdam.txt
+++ b/docs/api/api_python/nn/mindspore.nn.LazyAdam.txt
@ -0,0 +1,93 @@
+Class mindspore.nn.LazyAdam(*args, **kwargs)
+
+    通过Adaptive Moment Estimation (Adam)算法更新梯度。请参阅论文`Adam: A Method for Stochastic Optimization <https://arxiv.org/abs/1412.6980>`_。
+
+    当梯度稀疏时，此优化器将使用Lazy Adam算法。
+
+    更新公式如下：
+
+    .. math::
+        \begin{array}{ll} \\
+            m_{t+1} = \beta_1 * m_{t} + (1 - \beta_1) * g \\
+            v_{t+1} = \beta_2 * v_{t} + (1 - \beta_2) * g * g \\
+            l = \alpha * \frac{\sqrt{1-\beta_2^t}}{1-\beta_1^t} \\
+            w_{t+1} = w_{t} - l * \frac{m_{t+1}}{\sqrt{v_{t+1}} + \epsilon}
+        \end{array}
+
+    :math:`m`代表第一个矩向量`moment1`，:math:`v`代表第二个矩向量`moment2`，:math:`g`代表`gradients`，:math:`l`代表缩放因子，:math:`\beta_1,\beta_2`代表`beta1`和`beta2`，:math:`t`代表当前step，:math:`beta_1^t`和:math:`beta_2^t`代表`beta1_power`和`beta2_power`，:math:`\alpha`代表``learning_rate`，:math:`w`代表`params`，:math:`\epsilon`代表`eps`。
+    
+
+    .. note::
+        .. include:: mindspore.nn.optim_note_sparse.rst
+        需要注意的是，梯度稀疏时该优化器只更新网络参数的当前的索引位置，稀疏行为不等同于Adam算法。
+
+        .. include:: mindspore.nn.optim_note_weight_decay.rst
+
+    参数：
+        param (Union[list[Parameter], list[dict]]) - 必须是 `Parameter` 组成的列表或字典组成的列表。当列表元素是字典时，字典的键可以是"params"、"lr"、"weight_decay"、"grad_centralization"和"order_params"：
+
+          .. include:: mindspore.nn.optim_group_param.rst
+          .. include:: mindspore.nn.optim_group_lr.rst
+          .. include:: mindspore.nn.optim_group_weight_decay.rst
+          .. include:: mindspore.nn.optim_group_gc.rst
+          .. include:: mindspore.nn.optim_group_order.rst
+
+        learning_rate (Union[float, Tensor, Iterable, LearningRateSchedule]): 默认值：1e-3。
+          .. include:: mindspore.nn.optim_dynamic_lr.rst
+
+        beta1 (float)：`moment1` 的指数衰减率。参数范围（0.0,1.0）。默认值：0.9。
+
+        beta2 (float)：moment2` 的指数衰减率。参数范围（0.0,1.0）。默认值：0.999。
+
+        eps (float)：将添加到分母中，以提高数值稳定性。必须大于0。默认值：1e-8。
+
+        use_locking (bool)：是否对参数更新加锁保护。如果为True，则 `w` 、`m` 和 `v` 的Tensor更新将受到锁的保护。如果为False，则结果不可预测。默认值：False。
+
+        use_nesterov (bool)：是否使用Nesterov Accelerated Gradient (NAG)算法更新梯度。如果为True，使用NAG更新梯度。如果为False，则在不使用NAG的情况下更新梯度。默认值：False。
+
+        weight_decay (Union[float, int])：权重衰减（L2 penalty）。必须大于等于0。默认值：0.0。
+
+        .. include:: mindspore.nn.optim_arg_loss_scale.rst
+
+    输入：
+        - **gradients** (tuple[Tensor])：`params`的梯度，shape与`params`相同。
+
+    输出：
+        Tensor[bool]，值为True。
+
+    异常：
+        TypeError：`learning_rate`不是int、float、Tensor、Iterable或LearningRateSchedule。
+        TypeError：`parameters`的元素不是Parameter或字典。
+        TypeError：`beta1`、`beta2`、`eps`或`loss_scale`不是float。
+        TypeError：`weight_decay`不是float或int。
+        TypeError：`use_locking`或`use_nesterov`不是bool。
+        ValueError：`loss_scale`或`eps`小于或等于0。
+        ValueError：`beta1`、`beta2`不在（0.0,1.0）范围内。
+        ValueError：`weight_decay`小于0。
+
+    支持平台：
+        ``Ascend`` ``GPU``
+
+    示例：
+        >>> net = Net()
+        >>> #1) 所有参数使用相同的学习率和权重衰减
+        >>> optim = nn.LazyAdam(params=net.trainable_params())
+        >>>
+        >>> #2) 使用参数分组并设置不同的值
+        >>> conv_params = list(filter(lambda x: 'conv' in x.name, net.trainable_params()))
+        >>> no_conv_params = list(filter(lambda x: 'conv' not in x.name, net.trainable_params()))
+        >>> group_params = [{'params': conv_params, 'weight_decay': 0.01, 'grad_centralization':True},
+        ...                 {'params': no_conv_params, 'lr': 0.01},
+        ...                 {'order_params': net.trainable_params()}]
+        >>> optim = nn.LazyAdam(group_params, learning_rate=0.1, weight_decay=0.0)
+        >>> # conv_params参数组将使用优化器中的学习率0.1、该组的权重衰减0.01、该组的梯度中心化配置True。
+        >>> # no_conv_params参数组将使用该组的学习率0.01、优化器中的权重衰减0.0、梯度中心化使用默认值False。
+        >>> # 优化器按照"order_params"配置的参数顺序更新参数。
+        >>>
+        >>> loss = nn.SoftmaxCrossEntropyWithLogits()
+        >>> model = Model(net, loss_fn=loss, optimizer=optim)
+    
+
+.. include:: mindspore.nn.optim_target_unique_for_sparse.rst
+        
+        
--- a/docs/api/api_python/nn/mindspore.nn.Metric.rst
+++ b/docs/api/api_python/nn/mindspore.nn.Metric.rst
@ -1,76 +0,0 @@
-mindspore.nn.Metric
-====================
-
-.. py:class:: mindspore.nn.Metric
-
-    用于计算评估指标的基类。
-
-    在计算评估指标时需要调用 `clear` 、 `update` 和 `eval` 三个方法，在继承该类自定义评估指标时，也需要实现这三个方法。其中，`update` 用于计算中间过程的内部结果，`eval` 用于计算最终评估结果，`clear` 用于重置中间结果。
-    请勿直接使用该类，需使用子类如 :class:`mindspore.nn.MAE` 、 :class:`mindspore.nn.Recall` 等。
-
-    .. py:method:: clear()
-        :abstract:
-
-        描述了清除内部评估结果的行为。
-
-        .. note::
-            所有子类都必须重写此接口。
-
-    .. py:method:: eval()
-        :abstract:
-
-        描述了计算最终评估结果的行为。
-
-        .. note::
-            所有子类都必须重写此接口。
-        
-    .. py:method:: indexes
-        :property:
-
-        获取当前的 `indexes` 值。默认为None，调用 `set_indexes` 可修改 `indexes` 值。
-
-    .. py:method:: set_indexes(indexes)
-
-        该接口用于重排 `update` 的输入。
-
-        给定(label0, label1, logits)作为 `update` 的输入，将 `indexes` 设置为[2, 1]，则最终使用(logits, label1)作为 `update` 的真实输入。
-
-        .. note::
-            在继承该类自定义评估函数时，需要用装饰器 `mindspore.nn.rearrange_inputs` 修饰 `update` 方法，否则配置的 `indexes` 值不生效。
-            
-
-        **参数：**
-
-        **indexes** (List(int)) - logits和标签的目标顺序。
-
-        **输出：**
-
-        :class:`Metric` ，类实例本身。
-
-        **样例：**
-
-        >>> import numpy as np
-        >>> from mindspore import nn, Tensor
-        >>>
-        >>> x = Tensor(np.array([[0.2, 0.5], [0.3, 0.1], [0.9, 0.6]]))
-        >>> y = Tensor(np.array([1, 0, 1]))
-        >>> y2 = Tensor(np.array([0, 0, 1]))
-        >>> metric = nn.Accuracy('classification').set_indexes([0, 2])
-        >>> metric.clear()
-        >>> # indexes为[0, 2]，使用x作为预测值，y2作为真实标签
-        >>> metric.update(x, y, y2)
-        >>> accuracy = metric.eval()
-        >>> print(accuracy)
-        0.3333333333333333
-        
-    .. py:method:: update(*inputs)
-        :abstract:
-
-        描述了更新内部评估结果的行为。
-
-        .. note::
-            所有子类都必须重写此接口。
-
-        **参数：**
-
-        **inputs** - 可变长度输入参数列表。通常是预测值和对应的真实标签。
--- a/docs/api/api_python/nn/mindspore.nn.Momentum.rst
+++ b/docs/api/api_python/nn/mindspore.nn.Momentum.rst
@ -1,91 +0,0 @@
-mindspore.nn.Momentum
-======================
-
-.. py:class:: mindspore.nn.Momentum(*args, **kwargs)
-
-    Momentum算法优化器。
-
-    有关更多详细信息，请参阅论文 `On the importance of initialization and momentum in deep learning <https://dl.acm.org/doi/10.5555/3042817.3043064>`_。
-
-    .. math::
-            v_{t+1} = v_{t} \ast u + grad
-
-    如果 `use_nesterov` 为True：
-
-    .. math::
-            p_{t+1} =  p_{t} - (grad \ast lr + v_{t+1} \ast u \ast lr)
-
-    如果 `use_nesterov` 为False：
-
-    .. math::
-            p_{t+1} = p_{t} - lr \ast v_{t+1}
-
-    其中，:math:`grad` 、:math:`lr` 、:math:`p` 、:math:`v` 和 :math:`u` 分别表示梯度、学习率、参数、矩（Moment）和动量（Momentum）。
-
-    .. note::
-        在参数未分组时，优化器配置的 `weight_decay` 应用于名称含有"beta"或"gamma"的网络参数，通过网络参数分组可调整权重衰减策略。分组时，每组网络参数均可配置 `weight_decay` ，若未配置，则该组网络参数使用优化器中配置的 `weight_decay` 。
-
-    **参数：**
-        
-    - **params** (Union[list[Parameter], list[dict]]): 必须是 `Parameter` 组成的列表或字典组成的列表。当列表元素是字典时，字典的键可以是"params"、"lr"、"weight_decay"、"grad_centralization"和"order_params"：
-
-      - ** params** - 必填。当前组别的权重，该值必须是 `Parameter` 列表。
-      - ** lr** - 可选。如果键中存在"lr"，则使用对应的值作为学习率。如果没有，则使用优化器中配置的 `learning_rate` 作为学习率。
-      - ** weight_decay** - 可选。如果键中存在"weight_decay”，则使用对应的值作为权重衰减值。如果没有，则使用优化器中配置的 `weight_decay` 作为权重衰减值。
-      - ** grad_centralization** - 可选。如果键中存在"grad_centralization"，则使用对应的值，该值必须为布尔类型。如果没有，则认为 `grad_centralization` 为False。该参数仅适用于卷积层。
-      - ** order_params** - 可选。对应值是预期的参数更新顺序。当使用参数分组功能时，通常使用该配置项保持 `parameters` 的顺序以提升性能。如果键中存在"order_params"，则会忽略该组配置中的其他键。"order_params"中的参数必须在某一组 `params` 参数中。
-    
-    - **learning_rate** (Union[float, int, Tensor, Iterable, LearningRateSchedule]):
-
-      - **float** - 固定的学习率。必须大于等于零。
-      - **int** - 固定的学习率。必须大于等于零。整数类型会被转换为浮点数。
-      - **Tensor** - 可以是标量或一维向量。标量是固定的学习率。一维向量是动态的学习率，第i步将取向量中第i个值作为学习率。
-      - **Iterable** - 动态的学习率。第i步将取迭代器第i个值作为学习率。
-      - **LearningRateSchedule** - 动态的学习率。在训练过程中，优化器将使用步数（step）作为输入，调用 `LearningRateSchedule` 实例来计算当前学习率。
-    
-    - **momentum** (float) - 浮点数类型的超参，表示移动平均的动量。必须等于或大于0.0。
-    - **weight_decay** (int, float) - 权重衰减（L2 penalty）值。必须大于等于0.0。默认值：0.0。
-    - **loss_scale** (float) - 梯度缩放系数，必须大于0。如果 `loss_scale` 是整数，它将被转换为浮点数。通常使用默认值，仅当训练时使用了 `FixedLossScaleManager`，且 `FixedLossScaleManager` 的 `drop_overflow_update` 属性配置为False时，此值需要与 `FixedLossScaleManager` 中的 `loss_scale` 相同。有关更多详细信息，请参阅 :class:`mindspore.FixedLossScaleManager` 。默认值：1.0。
-    - **use_nesterov** (bool) - 是否使用Nesterov Accelerated Gradient (NAG)算法更新梯度。默认值：False。
-
-    **输入：**
-    
-    **gradients** (tuple[Tensor]) - `params` 的梯度，形状（shape）与 `params` 相同。
-
-    **输出：**
-
-    tuple[bool]，所有元素都为True。
-
-    **异常：**
-
-    - **TypeError** - `learning_rate` 不是int、float、Tensor、Iterable或LearningRateSchedule。
-    - **TypeError** - `parameters` 的元素不是 `Parameter` 或字典。
-    - **TypeError** - `loss_scale` 或 `momentum` 不是float。
-    - **TypeError** - `weight_decay` 不是float或int。
-    - **TypeError** - `use_nesterov` 不是bool。
-    - **ValueError** - `loss_scale` 小于或等于0。
-    - **ValueError** - `weight_decay` 或 `momentum` 小于0。
-
-    **支持平台：**
-    
-    ``Ascend``  ``GPU``  ``CPU``
-
-    **样例：**
-    
-    >>> net = Net()
-    >>> #1) 所有参数使用相同的学习率和权重衰减
-    >>> optim = nn.Momentum(params=net.trainable_params(), learning_rate=0.1, momentum=0.9)
-    >>>
-    >>> #2) 使用参数分组并设置不同的值
-    >>> conv_params = list(filter(lambda x: 'conv' in x.name, net.trainable_params()))
-    >>> no_conv_params = list(filter(lambda x: 'conv' not in x.name, net.trainable_params()))
-    >>> group_params = [{'params': conv_params, 'weight_decay': 0.01, 'grad_centralization':True},
-    ...                 {'params': no_conv_params, 'lr': 0.01},
-    ...                 {'order_params': net.trainable_params()}]
-    >>> optim = nn.Momentum(group_params, learning_rate=0.1, momentum=0.9, weight_decay=0.0)
-    >>> # conv_params参数组将使用优化器中的学习率0.1、该组的权重衰减0.01、该组的梯度中心化配置True。
-    >>> # no_conv_params参数组将使用该组的学习率0.01、优化器中的权重衰减0.0、梯度中心化使用默认值False。
-    >>> # 优化器按照"order_params"配置的参数顺序更新参数。
-    >>>
-    >>> loss = nn.SoftmaxCrossEntropyWithLogits()
-    >>> model = Model(net, loss_fn=loss, optimizer=optim, metrics=None)
--- a/docs/api/api_python/nn/mindspore.nn.Optimizer.rst
+++ b/docs/api/api_python/nn/mindspore.nn.Optimizer.rst
@ -1,143 +0,0 @@
-mindspore.nn.Optimizer
-======================
-
-.. py:class:: mindspore.nn.Optimizer(learning_rate, parameters, weight_decay=0.0, loss_scale=1.0)
-
-    用于参数更新的优化器基类。不要直接使用这个类，请实例化它的一个子类。
-
-    优化器支持参数分组。当参数分组时，每组参数均可配置不同的学习率（`lr` ）、权重衰减（`weight_decay`）和梯度中心化（`grad_centralization`）策略。
-
-    .. note::
-        在参数未分组时，优化器配置的 `weight_decay` 应用于名称含有"beta"或"gamma"的网络参数，通过网络参数分组可调整权重衰减策略。分组时，每组网络参数均可配置 `weight_decay` ，若未配置，则该组网络参数使用优化器中配置的 `weight_decay`。
-
-    **参数：**
-
-    - **learning_rate** (Union[float, int, Tensor, Iterable, LearningRateSchedule]):
-
-      - **float** - 固定的学习率。必须大于等于零。
-      - **int** - 固定的学习率。必须大于等于零。整数类型会被转换为浮点数。
-      - **Tensor** - 可以是标量或一维向量。标量是固定的学习率。一维向量是动态的学习率，第i步将取向量中第i个值作为学习率。
-      - **Iterable** - 动态的学习率。第i步将取迭代器第i个值作为学习率。
-      - **LearningRateSchedule** - 动态的学习率。在训练过程中，优化器将使用步数（step）作为输入，调用 `LearningRateSchedule` 实例来计算当前学习率。
-    
-    - **parameters (Union[list[Parameter], list[dict]])** - 必须是 `Parameter` 组成的列表或字典组成的列表。当列表元素是字典时，字典的键可以是"params"、"lr"、"weight_decay"、"grad_centralization"和"order_params"：
-
-      - **params** - 必填。当前组别的权重，该值必须是 `Parameter` 列表。
-      - **lr** - 可选。如果键中存在"lr"，则使用对应的值作为学习率。如果没有，则使用优化器中配置的 `learning_rate` 作为学习率。
-      - **weight_decay** - 可选。如果键中存在"weight_decay”，则使用对应的值作为权重衰减值。如果没有，则使用优化器中配置的 `weight_decay` 作为权重衰减值。
-      - **grad_centralization** - 可选。如果键中存在"grad_centralization"，则使用对应的值，该值必须为布尔类型。如果没有，则认为 `grad_centralization` 为False。该参数仅适用于卷积层。
-      - **order_params** - 可选。对应值是预期的参数更新顺序。当使用参数分组功能时，通常使用该配置项保持 `parameters` 的顺序以提升性能。如果键中存在"order_params"，则会忽略该组配置中的其他键。"order_params"中的参数必须在某一组 `params` 参数中。
-
-    - **weight_decay** (Union[float, int]) - 权重衰减的整数或浮点值。必须等于或大于0。如果 `weight_decay` 是整数，它将被转换为浮点数。默认值：0.0。
-    - **loss_scale** (float) - 梯度缩放系数，必须大于0。如果 `loss_scale` 是整数，它将被转换为浮点数。通常使用默认值，仅当训练时使用了 `FixedLossScaleManager` ，且 `FixedLossScaleManager` 的 `drop_overflow_update` 属性配置为False时，此值需要与 `FixedLossScaleManager` 中的 `loss_scale` 相同。有关更多详细信息，请参阅 :class:`mindspore.FixedLossScaleManager`。默认值：1.0。
-
-    **异常：**
-
-    - **TypeError** - `learning_rate` 不是int、float、Tensor、Iterable或LearningRateSchedule。
-    - **TypeError** - `parameters` 的元素不是Parameter或字典。
-    - **TypeError** - `loss_scale` 不是float。
-    - **TypeError** - `weight_decay` 不是float或int。
-    - **ValueError** - `loss_scale` 小于或等于0。
-    - **ValueError** - `weight_decay` 小于0。
-    - **ValueError** - `learning_rate` 是一个Tensor，但是Tensor的维度大于1。
-
-    **支持平台：**
-
-    ``Ascend``  ``GPU``  ``CPU``
-
-    .. py:method:: broadcast_params(optim_result)
-
-        按参数组的顺序进行参数广播。
-
-        **参数：**
-
-        **optim_result** (bool) - 参数更新结果。该输入用来保证参数更新完成后才执行参数广播。
-
-        **返回：**
-
-        bool，状态标志。
-
-    .. py:method:: decay_weight(gradients)
-
-        衰减权重。
-
-        一种减少深度学习神经网络模型过拟合的方法。继承  :class:`mindspore.nn.Optimizer` 自定义优化器时，可调用该接口进行权重衰减。
-
-        **参数：**
-
-        **gradients** (tuple[Tensor]) - 网络参数的梯度，形状（shape）与网络参数相同。
-
-        **返回：**
-
-        tuple[Tensor]，衰减权重后的梯度。
-
-    .. py:method:: get_lr()
-
-        优化器调用该接口获取当前步骤（step）的学习率。继承 :class:`mindspore.nn.Optimizer` 自定义优化器时，可在参数更新前调用该接口获取学习率。
-
-        **返回：**
-
-        float，当前步骤的学习率。
-
-    .. py:method:: get_lr_parameter(param)
-
-        用于在使用网络参数分组功能，且为不同组别配置不同的学习率时，获取指定参数的学习率。
-
-        **参数：**
-
-        **param** (Union[Parameter, list[Parameter]]) - `Parameter` 或 `Parameter` 列表。
-
-        **返回：**
-
-        Parameter，单个 `Parameter` 或 `Parameter` 列表。如果使用了动态学习率，返回用于计算学习率的 `LearningRateSchedule` 或 `LearningRateSchedule` 列表。
-
-        **样例：**
-
-        >>> from mindspore import nn
-        >>> net = Net()
-        >>> conv_params = list(filter(lambda x: 'conv' in x.name, net.trainable_params()))
-        >>> no_conv_params = list(filter(lambda x: 'conv' not in x.name, net.trainable_params()))
-        >>> group_params = [{'params': conv_params, 'lr': 0.05},
-        ...                 {'params': no_conv_params, 'lr': 0.01}]
-        >>> optim = nn.Momentum(group_params, learning_rate=0.1, momentum=0.9, weight_decay=0.0)
-        >>> conv_lr = optim.get_lr_parameter(conv_params)
-        >>> print(conv_lr[0].asnumpy())
-        0.05
-
-    .. py:method:: gradients_centralization(gradients)
-
-        梯度中心化。
-
-        一种优化卷积层参数以提高深度学习神经网络模型训练速度的方法。继承 :class:`mindspore.nn.Optimizer` 自定义优化器时，可调用该接口进行梯度中心化。
-
-        **参数：**
-        
-        **gradients** (tuple[Tensor]) - 网络参数的梯度，形状（shape）与网络参数相同。
-
-        **返回：**
-
-        tuple[Tensor]，梯度中心化后的梯度。
-
-    .. py:method:: scale_grad(gradients)
-
-        用于在混合精度场景还原梯度。
-
-        继承 :class:`mindspore.nn.Optimizer` 自定义优化器时，可调用该接口还原梯度。
-
-        **参数：**
-        
-        **gradients** (tuple[Tensor]) - 网络参数的梯度，形状（shape）与网络参数相同。
-
-        **返回：**
-
-        tuple[Tensor]，还原后的梯度。
-
-    .. py:method:: target
-        :property:
-
-        该属性用于指定在主机（host）上还是设备（device）上更新参数。输入类型为str，只能是'CPU'，'Ascend'或'GPU'。
-
-    .. py:method:: unique
-        :property:
-
-        该属性表示是否在优化器中进行梯度去重，通常用于稀疏网络。如果梯度是稀疏的则设置为True。如果前向稀疏网络已对权重去重，即梯度是稠密的，则设置为False。未设置时默认值为True。
--- a/docs/api/api_python/nn/mindspore.nn.ProximalAdagrad.txt
+++ b/docs/api/api_python/nn/mindspore.nn.ProximalAdagrad.txt
@ -0,0 +1,82 @@
+Class mindspore.nn.ProximalAdagrad(*args, **kwargs)
+
+    使用ApplyProximalAdagrad算子实现ProximalAdagrad算法。
+
+    ProximalAdagrad用于在线学习和随机优化。
+    请参阅论文`Efficient Learning using Forward-Backward Splitting <http://papers.nips.cc//paper/3793-efficient-learning-using-forward-backward-splitting.pdf>`_。
+
+    .. math::
+        accum_{t+1} = accum_{t} + grad * grad
+
+    .. math::
+        \text{prox_v} = var_{t} - lr * grad * \frac{1}{\sqrt{accum_{t+1}}}
+
+    .. math::
+        var_{t+1} = \frac{sign(\text{prox_v})}{1 + lr * l2} * \max(\left| \text{prox_v} \right| - lr * l1, 0)
+
+    其中，grad、lr、var、accum和t分别表示`grads`, `learning_rate`, `params`、累加器和当前step。
+
+    .. note::
+        .. include:: mindspore.nn.optim_note_sparse.rst
+
+        .. include:: mindspore.nn.optim_note_weight_decay.rst
+
+    参数：
+        param (Union[list[Parameter], list[dict]]) - 必须是 `Parameter` 组成的列表或字典组成的列表。当列表元素是字典时，字典的键可以是"params"、"lr"、"weight_decay"、"grad_centralization"和"order_params"：
+
+          .. include:: mindspore.nn.optim_group_param.rst
+          .. include:: mindspore.nn.optim_group_lr.rst
+          .. include:: mindspore.nn.optim_group_weight_decay.rst
+          .. include:: mindspore.nn.optim_group_gc.rst
+          .. include:: mindspore.nn.optim_group_order.rst
+
+        accum (float)：累加器`accum`的初始值，起始值必须为零或正值。默认值：0.1。
+
+        learning_rate (Union[float, Tensor, Iterable, LearningRateSchedule]): 默认值：1e-3。
+          .. include:: mindspore.nn.optim_arg_dynamic_lr.rst
+
+        l1 (float):l1正则化强度，必须大于或等于零。默认值：0.0。
+        l2 (float):l2正则化强度，必须大于或等于零。默认值：0.0。
+        use_locking (bool)：如果为True，则更新操作使用锁保护。默认值：False。
+        .. include:: mindspore.nn.optim_arg_loss_scale.rst
+        weight_decay (Union[float, int])：要乘以权重的权重衰减值，必须为零或正值。默认值：0.0。
+
+    输入：
+        - **grads** (tuple[Tensor]) - 优化器中`params`的梯度，shape与优化器中的`params`相同。
+
+    输出：
+        Tensor[bool]，值为True。
+
+    异常：
+        TypeError：`learning_rate`不是int、float、Tensor、Iterable或LearningRateSchedule。
+        TypeError：`parameters`的元素不是Parameter或字典。
+        TypeError：`accum`、`l1`、`l2`或`loss_scale`不是float。
+        TypeError：`weight_decay`不是float或int。
+        ValueError：`loss_scale`小于或等于0。
+        ValueError：`accum`、`l1`、`l2`或`weight_decay`小于0。
+
+    支持平台：
+        ``Ascend``
+
+    示例：
+        >>> net = Net()
+        >>> #1) 所有参数使用相同的学习率和权重衰减
+        >>> optim = nn.ProximalAdagrad(params=net.trainable_params())
+        >>>
+        >>> #2) 使用参数组并设置不同的值
+        >>> conv_params = list(filter(lambda x: 'conv' in x.name, net.trainable_params()))
+        >>> no_conv_params = list(filter(lambda x: 'conv' not in x.name, net.trainable_params()))
+        >>> group_params = [{'params': conv_params, 'weight_decay': 0.01, 'grad_centralization':True},
+        ...                 {'params': no_conv_params, 'lr': 0.01},
+        ...                 {'order_params': net.trainable_params()}]
+        >>> optim = nn.ProximalAdagrad(group_params, learning_rate=0.1, weight_decay=0.0)
+        >>> # conv_params参数组将使用优化器中的学习率0.1、该组的权重衰减0.01、该组的梯度中心化配置True。
+        >>> # no_conv_params参数组将使用该组的学习率0.01、优化器中的权重衰减0.0、梯度中心化使用默认值False。
+        >>> # 优化器按照"order_params"配置的参数顺序更新参数。
+        >>>
+        >>> loss = nn.SoftmaxCrossEntropyWithLogits()
+        >>> model = Model(net, loss_fn=loss, optimizer=optim)
+    
+
+.. include:: mindspore.nn.optim_target_unique_for_sparse.rst
+        
--- a/docs/api/api_python/nn/mindspore.nn.RMSProp.txt
+++ b/docs/api/api_python/nn/mindspore.nn.RMSProp.txt
@ -0,0 +1,101 @@
+Class mindspore.nn.RMSProp(*args, **kwargs)
+
+    实现均方根传播（RMSProp）算法。
+
+    根据RMSProp算法更新`params`，算法详见[http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf]第29页。
+
+    公式如下：
+
+    .. math::
+        s_{t+1} = \rho s_{t} + (1 - \rho)(\nabla Q_{i}(w))^2
+
+    .. math::
+        m_{t+1} = \beta m_{t} + \frac{\eta} {\sqrt{s_{t+1} + \epsilon}} \nabla Q_{i}(w)
+
+    .. math::
+        w = w - m_{t+1}
+
+    第一个方程计算每个权重的平方梯度的移动平均。然后将梯度除以:math:`\sqrt{ms_{t+1} + \epsilon}`。
+
+    如果centered为True：
+
+    .. math::
+        g_{t+1} = \rho g_{t} + (1 - \rho)\nabla Q_{i}(w)
+
+    .. math::
+        s_{t+1} = \rho s_{t} + (1 - \rho)(\nabla Q_{i}(w))^2
+
+    .. math::
+        m_{t+1} = \beta m_{t} + \frac{\eta} {\sqrt{s_{t+1} - g_{t+1}^2 + \epsilon}} \nabla Q_{i}(w)
+
+    .. math::
+        w = w - m_{t+1}
+
+    其中:math:`w`代表待更新的网络参数`params`。
+    :math:`g_{t+1}`是平均梯度。
+    :math:`s_{t+1}`是均方梯度。
+    :math:`m_{t+1}`是moment，`w`的delta。
+    :math:`\rho`代表`decay`。:math:`\beta`是动量项，表示`momentum`。
+    :math:`\epsilon`是平滑项，可以避免除以零，表示`epsilon`。
+    :math:`\eta`是学习率，表示`learning_rate`。:math:`\nabla Q_{i}(w)`是梯度，表示`gradients`。
+    :math:`t`表示当前step。 
+
+    .. note::
+        .. include:: mindspore.nn.optim_note_weight_decay.rst
+
+    参数：
+        params (Union[list[Parameter], list[dict]])：必须是 `Parameter` 组成的列表或字典组成的列表。当列表元素是字典时，字典的键可以是"params"、"lr"、"weight_decay"、"grad_centralization"和"order_params"：
+
+          .. include:: mindspore.nn.optim_group_param.rst
+          .. include:: mindspore.nn.optim_group_lr.rst
+          .. include:: mindspore.nn.optim_group_weight_decay.rst
+          .. include:: mindspore.nn.optim_group_gc.rst
+          .. include:: mindspore.nn.optim_group_order.rst
+
+        learning_rate (Union[float, Tensor, Iterable, LearningRateSchedule])：默认值：0.1。
+            .. include:: mindspore.nn.optim_arg_dynamic_lr.rst
+        decay (float)：衰减率。必须大于等于0。默认值：0.9。
+        momentum (float)：Float类型的超参数，表示移动平均的动量（momentum）。必须大于等于0。默认值：0.0。
+        epsilon (float)：将添加到分母中，以提高数值稳定性。取值大于0。默认值：1e-10。
+        use_locking (bool)：是否对参数更新加锁保护。默认值：False。
+        centered (bool)：如果为True，则梯度将通过梯度的估计方差进行归一。默认值：False。
+        .. include:: mindspore.nn.optim_arg_loss_scale.rst
+        weight_decay (Union[float, int])：权重衰减（L2 penalty）。必须大于等于0。默认值：0.0。
+
+    输入：
+        - **gradients** （tuple[Tensor]） - `params`的梯度，shape与`params`相同。
+
+    输出：
+        Tensor[bool]，值为True。
+
+    异常：
+        TypeError：`learning_rate`不是int、float、Tensor、Iterable或LearningRateSchedule。
+        TypeError：`decay`、`momentum`、`epsilon`或`loss_scale`不是float。
+        TypeError：`parameters`的元素不是Parameter或字典。
+        TypeError：`weight_decay`不是float或int。
+        TypeError：`use_locking`或`centered`不是bool。
+        ValueError：`epsilon`小于或等于0。
+        ValueError：`decay`或`momentum`小于0。
+
+    支持平台：
+        ``Ascend`` ``GPU`` ``CPU``
+
+    示例：
+        >>> net = Net()
+        >>> #1) 所有参数使用相同的学习率和权重衰减
+        >>> optim = nn.RMSProp(params=net.trainable_params(), learning_rate=0.1)
+        >>>
+        >>> #2) 使用参数分组并设置不同的值
+        >>> conv_params = list(filter(lambda x: 'conv' in x.name, net.trainable_params()))
+        >>> no_conv_params = list(filter(lambda x: 'conv' not in x.name, net.trainable_params()))
+        >>> group_params = [{'params': conv_params, 'weight_decay': 0.01, 'grad_centralization':True},
+        ...                 {'params': no_conv_params, 'lr': 0.01},
+        ...                 {'order_params': net.trainable_params()}]
+        >>> optim = nn.RMSProp(group_params, learning_rate=0.1, weight_decay=0.0)
+        >>> # conv_params参数组将使用优化器中的学习率0.1、该组的权重衰减0.01、该组的梯度中心化配置True。
+        >>> # no_conv_params参数组将使用该组的学习率0.01、优化器中的权重衰减0.0、梯度中心化使用默认值False。
+        >>> # 优化器按照"order_params"配置的参数顺序更新参数。
+        >>>
+        >>> loss = nn.SoftmaxCrossEntropyWithLogits()
+        >>> model = Model(net, loss_fn=loss, optimizer=optim)
+    
--- a/docs/api/api_python/nn/mindspore.nn.SGD.txt
+++ b/docs/api/api_python/nn/mindspore.nn.SGD.txt
@ -0,0 +1,88 @@
+mindspore.nn.SGD
+================
+
+.. py:class:: mindspore.nn.SGD(*args, **kwargs)
+
+    实现随机梯度下降。动量可选。
+
+    SGD相关介绍参见 `SGD <https://en.wikipedia.org/wiki/Stochastic_gradient_dencent>`_ 。
+
+    Nesterov动量公式参见论文 `On the importance of initialization and momentum in deep learning <http://proceedings.mlr.press/v28/sutskever13.html>`_ 。
+
+    .. math::
+            v_{t+1} = u \ast v_{t} + gradient \ast (1-dampening)
+
+    如果nesterov为True：
+
+    .. math::
+            p_{t+1} = p_{t} - lr \ast (gradient + u \ast v_{t+1})
+
+    如果nesterov为False：
+
+    .. math::
+            p_{t+1} = p_{t} - lr \ast v_{t+1}
+
+    需要注意的是，对于训练的第一步 :math:`v_{t+1} = gradient`。其中，p、v和u分别表示 `parameters`、`accum` 和 `momentum`。
+
+    .. note::
+
+        .. include:: mindspore.nn.optim_note_weight_decay.rst
+
+    **参数：**
+
+        - **params** (Union[list[Parameter], list[dict]]): 当 `params` 为会更新的 `Parameter` 列表时，`params` 中的元素必须为类 `Parameter`。当 `params` 为 `dict` 列表时，"params"、"lr"、"weight_decay"、"grad_centralization"和"order_params"为可以解析的键。
+          .. include:: mindspore.nn.optim_group_param.rst
+          .. include:: mindspore.nn.optim_group_lr.rst
+          .. include:: mindspore.nn.optim_group_weight_decay.rst
+          .. include:: mindspore.nn.optim_group_gc.rst
+          .. include:: mindspore.nn.optim_group_order.rst
+
+        - **learning_rate** (Union[float, Tensor, Iterable, LearningRateSchedule]): 默认值：0.1。
+          .. include:: mindspore.nn.optim_arg_dynamic_lr.rst
+
+        - **momentum** (float): 浮点动量，必须大于等于0.0。默认值：0.0。
+        - **dampening** (float): 浮点动量阻尼值，必须大于等于0.0。默认值：0.0。
+        - **weight_decay** (float): 权重衰减（L2 penalty），必须大于等于0。默认值：0.0。
+        - **nesterov** (bool): 启用Nesterov动量。如果使用Nesterov，动量必须为正，阻尼必须等于0.0。默认值：False。
+        .. include:: mindspore.nn.optim_arg_loss_scale.rst
+
+    **输入：**
+
+        **gradients** (tuple[Tensor])：`params` 的梯度，shape与 `params` 相同。
+
+    **输出：**
+
+        Tensor[bool]，值为True。
+
+    **异常：**
+
+        **ValueError：** 动量、阻尼或重量衰减值小于0.0。
+
+    **支持平台：**
+
+        ``Ascend`` ``GPU`` ``CPU``
+
+    **样例：**
+
+    .. code-block::
+
+        >>> net = Net()
+        >>> # 1) 所有参数使用相同的学习率和权重衰减
+        >>> optim = nn.SGD(params=net.trainable_params())
+        >>>
+        >>> # 2) 使用参数组并设置不同的值
+        >>> conv_params = list(filter(lambda x: 'conv' in x.name, net.trainable_params()))
+        >>> no_conv_params = list(filter(lambda x: 'conv' not in x.name, net.trainable_params()))
+        >>> group_params = [{'params': conv_params,'grad_centralization':True},
+        ...                 {'params': no_conv_params, 'lr': 0.01},
+        ...                 {'order_params': net.trainable_params()}]
+        >>> optim = nn.SGD(group_params, learning_rate=0.1, weight_decay=0.0)
+        >>> # con_params的参数将使用默认学习率0.1、默认权重衰减0.0、梯度集中度为True。
+        >>> # 
+        >>> # no_con_params的参数将使用学习率0.01、默认权重衰减0.0、梯度集中度为False。
+        >>> #
+        >>> # 优化器的最终参数顺序采用'order_params'的值。
+        >>>
+        >>> loss = nn.SoftmaxCrossEntropyWithLogits()
+        >>> model = Model(net, loss_fn=loss, optimizer=optim)
+    
--- a/docs/api/api_python/nn/mindspore.nn.TrainOneStepCell.txt
+++ b/docs/api/api_python/nn/mindspore.nn.TrainOneStepCell.txt
@ -0,0 +1,50 @@
+Class mindspore.nn.TrainOneStepCell(network, optimizer, sens=1.0)
+
+    训练网络封装类。
+
+    封装`network`和`optimizer`，构建一个输入'\*inputs'的用于训练的Cell。
+    执行函数`construct`中会构建反向图以更新网络参数。支持不同的并行训练模式。
+
+    参数：
+        network (Cell)：训练网络。只支持单输出网络。
+        optimizer (Union[Cell])：用于更新网络参数的优化器。
+        sens (numbers.Number)：反向传播的输入，缩放系数。默认值为1.0。
+
+    输入：
+        - **(\*inputs)** (Tuple(Tensor)) - shape为:math:`(N, \ldots)`的Tensor组成的元组。
+
+    输出：
+        Tensor，损失函数值，其shape通常为:math:`()`。
+
+    异常：
+        TypeError：`sens`不是numbers.Number。
+
+    支持平台：
+        ``Ascend`` ``GPU`` ``CPU``
+
+    示例：
+        >>> net = Net()
+        >>> loss_fn = nn.SoftmaxCrossEntropyWithLogits()
+        >>> optim = nn.Momentum(net.trainable_params(), learning_rate=0.1, momentum=0.9)
+        >>> # 1）使用MindSpore提供的WithLossCell
+        >>> loss_net = nn.WithLossCell(net, loss_fn)
+        >>> train_net = nn.TrainOneStepCell(loss_net, optim)
+        >>>
+        >>> # 2）用户自定义的WithLossCell
+        >>> class MyWithLossCell(Cell):
+        ...    def __init__(self, backbone, loss_fn):
+        ...        super(MyWithLossCell, self).__init__(auto_prefix=False)
+        ...        self._backbone = backbone
+        ...        self._loss_fn = loss_fn
+        ...
+        ...    def construct(self, x, y, label):
+        ...        out = self._backbone(x, y)
+        ...        return self._loss_fn(out, label)
+        ...
+        ...    @property
+        ...    def backbone_network(self):
+        ...        return self._backbone
+        ...
+        >>> loss_net = MyWithLossCell(net, loss_fn)
+        >>> train_net = nn.TrainOneStepCell(loss_net, optim)
+    
--- a/docs/api/api_python/nn/mindspore.nn.TrainOneStepWithLossScaleCell.txt
+++ b/docs/api/api_python/nn/mindspore.nn.TrainOneStepWithLossScaleCell.txt
@ -0,0 +1,116 @@
+Class mindspore.nn.TrainOneStepWithLossScaleCell(network, optimizer, scale_sense)
+
+    使用梯度放大功能（loss scale）的训练网络。
+
+    实现了包含梯度放大功能的单次训练。它使用网络、优化器和用于更新梯度放大系数的Cell(或一个Tensor)作为参数。可在host侧或device侧更新梯度放大系数。
+    如果需要在host侧更新，使用Tensor作为`scale_sense`，否则，使用可更新梯度放大系数的Cell实例作为`scale_sense`。
+
+    参数：
+        network (Cell)：训练网络。仅支持单输出网络。
+        optimizer (Cell)：用于更新网络参数的优化器。
+        scale_sense (Union[Tensor, Cell])：如果此值为Cell类型，`TrainOneStepWithLossScaleCell`会调用它来更新梯度放大系数。如果此值为Tensor类型，可调用`set_sense_scale`来更新梯度放大系数，shape为:math:`()`或:math:`(1,)`。
+
+    输入：
+        - **(*inputs)** (Tuple(Tensor))- shape为:math:`(N, \ldots)`的Tensor组成的元组。
+
+    输出：
+        Tuple，包含三个Tensor，分别为损失函数值、溢出状态和当前梯度放大系数。
+
+        - **loss** （Tensor） - shape为:math:`()`的Tensor。
+        - **overflow** （Tensor）- shape为:math:`()`的Tensor，类型为bool。
+        - **loss scale** （Tensor）- shape为:math:`()`的Tensor。
+
+    异常：
+        TypeError：`scale_sense`既不是Cell，也不是Tensor。
+        ValueError：`scale_sense`的shape既不是(1,)也不是()。
+
+    支持平台：
+        ``Ascend`` ``GPU``
+
+    示例：
+        >>> import numpy as np
+        >>> from mindspore import Tensor, Parameter, nn, ops
+        >>> from mindspore import dtype as mstype
+        >>>
+        >>> class Net(nn.Cell):
+        ...     def __init__(self, in_features, out_features):
+        ...         super(Net, self).__init__()
+        ...         self.weight = Parameter(Tensor(np.ones([in_features, out_features]).astype(np.float32)),
+        ...                                 name='weight')
+        ...         self.matmul = ops.MatMul()
+        ...
+        ...     def construct(self, x):
+        ...         output = self.matmul(x, self.weight)
+        ...         return output
+        ...
+        >>> size, in_features, out_features = 16, 16, 10
+        >>> #1）scale_sense类型为Cell时：
+        >>> net = Net(in_features, out_features)
+        >>> loss = nn.MSELoss()
+        >>> optimizer = nn.Momentum(net.trainable_params(), learning_rate=0.1, momentum=0.9)
+        >>> net_with_loss = nn.WithLossCell(net, loss)
+        >>> manager = nn.DynamicLossScaleUpdateCell(loss_scale_value=2**12, scale_factor=2, scale_window=1000)
+        >>> train_network = nn.TrainOneStepWithLossScaleCell(net_with_loss, optimizer, scale_sense=manager)
+        >>> input = Tensor(np.ones([out_features, in_features]), mindspore.float32)
+        >>> labels = Tensor(np.ones([out_features,]), mindspore.float32)
+        >>> output = train_network(input, labels)
+        >>>
+        >>>> #2）当scale_sense类型为Tensor时：
+        >>> net = Net(in_features, out_features)
+        >>> loss = nn.MSELoss()
+        >>> optimizer = nn.Momentum(net.trainable_params(), learning_rate=0.1, momentum=0.9)
+        >>> net_with_loss = nn.WithLossCell(net, loss)
+        >>> inputs = Tensor(np.ones([size, in_features]).astype(np.float32))
+        >>> label = Tensor(np.zeros([size, out_features]).astype(np.float32))
+        >>> scaling_sens = Tensor(np.full((1), np.finfo(np.float32).max), dtype=mstype.float32)
+        >>> train_network = nn.TrainOneStepWithLossScaleCell(net_with_loss, optimizer, scale_sense=scaling_sens)
+        >>> output = train_network(inputs, label)
+    
+
+get_overflow_status(status, compute_output)
+
+        获取浮点溢出状态。
+
+        溢出检测的目标过程执行完成后，获取溢出结果。继承该类自定义训练网络时，可复用该接口。
+
+        输入：
+            - **status** (object) - 用于检测溢出的状态实例。
+            - **compute_output** - 对特定计算过程进行溢出检测时，将`compute_output`设置为该计算过程的输出，以确保在执行计算之前获取了`status`。
+
+        输出：
+            bool，是否发生溢出。
+        
+
+process_loss_scale(overflow)
+
+        根据溢出状态计算梯度放大系数。继承该类自定义训练网络时，可复用该接口。
+
+        输入：
+            - **overflow** (bool) - 是否发生溢出。
+
+        输出：
+            bool，溢出状态，即输入。
+        
+
+set_sense_scale(sens)
+
+        如果使用了Tensor类型的`scale_sense`，可调用此函数修改它的值。
+
+        输入：
+            - **sens** （Tensor）- 新的梯度放大系数，其shape和类型需要与原始`scale_sense`相同。
+        
+
+start_overflow_check(pre_cond, compute_input)
+
+        启动浮点溢出检测。创建并清除溢出检测状态。
+
+        指定参数'pre_cond'和'compute_input'，以确保在正确的时间清除溢出状态。
+        以当前接口为例，我们需要在损失函数计算后进行清除状态，在梯度计算过程中检测溢出。在这种情况下，pre_cond应为损失函数的输出，而compute_input应为梯度计算函数的输入。继承该类自定义训练网络时，可复用该接口。
+        输入：
+            - **pre_cond** (Tensor) -启动溢出检测的先决条件。它决定溢出状态清除和先前处理的执行顺序。它确保函数'start_overflow'在执行完先决条件后清除状态。
+            - **compute_input** (object) - 后续运算的输入。需要对特定的计算过程进行溢出检测。将`compute_input`设置这一计算过程的输入，以确保在执行该计算之前清除了溢出状态。
+
+        输出：
+            Tuple[object, object]，GPU后端的第一个值为False，而其他后端的第一个值是NPUAllocFloatStatus的实例。该值用于在`get_overflow_status`期间检测溢出。
+            第二个值与`compute_input`的输入相同，用于控制执行序。
+        
--- a/docs/api/api_python/nn/mindspore.nn.WithEvalCell.txt
+++ b/docs/api/api_python/nn/mindspore.nn.WithEvalCell.txt
@ -0,0 +1,29 @@
+Class mindspore.nn.WithEvalCell(network, loss_fn, add_cast_fp32=False)
+
+    封装前向网络和损失函数，返回用于计算评估指标的损失函数值、前向输出和标签。
+
+
+    参数：
+        network (Cell)：前向网络。
+        loss_fn (Cell)：损失函数。
+        add_cast_fp32 (bool)：是否将数据类型调整为float32。默认值：False。
+
+    输入：
+        - **data** （Tensor） - shape为:math:`(N, \ldots)`的Tensor。
+        - **label** （Tensor） - shape为:math:`(N, \ldots)`的Tensor。
+
+    输出：
+        Tuple(Tensor)，包括标量损失函数、shape为:math:`(N, \ldots)`的网络输出和shape为:math:`(N, \ldots)`的标签。
+
+    异常：
+        TypeError：`add_cast_fp32`不是bool。
+
+    支持平台：
+        ``Ascend`` ``GPU`` ``CPU``
+
+    示例：
+        >>> # 未包含损失函数的前向网络
+        >>> net = Net()
+        >>> loss_fn = nn.SoftmaxCrossEntropyWithLogits()
+        >>> eval_net = nn.WithEvalCell(net, loss_fn)
+    
--- a/docs/api/api_python/nn/mindspore.nn.WithLossCell.txt
+++ b/docs/api/api_python/nn/mindspore.nn.WithLossCell.txt
@ -0,0 +1,42 @@
+Class mindspore.nn.WithLossCell(backbone, loss_fn)
+
+    包含损失函数的Cell。
+
+    封装`backbone`和`loss_fn`。此Cell接受数据和标签作为输入，并将返回损失函数作为计算结果。
+
+    参数：
+        backbone (Cell)：要封装的目标网络。
+        loss_fn (Cell)：用于计算损失函数。
+
+    输入：
+        - **data** （Tensor） - shape为:math:`(N, \ldots)`的Tensor。
+        - **label** （Tensor） - shape为:math:`(N, \ldots)`的Tensor。
+
+    输出：
+        Tensor，loss值，其shape通常为:math:`()`。
+
+    异常：
+        TypeError：`data`或`label`的数据类型既不是float16也不是float32。
+
+    支持平台：
+        ``Ascend`` ``GPU`` ``CPU``
+
+    示例：
+        >>> net = Net()
+        >>> loss_fn = nn.SoftmaxCrossEntropyWithLogits(sparse=False)
+        >>> net_with_criterion = nn.WithLossCell(net, loss_fn)
+        >>>
+        >>> batch_size = 2
+        >>> data = Tensor(np.ones([batch_size, 1, 32, 32]).astype(np.float32) * 0.01)
+        >>> label = Tensor(np.ones([batch_size, 10]).astype(np.float32))
+        >>>
+        >>> output_data = net_with_criterion(data, label)
+    
+
+backbone_network
+
+        获取骨干网络。
+
+        返回：
+            Cell，骨干网络。
+        
--- a/docs/api/api_python/nn/mindspore.nn.optim_arg_dynamic_lr.rst
+++ b/docs/api/api_python/nn/mindspore.nn.optim_arg_dynamic_lr.rst
@ -0,0 +1,5 @@
+- **float** - 固定的学习率。必须大于等于零。
+- **int** - 固定的学习率。必须大于等于零。整数类型会被转换为浮点数。
+- **Tensor** - 可以是标量或一维向量。标量是固定的学习率。一维向量是动态的学习率，第i步将取向量中第i个值作为学习率。
+- **Iterable** - 动态的学习率。第i步将取迭代器第i个值作为学习率。
+- **LearningRateSchedule** - 动态的学习率。在训练过程中，优化器将使用步数（step）作为输入，调用 `LearningRateSchedule` 实例来计算当前学习率。
--- a/docs/api/api_python/nn/mindspore.nn.optim_arg_loss_scale.rst
+++ b/docs/api/api_python/nn/mindspore.nn.optim_arg_loss_scale.rst
@ -0,0 +1 @@
+- **loss_scale** (float) - 梯度缩放系数，必须大于0。如果 `loss_scale` 是整数，它将被转换为浮点数。通常使用默认值，仅当训练时使用了 `FixedLossScaleManager`，且 `FixedLossScaleManager `的 `drop_overflow_update` 属性配置为False时，此值需要与 `FixedLossScaleManager` 中的 `loss_scale` 相同。有关更多详细信息，请参阅class：`mindspore.FixedLossScaleManager`。默认值：1.0。
--- a/docs/api/api_python/nn/mindspore.nn.optim_group_gc.rst
+++ b/docs/api/api_python/nn/mindspore.nn.optim_group_gc.rst
@ -0,0 +1 @@
+- **grad_centralization** - 可选。如果键中存在"grad_centralization"，则使用对应的值，该值必须为布尔类型。如果没有，则认为 `grad_centralization` 为False。该参数仅适用于卷积层。
--- a/docs/api/api_python/nn/mindspore.nn.optim_group_lr.rst
+++ b/docs/api/api_python/nn/mindspore.nn.optim_group_lr.rst
@ -0,0 +1 @@
+- **lr** - 可选。如果键中存在"lr"，则使用对应的值作为学习率。如果没有，则使用优化器中配置的 `learning_rate` 作为学习率。
--- a/docs/api/api_python/nn/mindspore.nn.optim_group_order.rst
+++ b/docs/api/api_python/nn/mindspore.nn.optim_group_order.rst
@ -0,0 +1 @@
+- **order_params** - 可选。对应值是预期的参数更新顺序。当使用参数分组功能时，通常使用该配置项保持 `parameters` 的顺序以提升性能。如果键中存在"order_params"，则会忽略该组配置中的其他键。"order_params"中的参数必须在某一组 `params` 参数中。
--- a/docs/api/api_python/nn/mindspore.nn.optim_group_param.rst
+++ b/docs/api/api_python/nn/mindspore.nn.optim_group_param.rst
@ -0,0 +1 @@
+- **params** - 必填。当前组别的权重，该值必须是 `Parameter` 列表。
--- a/docs/api/api_python/nn/mindspore.nn.optim_group_weight_decay.rst
+++ b/docs/api/api_python/nn/mindspore.nn.optim_group_weight_decay.rst
@ -0,0 +1 @@
+- **weight_decay** - 可选。如果键中存在"weight_decay”，则使用对应的值作为权重衰减值。如果没有，则使用优化器中配置的 `weight_decay` 作为权重衰减值。
--- a/docs/api/api_python/nn/mindspore.nn.optim_note_loss_scale.rst
+++ b/docs/api/api_python/nn/mindspore.nn.optim_note_loss_scale.rst
@ -0,0 +1 @@
+优化器和混合精度之间通常没有联系。但是，当使用`FixedLossScaleManager`且`FixedLossScaleManager`中的`drop_overflow_update`设置为False时，优化器需要设置'loss_scale'。由于此优化器没有`loss_scale`的参数，因此需要通过其他方式处理`loss_scale`，如何正确处理`loss_scale`详见`LossScale <https://www.mindspore.cn/docs/programming_guide/zh-CN/master/lossscale.html>`。
--- a/docs/api/api_python/nn/mindspore.nn.optim_note_sparse.rst
+++ b/docs/api/api_python/nn/mindspore.nn.optim_note_sparse.rst
@ -0,0 +1,2 @@
+如果前向网络使用了SparseGatherV2等算子，优化器会执行稀疏运算，通过设置 `target` 为CPU，可在主机（host）上进行稀疏运算。
+稀疏特性在持续开发中。
--- a/docs/api/api_python/nn/mindspore.nn.optim_note_weight_decay.rst
+++ b/docs/api/api_python/nn/mindspore.nn.optim_note_weight_decay.rst
@ -0,0 +1,2 @@
+在参数未分组时，优化器配置的 `weight_decay` 应用于名称含有"beta"或"gamma"的网络参数，通过网络参数分组可调整权重衰减策略。分组时，每组网络参数均可配置 `weight_decay` ，若未配置，则该组网络参数使用优化器中配置的 `weight_decay` 。
+
--- a/docs/api/api_python/nn/mindspore.nn.optim_target_unique_for_sparse.rst
+++ b/docs/api/api_python/nn/mindspore.nn.optim_target_unique_for_sparse.rst
@ -0,0 +1,9 @@
+    .. py:method:: target
+        :property:
+
+        该属性用于指定在主机（host）上还是设备（device）上更新参数。输入类型为str，只能是'CPU'，'Ascend'或'GPU'。
+
+    .. py:method:: unique
+        :property:
+
+        该属性表示是否在优化器中进行梯度去重，通常用于稀疏网络。如果梯度是稀疏的则设置为True。如果前向稀疏网络已对权重去重，即梯度是稠密的，则设置为False。未设置时默认值为True。
--- a/mindspore/nn/optim/adam.py
+++ b/mindspore/nn/optim/adam.py
@ -206,7 +206,7 @@ class Adam(Optimizer):

    :math:`m` represents the 1st moment vector `moment1`, :math:`v` represents the 2nd moment vector `moment2`,
    :math:`g` represents `gradients`, :math:`l` represents scaling factor, :math:`\beta_1, \beta_2` represent
-    `beta1` and `beta2`, :math:`t` represents updating step while :math:`beta_1^t` and :math:`beta_2^t` represent
+    `beta1` and `beta2`, :math:`t` represents the current step while :math:`beta_1^t` and :math:`beta_2^t` represent
    `beta1_power` and `beta2_power`, :math:`\alpha` represents `learning_rate`, :math:`w` represents `params`,
    :math:`\epsilon` represents `eps`.

@ -263,7 +263,7 @@ class Adam(Optimizer):
                       Default: 0.999.
        eps (float): Term added to the denominator to improve numerical stability. Should be greater than 0. Default:
                     1e-8.
-        use_locking (bool): Whether to enable a lock to protect variable tensors from being updated.
+        use_locking (bool): Whether to enable a lock to protect the updating process of variable tensors.
            If true, updates of the `w`, `m`, and `v` tensors will be protected by a lock.
            If false, the result is unpredictable. Default: False.
        use_nesterov (bool): Whether to use Nesterov Accelerated Gradient (NAG) algorithm to update the gradients.
@ -380,7 +380,7 @@ class Adam(Optimizer):

 class AdamWeightDecay(Optimizer):
    r"""
-    Implements the Adam algorithm to fix the weight decay.
+    Implements the Adam algorithm with weight decay.

    .. math::
        \begin{array}{ll} \\
@ -399,7 +399,7 @@ class AdamWeightDecay(Optimizer):

    :math:`m` represents the 1st moment vector `moment1`, :math:`v` represents the 2nd moment vector `moment2`,
    :math:`g` represents `gradients`, :math:`lr` represents `learning_rate`,
-    :math:`\beta_1, \beta_2` represent `beta1` and `beta2`, :math:`t` represents updating step while
+    :math:`\beta_1, \beta_2` represent `beta1` and `beta2`, :math:`t` represents the current step,
    :math:`w` represents `params`.

    Note:
@ -542,7 +542,7 @@ class AdamOffload(Optimizer):

    :math:`m` represents the 1st moment vector `moment1`, :math:`v` represents the 2nd moment vector `moment2`,
    :math:`g` represents `gradients`, :math:`l` represents scaling factor, :math:`\beta_1, \beta_2` represent
-    `beta1` and `beta2`, :math:`t` represents updating step while :math:`beta_1^t` and :math:`beta_2^t` represent
+    `beta1` and `beta2`, :math:`t` represents the current step while :math:`beta_1^t` and :math:`beta_2^t` represent
    `beta1_power` and `beta2_power`, :math:`\alpha` represents `learning_rate`, :math:`w` represents `params`,
    :math:`\epsilon` represents `eps`.

@ -593,7 +593,7 @@ class AdamOffload(Optimizer):
                       Default: 0.999.
        eps (float): Term added to the denominator to improve numerical stability. Should be greater than 0. Default:
                     1e-8.
-        use_locking (bool): Whether to enable a lock to protect variable tensors from being updated.
+        use_locking (bool): Whether to enable a lock to protect the updating process of variable tensors.
            If true, updates of the `w`, `m`, and `v` tensors will be protected by a lock.
            If false, the result is unpredictable. Default: False.
        use_nesterov (bool): Whether to use Nesterov Accelerated Gradient (NAG) algorithm to update the gradients.
--- a/mindspore/nn/optim/ftrl.py
+++ b/mindspore/nn/optim/ftrl.py
@ -98,9 +98,9 @@ class FTRL(Optimizer):
            \end{cases}\\
        \end{array}

-    :math:`m` represents `accum`, :math:`g` represents `grads`, :math:`t` represents updating step,
-    :math:`u` represents `linear`, :math:`p` represents `lr_power`, :math:`\alpha` represents `learning_rate`,
-    :math:`\omega` represents `params`.
+    :math:`m` represents accumulators, :math:`g` represents `grads`, :math:`t` represents the current step,
+    :math:`u` represents the linear coefficient to be updated,, :math:`p` represents `lr_power`, :math:`\alpha`
+    represents `learning_rate`, :math:`\omega` represents `params`.

    Note:
        The sparse strategy is applied while the SparseGatherV2 operator is used for forward network. If the sparse
@ -134,7 +134,7 @@ class FTRL(Optimizer):
              If `order_params` in the keys, other keys will be ignored and the element of 'order_params' must be in
              one group of `params`.

-        initial_accum (float): The starting value for accumulators, must be zero or positive values. Default: 0.1.
+        initial_accum (float): The starting value for accumulators `m`, must be zero or positive values. Default: 0.1.
        learning_rate (float): The learning rate value, must be zero or positive, dynamic learning rate is currently
            not supported. Default: 0.001.
        lr_power (float): Learning rate power controls how the learning rate decreases during training, must be less
@ -183,7 +183,8 @@ class FTRL(Optimizer):
        >>> optim = nn.FTRL(group_params, learning_rate=0.1, weight_decay=0.0)
        >>> # The conv_params's parameters will use default learning rate of 0.1 and weight decay of 0.01 and grad
        >>> # centralization of True.
-        >>> # The no_conv_params's parameters will use default weight decay of 0.0 and grad centralization of False.
+        >>> # The no_conv_params's parameters will use default learning rate of 0.1 will use default weight decay
+        >>> # of 0.0 and grad centralization of False.
        >>> # The final parameters order in which the optimizer will be followed is the value of 'order_params'.
        >>>
        >>> loss = nn.SoftmaxCrossEntropyWithLogits()
--- a/mindspore/nn/optim/lamb.py
+++ b/mindspore/nn/optim/lamb.py
@ -172,7 +172,7 @@ def _check_param_value(beta1, beta2, eps, prim_name):

 class Lamb(Optimizer):
    r"""
-    Lamb(Layer-wise Adaptive Moments optimizer for Batching training) Dynamic Learning Rate.
+    An optimizer that implements the Lamb(Layer-wise Adaptive Moments optimizer for Batching training) algorithm.

    LAMB is an optimization algorithm employing a layerwise adaptive large batch optimization technique.
    Refer to the paper `LARGE BATCH OPTIMIZATION FOR DEEP LEARNING: TRAINING BERT IN 76
--- a/mindspore/nn/optim/lars.py
+++ b/mindspore/nn/optim/lars.py
@ -71,16 +71,16 @@ class LARS(Optimizer):
            g_{t+1} = \lambda * (g_{t} + \delta * \omega)
        \end{array}

-    :math:`\theta` represents `coefficient`, :math:`\omega` represents `parameters`, :math:`g` represents `gradients`,
-    :math:`t` represents updating step, :math:`\delta` represents `weight_decay`,
-    :math:`\alpha` represents `learning_rate`, :math:`clip` represents `use_clip`.
+    :math:`\theta` represents `coefficient`, :math:`\omega` represents the network parameters, :math:`g` represents
+    `gradients`, :math:`t` represents the current step, :math:`\delta` represents `weight_decay` in `optimizer`,
+    :math:`\alpha` represents `learning_rate` in `optimizer`, :math:`clip` represents `use_clip`.

    Args:
        optimizer (Optimizer): MindSpore optimizer for which to wrap and modify gradients.
        epsilon (float): Term added to the denominator to improve numerical stability. Default: 1e-05.
        coefficient (float): Trust coefficient for calculating the local learning rate. Default: 0.001.
        use_clip (bool): Whether to use clip operation for calculating the local learning rate. Default: False.
-        lars_filter (Function): A function to determine whether apply the LARS algorithm. Default:
+        lars_filter (Function): A function to determine which of the network parameters to use LARS algorithm. Default:
                                lambda x: 'LayerNorm' not in x.name and 'bias' not in x.name.

    Inputs:
--- a/mindspore/nn/optim/lazyadam.py
+++ b/mindspore/nn/optim/lazyadam.py
@ -106,10 +106,10 @@ def _check_param_value(beta1, beta2, eps, weight_decay, prim_name):

 class LazyAdam(Optimizer):
    r"""
-    This optimizer will apply a lazy adam algorithm when gradient is sparse.
+    Updates gradients by the Adaptive Moment Estimation (Adam) algorithm. The Adam algorithm is proposed
+    in `Adam: A Method for Stochastic Optimization <https://arxiv.org/abs/1412.6980>`_.

-    The original adam algorithm is proposed in
-    `Adam: A Method for Stochastic Optimization <https://arxiv.org/abs/1412.6980>`_.
+    This optimizer will apply a lazy adam algorithm when gradient is sparse.

    The updating formulas are as follows,

@ -123,7 +123,7 @@ class LazyAdam(Optimizer):

    :math:`m` represents the 1st moment vector `moment1`, :math:`v` represents the 2nd moment vector `moment2`,
    :math:`g` represents `gradients`, :math:`l` represents scaling factor, :math:`\beta_1, \beta_2` represent
-    `beta1` and `beta2`, :math:`t` represents updating step while :math:`beta_1^t` and :math:`beta_2^t` represent
+    `beta1` and `beta2`, :math:`t` represents the current step while :math:`beta_1^t` and :math:`beta_2^t` represent
    `beta1_power` and `beta2_power`, :math:`\alpha` represents `learning_rate`, :math:`w` represents `params`,
    :math:`\epsilon` represents `eps`.

@ -182,7 +182,7 @@ class LazyAdam(Optimizer):
                       Default: 0.999.
        eps (float): Term added to the denominator to improve numerical stability. Should be greater than 0. Default:
                     1e-8.
-        use_locking (bool): Whether to enable a lock to protect variable tensors from being updated.
+        use_locking (bool): Whether to enable a lock to protect the updating process of variable tensors.
            If true, updates of the `w`, `m`, and `v` tensors will be protected by a lock.
            If false, the result is unpredictable. Default: False.
        use_nesterov (bool): Whether to use Nesterov Accelerated Gradient (NAG) algorithm to update the gradients.
--- a/mindspore/nn/optim/proximal_ada_grad.py
+++ b/mindspore/nn/optim/proximal_ada_grad.py
@ -69,7 +69,7 @@ class ProximalAdagrad(Optimizer):
    .. math::
        var_{t+1} = \frac{sign(\text{prox_v})}{1 + lr * l2} * \max(\left| \text{prox_v} \right| - lr * l1, 0)

-    Here : where grad, lr, var, accum and t denote the gradients, learning_rate, params and accumulation and current
+    Here : where grad, lr, var, accum and t denote the `grads`, `learning_rate`, `params`, accumulation and current
    step respectively.

    Note:
@ -105,7 +105,7 @@ class ProximalAdagrad(Optimizer):
              If `order_params` in the keys, other keys will be ignored and the element of 'order_params' must be in
              one group of `params`.

-        accum (float): The starting value for accumulators, must be zero or positive values. Default: 0.1.
+        accum (float): The starting value for accumulators `accum`, must be zero or positive values. Default: 0.1.
        learning_rate (Union[float, int, Tensor, Iterable, LearningRateSchedule]): Default: 0.001.

            - float: The fixed learning rate value. Must be equal to or greater than 0.
--- a/mindspore/nn/optim/rmsprop.py
+++ b/mindspore/nn/optim/rmsprop.py
@ -75,13 +75,14 @@ class RMSProp(Optimizer):
        w = w - m_{t+1}

    where :math:`w` represents `params`, which will be updated.
-    :math:`g_{t+1}` is mean gradients, :math:`g_{t}` is the last moment of :math:`g_{t+1}`.
-    :math:`s_{t+1}` is the mean square gradients, :math:`s_{t}` is the last moment of :math:`s_{t+1}`,
-    :math:`m_{t+1}` is moment, the delta of `w`, :math:`m_{t}` is the last moment of :math:`m_{t+1}`.
+    :math:`g_{t+1}` is mean gradients.
+    :math:`s_{t+1}` is the mean square gradients.
+    :math:`m_{t+1}` is moment, the delta of `w`.
    :math:`\\rho` represents `decay`. :math:`\\beta` is the momentum term, represents `momentum`.
    :math:`\\epsilon` is a smoothing term to avoid division by zero, represents `epsilon`.
    :math:`\\eta` is learning rate, represents `learning_rate`. :math:`\\nabla Q_{i}(w)` is gradients,
    represents `gradients`.
+    :math:`t` represents the current step.

    Note:
        If parameters are not grouped, the `weight_decay` in optimizer will be applied on the network parameters without
@ -131,9 +132,9 @@ class RMSProp(Optimizer):
                          greater than 0. Default: 0.0.
        epsilon (float): Term added to the denominator to improve numerical stability. Should be greater than
                         0. Default: 1e-10.
-        use_locking (bool):  Whether to enable a lock to protect the variable and accumulation tensors from being
-                             updated. Default: False.
-        centered (bool): If true, gradients are normalized by the estimated variance of the gradient. Default: False.
+        use_locking (bool):  Whether to enable a lock to protect the updating process of variable tensors.
+            Default: False.
+        centered (bool): If True, gradients are normalized by the estimated variance of the gradient. Default: False.
        loss_scale (float): A floating point value for the loss scale. Should be greater than 0. In general, use the
            default value. Only when `FixedLossScaleManager` is used for training and the `drop_overflow_update` in
            `FixedLossScaleManager` is set to False, then this value needs to be the same as the `loss_scale` in
--- a/mindspore/nn/optim/sgd.py
+++ b/mindspore/nn/optim/sgd.py
@ -128,8 +128,8 @@ class SGD(Optimizer):
        ...                 {'params': no_conv_params, 'lr': 0.01},
        ...                 {'order_params': net.trainable_params()}]
        >>> optim = nn.SGD(group_params, learning_rate=0.1, weight_decay=0.0)
-        >>> # The conv_params's parameters will use default learning rate of 0.1 default weight decay of 0.0 and grad
-        >>> # centralization of True.
+        >>> # The conv_params's parameters will use default learning rate of 0.1 and default weight decay of 0.0
+        >>> # and grad centralization of True.
        >>> # The no_conv_params's parameters will use learning rate of 0.01 and default weight decay of 0.0 and grad
        >>> # centralization of False.
        >>> # The final parameters order in which the optimizer will be followed is the value of 'order_params'.
--- a/mindspore/nn/wrap/cell_wrapper.py
+++ b/mindspore/nn/wrap/cell_wrapper.py
@ -287,13 +287,13 @@ class TrainOneStepCell(Cell):
    r"""
    Network training package class.

-    Wraps the network with an optimizer. The resulting Cell is trained with input '\*inputs'.
+    Wraps the `network` with the `optimizer`. The resulting Cell is trained with input '\*inputs'.
    The backward graph will be created in the construct function to update the parameter. Different
    parallel modes are available for training.

    Args:
        network (Cell): The training network. The network only supports single output.
-        optimizer (Union[Cell]): Optimizer for updating the weights.
+        optimizer (Union[Cell]): Optimizer for updating the network parameters.
        sens (numbers.Number): The scaling number to be filled as the input of backpropagation. Default value is 1.0.

    Inputs:
@ -303,7 +303,7 @@ class TrainOneStepCell(Cell):
        Tensor, a tensor means the loss value, the shape of which is usually :math:`()`.

    Raises:
-        TypeError: If `sens` is not a number.
+        TypeError: If `sens` is not a numbers.Number.

    Supported Platforms:
        ``Ascend`` ``GPU`` ``CPU``
@ -312,7 +312,7 @@ class TrainOneStepCell(Cell):
        >>> net = Net()
        >>> loss_fn = nn.SoftmaxCrossEntropyWithLogits()
        >>> optim = nn.Momentum(net.trainable_params(), learning_rate=0.1, momentum=0.9)
-        >>> #1) Using the WithLossCell existing provide
+        >>> #1) Using the WithLossCell provided by MindSpore
        >>> loss_net = nn.WithLossCell(net, loss_fn)
        >>> train_net = nn.TrainOneStepCell(loss_net, optim)
        >>>
@ -596,22 +596,21 @@ class VirtualDatasetCellTriple(Cell):

 class WithEvalCell(Cell):
    r"""
-    Cell that returns loss, output and label for evaluation.
+    Wraps the forward network with the loss function.

-    This Cell accepts a network and loss function as arguments and computes loss for model.
-    It returns loss, output and label to calculate the metrics.
+    It returns loss, forward output and label to calculate the metrics.

    Args:
-        network (Cell): The network Cell.
-        loss_fn (Cell): The loss Cell.
-        add_cast_fp32 (bool): Adjust the data type to float32. Default: False.
+        network (Cell): The forward network.
+        loss_fn (Cell): The loss function.
+        add_cast_fp32 (bool): Whether to adjust the data type to float32. Default: False.

    Inputs:
        - **data** (Tensor) - Tensor of shape :math:`(N, \ldots)`.
        - **label** (Tensor) - Tensor of shape :math:`(N, \ldots)`.

    Outputs:
-        Tuple, containing a scalar loss Tensor, a network output Tensor of shape :math:`(N, \ldots)`
+        Tuple(Tensor), containing a scalar loss Tensor, a network output Tensor of shape :math:`(N, \ldots)`
        and a label Tensor of shape :math:`(N, \ldots)`.

    Raises:
@ -621,7 +620,7 @@ class WithEvalCell(Cell):
        ``Ascend`` ``GPU`` ``CPU``

    Examples:
-        >>> # For a defined network Net without loss function
+        >>> # Forward network without loss function
        >>> net = Net()
        >>> loss_fn = nn.SoftmaxCrossEntropyWithLogits()
        >>> eval_net = nn.WithEvalCell(net, loss_fn)
--- a/mindspore/nn/wrap/loss_scale.py
+++ b/mindspore/nn/wrap/loss_scale.py
@ -59,16 +59,17 @@ class DynamicLossScaleUpdateCell(Cell):
    Dynamic Loss scale update cell.

    For loss scaling training, the initial loss scaling value will be set to be `loss_scale_value`.
-    In each training step, the loss scaling value  will be updated by loss scaling value/`scale_factor`
-    when there is an overflow. And it will be increased by loss scaling value * `scale_factor` if there is no
-    overflow for a continuous `scale_window` steps. This cell is used for Graph mode training in which all
-    logic will be executed on device side(Another training mode is normal(non-sink) mode in which some logic will be
-    executed on host).
+    In each training step, the loss scaling value will be decreased by `loss_scale`/`scale_factor`
+    when there is an overflow. And it will be increased by `loss_scale` * `scale_factor` if there is no
+    overflow for a continuous `scale_window` steps.
+
+    `get_update_cell` method of :class:`mindspore.nn.DynamicLossScaleManager` will return this class, it will be called
+    by :class:`mindspore.TrainOneStepWithLossScaleCell` during training to update loss scale.

    Args:
        loss_scale_value (float): Initializes loss scale.
        scale_factor (int): Coefficient of increase and decrease.
-        scale_window (int): Maximum continuous training steps that do not have overflow.
+        scale_window (int): Maximum continuous training steps that do not have overflow to increase the loss scale.

    Inputs:
        - **loss_scale** (Tensor) - The loss scale value during training with shape :math:`()`.
@ -77,9 +78,6 @@ class DynamicLossScaleUpdateCell(Cell):
    Outputs:
        bool, the input `overflow`.

-    Raises:
-        TypeError: If dtype of `inputs` or `label` is neither float16 nor float32.
-
    Supported Platforms:
        ``Ascend`` ``GPU``

@ -162,15 +160,17 @@ class DynamicLossScaleUpdateCell(Cell):

 class FixedLossScaleUpdateCell(Cell):
    """
-    Static scale update cell, the loss scaling value will not be updated.
+    Update cell with fixed loss scaling value.

-    For usage, refer to `DynamicLossScaleUpdateCell`.
+    `get_update_cell` method of :class:`mindspore.nn.FixedLossScaleManager` will return this class, it will be called
+    by :class:`mindspore.TrainOneStepWithLossScaleCell` during trainning.

    Args:
        loss_scale_value (float): Initializes loss scale.

    Inputs:
-        - **loss_scale** (Tensor) - The loss scale value during training with shape :math:`()`, that will be ignored.
+        - **loss_scale** (Tensor) - The loss scale value during training with shape :math:`()`, it is ignored in this
+          class.
        - **overflow** (bool) - Whether the overflow occurs or not.

    Outputs:
@ -227,28 +227,27 @@ class TrainOneStepWithLossScaleCell(TrainOneStepCell):
    r"""
    Network training with loss scaling.

-    This is a training step with loss scaling. It takes a network, an optimizer and possibly a scale update
-    Cell as args. The loss scale value can be updated in both host side or device side. The
-    TrainOneStepWithLossScaleCell will be compiled to be graph which takes `*inputs` as input data.
-    The Tensor type of `scale_sense` is acting as loss scaling value. If you want to update it on host side,
-    the value must be provided. If  the Tensor type of `scale_sense` is not given, the loss scale update logic
-    must be provide by Cell type of `scale_sense`.
+    This is a training step with loss scaling. It takes a network, an optimizer and a scale update Cell(or a Tensor) as
+    args. The loss scale value can be updated in both host side or device side. If you want to update it on
+    host side, using a value of Tensor type as `scale_sense`, otherwise, using a Cell instance for updating loss
+    scale as `scale_sense`.

    Args:
        network (Cell): The training network. The network only supports single output.
-        optimizer (Cell): Optimizer for updating the weights.
-        scale_sense (Union[Tensor, Cell]): If this value is Cell type, the loss scaling update logic cell.If this value
-            is Tensor type, Tensor with shape :math:`()` or :math:`(1,)`.
+        optimizer (Cell): Optimizer for updating the network parameters.
+        scale_sense (Union[Tensor, Cell]): If this value is a Cell, it will be called by `TrainOneStepWithLossScaleCell`
+            to update loss scale. If this value is a Tensor, the loss scale can be modified by `set_sense_scale`,
+            the shape should be :math:`()` or :math:`(1,)`.

    Inputs:
        - **(*inputs)** (Tuple(Tensor)) - Tuple of input tensors with shape :math:`(N, \ldots)`.

    Outputs:
-        Tuple of 3 Tensor, the loss, overflow flag and current loss scaling value.
+        Tuple of 3 Tensor, the loss, overflow flag and current loss scale value.

        - **loss** (Tensor) -  Tensor with shape :math:`()`.
        - **overflow** (Tensor) -  Tensor with shape :math:`()`, type is bool.
-        - **loss scaling value** (Tensor) -  Tensor with shape :math:`()`
+        - **loss scale** (Tensor) -  Tensor with shape :math:`()`

    Raises:
        TypeError: If `scale_sense` is neither Cell nor Tensor.
@ -350,8 +349,7 @@ class TrainOneStepWithLossScaleCell(TrainOneStepCell):

    def set_sense_scale(self, sens):
        """
-        If the user has set the sens in the training process and wants to reassign the value, he can call
-        this function again to make modification, and sens needs to be of type Tensor.
+        If the user has set the `scale_sense` of Tensor type, he can call this function to reassign the value.

        Args:
            sens(Tensor): The new sense whose shape and type are the same with original `scale_sense`.
@ -382,7 +380,7 @@ class TrainOneStepWithLossScaleCell(TrainOneStepCell):

        Returns:
            Tuple[object, object], the first value is False for GPU backend, while it is an instance of
-            NPUAllocFloatStatus for other backend. The status is used to detect overflow during overflow detection.
+            NPUAllocFloatStatus for other backend. The status is used to detect overflow during `get_overflow_status`.
            The second value is the same as the input of `compute_input`, but contains some information about the
            execution order.
        """
@ -406,7 +404,7 @@ class TrainOneStepWithLossScaleCell(TrainOneStepCell):
        Args:
            status (object): A status instance used to detect the overflow.
            compute_output: Overflow detection should be performed on a certain computation. Set `compute_output`
-              as the output of the computation, to ensure overflow status is acquired before executing the
+              as the output of the computation, to ensure overflow `status` is acquired before executing the
              computation.

        Returns:
@ -442,7 +440,7 @@ class TrainOneStepWithLossScaleCell(TrainOneStepCell):
            overflow(bool): Whether the overflow occurs or not.

        Returns:
-            bool, overflow value.
+            bool, the input overflow value.
        """
        if self.loss_scaling_manager is not None:
            return self.loss_scaling_manager(self.scale_sense, overflow)
--- a/mindspore/train/loss_scale_manager.py
+++ b/mindspore/train/loss_scale_manager.py
@ -25,8 +25,8 @@ class LossScaleManager:
    Derived class needs to implement all of its methods. `get_loss_scale` is used to get current loss scale value.
    `update_loss_scale` is used to update loss scale value, `update_loss_scale` will be called during the training.
    `get_update_cell` is used to get the instance of :class:`mindspore.nn.Cell` that is used to update the loss scale,
-    the instance will be called during the training. When using sink mode, only the `get_update_cell` works, otherwise
-    both `update_loss_scale` and `get_update_cell` works.
+    the instance will be called during the training. Currently, the `get_update_cell` is mostly used.
+
    For example, :class:`mindspore.FixedLossScaleManager` and :class:`mindspore.DynamicLossScaleManager`.
    """
    def get_loss_scale(self):
@ -105,7 +105,8 @@ class FixedLossScaleManager(LossScaleManager):
    def get_update_cell(self):
        """
        Returns the instance of :class:`mindspore.nn.Cell` that used to update the loss scale which will be called at
-        :class:`mindspore.nn.TrainOneStepWithLossScaleCell`.
+        :class:`mindspore.nn.TrainOneStepWithLossScaleCell`. As the loss scale is fixed in this class, the instance
+        will do nothing.

        Returns:
            None or :class:`mindspore.FixedLossScaleUpdateCell`. Instance of
				`@ -0,0 +1 @@`
				- loss_scale (float) - 梯度缩放系数，必须大于0。如果 `loss_scale` 是整数，它将被转换为浮点数。通常使用默认值，仅当训练时使用了 `FixedLossScaleManager`，且 `FixedLossScaleManager `的 `drop_overflow_update` 属性配置为False时，此值需要与 `FixedLossScaleManager` 中的 `loss_scale` 相同。有关更多详细信息，请参阅class：`mindspore.FixedLossScaleManager`。默认值：1.0。
				`@ -0,0 +1 @@`
				- grad_centralization - 可选。如果键中存在"grad_centralization"，则使用对应的值，该值必须为布尔类型。如果没有，则认为 `grad_centralization` 为False。该参数仅适用于卷积层。
				`@ -0,0 +1 @@`
				- lr - 可选。如果键中存在"lr"，则使用对应的值作为学习率。如果没有，则使用优化器中配置的 `learning_rate` 作为学习率。
				`@ -0,0 +1 @@`
				- order_params - 可选。对应值是预期的参数更新顺序。当使用参数分组功能时，通常使用该配置项保持 `parameters` 的顺序以提升性能。如果键中存在"order_params"，则会忽略该组配置中的其他键。"order_params"中的参数必须在某一组 `params` 参数中。
				`@ -0,0 +1 @@`
				- params - 必填。当前组别的权重，该值必须是 `Parameter` 列表。
				`@ -0,0 +1 @@`
				- weight_decay - 可选。如果键中存在"weight_decay”，则使用对应的值作为权重衰减值。如果没有，则使用优化器中配置的 `weight_decay` 作为权重衰减值。
				`@ -0,0 +1 @@`
				优化器和混合精度之间通常没有联系。但是，当使用`FixedLossScaleManager`且`FixedLossScaleManager`中的`drop_overflow_update`设置为False时，优化器需要设置'loss_scale'。由于此优化器没有`loss_scale`的参数，因此需要通过其他方式处理`loss_scale`，如何正确处理`loss_scale`详见`LossScale <https://www.mindspore.cn/docs/programming_guide/zh-CN/master/lossscale.html>`。
				`@ -0,0 +1,2 @@`
				在参数未分组时，优化器配置的 `weight_decay` 应用于名称含有"beta"或"gamma"的网络参数，通过网络参数分组可调整权重衰减策略。分组时，每组网络参数均可配置 `weight_decay` ，若未配置，则该组网络参数使用优化器中配置的 `weight_decay` 。