!31843 修正中文api内容

Merge pull request !31843 from 王程浩/master
2022-03-29 03:09:16 +00:00 · 2022-03-29 03:09:16 +00:00 · 78d77897f2
parent ff4974cca4 ce0ce12a08
commit 78d77897f2
4 changed files with 21 additions and 18 deletions
--- a/docs/api/api_python/nn/mindspore.nn.thor.rst
+++ b/docs/api/api_python/nn/mindspore.nn.thor.rst
@ -28,12 +28,12 @@ mindspore.nn.thor
    :math:`\otimes` 表示克罗内克尔积， :math:`\gamma` 表示学习率。

    .. note::
-        在分离参数组时，如果权重衰减为正，则每个组的权重衰减将应用于参数。当不分离参数组时，如果 `weight_decay` 为正数，则API中的 `weight_decay` 将应用于名称中没有'beta'或 'gamma'的参数。
+        在分离参数组时，每个组的 `weight_decay` 将应用于对应参数。当不分离参数组时，优化器中的 `weight_decay` 将应用于名称中没有'beta'或 'gamma'的参数。

-        在分离参数组时，如果要集中梯度，请将grad_centralization设置为True，但梯度集中只能应用于卷积层的参数。
+        在分离参数组时，如果要集中梯度，请将grad_centralization设置为True，但集中梯度只能应用于卷积层的参数。
        如果非卷积层的参数设置为True，则会报错。

-        为了提高参数组的性能，可以支持参数的自定义顺序。
+        为了提高参数组的性能，可以支持自定义参数的顺序。

    **参数：**
        
@ -42,13 +42,13 @@ mindspore.nn.thor
    - **damping** (Tensor) - 阻尼值。
    - **momentum** (float) - float类型的超参数，表示移动平均的动量。至少为0.0。
    - **weight_decay** (int, float) - 权重衰减（L2 penalty）。必须等于或大于0.0。默认值：0.0。
-    - **loss_scale** (float) - loss缩放的值。必须大于0.0。一般情况下，使用默认值。默认值：1.0。
+    - **loss_scale** (float) - loss损失缩放系数。必须大于0.0。一般情况下，使用默认值。默认值：1.0。
    - **batch_size** (int) - batch的大小。默认值：32。
    - **use_nesterov** (bool) - 启用Nesterov动量。默认值：False。
    - **decay_filter** (function) - 用于确定权重衰减应用于哪些层的函数，只有在weight_decay>0时才有效。默认值：lambda x: x.name not in []。
    - **split_indices** (list) - 按A/G层（A/G含义见上述公式）索引设置allreduce融合策略。仅在分布式计算中有效。ResNet50作为一个样本，A/G的层数分别为54层，当split_indices设置为[26,53]时，表示A/G被分成两组allreduce，一组为0~26层，另一组是27~53层。默认值：None。
    - **enable_clip_grad** (bool) - 是否剪切梯度。默认值：False。
-    - **frequency** (int) - A/G和$A^{-1}/G^{-1}$的更新间隔。当频率等于N（N大于1）时，A/G和$A^{-1}/G^{-1}$将每N步更新一次，和其他步骤将使用过时的A/G和$A^{-1}/G^{-1}$更新权重。默认值：100。
+    - **frequency** (int) - A/G和$A^{-1}/G^{-1}$的更新间隔。每隔frequency个step，A/G和$A^{-1}/G^{-1}$将更新一次。必须大于1。默认值：100。

    **输入：**

--- a/docs/api/api_python/train/mindspore.train.train_thor.ConvertModelUtils.rst
+++ b/docs/api/api_python/train/mindspore.train.train_thor.ConvertModelUtils.rst
@ -9,7 +9,7 @@

        **参数：**
        
-        - **model** (Object) - 用于训练的高级API。 `Model` 将图层分组到具有训练特征的对象中。
+        - **model** (Object) - 用于训练的高级API。 
        - **network** (Cell) - 训练网络。
        - **loss_fn** (Cell) - 目标函数。默认值：None。
        - **optimizer** (Cell) - 用于更新权重的优化器。默认值：None。
@ -19,11 +19,11 @@
          - **O0** - 不改变。
          - **O2** - 将网络转换为float16，使用动态loss scale保持BN在float32中运行。
          - **O3** - 将网络强制转换为float16，并使用附加属性 `keep_batchnorm_fp32=False` 。
-          - **auto** - 在不同设备中，将级别设置为建议级别。GPU上建议使用O2，Ascend上建议使用O3。建议级别基于专家经验，不能总是一概而论。用户应指定特殊网络的级别。
+          - **auto** - 在不同设备中，将级别设置为建议级别。GPU上建议使用O2，Ascend上建议使用O3。建议级别基于专家经验，不能总是一概而论。对于特殊网络，用户需要指定对应的混合精度训练级别。

-        - **loss_scale_manager** (Union[None, LossScaleManager]) - 如果为None，则不会按比例缩放loss。否则，通过LossScaleManager和优化器缩放loss不能为None。这是一个关键参数。例如，使用 `loss_scale_manager=None` 设置值。
+        - **loss_scale_manager** (Union[None, LossScaleManager]) - 如果为None，则不会按比例缩放loss。否则，需设置LossScaleManager，且优化器的入参loss_scale不为None。这是一个关键参数。例如，使用 `loss_scale_manager=None` 设置值。
        - **keep_batchnorm_fp32** (bool) - 保持BN在 `float32` 中运行。如果为True，则将覆盖之前的级别设置。默认值：False。

        **返回：**

-        model (Object) - 用于训练的高级API。 `Model` 将图层分组到具有训练特征的对象中。
+        model (Object) - 用于训练的高级API。 
--- a/mindspore/python/mindspore/nn/optim/thor.py
+++ b/mindspore/python/mindspore/nn/optim/thor.py
@ -31,7 +31,6 @@ from mindspore.nn.wrap import DistributedGradReducer
 from mindspore.train.train_thor.convert_utils import ConvertNetUtils
 from mindspore.parallel._auto_parallel_context import auto_parallel_context

-
 # Enumerates types of Layer
 Other = -1
 Conv = 1
@ -60,6 +59,7 @@ def _tensor_run_opt_ext(opt, momentum, learning_rate, gradient, weight, moment):
    success = F.depend(success, opt(weight, moment, learning_rate, gradient, momentum))
    return success

+
 IS_ENABLE_GLOBAL_NORM = False
 GRADIENT_CLIP_TYPE = 1
 GRADIENT_CLIP_VALUE = 1.0
@ -100,6 +100,7 @@ def clip_gradient(enable_clip_grad, gradients):
            gradients = hyper_map_op(F.partial(clip_grad, GRADIENT_CLIP_TYPE, GRADIENT_CLIP_VALUE), gradients)
    return gradients

+
 C0 = 16


@ -272,6 +273,15 @@ def thor(net, learning_rate, damping, momentum, weight_decay=0.0, loss_scale=1.0
    :math:`\lambda` represents :math:`damping`, :math:`g_i` represents gradients of the i-th layer,
    :math:`\otimes` represents Kronecker product, :math:`\gamma` represents 'learning rate'

+     Note:
+        When a parameter group is separated, 'weight_decay' of each group is applied to the corresponding parameter.
+        'weight_decay' in the optimizer is applied to arguments that do not have 'beta' or 'gamma' in their name
+        when the argument group is not separated.
+        When separating parameter groups, set grad_centralization to True if you want to concentrate gradients,
+        but concentration gradients can only be applied to parameters of the convolution layer.
+        If the parameter for the unconvolutional layer is set to True, an error will be reported.
+        To improve the performance of parameter groups, you can customize the order of parameters.
+
    Args:
        net (Cell): The training network.

@ -361,6 +371,7 @@ class ThorGpu(Optimizer):
    """
    ThorGpu
    """
+
    def __init__(self, net, learning_rate, damping, momentum, weight_decay=0.0, loss_scale=1.0, batch_size=32,
                 use_nesterov=False, decay_filter=lambda x: x.name not in [], split_indices=None,
                 enable_clip_grad=False, frequency=100):
@ -432,7 +443,6 @@ class ThorGpu(Optimizer):
        self.square = P.Square()
        self.expand = P.ExpandDims()

-
    def _define_gpu_reducer(self, split_indices):
        """define gpu reducer"""
        self.parallel_mode = context.get_auto_parallel_context("parallel_mode")
@ -449,7 +459,6 @@ class ThorGpu(Optimizer):
            self.grad_reducer_a = DistributedGradReducer(self.matrix_a_cov, mean, degree, fusion_type=6)
            self.grad_reducer_g = DistributedGradReducer(self.matrix_a_cov, mean, degree, fusion_type=8)

-
    def _process_matrix_init_and_weight_idx_map(self, net):
        """for GPU, process matrix init shape, and get weight idx map"""
        layer_type_map = get_net_layertype_mask(net)
@ -704,7 +713,6 @@ class ThorAscend(Optimizer):
        self.frequency = frequency
        self._define_ascend_reducer(split_indices)

-
    def get_frequency(self):
        """get thor frequency"""
        return self.frequency
@ -891,7 +899,6 @@ class ThorAscend(Optimizer):
            input_matrix = self.concat((input_matrix, matrix_sup))
        return input_matrix

-
    def _get_abs_max(self, matrix_inv, origin_dim):
        """get matrix abs max"""
        cholesky_shape = self.shape(matrix_inv)
@ -905,7 +912,6 @@ class ThorAscend(Optimizer):
            matrix_max = P.ReduceMax(keep_dims=False)(matrix_abs)
        return matrix_max, matrix_inv

-
    def _get_fc_ainv_ginv(self, index, damping_step, gradients, matrix_a_allreduce, matrix_g_allreduce,
                          matrix_a_max_allreduce, matrix_g_max_allreduce):
        """get fc layer ainv and ginv"""
@ -984,7 +990,6 @@ class ThorAscend(Optimizer):
                                  (0, self.C0 - in_channels)))(matrix_a_inv)
        return matrix_a_inv

-
    def _get_ainv_ginv_amax_gmax_list(self, gradients, damping_step, matrix_a_allreduce, matrix_g_allreduce,
                                      matrix_a_max_allreduce, matrix_g_max_allreduce):
        """get matrixA inverse list, matrixG inverse list, matrixA_max list, matrixG_max list"""
--- a/mindspore/python/mindspore/train/train_thor/convert_utils.py
+++ b/mindspore/python/mindspore/train/train_thor/convert_utils.py
@ -185,7 +185,6 @@ class ConvertModelUtils:

        Args:
            model (Object): High-Level API for Training.
-                            `Model` groups layers into an object with training features.
            network (Cell): A training network.
            loss_fn (Cell): Objective function. Default: None.
            optimizer (Cell): Optimizer used to updating the weights. Default: None.
@ -208,7 +207,6 @@ class ConvertModelUtils:

        Returns:
             model (Object): High-Level API for Training.
-                            `Model` groups layers into an object with training features.

        Supported Platforms:
            ``Ascend`` ``GPU``