forked from mindspore-Ecosystem/mindspore
!27141 add chinese api comments
Merge pull request !27141 from wangnan39/code_docs_add_chinese_docs
This commit is contained in:
commit
204f6528ae
|
@ -38,7 +38,7 @@ mindspore.DynamicLossScaleManager
|
|||
|
||||
.. py:method:: get_update_cell()
|
||||
|
||||
返回用于在 :class:`mindspore.TrainOneStepWithLossScaleCell` 中更新梯度放大系数的 `Cell` 实例。
|
||||
返回用于更新梯度放大系数的 `Cell` 实例,:class:`mindspore.TrainOneStepWithLossScaleCell` 会调用该实例。
|
||||
|
||||
**返回:**
|
||||
|
||||
|
|
|
@ -44,7 +44,7 @@ mindspore.FixedLossScaleManager
|
|||
|
||||
.. py:method:: get_update_cell()
|
||||
|
||||
返回用于更新 `loss_scale` 值的 `Cell` 实例,该实例将在 :class:`mindspore.TrainOneStepWithLossScaleCell` 中执行。
|
||||
返回用于更新 `loss_scale` 值的 `Cell` 实例,:class:`mindspore.TrainOneStepWithLossScaleCell`会调用该实例。该类使用固定的梯度放大系数,因此该实例不执行任何操作。
|
||||
|
||||
**返回:**
|
||||
|
||||
|
|
|
@ -5,7 +5,8 @@ mindspore.LossScaleManager
|
|||
|
||||
混合精度梯度放大系数(loss scale)管理器的抽象类。
|
||||
|
||||
派生类需要该类的所有方法。 `get_loss_scale` 用于获取当前的梯度放大系数。`update_loss_scale` 用于更新梯度放大系数,该方法将在训练过程中被调用。`get_update_cell` 用于获取更新梯度放大系数的 `Cell` 实例,该实例在将训练过程中被调用。下沉模式下仅 `get_update_cell` 方式生效,非下沉模式下两种更新梯度放大系数的方式均生效。
|
||||
派生类需要实现该类的所有方法。 `get_loss_scale` 用于获取当前的梯度放大系数。 `update_loss_scale` 用于更新梯度放大系数,该方法将在训练过程中被调用。 `get_update_cell` 用于获取更新梯度放大系数的 `Cell` 实例,该实例将在训练过程中被调用。当前多使用`get_update_cell` 方式。
|
||||
|
||||
例如::class:`mindspore.FixedLossScaleManager` 和 :class:`mindspore.DynamicLossScaleManager` 。
|
||||
|
||||
.. py:method:: get_loss_scale()
|
||||
|
@ -14,7 +15,7 @@ mindspore.LossScaleManager
|
|||
|
||||
.. py:method:: get_update_cell()
|
||||
|
||||
获取用于更新梯度放大系数的 :class:`mindspore.nn.Cell` 实例。
|
||||
获取用于更新梯度放大系数的Cell实例。
|
||||
|
||||
.. py:method:: update_loss_scale(overflow)
|
||||
|
||||
|
|
|
@ -1,87 +0,0 @@
|
|||
mindspore.nn.Adagrad
|
||||
=====================
|
||||
|
||||
.. py:class:: mindspore.nn.Adagrad(*args, **kwargs)
|
||||
|
||||
使用ApplyAdagrad算子实现Adagrad算法。
|
||||
|
||||
Adagrad用于在线学习和随机优化。
|
||||
请参阅论文 `Efficient Learning using Forward-Backward Splitting <https://proceedings.neurips.cc/paper/2009/file/621bf66ddb7c962aa0d22ac97d69b793-Paper.pdf>`_。
|
||||
公式如下:
|
||||
|
||||
.. math::
|
||||
\begin{array}{ll} \\
|
||||
h_{t+1} = h_{t} + g\\
|
||||
w_{t+1} = w_{t} - lr*\frac{1}{\sqrt{h_{t+1}}}*g
|
||||
\end{array}
|
||||
|
||||
:math:`h` 表示梯度平方的累积和, :math:`g` 表示 `grads` 。
|
||||
:math:`lr` 代表 `learning_rate`, :math:`w` 代表 `params` 。
|
||||
|
||||
.. note::
|
||||
在参数未分组时,优化器配置的 `weight_decay` 应用于名称含有"beta"或"gamma"的网络参数,通过网络参数分组可调整权重衰减策略。分组时,每组网络参数均可配置 `weight_decay` ,若未配置,则该组网络参数使用优化器中配置的 `weight_decay` 。
|
||||
|
||||
**参数:**
|
||||
|
||||
- **params** (Union[list[Parameter], list[dict]]) - 必须是 `Parameter` 组成的列表或字典组成的列表。当列表元素是字典时,字典的键可以是"params"、"lr"、"weight_decay"、"grad_centralization"和"order_params":
|
||||
|
||||
- **params** - 必填。当前组别的权重,该值必须是 `Parameter` 列表。
|
||||
- **lr** - 可选。如果键中存在"lr",则使用对应的值作为学习率。如果没有,则使用优化器中配置的 `learning_rate` 作为学习率。
|
||||
- **weight_decay** - 可选。如果键中存在"weight_decay",则使用对应的值作为权重衰减值。如果没有,则使用优化器中配置的 `weight_decay` 作为权重衰减值。
|
||||
- **grad_centralization** - 可选。如果键中存在"grad_centralization",则使用对应的值,该值必须为布尔类型。如果没有,则认为 `grad_centralization` 为False。该参数仅适用于卷积层。
|
||||
- **order_params** - 可选。对应值是预期的参数更新顺序。当使用参数分组功能时,通常使用该配置项保持 `parameters` 的顺序以提升性能。如果键中存在"order_params",则会忽略该组配置中的其他键。"order_params"中的参数必须在某一组 `params` 参数中。
|
||||
|
||||
- **accum** (float) - 累加器 :math:`h` 的初始值,必须大于等于零。默认值:0.1。
|
||||
- **learning_rate** (Union[float, Tensor, Iterable, LearningRateSchedule]) - 默认值:0.001。
|
||||
|
||||
- **float** - 固定的学习率。必须大于等于零。
|
||||
- **int** - 固定的学习率。必须大于等于零。整数类型会被转换为浮点数。
|
||||
- **Tensor** - 可以是标量或一维向量。标量是固定的学习率。一维向量是动态的学习率,第i步将取向量中第i个值作为学习率。
|
||||
- **Iterable** - 动态的学习率。第i步将取迭代器第i个值作为学习率。
|
||||
- **LearningRateSchedule** - 动态的学习率。在训练过程中,优化器将使用步数(step)作为输入,调用 `LearningRateSchedule` 实例来计算当前学习率。
|
||||
|
||||
- **update_slots** (bool) - 如果为True,则更新累加器 :math:`h` 。默认值:True。
|
||||
- **loss_scale** (float) - 梯度缩放系数,必须大于0。如果 `loss_scale` 是整数,它将被转换为浮点数。通常使用默认值,仅当训练时使用了 `FixedLossScaleManager` ,且 `FixedLossScaleManager` 的 `drop_overflow_update` 属性配置为False时,此值需要与 `FixedLossScaleManager` 中的 `loss_scale` 相同。有关更多详细信息,请参阅 :class:`mindspore.FixedLossScaleManager` 。默认值:1.0。
|
||||
- **weight_decay** (Union[float, int]) - 要乘以权重的权重衰减值,必须大于等于0.0。默认值:0.0。
|
||||
|
||||
**输入:**
|
||||
|
||||
**grads** (tuple[Tensor]) - 优化器中 `params` 的梯度,形状(shape)与 `params` 相同。
|
||||
|
||||
**输出:**
|
||||
|
||||
Tensor[bool],值为True。
|
||||
|
||||
**异常:**
|
||||
|
||||
- **TypeError** - `learning_rate` 不是int、float、Tensor、Iterable或 `LearningRateSchedule` 。
|
||||
- **TypeError** - `parameters` 的元素是 `Parameter` 或字典。
|
||||
- **TypeError** - `accum` 或 `loss_scale` 不是float。
|
||||
- **TypeError** - `update_slots` 不是bool。
|
||||
- **TypeError** - `weight_decay` 不是float或int。
|
||||
- **ValueError** - `loss_scale` 小于或等于0。
|
||||
- **ValueError** - `accum` 或 `weight_decay` 小于0。
|
||||
|
||||
**支持平台:**
|
||||
|
||||
``Ascend`` ``CPU`` ``GPU``
|
||||
|
||||
**样例:**
|
||||
|
||||
>>> net = Net()
|
||||
>>> #1) 所有参数使用相同的学习率和权重衰减
|
||||
>>> optim = nn.Adagrad(params=net.trainable_params())
|
||||
>>>
|
||||
>>> #2) 使用参数组并设置不同的值
|
||||
>>> conv_params = list(filter(lambda x: 'conv' in x.name, net.trainable_params()))
|
||||
>>> no_conv_params = list(filter(lambda x: 'conv' not in x.name, net.trainable_params()))
|
||||
>>> group_params = [{'params': conv_params, 'weight_decay': 0.01, 'grad_centralization':True},
|
||||
... {'params': no_conv_params, 'lr': 0.01},
|
||||
... {'order_params': net.trainable_params()}]
|
||||
>>> optim = nn.Adagrad(group_params, learning_rate=0.1, weight_decay=0.0)
|
||||
>>> # conv_params参数组将使用优化器中的学习率0.1、该组的权重衰减0.01、该组的梯度中心化配置True。
|
||||
>>> # no_conv_params参数组将使用该组的学习率0.01、优化器中的权重衰减0.0、梯度中心化使用默认值False。
|
||||
>>> # 优化器按照"order_params"配置的参数顺序更新参数。
|
||||
>>>
|
||||
>>> loss = nn.SoftmaxCrossEntropyWithLogits()
|
||||
>>> model = Model(net, loss_fn=loss, optimizer=optim)
|
|
@ -1,101 +0,0 @@
|
|||
mindspore.nn.Adam
|
||||
==================
|
||||
|
||||
.. py:class:: mindspore.nn.Adam(*args, **kwargs)
|
||||
|
||||
通过Adaptive Moment Estimation (Adam)算法更新梯度。
|
||||
|
||||
请参阅论文 `Adam: A Method for Stochastic Optimization <https://arxiv.org/abs/1412.6980>`_。
|
||||
|
||||
公式如下:
|
||||
|
||||
.. math::
|
||||
\begin{array}{ll} \\
|
||||
m_{t+1} = \beta_1 * m_{t} + (1 - \beta_1) * g \\
|
||||
v_{t+1} = \beta_2 * v_{t} + (1 - \beta_2) * g * g \\
|
||||
l = \alpha * \frac{\sqrt{1-\beta_2^t}}{1-\beta_1^t} \\
|
||||
w_{t+1} = w_{t} - l * \frac{m_{t+1}}{\sqrt{v_{t+1}} + \epsilon}
|
||||
\end{array}
|
||||
|
||||
:math:`m` 代表第一个动量矩阵 `moment1` ,:math:`v` 代表第二个动量矩阵 `moment2` ,:math:`g` 代表 `gradients` ,:math:`l` 代表缩放因子,:math:`\beta_1,\beta_2` 代表 `beta1` 和 `beta2` ,:math:`t` 代表更新步骤,:math:`beta_1^t` 和 :math:`beta_2^t` 代表 `beta1_power` 和 `beta2_power` ,:math:`\alpha` 代表 `learning_rate` , :math:`w` 代表 `params` , :math:`\epsilon` 代表 `eps` 。
|
||||
|
||||
.. note::
|
||||
如果前向网络使用了SparseGatherV2等算子,优化器会执行稀疏运算,通过设置 `target` 为CPU,可在主机(host)上进行稀疏运算。
|
||||
稀疏特性在持续开发中。
|
||||
|
||||
在参数未分组时,优化器配置的 `weight_decay` 应用于名称含有"beta"或"gamma"的网络参数,通过网络参数分组可调整权重衰减策略。分组时,每组网络参数均可配置 `weight_decay` ,若未配置,则该组网络参数使用优化器中配置的 `weight_decay` 。
|
||||
|
||||
**参数:**
|
||||
|
||||
- **params** (Union[list[Parameter], list[dict]]) - 必须是 `Parameter` 组成的列表或字典组成的列表。当列表元素是字典时,字典的键可以是"params"、"lr"、"weight_decay"、"grad_centralization"和"order_params":
|
||||
|
||||
- **params** - 必填。当前组别的权重,该值必须是 `Parameter` 列表。
|
||||
- **lr** - 可选。如果键中存在"lr",则使用对应的值作为学习率。如果没有,则使用优化器中配置的 `learning_rate` 作为学习率。
|
||||
- **weight_decay** - 可选。如果键中存在"weight_decay”,则使用对应的值作为权重衰减值。如果没有,则使用优化器中配置的 `weight_decay` 作为权重衰减值。
|
||||
- **grad_centralization** - 可选。如果键中存在"grad_centralization",则使用对应的值,该值必须为布尔类型。如果没有,则认为 `grad_centralization` 为False。该参数仅适用于卷积层。
|
||||
- **order_params** - 可选。对应值是预期的参数更新顺序。当使用参数分组功能时,通常使用该配置项保持 `parameters` 的顺序以提升性能。如果键中存在"order_params",则会忽略该组配置中的其他键。"order_params"中的参数必须在某一组 `params` 参数中。
|
||||
|
||||
- **learning_rate** (Union[float, Tensor, Iterable, LearningRateSchedule]): 默认值:1e-3。
|
||||
|
||||
- **float** - 固定的学习率。必须大于等于零。
|
||||
- **int** - 固定的学习率。必须大于等于零。整数类型会被转换为浮点数。
|
||||
- **Tensor** - 可以是标量或一维向量。标量是固定的学习率。一维向量是动态的学习率,第i步将取向量中第i个值作为学习率。
|
||||
- **Iterable** - 动态的学习率。第i步将取迭代器第i个值作为学习率。
|
||||
- **LearningRateSchedule** - 动态的学习率。在训练过程中,优化器将使用步数(step)作为输入,调用 `LearningRateSchedule` 实例来计算当前学习率。
|
||||
|
||||
- **beta1** (float) - `moment1` 的指数衰减率。参数范围(0.0,1.0)。默认值:0.9。
|
||||
- **beta2** (float) - `moment2` 的指数衰减率。参数范围(0.0,1.0)。默认值:0.999。
|
||||
- **eps** (float) - 将添加到分母中,以提高数值稳定性。必须大于0。默认值:1e-8。
|
||||
- **use_locking** (bool) - 是否对参数更新加锁保护。如果为True,则 `w` 、`m` 和 `v` 的tensor更新将受到锁的保护。如果为False,则结果不可预测。默认值:False。
|
||||
- **use_nesterov** (bool) - 是否使用Nesterov Accelerated Gradient (NAG)算法更新梯度。如果为True,使用NAG更新梯度。如果为False,则在不使用NAG的情况下更新梯度。默认值:False。
|
||||
- **weight_decay** (float) - 权重衰减(L2 penalty)。必须大于等于0。默认值:0.0。
|
||||
- **loss_scale** (float) - 梯度缩放系数,必须大于0。如果 `loss_scale` 是整数,它将被转换为浮点数。通常使用默认值,仅当训练时使用了 `FixedLossScaleManager` ,且 `FixedLossScaleManager` 的 `drop_overflow_update` 属性配置为False时,此值需要与 `FixedLossScaleManager` 中的 `loss_scale` 相同。有关更多详细信息,请参阅 :class:`mindspore.FixedLossScaleManager` 。默认值:1.0。
|
||||
|
||||
**输入:**
|
||||
|
||||
**gradients** (tuple[Tensor]) - `params` 的梯度,形状(shape)与 `params` 相同。
|
||||
|
||||
**输出:**
|
||||
|
||||
Tensor[bool],值为True。
|
||||
|
||||
**异常:**
|
||||
|
||||
- **TypeError** - `learning_rate` 不是int、float、Tensor、Iterable或LearningRateSchedule。
|
||||
- **TypeError** - `parameters` 的元素不是Parameter或字典。
|
||||
- **TypeError** - `beta1` 、`beta2` 、 `eps` 或 `loss_scale` 不是float。
|
||||
- **TypeError** - `weight_decay` 不是float或int。
|
||||
- **TypeError** - `use_locking` 或 `use_nesterov` 不是bool。
|
||||
- **ValueError** - `loss_scale` 或 `eps` 小于或等于0。
|
||||
- **ValueError** - `beta1` 、`beta2` 不在(0.0,1.0)范围内。
|
||||
- **ValueError** - `weight_decay` 小于0。
|
||||
|
||||
**支持平台:**
|
||||
|
||||
``Ascend`` ``GPU`` ``CPU``
|
||||
|
||||
**样例:**
|
||||
|
||||
>>> net = Net()
|
||||
>>> #1) 所有参数使用相同的学习率和权重衰减
|
||||
>>> optim = nn.Adam(params=net.trainable_params())
|
||||
>>>
|
||||
>>> #2) 使用参数组并设置不同的值
|
||||
>>> conv_params = list(filter(lambda x: 'conv' in x.name, net.trainable_params()))
|
||||
>>> no_conv_params = list(filter(lambda x: 'conv' not in x.name, net.trainable_params()))
|
||||
>>> group_params = [{'params': conv_params, 'weight_decay': 0.01, 'grad_centralization':True},
|
||||
... {'params': no_conv_params, 'lr': 0.01},
|
||||
... {'order_params': net.trainable_params()}]
|
||||
>>> optim = nn.Adam(group_params, learning_rate=0.1, weight_decay=0.0)
|
||||
>>> # conv_params参数组将使用优化器中的学习率0.1、该组的权重衰减0.01、该组的梯度中心化配置True。
|
||||
>>> # no_conv_params参数组将使用该组的学习率0.01、优化器中的权重衰减0.0、梯度中心化使用默认值False。
|
||||
>>> # 优化器按照"order_params"配置的参数顺序更新参数。
|
||||
>>>
|
||||
>>> loss = nn.SoftmaxCrossEntropyWithLogits()
|
||||
>>> model = Model(net, loss_fn=loss, optimizer=optim)
|
||||
|
||||
|
||||
.. py:method:: target
|
||||
:property:
|
||||
|
||||
该属性用于指定在主机(host)上还是设备(device)上更新参数。输入类型为str,只能是'CPU','Ascend'或'GPU'。
|
|
@ -0,0 +1,93 @@
|
|||
Class mindspore.nn.AdamOffload(params, learning_rate=1e-3, beta1=0.9, beta2=0.999, eps=1e-08, use_locking=False, use_nesterov=False, weight_decay=0.0, loss_scale=1.0)
|
||||
|
||||
此优化器在主机CPU上运行Adam优化算法,设备上仅执行网络参数的更新,最大限度地降低内存成本。
|
||||
虽然会增加性能开销,但优化器可以运行更大的模型。
|
||||
|
||||
|
||||
Adam算法参见`Adam: A Method for Stochastic Optimization <https://arxiv.org/abs/1412.6980>`_。
|
||||
|
||||
更新公式如下:
|
||||
|
||||
.. math::
|
||||
\begin{array}{ll} \\
|
||||
m_{t+1} = \beta_1 * m_{t} + (1 - \beta_1) * g \\
|
||||
v_{t+1} = \beta_2 * v_{t} + (1 - \beta_2) * g * g \\
|
||||
l = \alpha * \frac{\sqrt{1-\beta_2^t}}{1-\beta_1^t} \\
|
||||
w_{t+1} = w_{t} - l * \frac{m_{t+1}}{\sqrt{v_{t+1}} + \epsilon}
|
||||
\end{array}
|
||||
|
||||
:math:`m`代表第一个矩向量`moment1`,:math:`v`代表第二个矩向量`moment2`,:math:`g`代表`gradients`,:math:`l`代表缩放因子,:math:`\beta_1,\beta_2`代表`beta1`和`beta2`,:math:`t`代表当前step,:math:`beta_1^t`和:math:`beta_2^t`代表`beta1_power`和`beta2_power`,:math:`\alpha`代表`learning_rate`,:math:`w`代表`params`,:math:`\epsilon`代表`eps`。
|
||||
|
||||
.. note::
|
||||
此优化器目前仅支持图模式。
|
||||
|
||||
.. include:: mindspore.nn.optim_note_weight_decay.rst
|
||||
|
||||
参数:
|
||||
- **params** (Union[list[Parameter], list[dict]]) - 必须是 `Parameter` 组成的列表或字典组成的列表。当列表元素是字典时,字典的键可以是"params"、"lr"、"weight_decay"、和"order_params":
|
||||
|
||||
.. include:: mindspore.nn.optim_group_param.rst
|
||||
|
||||
.. include:: mindspore.nn.optim_group_lr.rst
|
||||
|
||||
.. include:: mindspore.nn.optim_group_weight_decay.rst
|
||||
|
||||
.. include:: mindspore.nn.optim_group_order.rst
|
||||
|
||||
|
||||
- **learning_rate (Union[float, Tensor, Iterable, LearningRateSchedule]): 默认值:1e-3。
|
||||
.. include:: mindspore.nn.optim_arg_dynamic_lr.rst
|
||||
|
||||
- **beta1** (float) - `moment1` 的指数衰减率。参数范围(0.0,1.0)。默认值:0.9。
|
||||
|
||||
- **beta2** (float) - `moment2` 的指数衰减率。参数范围(0.0,1.0)。默认值:0.999。
|
||||
|
||||
- **eps** (float) - 将添加到分母中,以提高数值稳定性。必须大于0。默认值:1e-8。
|
||||
|
||||
- **use_locking** (bool) - 是否对参数更新加锁保护。如果为True,则 `w` 、`m` 和 `v` 的更新将受到锁保护。如果为False,则结果不可预测。默认值:False。
|
||||
|
||||
- **use_nesterov** (bool) - 是否使用Nesterov Accelerated Gradient (NAG)算法更新梯度。如果为True,使用NAG更新梯度。如果为False,则在不使用NAG的情况下更新梯度。默认值:False。
|
||||
|
||||
- **weight_decay** (float) - 权重衰减(L2 penalty)。必须大于等于0。默认值:0.0。
|
||||
|
||||
.. include:: mindspore.nn.optim_arg_loss_scale.rst
|
||||
|
||||
|
||||
输入:
|
||||
- **gradients** (tuple[Tensor]):`params`的梯度,shape与`params`相同。
|
||||
|
||||
输出:
|
||||
Tensor[bool],值为True。
|
||||
|
||||
异常:
|
||||
TypeError:`learning_rate`不是int、float、Tensor、Iterable或LearningRateSchedule。
|
||||
TypeError:`parameters`的元素不是Parameter或字典。
|
||||
TypeError:`beta1`、`beta2`、`eps`或`loss_scale`不是float。
|
||||
TypeError:`weight_decay`不是float或int。
|
||||
TypeError:`use_locking`或`use_nesterov`不是bool。
|
||||
ValueError:`loss_scale`或`eps`不大于0。
|
||||
ValueError:`beta1`、`beta2`不在(0.0,1.0)范围内。
|
||||
ValueError:`weight_decay`小于0。
|
||||
|
||||
支持平台:
|
||||
``Ascend`` ``GPU`` ``CPU``
|
||||
|
||||
示例:
|
||||
>>> net = Net()
|
||||
>>> #1) 所有参数使用相同的学习率和权重衰减
|
||||
>>> optim = nn.AdamOffload(params=net.trainable_params())
|
||||
>>>
|
||||
>>> #2) 使用参数分组并设置不同的值
|
||||
>>> conv_params = list(filter(lambda x: 'conv' in x.name, net.trainable_params()))
|
||||
>>> no_conv_params = list(filter(lambda x: 'conv' not in x.name, net.trainable_params()))
|
||||
>>> group_params = [{'params': conv_params, 'weight_decay': 0.01},
|
||||
... {'params': no_conv_params, 'lr': 0.01},
|
||||
... {'order_params': net.trainable_params()}]
|
||||
>>> optim = nn.AdamOffload(group_params, learning_rate=0.1, weight_decay=0.0)
|
||||
>>> # conv_params参数组将使用优化器中的学习率0.1、该组的权重衰减0.01。
|
||||
>>> # no_conv_params参数组将使用该组的学习率0.01、优化器中的权重衰减0.0。
|
||||
>>> # 优化器按照"order_params"配置的参数顺序更新参数。
|
||||
>>>
|
||||
>>> loss = nn.SoftmaxCrossEntropyWithLogits()
|
||||
>>> model = Model(net, loss_fn=loss, optimizer=optim)
|
||||
|
|
@ -0,0 +1,87 @@
|
|||
Class mindspore.nn.AdamWeightDecay(params, learning_rate=1e-3, beta1=0.9, beta2=0.999, eps=1e-06, weight_decay=0.0)
|
||||
|
||||
实现权重衰减Adam算法。
|
||||
|
||||
.. math::
|
||||
\begin{array}{ll} \\
|
||||
m_{t+1} = \beta_1 * m_{t} + (1 - \beta_1) * g \\
|
||||
v_{t+1} = \beta_2 * v_{t} + (1 - \beta_2) * g * g \\
|
||||
update = \frac{m_{t+1}}{\sqrt{v_{t+1}} + eps} \\
|
||||
update =
|
||||
\begin{cases}
|
||||
update + weight\_decay * w_{t}
|
||||
& \text{ if } weight\_decay > 0 \\
|
||||
update
|
||||
& \text{ otherwise }
|
||||
\end{cases} \\
|
||||
w_{t+1} = w_{t} - lr * update
|
||||
\end{array}
|
||||
|
||||
:math:`m`表示第1矩向量`moment1`,:math:`v`表示第2矩向量`moment2`,:math:`g`表示`gradients`,:math:`lr`表示`learning_rate`,:math:`\beta_1, \beta_2`表示`beta1`和`beta2`,:math:`t`表示当前step,:math:`w`表示`params`。
|
||||
|
||||
|
||||
.. note::
|
||||
.. include:: mindspore.nn.optim_note_loss_scale.rst
|
||||
.. include:: mindspore.nn.optim_note_weight_decay.rst
|
||||
|
||||
参数:
|
||||
params (Union[list[Parameter], list[dict]]) - 必须是 `Parameter` 组成的列表或字典组成的列表。当列表元素是字典时,字典的键可以是"params"、"lr"、"weight_decay"、和"order_params":
|
||||
|
||||
.. include:: mindspore.nn.optim_group_param.rst
|
||||
|
||||
.. include:: mindspore.nn.optim_group_lr.rst
|
||||
|
||||
.. include:: mindspore.nn.optim_group_weight_decay.rst
|
||||
|
||||
.. include:: mindspore.nn.optim_group_order.rst
|
||||
|
||||
|
||||
learning_rate (Union[float, Tensor, Iterable, LearningRateSchedule]): 默认值:1e-3。
|
||||
|
||||
.. include:: mindspore.nn.optim_arg_dynamic_lr.rst
|
||||
|
||||
beta1 (float):`moment1` 的指数衰减率。参数范围(0.0,1.0)。默认值:0.9。
|
||||
|
||||
beta2 (float):`moment2` 的指数衰减率。参数范围(0.0,1.0)。默认值:0.999。
|
||||
|
||||
eps (float):将添加到分母中,以提高数值稳定性。必须大于0。默认值:1e-6。
|
||||
|
||||
weight_decay (float):权重衰减(L2 penalty)。必须大于等于0。默认值:0.0。
|
||||
|
||||
输入:
|
||||
- **gradients** (tuple[Tensor]):`params`的梯度,shape与`params`相同。
|
||||
|
||||
输出:
|
||||
tuple[bool],所有元素都为True。
|
||||
|
||||
异常:
|
||||
TypeError:`learning_rate`不是int、float、Tensor、Iterable或LearningRateSchedule。
|
||||
TypeError:`parameters`的元素不是Parameter或字典。
|
||||
TypeError:`beta1`、`beta2`或`eps`不是float。
|
||||
TypeError:`weight_decay`不是float或int。
|
||||
ValueError:`eps`小于等于0。
|
||||
ValueError:`beta1`、`beta2`不在(0.0,1.0)范围内。
|
||||
ValueError:`weight_decay`小于0。
|
||||
|
||||
支持平台:
|
||||
``Ascend`` ``GPU`` ``CPU``
|
||||
|
||||
示例:
|
||||
>>> net = Net()
|
||||
>>> #1) 所有参数使用相同的学习率和权重衰减
|
||||
>>> optim = nn.AdamWeightDecay(params=net.trainable_params())
|
||||
>>>
|
||||
>>> #2) 使用参数分组并设置不同的值
|
||||
>>> conv_params = list(filter(lambda x: 'conv' in x.name, net.trainable_params()))
|
||||
>>> no_conv_params = list(filter(lambda x: 'conv' not in x.name, net.trainable_params()))
|
||||
>>> group_params = [{'params': conv_params, 'weight_decay': 0.01},
|
||||
... {'params': no_conv_params, 'lr': 0.01},
|
||||
... {'order_params': net.trainable_params()}]
|
||||
>>> optim = nn.AdamWeightDecay(group_params, learning_rate=0.1, weight_decay=0.0)
|
||||
>>> # conv_params参数组将使用优化器中的学习率0.1、该组的权重衰减0.01。
|
||||
>>> # no_conv_params参数组将使用该组的学习率0.01、优化器中的权重衰减0.0。
|
||||
>>> # 优化器按照"order_params"配置的参数顺序更新参数。
|
||||
>>>
|
||||
>>> loss = nn.SoftmaxCrossEntropyWithLogits()
|
||||
>>> model = Model(net, loss_fn=loss, optimizer=optim)
|
||||
|
|
@ -0,0 +1,58 @@
|
|||
Class mindspore.nn.DynamicLossScaleUpdateCell(loss_scale_value, scale_factor, scale_window)
|
||||
|
||||
用于动态地更新梯度放大系数(loss scale)的神经元。
|
||||
|
||||
使用梯度放大功能进行训练时,初始梯度放大系数值为`loss_scale_value`。
|
||||
在每个训练步骤中,当出现溢出时,通过计算公式`loss_scale`/`scale_factor`减小梯度放大系数。
|
||||
如果连续`scale_window`步(step)未溢出,则将通过`loss_scale` * `scale_factor`增大梯度放大系数。
|
||||
|
||||
该类是:class:`mindspore.nn.DynamicLossScaleManager`的`get_update_cell`方法的返回值。
|
||||
训练过程中,类:class:`mindspore.TrainOneStepWithLossScaleCell`会调用该Cell来更新梯度放大系数。
|
||||
|
||||
参数:
|
||||
loss_scale_value (float):初始的梯度放大系数。
|
||||
scale_factor (int):增减系数。
|
||||
scale_window (int):未溢出时,增大梯度放大系数的最大连续训练步数。
|
||||
|
||||
输入:
|
||||
- **loss_scale** (Tensor):训练期间的梯度放大系数,shape为:math:`()`。
|
||||
- **overflow** (bool):是否发生溢出。
|
||||
|
||||
输出:
|
||||
Bool,即输入`overflow`。
|
||||
|
||||
支持平台:
|
||||
``Ascend`` ``GPU``
|
||||
|
||||
示例:
|
||||
>>> import numpy as np
|
||||
>>> from mindspore import Tensor, Parameter, nn
|
||||
>>> import mindspore.ops as ops
|
||||
>>>
|
||||
>>> class Net(nn.Cell):
|
||||
... def __init__(self, in_features, out_features):
|
||||
... super(Net, self).__init__()
|
||||
... self.weight = Parameter(Tensor(np.ones([in_features, out_features]).astype(np.float32)),
|
||||
... name='weight')
|
||||
... self.matmul = ops.MatMul()
|
||||
...
|
||||
... def construct(self, x):
|
||||
... output = self.matmul(x, self.weight)
|
||||
... return output
|
||||
...
|
||||
>>> in_features, out_features = 16, 10
|
||||
>>> net = Net(in_features, out_features)
|
||||
>>> loss = nn.MSELoss()
|
||||
>>> optimizer = nn.Momentum(net.trainable_params(), learning_rate=0.1, momentum=0.9)
|
||||
>>> net_with_loss = nn.WithLossCell(net, loss)
|
||||
>>> manager = nn.DynamicLossScaleUpdateCell(loss_scale_value=2**12, scale_factor=2, scale_window=1000)
|
||||
>>> train_network = nn.TrainOneStepWithLossScaleCell(net_with_loss, optimizer, scale_sense=manager)
|
||||
>>> input = Tensor(np.ones([out_features, in_features]), mindspore.float32)
|
||||
>>> labels = Tensor(np.ones([out_features,]), mindspore.float32)
|
||||
>>> output = train_network(input, labels)
|
||||
|
||||
|
||||
get_loss_scale()
|
||||
|
||||
获取当前梯度放大系数。
|
||||
|
|
@ -0,0 +1,103 @@
|
|||
Class mindspore.nn.FTRL(*args, **kwargs)
|
||||
|
||||
使用ApplyFtrl算子实现FTRL算法。
|
||||
|
||||
FTRL是一种在线凸优化算法,根据损失函数自适应地选择正则化函数。
|
||||
详见论文`Adaptive Bound Optimization for Online Convex Optimization <https://arxiv.org/abs/1002.4908>`_。
|
||||
工程文档参阅`Ad Click Prediction: a View from the Trenches <https://www.eecs.tufts.edu/~dsculley/papers/ad-click-prediction.pdf>`_。
|
||||
|
||||
|
||||
更新公式如下:
|
||||
|
||||
.. math::
|
||||
|
||||
\begin{array}{ll} \\
|
||||
m_{t+1} = m_{t} + g^2 \\
|
||||
u_{t+1} = u_{t} + g - \frac{m_{t+1}^\text{-p} - m_{t}^\text{-p}}{\alpha } * \omega_{t} \\
|
||||
\omega_{t+1} =
|
||||
\begin{cases}
|
||||
\frac{(sign(u_{t+1}) * l1 - u_{t+1})}{\frac{m_{t+1}^\text{-p}}{\alpha } + 2 * l2 }
|
||||
& \text{ if } |u_{t+1}| > l1 \\
|
||||
0.0
|
||||
& \text{ otherwise }
|
||||
\end{cases}\\
|
||||
\end{array}
|
||||
|
||||
:math:`m`表示累加器,:math:`g`表示`grads`,:math:`t`表示当前step,:math:`u`表示需要更新的线性系数,:math:`p`表示`lr_power`,:math:`\alpha`表示`learning_rate`,:math:`\omega`表示`params`。
|
||||
|
||||
.. note::
|
||||
.. include:: mindspore.nn.optim_note_sparse.rst
|
||||
|
||||
.. include:: mindspore.nn.optim_note_weight_decay.rst
|
||||
|
||||
参数:
|
||||
params (Union[list[Parameter], list[dict]]) - 必须是 `Parameter` 组成的列表或字典组成的列表。当列表元素是字典时,字典的键可以是"params"、"lr"、"weight_decay"、"grad_centralization"和"order_params":
|
||||
|
||||
.. include:: mindspore.nn.optim_group_param.rst
|
||||
|
||||
- **lr** - 学习率当前不支持参数分组。
|
||||
|
||||
.. include:: mindspore.nn.optim_group_weight_decay.rst
|
||||
|
||||
.. include:: mindspore.nn.optim_group_gc.rst
|
||||
|
||||
.. include:: mindspore.nn.optim_group_order.rst
|
||||
|
||||
|
||||
initial_accum (float):累加器`m`的初始值,必须大于等于零。默认值:0.1。
|
||||
|
||||
learning_rate (float):学习速率值必须为零或正数,当前不支持动态学习率。默认值:0.001。
|
||||
|
||||
lr_power (float):学习率的幂值,控制训练期间学习率的下降方式,必须小于或等于零。如果lr_power为零,则使用固定的学习率。默认值:-0.5。
|
||||
|
||||
l1 (float):l1正则化强度,必须大于等于零。默认值:0.0。
|
||||
|
||||
l2 (float):l2正则化强度,必须大于等于零。默认值:0.0。
|
||||
|
||||
use_locking (bool):如果为True,则更新操作使用锁保护。默认值:False。
|
||||
|
||||
.. include:: mindspore.nn.optim_arg_loss_scale.rst
|
||||
|
||||
weight_decay (Union[float, int]):要乘以权重的权重衰减值,必须为零或正值。默认值:0.0。
|
||||
|
||||
输入:
|
||||
- **grads** (tuple[Tensor]):优化器中`params`的梯度,shape与优化器中的`params`相同。
|
||||
|
||||
|
||||
输出:
|
||||
tuple[Parameter],更新的参数,shape与`params`相同。
|
||||
|
||||
异常:
|
||||
TypeError:`initial_accum`、`learning_rate`、`lr_power`、`l1`、`l2`或`loss_scale`不是float。
|
||||
TypeError:`parameters`的元素不是Parameter或dict。
|
||||
TypeError:`weight_decay`不是float或int。
|
||||
TypeError:`use_nesterov`不是bool。
|
||||
ValueError:`lr_power`大于0。
|
||||
ValueError:`loss_scale`小于等于0。
|
||||
ValueError:`initial_accum`、`l1`或`l2`小于0。
|
||||
|
||||
支持平台:
|
||||
``Ascend`` ``GPU`` ``CPU``
|
||||
|
||||
示例:
|
||||
>>> net = Net()
|
||||
>>> #1) 所有参数使用相同的学习率和权重衰减
|
||||
>>> optim = nn.FTRL(params=net.trainable_params())
|
||||
>>>
|
||||
>>> #2) 使用参数分组并设置不同的值
|
||||
>>> conv_params = list(filter(lambda x: 'conv' in x.name, net.trainable_params()))
|
||||
>>> no_conv_params = list(filter(lambda x: 'conv' not in x.name, net.trainable_params()))
|
||||
>>> group_params = [{'params': conv_params, 'weight_decay': 0.01, 'grad_centralization':True},
|
||||
... {'params': no_conv_params},
|
||||
... {'order_params': net.trainable_params()}]
|
||||
>>> optim = nn.FTRL(group_params, learning_rate=0.1, weight_decay=0.0)
|
||||
>>> # conv_params参数组将使用优化器中的学习率0.1、该组的权重衰减0.01、该组的梯度中心化配置True。
|
||||
>>> # no_conv_params参数组使用优化器中的学习率0.1、优化器中的权重衰减0.0、梯度中心化使用默认值False。
|
||||
>>> # 优化器按照"order_params"配置的参数顺序更新参数。
|
||||
>>>
|
||||
>>>
|
||||
>>> loss = nn.SoftmaxCrossEntropyWithLogits()
|
||||
>>> model = Model(net, loss_fn=loss, optimizer=optim)
|
||||
|
||||
|
||||
.. include::mindspore.nn.optim_target_unique_for_sparse.rst
|
|
@ -0,0 +1,51 @@
|
|||
Class mindspore.nn.FixedLossScaleUpdateCell(loss_scale_value)
|
||||
|
||||
固定梯度放大系数的神经元。
|
||||
|
||||
该类是:class:`mindspore.nn.FixedLossScaleManager`的`get_update_cell`方法的返回值。
|
||||
训练过程中,类:class:`mindspore.TrainOneStepWithLossScaleCell`会调用该Cell。
|
||||
|
||||
参数:
|
||||
loss_scale_value (float):初始梯度放大系数。
|
||||
|
||||
输入:
|
||||
- **loss_scale** (Tensor):训练期间的梯度放大系数,shape为:math:`()`,在当前类中,该值被忽略。
|
||||
- **overflow** (bool):是否发生溢出。
|
||||
|
||||
输出:
|
||||
Bool,即输入`overflow`。
|
||||
|
||||
支持平台:
|
||||
``Ascend`` ``GPU``
|
||||
|
||||
示例:
|
||||
>>> import numpy as np
|
||||
>>> from mindspore import Tensor, Parameter, nn, ops
|
||||
>>>
|
||||
>>> class Net(nn.Cell):
|
||||
... def __init__(self, in_features, out_features):
|
||||
... super(Net, self).__init__()
|
||||
... self.weight = Parameter(Tensor(np.ones([in_features, out_features]).astype(np.float32)),
|
||||
... name='weight')
|
||||
... self.matmul = ops.MatMul()
|
||||
...
|
||||
... def construct(self, x):
|
||||
... output = self.matmul(x, self.weight)
|
||||
... return output
|
||||
...
|
||||
>>> in_features, out_features = 16, 10
|
||||
>>> net = Net(in_features, out_features)
|
||||
>>> loss = nn.MSELoss()
|
||||
>>> optimizer = nn.Momentum(net.trainable_params(), learning_rate=0.1, momentum=0.9)
|
||||
>>> net_with_loss = nn.WithLossCell(net, loss)
|
||||
>>> manager = nn.FixedLossScaleUpdateCell(loss_scale_value=2**12)
|
||||
>>> train_network = nn.TrainOneStepWithLossScaleCell(net_with_loss, optimizer, scale_sense=manager)
|
||||
>>> input = Tensor(np.ones([out_features, in_features]), mindspore.float32)
|
||||
>>> labels = Tensor(np.ones([out_features,]), mindspore.float32)
|
||||
>>> output = train_network(input, labels)
|
||||
|
||||
|
||||
get_loss_scale()
|
||||
|
||||
获取当前梯度放大系数。
|
||||
|
|
@ -0,0 +1,50 @@
|
|||
Class mindspore.nn.LARS(*args, **kwargs)
|
||||
|
||||
使用LARSUpdate算子实现LARS算法。
|
||||
|
||||
LARS算法采用大量的优化技术。详见论文`LARGE BATCH TRAINING OF CONVOLUTIONAL NETWORKS <https://arxiv.org/abs/1708.03888>`_。
|
||||
|
||||
更新公式如下:
|
||||
|
||||
.. math::
|
||||
|
||||
\begin{array}{ll} \\
|
||||
\lambda = \frac{\theta \text{ * } || \omega || } \\
|
||||
{|| g_{t} || \text{ + } \delta \text{ * } || \omega || } \\
|
||||
\lambda =
|
||||
\begin{cases}
|
||||
\min(\frac{\lambda}{\alpha }, 1)
|
||||
& \text{ if } clip = True \\
|
||||
\lambda
|
||||
& \text{ otherwise }
|
||||
\end{cases}\\
|
||||
g_{t+1} = \lambda * (g_{t} + \delta * \omega)
|
||||
\end{array}
|
||||
|
||||
:math:`\theta`表示`coefficient`,:math:`\omega`表示网络参数,:math:`g`表示`gradients`,:math:`t`表示当前step,:math:`\delta`表示`optimizer`配置的`weight_decay`,:math:`\alpha`表示`optimizer`配置的`learning_rate`,:math:`clip`表示`use_clip`。
|
||||
|
||||
|
||||
参数:
|
||||
optimizer (Optimizer):待封装和修改梯度的MindSpore优化器。
|
||||
epsilon (float):将添加到分母中,提高数值稳定性。默认值:1e-05。
|
||||
coefficient (float):计算局部学习速率的信任系数。默认值:0.001。
|
||||
use_clip (bool):计算局部学习速率时是否裁剪。默认值:False。
|
||||
lars_filter (Function):用于指定使用LARS算法的网络参数。默认值:lambda x: 'LayerNorm' not in x.name and 'bias' not in x.name。
|
||||
|
||||
输入:
|
||||
- **gradients** (tuple[Tensor]):优化器中`params`的梯度,shape与优化器中的`params`相同。
|
||||
|
||||
|
||||
输出:
|
||||
Union[Tensor[bool], tuple[Parameter]],取决于`optimizer`的输出。
|
||||
|
||||
支持平台:
|
||||
``Ascend`` ``CPU``
|
||||
|
||||
示例:
|
||||
>>> net = Net()
|
||||
>>> loss = nn.SoftmaxCrossEntropyWithLogits()
|
||||
>>> opt = nn.Momentum(net.trainable_params(), 0.1, 0.9)
|
||||
>>> opt_lars = nn.LARS(opt, epsilon=1e-08, coefficient=0.02)
|
||||
>>> model = Model(net, loss_fn=loss, optimizer=opt_lars, metrics=None)
|
||||
|
|
@ -0,0 +1,89 @@
|
|||
Class mindspore.nn.Lamb(*args, **kwargs)
|
||||
|
||||
LAMB(Layer-wise Adaptive Moments optimizer for Batching training,用于批训练的分层自适应矩优化器)算法优化器。
|
||||
|
||||
LAMB是一种采用分层自适应批优化技术的优化算法。
|
||||
详见论文`LARGE BATCH OPTIMIZATION FOR DEEP LEARNING: TRAINING BERT IN 76 MINUTES <https://arxiv.org/abs/1904.00962>`_。
|
||||
|
||||
LAMB优化器旨在不降低精度的情况下增加训练batch size,支持自适应逐元素更新和精确的分层校正。
|
||||
|
||||
|
||||
参数更新如下:
|
||||
|
||||
.. math::
|
||||
\begin{gather*}
|
||||
m_t = \beta_1 m_{t - 1}+ (1 - \beta_1)g_t\\
|
||||
v_t = \beta_2 v_{t - 1} + (1 - \beta_2)g_t^2\\
|
||||
m_t = \frac{m_t}{\beta_1^t}\\
|
||||
v_t = \frac{v_t}{\beta_2^t}\\
|
||||
r_t = \frac{m_t}{\sqrt{v_t}+\epsilon}\\
|
||||
w_t = w_{t-1} -\eta_t \frac{\| w_{t-1} \|}{\| r_t + \lambda w_{t-1} \|} (r_t + \lambda w_{t-1})
|
||||
\end{gather*}
|
||||
|
||||
其中,math:`m`代表第一个矩向量,:math:`v`代表第二个矩向量,:math:`\eta`表示学习率,:math:`\lambda`表示LAMB权重衰减率。
|
||||
|
||||
|
||||
.. note::
|
||||
.. include:: mindspore.nn.optim_note_weight_decay.rst
|
||||
|
||||
.. include:: mindspore.nn.optim_note_loss_scale.rst
|
||||
|
||||
参数:
|
||||
params (Union[list[Parameter], list[dict]]): 必须是 `Parameter` 组成的列表或字典组成的列表。当列表元素是字典时,字典的键可以是"params"、"lr"、"weight_decay"、"grad_centralization"和"order_params":
|
||||
|
||||
.. include:: mindspore.nn.optim_group_param.rst
|
||||
.. include:: mindspore.nn.optim_group_lr.rst
|
||||
.. include:: mindspore.nn.optim_group_weight_decay.rst
|
||||
.. include:: mindspore.nn.optim_group_gc.rst
|
||||
.. include:: mindspore.nn.optim_group_order.rst
|
||||
|
||||
learning_rate (Union[float, Tensor, Iterable, LearningRateSchedule]):
|
||||
.. include:: mindspore.nn.optim_arg_dynamic_lr.rst
|
||||
|
||||
beta1 (float):第一矩的指数衰减率。参数范围(0.0,1.0)。默认值:0.9。
|
||||
|
||||
beta2 (float):第二矩的指数衰减率。参数范围(0.0,1.0)。默认值:0.999。
|
||||
|
||||
eps (float):将添加到分母中,以提高数值稳定性。必须大于0。默认值:1e-6。
|
||||
|
||||
weight_decay (float):权重衰减(L2 penalty)。必须大于等于0。默认值:0.0。
|
||||
|
||||
输入:
|
||||
- **gradients** (tuple[Tensor]):`params`的梯度,shape与`params`相同。
|
||||
|
||||
输出:
|
||||
tuple[bool],所有元素都为True。
|
||||
|
||||
异常:
|
||||
TypeError:`learning_rate`不是int、float、Tensor、Iterable或LearningRateSchedule。
|
||||
TypeError:`parameters`的元素不是Parameter或dict。
|
||||
TypeError:`beta1`、`beta2`或`eps`不是float。
|
||||
TypeError:`weight_decay`不是float或int。
|
||||
ValueError:`eps`小于等于0。
|
||||
ValueError:`beta1`、`beta2`不在(0.0,1.0)范围内。
|
||||
ValueError:`weight_decay`小于0。
|
||||
|
||||
支持平台:
|
||||
``Ascend`` ``GPU`` ``CPU``
|
||||
|
||||
示例:
|
||||
>>> net = Net()
|
||||
>>> #1) 所有参数使用相同的学习率和权重衰减
|
||||
>>> optim = nn.Lamb(params=net.trainable_params(), learning_rate=0.1)
|
||||
>>>
|
||||
>>> #2) 使用参数分组并设置不同的值
|
||||
>>> poly_decay_lr = learning_rate_schedule.PolynomialDecayLR(learning_rate=0.1, end_learning_rate=0.01,
|
||||
... decay_steps=4, power = 0.5)
|
||||
>>> conv_params = list(filter(lambda x: 'conv' in x.name, net.trainable_params()))
|
||||
>>> no_conv_params = list(filter(lambda x: 'conv' not in x.name, net.trainable_params()))
|
||||
>>> group_params = [{'params': conv_params, 'weight_decay': 0.01, 'grad_centralization':True},
|
||||
... {'params': no_conv_params, 'lr': poly_decay_lr},
|
||||
... {'order_params': net.trainable_params(0.01)}]
|
||||
>>> optim = nn.Lamb(group_params, learning_rate=0.1, weight_decay=0.0)
|
||||
>>> # conv_params参数组将使用优化器中的学习率0.1、该组的权重衰减0.01、该组的梯度中心化配置True。
|
||||
>>> # no_conv_params参数组将使用该组的衰减学习率、优化器中的权重衰减0.0、梯度中心化使用默认值False。
|
||||
>>> # 优化器按照"order_params"配置的参数顺序更新参数。
|
||||
>>>
|
||||
>>> loss = nn.SoftmaxCrossEntropyWithLogits()
|
||||
>>> model = Model(net, loss_fn=loss, optimizer=optim)
|
||||
|
|
@ -0,0 +1,93 @@
|
|||
Class mindspore.nn.LazyAdam(*args, **kwargs)
|
||||
|
||||
通过Adaptive Moment Estimation (Adam)算法更新梯度。请参阅论文`Adam: A Method for Stochastic Optimization <https://arxiv.org/abs/1412.6980>`_。
|
||||
|
||||
当梯度稀疏时,此优化器将使用Lazy Adam算法。
|
||||
|
||||
更新公式如下:
|
||||
|
||||
.. math::
|
||||
\begin{array}{ll} \\
|
||||
m_{t+1} = \beta_1 * m_{t} + (1 - \beta_1) * g \\
|
||||
v_{t+1} = \beta_2 * v_{t} + (1 - \beta_2) * g * g \\
|
||||
l = \alpha * \frac{\sqrt{1-\beta_2^t}}{1-\beta_1^t} \\
|
||||
w_{t+1} = w_{t} - l * \frac{m_{t+1}}{\sqrt{v_{t+1}} + \epsilon}
|
||||
\end{array}
|
||||
|
||||
:math:`m`代表第一个矩向量`moment1`,:math:`v`代表第二个矩向量`moment2`,:math:`g`代表`gradients`,:math:`l`代表缩放因子,:math:`\beta_1,\beta_2`代表`beta1`和`beta2`,:math:`t`代表当前step,:math:`beta_1^t`和:math:`beta_2^t`代表`beta1_power`和`beta2_power`,:math:`\alpha`代表``learning_rate`,:math:`w`代表`params`,:math:`\epsilon`代表`eps`。
|
||||
|
||||
|
||||
.. note::
|
||||
.. include:: mindspore.nn.optim_note_sparse.rst
|
||||
需要注意的是,梯度稀疏时该优化器只更新网络参数的当前的索引位置,稀疏行为不等同于Adam算法。
|
||||
|
||||
.. include:: mindspore.nn.optim_note_weight_decay.rst
|
||||
|
||||
参数:
|
||||
param (Union[list[Parameter], list[dict]]) - 必须是 `Parameter` 组成的列表或字典组成的列表。当列表元素是字典时,字典的键可以是"params"、"lr"、"weight_decay"、"grad_centralization"和"order_params":
|
||||
|
||||
.. include:: mindspore.nn.optim_group_param.rst
|
||||
.. include:: mindspore.nn.optim_group_lr.rst
|
||||
.. include:: mindspore.nn.optim_group_weight_decay.rst
|
||||
.. include:: mindspore.nn.optim_group_gc.rst
|
||||
.. include:: mindspore.nn.optim_group_order.rst
|
||||
|
||||
learning_rate (Union[float, Tensor, Iterable, LearningRateSchedule]): 默认值:1e-3。
|
||||
.. include:: mindspore.nn.optim_dynamic_lr.rst
|
||||
|
||||
beta1 (float):`moment1` 的指数衰减率。参数范围(0.0,1.0)。默认值:0.9。
|
||||
|
||||
beta2 (float):moment2` 的指数衰减率。参数范围(0.0,1.0)。默认值:0.999。
|
||||
|
||||
eps (float):将添加到分母中,以提高数值稳定性。必须大于0。默认值:1e-8。
|
||||
|
||||
use_locking (bool):是否对参数更新加锁保护。如果为True,则 `w` 、`m` 和 `v` 的Tensor更新将受到锁的保护。如果为False,则结果不可预测。默认值:False。
|
||||
|
||||
use_nesterov (bool):是否使用Nesterov Accelerated Gradient (NAG)算法更新梯度。如果为True,使用NAG更新梯度。如果为False,则在不使用NAG的情况下更新梯度。默认值:False。
|
||||
|
||||
weight_decay (Union[float, int]):权重衰减(L2 penalty)。必须大于等于0。默认值:0.0。
|
||||
|
||||
.. include:: mindspore.nn.optim_arg_loss_scale.rst
|
||||
|
||||
输入:
|
||||
- **gradients** (tuple[Tensor]):`params`的梯度,shape与`params`相同。
|
||||
|
||||
输出:
|
||||
Tensor[bool],值为True。
|
||||
|
||||
异常:
|
||||
TypeError:`learning_rate`不是int、float、Tensor、Iterable或LearningRateSchedule。
|
||||
TypeError:`parameters`的元素不是Parameter或字典。
|
||||
TypeError:`beta1`、`beta2`、`eps`或`loss_scale`不是float。
|
||||
TypeError:`weight_decay`不是float或int。
|
||||
TypeError:`use_locking`或`use_nesterov`不是bool。
|
||||
ValueError:`loss_scale`或`eps`小于或等于0。
|
||||
ValueError:`beta1`、`beta2`不在(0.0,1.0)范围内。
|
||||
ValueError:`weight_decay`小于0。
|
||||
|
||||
支持平台:
|
||||
``Ascend`` ``GPU``
|
||||
|
||||
示例:
|
||||
>>> net = Net()
|
||||
>>> #1) 所有参数使用相同的学习率和权重衰减
|
||||
>>> optim = nn.LazyAdam(params=net.trainable_params())
|
||||
>>>
|
||||
>>> #2) 使用参数分组并设置不同的值
|
||||
>>> conv_params = list(filter(lambda x: 'conv' in x.name, net.trainable_params()))
|
||||
>>> no_conv_params = list(filter(lambda x: 'conv' not in x.name, net.trainable_params()))
|
||||
>>> group_params = [{'params': conv_params, 'weight_decay': 0.01, 'grad_centralization':True},
|
||||
... {'params': no_conv_params, 'lr': 0.01},
|
||||
... {'order_params': net.trainable_params()}]
|
||||
>>> optim = nn.LazyAdam(group_params, learning_rate=0.1, weight_decay=0.0)
|
||||
>>> # conv_params参数组将使用优化器中的学习率0.1、该组的权重衰减0.01、该组的梯度中心化配置True。
|
||||
>>> # no_conv_params参数组将使用该组的学习率0.01、优化器中的权重衰减0.0、梯度中心化使用默认值False。
|
||||
>>> # 优化器按照"order_params"配置的参数顺序更新参数。
|
||||
>>>
|
||||
>>> loss = nn.SoftmaxCrossEntropyWithLogits()
|
||||
>>> model = Model(net, loss_fn=loss, optimizer=optim)
|
||||
|
||||
|
||||
.. include:: mindspore.nn.optim_target_unique_for_sparse.rst
|
||||
|
||||
|
|
@ -1,76 +0,0 @@
|
|||
mindspore.nn.Metric
|
||||
====================
|
||||
|
||||
.. py:class:: mindspore.nn.Metric
|
||||
|
||||
用于计算评估指标的基类。
|
||||
|
||||
在计算评估指标时需要调用 `clear` 、 `update` 和 `eval` 三个方法,在继承该类自定义评估指标时,也需要实现这三个方法。其中,`update` 用于计算中间过程的内部结果,`eval` 用于计算最终评估结果,`clear` 用于重置中间结果。
|
||||
请勿直接使用该类,需使用子类如 :class:`mindspore.nn.MAE` 、 :class:`mindspore.nn.Recall` 等。
|
||||
|
||||
.. py:method:: clear()
|
||||
:abstract:
|
||||
|
||||
描述了清除内部评估结果的行为。
|
||||
|
||||
.. note::
|
||||
所有子类都必须重写此接口。
|
||||
|
||||
.. py:method:: eval()
|
||||
:abstract:
|
||||
|
||||
描述了计算最终评估结果的行为。
|
||||
|
||||
.. note::
|
||||
所有子类都必须重写此接口。
|
||||
|
||||
.. py:method:: indexes
|
||||
:property:
|
||||
|
||||
获取当前的 `indexes` 值。默认为None,调用 `set_indexes` 可修改 `indexes` 值。
|
||||
|
||||
.. py:method:: set_indexes(indexes)
|
||||
|
||||
该接口用于重排 `update` 的输入。
|
||||
|
||||
给定(label0, label1, logits)作为 `update` 的输入,将 `indexes` 设置为[2, 1],则最终使用(logits, label1)作为 `update` 的真实输入。
|
||||
|
||||
.. note::
|
||||
在继承该类自定义评估函数时,需要用装饰器 `mindspore.nn.rearrange_inputs` 修饰 `update` 方法,否则配置的 `indexes` 值不生效。
|
||||
|
||||
|
||||
**参数:**
|
||||
|
||||
**indexes** (List(int)) - logits和标签的目标顺序。
|
||||
|
||||
**输出:**
|
||||
|
||||
:class:`Metric` ,类实例本身。
|
||||
|
||||
**样例:**
|
||||
|
||||
>>> import numpy as np
|
||||
>>> from mindspore import nn, Tensor
|
||||
>>>
|
||||
>>> x = Tensor(np.array([[0.2, 0.5], [0.3, 0.1], [0.9, 0.6]]))
|
||||
>>> y = Tensor(np.array([1, 0, 1]))
|
||||
>>> y2 = Tensor(np.array([0, 0, 1]))
|
||||
>>> metric = nn.Accuracy('classification').set_indexes([0, 2])
|
||||
>>> metric.clear()
|
||||
>>> # indexes为[0, 2],使用x作为预测值,y2作为真实标签
|
||||
>>> metric.update(x, y, y2)
|
||||
>>> accuracy = metric.eval()
|
||||
>>> print(accuracy)
|
||||
0.3333333333333333
|
||||
|
||||
.. py:method:: update(*inputs)
|
||||
:abstract:
|
||||
|
||||
描述了更新内部评估结果的行为。
|
||||
|
||||
.. note::
|
||||
所有子类都必须重写此接口。
|
||||
|
||||
**参数:**
|
||||
|
||||
**inputs** - 可变长度输入参数列表。通常是预测值和对应的真实标签。
|
|
@ -1,91 +0,0 @@
|
|||
mindspore.nn.Momentum
|
||||
======================
|
||||
|
||||
.. py:class:: mindspore.nn.Momentum(*args, **kwargs)
|
||||
|
||||
Momentum算法优化器。
|
||||
|
||||
有关更多详细信息,请参阅论文 `On the importance of initialization and momentum in deep learning <https://dl.acm.org/doi/10.5555/3042817.3043064>`_。
|
||||
|
||||
.. math::
|
||||
v_{t+1} = v_{t} \ast u + grad
|
||||
|
||||
如果 `use_nesterov` 为True:
|
||||
|
||||
.. math::
|
||||
p_{t+1} = p_{t} - (grad \ast lr + v_{t+1} \ast u \ast lr)
|
||||
|
||||
如果 `use_nesterov` 为False:
|
||||
|
||||
.. math::
|
||||
p_{t+1} = p_{t} - lr \ast v_{t+1}
|
||||
|
||||
其中,:math:`grad` 、:math:`lr` 、:math:`p` 、:math:`v` 和 :math:`u` 分别表示梯度、学习率、参数、矩(Moment)和动量(Momentum)。
|
||||
|
||||
.. note::
|
||||
在参数未分组时,优化器配置的 `weight_decay` 应用于名称含有"beta"或"gamma"的网络参数,通过网络参数分组可调整权重衰减策略。分组时,每组网络参数均可配置 `weight_decay` ,若未配置,则该组网络参数使用优化器中配置的 `weight_decay` 。
|
||||
|
||||
**参数:**
|
||||
|
||||
- **params** (Union[list[Parameter], list[dict]]): 必须是 `Parameter` 组成的列表或字典组成的列表。当列表元素是字典时,字典的键可以是"params"、"lr"、"weight_decay"、"grad_centralization"和"order_params":
|
||||
|
||||
- ** params** - 必填。当前组别的权重,该值必须是 `Parameter` 列表。
|
||||
- ** lr** - 可选。如果键中存在"lr",则使用对应的值作为学习率。如果没有,则使用优化器中配置的 `learning_rate` 作为学习率。
|
||||
- ** weight_decay** - 可选。如果键中存在"weight_decay”,则使用对应的值作为权重衰减值。如果没有,则使用优化器中配置的 `weight_decay` 作为权重衰减值。
|
||||
- ** grad_centralization** - 可选。如果键中存在"grad_centralization",则使用对应的值,该值必须为布尔类型。如果没有,则认为 `grad_centralization` 为False。该参数仅适用于卷积层。
|
||||
- ** order_params** - 可选。对应值是预期的参数更新顺序。当使用参数分组功能时,通常使用该配置项保持 `parameters` 的顺序以提升性能。如果键中存在"order_params",则会忽略该组配置中的其他键。"order_params"中的参数必须在某一组 `params` 参数中。
|
||||
|
||||
- **learning_rate** (Union[float, int, Tensor, Iterable, LearningRateSchedule]):
|
||||
|
||||
- **float** - 固定的学习率。必须大于等于零。
|
||||
- **int** - 固定的学习率。必须大于等于零。整数类型会被转换为浮点数。
|
||||
- **Tensor** - 可以是标量或一维向量。标量是固定的学习率。一维向量是动态的学习率,第i步将取向量中第i个值作为学习率。
|
||||
- **Iterable** - 动态的学习率。第i步将取迭代器第i个值作为学习率。
|
||||
- **LearningRateSchedule** - 动态的学习率。在训练过程中,优化器将使用步数(step)作为输入,调用 `LearningRateSchedule` 实例来计算当前学习率。
|
||||
|
||||
- **momentum** (float) - 浮点数类型的超参,表示移动平均的动量。必须等于或大于0.0。
|
||||
- **weight_decay** (int, float) - 权重衰减(L2 penalty)值。必须大于等于0.0。默认值:0.0。
|
||||
- **loss_scale** (float) - 梯度缩放系数,必须大于0。如果 `loss_scale` 是整数,它将被转换为浮点数。通常使用默认值,仅当训练时使用了 `FixedLossScaleManager`,且 `FixedLossScaleManager` 的 `drop_overflow_update` 属性配置为False时,此值需要与 `FixedLossScaleManager` 中的 `loss_scale` 相同。有关更多详细信息,请参阅 :class:`mindspore.FixedLossScaleManager` 。默认值:1.0。
|
||||
- **use_nesterov** (bool) - 是否使用Nesterov Accelerated Gradient (NAG)算法更新梯度。默认值:False。
|
||||
|
||||
**输入:**
|
||||
|
||||
**gradients** (tuple[Tensor]) - `params` 的梯度,形状(shape)与 `params` 相同。
|
||||
|
||||
**输出:**
|
||||
|
||||
tuple[bool],所有元素都为True。
|
||||
|
||||
**异常:**
|
||||
|
||||
- **TypeError** - `learning_rate` 不是int、float、Tensor、Iterable或LearningRateSchedule。
|
||||
- **TypeError** - `parameters` 的元素不是 `Parameter` 或字典。
|
||||
- **TypeError** - `loss_scale` 或 `momentum` 不是float。
|
||||
- **TypeError** - `weight_decay` 不是float或int。
|
||||
- **TypeError** - `use_nesterov` 不是bool。
|
||||
- **ValueError** - `loss_scale` 小于或等于0。
|
||||
- **ValueError** - `weight_decay` 或 `momentum` 小于0。
|
||||
|
||||
**支持平台:**
|
||||
|
||||
``Ascend`` ``GPU`` ``CPU``
|
||||
|
||||
**样例:**
|
||||
|
||||
>>> net = Net()
|
||||
>>> #1) 所有参数使用相同的学习率和权重衰减
|
||||
>>> optim = nn.Momentum(params=net.trainable_params(), learning_rate=0.1, momentum=0.9)
|
||||
>>>
|
||||
>>> #2) 使用参数分组并设置不同的值
|
||||
>>> conv_params = list(filter(lambda x: 'conv' in x.name, net.trainable_params()))
|
||||
>>> no_conv_params = list(filter(lambda x: 'conv' not in x.name, net.trainable_params()))
|
||||
>>> group_params = [{'params': conv_params, 'weight_decay': 0.01, 'grad_centralization':True},
|
||||
... {'params': no_conv_params, 'lr': 0.01},
|
||||
... {'order_params': net.trainable_params()}]
|
||||
>>> optim = nn.Momentum(group_params, learning_rate=0.1, momentum=0.9, weight_decay=0.0)
|
||||
>>> # conv_params参数组将使用优化器中的学习率0.1、该组的权重衰减0.01、该组的梯度中心化配置True。
|
||||
>>> # no_conv_params参数组将使用该组的学习率0.01、优化器中的权重衰减0.0、梯度中心化使用默认值False。
|
||||
>>> # 优化器按照"order_params"配置的参数顺序更新参数。
|
||||
>>>
|
||||
>>> loss = nn.SoftmaxCrossEntropyWithLogits()
|
||||
>>> model = Model(net, loss_fn=loss, optimizer=optim, metrics=None)
|
|
@ -1,143 +0,0 @@
|
|||
mindspore.nn.Optimizer
|
||||
======================
|
||||
|
||||
.. py:class:: mindspore.nn.Optimizer(learning_rate, parameters, weight_decay=0.0, loss_scale=1.0)
|
||||
|
||||
用于参数更新的优化器基类。不要直接使用这个类,请实例化它的一个子类。
|
||||
|
||||
优化器支持参数分组。当参数分组时,每组参数均可配置不同的学习率(`lr` )、权重衰减(`weight_decay`)和梯度中心化(`grad_centralization`)策略。
|
||||
|
||||
.. note::
|
||||
在参数未分组时,优化器配置的 `weight_decay` 应用于名称含有"beta"或"gamma"的网络参数,通过网络参数分组可调整权重衰减策略。分组时,每组网络参数均可配置 `weight_decay` ,若未配置,则该组网络参数使用优化器中配置的 `weight_decay`。
|
||||
|
||||
**参数:**
|
||||
|
||||
- **learning_rate** (Union[float, int, Tensor, Iterable, LearningRateSchedule]):
|
||||
|
||||
- **float** - 固定的学习率。必须大于等于零。
|
||||
- **int** - 固定的学习率。必须大于等于零。整数类型会被转换为浮点数。
|
||||
- **Tensor** - 可以是标量或一维向量。标量是固定的学习率。一维向量是动态的学习率,第i步将取向量中第i个值作为学习率。
|
||||
- **Iterable** - 动态的学习率。第i步将取迭代器第i个值作为学习率。
|
||||
- **LearningRateSchedule** - 动态的学习率。在训练过程中,优化器将使用步数(step)作为输入,调用 `LearningRateSchedule` 实例来计算当前学习率。
|
||||
|
||||
- **parameters (Union[list[Parameter], list[dict]])** - 必须是 `Parameter` 组成的列表或字典组成的列表。当列表元素是字典时,字典的键可以是"params"、"lr"、"weight_decay"、"grad_centralization"和"order_params":
|
||||
|
||||
- **params** - 必填。当前组别的权重,该值必须是 `Parameter` 列表。
|
||||
- **lr** - 可选。如果键中存在"lr",则使用对应的值作为学习率。如果没有,则使用优化器中配置的 `learning_rate` 作为学习率。
|
||||
- **weight_decay** - 可选。如果键中存在"weight_decay”,则使用对应的值作为权重衰减值。如果没有,则使用优化器中配置的 `weight_decay` 作为权重衰减值。
|
||||
- **grad_centralization** - 可选。如果键中存在"grad_centralization",则使用对应的值,该值必须为布尔类型。如果没有,则认为 `grad_centralization` 为False。该参数仅适用于卷积层。
|
||||
- **order_params** - 可选。对应值是预期的参数更新顺序。当使用参数分组功能时,通常使用该配置项保持 `parameters` 的顺序以提升性能。如果键中存在"order_params",则会忽略该组配置中的其他键。"order_params"中的参数必须在某一组 `params` 参数中。
|
||||
|
||||
- **weight_decay** (Union[float, int]) - 权重衰减的整数或浮点值。必须等于或大于0。如果 `weight_decay` 是整数,它将被转换为浮点数。默认值:0.0。
|
||||
- **loss_scale** (float) - 梯度缩放系数,必须大于0。如果 `loss_scale` 是整数,它将被转换为浮点数。通常使用默认值,仅当训练时使用了 `FixedLossScaleManager` ,且 `FixedLossScaleManager` 的 `drop_overflow_update` 属性配置为False时,此值需要与 `FixedLossScaleManager` 中的 `loss_scale` 相同。有关更多详细信息,请参阅 :class:`mindspore.FixedLossScaleManager`。默认值:1.0。
|
||||
|
||||
**异常:**
|
||||
|
||||
- **TypeError** - `learning_rate` 不是int、float、Tensor、Iterable或LearningRateSchedule。
|
||||
- **TypeError** - `parameters` 的元素不是Parameter或字典。
|
||||
- **TypeError** - `loss_scale` 不是float。
|
||||
- **TypeError** - `weight_decay` 不是float或int。
|
||||
- **ValueError** - `loss_scale` 小于或等于0。
|
||||
- **ValueError** - `weight_decay` 小于0。
|
||||
- **ValueError** - `learning_rate` 是一个Tensor,但是Tensor的维度大于1。
|
||||
|
||||
**支持平台:**
|
||||
|
||||
``Ascend`` ``GPU`` ``CPU``
|
||||
|
||||
.. py:method:: broadcast_params(optim_result)
|
||||
|
||||
按参数组的顺序进行参数广播。
|
||||
|
||||
**参数:**
|
||||
|
||||
**optim_result** (bool) - 参数更新结果。该输入用来保证参数更新完成后才执行参数广播。
|
||||
|
||||
**返回:**
|
||||
|
||||
bool,状态标志。
|
||||
|
||||
.. py:method:: decay_weight(gradients)
|
||||
|
||||
衰减权重。
|
||||
|
||||
一种减少深度学习神经网络模型过拟合的方法。继承 :class:`mindspore.nn.Optimizer` 自定义优化器时,可调用该接口进行权重衰减。
|
||||
|
||||
**参数:**
|
||||
|
||||
**gradients** (tuple[Tensor]) - 网络参数的梯度,形状(shape)与网络参数相同。
|
||||
|
||||
**返回:**
|
||||
|
||||
tuple[Tensor],衰减权重后的梯度。
|
||||
|
||||
.. py:method:: get_lr()
|
||||
|
||||
优化器调用该接口获取当前步骤(step)的学习率。继承 :class:`mindspore.nn.Optimizer` 自定义优化器时,可在参数更新前调用该接口获取学习率。
|
||||
|
||||
**返回:**
|
||||
|
||||
float,当前步骤的学习率。
|
||||
|
||||
.. py:method:: get_lr_parameter(param)
|
||||
|
||||
用于在使用网络参数分组功能,且为不同组别配置不同的学习率时,获取指定参数的学习率。
|
||||
|
||||
**参数:**
|
||||
|
||||
**param** (Union[Parameter, list[Parameter]]) - `Parameter` 或 `Parameter` 列表。
|
||||
|
||||
**返回:**
|
||||
|
||||
Parameter,单个 `Parameter` 或 `Parameter` 列表。如果使用了动态学习率,返回用于计算学习率的 `LearningRateSchedule` 或 `LearningRateSchedule` 列表。
|
||||
|
||||
**样例:**
|
||||
|
||||
>>> from mindspore import nn
|
||||
>>> net = Net()
|
||||
>>> conv_params = list(filter(lambda x: 'conv' in x.name, net.trainable_params()))
|
||||
>>> no_conv_params = list(filter(lambda x: 'conv' not in x.name, net.trainable_params()))
|
||||
>>> group_params = [{'params': conv_params, 'lr': 0.05},
|
||||
... {'params': no_conv_params, 'lr': 0.01}]
|
||||
>>> optim = nn.Momentum(group_params, learning_rate=0.1, momentum=0.9, weight_decay=0.0)
|
||||
>>> conv_lr = optim.get_lr_parameter(conv_params)
|
||||
>>> print(conv_lr[0].asnumpy())
|
||||
0.05
|
||||
|
||||
.. py:method:: gradients_centralization(gradients)
|
||||
|
||||
梯度中心化。
|
||||
|
||||
一种优化卷积层参数以提高深度学习神经网络模型训练速度的方法。继承 :class:`mindspore.nn.Optimizer` 自定义优化器时,可调用该接口进行梯度中心化。
|
||||
|
||||
**参数:**
|
||||
|
||||
**gradients** (tuple[Tensor]) - 网络参数的梯度,形状(shape)与网络参数相同。
|
||||
|
||||
**返回:**
|
||||
|
||||
tuple[Tensor],梯度中心化后的梯度。
|
||||
|
||||
.. py:method:: scale_grad(gradients)
|
||||
|
||||
用于在混合精度场景还原梯度。
|
||||
|
||||
继承 :class:`mindspore.nn.Optimizer` 自定义优化器时,可调用该接口还原梯度。
|
||||
|
||||
**参数:**
|
||||
|
||||
**gradients** (tuple[Tensor]) - 网络参数的梯度,形状(shape)与网络参数相同。
|
||||
|
||||
**返回:**
|
||||
|
||||
tuple[Tensor],还原后的梯度。
|
||||
|
||||
.. py:method:: target
|
||||
:property:
|
||||
|
||||
该属性用于指定在主机(host)上还是设备(device)上更新参数。输入类型为str,只能是'CPU','Ascend'或'GPU'。
|
||||
|
||||
.. py:method:: unique
|
||||
:property:
|
||||
|
||||
该属性表示是否在优化器中进行梯度去重,通常用于稀疏网络。如果梯度是稀疏的则设置为True。如果前向稀疏网络已对权重去重,即梯度是稠密的,则设置为False。未设置时默认值为True。
|
|
@ -0,0 +1,82 @@
|
|||
Class mindspore.nn.ProximalAdagrad(*args, **kwargs)
|
||||
|
||||
使用ApplyProximalAdagrad算子实现ProximalAdagrad算法。
|
||||
|
||||
ProximalAdagrad用于在线学习和随机优化。
|
||||
请参阅论文`Efficient Learning using Forward-Backward Splitting <http://papers.nips.cc//paper/3793-efficient-learning-using-forward-backward-splitting.pdf>`_。
|
||||
|
||||
.. math::
|
||||
accum_{t+1} = accum_{t} + grad * grad
|
||||
|
||||
.. math::
|
||||
\text{prox_v} = var_{t} - lr * grad * \frac{1}{\sqrt{accum_{t+1}}}
|
||||
|
||||
.. math::
|
||||
var_{t+1} = \frac{sign(\text{prox_v})}{1 + lr * l2} * \max(\left| \text{prox_v} \right| - lr * l1, 0)
|
||||
|
||||
其中,grad、lr、var、accum和t分别表示`grads`, `learning_rate`, `params`、累加器和当前step。
|
||||
|
||||
.. note::
|
||||
.. include:: mindspore.nn.optim_note_sparse.rst
|
||||
|
||||
.. include:: mindspore.nn.optim_note_weight_decay.rst
|
||||
|
||||
参数:
|
||||
param (Union[list[Parameter], list[dict]]) - 必须是 `Parameter` 组成的列表或字典组成的列表。当列表元素是字典时,字典的键可以是"params"、"lr"、"weight_decay"、"grad_centralization"和"order_params":
|
||||
|
||||
.. include:: mindspore.nn.optim_group_param.rst
|
||||
.. include:: mindspore.nn.optim_group_lr.rst
|
||||
.. include:: mindspore.nn.optim_group_weight_decay.rst
|
||||
.. include:: mindspore.nn.optim_group_gc.rst
|
||||
.. include:: mindspore.nn.optim_group_order.rst
|
||||
|
||||
accum (float):累加器`accum`的初始值,起始值必须为零或正值。默认值:0.1。
|
||||
|
||||
learning_rate (Union[float, Tensor, Iterable, LearningRateSchedule]): 默认值:1e-3。
|
||||
.. include:: mindspore.nn.optim_arg_dynamic_lr.rst
|
||||
|
||||
l1 (float):l1正则化强度,必须大于或等于零。默认值:0.0。
|
||||
l2 (float):l2正则化强度,必须大于或等于零。默认值:0.0。
|
||||
use_locking (bool):如果为True,则更新操作使用锁保护。默认值:False。
|
||||
.. include:: mindspore.nn.optim_arg_loss_scale.rst
|
||||
weight_decay (Union[float, int]):要乘以权重的权重衰减值,必须为零或正值。默认值:0.0。
|
||||
|
||||
输入:
|
||||
- **grads** (tuple[Tensor]) - 优化器中`params`的梯度,shape与优化器中的`params`相同。
|
||||
|
||||
输出:
|
||||
Tensor[bool],值为True。
|
||||
|
||||
异常:
|
||||
TypeError:`learning_rate`不是int、float、Tensor、Iterable或LearningRateSchedule。
|
||||
TypeError:`parameters`的元素不是Parameter或字典。
|
||||
TypeError:`accum`、`l1`、`l2`或`loss_scale`不是float。
|
||||
TypeError:`weight_decay`不是float或int。
|
||||
ValueError:`loss_scale`小于或等于0。
|
||||
ValueError:`accum`、`l1`、`l2`或`weight_decay`小于0。
|
||||
|
||||
支持平台:
|
||||
``Ascend``
|
||||
|
||||
示例:
|
||||
>>> net = Net()
|
||||
>>> #1) 所有参数使用相同的学习率和权重衰减
|
||||
>>> optim = nn.ProximalAdagrad(params=net.trainable_params())
|
||||
>>>
|
||||
>>> #2) 使用参数组并设置不同的值
|
||||
>>> conv_params = list(filter(lambda x: 'conv' in x.name, net.trainable_params()))
|
||||
>>> no_conv_params = list(filter(lambda x: 'conv' not in x.name, net.trainable_params()))
|
||||
>>> group_params = [{'params': conv_params, 'weight_decay': 0.01, 'grad_centralization':True},
|
||||
... {'params': no_conv_params, 'lr': 0.01},
|
||||
... {'order_params': net.trainable_params()}]
|
||||
>>> optim = nn.ProximalAdagrad(group_params, learning_rate=0.1, weight_decay=0.0)
|
||||
>>> # conv_params参数组将使用优化器中的学习率0.1、该组的权重衰减0.01、该组的梯度中心化配置True。
|
||||
>>> # no_conv_params参数组将使用该组的学习率0.01、优化器中的权重衰减0.0、梯度中心化使用默认值False。
|
||||
>>> # 优化器按照"order_params"配置的参数顺序更新参数。
|
||||
>>>
|
||||
>>> loss = nn.SoftmaxCrossEntropyWithLogits()
|
||||
>>> model = Model(net, loss_fn=loss, optimizer=optim)
|
||||
|
||||
|
||||
.. include:: mindspore.nn.optim_target_unique_for_sparse.rst
|
||||
|
|
@ -0,0 +1,101 @@
|
|||
Class mindspore.nn.RMSProp(*args, **kwargs)
|
||||
|
||||
实现均方根传播(RMSProp)算法。
|
||||
|
||||
根据RMSProp算法更新`params`,算法详见[http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf]第29页。
|
||||
|
||||
公式如下:
|
||||
|
||||
.. math::
|
||||
s_{t+1} = \rho s_{t} + (1 - \rho)(\nabla Q_{i}(w))^2
|
||||
|
||||
.. math::
|
||||
m_{t+1} = \beta m_{t} + \frac{\eta} {\sqrt{s_{t+1} + \epsilon}} \nabla Q_{i}(w)
|
||||
|
||||
.. math::
|
||||
w = w - m_{t+1}
|
||||
|
||||
第一个方程计算每个权重的平方梯度的移动平均。然后将梯度除以:math:`\sqrt{ms_{t+1} + \epsilon}`。
|
||||
|
||||
如果centered为True:
|
||||
|
||||
.. math::
|
||||
g_{t+1} = \rho g_{t} + (1 - \rho)\nabla Q_{i}(w)
|
||||
|
||||
.. math::
|
||||
s_{t+1} = \rho s_{t} + (1 - \rho)(\nabla Q_{i}(w))^2
|
||||
|
||||
.. math::
|
||||
m_{t+1} = \beta m_{t} + \frac{\eta} {\sqrt{s_{t+1} - g_{t+1}^2 + \epsilon}} \nabla Q_{i}(w)
|
||||
|
||||
.. math::
|
||||
w = w - m_{t+1}
|
||||
|
||||
其中:math:`w`代表待更新的网络参数`params`。
|
||||
:math:`g_{t+1}`是平均梯度。
|
||||
:math:`s_{t+1}`是均方梯度。
|
||||
:math:`m_{t+1}`是moment,`w`的delta。
|
||||
:math:`\rho`代表`decay`。:math:`\beta`是动量项,表示`momentum`。
|
||||
:math:`\epsilon`是平滑项,可以避免除以零,表示`epsilon`。
|
||||
:math:`\eta`是学习率,表示`learning_rate`。:math:`\nabla Q_{i}(w)`是梯度,表示`gradients`。
|
||||
:math:`t`表示当前step。
|
||||
|
||||
.. note::
|
||||
.. include:: mindspore.nn.optim_note_weight_decay.rst
|
||||
|
||||
参数:
|
||||
params (Union[list[Parameter], list[dict]]):必须是 `Parameter` 组成的列表或字典组成的列表。当列表元素是字典时,字典的键可以是"params"、"lr"、"weight_decay"、"grad_centralization"和"order_params":
|
||||
|
||||
.. include:: mindspore.nn.optim_group_param.rst
|
||||
.. include:: mindspore.nn.optim_group_lr.rst
|
||||
.. include:: mindspore.nn.optim_group_weight_decay.rst
|
||||
.. include:: mindspore.nn.optim_group_gc.rst
|
||||
.. include:: mindspore.nn.optim_group_order.rst
|
||||
|
||||
learning_rate (Union[float, Tensor, Iterable, LearningRateSchedule]):默认值:0.1。
|
||||
.. include:: mindspore.nn.optim_arg_dynamic_lr.rst
|
||||
decay (float):衰减率。必须大于等于0。默认值:0.9。
|
||||
momentum (float):Float类型的超参数,表示移动平均的动量(momentum)。必须大于等于0。默认值:0.0。
|
||||
epsilon (float):将添加到分母中,以提高数值稳定性。取值大于0。默认值:1e-10。
|
||||
use_locking (bool):是否对参数更新加锁保护。默认值:False。
|
||||
centered (bool):如果为True,则梯度将通过梯度的估计方差进行归一。默认值:False。
|
||||
.. include:: mindspore.nn.optim_arg_loss_scale.rst
|
||||
weight_decay (Union[float, int]):权重衰减(L2 penalty)。必须大于等于0。默认值:0.0。
|
||||
|
||||
输入:
|
||||
- **gradients** (tuple[Tensor]) - `params`的梯度,shape与`params`相同。
|
||||
|
||||
输出:
|
||||
Tensor[bool],值为True。
|
||||
|
||||
异常:
|
||||
TypeError:`learning_rate`不是int、float、Tensor、Iterable或LearningRateSchedule。
|
||||
TypeError:`decay`、`momentum`、`epsilon`或`loss_scale`不是float。
|
||||
TypeError:`parameters`的元素不是Parameter或字典。
|
||||
TypeError:`weight_decay`不是float或int。
|
||||
TypeError:`use_locking`或`centered`不是bool。
|
||||
ValueError:`epsilon`小于或等于0。
|
||||
ValueError:`decay`或`momentum`小于0。
|
||||
|
||||
支持平台:
|
||||
``Ascend`` ``GPU`` ``CPU``
|
||||
|
||||
示例:
|
||||
>>> net = Net()
|
||||
>>> #1) 所有参数使用相同的学习率和权重衰减
|
||||
>>> optim = nn.RMSProp(params=net.trainable_params(), learning_rate=0.1)
|
||||
>>>
|
||||
>>> #2) 使用参数分组并设置不同的值
|
||||
>>> conv_params = list(filter(lambda x: 'conv' in x.name, net.trainable_params()))
|
||||
>>> no_conv_params = list(filter(lambda x: 'conv' not in x.name, net.trainable_params()))
|
||||
>>> group_params = [{'params': conv_params, 'weight_decay': 0.01, 'grad_centralization':True},
|
||||
... {'params': no_conv_params, 'lr': 0.01},
|
||||
... {'order_params': net.trainable_params()}]
|
||||
>>> optim = nn.RMSProp(group_params, learning_rate=0.1, weight_decay=0.0)
|
||||
>>> # conv_params参数组将使用优化器中的学习率0.1、该组的权重衰减0.01、该组的梯度中心化配置True。
|
||||
>>> # no_conv_params参数组将使用该组的学习率0.01、优化器中的权重衰减0.0、梯度中心化使用默认值False。
|
||||
>>> # 优化器按照"order_params"配置的参数顺序更新参数。
|
||||
>>>
|
||||
>>> loss = nn.SoftmaxCrossEntropyWithLogits()
|
||||
>>> model = Model(net, loss_fn=loss, optimizer=optim)
|
||||
|
|
@ -0,0 +1,88 @@
|
|||
mindspore.nn.SGD
|
||||
================
|
||||
|
||||
.. py:class:: mindspore.nn.SGD(*args, **kwargs)
|
||||
|
||||
实现随机梯度下降。动量可选。
|
||||
|
||||
SGD相关介绍参见 `SGD <https://en.wikipedia.org/wiki/Stochastic_gradient_dencent>`_ 。
|
||||
|
||||
Nesterov动量公式参见论文 `On the importance of initialization and momentum in deep learning <http://proceedings.mlr.press/v28/sutskever13.html>`_ 。
|
||||
|
||||
.. math::
|
||||
v_{t+1} = u \ast v_{t} + gradient \ast (1-dampening)
|
||||
|
||||
如果nesterov为True:
|
||||
|
||||
.. math::
|
||||
p_{t+1} = p_{t} - lr \ast (gradient + u \ast v_{t+1})
|
||||
|
||||
如果nesterov为False:
|
||||
|
||||
.. math::
|
||||
p_{t+1} = p_{t} - lr \ast v_{t+1}
|
||||
|
||||
需要注意的是,对于训练的第一步 :math:`v_{t+1} = gradient`。其中,p、v和u分别表示 `parameters`、`accum` 和 `momentum`。
|
||||
|
||||
.. note::
|
||||
|
||||
.. include:: mindspore.nn.optim_note_weight_decay.rst
|
||||
|
||||
**参数:**
|
||||
|
||||
- **params** (Union[list[Parameter], list[dict]]): 当 `params` 为会更新的 `Parameter` 列表时,`params` 中的元素必须为类 `Parameter`。当 `params` 为 `dict` 列表时,"params"、"lr"、"weight_decay"、"grad_centralization"和"order_params"为可以解析的键。
|
||||
.. include:: mindspore.nn.optim_group_param.rst
|
||||
.. include:: mindspore.nn.optim_group_lr.rst
|
||||
.. include:: mindspore.nn.optim_group_weight_decay.rst
|
||||
.. include:: mindspore.nn.optim_group_gc.rst
|
||||
.. include:: mindspore.nn.optim_group_order.rst
|
||||
|
||||
- **learning_rate** (Union[float, Tensor, Iterable, LearningRateSchedule]): 默认值:0.1。
|
||||
.. include:: mindspore.nn.optim_arg_dynamic_lr.rst
|
||||
|
||||
- **momentum** (float): 浮点动量,必须大于等于0.0。默认值:0.0。
|
||||
- **dampening** (float): 浮点动量阻尼值,必须大于等于0.0。默认值:0.0。
|
||||
- **weight_decay** (float): 权重衰减(L2 penalty),必须大于等于0。默认值:0.0。
|
||||
- **nesterov** (bool): 启用Nesterov动量。如果使用Nesterov,动量必须为正,阻尼必须等于0.0。默认值:False。
|
||||
.. include:: mindspore.nn.optim_arg_loss_scale.rst
|
||||
|
||||
**输入:**
|
||||
|
||||
**gradients** (tuple[Tensor]):`params` 的梯度,shape与 `params` 相同。
|
||||
|
||||
**输出:**
|
||||
|
||||
Tensor[bool],值为True。
|
||||
|
||||
**异常:**
|
||||
|
||||
**ValueError:** 动量、阻尼或重量衰减值小于0.0。
|
||||
|
||||
**支持平台:**
|
||||
|
||||
``Ascend`` ``GPU`` ``CPU``
|
||||
|
||||
**样例:**
|
||||
|
||||
.. code-block::
|
||||
|
||||
>>> net = Net()
|
||||
>>> # 1) 所有参数使用相同的学习率和权重衰减
|
||||
>>> optim = nn.SGD(params=net.trainable_params())
|
||||
>>>
|
||||
>>> # 2) 使用参数组并设置不同的值
|
||||
>>> conv_params = list(filter(lambda x: 'conv' in x.name, net.trainable_params()))
|
||||
>>> no_conv_params = list(filter(lambda x: 'conv' not in x.name, net.trainable_params()))
|
||||
>>> group_params = [{'params': conv_params,'grad_centralization':True},
|
||||
... {'params': no_conv_params, 'lr': 0.01},
|
||||
... {'order_params': net.trainable_params()}]
|
||||
>>> optim = nn.SGD(group_params, learning_rate=0.1, weight_decay=0.0)
|
||||
>>> # con_params的参数将使用默认学习率0.1、默认权重衰减0.0、梯度集中度为True。
|
||||
>>> #
|
||||
>>> # no_con_params的参数将使用学习率0.01、默认权重衰减0.0、梯度集中度为False。
|
||||
>>> #
|
||||
>>> # 优化器的最终参数顺序采用'order_params'的值。
|
||||
>>>
|
||||
>>> loss = nn.SoftmaxCrossEntropyWithLogits()
|
||||
>>> model = Model(net, loss_fn=loss, optimizer=optim)
|
||||
|
|
@ -0,0 +1,50 @@
|
|||
Class mindspore.nn.TrainOneStepCell(network, optimizer, sens=1.0)
|
||||
|
||||
训练网络封装类。
|
||||
|
||||
封装`network`和`optimizer`,构建一个输入'\*inputs'的用于训练的Cell。
|
||||
执行函数`construct`中会构建反向图以更新网络参数。支持不同的并行训练模式。
|
||||
|
||||
参数:
|
||||
network (Cell):训练网络。只支持单输出网络。
|
||||
optimizer (Union[Cell]):用于更新网络参数的优化器。
|
||||
sens (numbers.Number):反向传播的输入,缩放系数。默认值为1.0。
|
||||
|
||||
输入:
|
||||
- **(\*inputs)** (Tuple(Tensor)) - shape为:math:`(N, \ldots)`的Tensor组成的元组。
|
||||
|
||||
输出:
|
||||
Tensor,损失函数值,其shape通常为:math:`()`。
|
||||
|
||||
异常:
|
||||
TypeError:`sens`不是numbers.Number。
|
||||
|
||||
支持平台:
|
||||
``Ascend`` ``GPU`` ``CPU``
|
||||
|
||||
示例:
|
||||
>>> net = Net()
|
||||
>>> loss_fn = nn.SoftmaxCrossEntropyWithLogits()
|
||||
>>> optim = nn.Momentum(net.trainable_params(), learning_rate=0.1, momentum=0.9)
|
||||
>>> # 1)使用MindSpore提供的WithLossCell
|
||||
>>> loss_net = nn.WithLossCell(net, loss_fn)
|
||||
>>> train_net = nn.TrainOneStepCell(loss_net, optim)
|
||||
>>>
|
||||
>>> # 2)用户自定义的WithLossCell
|
||||
>>> class MyWithLossCell(Cell):
|
||||
... def __init__(self, backbone, loss_fn):
|
||||
... super(MyWithLossCell, self).__init__(auto_prefix=False)
|
||||
... self._backbone = backbone
|
||||
... self._loss_fn = loss_fn
|
||||
...
|
||||
... def construct(self, x, y, label):
|
||||
... out = self._backbone(x, y)
|
||||
... return self._loss_fn(out, label)
|
||||
...
|
||||
... @property
|
||||
... def backbone_network(self):
|
||||
... return self._backbone
|
||||
...
|
||||
>>> loss_net = MyWithLossCell(net, loss_fn)
|
||||
>>> train_net = nn.TrainOneStepCell(loss_net, optim)
|
||||
|
|
@ -0,0 +1,116 @@
|
|||
Class mindspore.nn.TrainOneStepWithLossScaleCell(network, optimizer, scale_sense)
|
||||
|
||||
使用梯度放大功能(loss scale)的训练网络。
|
||||
|
||||
实现了包含梯度放大功能的单次训练。它使用网络、优化器和用于更新梯度放大系数的Cell(或一个Tensor)作为参数。可在host侧或device侧更新梯度放大系数。
|
||||
如果需要在host侧更新,使用Tensor作为`scale_sense`,否则,使用可更新梯度放大系数的Cell实例作为`scale_sense`。
|
||||
|
||||
参数:
|
||||
network (Cell):训练网络。仅支持单输出网络。
|
||||
optimizer (Cell):用于更新网络参数的优化器。
|
||||
scale_sense (Union[Tensor, Cell]):如果此值为Cell类型,`TrainOneStepWithLossScaleCell`会调用它来更新梯度放大系数。如果此值为Tensor类型,可调用`set_sense_scale`来更新梯度放大系数,shape为:math:`()`或:math:`(1,)`。
|
||||
|
||||
输入:
|
||||
- **(*inputs)** (Tuple(Tensor))- shape为:math:`(N, \ldots)`的Tensor组成的元组。
|
||||
|
||||
输出:
|
||||
Tuple,包含三个Tensor,分别为损失函数值、溢出状态和当前梯度放大系数。
|
||||
|
||||
- **loss** (Tensor) - shape为:math:`()`的Tensor。
|
||||
- **overflow** (Tensor)- shape为:math:`()`的Tensor,类型为bool。
|
||||
- **loss scale** (Tensor)- shape为:math:`()`的Tensor。
|
||||
|
||||
异常:
|
||||
TypeError:`scale_sense`既不是Cell,也不是Tensor。
|
||||
ValueError:`scale_sense`的shape既不是(1,)也不是()。
|
||||
|
||||
支持平台:
|
||||
``Ascend`` ``GPU``
|
||||
|
||||
示例:
|
||||
>>> import numpy as np
|
||||
>>> from mindspore import Tensor, Parameter, nn, ops
|
||||
>>> from mindspore import dtype as mstype
|
||||
>>>
|
||||
>>> class Net(nn.Cell):
|
||||
... def __init__(self, in_features, out_features):
|
||||
... super(Net, self).__init__()
|
||||
... self.weight = Parameter(Tensor(np.ones([in_features, out_features]).astype(np.float32)),
|
||||
... name='weight')
|
||||
... self.matmul = ops.MatMul()
|
||||
...
|
||||
... def construct(self, x):
|
||||
... output = self.matmul(x, self.weight)
|
||||
... return output
|
||||
...
|
||||
>>> size, in_features, out_features = 16, 16, 10
|
||||
>>> #1)scale_sense类型为Cell时:
|
||||
>>> net = Net(in_features, out_features)
|
||||
>>> loss = nn.MSELoss()
|
||||
>>> optimizer = nn.Momentum(net.trainable_params(), learning_rate=0.1, momentum=0.9)
|
||||
>>> net_with_loss = nn.WithLossCell(net, loss)
|
||||
>>> manager = nn.DynamicLossScaleUpdateCell(loss_scale_value=2**12, scale_factor=2, scale_window=1000)
|
||||
>>> train_network = nn.TrainOneStepWithLossScaleCell(net_with_loss, optimizer, scale_sense=manager)
|
||||
>>> input = Tensor(np.ones([out_features, in_features]), mindspore.float32)
|
||||
>>> labels = Tensor(np.ones([out_features,]), mindspore.float32)
|
||||
>>> output = train_network(input, labels)
|
||||
>>>
|
||||
>>>> #2)当scale_sense类型为Tensor时:
|
||||
>>> net = Net(in_features, out_features)
|
||||
>>> loss = nn.MSELoss()
|
||||
>>> optimizer = nn.Momentum(net.trainable_params(), learning_rate=0.1, momentum=0.9)
|
||||
>>> net_with_loss = nn.WithLossCell(net, loss)
|
||||
>>> inputs = Tensor(np.ones([size, in_features]).astype(np.float32))
|
||||
>>> label = Tensor(np.zeros([size, out_features]).astype(np.float32))
|
||||
>>> scaling_sens = Tensor(np.full((1), np.finfo(np.float32).max), dtype=mstype.float32)
|
||||
>>> train_network = nn.TrainOneStepWithLossScaleCell(net_with_loss, optimizer, scale_sense=scaling_sens)
|
||||
>>> output = train_network(inputs, label)
|
||||
|
||||
|
||||
get_overflow_status(status, compute_output)
|
||||
|
||||
获取浮点溢出状态。
|
||||
|
||||
溢出检测的目标过程执行完成后,获取溢出结果。继承该类自定义训练网络时,可复用该接口。
|
||||
|
||||
输入:
|
||||
- **status** (object) - 用于检测溢出的状态实例。
|
||||
- **compute_output** - 对特定计算过程进行溢出检测时,将`compute_output`设置为该计算过程的输出,以确保在执行计算之前获取了`status`。
|
||||
|
||||
输出:
|
||||
bool,是否发生溢出。
|
||||
|
||||
|
||||
process_loss_scale(overflow)
|
||||
|
||||
根据溢出状态计算梯度放大系数。继承该类自定义训练网络时,可复用该接口。
|
||||
|
||||
输入:
|
||||
- **overflow** (bool) - 是否发生溢出。
|
||||
|
||||
输出:
|
||||
bool,溢出状态,即输入。
|
||||
|
||||
|
||||
set_sense_scale(sens)
|
||||
|
||||
如果使用了Tensor类型的`scale_sense`,可调用此函数修改它的值。
|
||||
|
||||
输入:
|
||||
- **sens** (Tensor)- 新的梯度放大系数,其shape和类型需要与原始`scale_sense`相同。
|
||||
|
||||
|
||||
start_overflow_check(pre_cond, compute_input)
|
||||
|
||||
启动浮点溢出检测。创建并清除溢出检测状态。
|
||||
|
||||
指定参数'pre_cond'和'compute_input',以确保在正确的时间清除溢出状态。
|
||||
以当前接口为例,我们需要在损失函数计算后进行清除状态,在梯度计算过程中检测溢出。在这种情况下,pre_cond应为损失函数的输出,而compute_input应为梯度计算函数的输入。继承该类自定义训练网络时,可复用该接口。
|
||||
输入:
|
||||
- **pre_cond** (Tensor) -启动溢出检测的先决条件。它决定溢出状态清除和先前处理的执行顺序。它确保函数'start_overflow'在执行完先决条件后清除状态。
|
||||
- **compute_input** (object) - 后续运算的输入。需要对特定的计算过程进行溢出检测。将`compute_input`设置这一计算过程的输入,以确保在执行该计算之前清除了溢出状态。
|
||||
|
||||
输出:
|
||||
Tuple[object, object],GPU后端的第一个值为False,而其他后端的第一个值是NPUAllocFloatStatus的实例。该值用于在`get_overflow_status`期间检测溢出。
|
||||
第二个值与`compute_input`的输入相同,用于控制执行序。
|
||||
|
|
@ -0,0 +1,29 @@
|
|||
Class mindspore.nn.WithEvalCell(network, loss_fn, add_cast_fp32=False)
|
||||
|
||||
封装前向网络和损失函数,返回用于计算评估指标的损失函数值、前向输出和标签。
|
||||
|
||||
|
||||
参数:
|
||||
network (Cell):前向网络。
|
||||
loss_fn (Cell):损失函数。
|
||||
add_cast_fp32 (bool):是否将数据类型调整为float32。默认值:False。
|
||||
|
||||
输入:
|
||||
- **data** (Tensor) - shape为:math:`(N, \ldots)`的Tensor。
|
||||
- **label** (Tensor) - shape为:math:`(N, \ldots)`的Tensor。
|
||||
|
||||
输出:
|
||||
Tuple(Tensor),包括标量损失函数、shape为:math:`(N, \ldots)`的网络输出和shape为:math:`(N, \ldots)`的标签。
|
||||
|
||||
异常:
|
||||
TypeError:`add_cast_fp32`不是bool。
|
||||
|
||||
支持平台:
|
||||
``Ascend`` ``GPU`` ``CPU``
|
||||
|
||||
示例:
|
||||
>>> # 未包含损失函数的前向网络
|
||||
>>> net = Net()
|
||||
>>> loss_fn = nn.SoftmaxCrossEntropyWithLogits()
|
||||
>>> eval_net = nn.WithEvalCell(net, loss_fn)
|
||||
|
|
@ -0,0 +1,42 @@
|
|||
Class mindspore.nn.WithLossCell(backbone, loss_fn)
|
||||
|
||||
包含损失函数的Cell。
|
||||
|
||||
封装`backbone`和`loss_fn`。此Cell接受数据和标签作为输入,并将返回损失函数作为计算结果。
|
||||
|
||||
参数:
|
||||
backbone (Cell):要封装的目标网络。
|
||||
loss_fn (Cell):用于计算损失函数。
|
||||
|
||||
输入:
|
||||
- **data** (Tensor) - shape为:math:`(N, \ldots)`的Tensor。
|
||||
- **label** (Tensor) - shape为:math:`(N, \ldots)`的Tensor。
|
||||
|
||||
输出:
|
||||
Tensor,loss值,其shape通常为:math:`()`。
|
||||
|
||||
异常:
|
||||
TypeError:`data`或`label`的数据类型既不是float16也不是float32。
|
||||
|
||||
支持平台:
|
||||
``Ascend`` ``GPU`` ``CPU``
|
||||
|
||||
示例:
|
||||
>>> net = Net()
|
||||
>>> loss_fn = nn.SoftmaxCrossEntropyWithLogits(sparse=False)
|
||||
>>> net_with_criterion = nn.WithLossCell(net, loss_fn)
|
||||
>>>
|
||||
>>> batch_size = 2
|
||||
>>> data = Tensor(np.ones([batch_size, 1, 32, 32]).astype(np.float32) * 0.01)
|
||||
>>> label = Tensor(np.ones([batch_size, 10]).astype(np.float32))
|
||||
>>>
|
||||
>>> output_data = net_with_criterion(data, label)
|
||||
|
||||
|
||||
backbone_network
|
||||
|
||||
获取骨干网络。
|
||||
|
||||
返回:
|
||||
Cell,骨干网络。
|
||||
|
|
@ -0,0 +1,5 @@
|
|||
- **float** - 固定的学习率。必须大于等于零。
|
||||
- **int** - 固定的学习率。必须大于等于零。整数类型会被转换为浮点数。
|
||||
- **Tensor** - 可以是标量或一维向量。标量是固定的学习率。一维向量是动态的学习率,第i步将取向量中第i个值作为学习率。
|
||||
- **Iterable** - 动态的学习率。第i步将取迭代器第i个值作为学习率。
|
||||
- **LearningRateSchedule** - 动态的学习率。在训练过程中,优化器将使用步数(step)作为输入,调用 `LearningRateSchedule` 实例来计算当前学习率。
|
|
@ -0,0 +1 @@
|
|||
- **loss_scale** (float) - 梯度缩放系数,必须大于0。如果 `loss_scale` 是整数,它将被转换为浮点数。通常使用默认值,仅当训练时使用了 `FixedLossScaleManager`,且 `FixedLossScaleManager `的 `drop_overflow_update` 属性配置为False时,此值需要与 `FixedLossScaleManager` 中的 `loss_scale` 相同。有关更多详细信息,请参阅class:`mindspore.FixedLossScaleManager`。默认值:1.0。
|
|
@ -0,0 +1 @@
|
|||
- **grad_centralization** - 可选。如果键中存在"grad_centralization",则使用对应的值,该值必须为布尔类型。如果没有,则认为 `grad_centralization` 为False。该参数仅适用于卷积层。
|
|
@ -0,0 +1 @@
|
|||
- **lr** - 可选。如果键中存在"lr",则使用对应的值作为学习率。如果没有,则使用优化器中配置的 `learning_rate` 作为学习率。
|
|
@ -0,0 +1 @@
|
|||
- **order_params** - 可选。对应值是预期的参数更新顺序。当使用参数分组功能时,通常使用该配置项保持 `parameters` 的顺序以提升性能。如果键中存在"order_params",则会忽略该组配置中的其他键。"order_params"中的参数必须在某一组 `params` 参数中。
|
|
@ -0,0 +1 @@
|
|||
- **params** - 必填。当前组别的权重,该值必须是 `Parameter` 列表。
|
|
@ -0,0 +1 @@
|
|||
- **weight_decay** - 可选。如果键中存在"weight_decay”,则使用对应的值作为权重衰减值。如果没有,则使用优化器中配置的 `weight_decay` 作为权重衰减值。
|
|
@ -0,0 +1 @@
|
|||
优化器和混合精度之间通常没有联系。但是,当使用`FixedLossScaleManager`且`FixedLossScaleManager`中的`drop_overflow_update`设置为False时,优化器需要设置'loss_scale'。由于此优化器没有`loss_scale`的参数,因此需要通过其他方式处理`loss_scale`,如何正确处理`loss_scale`详见`LossScale <https://www.mindspore.cn/docs/programming_guide/zh-CN/master/lossscale.html>`。
|
|
@ -0,0 +1,2 @@
|
|||
如果前向网络使用了SparseGatherV2等算子,优化器会执行稀疏运算,通过设置 `target` 为CPU,可在主机(host)上进行稀疏运算。
|
||||
稀疏特性在持续开发中。
|
|
@ -0,0 +1,2 @@
|
|||
在参数未分组时,优化器配置的 `weight_decay` 应用于名称含有"beta"或"gamma"的网络参数,通过网络参数分组可调整权重衰减策略。分组时,每组网络参数均可配置 `weight_decay` ,若未配置,则该组网络参数使用优化器中配置的 `weight_decay` 。
|
||||
|
|
@ -0,0 +1,9 @@
|
|||
.. py:method:: target
|
||||
:property:
|
||||
|
||||
该属性用于指定在主机(host)上还是设备(device)上更新参数。输入类型为str,只能是'CPU','Ascend'或'GPU'。
|
||||
|
||||
.. py:method:: unique
|
||||
:property:
|
||||
|
||||
该属性表示是否在优化器中进行梯度去重,通常用于稀疏网络。如果梯度是稀疏的则设置为True。如果前向稀疏网络已对权重去重,即梯度是稠密的,则设置为False。未设置时默认值为True。
|
|
@ -206,7 +206,7 @@ class Adam(Optimizer):
|
|||
|
||||
:math:`m` represents the 1st moment vector `moment1`, :math:`v` represents the 2nd moment vector `moment2`,
|
||||
:math:`g` represents `gradients`, :math:`l` represents scaling factor, :math:`\beta_1, \beta_2` represent
|
||||
`beta1` and `beta2`, :math:`t` represents updating step while :math:`beta_1^t` and :math:`beta_2^t` represent
|
||||
`beta1` and `beta2`, :math:`t` represents the current step while :math:`beta_1^t` and :math:`beta_2^t` represent
|
||||
`beta1_power` and `beta2_power`, :math:`\alpha` represents `learning_rate`, :math:`w` represents `params`,
|
||||
:math:`\epsilon` represents `eps`.
|
||||
|
||||
|
@ -263,7 +263,7 @@ class Adam(Optimizer):
|
|||
Default: 0.999.
|
||||
eps (float): Term added to the denominator to improve numerical stability. Should be greater than 0. Default:
|
||||
1e-8.
|
||||
use_locking (bool): Whether to enable a lock to protect variable tensors from being updated.
|
||||
use_locking (bool): Whether to enable a lock to protect the updating process of variable tensors.
|
||||
If true, updates of the `w`, `m`, and `v` tensors will be protected by a lock.
|
||||
If false, the result is unpredictable. Default: False.
|
||||
use_nesterov (bool): Whether to use Nesterov Accelerated Gradient (NAG) algorithm to update the gradients.
|
||||
|
@ -380,7 +380,7 @@ class Adam(Optimizer):
|
|||
|
||||
class AdamWeightDecay(Optimizer):
|
||||
r"""
|
||||
Implements the Adam algorithm to fix the weight decay.
|
||||
Implements the Adam algorithm with weight decay.
|
||||
|
||||
.. math::
|
||||
\begin{array}{ll} \\
|
||||
|
@ -399,7 +399,7 @@ class AdamWeightDecay(Optimizer):
|
|||
|
||||
:math:`m` represents the 1st moment vector `moment1`, :math:`v` represents the 2nd moment vector `moment2`,
|
||||
:math:`g` represents `gradients`, :math:`lr` represents `learning_rate`,
|
||||
:math:`\beta_1, \beta_2` represent `beta1` and `beta2`, :math:`t` represents updating step while
|
||||
:math:`\beta_1, \beta_2` represent `beta1` and `beta2`, :math:`t` represents the current step,
|
||||
:math:`w` represents `params`.
|
||||
|
||||
Note:
|
||||
|
@ -542,7 +542,7 @@ class AdamOffload(Optimizer):
|
|||
|
||||
:math:`m` represents the 1st moment vector `moment1`, :math:`v` represents the 2nd moment vector `moment2`,
|
||||
:math:`g` represents `gradients`, :math:`l` represents scaling factor, :math:`\beta_1, \beta_2` represent
|
||||
`beta1` and `beta2`, :math:`t` represents updating step while :math:`beta_1^t` and :math:`beta_2^t` represent
|
||||
`beta1` and `beta2`, :math:`t` represents the current step while :math:`beta_1^t` and :math:`beta_2^t` represent
|
||||
`beta1_power` and `beta2_power`, :math:`\alpha` represents `learning_rate`, :math:`w` represents `params`,
|
||||
:math:`\epsilon` represents `eps`.
|
||||
|
||||
|
@ -593,7 +593,7 @@ class AdamOffload(Optimizer):
|
|||
Default: 0.999.
|
||||
eps (float): Term added to the denominator to improve numerical stability. Should be greater than 0. Default:
|
||||
1e-8.
|
||||
use_locking (bool): Whether to enable a lock to protect variable tensors from being updated.
|
||||
use_locking (bool): Whether to enable a lock to protect the updating process of variable tensors.
|
||||
If true, updates of the `w`, `m`, and `v` tensors will be protected by a lock.
|
||||
If false, the result is unpredictable. Default: False.
|
||||
use_nesterov (bool): Whether to use Nesterov Accelerated Gradient (NAG) algorithm to update the gradients.
|
||||
|
|
|
@ -98,9 +98,9 @@ class FTRL(Optimizer):
|
|||
\end{cases}\\
|
||||
\end{array}
|
||||
|
||||
:math:`m` represents `accum`, :math:`g` represents `grads`, :math:`t` represents updating step,
|
||||
:math:`u` represents `linear`, :math:`p` represents `lr_power`, :math:`\alpha` represents `learning_rate`,
|
||||
:math:`\omega` represents `params`.
|
||||
:math:`m` represents accumulators, :math:`g` represents `grads`, :math:`t` represents the current step,
|
||||
:math:`u` represents the linear coefficient to be updated,, :math:`p` represents `lr_power`, :math:`\alpha`
|
||||
represents `learning_rate`, :math:`\omega` represents `params`.
|
||||
|
||||
Note:
|
||||
The sparse strategy is applied while the SparseGatherV2 operator is used for forward network. If the sparse
|
||||
|
@ -134,7 +134,7 @@ class FTRL(Optimizer):
|
|||
If `order_params` in the keys, other keys will be ignored and the element of 'order_params' must be in
|
||||
one group of `params`.
|
||||
|
||||
initial_accum (float): The starting value for accumulators, must be zero or positive values. Default: 0.1.
|
||||
initial_accum (float): The starting value for accumulators `m`, must be zero or positive values. Default: 0.1.
|
||||
learning_rate (float): The learning rate value, must be zero or positive, dynamic learning rate is currently
|
||||
not supported. Default: 0.001.
|
||||
lr_power (float): Learning rate power controls how the learning rate decreases during training, must be less
|
||||
|
@ -183,7 +183,8 @@ class FTRL(Optimizer):
|
|||
>>> optim = nn.FTRL(group_params, learning_rate=0.1, weight_decay=0.0)
|
||||
>>> # The conv_params's parameters will use default learning rate of 0.1 and weight decay of 0.01 and grad
|
||||
>>> # centralization of True.
|
||||
>>> # The no_conv_params's parameters will use default weight decay of 0.0 and grad centralization of False.
|
||||
>>> # The no_conv_params's parameters will use default learning rate of 0.1 will use default weight decay
|
||||
>>> # of 0.0 and grad centralization of False.
|
||||
>>> # The final parameters order in which the optimizer will be followed is the value of 'order_params'.
|
||||
>>>
|
||||
>>> loss = nn.SoftmaxCrossEntropyWithLogits()
|
||||
|
|
|
@ -172,7 +172,7 @@ def _check_param_value(beta1, beta2, eps, prim_name):
|
|||
|
||||
class Lamb(Optimizer):
|
||||
r"""
|
||||
Lamb(Layer-wise Adaptive Moments optimizer for Batching training) Dynamic Learning Rate.
|
||||
An optimizer that implements the Lamb(Layer-wise Adaptive Moments optimizer for Batching training) algorithm.
|
||||
|
||||
LAMB is an optimization algorithm employing a layerwise adaptive large batch optimization technique.
|
||||
Refer to the paper `LARGE BATCH OPTIMIZATION FOR DEEP LEARNING: TRAINING BERT IN 76
|
||||
|
|
|
@ -71,16 +71,16 @@ class LARS(Optimizer):
|
|||
g_{t+1} = \lambda * (g_{t} + \delta * \omega)
|
||||
\end{array}
|
||||
|
||||
:math:`\theta` represents `coefficient`, :math:`\omega` represents `parameters`, :math:`g` represents `gradients`,
|
||||
:math:`t` represents updating step, :math:`\delta` represents `weight_decay`,
|
||||
:math:`\alpha` represents `learning_rate`, :math:`clip` represents `use_clip`.
|
||||
:math:`\theta` represents `coefficient`, :math:`\omega` represents the network parameters, :math:`g` represents
|
||||
`gradients`, :math:`t` represents the current step, :math:`\delta` represents `weight_decay` in `optimizer`,
|
||||
:math:`\alpha` represents `learning_rate` in `optimizer`, :math:`clip` represents `use_clip`.
|
||||
|
||||
Args:
|
||||
optimizer (Optimizer): MindSpore optimizer for which to wrap and modify gradients.
|
||||
epsilon (float): Term added to the denominator to improve numerical stability. Default: 1e-05.
|
||||
coefficient (float): Trust coefficient for calculating the local learning rate. Default: 0.001.
|
||||
use_clip (bool): Whether to use clip operation for calculating the local learning rate. Default: False.
|
||||
lars_filter (Function): A function to determine whether apply the LARS algorithm. Default:
|
||||
lars_filter (Function): A function to determine which of the network parameters to use LARS algorithm. Default:
|
||||
lambda x: 'LayerNorm' not in x.name and 'bias' not in x.name.
|
||||
|
||||
Inputs:
|
||||
|
|
|
@ -106,10 +106,10 @@ def _check_param_value(beta1, beta2, eps, weight_decay, prim_name):
|
|||
|
||||
class LazyAdam(Optimizer):
|
||||
r"""
|
||||
This optimizer will apply a lazy adam algorithm when gradient is sparse.
|
||||
Updates gradients by the Adaptive Moment Estimation (Adam) algorithm. The Adam algorithm is proposed
|
||||
in `Adam: A Method for Stochastic Optimization <https://arxiv.org/abs/1412.6980>`_.
|
||||
|
||||
The original adam algorithm is proposed in
|
||||
`Adam: A Method for Stochastic Optimization <https://arxiv.org/abs/1412.6980>`_.
|
||||
This optimizer will apply a lazy adam algorithm when gradient is sparse.
|
||||
|
||||
The updating formulas are as follows,
|
||||
|
||||
|
@ -123,7 +123,7 @@ class LazyAdam(Optimizer):
|
|||
|
||||
:math:`m` represents the 1st moment vector `moment1`, :math:`v` represents the 2nd moment vector `moment2`,
|
||||
:math:`g` represents `gradients`, :math:`l` represents scaling factor, :math:`\beta_1, \beta_2` represent
|
||||
`beta1` and `beta2`, :math:`t` represents updating step while :math:`beta_1^t` and :math:`beta_2^t` represent
|
||||
`beta1` and `beta2`, :math:`t` represents the current step while :math:`beta_1^t` and :math:`beta_2^t` represent
|
||||
`beta1_power` and `beta2_power`, :math:`\alpha` represents `learning_rate`, :math:`w` represents `params`,
|
||||
:math:`\epsilon` represents `eps`.
|
||||
|
||||
|
@ -182,7 +182,7 @@ class LazyAdam(Optimizer):
|
|||
Default: 0.999.
|
||||
eps (float): Term added to the denominator to improve numerical stability. Should be greater than 0. Default:
|
||||
1e-8.
|
||||
use_locking (bool): Whether to enable a lock to protect variable tensors from being updated.
|
||||
use_locking (bool): Whether to enable a lock to protect the updating process of variable tensors.
|
||||
If true, updates of the `w`, `m`, and `v` tensors will be protected by a lock.
|
||||
If false, the result is unpredictable. Default: False.
|
||||
use_nesterov (bool): Whether to use Nesterov Accelerated Gradient (NAG) algorithm to update the gradients.
|
||||
|
|
|
@ -69,7 +69,7 @@ class ProximalAdagrad(Optimizer):
|
|||
.. math::
|
||||
var_{t+1} = \frac{sign(\text{prox_v})}{1 + lr * l2} * \max(\left| \text{prox_v} \right| - lr * l1, 0)
|
||||
|
||||
Here : where grad, lr, var, accum and t denote the gradients, learning_rate, params and accumulation and current
|
||||
Here : where grad, lr, var, accum and t denote the `grads`, `learning_rate`, `params`, accumulation and current
|
||||
step respectively.
|
||||
|
||||
Note:
|
||||
|
@ -105,7 +105,7 @@ class ProximalAdagrad(Optimizer):
|
|||
If `order_params` in the keys, other keys will be ignored and the element of 'order_params' must be in
|
||||
one group of `params`.
|
||||
|
||||
accum (float): The starting value for accumulators, must be zero or positive values. Default: 0.1.
|
||||
accum (float): The starting value for accumulators `accum`, must be zero or positive values. Default: 0.1.
|
||||
learning_rate (Union[float, int, Tensor, Iterable, LearningRateSchedule]): Default: 0.001.
|
||||
|
||||
- float: The fixed learning rate value. Must be equal to or greater than 0.
|
||||
|
|
|
@ -75,13 +75,14 @@ class RMSProp(Optimizer):
|
|||
w = w - m_{t+1}
|
||||
|
||||
where :math:`w` represents `params`, which will be updated.
|
||||
:math:`g_{t+1}` is mean gradients, :math:`g_{t}` is the last moment of :math:`g_{t+1}`.
|
||||
:math:`s_{t+1}` is the mean square gradients, :math:`s_{t}` is the last moment of :math:`s_{t+1}`,
|
||||
:math:`m_{t+1}` is moment, the delta of `w`, :math:`m_{t}` is the last moment of :math:`m_{t+1}`.
|
||||
:math:`g_{t+1}` is mean gradients.
|
||||
:math:`s_{t+1}` is the mean square gradients.
|
||||
:math:`m_{t+1}` is moment, the delta of `w`.
|
||||
:math:`\\rho` represents `decay`. :math:`\\beta` is the momentum term, represents `momentum`.
|
||||
:math:`\\epsilon` is a smoothing term to avoid division by zero, represents `epsilon`.
|
||||
:math:`\\eta` is learning rate, represents `learning_rate`. :math:`\\nabla Q_{i}(w)` is gradients,
|
||||
represents `gradients`.
|
||||
:math:`t` represents the current step.
|
||||
|
||||
Note:
|
||||
If parameters are not grouped, the `weight_decay` in optimizer will be applied on the network parameters without
|
||||
|
@ -131,9 +132,9 @@ class RMSProp(Optimizer):
|
|||
greater than 0. Default: 0.0.
|
||||
epsilon (float): Term added to the denominator to improve numerical stability. Should be greater than
|
||||
0. Default: 1e-10.
|
||||
use_locking (bool): Whether to enable a lock to protect the variable and accumulation tensors from being
|
||||
updated. Default: False.
|
||||
centered (bool): If true, gradients are normalized by the estimated variance of the gradient. Default: False.
|
||||
use_locking (bool): Whether to enable a lock to protect the updating process of variable tensors.
|
||||
Default: False.
|
||||
centered (bool): If True, gradients are normalized by the estimated variance of the gradient. Default: False.
|
||||
loss_scale (float): A floating point value for the loss scale. Should be greater than 0. In general, use the
|
||||
default value. Only when `FixedLossScaleManager` is used for training and the `drop_overflow_update` in
|
||||
`FixedLossScaleManager` is set to False, then this value needs to be the same as the `loss_scale` in
|
||||
|
|
|
@ -128,8 +128,8 @@ class SGD(Optimizer):
|
|||
... {'params': no_conv_params, 'lr': 0.01},
|
||||
... {'order_params': net.trainable_params()}]
|
||||
>>> optim = nn.SGD(group_params, learning_rate=0.1, weight_decay=0.0)
|
||||
>>> # The conv_params's parameters will use default learning rate of 0.1 default weight decay of 0.0 and grad
|
||||
>>> # centralization of True.
|
||||
>>> # The conv_params's parameters will use default learning rate of 0.1 and default weight decay of 0.0
|
||||
>>> # and grad centralization of True.
|
||||
>>> # The no_conv_params's parameters will use learning rate of 0.01 and default weight decay of 0.0 and grad
|
||||
>>> # centralization of False.
|
||||
>>> # The final parameters order in which the optimizer will be followed is the value of 'order_params'.
|
||||
|
|
|
@ -287,13 +287,13 @@ class TrainOneStepCell(Cell):
|
|||
r"""
|
||||
Network training package class.
|
||||
|
||||
Wraps the network with an optimizer. The resulting Cell is trained with input '\*inputs'.
|
||||
Wraps the `network` with the `optimizer`. The resulting Cell is trained with input '\*inputs'.
|
||||
The backward graph will be created in the construct function to update the parameter. Different
|
||||
parallel modes are available for training.
|
||||
|
||||
Args:
|
||||
network (Cell): The training network. The network only supports single output.
|
||||
optimizer (Union[Cell]): Optimizer for updating the weights.
|
||||
optimizer (Union[Cell]): Optimizer for updating the network parameters.
|
||||
sens (numbers.Number): The scaling number to be filled as the input of backpropagation. Default value is 1.0.
|
||||
|
||||
Inputs:
|
||||
|
@ -303,7 +303,7 @@ class TrainOneStepCell(Cell):
|
|||
Tensor, a tensor means the loss value, the shape of which is usually :math:`()`.
|
||||
|
||||
Raises:
|
||||
TypeError: If `sens` is not a number.
|
||||
TypeError: If `sens` is not a numbers.Number.
|
||||
|
||||
Supported Platforms:
|
||||
``Ascend`` ``GPU`` ``CPU``
|
||||
|
@ -312,7 +312,7 @@ class TrainOneStepCell(Cell):
|
|||
>>> net = Net()
|
||||
>>> loss_fn = nn.SoftmaxCrossEntropyWithLogits()
|
||||
>>> optim = nn.Momentum(net.trainable_params(), learning_rate=0.1, momentum=0.9)
|
||||
>>> #1) Using the WithLossCell existing provide
|
||||
>>> #1) Using the WithLossCell provided by MindSpore
|
||||
>>> loss_net = nn.WithLossCell(net, loss_fn)
|
||||
>>> train_net = nn.TrainOneStepCell(loss_net, optim)
|
||||
>>>
|
||||
|
@ -596,22 +596,21 @@ class VirtualDatasetCellTriple(Cell):
|
|||
|
||||
class WithEvalCell(Cell):
|
||||
r"""
|
||||
Cell that returns loss, output and label for evaluation.
|
||||
Wraps the forward network with the loss function.
|
||||
|
||||
This Cell accepts a network and loss function as arguments and computes loss for model.
|
||||
It returns loss, output and label to calculate the metrics.
|
||||
It returns loss, forward output and label to calculate the metrics.
|
||||
|
||||
Args:
|
||||
network (Cell): The network Cell.
|
||||
loss_fn (Cell): The loss Cell.
|
||||
add_cast_fp32 (bool): Adjust the data type to float32. Default: False.
|
||||
network (Cell): The forward network.
|
||||
loss_fn (Cell): The loss function.
|
||||
add_cast_fp32 (bool): Whether to adjust the data type to float32. Default: False.
|
||||
|
||||
Inputs:
|
||||
- **data** (Tensor) - Tensor of shape :math:`(N, \ldots)`.
|
||||
- **label** (Tensor) - Tensor of shape :math:`(N, \ldots)`.
|
||||
|
||||
Outputs:
|
||||
Tuple, containing a scalar loss Tensor, a network output Tensor of shape :math:`(N, \ldots)`
|
||||
Tuple(Tensor), containing a scalar loss Tensor, a network output Tensor of shape :math:`(N, \ldots)`
|
||||
and a label Tensor of shape :math:`(N, \ldots)`.
|
||||
|
||||
Raises:
|
||||
|
@ -621,7 +620,7 @@ class WithEvalCell(Cell):
|
|||
``Ascend`` ``GPU`` ``CPU``
|
||||
|
||||
Examples:
|
||||
>>> # For a defined network Net without loss function
|
||||
>>> # Forward network without loss function
|
||||
>>> net = Net()
|
||||
>>> loss_fn = nn.SoftmaxCrossEntropyWithLogits()
|
||||
>>> eval_net = nn.WithEvalCell(net, loss_fn)
|
||||
|
|
|
@ -59,16 +59,17 @@ class DynamicLossScaleUpdateCell(Cell):
|
|||
Dynamic Loss scale update cell.
|
||||
|
||||
For loss scaling training, the initial loss scaling value will be set to be `loss_scale_value`.
|
||||
In each training step, the loss scaling value will be updated by loss scaling value/`scale_factor`
|
||||
when there is an overflow. And it will be increased by loss scaling value * `scale_factor` if there is no
|
||||
overflow for a continuous `scale_window` steps. This cell is used for Graph mode training in which all
|
||||
logic will be executed on device side(Another training mode is normal(non-sink) mode in which some logic will be
|
||||
executed on host).
|
||||
In each training step, the loss scaling value will be decreased by `loss_scale`/`scale_factor`
|
||||
when there is an overflow. And it will be increased by `loss_scale` * `scale_factor` if there is no
|
||||
overflow for a continuous `scale_window` steps.
|
||||
|
||||
`get_update_cell` method of :class:`mindspore.nn.DynamicLossScaleManager` will return this class, it will be called
|
||||
by :class:`mindspore.TrainOneStepWithLossScaleCell` during training to update loss scale.
|
||||
|
||||
Args:
|
||||
loss_scale_value (float): Initializes loss scale.
|
||||
scale_factor (int): Coefficient of increase and decrease.
|
||||
scale_window (int): Maximum continuous training steps that do not have overflow.
|
||||
scale_window (int): Maximum continuous training steps that do not have overflow to increase the loss scale.
|
||||
|
||||
Inputs:
|
||||
- **loss_scale** (Tensor) - The loss scale value during training with shape :math:`()`.
|
||||
|
@ -77,9 +78,6 @@ class DynamicLossScaleUpdateCell(Cell):
|
|||
Outputs:
|
||||
bool, the input `overflow`.
|
||||
|
||||
Raises:
|
||||
TypeError: If dtype of `inputs` or `label` is neither float16 nor float32.
|
||||
|
||||
Supported Platforms:
|
||||
``Ascend`` ``GPU``
|
||||
|
||||
|
@ -162,15 +160,17 @@ class DynamicLossScaleUpdateCell(Cell):
|
|||
|
||||
class FixedLossScaleUpdateCell(Cell):
|
||||
"""
|
||||
Static scale update cell, the loss scaling value will not be updated.
|
||||
Update cell with fixed loss scaling value.
|
||||
|
||||
For usage, refer to `DynamicLossScaleUpdateCell`.
|
||||
`get_update_cell` method of :class:`mindspore.nn.FixedLossScaleManager` will return this class, it will be called
|
||||
by :class:`mindspore.TrainOneStepWithLossScaleCell` during trainning.
|
||||
|
||||
Args:
|
||||
loss_scale_value (float): Initializes loss scale.
|
||||
|
||||
Inputs:
|
||||
- **loss_scale** (Tensor) - The loss scale value during training with shape :math:`()`, that will be ignored.
|
||||
- **loss_scale** (Tensor) - The loss scale value during training with shape :math:`()`, it is ignored in this
|
||||
class.
|
||||
- **overflow** (bool) - Whether the overflow occurs or not.
|
||||
|
||||
Outputs:
|
||||
|
@ -227,28 +227,27 @@ class TrainOneStepWithLossScaleCell(TrainOneStepCell):
|
|||
r"""
|
||||
Network training with loss scaling.
|
||||
|
||||
This is a training step with loss scaling. It takes a network, an optimizer and possibly a scale update
|
||||
Cell as args. The loss scale value can be updated in both host side or device side. The
|
||||
TrainOneStepWithLossScaleCell will be compiled to be graph which takes `*inputs` as input data.
|
||||
The Tensor type of `scale_sense` is acting as loss scaling value. If you want to update it on host side,
|
||||
the value must be provided. If the Tensor type of `scale_sense` is not given, the loss scale update logic
|
||||
must be provide by Cell type of `scale_sense`.
|
||||
This is a training step with loss scaling. It takes a network, an optimizer and a scale update Cell(or a Tensor) as
|
||||
args. The loss scale value can be updated in both host side or device side. If you want to update it on
|
||||
host side, using a value of Tensor type as `scale_sense`, otherwise, using a Cell instance for updating loss
|
||||
scale as `scale_sense`.
|
||||
|
||||
Args:
|
||||
network (Cell): The training network. The network only supports single output.
|
||||
optimizer (Cell): Optimizer for updating the weights.
|
||||
scale_sense (Union[Tensor, Cell]): If this value is Cell type, the loss scaling update logic cell.If this value
|
||||
is Tensor type, Tensor with shape :math:`()` or :math:`(1,)`.
|
||||
optimizer (Cell): Optimizer for updating the network parameters.
|
||||
scale_sense (Union[Tensor, Cell]): If this value is a Cell, it will be called by `TrainOneStepWithLossScaleCell`
|
||||
to update loss scale. If this value is a Tensor, the loss scale can be modified by `set_sense_scale`,
|
||||
the shape should be :math:`()` or :math:`(1,)`.
|
||||
|
||||
Inputs:
|
||||
- **(*inputs)** (Tuple(Tensor)) - Tuple of input tensors with shape :math:`(N, \ldots)`.
|
||||
|
||||
Outputs:
|
||||
Tuple of 3 Tensor, the loss, overflow flag and current loss scaling value.
|
||||
Tuple of 3 Tensor, the loss, overflow flag and current loss scale value.
|
||||
|
||||
- **loss** (Tensor) - Tensor with shape :math:`()`.
|
||||
- **overflow** (Tensor) - Tensor with shape :math:`()`, type is bool.
|
||||
- **loss scaling value** (Tensor) - Tensor with shape :math:`()`
|
||||
- **loss scale** (Tensor) - Tensor with shape :math:`()`
|
||||
|
||||
Raises:
|
||||
TypeError: If `scale_sense` is neither Cell nor Tensor.
|
||||
|
@ -350,8 +349,7 @@ class TrainOneStepWithLossScaleCell(TrainOneStepCell):
|
|||
|
||||
def set_sense_scale(self, sens):
|
||||
"""
|
||||
If the user has set the sens in the training process and wants to reassign the value, he can call
|
||||
this function again to make modification, and sens needs to be of type Tensor.
|
||||
If the user has set the `scale_sense` of Tensor type, he can call this function to reassign the value.
|
||||
|
||||
Args:
|
||||
sens(Tensor): The new sense whose shape and type are the same with original `scale_sense`.
|
||||
|
@ -382,7 +380,7 @@ class TrainOneStepWithLossScaleCell(TrainOneStepCell):
|
|||
|
||||
Returns:
|
||||
Tuple[object, object], the first value is False for GPU backend, while it is an instance of
|
||||
NPUAllocFloatStatus for other backend. The status is used to detect overflow during overflow detection.
|
||||
NPUAllocFloatStatus for other backend. The status is used to detect overflow during `get_overflow_status`.
|
||||
The second value is the same as the input of `compute_input`, but contains some information about the
|
||||
execution order.
|
||||
"""
|
||||
|
@ -406,7 +404,7 @@ class TrainOneStepWithLossScaleCell(TrainOneStepCell):
|
|||
Args:
|
||||
status (object): A status instance used to detect the overflow.
|
||||
compute_output: Overflow detection should be performed on a certain computation. Set `compute_output`
|
||||
as the output of the computation, to ensure overflow status is acquired before executing the
|
||||
as the output of the computation, to ensure overflow `status` is acquired before executing the
|
||||
computation.
|
||||
|
||||
Returns:
|
||||
|
@ -442,7 +440,7 @@ class TrainOneStepWithLossScaleCell(TrainOneStepCell):
|
|||
overflow(bool): Whether the overflow occurs or not.
|
||||
|
||||
Returns:
|
||||
bool, overflow value.
|
||||
bool, the input overflow value.
|
||||
"""
|
||||
if self.loss_scaling_manager is not None:
|
||||
return self.loss_scaling_manager(self.scale_sense, overflow)
|
||||
|
|
|
@ -25,8 +25,8 @@ class LossScaleManager:
|
|||
Derived class needs to implement all of its methods. `get_loss_scale` is used to get current loss scale value.
|
||||
`update_loss_scale` is used to update loss scale value, `update_loss_scale` will be called during the training.
|
||||
`get_update_cell` is used to get the instance of :class:`mindspore.nn.Cell` that is used to update the loss scale,
|
||||
the instance will be called during the training. When using sink mode, only the `get_update_cell` works, otherwise
|
||||
both `update_loss_scale` and `get_update_cell` works.
|
||||
the instance will be called during the training. Currently, the `get_update_cell` is mostly used.
|
||||
|
||||
For example, :class:`mindspore.FixedLossScaleManager` and :class:`mindspore.DynamicLossScaleManager`.
|
||||
"""
|
||||
def get_loss_scale(self):
|
||||
|
@ -105,7 +105,8 @@ class FixedLossScaleManager(LossScaleManager):
|
|||
def get_update_cell(self):
|
||||
"""
|
||||
Returns the instance of :class:`mindspore.nn.Cell` that used to update the loss scale which will be called at
|
||||
:class:`mindspore.nn.TrainOneStepWithLossScaleCell`.
|
||||
:class:`mindspore.nn.TrainOneStepWithLossScaleCell`. As the loss scale is fixed in this class, the instance
|
||||
will do nothing.
|
||||
|
||||
Returns:
|
||||
None or :class:`mindspore.FixedLossScaleUpdateCell`. Instance of
|
||||
|
|
Loading…
Reference in New Issue