!41615 reconstruct map parameters

Merge pull request !41615 from guozhijian/reconstruct_map
This commit is contained in:
i-robot 2022-09-15 08:21:44 +00:00 committed by Gitee
commit 8a983e1a06
No known key found for this signature in database
GPG Key ID: 173E9B9CA92EEF8F
7 changed files with 62 additions and 19 deletions

Binary file not shown.

After

Width:  |  Height:  |  Size: 2.0 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 2.3 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 2.9 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 3.2 KiB

View File

@ -1,4 +1,4 @@
.. py:method:: map(operations, input_columns=None, output_columns=None, column_order=None, num_parallel_workers=None, python_multiprocessing=False, cache=None, callbacks=None, max_rowsize=16, offload=None)
.. py:method:: map(operations, input_columns=None, output_columns=None, column_order=None, num_parallel_workers=None, **kwargs)
给定一组数据增强列表,按顺序将数据增强作用在数据集对象上。
@ -8,17 +8,30 @@
最后一个数据增强的输出列的列名由 `output_columns` 指定,如果没有指定 `output_columns` ,输出列名与 `input_columns` 一致。
- 如果使用的是 `mindspore` `dataset` 提供的数据增强(
`vision类 <https://www.mindspore.cn/docs/zh-CN/master/api_python/mindspore.dataset.vision.html>`_
`nlp类 <https://www.mindspore.cn/docs/zh-CN/master/api_python/mindspore.dataset.text.html>`_
`audio类 <https://www.mindspore.cn/docs/zh-CN/master/api_python/mindspore.dataset.audio.html>`_),请使用如下参数:
.. image:: map_parameter_cn.png
- 如果使用的是自定义PyFunc数据增强请使用如下参数
.. image:: map_parameter_pyfunc_cn.png
参数:
- **operations** (Union[list[TensorOperation], list[functions]]) - 一组数据增强操作支持数据集增强算子或者用户自定义的Python Callable对象。map操作将按顺序将一组数据增强作用在数据集对象上。
- **input_columns** (Union[str, list[str]], 可选) - 第一个数据增强操作的输入数据列。此列表的长度必须与 `operations` 列表中第一个数据增强的预期输入列数相匹配。默认值None。表示所有数据列都将传递给第一个数据增强操作。
- **output_columns** (Union[str, list[str]], 可选) - 最后一个数据增强操作的输出数据列。如果 `input_columns` 长度不等于 `output_columns` 长度则必须指定此参数。列表的长度必须必须与最后一个数据增强的输出列数相匹配。默认值None输出列将与输入列具有相同的名称。
- **column_order** (Union[str, list[str]], 可选) - 指定传递到下一个数据集操作的数据列的顺序。如果 `input_columns` 长度不等于 `output_columns` 长度,则必须指定此参数。注意:参数的列名不限定在 `input_columns``output_columns` 中指定的列也可以是上一个操作输出的未被处理的数据列。默认值None按照原输入顺序排列。
- **num_parallel_workers** (int, 可选) - 指定map操作的多进程/多线程并发数加快处理速度。默认值None将使用 `set_num_parallel_workers` 设置的并发数。
- **python_multiprocessing** (bool, 可选) - 启用Python多进程模式加速map操作。当传入的 `operations` 计算量很大时开启此选项可能会有较好效果。默认值False。
- **cache** (DatasetCache, 可选) - 单节点数据缓存服务,用于加快数据集处理,详情请阅读 `单节点数据缓存 <https://www.mindspore.cn/tutorials/experts/zh-CN/master/dataset/cache.html>`_ 。默认值None不使用缓存。
- **callbacks** (DSCallback, list[DSCallback], 可选) - 要调用的Dataset回调函数列表。默认值None。
- **max_rowsize** (int, 可选) - 指定在多进程之间复制数据时,共享内存分配的最大空间,仅当 `python_multiprocessing` 为True时该选项有效。默认值16单位为MB。
- **offload** (bool, 可选) - 是否进行异构硬件加速,详情请阅读 `数据准备异构加速 <https://www.mindspore.cn/tutorials/experts/zh-CN/master/dataset/dataset_offload.html>`_ 。默认值None。
- **\*\*kwargs** - 其他参数。
- python_multiprocessing (bool, 可选) - 启用Python多进程模式加速map操作。当传入的 `operations` 计算量很大时开启此选项可能会有较好效果。默认值False。
- max_rowsize (int, 可选) - 指定在多进程之间复制数据时,共享内存分配的最大空间,仅当 `python_multiprocessing` 为True时该选项有效。默认值16单位为MB。
- cache (DatasetCache, 可选) - 单节点数据缓存服务,用于加快数据集处理,详情请阅读 `单节点数据缓存 <https://www.mindspore.cn/tutorials/experts/zh-CN/master/dataset/cache.html>`_ 。默认值None不使用缓存。
- callbacks (DSCallback, list[DSCallback], 可选) - 要调用的Dataset回调函数列表。默认值None。
- offload (bool, 可选) - 是否进行异构硬件加速,详情请阅读 `数据准备异构加速 <https://www.mindspore.cn/tutorials/experts/zh-CN/master/dataset/dataset_offload.html>`_ 。默认值None。
.. note::
- `operations` 参数接收 `TensorOperation` 类型的数据处理操作以及用户定义的Python函数(PyFuncs)。

View File

@ -767,8 +767,7 @@ class Dataset:
@check_map
def map(self, operations, input_columns=None, output_columns=None, column_order=None,
num_parallel_workers=None, python_multiprocessing=False, cache=None, callbacks=None,
max_rowsize=16, offload=None):
num_parallel_workers=None, **kwargs):
"""
Apply each operation in operations to this dataset.
@ -780,6 +779,18 @@ class Dataset:
The columns outputted by the very last operation will be assigned names specified by
`output_columns`, and if not specified, the column name of output column is same as that of `input_columns`.
- If you use transformations (
`vision transform <https://www.mindspore.cn/docs/en/master/api_python/mindspore.dataset.vision.html>`_,
`nlp transform <https://www.mindspore.cn/docs/en/master/api_python/mindspore.dataset.text.html>`_,
`audio transform <https://www.mindspore.cn/docs/en/master/api_python/mindspore.dataset.audio.html>`_)
provided by mindspore dataset, please use the following parameters:
.. image:: map_parameter_en.png
- If you use user-defined transform as PyFunc (Python Func), please use the following parameters:
.. image:: map_parameter_pyfunc_en.png
Args:
operations (Union[list[TensorOperation], list[functions]]): List of operations to be
applied on the dataset. Operations are applied in the order they appear in this list.
@ -798,14 +809,21 @@ class Dataset:
Caution: the list here is not just the columns specified in parameter input_columns and output_columns.
num_parallel_workers (int, optional): Number of threads used to process the dataset in
parallel (default=None, the value from the configuration will be used).
python_multiprocessing (bool, optional): Parallelize Python operations with multiple worker processes. This
option could be beneficial if the Python operation is computational heavy (default=False).
cache (DatasetCache, optional): Use tensor caching service to speed up dataset processing.
(default=None, which means no cache is used).
callbacks (DSCallback, list[DSCallback], optional): List of Dataset callbacks to be called (Default=None).
max_rowsize (int, optional): Maximum size of row in MB that is used for shared memory allocation to copy
data between processes. This is only used if python_multiprocessing is set to True (Default=16).
offload (bool, optional): Flag to indicate whether offload is used (Default=None).
**kwargs:
- python_multiprocessing (bool, optional): Parallelize Python operations with multiple worker processes.
This option could be beneficial if the Python operation is computational heavy (default=False).
- max_rowsize (int, optional): Maximum size of row in MB that is used for shared memory allocation to
copy data between processes. This is only used if python_multiprocessing is set to True (Default=16).
- cache (DatasetCache, optional): Use tensor caching service to speed up dataset processing.
(default=None, which means no cache is used).
- callbacks (DSCallback, list[DSCallback], optional): List of Dataset callbacks to be called
(Default=None).
- offload (bool, optional): Flag to indicate whether offload is used (Default=None).
Note:
- Input `operations` accepts TensorOperations defined in mindspore.dataset part, plus user-defined
@ -914,7 +932,7 @@ class Dataset:
"Please use '.project' operation instead.")
return MapDataset(self, operations, input_columns, output_columns, column_order, num_parallel_workers,
python_multiprocessing, cache, callbacks, max_rowsize, offload)
**kwargs)
@check_filter
def filter(self, predicate, input_columns=None, num_parallel_workers=None):

View File

@ -1316,16 +1316,28 @@ def check_shuffle(method):
return new_method
def get_map_kwargs_from_dict(param_dict):
"""get map operation kwargs parameters."""
if param_dict is not None:
python_multiprocessing = param_dict.get("python_multiprocessing", False)
max_rowsize = param_dict.get("max_rowsize", 16)
cache = param_dict.get("cache", None)
callbacks = param_dict.get("callbacks", None)
offload = param_dict.get("offload", None)
return python_multiprocessing, max_rowsize, cache, callbacks, offload
def check_map(method):
"""check the input arguments of map."""
@wraps(method)
def new_method(self, *args, **kwargs):
from mindspore.dataset.callback import DSCallback
[operations, input_columns, output_columns, column_order, num_parallel_workers, python_multiprocessing, cache,
callbacks, max_rowsize, offload], _ = \
[operations, input_columns, output_columns, column_order, num_parallel_workers, param_dict], _ = \
parse_user_args(method, *args, **kwargs)
(python_multiprocessing, max_rowsize, cache, callbacks, offload) = get_map_kwargs_from_dict(param_dict)
# check whether network computing operator exist in input operations(python function)
# check used variable and function document whether contain computing operator
from types import FunctionType