forked from mindspore-Ecosystem/mindspore
!41615 reconstruct map parameters
Merge pull request !41615 from guozhijian/reconstruct_map
This commit is contained in:
commit
8a983e1a06
Binary file not shown.
After Width: | Height: | Size: 2.0 KiB |
Binary file not shown.
After Width: | Height: | Size: 2.3 KiB |
Binary file not shown.
After Width: | Height: | Size: 2.9 KiB |
Binary file not shown.
After Width: | Height: | Size: 3.2 KiB |
|
@ -1,4 +1,4 @@
|
|||
.. py:method:: map(operations, input_columns=None, output_columns=None, column_order=None, num_parallel_workers=None, python_multiprocessing=False, cache=None, callbacks=None, max_rowsize=16, offload=None)
|
||||
.. py:method:: map(operations, input_columns=None, output_columns=None, column_order=None, num_parallel_workers=None, **kwargs)
|
||||
|
||||
给定一组数据增强列表,按顺序将数据增强作用在数据集对象上。
|
||||
|
||||
|
@ -8,17 +8,30 @@
|
|||
|
||||
最后一个数据增强的输出列的列名由 `output_columns` 指定,如果没有指定 `output_columns` ,输出列名与 `input_columns` 一致。
|
||||
|
||||
- 如果使用的是 `mindspore` `dataset` 提供的数据增强(
|
||||
`vision类 <https://www.mindspore.cn/docs/zh-CN/master/api_python/mindspore.dataset.vision.html>`_,
|
||||
`nlp类 <https://www.mindspore.cn/docs/zh-CN/master/api_python/mindspore.dataset.text.html>`_,
|
||||
`audio类 <https://www.mindspore.cn/docs/zh-CN/master/api_python/mindspore.dataset.audio.html>`_),请使用如下参数:
|
||||
|
||||
.. image:: map_parameter_cn.png
|
||||
|
||||
- 如果使用的是自定义PyFunc数据增强,请使用如下参数:
|
||||
|
||||
.. image:: map_parameter_pyfunc_cn.png
|
||||
|
||||
参数:
|
||||
- **operations** (Union[list[TensorOperation], list[functions]]) - 一组数据增强操作,支持数据集增强算子或者用户自定义的Python Callable对象。map操作将按顺序将一组数据增强作用在数据集对象上。
|
||||
- **input_columns** (Union[str, list[str]], 可选) - 第一个数据增强操作的输入数据列。此列表的长度必须与 `operations` 列表中第一个数据增强的预期输入列数相匹配。默认值:None。表示所有数据列都将传递给第一个数据增强操作。
|
||||
- **output_columns** (Union[str, list[str]], 可选) - 最后一个数据增强操作的输出数据列。如果 `input_columns` 长度不等于 `output_columns` 长度,则必须指定此参数。列表的长度必须必须与最后一个数据增强的输出列数相匹配。默认值:None,输出列将与输入列具有相同的名称。
|
||||
- **column_order** (Union[str, list[str]], 可选) - 指定传递到下一个数据集操作的数据列的顺序。如果 `input_columns` 长度不等于 `output_columns` 长度,则必须指定此参数。注意:参数的列名不限定在 `input_columns` 和 `output_columns` 中指定的列,也可以是上一个操作输出的未被处理的数据列。默认值:None,按照原输入顺序排列。
|
||||
- **num_parallel_workers** (int, 可选) - 指定map操作的多进程/多线程并发数,加快处理速度。默认值:None,将使用 `set_num_parallel_workers` 设置的并发数。
|
||||
- **python_multiprocessing** (bool, 可选) - 启用Python多进程模式加速map操作。当传入的 `operations` 计算量很大时,开启此选项可能会有较好效果。默认值:False。
|
||||
- **cache** (DatasetCache, 可选) - 单节点数据缓存服务,用于加快数据集处理,详情请阅读 `单节点数据缓存 <https://www.mindspore.cn/tutorials/experts/zh-CN/master/dataset/cache.html>`_ 。默认值:None,不使用缓存。
|
||||
- **callbacks** (DSCallback, list[DSCallback], 可选) - 要调用的Dataset回调函数列表。默认值:None。
|
||||
- **max_rowsize** (int, 可选) - 指定在多进程之间复制数据时,共享内存分配的最大空间,仅当 `python_multiprocessing` 为True时,该选项有效。默认值:16,单位为MB。
|
||||
- **offload** (bool, 可选) - 是否进行异构硬件加速,详情请阅读 `数据准备异构加速 <https://www.mindspore.cn/tutorials/experts/zh-CN/master/dataset/dataset_offload.html>`_ 。默认值:None。
|
||||
- **\*\*kwargs** - 其他参数。
|
||||
|
||||
- python_multiprocessing (bool, 可选) - 启用Python多进程模式加速map操作。当传入的 `operations` 计算量很大时,开启此选项可能会有较好效果。默认值:False。
|
||||
- max_rowsize (int, 可选) - 指定在多进程之间复制数据时,共享内存分配的最大空间,仅当 `python_multiprocessing` 为True时,该选项有效。默认值:16,单位为MB。
|
||||
- cache (DatasetCache, 可选) - 单节点数据缓存服务,用于加快数据集处理,详情请阅读 `单节点数据缓存 <https://www.mindspore.cn/tutorials/experts/zh-CN/master/dataset/cache.html>`_ 。默认值:None,不使用缓存。
|
||||
- callbacks (DSCallback, list[DSCallback], 可选) - 要调用的Dataset回调函数列表。默认值:None。
|
||||
- offload (bool, 可选) - 是否进行异构硬件加速,详情请阅读 `数据准备异构加速 <https://www.mindspore.cn/tutorials/experts/zh-CN/master/dataset/dataset_offload.html>`_ 。默认值:None。
|
||||
|
||||
.. note::
|
||||
- `operations` 参数接收 `TensorOperation` 类型的数据处理操作,以及用户定义的Python函数(PyFuncs)。
|
||||
|
|
|
@ -767,8 +767,7 @@ class Dataset:
|
|||
|
||||
@check_map
|
||||
def map(self, operations, input_columns=None, output_columns=None, column_order=None,
|
||||
num_parallel_workers=None, python_multiprocessing=False, cache=None, callbacks=None,
|
||||
max_rowsize=16, offload=None):
|
||||
num_parallel_workers=None, **kwargs):
|
||||
"""
|
||||
Apply each operation in operations to this dataset.
|
||||
|
||||
|
@ -780,6 +779,18 @@ class Dataset:
|
|||
The columns outputted by the very last operation will be assigned names specified by
|
||||
`output_columns`, and if not specified, the column name of output column is same as that of `input_columns`.
|
||||
|
||||
- If you use transformations (
|
||||
`vision transform <https://www.mindspore.cn/docs/en/master/api_python/mindspore.dataset.vision.html>`_,
|
||||
`nlp transform <https://www.mindspore.cn/docs/en/master/api_python/mindspore.dataset.text.html>`_,
|
||||
`audio transform <https://www.mindspore.cn/docs/en/master/api_python/mindspore.dataset.audio.html>`_)
|
||||
provided by mindspore dataset, please use the following parameters:
|
||||
|
||||
.. image:: map_parameter_en.png
|
||||
|
||||
- If you use user-defined transform as PyFunc (Python Func), please use the following parameters:
|
||||
|
||||
.. image:: map_parameter_pyfunc_en.png
|
||||
|
||||
Args:
|
||||
operations (Union[list[TensorOperation], list[functions]]): List of operations to be
|
||||
applied on the dataset. Operations are applied in the order they appear in this list.
|
||||
|
@ -798,14 +809,21 @@ class Dataset:
|
|||
Caution: the list here is not just the columns specified in parameter input_columns and output_columns.
|
||||
num_parallel_workers (int, optional): Number of threads used to process the dataset in
|
||||
parallel (default=None, the value from the configuration will be used).
|
||||
python_multiprocessing (bool, optional): Parallelize Python operations with multiple worker processes. This
|
||||
option could be beneficial if the Python operation is computational heavy (default=False).
|
||||
cache (DatasetCache, optional): Use tensor caching service to speed up dataset processing.
|
||||
(default=None, which means no cache is used).
|
||||
callbacks (DSCallback, list[DSCallback], optional): List of Dataset callbacks to be called (Default=None).
|
||||
max_rowsize (int, optional): Maximum size of row in MB that is used for shared memory allocation to copy
|
||||
data between processes. This is only used if python_multiprocessing is set to True (Default=16).
|
||||
offload (bool, optional): Flag to indicate whether offload is used (Default=None).
|
||||
**kwargs:
|
||||
|
||||
- python_multiprocessing (bool, optional): Parallelize Python operations with multiple worker processes.
|
||||
This option could be beneficial if the Python operation is computational heavy (default=False).
|
||||
|
||||
- max_rowsize (int, optional): Maximum size of row in MB that is used for shared memory allocation to
|
||||
copy data between processes. This is only used if python_multiprocessing is set to True (Default=16).
|
||||
|
||||
- cache (DatasetCache, optional): Use tensor caching service to speed up dataset processing.
|
||||
(default=None, which means no cache is used).
|
||||
|
||||
- callbacks (DSCallback, list[DSCallback], optional): List of Dataset callbacks to be called
|
||||
(Default=None).
|
||||
|
||||
- offload (bool, optional): Flag to indicate whether offload is used (Default=None).
|
||||
|
||||
Note:
|
||||
- Input `operations` accepts TensorOperations defined in mindspore.dataset part, plus user-defined
|
||||
|
@ -914,7 +932,7 @@ class Dataset:
|
|||
"Please use '.project' operation instead.")
|
||||
|
||||
return MapDataset(self, operations, input_columns, output_columns, column_order, num_parallel_workers,
|
||||
python_multiprocessing, cache, callbacks, max_rowsize, offload)
|
||||
**kwargs)
|
||||
|
||||
@check_filter
|
||||
def filter(self, predicate, input_columns=None, num_parallel_workers=None):
|
||||
|
|
|
@ -1316,16 +1316,28 @@ def check_shuffle(method):
|
|||
return new_method
|
||||
|
||||
|
||||
def get_map_kwargs_from_dict(param_dict):
|
||||
"""get map operation kwargs parameters."""
|
||||
if param_dict is not None:
|
||||
python_multiprocessing = param_dict.get("python_multiprocessing", False)
|
||||
max_rowsize = param_dict.get("max_rowsize", 16)
|
||||
cache = param_dict.get("cache", None)
|
||||
callbacks = param_dict.get("callbacks", None)
|
||||
offload = param_dict.get("offload", None)
|
||||
return python_multiprocessing, max_rowsize, cache, callbacks, offload
|
||||
|
||||
|
||||
def check_map(method):
|
||||
"""check the input arguments of map."""
|
||||
|
||||
@wraps(method)
|
||||
def new_method(self, *args, **kwargs):
|
||||
from mindspore.dataset.callback import DSCallback
|
||||
[operations, input_columns, output_columns, column_order, num_parallel_workers, python_multiprocessing, cache,
|
||||
callbacks, max_rowsize, offload], _ = \
|
||||
[operations, input_columns, output_columns, column_order, num_parallel_workers, param_dict], _ = \
|
||||
parse_user_args(method, *args, **kwargs)
|
||||
|
||||
(python_multiprocessing, max_rowsize, cache, callbacks, offload) = get_map_kwargs_from_dict(param_dict)
|
||||
|
||||
# check whether network computing operator exist in input operations(python function)
|
||||
# check used variable and function document whether contain computing operator
|
||||
from types import FunctionType
|
||||
|
|
Loading…
Reference in New Issue