forked from mindspore-Ecosystem/mindspore
!45318 Fix the problem of Chinese API document
Merge pull request !45318 from 刘勇琪/code_docs_modify_chinese_api
This commit is contained in:
commit
c10184ad88
|
@ -1,7 +1,7 @@
|
||||||
mindspore.dataset.ArgoverseDataset
|
mindspore.dataset.ArgoverseDataset
|
||||||
====================================
|
====================================
|
||||||
|
|
||||||
.. py:class:: mindspore.dataset.ArgoverseDataset(data_dir, column_names="graph", shuffle=None, num_parallel_workers=1, python_multiprocessing=True, perf_mode=True)
|
.. py:class:: mindspore.dataset.ArgoverseDataset(data_dir, column_names="graph", num_parallel_workers=1, shuffle=None, python_multiprocessing=True, perf_mode=True)
|
||||||
|
|
||||||
加载argoverse数据集并进行图(Graph)初始化。
|
加载argoverse数据集并进行图(Graph)初始化。
|
||||||
|
|
||||||
|
@ -16,6 +16,45 @@
|
||||||
- **python_multiprocessing** (bool,可选) - 启用Python多进程模式加速运算。默认值:True。当传入 `source` 的Python对象的计算量很大时,开启此选项可能会有较好效果。
|
- **python_multiprocessing** (bool,可选) - 启用Python多进程模式加速运算。默认值:True。当传入 `source` 的Python对象的计算量很大时,开启此选项可能会有较好效果。
|
||||||
- **perf_mode** (bool,可选) - 遍历创建的dataset对象时获得更高性能的模式(在此过程中将调用 `__getitem__` 方法)。默认值:True,将Graph的所有数据(如边的索引、节点特征和图的特征)都作为图特征进行存储。
|
- **perf_mode** (bool,可选) - 遍历创建的dataset对象时获得更高性能的模式(在此过程中将调用 `__getitem__` 方法)。默认值:True,将Graph的所有数据(如边的索引、节点特征和图的特征)都作为图特征进行存储。
|
||||||
|
|
||||||
|
异常:
|
||||||
|
- **TypeError** - 如果 `data_dir` 不是str类型。
|
||||||
|
- **TypeError** - 如果 `num_parallel_workers` 不是int类型。
|
||||||
|
- **TypeError** - 如果 `shuffle` 不是bool类型。
|
||||||
|
- **TypeError** - 如果 `python_multiprocessing` 不是bool类型。
|
||||||
|
- **TypeError** - 如果 `perf_mode` 不是bool类型。
|
||||||
|
- **RuntimeError** - 如果 `data_dir` 无效或不存在。
|
||||||
|
- **ValueError** - `num_parallel_workers` 参数超过系统最大线程数。
|
||||||
|
|
||||||
|
**关于Argoverse数据集:**
|
||||||
|
|
||||||
|
Argoverse是第一个包含高精地图的数据集,它包含了290KM的带有几何形状和语义信息的高精度地图数据。
|
||||||
|
|
||||||
|
可以将数据集文件解压缩到以下结构中,并通过MindSpore的API读取:
|
||||||
|
|
||||||
|
.. code-block::
|
||||||
|
|
||||||
|
.
|
||||||
|
└── argoversedataset_dir
|
||||||
|
├── train
|
||||||
|
│ ├──...
|
||||||
|
├── val
|
||||||
|
│ └──...
|
||||||
|
├── test
|
||||||
|
│ └──...
|
||||||
|
|
||||||
|
**引用:**
|
||||||
|
|
||||||
|
.. code-block::
|
||||||
|
|
||||||
|
@inproceedings{Argoverse,
|
||||||
|
author = {Ming-Fang Chang and John W Lambert and Patsorn Sangkloy and Jagjeet Singh
|
||||||
|
and Slawomir Bak and Andrew Hartnett and De Wang and Peter Carr
|
||||||
|
and Simon Lucey and Deva Ramanan and James Hays},
|
||||||
|
title = {Argoverse: 3D Tracking and Forecasting with Rich Maps},
|
||||||
|
booktitle = {Conference on Computer Vision and Pattern Recognition (CVPR)},
|
||||||
|
year = {2019}
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
.. py:method:: load()
|
.. py:method:: load()
|
||||||
|
|
||||||
|
|
|
@ -3,15 +3,13 @@ mindspore.dataset.EnWik9Dataset
|
||||||
|
|
||||||
.. py:class:: mindspore.dataset.EnWik9Dataset(dataset_dir, num_samples=None, num_parallel_workers=None, shuffle=True, num_shards=None, shard_id=None, cache=None)
|
.. py:class:: mindspore.dataset.EnWik9Dataset(dataset_dir, num_samples=None, num_parallel_workers=None, shuffle=True, num_shards=None, shard_id=None, cache=None)
|
||||||
|
|
||||||
读取和解析EnWik9 Full和EnWik9 Polarity数据集。
|
读取和解析EnWik9数据集。
|
||||||
|
|
||||||
生成的数据集有一列 `[text]` ,数据类型为string。
|
生成的数据集有一列 `[text]` ,数据类型为string。
|
||||||
|
|
||||||
参数:
|
参数:
|
||||||
- **dataset_dir** (str) - 包含数据集文件的根目录路径。
|
- **dataset_dir** (str) - 包含数据集文件的根目录路径。
|
||||||
- **num_samples** (int, 可选) - 指定从数据集中读取的样本数。
|
- **num_samples** (int, 可选) - 指定从数据集中读取的样本数。默认值:None,读取所有样本。
|
||||||
对于Polarity数据集, 'train'将读取360万个训练样本, 'test'将读取40万个测试样本, 'all'将读取所有400万个样本。
|
|
||||||
对于Full数据集, 'train'将读取300万个训练样本, 'test'将读取65万个测试样本, 'all'将读取所有365万个样本。默认值:None,读取所有样本。
|
|
||||||
- **num_parallel_workers** (int, 可选) - 指定读取数据的工作线程数。默认值:None,使用mindspore.dataset.config中配置的线程数。
|
- **num_parallel_workers** (int, 可选) - 指定读取数据的工作线程数。默认值:None,使用mindspore.dataset.config中配置的线程数。
|
||||||
- **shuffle** (Union[bool, Shuffle], 可选) - 每个epoch中数据混洗的模式,支持传入bool类型与枚举类型进行指定。默认值:True。
|
- **shuffle** (Union[bool, Shuffle], 可选) - 每个epoch中数据混洗的模式,支持传入bool类型与枚举类型进行指定。默认值:True。
|
||||||
如果 `shuffle` 为False,则不混洗,如果 `shuffle` 为True,等同于将 `shuffle` 设置为mindspore.dataset.Shuffle.GLOBAL。
|
如果 `shuffle` 为False,则不混洗,如果 `shuffle` 为True,等同于将 `shuffle` 设置为mindspore.dataset.Shuffle.GLOBAL。
|
||||||
|
|
|
@ -10,9 +10,7 @@ mindspore.dataset.IMDBDataset
|
||||||
参数:
|
参数:
|
||||||
- **dataset_dir** (str) - 包含数据集文件的根目录路径。
|
- **dataset_dir** (str) - 包含数据集文件的根目录路径。
|
||||||
- **usage** (str, 可选) - 指定数据集的子集,可取值为 'train', 'test'或 'all'。默认值:None,读取全部样本。
|
- **usage** (str, 可选) - 指定数据集的子集,可取值为 'train', 'test'或 'all'。默认值:None,读取全部样本。
|
||||||
- **num_samples** (int, 可选) - 指定从数据集中读取的样本数。
|
- **num_samples** (int, 可选) - 指定从数据集中读取的样本数。默认值:None,读取所有样本。
|
||||||
对于Polarity数据集, 'train'将读取360万个训练样本, 'test'将读取40万个测试样本, 'all'将读取所有400万个样本。
|
|
||||||
对于Full数据集, 'train'将读取300万个训练样本, 'test'将读取65万个测试样本, 'all'将读取所有365万个样本。默认值:None,读取所有样本。
|
|
||||||
- **num_parallel_workers** (int, 可选) - 指定读取数据的工作线程数。默认值:None,使用mindspore.dataset.config中配置的线程数。
|
- **num_parallel_workers** (int, 可选) - 指定读取数据的工作线程数。默认值:None,使用mindspore.dataset.config中配置的线程数。
|
||||||
- **shuffle** (bool, 可选) - 是否混洗数据集。默认值:None。下表中会展示不同参数配置的预期行为。
|
- **shuffle** (bool, 可选) - 是否混洗数据集。默认值:None。下表中会展示不同参数配置的预期行为。
|
||||||
- **sampler** (Sampler, 可选) - 指定从数据集中选取样本的采样器。默认值:None。下表中会展示不同配置的预期行为。
|
- **sampler** (Sampler, 可选) - 指定从数据集中选取样本的采样器。默认值:None。下表中会展示不同配置的预期行为。
|
||||||
|
|
|
@ -33,7 +33,7 @@ mindspore.dataset.IWSLT2017Dataset
|
||||||
- **RuntimeError** - 指定了 `shard_id` 参数,但是未指定 `num_shards` 参数。
|
- **RuntimeError** - 指定了 `shard_id` 参数,但是未指定 `num_shards` 参数。
|
||||||
- **ValueError** - `num_parallel_workers` 参数超过系统最大线程数。
|
- **ValueError** - `num_parallel_workers` 参数超过系统最大线程数。
|
||||||
|
|
||||||
**关于IWSLT2016数据集:**
|
**关于IWSLT2017数据集:**
|
||||||
|
|
||||||
IWSLT是一个专门讨论口译各个方面的重要年度科学会议。IWSLT评估活动中的MT任务被构成一个数据集,该数据集可通过 `wit3 <https://wit3.fbk.eu>`_ 公开获取。
|
IWSLT是一个专门讨论口译各个方面的重要年度科学会议。IWSLT评估活动中的MT任务被构成一个数据集,该数据集可通过 `wit3 <https://wit3.fbk.eu>`_ 公开获取。
|
||||||
IWSLT2017数据集中有德语、英语、意大利语、荷兰语和罗马尼亚语,数据集包括其中任何两种语言的翻译。
|
IWSLT2017数据集中有德语、英语、意大利语、荷兰语和罗马尼亚语,数据集包括其中任何两种语言的翻译。
|
||||||
|
|
|
@ -21,6 +21,18 @@
|
||||||
- **python_multiprocessing** (bool,可选) - 启用Python多进程模式加速运算。默认值:True。当传入 `source` 的Python对象的计算量很大时,开启此选项可能会有较好效果。
|
- **python_multiprocessing** (bool,可选) - 启用Python多进程模式加速运算。默认值:True。当传入 `source` 的Python对象的计算量很大时,开启此选项可能会有较好效果。
|
||||||
- **max_rowsize** (int, 可选) - 指定在多进程之间复制数据时,共享内存分配的最大空间。默认值:6,单位为MB。仅当参数 `python_multiprocessing` 设为True时,此参数才会生效。
|
- **max_rowsize** (int, 可选) - 指定在多进程之间复制数据时,共享内存分配的最大空间。默认值:6,单位为MB。仅当参数 `python_multiprocessing` 设为True时,此参数才会生效。
|
||||||
|
|
||||||
|
异常:
|
||||||
|
- **TypeError** - 如果 `data_dir` 不是str类型。
|
||||||
|
- **TypeError** - 如果 `save_dir` 不是str类型。
|
||||||
|
- **TypeError** - 如果 `num_parallel_workers` 不是int类型。
|
||||||
|
- **TypeError** - 如果 `shuffle` 不是bool类型。
|
||||||
|
- **TypeError** - 如果 `python_multiprocessing` 不是bool类型。
|
||||||
|
- **TypeError** - 如果 `perf_mode` 不是bool类型。
|
||||||
|
- **RuntimeError** - 如果 `data_dir` 无效或不存在。
|
||||||
|
- **RuntimeError** - 指定了 `num_shards` 参数,但是未指定 `shard_id` 参数。
|
||||||
|
- **RuntimeError** - 指定了 `shard_id` 参数,但是未指定 `num_shards` 参数。
|
||||||
|
- **ValueError** - `num_parallel_workers` 参数超过系统最大线程数。
|
||||||
|
|
||||||
.. py:method:: load()
|
.. py:method:: load()
|
||||||
|
|
||||||
从给定(处理好的)路径加载数据,也可以在自己实现的Dataset类中实现这个方法。
|
从给定(处理好的)路径加载数据,也可以在自己实现的Dataset类中实现这个方法。
|
||||||
|
|
|
@ -60,5 +60,25 @@
|
||||||
- False
|
- False
|
||||||
- 不允许
|
- 不允许
|
||||||
|
|
||||||
|
**关于Manifest数据集:**
|
||||||
|
|
||||||
|
Manifest文件包含数据集中包含的文件列表,包括文件名和文件ID等基本文件信息,以及扩展文件元数据。
|
||||||
|
Manifest是华为ModelArts支持的数据格式文件,详细说明请参见`Manifest文档 <https://support.huaweicloud.com/engineers-modelarts/modelarts_23_0009.html>`_ 。
|
||||||
|
|
||||||
|
以下是原始Manifest数据集结构。可以将数据集文件解压缩到此目录结构中,并由MindSpore的API读取。
|
||||||
|
|
||||||
|
.. code-block::
|
||||||
|
|
||||||
|
.
|
||||||
|
└── manifest_dataset_directory
|
||||||
|
├── train
|
||||||
|
│ ├── 1.JPEG
|
||||||
|
│ ├── 2.JPEG
|
||||||
|
│ ├── ...
|
||||||
|
├── eval
|
||||||
|
│ ├── 1.JPEG
|
||||||
|
│ ├── 2.JPEG
|
||||||
|
│ ├── ...
|
||||||
|
|
||||||
|
|
||||||
.. include:: mindspore.dataset.api_list_vision.rst
|
.. include:: mindspore.dataset.api_list_vision.rst
|
||||||
|
|
|
@ -26,7 +26,7 @@
|
||||||
- **shard_id** (int, 可选) - 指定分布式训练时使用的分片ID号。默认值:None。只有当指定了 `num_shards` 时才能指定此参数。
|
- **shard_id** (int, 可选) - 指定分布式训练时使用的分片ID号。默认值:None。只有当指定了 `num_shards` 时才能指定此参数。
|
||||||
- **shard_equal_rows** (bool, 可选) - 分布式训练时,为所有分片获取等量的数据行数。默认值:True。
|
- **shard_equal_rows** (bool, 可选) - 分布式训练时,为所有分片获取等量的数据行数。默认值:True。
|
||||||
如果 `shard_equal_rows` 为False,则可能会使得每个分片的数据条目不相等,从而导致分布式训练失败。
|
如果 `shard_equal_rows` 为False,则可能会使得每个分片的数据条目不相等,从而导致分布式训练失败。
|
||||||
因此当每个TFRecord文件的数据数量不相等时,建议将此参数设置为True。注意,只有当指定了 `num_shards` 时才能指定此参数。
|
因此当每个MindRecord文件的数据数量不相等时,建议将此参数设置为True。注意,只有当指定了 `num_shards` 时才能指定此参数。
|
||||||
|
|
||||||
异常:
|
异常:
|
||||||
- **RuntimeError** - `sync_obs_path` 参数指定的目录不存在。
|
- **RuntimeError** - `sync_obs_path` 参数指定的目录不存在。
|
||||||
|
|
|
@ -11,7 +11,7 @@ mindspore.dataset.SVHNDataset
|
||||||
- **dataset_dir** (str) - 包含数据集文件的根目录路径。
|
- **dataset_dir** (str) - 包含数据集文件的根目录路径。
|
||||||
- **usage** (str, 可选) - 指定数据集的子集,可取值为 'train'、 'test'、 'extra'或 'all'。默认值:None,读取全部样本图片。
|
- **usage** (str, 可选) - 指定数据集的子集,可取值为 'train'、 'test'、 'extra'或 'all'。默认值:None,读取全部样本图片。
|
||||||
- **num_samples** (int, 可选) - 指定从数据集中读取的样本数,可以小于数据集总数。默认值:None,读取全部样本图片。
|
- **num_samples** (int, 可选) - 指定从数据集中读取的样本数,可以小于数据集总数。默认值:None,读取全部样本图片。
|
||||||
- **num_parallel_workers** (int, 可选) - 指定读取数据的工作线程数。默认值:1,使用mindspore.dataset.config中配置的线程数。
|
- **num_parallel_workers** (int, 可选) - 指定读取数据的工作线程数。默认值:1。
|
||||||
- **shuffle** (bool, 可选) - 是否混洗数据集。默认值:None。下表中会展示不同参数配置的预期行为。
|
- **shuffle** (bool, 可选) - 是否混洗数据集。默认值:None。下表中会展示不同参数配置的预期行为。
|
||||||
- **sampler** (Sampler, 可选) - 指定从数据集中选取样本的采样器。默认值:None。下表中会展示不同配置的预期行为。
|
- **sampler** (Sampler, 可选) - 指定从数据集中选取样本的采样器。默认值:None。下表中会展示不同配置的预期行为。
|
||||||
- **num_shards** (int, 可选) - 指定分布式训练时将数据集进行划分的分片数。默认值:None。指定此参数后, `num_samples` 表示每个分片的最大样本数。
|
- **num_shards** (int, 可选) - 指定分布式训练时将数据集进行划分的分片数。默认值:None。指定此参数后, `num_samples` 表示每个分片的最大样本数。
|
||||||
|
|
|
@ -450,6 +450,9 @@ class Dataset:
|
||||||
|
|
||||||
Returns:
|
Returns:
|
||||||
str, JSON string of the pipeline.
|
str, JSON string of the pipeline.
|
||||||
|
|
||||||
|
Examples:
|
||||||
|
>>> dataset_json = dataset.to_json("/path/to/mnist_dataset_pipeline.json")
|
||||||
"""
|
"""
|
||||||
ir_tree, _ = self.create_ir_tree()
|
ir_tree, _ = self.create_ir_tree()
|
||||||
return json.loads(ir_tree.to_json(filename))
|
return json.loads(ir_tree.to_json(filename))
|
||||||
|
@ -1316,6 +1319,14 @@ class Dataset:
|
||||||
|
|
||||||
Returns:
|
Returns:
|
||||||
Dataset, dataset for transferring.
|
Dataset, dataset for transferring.
|
||||||
|
|
||||||
|
Examples:
|
||||||
|
>>> data = ds.TFRecordDataset('/path/to/TF_FILES', '/path/to/TF_SCHEMA_FILE', shuffle=ds.Shuffle.FILES)
|
||||||
|
>>>
|
||||||
|
>>> data = data.device_que()
|
||||||
|
>>> data.send()
|
||||||
|
>>> time.sleep(0.1)
|
||||||
|
>>> data.stop_send()
|
||||||
"""
|
"""
|
||||||
return TransferDataset(self, send_epoch_end, create_data_info_queue)
|
return TransferDataset(self, send_epoch_end, create_data_info_queue)
|
||||||
|
|
||||||
|
@ -1389,6 +1400,17 @@ class Dataset:
|
||||||
num_files (int, optional): Number of dataset files. Default: 1.
|
num_files (int, optional): Number of dataset files. Default: 1.
|
||||||
file_type (str, optional): Dataset format. Default: 'mindrecord'.
|
file_type (str, optional): Dataset format. Default: 'mindrecord'.
|
||||||
|
|
||||||
|
Examples:
|
||||||
|
>>> import numpy as np
|
||||||
|
>>>
|
||||||
|
>>> def generator_1d():
|
||||||
|
... for i in range(10):
|
||||||
|
... yield (np.array([i]),)
|
||||||
|
>>>
|
||||||
|
>>>
|
||||||
|
>>> # apply dataset operations
|
||||||
|
>>> d1 = ds.GeneratorDataset(generator_1d, ["data"], shuffle=False)
|
||||||
|
>>> d1.save('/path/to/save_file')
|
||||||
"""
|
"""
|
||||||
ir_tree, api_tree = self.create_ir_tree()
|
ir_tree, api_tree = self.create_ir_tree()
|
||||||
|
|
||||||
|
@ -1689,6 +1711,39 @@ class Dataset:
|
||||||
When num_batch is None, it will default to the number specified by the
|
When num_batch is None, it will default to the number specified by the
|
||||||
sync_wait operation. Default: None.
|
sync_wait operation. Default: None.
|
||||||
data (Any): The data passed to the callback, user defined. Default: None.
|
data (Any): The data passed to the callback, user defined. Default: None.
|
||||||
|
|
||||||
|
Examples:
|
||||||
|
>>> import numpy as np
|
||||||
|
>>>
|
||||||
|
>>>
|
||||||
|
>>> def gen():
|
||||||
|
... for i in range(100):
|
||||||
|
... yield (np.array(i),)
|
||||||
|
>>>
|
||||||
|
>>>
|
||||||
|
>>> class Augment:
|
||||||
|
... def __init__(self, loss):
|
||||||
|
... self.loss = loss
|
||||||
|
...
|
||||||
|
... def preprocess(self, input_):
|
||||||
|
... return input_
|
||||||
|
...
|
||||||
|
... def update(self, data):
|
||||||
|
... self.loss = data["loss"]
|
||||||
|
>>>
|
||||||
|
>>>
|
||||||
|
>>> batch_size = 10
|
||||||
|
>>> dataset = ds.GeneratorDataset(gen, column_names=["input"])
|
||||||
|
>>> aug = Augment(0)
|
||||||
|
>>> dataset = dataset.sync_wait(condition_name='', num_batch=1)
|
||||||
|
>>> dataset = dataset.map(input_columns=["input"], operations=[aug.preprocess])
|
||||||
|
>>> dataset = dataset.batch(batch_size)
|
||||||
|
>>>
|
||||||
|
>>> count = 0
|
||||||
|
>>> for data in dataset.create_dict_iterator(num_epochs=1, output_numpy=True):
|
||||||
|
... count += 1
|
||||||
|
... data = {"loss": count}
|
||||||
|
... dataset.sync_update(condition_name="", data=data)
|
||||||
"""
|
"""
|
||||||
if (not isinstance(num_batch, int) and num_batch is not None) or \
|
if (not isinstance(num_batch, int) and num_batch is not None) or \
|
||||||
(isinstance(num_batch, int) and num_batch <= 0):
|
(isinstance(num_batch, int) and num_batch <= 0):
|
||||||
|
@ -1761,7 +1816,18 @@ class Dataset:
|
||||||
return {}
|
return {}
|
||||||
|
|
||||||
def reset(self):
|
def reset(self):
|
||||||
"""Reset the dataset for next epoch."""
|
"""
|
||||||
|
Reset the dataset for next epoch.
|
||||||
|
|
||||||
|
Examples:
|
||||||
|
>>> mind_dataset_dir = ["/path/to/mind_dataset_file"]
|
||||||
|
>>> data_set = ds.MindDataset(dataset_files=mind_dataset_dir)
|
||||||
|
>>> for _ in range(5):
|
||||||
|
... num_iter = 0
|
||||||
|
... for data in dataset.create_tuple_iterator(num_epochs=1, output_numpy=True):
|
||||||
|
... num_iter += 1
|
||||||
|
... data_set.reset()
|
||||||
|
"""
|
||||||
|
|
||||||
def is_shuffled(self):
|
def is_shuffled(self):
|
||||||
"""Returns True if the dataset or its children is shuffled."""
|
"""Returns True if the dataset or its children is shuffled."""
|
||||||
|
@ -3797,6 +3863,12 @@ class Schema:
|
||||||
|
|
||||||
Raises:
|
Raises:
|
||||||
ValueError: If column type is unknown.
|
ValueError: If column type is unknown.
|
||||||
|
|
||||||
|
Examples:
|
||||||
|
>>> from mindspore import dtype as mstype
|
||||||
|
>>>
|
||||||
|
>>> schema = ds.Schema()
|
||||||
|
>>> schema.add_column('col_1d', de_type=mstype.int64, shape=[2])
|
||||||
"""
|
"""
|
||||||
if isinstance(de_type, typing.Type):
|
if isinstance(de_type, typing.Type):
|
||||||
de_type = mstype_to_detype(de_type)
|
de_type = mstype_to_detype(de_type)
|
||||||
|
@ -3841,6 +3913,12 @@ class Schema:
|
||||||
|
|
||||||
Returns:
|
Returns:
|
||||||
str, JSON string of the schema.
|
str, JSON string of the schema.
|
||||||
|
|
||||||
|
Examples:
|
||||||
|
>>> from mindspore.dataset import Schema
|
||||||
|
>>>
|
||||||
|
>>> schema1 = ds.Schema()
|
||||||
|
>>> schema2 = schema1.to_json()
|
||||||
"""
|
"""
|
||||||
return self.cpp_schema.to_json()
|
return self.cpp_schema.to_json()
|
||||||
|
|
||||||
|
@ -3855,6 +3933,16 @@ class Schema:
|
||||||
RuntimeError: if there is unknown item in the object.
|
RuntimeError: if there is unknown item in the object.
|
||||||
RuntimeError: if dataset type is missing in the object.
|
RuntimeError: if dataset type is missing in the object.
|
||||||
RuntimeError: if columns are missing in the object.
|
RuntimeError: if columns are missing in the object.
|
||||||
|
|
||||||
|
Examples:
|
||||||
|
>>> import json
|
||||||
|
>>>
|
||||||
|
>>> from mindspore.dataset import Schema
|
||||||
|
>>>
|
||||||
|
>>> with open("/path/to/schema_file") as file:
|
||||||
|
... json_obj = json.load(file)
|
||||||
|
... schema = ds.Schema()
|
||||||
|
... schema.from_json(json_obj)
|
||||||
"""
|
"""
|
||||||
self.cpp_schema.from_string(json.dumps(json_obj, indent=2))
|
self.cpp_schema.from_string(json.dumps(json_obj, indent=2))
|
||||||
|
|
||||||
|
|
|
@ -647,17 +647,14 @@ class DBpediaDataset(SourceDataset, TextBaseDataset):
|
||||||
|
|
||||||
class EnWik9Dataset(SourceDataset, TextBaseDataset):
|
class EnWik9Dataset(SourceDataset, TextBaseDataset):
|
||||||
"""
|
"""
|
||||||
A source dataset that reads and parses EnWik9 Polarity and EnWik9 Full datasets.
|
A source dataset that reads and parses EnWik9 datasets.
|
||||||
|
|
||||||
The generated dataset has one column :py:obj:`[text]` with type string.
|
The generated dataset has one column :py:obj:`[text]` with type string.
|
||||||
|
|
||||||
Args:
|
Args:
|
||||||
dataset_dir (str): Path to the root directory that contains the dataset.
|
dataset_dir (str): Path to the root directory that contains the dataset.
|
||||||
num_samples (int, optional): The number of samples to be included in the dataset.
|
num_samples (int, optional): The number of samples to be included in the dataset.
|
||||||
For Polarity dataset, 'train' will read from 3,600,000 train samples, 'test' will read from 400,000 test
|
Default: None, will include all samples.
|
||||||
samples, 'all' will read from all 4,000,000 samples.
|
|
||||||
For Full dataset, 'train' will read from 3,000,000 train samples, 'test' will read from 650,000 test
|
|
||||||
samples, 'all' will read from all 3,650,000 samples. Default: None, will include all samples.
|
|
||||||
num_parallel_workers (int, optional): Number of workers to read the data.
|
num_parallel_workers (int, optional): Number of workers to read the data.
|
||||||
Default: None, number set in the mindspore.dataset.config.
|
Default: None, number set in the mindspore.dataset.config.
|
||||||
shuffle (Union[bool, Shuffle], optional): Perform reshuffling of the data every epoch.
|
shuffle (Union[bool, Shuffle], optional): Perform reshuffling of the data every epoch.
|
||||||
|
@ -744,9 +741,6 @@ class IMDBDataset(MappableDataset, TextBaseDataset):
|
||||||
usage (str, optional): Usage of this dataset, can be 'train', 'test' or 'all'.
|
usage (str, optional): Usage of this dataset, can be 'train', 'test' or 'all'.
|
||||||
Default: None, will read all samples.
|
Default: None, will read all samples.
|
||||||
num_samples (int, optional): The number of images to be included in the dataset.
|
num_samples (int, optional): The number of images to be included in the dataset.
|
||||||
For Polarity dataset, 'train' will read from 3,600,000 train samples, 'test' will read from 400,000 test
|
|
||||||
samples, 'all' will read from all 4,000,000 samples. For Full dataset, 'train' will read from 3,000,000
|
|
||||||
train samples, 'test' will read from 650,000 test samples, 'all' will read from all 3,650,000 samples.
|
|
||||||
Default: None, will include all samples.
|
Default: None, will include all samples.
|
||||||
num_parallel_workers (int, optional): Number of workers to read the data.
|
num_parallel_workers (int, optional): Number of workers to read the data.
|
||||||
Default: None, number set in the mindspore.dataset.config.
|
Default: None, number set in the mindspore.dataset.config.
|
||||||
|
|
|
@ -3114,6 +3114,26 @@ class ManifestDataset(MappableDataset, VisionBaseDataset):
|
||||||
>>>
|
>>>
|
||||||
>>> # 2) Read samples (specified in manifest_file.manifest) for shard 0 in a 2-way distributed training setup
|
>>> # 2) Read samples (specified in manifest_file.manifest) for shard 0 in a 2-way distributed training setup
|
||||||
>>> dataset = ds.ManifestDataset(dataset_file=manifest_dataset_dir, num_shards=2, shard_id=0)
|
>>> dataset = ds.ManifestDataset(dataset_file=manifest_dataset_dir, num_shards=2, shard_id=0)
|
||||||
|
|
||||||
|
About Manifest dataset:
|
||||||
|
|
||||||
|
Manifest file contains a list of files included in a dataset, including basic file info such as File name and File
|
||||||
|
ID, along with extended file metadata. Manifest is a data format file supported by Huawei Modelarts. For details,
|
||||||
|
see `Specifications for Importing the Manifest File <https://support.huaweicloud.com/engineers-modelarts/
|
||||||
|
modelarts_23_0009.html>`_ .
|
||||||
|
|
||||||
|
.. code-block::
|
||||||
|
|
||||||
|
.
|
||||||
|
└── manifest_dataset_directory
|
||||||
|
├── train
|
||||||
|
│ ├── 1.JPEG
|
||||||
|
│ ├── 2.JPEG
|
||||||
|
│ ├── ...
|
||||||
|
├── eval
|
||||||
|
│ ├── 1.JPEG
|
||||||
|
│ ├── 2.JPEG
|
||||||
|
│ ├── ...
|
||||||
"""
|
"""
|
||||||
|
|
||||||
@check_manifestdataset
|
@check_manifestdataset
|
||||||
|
|
|
@ -198,6 +198,13 @@ class GraphData:
|
||||||
Returns:
|
Returns:
|
||||||
numpy.ndarray, array of nodes.
|
numpy.ndarray, array of nodes.
|
||||||
|
|
||||||
|
Examples:
|
||||||
|
>>> from mindspore.dataset import GraphData
|
||||||
|
>>>
|
||||||
|
>>> g = ds.GraphData("/path/to/testdata", 1)
|
||||||
|
>>> edges = g.get_all_edges(0)
|
||||||
|
>>> nodes = g.get_nodes_from_edges(edges)
|
||||||
|
|
||||||
Raises:
|
Raises:
|
||||||
TypeError: If `edge_list` is not list or ndarray.
|
TypeError: If `edge_list` is not list or ndarray.
|
||||||
"""
|
"""
|
||||||
|
@ -488,6 +495,12 @@ class GraphData:
|
||||||
Returns:
|
Returns:
|
||||||
dict, meta information of the graph. The key is node_type, edge_type, node_num, edge_num,
|
dict, meta information of the graph. The key is node_type, edge_type, node_num, edge_num,
|
||||||
node_feature_type and edge_feature_type.
|
node_feature_type and edge_feature_type.
|
||||||
|
|
||||||
|
Examples:
|
||||||
|
>>> from mindspore.dataset import GraphData
|
||||||
|
>>>
|
||||||
|
>>> g = ds.GraphData("/path/to/testdata", 2)
|
||||||
|
>>> graph_info = g.graph_info()
|
||||||
"""
|
"""
|
||||||
if self._working_mode == 'server':
|
if self._working_mode == 'server':
|
||||||
raise Exception("This method is not supported when working mode is server.")
|
raise Exception("This method is not supported when working mode is server.")
|
||||||
|
@ -1282,17 +1295,29 @@ class InMemoryGraphDataset(GeneratorDataset):
|
||||||
Default: 'graph'.
|
Default: 'graph'.
|
||||||
num_samples (int, optional): The number of samples to be included in the dataset. Default: None, all samples.
|
num_samples (int, optional): The number of samples to be included in the dataset. Default: None, all samples.
|
||||||
num_parallel_workers (int, optional): Number of subprocesses used to fetch the dataset in parallel. Default: 1.
|
num_parallel_workers (int, optional): Number of subprocesses used to fetch the dataset in parallel. Default: 1.
|
||||||
shuffle (bool, optional): Whether or not to perform shuffle on the dataset.
|
shuffle (bool, optional): Whether or not to perform shuffle on the dataset. This parameter can only be
|
||||||
Default: None, expected order behavior shown in the table below.
|
specified when the implemented dataset has a random access attribute ( `__getitem__` ). Default: None.
|
||||||
num_shards (int, optional): Number of shards that the dataset will be divided into. Default: None.
|
num_shards (int, optional): Number of shards that the dataset will be divided into. Default: None.
|
||||||
When this argument is specified, `num_samples` reflects the max
|
When this argument is specified, `num_samples` reflects the max
|
||||||
sample number of per shard.
|
sample number of per shard.
|
||||||
shard_id (int, optional): The shard ID within `num_shards` . Default: None. This argument must be specified only
|
shard_id (int, optional): The shard ID within `num_shards` . Default: None. This argument must be specified only
|
||||||
when num_shards is also specified.
|
when `num_shards` is also specified.
|
||||||
python_multiprocessing (bool, optional): Parallelize Python operations with multiple worker process. This
|
python_multiprocessing (bool, optional): Parallelize Python operations with multiple worker process. This
|
||||||
option could be beneficial if the Python operation is computational heavy. Default: True.
|
option could be beneficial if the Python operation is computational heavy. Default: True.
|
||||||
max_rowsize(int, optional): Maximum size of row in MB that is used for shared memory allocation to copy
|
max_rowsize(int, optional): Maximum size of row in MB that is used for shared memory allocation to copy
|
||||||
data between processes. This is only used if python_multiprocessing is set to True. Default: 6 MB.
|
data between processes. This is only used if python_multiprocessing is set to True. Default: 6 MB.
|
||||||
|
|
||||||
|
Raises:
|
||||||
|
TypeError: If `data_dir` is not of type str.
|
||||||
|
TypeError: If `save_dir` is not of type str.
|
||||||
|
TypeError: If `num_parallel_workers` is not of type int.
|
||||||
|
TypeError: If `shuffle` is not of type bool.
|
||||||
|
TypeError: If `python_multiprocessing` is not of type bool.
|
||||||
|
TypeError: If `perf_mode` is not of type bool.
|
||||||
|
RuntimeError: If `data_dir` is not valid or does not exit.
|
||||||
|
RuntimeError: If `num_shards` is specified but `shard_id` is None.
|
||||||
|
RuntimeError: If `shard_id` is specified but `num_shards` is None.
|
||||||
|
ValueError: If `num_parallel_workers` exceeds the max thread numbers.
|
||||||
|
|
||||||
Examples:
|
Examples:
|
||||||
>>> from mindspore.dataset import InMemoryGraphDataset, Graph
|
>>> from mindspore.dataset import InMemoryGraphDataset, Graph
|
||||||
|
@ -1381,19 +1406,28 @@ class ArgoverseDataset(InMemoryGraphDataset):
|
||||||
Args:
|
Args:
|
||||||
data_dir (str): directory for loading dataset, here contains origin format data and will be loaded in
|
data_dir (str): directory for loading dataset, here contains origin format data and will be loaded in
|
||||||
`process` method.
|
`process` method.
|
||||||
column_names (Union[str, list[str]], optional): single column name or list of column names of the dataset,
|
column_names (Union[str, list[str]], optional): single column name or list of column names of the dataset.
|
||||||
num of column name should be equal to num of item in return data when implement method like `__getitem__` ,
|
Default: "graph". Num of column name should be equal to num of item in return data when implement method
|
||||||
recommend to specify it with
|
like `__getitem__`, recommend to specify it with
|
||||||
`column_names=["edge_index", "x", "y", "cluster", "valid_len", "time_step_len"]` like the following example.
|
`column_names=["edge_index", "x", "y", "cluster", "valid_len", "time_step_len"]` like the following example.
|
||||||
num_parallel_workers (int, optional): Number of subprocesses used to fetch the dataset in parallel. Default: 1.
|
num_parallel_workers (int, optional): Number of subprocesses used to fetch the dataset in parallel. Default: 1.
|
||||||
shuffle (bool, optional): Whether or not to perform shuffle on the dataset.
|
shuffle (bool, optional): Whether or not to perform shuffle on the dataset. This parameter can only be
|
||||||
Default: None, expected order behavior shown in the table below.
|
specified when the implemented dataset has a random access attribute ( `__getitem__` ). Default: None.
|
||||||
python_multiprocessing (bool, optional): Parallelize Python operations with multiple worker process. This
|
python_multiprocessing (bool, optional): Parallelize Python operations with multiple worker process. This
|
||||||
option could be beneficial if the Python operation is computational heavy. Default: True.
|
option could be beneficial if the Python operation is computational heavy. Default: True.
|
||||||
perf_mode(bool, optional): mode for obtaining higher performance when iterate created dataset(will call
|
perf_mode(bool, optional): mode for obtaining higher performance when iterate created dataset(will call
|
||||||
`__getitem__` method in this process). Default True, will save all the data in graph
|
`__getitem__` method in this process). Default True, will save all the data in graph
|
||||||
(like edge index, node feature and graph feature) into graph feature.
|
(like edge index, node feature and graph feature) into graph feature.
|
||||||
|
|
||||||
|
Raises:
|
||||||
|
TypeError: If `data_dir` is not of type str.
|
||||||
|
TypeError: If `num_parallel_workers` is not of type int.
|
||||||
|
TypeError: If `shuffle` is not of type bool.
|
||||||
|
TypeError: If `python_multiprocessing` is not of type bool.
|
||||||
|
TypeError: If `perf_mode` is not of type bool.
|
||||||
|
RuntimeError: If `data_dir` is not valid or does not exit.
|
||||||
|
ValueError: If `num_parallel_workers` exceeds the max thread numbers.
|
||||||
|
|
||||||
Examples:
|
Examples:
|
||||||
>>> from mindspore.dataset import ArgoverseDataset
|
>>> from mindspore.dataset import ArgoverseDataset
|
||||||
>>>
|
>>>
|
||||||
|
@ -1403,6 +1437,37 @@ class ArgoverseDataset(InMemoryGraphDataset):
|
||||||
... "time_step_len"])
|
... "time_step_len"])
|
||||||
>>> for item in graph_dataset.create_dict_iterator(output_numpy=True, num_epochs=1):
|
>>> for item in graph_dataset.create_dict_iterator(output_numpy=True, num_epochs=1):
|
||||||
... pass
|
... pass
|
||||||
|
|
||||||
|
About Argoverse Dataset:
|
||||||
|
|
||||||
|
Argverse is the first dataset containing high-precision maps, which contains 290KM high-precision map data with
|
||||||
|
geometric shape and semantic information.
|
||||||
|
|
||||||
|
You can unzip the dataset files into the following structure and read by MindSpore's API:
|
||||||
|
|
||||||
|
.. code-block::
|
||||||
|
|
||||||
|
.
|
||||||
|
└── argoverse_dataset_dir
|
||||||
|
├── train
|
||||||
|
│ ├──...
|
||||||
|
├── val
|
||||||
|
│ └──...
|
||||||
|
├── test
|
||||||
|
│ └──...
|
||||||
|
|
||||||
|
Citation:
|
||||||
|
|
||||||
|
.. code-block::
|
||||||
|
|
||||||
|
@inproceedings{Argoverse,
|
||||||
|
author = {Ming-Fang Chang and John W Lambert and Patsorn Sangkloy and Jagjeet Singh
|
||||||
|
and Slawomir Bak and Andrew Hartnett and De Wang and Peter Carr
|
||||||
|
and Simon Lucey and Deva Ramanan and James Hays},
|
||||||
|
title = {Argoverse: 3D Tracking and Forecasting with Rich Maps},
|
||||||
|
booktitle = {Conference on Computer Vision and Pattern Recognition (CVPR)},
|
||||||
|
year = {2019}
|
||||||
|
}
|
||||||
"""
|
"""
|
||||||
|
|
||||||
def __init__(self, data_dir, column_names="graph", num_parallel_workers=1, shuffle=None,
|
def __init__(self, data_dir, column_names="graph", num_parallel_workers=1, shuffle=None,
|
||||||
|
|
Loading…
Reference in New Issue