!45318 Fix the problem of Chinese API document

Merge pull request !45318 from 刘勇琪/code_docs_modify_chinese_api
This commit is contained in:
i-robot 2022-12-09 02:59:32 +00:00 committed by Gitee
commit c10184ad88
No known key found for this signature in database
GPG Key ID: 173E9B9CA92EEF8F
12 changed files with 263 additions and 29 deletions

View File

@ -1,7 +1,7 @@
mindspore.dataset.ArgoverseDataset
====================================
.. py:class:: mindspore.dataset.ArgoverseDataset(data_dir, column_names="graph", shuffle=None, num_parallel_workers=1, python_multiprocessing=True, perf_mode=True)
.. py:class:: mindspore.dataset.ArgoverseDataset(data_dir, column_names="graph", num_parallel_workers=1, shuffle=None, python_multiprocessing=True, perf_mode=True)
加载argoverse数据集并进行图Graph初始化。
@ -16,6 +16,45 @@
- **python_multiprocessing** (bool可选) - 启用Python多进程模式加速运算。默认值True。当传入 `source` 的Python对象的计算量很大时开启此选项可能会有较好效果。
- **perf_mode** (bool可选) - 遍历创建的dataset对象时获得更高性能的模式在此过程中将调用 `__getitem__` 方法。默认值True将Graph的所有数据如边的索引、节点特征和图的特征都作为图特征进行存储。
异常:
- **TypeError** - 如果 `data_dir` 不是str类型。
- **TypeError** - 如果 `num_parallel_workers` 不是int类型。
- **TypeError** - 如果 `shuffle` 不是bool类型。
- **TypeError** - 如果 `python_multiprocessing` 不是bool类型。
- **TypeError** - 如果 `perf_mode` 不是bool类型。
- **RuntimeError** - 如果 `data_dir` 无效或不存在。
- **ValueError** - `num_parallel_workers` 参数超过系统最大线程数。
**关于Argoverse数据集**
Argoverse是第一个包含高精地图的数据集它包含了290KM的带有几何形状和语义信息的高精度地图数据。
可以将数据集文件解压缩到以下结构中并通过MindSpore的API读取
.. code-block::
.
└── argoversedataset_dir
├── train
│ ├──...
├── val
│ └──...
├── test
│ └──...
**引用:**
.. code-block::
@inproceedings{Argoverse,
author = {Ming-Fang Chang and John W Lambert and Patsorn Sangkloy and Jagjeet Singh
and Slawomir Bak and Andrew Hartnett and De Wang and Peter Carr
and Simon Lucey and Deva Ramanan and James Hays},
title = {Argoverse: 3D Tracking and Forecasting with Rich Maps},
booktitle = {Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2019}
}
.. py:method:: load()

View File

@ -3,15 +3,13 @@ mindspore.dataset.EnWik9Dataset
.. py:class:: mindspore.dataset.EnWik9Dataset(dataset_dir, num_samples=None, num_parallel_workers=None, shuffle=True, num_shards=None, shard_id=None, cache=None)
读取和解析EnWik9 Full和EnWik9 Polarity数据集。
读取和解析EnWik9数据集。
生成的数据集有一列 `[text]` 数据类型为string。
参数:
- **dataset_dir** (str) - 包含数据集文件的根目录路径。
- **num_samples** (int, 可选) - 指定从数据集中读取的样本数。
对于Polarity数据集 'train'将读取360万个训练样本 'test'将读取40万个测试样本 'all'将读取所有400万个样本。
对于Full数据集 'train'将读取300万个训练样本 'test'将读取65万个测试样本 'all'将读取所有365万个样本。默认值None读取所有样本。
- **num_samples** (int, 可选) - 指定从数据集中读取的样本数。默认值None读取所有样本。
- **num_parallel_workers** (int, 可选) - 指定读取数据的工作线程数。默认值None使用mindspore.dataset.config中配置的线程数。
- **shuffle** (Union[bool, Shuffle], 可选) - 每个epoch中数据混洗的模式支持传入bool类型与枚举类型进行指定。默认值True。
如果 `shuffle` 为False则不混洗如果 `shuffle` 为True等同于将 `shuffle` 设置为mindspore.dataset.Shuffle.GLOBAL。

View File

@ -10,9 +10,7 @@ mindspore.dataset.IMDBDataset
参数:
- **dataset_dir** (str) - 包含数据集文件的根目录路径。
- **usage** (str, 可选) - 指定数据集的子集,可取值为 'train' 'test'或 'all'。默认值None读取全部样本。
- **num_samples** (int, 可选) - 指定从数据集中读取的样本数。
对于Polarity数据集 'train'将读取360万个训练样本 'test'将读取40万个测试样本 'all'将读取所有400万个样本。
对于Full数据集 'train'将读取300万个训练样本 'test'将读取65万个测试样本 'all'将读取所有365万个样本。默认值None读取所有样本。
- **num_samples** (int, 可选) - 指定从数据集中读取的样本数。默认值None读取所有样本。
- **num_parallel_workers** (int, 可选) - 指定读取数据的工作线程数。默认值None使用mindspore.dataset.config中配置的线程数。
- **shuffle** (bool, 可选) - 是否混洗数据集。默认值None。下表中会展示不同参数配置的预期行为。
- **sampler** (Sampler, 可选) - 指定从数据集中选取样本的采样器。默认值None。下表中会展示不同配置的预期行为。

View File

@ -33,7 +33,7 @@ mindspore.dataset.IWSLT2017Dataset
- **RuntimeError** - 指定了 `shard_id` 参数,但是未指定 `num_shards` 参数。
- **ValueError** - `num_parallel_workers` 参数超过系统最大线程数。
**关于IWSLT2016数据集:**
**关于IWSLT2017数据集:**
IWSLT是一个专门讨论口译各个方面的重要年度科学会议。IWSLT评估活动中的MT任务被构成一个数据集该数据集可通过 `wit3 <https://wit3.fbk.eu>`_ 公开获取。
IWSLT2017数据集中有德语、英语、意大利语、荷兰语和罗马尼亚语数据集包括其中任何两种语言的翻译。

View File

@ -21,6 +21,18 @@
- **python_multiprocessing** (bool可选) - 启用Python多进程模式加速运算。默认值True。当传入 `source` 的Python对象的计算量很大时开启此选项可能会有较好效果。
- **max_rowsize** (int, 可选) - 指定在多进程之间复制数据时共享内存分配的最大空间。默认值6单位为MB。仅当参数 `python_multiprocessing` 设为True时此参数才会生效。
异常:
- **TypeError** - 如果 `data_dir` 不是str类型。
- **TypeError** - 如果 `save_dir` 不是str类型。
- **TypeError** - 如果 `num_parallel_workers` 不是int类型。
- **TypeError** - 如果 `shuffle` 不是bool类型。
- **TypeError** - 如果 `python_multiprocessing` 不是bool类型。
- **TypeError** - 如果 `perf_mode` 不是bool类型。
- **RuntimeError** - 如果 `data_dir` 无效或不存在。
- **RuntimeError** - 指定了 `num_shards` 参数,但是未指定 `shard_id` 参数。
- **RuntimeError** - 指定了 `shard_id` 参数,但是未指定 `num_shards` 参数。
- **ValueError** - `num_parallel_workers` 参数超过系统最大线程数。
.. py:method:: load()
从给定处理好的路径加载数据也可以在自己实现的Dataset类中实现这个方法。

View File

@ -60,5 +60,25 @@
- False
- 不允许
**关于Manifest数据集**
Manifest文件包含数据集中包含的文件列表包括文件名和文件ID等基本文件信息以及扩展文件元数据。
Manifest是华为ModelArts支持的数据格式文件详细说明请参见`Manifest文档 <https://support.huaweicloud.com/engineers-modelarts/modelarts_23_0009.html>`_
以下是原始Manifest数据集结构。可以将数据集文件解压缩到此目录结构中并由MindSpore的API读取。
.. code-block::
.
└── manifest_dataset_directory
├── train
│ ├── 1.JPEG
│ ├── 2.JPEG
│ ├── ...
├── eval
│ ├── 1.JPEG
│ ├── 2.JPEG
│ ├── ...
.. include:: mindspore.dataset.api_list_vision.rst

View File

@ -26,7 +26,7 @@
- **shard_id** (int, 可选) - 指定分布式训练时使用的分片ID号。默认值None。只有当指定了 `num_shards` 时才能指定此参数。
- **shard_equal_rows** (bool, 可选) - 分布式训练时为所有分片获取等量的数据行数。默认值True。
如果 `shard_equal_rows` 为False则可能会使得每个分片的数据条目不相等从而导致分布式训练失败。
因此当每个TFRecord文件的数据数量不相等时建议将此参数设置为True。注意只有当指定了 `num_shards` 时才能指定此参数。
因此当每个MindRecord文件的数据数量不相等时建议将此参数设置为True。注意只有当指定了 `num_shards` 时才能指定此参数。
异常:
- **RuntimeError** - `sync_obs_path` 参数指定的目录不存在。

View File

@ -11,7 +11,7 @@ mindspore.dataset.SVHNDataset
- **dataset_dir** (str) - 包含数据集文件的根目录路径。
- **usage** (str, 可选) - 指定数据集的子集,可取值为 'train'、 'test'、 'extra'或 'all'。默认值None读取全部样本图片。
- **num_samples** (int, 可选) - 指定从数据集中读取的样本数可以小于数据集总数。默认值None读取全部样本图片。
- **num_parallel_workers** (int, 可选) - 指定读取数据的工作线程数。默认值1使用mindspore.dataset.config中配置的线程数
- **num_parallel_workers** (int, 可选) - 指定读取数据的工作线程数。默认值1。
- **shuffle** (bool, 可选) - 是否混洗数据集。默认值None。下表中会展示不同参数配置的预期行为。
- **sampler** (Sampler, 可选) - 指定从数据集中选取样本的采样器。默认值None。下表中会展示不同配置的预期行为。
- **num_shards** (int, 可选) - 指定分布式训练时将数据集进行划分的分片数。默认值None。指定此参数后 `num_samples` 表示每个分片的最大样本数。

View File

@ -450,6 +450,9 @@ class Dataset:
Returns:
str, JSON string of the pipeline.
Examples:
>>> dataset_json = dataset.to_json("/path/to/mnist_dataset_pipeline.json")
"""
ir_tree, _ = self.create_ir_tree()
return json.loads(ir_tree.to_json(filename))
@ -1316,6 +1319,14 @@ class Dataset:
Returns:
Dataset, dataset for transferring.
Examples:
>>> data = ds.TFRecordDataset('/path/to/TF_FILES', '/path/to/TF_SCHEMA_FILE', shuffle=ds.Shuffle.FILES)
>>>
>>> data = data.device_que()
>>> data.send()
>>> time.sleep(0.1)
>>> data.stop_send()
"""
return TransferDataset(self, send_epoch_end, create_data_info_queue)
@ -1389,6 +1400,17 @@ class Dataset:
num_files (int, optional): Number of dataset files. Default: 1.
file_type (str, optional): Dataset format. Default: 'mindrecord'.
Examples:
>>> import numpy as np
>>>
>>> def generator_1d():
... for i in range(10):
... yield (np.array([i]),)
>>>
>>>
>>> # apply dataset operations
>>> d1 = ds.GeneratorDataset(generator_1d, ["data"], shuffle=False)
>>> d1.save('/path/to/save_file')
"""
ir_tree, api_tree = self.create_ir_tree()
@ -1689,6 +1711,39 @@ class Dataset:
When num_batch is None, it will default to the number specified by the
sync_wait operation. Default: None.
data (Any): The data passed to the callback, user defined. Default: None.
Examples:
>>> import numpy as np
>>>
>>>
>>> def gen():
... for i in range(100):
... yield (np.array(i),)
>>>
>>>
>>> class Augment:
... def __init__(self, loss):
... self.loss = loss
...
... def preprocess(self, input_):
... return input_
...
... def update(self, data):
... self.loss = data["loss"]
>>>
>>>
>>> batch_size = 10
>>> dataset = ds.GeneratorDataset(gen, column_names=["input"])
>>> aug = Augment(0)
>>> dataset = dataset.sync_wait(condition_name='', num_batch=1)
>>> dataset = dataset.map(input_columns=["input"], operations=[aug.preprocess])
>>> dataset = dataset.batch(batch_size)
>>>
>>> count = 0
>>> for data in dataset.create_dict_iterator(num_epochs=1, output_numpy=True):
... count += 1
... data = {"loss": count}
... dataset.sync_update(condition_name="", data=data)
"""
if (not isinstance(num_batch, int) and num_batch is not None) or \
(isinstance(num_batch, int) and num_batch <= 0):
@ -1761,7 +1816,18 @@ class Dataset:
return {}
def reset(self):
"""Reset the dataset for next epoch."""
"""
Reset the dataset for next epoch.
Examples:
>>> mind_dataset_dir = ["/path/to/mind_dataset_file"]
>>> data_set = ds.MindDataset(dataset_files=mind_dataset_dir)
>>> for _ in range(5):
... num_iter = 0
... for data in dataset.create_tuple_iterator(num_epochs=1, output_numpy=True):
... num_iter += 1
... data_set.reset()
"""
def is_shuffled(self):
"""Returns True if the dataset or its children is shuffled."""
@ -3797,6 +3863,12 @@ class Schema:
Raises:
ValueError: If column type is unknown.
Examples:
>>> from mindspore import dtype as mstype
>>>
>>> schema = ds.Schema()
>>> schema.add_column('col_1d', de_type=mstype.int64, shape=[2])
"""
if isinstance(de_type, typing.Type):
de_type = mstype_to_detype(de_type)
@ -3841,6 +3913,12 @@ class Schema:
Returns:
str, JSON string of the schema.
Examples:
>>> from mindspore.dataset import Schema
>>>
>>> schema1 = ds.Schema()
>>> schema2 = schema1.to_json()
"""
return self.cpp_schema.to_json()
@ -3855,6 +3933,16 @@ class Schema:
RuntimeError: if there is unknown item in the object.
RuntimeError: if dataset type is missing in the object.
RuntimeError: if columns are missing in the object.
Examples:
>>> import json
>>>
>>> from mindspore.dataset import Schema
>>>
>>> with open("/path/to/schema_file") as file:
... json_obj = json.load(file)
... schema = ds.Schema()
... schema.from_json(json_obj)
"""
self.cpp_schema.from_string(json.dumps(json_obj, indent=2))

View File

@ -647,17 +647,14 @@ class DBpediaDataset(SourceDataset, TextBaseDataset):
class EnWik9Dataset(SourceDataset, TextBaseDataset):
"""
A source dataset that reads and parses EnWik9 Polarity and EnWik9 Full datasets.
A source dataset that reads and parses EnWik9 datasets.
The generated dataset has one column :py:obj:`[text]` with type string.
Args:
dataset_dir (str): Path to the root directory that contains the dataset.
num_samples (int, optional): The number of samples to be included in the dataset.
For Polarity dataset, 'train' will read from 3,600,000 train samples, 'test' will read from 400,000 test
samples, 'all' will read from all 4,000,000 samples.
For Full dataset, 'train' will read from 3,000,000 train samples, 'test' will read from 650,000 test
samples, 'all' will read from all 3,650,000 samples. Default: None, will include all samples.
Default: None, will include all samples.
num_parallel_workers (int, optional): Number of workers to read the data.
Default: None, number set in the mindspore.dataset.config.
shuffle (Union[bool, Shuffle], optional): Perform reshuffling of the data every epoch.
@ -744,9 +741,6 @@ class IMDBDataset(MappableDataset, TextBaseDataset):
usage (str, optional): Usage of this dataset, can be 'train', 'test' or 'all'.
Default: None, will read all samples.
num_samples (int, optional): The number of images to be included in the dataset.
For Polarity dataset, 'train' will read from 3,600,000 train samples, 'test' will read from 400,000 test
samples, 'all' will read from all 4,000,000 samples. For Full dataset, 'train' will read from 3,000,000
train samples, 'test' will read from 650,000 test samples, 'all' will read from all 3,650,000 samples.
Default: None, will include all samples.
num_parallel_workers (int, optional): Number of workers to read the data.
Default: None, number set in the mindspore.dataset.config.

View File

@ -3114,6 +3114,26 @@ class ManifestDataset(MappableDataset, VisionBaseDataset):
>>>
>>> # 2) Read samples (specified in manifest_file.manifest) for shard 0 in a 2-way distributed training setup
>>> dataset = ds.ManifestDataset(dataset_file=manifest_dataset_dir, num_shards=2, shard_id=0)
About Manifest dataset:
Manifest file contains a list of files included in a dataset, including basic file info such as File name and File
ID, along with extended file metadata. Manifest is a data format file supported by Huawei Modelarts. For details,
see `Specifications for Importing the Manifest File <https://support.huaweicloud.com/engineers-modelarts/
modelarts_23_0009.html>`_ .
.. code-block::
.
manifest_dataset_directory
train
1.JPEG
2.JPEG
...
eval
1.JPEG
2.JPEG
...
"""
@check_manifestdataset

View File

@ -198,6 +198,13 @@ class GraphData:
Returns:
numpy.ndarray, array of nodes.
Examples:
>>> from mindspore.dataset import GraphData
>>>
>>> g = ds.GraphData("/path/to/testdata", 1)
>>> edges = g.get_all_edges(0)
>>> nodes = g.get_nodes_from_edges(edges)
Raises:
TypeError: If `edge_list` is not list or ndarray.
"""
@ -488,6 +495,12 @@ class GraphData:
Returns:
dict, meta information of the graph. The key is node_type, edge_type, node_num, edge_num,
node_feature_type and edge_feature_type.
Examples:
>>> from mindspore.dataset import GraphData
>>>
>>> g = ds.GraphData("/path/to/testdata", 2)
>>> graph_info = g.graph_info()
"""
if self._working_mode == 'server':
raise Exception("This method is not supported when working mode is server.")
@ -1282,17 +1295,29 @@ class InMemoryGraphDataset(GeneratorDataset):
Default: 'graph'.
num_samples (int, optional): The number of samples to be included in the dataset. Default: None, all samples.
num_parallel_workers (int, optional): Number of subprocesses used to fetch the dataset in parallel. Default: 1.
shuffle (bool, optional): Whether or not to perform shuffle on the dataset.
Default: None, expected order behavior shown in the table below.
shuffle (bool, optional): Whether or not to perform shuffle on the dataset. This parameter can only be
specified when the implemented dataset has a random access attribute ( `__getitem__` ). Default: None.
num_shards (int, optional): Number of shards that the dataset will be divided into. Default: None.
When this argument is specified, `num_samples` reflects the max
sample number of per shard.
shard_id (int, optional): The shard ID within `num_shards` . Default: None. This argument must be specified only
when num_shards is also specified.
when `num_shards` is also specified.
python_multiprocessing (bool, optional): Parallelize Python operations with multiple worker process. This
option could be beneficial if the Python operation is computational heavy. Default: True.
max_rowsize(int, optional): Maximum size of row in MB that is used for shared memory allocation to copy
data between processes. This is only used if python_multiprocessing is set to True. Default: 6 MB.
data between processes. This is only used if python_multiprocessing is set to True. Default: 6 MB.
Raises:
TypeError: If `data_dir` is not of type str.
TypeError: If `save_dir` is not of type str.
TypeError: If `num_parallel_workers` is not of type int.
TypeError: If `shuffle` is not of type bool.
TypeError: If `python_multiprocessing` is not of type bool.
TypeError: If `perf_mode` is not of type bool.
RuntimeError: If `data_dir` is not valid or does not exit.
RuntimeError: If `num_shards` is specified but `shard_id` is None.
RuntimeError: If `shard_id` is specified but `num_shards` is None.
ValueError: If `num_parallel_workers` exceeds the max thread numbers.
Examples:
>>> from mindspore.dataset import InMemoryGraphDataset, Graph
@ -1381,19 +1406,28 @@ class ArgoverseDataset(InMemoryGraphDataset):
Args:
data_dir (str): directory for loading dataset, here contains origin format data and will be loaded in
`process` method.
column_names (Union[str, list[str]], optional): single column name or list of column names of the dataset,
num of column name should be equal to num of item in return data when implement method like `__getitem__` ,
recommend to specify it with
column_names (Union[str, list[str]], optional): single column name or list of column names of the dataset.
Default: "graph". Num of column name should be equal to num of item in return data when implement method
like `__getitem__`, recommend to specify it with
`column_names=["edge_index", "x", "y", "cluster", "valid_len", "time_step_len"]` like the following example.
num_parallel_workers (int, optional): Number of subprocesses used to fetch the dataset in parallel. Default: 1.
shuffle (bool, optional): Whether or not to perform shuffle on the dataset.
Default: None, expected order behavior shown in the table below.
shuffle (bool, optional): Whether or not to perform shuffle on the dataset. This parameter can only be
specified when the implemented dataset has a random access attribute ( `__getitem__` ). Default: None.
python_multiprocessing (bool, optional): Parallelize Python operations with multiple worker process. This
option could be beneficial if the Python operation is computational heavy. Default: True.
perf_mode(bool, optional): mode for obtaining higher performance when iterate created dataset(will call
`__getitem__` method in this process). Default True, will save all the data in graph
(like edge index, node feature and graph feature) into graph feature.
Raises:
TypeError: If `data_dir` is not of type str.
TypeError: If `num_parallel_workers` is not of type int.
TypeError: If `shuffle` is not of type bool.
TypeError: If `python_multiprocessing` is not of type bool.
TypeError: If `perf_mode` is not of type bool.
RuntimeError: If `data_dir` is not valid or does not exit.
ValueError: If `num_parallel_workers` exceeds the max thread numbers.
Examples:
>>> from mindspore.dataset import ArgoverseDataset
>>>
@ -1403,6 +1437,37 @@ class ArgoverseDataset(InMemoryGraphDataset):
... "time_step_len"])
>>> for item in graph_dataset.create_dict_iterator(output_numpy=True, num_epochs=1):
... pass
About Argoverse Dataset:
Argverse is the first dataset containing high-precision maps, which contains 290KM high-precision map data with
geometric shape and semantic information.
You can unzip the dataset files into the following structure and read by MindSpore's API:
.. code-block::
.
argoverse_dataset_dir
train
...
val
...
test
...
Citation:
.. code-block::
@inproceedings{Argoverse,
author = {Ming-Fang Chang and John W Lambert and Patsorn Sangkloy and Jagjeet Singh
and Slawomir Bak and Andrew Hartnett and De Wang and Peter Carr
and Simon Lucey and Deva Ramanan and James Hays},
title = {Argoverse: 3D Tracking and Forecasting with Rich Maps},
booktitle = {Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2019}
}
"""
def __init__(self, data_dir, column_names="graph", num_parallel_workers=1, shuffle=None,