diff --git a/docs/api/api_python/dataset/mindspore.dataset.ArgoverseDataset.rst b/docs/api/api_python/dataset/mindspore.dataset.ArgoverseDataset.rst index ab6ff014707..4558e734f1c 100644 --- a/docs/api/api_python/dataset/mindspore.dataset.ArgoverseDataset.rst +++ b/docs/api/api_python/dataset/mindspore.dataset.ArgoverseDataset.rst @@ -1,7 +1,7 @@ mindspore.dataset.ArgoverseDataset ==================================== -.. py:class:: mindspore.dataset.ArgoverseDataset(data_dir, column_names="graph", shuffle=None, num_parallel_workers=1, python_multiprocessing=True, perf_mode=True) +.. py:class:: mindspore.dataset.ArgoverseDataset(data_dir, column_names="graph", num_parallel_workers=1, shuffle=None, python_multiprocessing=True, perf_mode=True) 加载argoverse数据集并进行图(Graph)初始化。 @@ -16,6 +16,45 @@ - **python_multiprocessing** (bool,可选) - 启用Python多进程模式加速运算。默认值:True。当传入 `source` 的Python对象的计算量很大时,开启此选项可能会有较好效果。 - **perf_mode** (bool,可选) - 遍历创建的dataset对象时获得更高性能的模式(在此过程中将调用 `__getitem__` 方法)。默认值:True,将Graph的所有数据(如边的索引、节点特征和图的特征)都作为图特征进行存储。 + 异常: + - **TypeError** - 如果 `data_dir` 不是str类型。 + - **TypeError** - 如果 `num_parallel_workers` 不是int类型。 + - **TypeError** - 如果 `shuffle` 不是bool类型。 + - **TypeError** - 如果 `python_multiprocessing` 不是bool类型。 + - **TypeError** - 如果 `perf_mode` 不是bool类型。 + - **RuntimeError** - 如果 `data_dir` 无效或不存在。 + - **ValueError** - `num_parallel_workers` 参数超过系统最大线程数。 + + **关于Argoverse数据集:** + + Argoverse是第一个包含高精地图的数据集,它包含了290KM的带有几何形状和语义信息的高精度地图数据。 + + 可以将数据集文件解压缩到以下结构中,并通过MindSpore的API读取: + + .. code-block:: + + . + └── argoversedataset_dir + ├── train + │ ├──... + ├── val + │ └──... + ├── test + │ └──... + + **引用:** + + .. code-block:: + + @inproceedings{Argoverse, + author = {Ming-Fang Chang and John W Lambert and Patsorn Sangkloy and Jagjeet Singh + and Slawomir Bak and Andrew Hartnett and De Wang and Peter Carr + and Simon Lucey and Deva Ramanan and James Hays}, + title = {Argoverse: 3D Tracking and Forecasting with Rich Maps}, + booktitle = {Conference on Computer Vision and Pattern Recognition (CVPR)}, + year = {2019} + } + .. py:method:: load() diff --git a/docs/api/api_python/dataset/mindspore.dataset.EnWik9Dataset.rst b/docs/api/api_python/dataset/mindspore.dataset.EnWik9Dataset.rst index ddb12833156..5b27f7849b9 100644 --- a/docs/api/api_python/dataset/mindspore.dataset.EnWik9Dataset.rst +++ b/docs/api/api_python/dataset/mindspore.dataset.EnWik9Dataset.rst @@ -3,15 +3,13 @@ mindspore.dataset.EnWik9Dataset .. py:class:: mindspore.dataset.EnWik9Dataset(dataset_dir, num_samples=None, num_parallel_workers=None, shuffle=True, num_shards=None, shard_id=None, cache=None) - 读取和解析EnWik9 Full和EnWik9 Polarity数据集。 + 读取和解析EnWik9数据集。 生成的数据集有一列 `[text]` ,数据类型为string。 参数: - **dataset_dir** (str) - 包含数据集文件的根目录路径。 - - **num_samples** (int, 可选) - 指定从数据集中读取的样本数。 - 对于Polarity数据集, 'train'将读取360万个训练样本, 'test'将读取40万个测试样本, 'all'将读取所有400万个样本。 - 对于Full数据集, 'train'将读取300万个训练样本, 'test'将读取65万个测试样本, 'all'将读取所有365万个样本。默认值:None,读取所有样本。 + - **num_samples** (int, 可选) - 指定从数据集中读取的样本数。默认值:None,读取所有样本。 - **num_parallel_workers** (int, 可选) - 指定读取数据的工作线程数。默认值:None,使用mindspore.dataset.config中配置的线程数。 - **shuffle** (Union[bool, Shuffle], 可选) - 每个epoch中数据混洗的模式,支持传入bool类型与枚举类型进行指定。默认值:True。 如果 `shuffle` 为False,则不混洗,如果 `shuffle` 为True,等同于将 `shuffle` 设置为mindspore.dataset.Shuffle.GLOBAL。 diff --git a/docs/api/api_python/dataset/mindspore.dataset.IMDBDataset.rst b/docs/api/api_python/dataset/mindspore.dataset.IMDBDataset.rst index b22b56b3337..07d84c86534 100644 --- a/docs/api/api_python/dataset/mindspore.dataset.IMDBDataset.rst +++ b/docs/api/api_python/dataset/mindspore.dataset.IMDBDataset.rst @@ -10,9 +10,7 @@ mindspore.dataset.IMDBDataset 参数: - **dataset_dir** (str) - 包含数据集文件的根目录路径。 - **usage** (str, 可选) - 指定数据集的子集,可取值为 'train', 'test'或 'all'。默认值:None,读取全部样本。 - - **num_samples** (int, 可选) - 指定从数据集中读取的样本数。 - 对于Polarity数据集, 'train'将读取360万个训练样本, 'test'将读取40万个测试样本, 'all'将读取所有400万个样本。 - 对于Full数据集, 'train'将读取300万个训练样本, 'test'将读取65万个测试样本, 'all'将读取所有365万个样本。默认值:None,读取所有样本。 + - **num_samples** (int, 可选) - 指定从数据集中读取的样本数。默认值:None,读取所有样本。 - **num_parallel_workers** (int, 可选) - 指定读取数据的工作线程数。默认值:None,使用mindspore.dataset.config中配置的线程数。 - **shuffle** (bool, 可选) - 是否混洗数据集。默认值:None。下表中会展示不同参数配置的预期行为。 - **sampler** (Sampler, 可选) - 指定从数据集中选取样本的采样器。默认值:None。下表中会展示不同配置的预期行为。 diff --git a/docs/api/api_python/dataset/mindspore.dataset.IWSLT2017Dataset.rst b/docs/api/api_python/dataset/mindspore.dataset.IWSLT2017Dataset.rst index 3b89ecb79ce..a360241b476 100644 --- a/docs/api/api_python/dataset/mindspore.dataset.IWSLT2017Dataset.rst +++ b/docs/api/api_python/dataset/mindspore.dataset.IWSLT2017Dataset.rst @@ -33,7 +33,7 @@ mindspore.dataset.IWSLT2017Dataset - **RuntimeError** - 指定了 `shard_id` 参数,但是未指定 `num_shards` 参数。 - **ValueError** - `num_parallel_workers` 参数超过系统最大线程数。 - **关于IWSLT2016数据集:** + **关于IWSLT2017数据集:** IWSLT是一个专门讨论口译各个方面的重要年度科学会议。IWSLT评估活动中的MT任务被构成一个数据集,该数据集可通过 `wit3 `_ 公开获取。 IWSLT2017数据集中有德语、英语、意大利语、荷兰语和罗马尼亚语,数据集包括其中任何两种语言的翻译。 diff --git a/docs/api/api_python/dataset/mindspore.dataset.InMemoryGraphDataset.rst b/docs/api/api_python/dataset/mindspore.dataset.InMemoryGraphDataset.rst index b0638c01ac9..4c68773cb1c 100644 --- a/docs/api/api_python/dataset/mindspore.dataset.InMemoryGraphDataset.rst +++ b/docs/api/api_python/dataset/mindspore.dataset.InMemoryGraphDataset.rst @@ -21,6 +21,18 @@ - **python_multiprocessing** (bool,可选) - 启用Python多进程模式加速运算。默认值:True。当传入 `source` 的Python对象的计算量很大时,开启此选项可能会有较好效果。 - **max_rowsize** (int, 可选) - 指定在多进程之间复制数据时,共享内存分配的最大空间。默认值:6,单位为MB。仅当参数 `python_multiprocessing` 设为True时,此参数才会生效。 + 异常: + - **TypeError** - 如果 `data_dir` 不是str类型。 + - **TypeError** - 如果 `save_dir` 不是str类型。 + - **TypeError** - 如果 `num_parallel_workers` 不是int类型。 + - **TypeError** - 如果 `shuffle` 不是bool类型。 + - **TypeError** - 如果 `python_multiprocessing` 不是bool类型。 + - **TypeError** - 如果 `perf_mode` 不是bool类型。 + - **RuntimeError** - 如果 `data_dir` 无效或不存在。 + - **RuntimeError** - 指定了 `num_shards` 参数,但是未指定 `shard_id` 参数。 + - **RuntimeError** - 指定了 `shard_id` 参数,但是未指定 `num_shards` 参数。 + - **ValueError** - `num_parallel_workers` 参数超过系统最大线程数。 + .. py:method:: load() 从给定(处理好的)路径加载数据,也可以在自己实现的Dataset类中实现这个方法。 diff --git a/docs/api/api_python/dataset/mindspore.dataset.ManifestDataset.rst b/docs/api/api_python/dataset/mindspore.dataset.ManifestDataset.rst index 2e0b94b7531..05f9a20f5de 100644 --- a/docs/api/api_python/dataset/mindspore.dataset.ManifestDataset.rst +++ b/docs/api/api_python/dataset/mindspore.dataset.ManifestDataset.rst @@ -60,5 +60,25 @@ - False - 不允许 + **关于Manifest数据集:** + + Manifest文件包含数据集中包含的文件列表,包括文件名和文件ID等基本文件信息,以及扩展文件元数据。 + Manifest是华为ModelArts支持的数据格式文件,详细说明请参见`Manifest文档 `_ 。 + + 以下是原始Manifest数据集结构。可以将数据集文件解压缩到此目录结构中,并由MindSpore的API读取。 + + .. code-block:: + + . + └── manifest_dataset_directory + ├── train + │ ├── 1.JPEG + │ ├── 2.JPEG + │ ├── ... + ├── eval + │ ├── 1.JPEG + │ ├── 2.JPEG + │ ├── ... + .. include:: mindspore.dataset.api_list_vision.rst diff --git a/docs/api/api_python/dataset/mindspore.dataset.OBSMindDataset.rst b/docs/api/api_python/dataset/mindspore.dataset.OBSMindDataset.rst index c20bc885a95..5bf8eefce1a 100644 --- a/docs/api/api_python/dataset/mindspore.dataset.OBSMindDataset.rst +++ b/docs/api/api_python/dataset/mindspore.dataset.OBSMindDataset.rst @@ -26,7 +26,7 @@ - **shard_id** (int, 可选) - 指定分布式训练时使用的分片ID号。默认值:None。只有当指定了 `num_shards` 时才能指定此参数。 - **shard_equal_rows** (bool, 可选) - 分布式训练时,为所有分片获取等量的数据行数。默认值:True。 如果 `shard_equal_rows` 为False,则可能会使得每个分片的数据条目不相等,从而导致分布式训练失败。 - 因此当每个TFRecord文件的数据数量不相等时,建议将此参数设置为True。注意,只有当指定了 `num_shards` 时才能指定此参数。 + 因此当每个MindRecord文件的数据数量不相等时,建议将此参数设置为True。注意,只有当指定了 `num_shards` 时才能指定此参数。 异常: - **RuntimeError** - `sync_obs_path` 参数指定的目录不存在。 diff --git a/docs/api/api_python/dataset/mindspore.dataset.SVHNDataset.rst b/docs/api/api_python/dataset/mindspore.dataset.SVHNDataset.rst index ab20ec34a10..7ea2c06af0f 100644 --- a/docs/api/api_python/dataset/mindspore.dataset.SVHNDataset.rst +++ b/docs/api/api_python/dataset/mindspore.dataset.SVHNDataset.rst @@ -11,7 +11,7 @@ mindspore.dataset.SVHNDataset - **dataset_dir** (str) - 包含数据集文件的根目录路径。 - **usage** (str, 可选) - 指定数据集的子集,可取值为 'train'、 'test'、 'extra'或 'all'。默认值:None,读取全部样本图片。 - **num_samples** (int, 可选) - 指定从数据集中读取的样本数,可以小于数据集总数。默认值:None,读取全部样本图片。 - - **num_parallel_workers** (int, 可选) - 指定读取数据的工作线程数。默认值:1,使用mindspore.dataset.config中配置的线程数。 + - **num_parallel_workers** (int, 可选) - 指定读取数据的工作线程数。默认值:1。 - **shuffle** (bool, 可选) - 是否混洗数据集。默认值:None。下表中会展示不同参数配置的预期行为。 - **sampler** (Sampler, 可选) - 指定从数据集中选取样本的采样器。默认值:None。下表中会展示不同配置的预期行为。 - **num_shards** (int, 可选) - 指定分布式训练时将数据集进行划分的分片数。默认值:None。指定此参数后, `num_samples` 表示每个分片的最大样本数。 diff --git a/mindspore/python/mindspore/dataset/engine/datasets.py b/mindspore/python/mindspore/dataset/engine/datasets.py index b82c27c4a65..094a4b4dbee 100644 --- a/mindspore/python/mindspore/dataset/engine/datasets.py +++ b/mindspore/python/mindspore/dataset/engine/datasets.py @@ -450,6 +450,9 @@ class Dataset: Returns: str, JSON string of the pipeline. + + Examples: + >>> dataset_json = dataset.to_json("/path/to/mnist_dataset_pipeline.json") """ ir_tree, _ = self.create_ir_tree() return json.loads(ir_tree.to_json(filename)) @@ -1316,6 +1319,14 @@ class Dataset: Returns: Dataset, dataset for transferring. + + Examples: + >>> data = ds.TFRecordDataset('/path/to/TF_FILES', '/path/to/TF_SCHEMA_FILE', shuffle=ds.Shuffle.FILES) + >>> + >>> data = data.device_que() + >>> data.send() + >>> time.sleep(0.1) + >>> data.stop_send() """ return TransferDataset(self, send_epoch_end, create_data_info_queue) @@ -1389,6 +1400,17 @@ class Dataset: num_files (int, optional): Number of dataset files. Default: 1. file_type (str, optional): Dataset format. Default: 'mindrecord'. + Examples: + >>> import numpy as np + >>> + >>> def generator_1d(): + ... for i in range(10): + ... yield (np.array([i]),) + >>> + >>> + >>> # apply dataset operations + >>> d1 = ds.GeneratorDataset(generator_1d, ["data"], shuffle=False) + >>> d1.save('/path/to/save_file') """ ir_tree, api_tree = self.create_ir_tree() @@ -1689,6 +1711,39 @@ class Dataset: When num_batch is None, it will default to the number specified by the sync_wait operation. Default: None. data (Any): The data passed to the callback, user defined. Default: None. + + Examples: + >>> import numpy as np + >>> + >>> + >>> def gen(): + ... for i in range(100): + ... yield (np.array(i),) + >>> + >>> + >>> class Augment: + ... def __init__(self, loss): + ... self.loss = loss + ... + ... def preprocess(self, input_): + ... return input_ + ... + ... def update(self, data): + ... self.loss = data["loss"] + >>> + >>> + >>> batch_size = 10 + >>> dataset = ds.GeneratorDataset(gen, column_names=["input"]) + >>> aug = Augment(0) + >>> dataset = dataset.sync_wait(condition_name='', num_batch=1) + >>> dataset = dataset.map(input_columns=["input"], operations=[aug.preprocess]) + >>> dataset = dataset.batch(batch_size) + >>> + >>> count = 0 + >>> for data in dataset.create_dict_iterator(num_epochs=1, output_numpy=True): + ... count += 1 + ... data = {"loss": count} + ... dataset.sync_update(condition_name="", data=data) """ if (not isinstance(num_batch, int) and num_batch is not None) or \ (isinstance(num_batch, int) and num_batch <= 0): @@ -1761,7 +1816,18 @@ class Dataset: return {} def reset(self): - """Reset the dataset for next epoch.""" + """ + Reset the dataset for next epoch. + + Examples: + >>> mind_dataset_dir = ["/path/to/mind_dataset_file"] + >>> data_set = ds.MindDataset(dataset_files=mind_dataset_dir) + >>> for _ in range(5): + ... num_iter = 0 + ... for data in dataset.create_tuple_iterator(num_epochs=1, output_numpy=True): + ... num_iter += 1 + ... data_set.reset() + """ def is_shuffled(self): """Returns True if the dataset or its children is shuffled.""" @@ -3797,6 +3863,12 @@ class Schema: Raises: ValueError: If column type is unknown. + + Examples: + >>> from mindspore import dtype as mstype + >>> + >>> schema = ds.Schema() + >>> schema.add_column('col_1d', de_type=mstype.int64, shape=[2]) """ if isinstance(de_type, typing.Type): de_type = mstype_to_detype(de_type) @@ -3841,6 +3913,12 @@ class Schema: Returns: str, JSON string of the schema. + + Examples: + >>> from mindspore.dataset import Schema + >>> + >>> schema1 = ds.Schema() + >>> schema2 = schema1.to_json() """ return self.cpp_schema.to_json() @@ -3855,6 +3933,16 @@ class Schema: RuntimeError: if there is unknown item in the object. RuntimeError: if dataset type is missing in the object. RuntimeError: if columns are missing in the object. + + Examples: + >>> import json + >>> + >>> from mindspore.dataset import Schema + >>> + >>> with open("/path/to/schema_file") as file: + ... json_obj = json.load(file) + ... schema = ds.Schema() + ... schema.from_json(json_obj) """ self.cpp_schema.from_string(json.dumps(json_obj, indent=2)) diff --git a/mindspore/python/mindspore/dataset/engine/datasets_text.py b/mindspore/python/mindspore/dataset/engine/datasets_text.py index 6f10e359234..2dc50b1a03e 100644 --- a/mindspore/python/mindspore/dataset/engine/datasets_text.py +++ b/mindspore/python/mindspore/dataset/engine/datasets_text.py @@ -647,17 +647,14 @@ class DBpediaDataset(SourceDataset, TextBaseDataset): class EnWik9Dataset(SourceDataset, TextBaseDataset): """ - A source dataset that reads and parses EnWik9 Polarity and EnWik9 Full datasets. + A source dataset that reads and parses EnWik9 datasets. The generated dataset has one column :py:obj:`[text]` with type string. Args: dataset_dir (str): Path to the root directory that contains the dataset. num_samples (int, optional): The number of samples to be included in the dataset. - For Polarity dataset, 'train' will read from 3,600,000 train samples, 'test' will read from 400,000 test - samples, 'all' will read from all 4,000,000 samples. - For Full dataset, 'train' will read from 3,000,000 train samples, 'test' will read from 650,000 test - samples, 'all' will read from all 3,650,000 samples. Default: None, will include all samples. + Default: None, will include all samples. num_parallel_workers (int, optional): Number of workers to read the data. Default: None, number set in the mindspore.dataset.config. shuffle (Union[bool, Shuffle], optional): Perform reshuffling of the data every epoch. @@ -744,9 +741,6 @@ class IMDBDataset(MappableDataset, TextBaseDataset): usage (str, optional): Usage of this dataset, can be 'train', 'test' or 'all'. Default: None, will read all samples. num_samples (int, optional): The number of images to be included in the dataset. - For Polarity dataset, 'train' will read from 3,600,000 train samples, 'test' will read from 400,000 test - samples, 'all' will read from all 4,000,000 samples. For Full dataset, 'train' will read from 3,000,000 - train samples, 'test' will read from 650,000 test samples, 'all' will read from all 3,650,000 samples. Default: None, will include all samples. num_parallel_workers (int, optional): Number of workers to read the data. Default: None, number set in the mindspore.dataset.config. diff --git a/mindspore/python/mindspore/dataset/engine/datasets_vision.py b/mindspore/python/mindspore/dataset/engine/datasets_vision.py index 6af23476a84..6a1d881b188 100644 --- a/mindspore/python/mindspore/dataset/engine/datasets_vision.py +++ b/mindspore/python/mindspore/dataset/engine/datasets_vision.py @@ -3114,6 +3114,26 @@ class ManifestDataset(MappableDataset, VisionBaseDataset): >>> >>> # 2) Read samples (specified in manifest_file.manifest) for shard 0 in a 2-way distributed training setup >>> dataset = ds.ManifestDataset(dataset_file=manifest_dataset_dir, num_shards=2, shard_id=0) + + About Manifest dataset: + + Manifest file contains a list of files included in a dataset, including basic file info such as File name and File + ID, along with extended file metadata. Manifest is a data format file supported by Huawei Modelarts. For details, + see `Specifications for Importing the Manifest File `_ . + + .. code-block:: + + . + └── manifest_dataset_directory + ├── train + │ ├── 1.JPEG + │ ├── 2.JPEG + │ ├── ... + ├── eval + │ ├── 1.JPEG + │ ├── 2.JPEG + │ ├── ... """ @check_manifestdataset diff --git a/mindspore/python/mindspore/dataset/engine/graphdata.py b/mindspore/python/mindspore/dataset/engine/graphdata.py index 26c6ae9c6dc..642717845ff 100644 --- a/mindspore/python/mindspore/dataset/engine/graphdata.py +++ b/mindspore/python/mindspore/dataset/engine/graphdata.py @@ -198,6 +198,13 @@ class GraphData: Returns: numpy.ndarray, array of nodes. + Examples: + >>> from mindspore.dataset import GraphData + >>> + >>> g = ds.GraphData("/path/to/testdata", 1) + >>> edges = g.get_all_edges(0) + >>> nodes = g.get_nodes_from_edges(edges) + Raises: TypeError: If `edge_list` is not list or ndarray. """ @@ -488,6 +495,12 @@ class GraphData: Returns: dict, meta information of the graph. The key is node_type, edge_type, node_num, edge_num, node_feature_type and edge_feature_type. + + Examples: + >>> from mindspore.dataset import GraphData + >>> + >>> g = ds.GraphData("/path/to/testdata", 2) + >>> graph_info = g.graph_info() """ if self._working_mode == 'server': raise Exception("This method is not supported when working mode is server.") @@ -1282,17 +1295,29 @@ class InMemoryGraphDataset(GeneratorDataset): Default: 'graph'. num_samples (int, optional): The number of samples to be included in the dataset. Default: None, all samples. num_parallel_workers (int, optional): Number of subprocesses used to fetch the dataset in parallel. Default: 1. - shuffle (bool, optional): Whether or not to perform shuffle on the dataset. - Default: None, expected order behavior shown in the table below. + shuffle (bool, optional): Whether or not to perform shuffle on the dataset. This parameter can only be + specified when the implemented dataset has a random access attribute ( `__getitem__` ). Default: None. num_shards (int, optional): Number of shards that the dataset will be divided into. Default: None. When this argument is specified, `num_samples` reflects the max sample number of per shard. shard_id (int, optional): The shard ID within `num_shards` . Default: None. This argument must be specified only - when num_shards is also specified. + when `num_shards` is also specified. python_multiprocessing (bool, optional): Parallelize Python operations with multiple worker process. This option could be beneficial if the Python operation is computational heavy. Default: True. max_rowsize(int, optional): Maximum size of row in MB that is used for shared memory allocation to copy - data between processes. This is only used if python_multiprocessing is set to True. Default: 6 MB. + data between processes. This is only used if python_multiprocessing is set to True. Default: 6 MB. + + Raises: + TypeError: If `data_dir` is not of type str. + TypeError: If `save_dir` is not of type str. + TypeError: If `num_parallel_workers` is not of type int. + TypeError: If `shuffle` is not of type bool. + TypeError: If `python_multiprocessing` is not of type bool. + TypeError: If `perf_mode` is not of type bool. + RuntimeError: If `data_dir` is not valid or does not exit. + RuntimeError: If `num_shards` is specified but `shard_id` is None. + RuntimeError: If `shard_id` is specified but `num_shards` is None. + ValueError: If `num_parallel_workers` exceeds the max thread numbers. Examples: >>> from mindspore.dataset import InMemoryGraphDataset, Graph @@ -1381,19 +1406,28 @@ class ArgoverseDataset(InMemoryGraphDataset): Args: data_dir (str): directory for loading dataset, here contains origin format data and will be loaded in `process` method. - column_names (Union[str, list[str]], optional): single column name or list of column names of the dataset, - num of column name should be equal to num of item in return data when implement method like `__getitem__` , - recommend to specify it with + column_names (Union[str, list[str]], optional): single column name or list of column names of the dataset. + Default: "graph". Num of column name should be equal to num of item in return data when implement method + like `__getitem__`, recommend to specify it with `column_names=["edge_index", "x", "y", "cluster", "valid_len", "time_step_len"]` like the following example. num_parallel_workers (int, optional): Number of subprocesses used to fetch the dataset in parallel. Default: 1. - shuffle (bool, optional): Whether or not to perform shuffle on the dataset. - Default: None, expected order behavior shown in the table below. + shuffle (bool, optional): Whether or not to perform shuffle on the dataset. This parameter can only be + specified when the implemented dataset has a random access attribute ( `__getitem__` ). Default: None. python_multiprocessing (bool, optional): Parallelize Python operations with multiple worker process. This option could be beneficial if the Python operation is computational heavy. Default: True. perf_mode(bool, optional): mode for obtaining higher performance when iterate created dataset(will call `__getitem__` method in this process). Default True, will save all the data in graph (like edge index, node feature and graph feature) into graph feature. + Raises: + TypeError: If `data_dir` is not of type str. + TypeError: If `num_parallel_workers` is not of type int. + TypeError: If `shuffle` is not of type bool. + TypeError: If `python_multiprocessing` is not of type bool. + TypeError: If `perf_mode` is not of type bool. + RuntimeError: If `data_dir` is not valid or does not exit. + ValueError: If `num_parallel_workers` exceeds the max thread numbers. + Examples: >>> from mindspore.dataset import ArgoverseDataset >>> @@ -1403,6 +1437,37 @@ class ArgoverseDataset(InMemoryGraphDataset): ... "time_step_len"]) >>> for item in graph_dataset.create_dict_iterator(output_numpy=True, num_epochs=1): ... pass + + About Argoverse Dataset: + + Argverse is the first dataset containing high-precision maps, which contains 290KM high-precision map data with + geometric shape and semantic information. + + You can unzip the dataset files into the following structure and read by MindSpore's API: + + .. code-block:: + + . + └── argoverse_dataset_dir + ├── train + │ ├──... + ├── val + │ └──... + ├── test + │ └──... + + Citation: + + .. code-block:: + + @inproceedings{Argoverse, + author = {Ming-Fang Chang and John W Lambert and Patsorn Sangkloy and Jagjeet Singh + and Slawomir Bak and Andrew Hartnett and De Wang and Peter Carr + and Simon Lucey and Deva Ramanan and James Hays}, + title = {Argoverse: 3D Tracking and Forecasting with Rich Maps}, + booktitle = {Conference on Computer Vision and Pattern Recognition (CVPR)}, + year = {2019} + } """ def __init__(self, data_dir, column_names="graph", num_parallel_workers=1, shuffle=None,