!44822 Fix problems with some Chinese API reviews

Merge pull request !44822 from 刘勇琪/code_docs_modify_chinese_api
This commit is contained in:
i-robot 2022-11-02 06:36:55 +00:00 committed by Gitee
commit d4db709ee0
No known key found for this signature in database
GPG Key ID: 173E9B9CA92EEF8F
22 changed files with 454 additions and 415 deletions

View File

@ -3,7 +3,7 @@ mindspore.dataset.Dataset.shuffle
.. py:method:: mindspore.dataset.Dataset.shuffle(buffer_size)
使用以下策略混洗此数据集的行:
通过创建 `buffer_size` 大小的缓存来混洗该数据集。
1. 生成一个混洗缓冲区包含 `buffer_size` 条数据行。

View File

@ -12,17 +12,23 @@ mindspore.dataset.AGNewsDataset
- **usage** (str, 可选) - 指定数据集的子集,可取值为'train''test'或'all'。默认值None读取全部样本。
- **num_samples** (int, 可选) - 指定从数据集中读取的样本数。默认值None读取所有样本。
- **num_parallel_workers** (int, 可选) - 指定读取数据的工作线程数。默认值None使用mindspore.dataset.config中配置的线程数。
- **shuffle** (Union[bool, Shuffle], 可选) - 每个epoch中数据混洗的模式支持传入bool类型与枚举类型进行指定默认值mindspore.dataset.Shuffle.GLOBAL
- **shuffle** (Union[bool, Shuffle], 可选) - 每个epoch中数据混洗的模式支持传入bool类型与枚举类型进行指定。默认值:`Shuffle.GLOBAL`
如果 `shuffle` 为False则不混洗如果 `shuffle` 为True等同于将 `shuffle` 设置为mindspore.dataset.Shuffle.GLOBAL。
通过传入枚举变量设置数据混洗的模式:
- **Shuffle.GLOBAL**:混洗文件和样本。
- **Shuffle.FILES**:仅混洗文件。
- **num_shards** (int, 可选) - 指定分布式训练时将数据集进行划分的分片数默认值None。指定此参数后 `num_samples` 表示每个分片的最大样本数。
- **shard_id** (int, 可选) - 指定分布式训练时使用的分片ID号默认值None。只有当指定了 `num_shards` 时才能指定此参数。
- **num_shards** (int, 可选) - 指定分布式训练时将数据集进行划分的分片数默认值None。指定此参数后 `num_samples` 表示每个分片的最大样本数。
- **shard_id** (int, 可选) - 指定分布式训练时使用的分片ID号默认值None。只有当指定了 `num_shards` 时才能指定此参数。
- **cache** (DatasetCache, 可选) - 单节点数据缓存服务,用于加快数据集处理,详情请阅读 `单节点数据缓存 <https://www.mindspore.cn/tutorials/experts/zh-CN/master/dataset/cache.html>`_ 。默认值None不使用缓存。
异常:
- **RuntimeError** - `dataset_dir` 参数所指向的文件目录不存在或缺少数据集文件。
- **RuntimeError** - 指定了 `num_shards` 参数,但是未指定 `shard_id` 参数。
- **RuntimeError** - 指定了 `shard_id` 参数,但是未指定 `num_shards` 参数。
- **ValueError** - `num_parallel_workers` 参数超过系统最大线程数。
**关于AGNews数据集**
AG是一个大型合集具有超过100万篇新闻文章。这些新闻文章是由ComeToMyHead在持续1年多的活动中从2000多个新闻来源收集的。ComeToMyHead是一个学术新闻搜索引擎自2004年7月以来一直在运营。
@ -52,5 +58,5 @@ mindspore.dataset.AGNewsDataset
archivePrefix={arXiv},
primaryClass={cs.LG}
}
.. include:: mindspore.dataset.api_list_nlp.rst
.. include:: mindspore.dataset.api_list_nlp.rst

View File

@ -14,22 +14,22 @@ mindspore.dataset.AmazonReviewDataset
对于Full数据集'train'将读取300万个训练样本'test'将读取65万个测试样本'all'将读取所有365万个样本。默认值None读取所有样本。
- **num_samples** (int, 可选) - 指定从数据集中读取的样本数。
- **num_parallel_workers** (int, 可选) - 指定读取数据的工作线程数。默认值None使用mindspore.dataset.config中配置的线程数。
- **shuffle** (Union[bool, Shuffle], 可选) - 每个epoch中数据混洗的模式支持传入bool类型与枚举类型进行指定默认值mindspore.dataset.Shuffle.GLOBAL
- **shuffle** (Union[bool, Shuffle], 可选) - 每个epoch中数据混洗的模式支持传入bool类型与枚举类型进行指定。默认值:`Shuffle.GLOBAL`
如果 `shuffle` 为False则不混洗如果 `shuffle` 为True等同于将 `shuffle` 设置为mindspore.dataset.Shuffle.GLOBAL。
通过传入枚举变量设置数据混洗的模式:
- **Shuffle.GLOBAL**:混洗文件和样本。
- **Shuffle.FILES**:仅混洗文件。
- **num_shards** (int, 可选) - 指定分布式训练时将数据集进行划分的分片数默认值None。指定此参数后 `num_samples` 表示每个分片的最大样本数。
- **shard_id** (int, 可选) - 指定分布式训练时使用的分片ID号默认值None。只有当指定了 `num_shards` 时才能指定此参数。
- **num_shards** (int, 可选) - 指定分布式训练时将数据集进行划分的分片数默认值None。指定此参数后 `num_samples` 表示每个分片的最大样本数。
- **shard_id** (int, 可选) - 指定分布式训练时使用的分片ID号默认值None。只有当指定了 `num_shards` 时才能指定此参数。
- **cache** (DatasetCache, 可选) - 单节点数据缓存服务,用于加快数据集处理,详情请阅读 `单节点数据缓存 <https://www.mindspore.cn/tutorials/experts/zh-CN/master/dataset/cache.html>`_ 。默认值None不使用缓存。
异常:
- **RuntimeError** - `dataset_dir` 参数所指向的文件目录不存在或缺少数据集文件。
- **ValueError** - `num_parallel_workers` 参数超过系统最大线程数。
- **RuntimeError** - 指定了 `num_shards` 参数,但是未指定 `shard_id` 参数。
- **RuntimeError** - 指定了 `shard_id` 参数,但是未指定 `num_shards` 参数。
- **ValueError** - `num_parallel_workers` 参数超过系统最大线程数。
**关于AmazonReview数据集**

View File

@ -10,26 +10,26 @@ mindspore.dataset.DBpediaDataset
参数:
- **dataset_dir** (str) - 包含数据集文件的根目录路径。
- **usage** (str, 可选) - 指定数据集的子集,可取值为'train''test'或'all'。
'train'将读取560,000个训练样本'test'将读取70,000个测试样本中'all'将读取所有63个样本。默认值None读取全部样本。
'train'将读取560,000个训练样本'test'将读取70,000个测试样本中'all'将读取所有630,000个样本。默认值None读取全部样本。
- **num_samples** (int, 可选) - 指定从数据集中读取的样本数。默认值None读取所有样本。
- **num_parallel_workers** (int, 可选) - 指定读取数据的工作线程数。默认值None使用mindspore.dataset.config中配置的线程数。
- **shuffle** (Union[bool, Shuffle], 可选) - 每个epoch中数据混洗的模式支持传入bool类型与枚举类型进行指定默认值mindspore.dataset.Shuffle.GLOBAL
- **shuffle** (Union[bool, Shuffle], 可选) - 每个epoch中数据混洗的模式支持传入bool类型与枚举类型进行指定。默认值:`Shuffle.GLOBAL`
如果 `shuffle` 为False则不混洗如果 `shuffle` 为True等同于将 `shuffle` 设置为mindspore.dataset.Shuffle.GLOBAL。
通过传入枚举变量设置数据混洗的模式:
- **Shuffle.GLOBAL**:混洗文件和样本。
- **Shuffle.FILES**:仅混洗文件。
- **num_shards** (int, 可选) - 指定分布式训练时将数据集进行划分的分片数默认值None。指定此参数后 `num_samples` 表示每个分片的最大样本数。
- **shard_id** (int, 可选) - 指定分布式训练时使用的分片ID号默认值None。只有当指定了 `num_shards` 时才能指定此参数。
- **num_shards** (int, 可选) - 指定分布式训练时将数据集进行划分的分片数默认值None。指定此参数后 `num_samples` 表示每个分片的最大样本数。
- **shard_id** (int, 可选) - 指定分布式训练时使用的分片ID号默认值None。只有当指定了 `num_shards` 时才能指定此参数。
- **cache** (DatasetCache, 可选) - 单节点数据缓存服务,用于加快数据集处理,详情请阅读 `单节点数据缓存 <https://www.mindspore.cn/tutorials/experts/zh-CN/master/dataset/cache.html>`_ 。默认值None不使用缓存。
异常:
- **RuntimeError** - `dataset_dir` 参数所指向的文件目录不存在或缺少数据集文件。
- **ValueError** - `num_parallel_workers` 参数超过系统最大线程数。
- **RuntimeError** - 指定了 `num_shards` 参数,但是未指定 `shard_id` 参数。
- **RuntimeError** - 指定了 `shard_id` 参数,但是未指定 `num_shards` 参数。
- **ValueError** - `shard_id` 参数值错误小于0或者大于等于 `num_shards` )。
- **ValueError** - `num_parallel_workers` 参数超过系统最大线程数。
- **ValueError** - `shard_id` 参数值错误小于0或者大于等于 `num_shards`
**关于DBpedia数据集**
@ -61,5 +61,5 @@ mindspore.dataset.DBpediaDataset
howpublished = {http://dbpedia.org}
}
.. include:: mindspore.dataset.api_list_nlp.rst
.. include:: mindspore.dataset.api_list_nlp.rst

View File

@ -5,19 +5,19 @@ mindspore.dataset.EMnistDataset
读取和解析EMNIST数据集的源文件构建数据集。
生成的数据集有两列: `[image, label]``image` 列的数据类型为uint8。 `label` 列的数据类型为uint32。
生成的数据集有两列: `[image, label]` `image` 列的数据类型为uint8。 `label` 列的数据类型为uint32。
参数:
- **dataset_dir** (str) - 包含数据集文件的根目录路径。
- **name** (str) - 按给定规则对数据集进行拆分,可以是'byclass'、'bymerge'、'balanced'、'letters'、'digits'或'mnist'。
- **usage** (str, 可选) - 指定数据集的子集,可取值为 'train'、'test' 或 'all'。
取值为'train'时将会读取60,000个训练样本取值为'test'时将会读取10,000个测试样本取值为'all'时将会读取全部70,000个样本。默认值None读取全部样本图片。
- **num_samples** (int, 可选) - 指定从数据集中读取的样本数,可以小于数据集总数。默认值None读取全部样本图片。
- **num_samples** (int, 可选) - 指定从数据集中读取的样本数。默认值None读取全部样本图片。
- **num_parallel_workers** (int, 可选) - 指定读取数据的工作线程数。默认值None使用mindspore.dataset.config中配置的线程数。
- **shuffle** (bool, 可选) - 是否混洗数据集。默认值None下表中会展示不同参数配置的预期行为。
- **sampler** (Sampler, 可选) - 指定从数据集中选取样本的采样器默认值None下表中会展示不同配置的预期行为。
- **num_shards** (int, 可选) - 指定分布式训练时将数据集进行划分的分片数默认值None。指定此参数后 `num_samples` 表示每个分片的最大样本数。
- **shard_id** (int, 可选) - 指定分布式训练时使用的分片ID号默认值None。只有当指定了 `num_shards` 时才能指定此参数。
- **sampler** (Sampler, 可选) - 指定从数据集中选取样本的采样器默认值None下表中会展示不同配置的预期行为。
- **num_shards** (int, 可选) - 指定分布式训练时将数据集进行划分的分片数默认值None。指定此参数后 `num_samples` 表示每个分片的最大样本数。
- **shard_id** (int, 可选) - 指定分布式训练时使用的分片ID号默认值None。只有当指定了 `num_shards` 时才能指定此参数。
- **cache** (DatasetCache, 可选) - 单节点数据缓存服务,用于加快数据集处理,详情请阅读 `单节点数据缓存 <https://www.mindspore.cn/tutorials/experts/zh-CN/master/dataset/cache.html>`_ 。默认值None不使用缓存。
异常:
@ -25,7 +25,7 @@ mindspore.dataset.EMnistDataset
- **RuntimeError** - 同时指定了 `sampler``num_shards` 参数或同时指定了 `sampler``shard_id` 参数。
- **RuntimeError** - 指定了 `num_shards` 参数,但是未指定 `shard_id` 参数。
- **RuntimeError** - 指定了 `shard_id` 参数,但是未指定 `num_shards` 参数。
- **ValueError** - `shard_id` 参数错误小于0或者大于等于 `num_shards`
- **ValueError** - `shard_id` 参数错误小于0或者大于等于 `num_shards`
.. note:: 此数据集可以指定参数 `sampler` ,但参数 `sampler` 和参数 `shuffle` 的行为是互斥的。下表展示了几种合法的输入参数组合及预期的行为。
@ -96,5 +96,5 @@ mindspore.dataset.EMnistDataset
publication_support_materials/emnist}
}
.. include:: mindspore.dataset.api_list_vision.rst
.. include:: mindspore.dataset.api_list_vision.rst

View File

@ -3,7 +3,7 @@ mindspore.dataset.EnWik9Dataset
.. py:class:: mindspore.dataset.EnWik9Dataset(dataset_dir, num_samples=None, num_parallel_workers=None, shuffle=True, num_shards=None, shard_id=None, cache=None)
读取和解析EnWik9数据集的源数据集。
读取和解析EnWik9 Full和EnWik9 Polarity数据集。
生成的数据集有一列 `[text]` 数据类型为string。
@ -13,17 +13,23 @@ mindspore.dataset.EnWik9Dataset
对于Polarity数据集'train'将读取360万个训练样本'test'将读取40万个测试样本'all'将读取所有400万个样本。
对于Full数据集'train'将读取300万个训练样本'test'将读取65万个测试样本'all'将读取所有365万个样本。默认值None读取所有样本。
- **num_parallel_workers** (int, 可选) - 指定读取数据的工作线程数。默认值None使用mindspore.dataset.config中配置的线程数。
- **shuffle** (Union[bool, Shuffle], 可选) - 每个epoch中数据混洗的模式支持传入bool类型与枚举类型进行指定默认值True。
- **shuffle** (Union[bool, Shuffle], 可选) - 每个epoch中数据混洗的模式支持传入bool类型与枚举类型进行指定默认值True。
如果 `shuffle` 为False则不混洗如果 `shuffle` 为True等同于将 `shuffle` 设置为mindspore.dataset.Shuffle.GLOBAL。
通过传入枚举变量设置数据混洗的模式:
- **Shuffle.GLOBAL**:混洗文件和样本。
- **Shuffle.FILES**:仅混洗文件。
- **num_shards** (int, 可选) - 指定分布式训练时将数据集进行划分的分片数默认值None。指定此参数后 `num_samples` 表示每个分片的最大样本数。
- **shard_id** (int, 可选) - 指定分布式训练时使用的分片ID号默认值None。只有当指定了 `num_shards` 时才能指定此参数。
- **num_shards** (int, 可选) - 指定分布式训练时将数据集进行划分的分片数默认值None。指定此参数后 `num_samples` 表示每个分片的最大样本数。
- **shard_id** (int, 可选) - 指定分布式训练时使用的分片ID号默认值None。只有当指定了 `num_shards` 时才能指定此参数。
- **cache** (DatasetCache, 可选) - 单节点数据缓存服务,用于加快数据集处理,详情请阅读 `单节点数据缓存 <https://www.mindspore.cn/tutorials/experts/zh-CN/master/dataset/cache.html>`_ 。默认值None不使用缓存。
异常:
- **RuntimeError** - `dataset_dir` 参数所指向的文件目录不存在或缺少数据集文件。
- **RuntimeError** - 指定了 `num_shards` 参数,但是未指定 `shard_id` 参数。
- **RuntimeError** - 指定了 `shard_id` 参数,但是未指定 `num_shards` 参数。
- **ValueError** - `num_parallel_workers` 参数超过系统最大线程数。
**关于EnWik9数据集**
EnWik9的数据是一系列UTF-8编码的XML主要由英文文本组成。数据集包含243,426篇文章标题其中85,560个被重定向以修复丢失的网页链接其余是常规文章。
@ -50,5 +56,5 @@ mindspore.dataset.EnWik9Dataset
year = {2006}
}
.. include:: mindspore.dataset.api_list_nlp.rst
.. include:: mindspore.dataset.api_list_nlp.rst

View File

@ -5,28 +5,28 @@ mindspore.dataset.FakeImageDataset
生成虚假图像构建数据集。
生成的数据集有两列: `[image, label]``image` 列的数据类型为uint8。 `label` 列的数据类型为uint32。
生成的数据集有两列: `[image, label]` `image` 列的数据类型为uint8。 `label` 列的数据类型为uint32。
参数:
- **num_images** (int, 可选) - 要生成的虚假图像数默认值1000。
- **image_size** (tuple, 可选) - 虚假图像的尺寸默认值:(224, 224, 3)。
- **num_classes** (int, 可选) - 数据集的类别数默认值10。
- **base_seed** (int, 可选) - 生成随机图像的随机种子默认值0。
- **num_images** (int, 可选) - 要生成的虚假图像数默认值1000。
- **image_size** (tuple, 可选) - 虚假图像的尺寸默认值:(224, 224, 3)。
- **num_classes** (int, 可选) - 数据集的类别数默认值10。
- **base_seed** (int, 可选) - 生成随机图像的随机种子默认值0。
- **num_samples** (int, 可选) - 指定从数据集中读取的样本数可以小于数据集总数。默认值None读取全部样本图片。
- **num_parallel_workers** (int, 可选) - 指定读取数据的工作线程数。默认值None使用mindspore.dataset.config中配置的线程数。
- **shuffle** (bool, 可选) - 是否混洗数据集。默认值None下表中会展示不同参数配置的预期行为。
- **sampler** (Sampler, 可选) - 指定从数据集中选取样本的采样器默认值None下表中会展示不同配置的预期行为。
- **num_shards** (int, 可选) - 指定分布式训练时将数据集进行划分的分片数默认值None。指定此参数后 `num_samples` 表示每个分片的最大样本数。
- **shard_id** (int, 可选) - 指定分布式训练时使用的分片ID号默认值None。只有当指定了 `num_shards` 时才能指定此参数。
- **sampler** (Sampler, 可选) - 指定从数据集中选取样本的采样器默认值None下表中会展示不同配置的预期行为。
- **num_shards** (int, 可选) - 指定分布式训练时将数据集进行划分的分片数默认值None。指定此参数后 `num_samples` 表示每个分片的最大样本数。
- **shard_id** (int, 可选) - 指定分布式训练时使用的分片ID号默认值None。只有当指定了 `num_shards` 时才能指定此参数。
- **cache** (DatasetCache, 可选) - 单节点数据缓存服务,用于加快数据集处理,详情请阅读 `单节点数据缓存 <https://www.mindspore.cn/tutorials/experts/zh-CN/master/dataset/cache.html>`_ 。默认值None不使用缓存。
异常:
- **ValueError** - `num_parallel_workers` 参数超过系统最大线程数。
- **RuntimeError** - 同时指定了 `sampler``shuffle` 参数。
- **RuntimeError** - 同时指定了 `sampler``num_shards` 参数或同时指定了 `sampler``shard_id` 参数。
- **RuntimeError** - 指定了 `num_shards` 参数,但是未指定 `shard_id` 参数。
- **RuntimeError** - 指定了 `shard_id` 参数,但是未指定 `num_shards` 参数。
- **ValueError** - `shard_id` 参数错误小于0或者大于等于 `num_shards` )。
- **ValueError** - `num_parallel_workers` 参数超过系统最大线程数。
- **ValueError** - `shard_id` 参数错误小于0或者大于等于 `num_shards`
.. note:: 此数据集可以指定参数 `sampler` ,但参数 `sampler` 和参数 `shuffle` 的行为是互斥的。下表展示了几种合法的输入参数组合及预期的行为。
@ -56,5 +56,5 @@ mindspore.dataset.FakeImageDataset
- False
- 不允许
.. include:: mindspore.dataset.api_list_vision.rst
.. include:: mindspore.dataset.api_list_vision.rst

View File

@ -5,7 +5,7 @@ mindspore.dataset.FashionMnistDataset
读取和解析Fashion-MNIST数据集的源文件构建数据集。
生成的数据集有两列: `[image, label]``image` 列的数据类型为uint8。 `label` 列的数据类型为uint32。
生成的数据集有两列: `[image, label]` `image` 列的数据类型为uint8。 `label` 列的数据类型为uint32。
参数:
- **dataset_dir** (str) - 包含数据集文件的根目录路径。
@ -14,19 +14,19 @@ mindspore.dataset.FashionMnistDataset
- **num_samples** (int, 可选) - 指定从数据集中读取的样本数可以小于数据集总数。默认值None读取全部样本图片。
- **num_parallel_workers** (int, 可选) - 指定读取数据的工作线程数。默认值None使用mindspore.dataset.config中配置的线程数。
- **shuffle** (bool, 可选) - 是否混洗数据集。默认值None下表中会展示不同参数配置的预期行为。
- **sampler** (Sampler, 可选) - 指定从数据集中选取样本的采样器默认值None下表中会展示不同配置的预期行为。
- **num_shards** (int, 可选) - 指定分布式训练时将数据集进行划分的分片数默认值None。指定此参数后 `num_samples` 表示每个分片的最大样本数。
- **shard_id** (int, 可选) - 指定分布式训练时使用的分片ID号默认值None。只有当指定了 `num_shards` 时才能指定此参数。
- **sampler** (Sampler, 可选) - 指定从数据集中选取样本的采样器默认值None下表中会展示不同配置的预期行为。
- **num_shards** (int, 可选) - 指定分布式训练时将数据集进行划分的分片数默认值None。指定此参数后 `num_samples` 表示每个分片的最大样本数。
- **shard_id** (int, 可选) - 指定分布式训练时使用的分片ID号默认值None。只有当指定了 `num_shards` 时才能指定此参数。
- **cache** (DatasetCache, 可选) - 单节点数据缓存服务,用于加快数据集处理,详情请阅读 `单节点数据缓存 <https://www.mindspore.cn/tutorials/experts/zh-CN/master/dataset/cache.html>`_ 。默认值None不使用缓存。
异常:
- **RuntimeError** - `dataset_dir` 路径下不包含数据文件。
- **ValueError** - `num_parallel_workers` 参数超过系统最大线程数。
- **RuntimeError** - 同时指定了 `sampler``shuffle` 参数。
- **RuntimeError** - 同时指定了 `sampler``num_shards` 参数或同时指定了 `sampler``shard_id` 参数。
- **RuntimeError** - 指定了 `num_shards` 参数,但是未指定 `shard_id` 参数。
- **RuntimeError** - 指定了 `shard_id` 参数,但是未指定 `num_shards` 参数。
- **ValueError** - `shard_id` 参数错误小于0或者大于等于 `num_shards` )。
- **ValueError** - `num_parallel_workers` 参数超过系统最大线程数。
- **ValueError** - `shard_id` 参数错误小于0或者大于等于 `num_shards`
.. note:: 此数据集可以指定参数 `sampler` ,但参数 `sampler` 和参数 `shuffle` 的行为是互斥的。下表展示了几种合法的输入参数组合及预期的行为。

View File

@ -7,29 +7,29 @@ mindspore.dataset.Flowers102Dataset
根据给定的 `task` 配置,生成数据集具有不同的输出列:
- `task` = 'Classification',输出列: `[image, dtype=uint8]` , `[label, dtype=uint32]`
- `task` = 'Segmentation',输出列: `[image, dtype=uint8]` , `[segmentation, dtype=uint8]` , `[label, dtype=uint32]`
- `task` = 'Classification',输出列: `[image, dtype=uint8]` `[label, dtype=uint32]`
- `task` = 'Segmentation',输出列: `[image, dtype=uint8]` `[segmentation, dtype=uint8]``[label, dtype=uint32]`
参数:
- **dataset_dir** (str) - 包含数据集文件的根目录的路径。
- **task** (str, 可选) - 指定读取数据的任务类型,支持'Classification'和'Segmentation'。默认值:'Classification'。
- **usage** (str, 可选) - 指定数据集的子集,可取值为'train''valid''test'或'all'。默认值:'all',读取全部样本。
- **num_samples** (int, 可选) - 指定从数据集中读取的样本数。默认值None所有图像样本。
- **num_parallel_workers** (int, 可选) - 指定读取数据的工作线程数。默认值:None使用mindspore.dataset.config中配置的线程数
- **num_parallel_workers** (int, 可选) - 指定读取数据的工作线程数。默认值:1
- **shuffle** (bool, 可选) - 是否混洗数据集。默认值None下表中会展示不同参数配置的预期行为。
- **decode** (bool, 可选) - 是否对读取的图片进行解码操作默认值False不解码。
- **sampler** (Union[Sampler, Iterable], 可选) - 指定从数据集中选取样本的采样器默认值None下表中会展示不同配置的预期行为。
- **num_shards** (int, 可选) - 指定分布式训练时将数据集进行划分的分片数默认值None。指定此参数后 `num_samples` 表示每个分片的最大样本数。
- **shard_id** (int, 可选) - 指定分布式训练时使用的分片ID号默认值None。只有当指定了 `num_shards` 时才能指定此参数。
- **decode** (bool, 可选) - 是否对读取的图片进行解码操作默认值False不解码。
- **sampler** (Union[Sampler, Iterable], 可选) - 指定从数据集中选取样本的采样器默认值None下表中会展示不同配置的预期行为。
- **num_shards** (int, 可选) - 指定分布式训练时将数据集进行划分的分片数默认值None。指定此参数后 `num_samples` 表示每个分片的最大样本数。
- **shard_id** (int, 可选) - 指定分布式训练时使用的分片ID号默认值None。只有当指定了 `num_shards` 时才能指定此参数。
异常:
- **RuntimeError** - `dataset_dir` 路径下不包含任何数据文件。
- **ValueError** - `num_parallel_workers` 参数超过系统最大线程数。
- **RuntimeError** - 同时指定了 `sampler``shuffle` 参数。
- **RuntimeError** - 同时指定了 `sampler``num_shards` 参数或同时指定了 `sampler``shard_id` 参数。
- **RuntimeError** - 指定了 `num_shards` 参数,但是未指定 `shard_id` 参数。
- **RuntimeError** - 指定了 `shard_id` 参数,但是未指定 `num_shards` 参数。
- **ValueError** - `shard_id` 参数值错误小于0或者大于等于 `num_shards` )。
- **ValueError** - `num_parallel_workers` 参数超过系统最大线程数。
- **ValueError** - `shard_id` 参数值错误小于0或者大于等于 `num_shards`
.. note:: 此数据集可以指定参数 `sampler` ,但参数 `sampler` 和参数 `shuffle` 的行为是互斥的。下表展示了几种合法的输入参数组合及预期的行为。
@ -93,5 +93,5 @@ mindspore.dataset.Flowers102Dataset
year = "2008",
}
.. include:: mindspore.dataset.api_list_vision.rst
.. include:: mindspore.dataset.api_list_vision.rst

View File

@ -14,26 +14,20 @@ mindspore.dataset.IMDBDataset
对于Polarity数据集'train'将读取360万个训练样本'test'将读取40万个测试样本'all'将读取所有400万个样本。
对于Full数据集'train'将读取300万个训练样本'test'将读取65万个测试样本'all'将读取所有365万个样本。默认值None读取所有样本。
- **num_parallel_workers** (int, 可选) - 指定读取数据的工作线程数。默认值None使用mindspore.dataset.config中配置的线程数。
- **shuffle** (bool, 可选) - 每个epoch中数据混洗的模式支持传入bool类型与枚举类型进行指定默认值mindspore.dataset.Shuffle.GLOBAL。
如果 `shuffle` 为False则不混洗如果 `shuffle` 为True等同于将 `shuffle` 设置为mindspore.dataset.Shuffle.GLOBAL。
通过传入枚举变量设置数据混洗的模式:
- **Shuffle.GLOBAL**:混洗文件和样本。
- **Shuffle.FILES**:仅混洗文件。
- **sampler** (Sampler, 可选) - 指定从数据集中选取样本的采样器默认值None下表中会展示不同配置的预期行为。
- **num_shards** (int, 可选) - 指定分布式训练时将数据集进行划分的分片数默认值None。指定此参数后 `num_samples` 表示每个分片的最大样本数。
- **shard_id** (int, 可选) - 指定分布式训练时使用的分片ID号默认值None。只有当指定了 `num_shards` 时才能指定此参数。
- **shuffle** (bool, 可选) - 是否混洗数据集。默认值None下表中会展示不同参数配置的预期行为。
- **sampler** (Sampler, 可选) - 指定从数据集中选取样本的采样器。默认值None下表中会展示不同配置的预期行为。
- **num_shards** (int, 可选) - 指定分布式训练时将数据集进行划分的分片数。默认值None。指定此参数后 `num_samples` 表示每个分片的最大样本数。
- **shard_id** (int, 可选) - 指定分布式训练时使用的分片ID号。默认值None。只有当指定了 `num_shards` 时才能指定此参数。
- **cache** (DatasetCache, 可选) - 单节点数据缓存服务,用于加快数据集处理,详情请阅读 `单节点数据缓存 <https://www.mindspore.cn/tutorials/experts/zh-CN/master/dataset/cache.html>`_ 。默认值None不使用缓存。
异常:
- **RuntimeError** - `dataset_dir` 参数所指向的文件目录不存在或缺少数据集文件。
- **ValueError** - `num_parallel_workers` 参数超过系统最大线程数。
- **RuntimeError** - 同时指定了 `sampler``shuffle` 参数。
- **RuntimeError** - 同时指定了 `sampler``num_shards` 参数或同时指定了 `sampler``shard_id` 参数。
- **RuntimeError** - 指定了 `num_shards` 参数,但是未指定 `shard_id` 参数。
- **RuntimeError** - 指定了 `shard_id` 参数,但是未指定 `num_shards` 参数。
- **ValueError** - `shard_id` 参数值错误小于0或者大于等于 `num_shards` )。
- **ValueError** - `num_parallel_workers` 参数超过系统最大线程数。
- **ValueError** - `shard_id` 参数值错误小于0或者大于等于 `num_shards`
.. note:: 此数据集可以指定参数 `sampler` ,但参数 `sampler` 和参数 `shuffle` 的行为是互斥的。下表展示了几种合法的输入参数组合及预期的行为。
@ -112,5 +106,5 @@ mindspore.dataset.IMDBDataset
url = {http://www.aclweb.org/anthology/P11-1015}
}
.. include:: mindspore.dataset.api_list_nlp.rst
.. include:: mindspore.dataset.api_list_nlp.rst

View File

@ -10,31 +10,31 @@ mindspore.dataset.IWSLT2016Dataset
参数:
- **dataset_dir** (str) - 包含数据集文件的根目录路径。
- **usage** (str, 可选) - 指定数据集的子集,可取值为'train''valid''test'或'all'。默认值None读取全部样本。
- **language_pair** (sequence, 可选) - 包含源语言和目标语言的序列,支持的值为('en''fr')、('en''de')、('en''cs')、('en''ar')、('de''en','cs''en','ar''en'默认值:('de''en')。
- **valid_set** (str, 可选) - 标识验证集的字符串,支持的值为'dev2010'、'tst2010'、'tst2011'、'tst'2012'tst2013'和'tst2014'默认值:'tst2013'。
- **test_set** (str, 可选) - 识测试集的字符串,支持的值为'dev2010'、'tst2010'、'tst2011'、'tst'2012、'tst2013'和'tst2014'默认值:'tst2014'。
- **num_samples** (int, 可选) - 指定从数据集中读取的样本数。
- **shuffle** (Union[bool, Shuffle], 可选) - 每个epoch中数据混洗的模式支持传入bool类型与枚举类型进行指定默认值mindspore.dataset.Shuffle.GLOBAL
- **language_pair** (sequence, 可选) - 包含源语言和目标语言的序列,支持的值为('en''fr')、('en''de')、('en''cs')、('en''ar')、('de''en'、('cs''en')、('ar''en')。默认值:('de''en')。
- **valid_set** (str, 可选) - 标识验证集的字符串,支持的值为'dev2010'、'tst2010'、'tst2011'、'tst2012'、'tst2013'和'tst2014'。默认值:'tst2013'。
- **test_set** (str, 可选) - 识测试集的字符串,支持的值为'dev2010'、'tst2010'、'tst2011'、'tst'2012、'tst2013'和'tst2014'默认值:'tst2014'。
- **num_samples** (int, 可选) - 指定从数据集中读取的样本数。默认值None读取所有样本。
- **shuffle** (Union[bool, Shuffle], 可选) - 每个epoch中数据混洗的模式支持传入bool类型与枚举类型进行指定。默认值:`Shuffle.GLOBAL`
如果 `shuffle` 为False则不混洗如果 `shuffle` 为True等同于将 `shuffle` 设置为mindspore.dataset.Shuffle.GLOBAL。
通过传入枚举变量设置数据混洗的模式:
- **Shuffle.GLOBAL**:混洗文件和样本。
- **Shuffle.FILES**:仅混洗文件。
- **num_shards** (int, 可选) - 指定分布式训练时将数据集进行划分的分片数默认值None。指定此参数后 `num_samples` 表示每个分片的最大样本数。
- **shard_id** (int, 可选) - 指定分布式训练时使用的分片ID号默认值None。只有当指定了 `num_shards` 时才能指定此参数。
- **num_shards** (int, 可选) - 指定分布式训练时将数据集进行划分的分片数默认值None。指定此参数后 `num_samples` 表示每个分片的最大样本数。
- **shard_id** (int, 可选) - 指定分布式训练时使用的分片ID号默认值None。只有当指定了 `num_shards` 时才能指定此参数。
- **num_parallel_workers** (int, 可选) - 指定读取数据的工作线程数。默认值None使用mindspore.dataset.config中配置的线程数。
- **cache** (DatasetCache, 可选) - 单节点数据缓存服务,用于加快数据集处理,详情请阅读 `单节点数据缓存 <https://www.mindspore.cn/tutorials/experts/zh-CN/master/dataset/cache.html>`_ 。默认值None不使用缓存。
异常:
- **RuntimeError** - `dataset_dir` 参数所指向的文件目录不存在或缺少数据集文件。
- **ValueError** - `num_parallel_workers` 参数超过系统最大线程数。
- **RuntimeError** - 指定了 `num_shards` 参数,但是未指定 `shard_id` 参数。
- **RuntimeError** - 指定了 `shard_id` 参数,但是未指定 `num_shards` 参数。
- **ValueError** - `num_parallel_workers` 参数超过系统最大线程数。
**关于IWSLT2016数据集**
IWSLT是一个专门讨论口译各个方面的重要年度科学会议。IWSLT评估活动中的MT任务被构成一个数据集该数据集可通过wit3.fbk.eu公开获取。
IWSLT是一个专门讨论口译各个方面的重要年度科学会议。IWSLT评估活动中的MT任务被构成一个数据集该数据集可通过 `wit3 <https://wit3.fbk.eu>`_ 公开获取。
IWSLT2016数据集包括从英语到阿拉伯语、捷克、法语和德语的翻译以及从阿拉伯语、捷克、法语和德语到英语的翻译。
可以将原始IWSLT2016数据集文件解压缩到此目录结构中并由MindSpore的API读取。解压后还需要将要读取的数据集解压到指定文件夹中。例如如果要读取de-en的数据集则需要解压缩de/en目录下的tgz文件数据集位于解压缩文件夹中。

View File

@ -5,34 +5,37 @@ mindspore.dataset.IWSLT2017Dataset
读取和解析IWSLT2017数据集的源数据集。
生成的数据集有两列 `[text, translation]``text`的数据类型是string。 `translation` 列的数据类型是string。
生成的数据集有两列 `[text, translation]``text``translation` 列的数据类型均为string。
参数:
- **dataset_dir** (str) - 包含数据集文件的根目录路径。
- **usage** (str, 可选) - 指定数据集的子集,可取值为'train''valid''test'或'all'。默认值None读取全部样本。
- **language_pair** (sequence, 可选) - 包含源语和目标语的语言列表,支持的语言对有('en''nl')、('en''de')、('en''it')、('en''ro')、('nl''en''de')、('nl''it')、('nl''ro')、('de''en')、('de''nl')、('de''it''it''en')、('it''nl')、('it''de')、('it''ro')、('ro''en')、('ro''nl')、('ro''de')、('ro''it'),默认值:('de''en')。
- **num_samples** (int, 可选) - 指定从数据集中读取的样本数。
- **shuffle** (Union[bool, Shuffle], 可选) - 每个epoch中数据混洗的模式支持传入bool类型与枚举类型进行指定默认值mindspore.dataset.Shuffle.GLOBAL。
- **language_pair** (sequence, 可选) - 包含源语和目标语的语言列表,支持的语言对有('en', 'nl')、
'en', 'de')、('en', 'it')、('en', 'ro')、('nl', 'en')、('nl', 'de')、('nl', 'it')、('nl', 'ro')、
'de', 'en')、('de', 'nl')、('de', 'it')、('de', 'ro')、('it', 'en')、('it', 'nl')、('it', 'de')、
'it', 'ro')、('ro', 'en')、('ro', 'nl')、('ro', 'de')、('ro', 'it')。默认值:('de''en')。
- **num_samples** (int, 可选) - 指定从数据集中读取的样本数。默认值None读取所有样本。
- **shuffle** (Union[bool, Shuffle], 可选) - 每个epoch中数据混洗的模式支持传入bool类型与枚举类型进行指定。默认值`Shuffle.GLOBAL`
如果 `shuffle` 为False则不混洗如果 `shuffle` 为True等同于将 `shuffle` 设置为mindspore.dataset.Shuffle.GLOBAL。
通过传入枚举变量设置数据混洗的模式:
- **Shuffle.GLOBAL**:混洗文件和样本。
- **Shuffle.FILES**:仅混洗文件。
- **num_shards** (int, 可选) - 指定分布式训练时将数据集进行划分的分片数默认值None。指定此参数后 `num_samples` 表示每个分片的最大样本数。
- **shard_id** (int, 可选) - 指定分布式训练时使用的分片ID号默认值None。只有当指定了 `num_shards` 时才能指定此参数。
- **num_shards** (int, 可选) - 指定分布式训练时将数据集进行划分的分片数默认值None。指定此参数后 `num_samples` 表示每个分片的最大样本数。
- **shard_id** (int, 可选) - 指定分布式训练时使用的分片ID号默认值None。只有当指定了 `num_shards` 时才能指定此参数。
- **num_parallel_workers** (int, 可选) - 指定读取数据的工作线程数。默认值None使用mindspore.dataset.config中配置的线程数。
- **cache** (DatasetCache, 可选) - 单节点数据缓存服务,用于加快数据集处理,详情请阅读 `单节点数据缓存 <https://www.mindspore.cn/tutorials/experts/zh-CN/master/dataset/cache.html>`_ 。默认值None不使用缓存。
异常:
- **RuntimeError** - `dataset_dir` 参数所指向的文件目录不存在或缺少数据集文件。
- **ValueError** - `num_parallel_workers` 参数超过系统最大线程数。
- **RuntimeError** - 指定了 `num_shards` 参数,但是未指定 `shard_id` 参数。
- **RuntimeError** - 指定了 `shard_id` 参数,但是未指定 `num_shards` 参数。
- **ValueError** - `num_parallel_workers` 参数超过系统最大线程数。
**关于IWSLT2016数据集**
IWSLT是一个专门讨论口译各个方面的重要年度科学会议。IWSLT评估活动中的MT任务被构成一个数据集该数据集可通过wit3.fbk.eu公开获取。
IWSLT是一个专门讨论口译各个方面的重要年度科学会议。IWSLT评估活动中的MT任务被构成一个数据集该数据集可通过 `wit3 <https://wit3.fbk.eu>`_ 公开获取。
IWSLT2017数据集中有德语、英语、意大利语、荷兰语和罗马尼亚语数据集包括其中任何两种语言的翻译。
可以将原始IWSLT2017数据集文件解压缩到此目录结构中并由MindSpore的API读取。解压后还需要将要读取的数据集解压到指定文件夹中。例如如果要读取de-en的数据集则需要解压缩de/en目录下的tgz文件数据集位于解压缩文件夹中。

View File

@ -5,7 +5,7 @@ mindspore.dataset.KMnistDataset
读取和解析KMNIST数据集的源文件构建数据集。
生成的数据集有两列: `[image, label]``image` 列的数据类型为uint8。 `label` 列的数据类型为uint32。
生成的数据集有两列: `[image, label]` `image` 列的数据类型为uint8。 `label` 列的数据类型为uint32。
参数:
- **dataset_dir** (str) - 包含数据集文件的根目录路径。
@ -14,19 +14,19 @@ mindspore.dataset.KMnistDataset
- **num_samples** (int, 可选) - 指定从数据集中读取的样本数可以小于数据集总数。默认值None读取全部样本图片。
- **num_parallel_workers** (int, 可选) - 指定读取数据的工作线程数。默认值None使用mindspore.dataset.config中配置的线程数。
- **shuffle** (bool, 可选) - 是否混洗数据集。默认值None下表中会展示不同参数配置的预期行为。
- **sampler** (Sampler, 可选) - 指定从数据集中选取样本的采样器默认值None下表中会展示不同配置的预期行为。
- **num_shards** (int, 可选) - 指定分布式训练时将数据集进行划分的分片数默认值None。指定此参数后 `num_samples` 表示每个分片的最大样本数。
- **shard_id** (int, 可选) - 指定分布式训练时使用的分片ID号默认值None。只有当指定了 `num_shards` 时才能指定此参数。
- **sampler** (Sampler, 可选) - 指定从数据集中选取样本的采样器默认值None下表中会展示不同配置的预期行为。
- **num_shards** (int, 可选) - 指定分布式训练时将数据集进行划分的分片数默认值None。指定此参数后 `num_samples` 表示每个分片的最大样本数。
- **shard_id** (int, 可选) - 指定分布式训练时使用的分片ID号默认值None。只有当指定了 `num_shards` 时才能指定此参数。
- **cache** (DatasetCache, 可选) - 单节点数据缓存服务,用于加快数据集处理,详情请阅读 `单节点数据缓存 <https://www.mindspore.cn/tutorials/experts/zh-CN/master/dataset/cache.html>`_ 。默认值None不使用缓存。
异常:
- **RuntimeError** - `dataset_dir` 路径下不包含数据文件。
- **ValueError** - `num_parallel_workers` 参数超过系统最大线程数。
- **RuntimeError** - 同时指定了 `sampler``shuffle` 参数。
- **RuntimeError** - 同时指定了 `sampler``num_shards` 参数或同时指定了 `sampler``shard_id` 参数。
- **RuntimeError** - 指定了 `num_shards` 参数,但是未指定 `shard_id` 参数。
- **RuntimeError** - 指定了 `shard_id` 参数,但是未指定 `num_shards` 参数。
- **ValueError** - `shard_id` 参数错误小于0或者大于等于 `num_shards` )。
- **ValueError** - `num_parallel_workers` 参数超过系统最大线程数。
- **ValueError** - `shard_id` 参数错误小于0或者大于等于 `num_shards`
.. note:: 此数据集可以指定参数 `sampler` ,但参数 `sampler` 和参数 `shuffle` 的行为是互斥的。下表展示了几种合法的输入参数组合及预期的行为。

View File

@ -5,7 +5,7 @@ mindspore.dataset.LJSpeechDataset
读取和解析LJSpeech数据集的源文件构建数据集。
生成的数据集有列: `[waveform, sample_rate, transcription, normalized_transcript]`
生成的数据集有列: `[waveform, sample_rate, transcription, normalized_transcript]`
`waveform` 列的数据类型为float32。 `sample_rate` 列的数据类型为int32。 `transcription` 列的数据类型为string。 `normalized_transcript` 列的数据类型为string。
参数:
@ -13,19 +13,19 @@ mindspore.dataset.LJSpeechDataset
- **num_samples** (int, 可选) - 指定从数据集中读取的样本数可以小于数据集总数。默认值None读取全部样本音频。
- **num_parallel_workers** (int, 可选) - 指定读取数据的工作线程数。默认值None使用mindspore.dataset.config中配置的线程数。
- **shuffle** (bool, 可选) - 是否混洗数据集。默认值None下表中会展示不同参数配置的预期行为。
- **sampler** (Sampler, 可选) - 指定从数据集中选取样本的采样器默认值None下表中会展示不同配置的预期行为。
- **num_shards** (int, 可选) - 指定分布式训练时将数据集进行划分的分片数默认值None。指定此参数后 `num_samples` 表示每个分片的最大样本数。
- **shard_id** (int, 可选) - 指定分布式训练时使用的分片ID号默认值None。只有当指定了 `num_shards` 时才能指定此参数。
- **sampler** (Sampler, 可选) - 指定从数据集中选取样本的采样器默认值None下表中会展示不同配置的预期行为。
- **num_shards** (int, 可选) - 指定分布式训练时将数据集进行划分的分片数默认值None。指定此参数后 `num_samples` 表示每个分片的最大样本数。
- **shard_id** (int, 可选) - 指定分布式训练时使用的分片ID号默认值None。只有当指定了 `num_shards` 时才能指定此参数。
- **cache** (DatasetCache, 可选) - 单节点数据缓存服务,用于加快数据集处理,详情请阅读 `单节点数据缓存 <https://www.mindspore.cn/tutorials/experts/zh-CN/master/dataset/cache.html>`_ 。默认值None不使用缓存。
异常:
- **RuntimeError** - `dataset_dir` 路径下不包含数据文件。
- **ValueError** - `num_parallel_workers` 参数超过系统最大线程数。
- **RuntimeError** - 同时指定了 `sampler``shuffle` 参数。
- **RuntimeError** - 同时指定了 `sampler``num_shards` 参数或同时指定了 `sampler``shard_id` 参数。
- **RuntimeError** - 指定了 `num_shards` 参数,但是未指定 `shard_id` 参数。
- **RuntimeError** - 指定了 `shard_id` 参数,但是未指定 `num_shards` 参数。
- **ValueError** - `shard_id` 参数错误小于0或者大于等于 `num_shards` )。
- **ValueError** - `num_parallel_workers` 参数超过系统最大线程数。
- **ValueError** - `shard_id` 参数错误小于0或者大于等于 `num_shards`
.. note:: 此数据集可以指定参数 `sampler` ,但参数 `sampler` 和参数 `shuffle` 的行为是互斥的。下表展示了几种合法的输入参数组合及预期的行为。

View File

@ -5,7 +5,7 @@ mindspore.dataset.PennTreebankDataset
读取和解析PennTreebank数据集的源数据集。
生成的数据集有一列 `[text]` 。数据类型为string。
生成的数据集有一列 `[text]` `text` 列的数据类型为string。
参数:
- **dataset_dir** (str) - 包含数据集文件的根目录路径。
@ -13,17 +13,23 @@ mindspore.dataset.PennTreebankDataset
取值为'train'将读取42,068个样本'test'将读取3,370个样本'test'将读取3,761个样本'all'将读取所有49,199个样本。默认值None读取全部样本。
- **num_samples** (int, 可选) - 指定从数据集中读取的样本数。默认值None读取所有样本。
- **num_parallel_workers** (int, 可选) - 指定读取数据的工作线程数。默认值None使用mindspore.dataset.config中配置的线程数。
- **shuffle** (Union[bool, Shuffle], 可选) - 每个epoch中数据混洗的模式支持传入bool类型与枚举类型进行指定默认值True
- **shuffle** (Union[bool, Shuffle], 可选) - 每个epoch中数据混洗的模式支持传入bool类型与枚举类型进行指定。默认值:`Shuffle.GLOBAL`
如果 `shuffle` 为False则不混洗如果 `shuffle` 为True等同于将 `shuffle` 设置为mindspore.dataset.Shuffle.GLOBAL。
通过传入枚举变量设置数据混洗的模式:
- **Shuffle.GLOBAL**:混洗文件和样本。
- **Shuffle.FILES**:仅混洗文件。
- **num_shards** (int, 可选) - 指定分布式训练时将数据集进行划分的分片数默认值None。指定此参数后 `num_samples` 表示每个分片的最大样本数。
- **shard_id** (int, 可选) - 指定分布式训练时使用的分片ID号默认值None。只有当指定了 `num_shards` 时才能指定此参数。
- **num_shards** (int, 可选) - 指定分布式训练时将数据集进行划分的分片数默认值None。指定此参数后 `num_samples` 表示每个分片的最大样本数。
- **shard_id** (int, 可选) - 指定分布式训练时使用的分片ID号默认值None。只有当指定了 `num_shards` 时才能指定此参数。
- **cache** (DatasetCache, 可选) - 单节点数据缓存服务,用于加快数据集处理,详情请阅读 `单节点数据缓存 <https://www.mindspore.cn/tutorials/experts/zh-CN/master/dataset/cache.html>`_ 。默认值None不使用缓存。
异常:
- **RuntimeError** - `dataset_dir` 参数所指向的文件目录不存在或缺少数据集文件。
- **RuntimeError** - 指定了 `num_shards` 参数,但是未指定 `shard_id` 参数。
- **RuntimeError** - 指定了 `shard_id` 参数,但是未指定 `num_shards` 参数。
- **ValueError** - `num_parallel_workers` 参数超过系统最大线程数。
**关于PennTreebank数据集**
Penn Treebank (PTB) 数据集,广泛用于 NLP自然语言处理的机器学习研究。
@ -60,5 +66,5 @@ mindspore.dataset.PennTreebankDataset
year = 1990
}
.. include:: mindspore.dataset.api_list_nlp.rst
.. include:: mindspore.dataset.api_list_nlp.rst

View File

@ -5,26 +5,26 @@ mindspore.dataset.PhotoTourDataset
读取和解析PhotoTour数据集的源数据集。
`usage` = 'train',生成的数据集有一列 `[image]` 数据类型为uint8。
`usage` ≠ 'train',生成的数据集有三列: `[image1, image2, matches]``image1``image2` 列的数据类型为uint8。 `matches` 列的数据类型为uint32。
根据给定的 `usage` 配置,生成数据集具有不同的输出列:
- `usage` = 'train',输出列: `[image, dtype=uint8]`
- `usage` ≠ 'train',输出列: `[image1, dtype=uint8]``[image2, dtype=uint8]``[matches, dtype=uint32]`
参数:
- **dataset_dir** (str) - 包含数据集文件的根目录路径。
- **name** (str) - 要加载的数据集内容名称,可以取值为'notredame' 'yosemite' 'liberty' 'notredame_harris' 'yosemite_harris' 或 'liberty_harris'。
- **name** (str) - 要加载的数据集内容名称,可以取值为'notredame'、'yosemite'、'liberty'、'notredame_harris'、'yosemite_harris' 或 'liberty_harris'。
- **usage** (str, 可选) - 指定数据集的子集,可取值为'train'或'test'。默认值None将被设置为'train'。
取值为'train'时,每个 `name` 的数据集样本数分别为{'notredame': 468159, 'yosemite': 633587, 'liberty': 450092, 'liberty_harris': 379587, 'yosemite_harris': 450912, 'notredame_harris': 325295}。
取值为'test'时将读取100,000个测试样本。
- **num_samples** (int, 可选) - 指定从数据集中读取的样本数。默认值None读取所有样本。
- **num_parallel_workers** (int, 可选) - 指定读取数据的工作线程数。默认值None使用mindspore.dataset.config中配置的线程数。
- **shuffle** (bool, 可选) - 是否混洗数据集。默认值None下表中会展示不同参数配置的预期行为。
- **sampler** (Sampler, 可选) - 指定从数据集中选取样本的采样器默认值None下表中会展示不同配置的预期行为。
- **num_shards** (int, 可选) - 指定分布式训练时将数据集进行划分的分片数默认值None。指定此参数后 `num_samples` 表示每个分片的最大样本数。
- **shard_id** (int, 可选) - 指定分布式训练时使用的分片ID号默认值None。只有当指定了 `num_shards` 时才能指定此参数。
- **sampler** (Sampler, 可选) - 指定从数据集中选取样本的采样器默认值None下表中会展示不同配置的预期行为。
- **num_shards** (int, 可选) - 指定分布式训练时将数据集进行划分的分片数默认值None。指定此参数后 `num_samples` 表示每个分片的最大样本数。
- **shard_id** (int, 可选) - 指定分布式训练时使用的分片ID号默认值None。只有当指定了 `num_shards` 时才能指定此参数。
- **cache** (DatasetCache, 可选) - 单节点数据缓存服务,用于加快数据集处理,详情请阅读 `单节点数据缓存 <https://www.mindspore.cn/tutorials/experts/zh-CN/master/dataset/cache.html>`_ 。默认值None不使用缓存。
异常:
- **RuntimeError** - `dataset_dir` 路径下不包含数据文件。
- **ValueError** - `num_parallel_workers` 参数超过系统最大线程数。
- **RuntimeError** - 同时指定了 `sampler``shuffle` 参数。
- **RuntimeError** - 同时指定了 `sampler``num_shards` 参数或同时指定了 `sampler``shard_id` 参数。
- **RuntimeError** - 指定了 `num_shards` 参数,但是未指定 `shard_id` 参数。
@ -32,7 +32,8 @@ mindspore.dataset.PhotoTourDataset
- **ValueError** - `dataset_dir` 不存在。
- **ValueError** - `usage` 不是["train", "test"]中的任何一个。
- **ValueError** - `name` 不是["notredame", "yosemite", "liberty","notredame_harris", "yosemite_harris", "liberty_harris"]中的任何一个。
- **ValueError** - `shard_id` 参数错误小于0或者大于等于 `num_shards` )。
- **ValueError** - `num_parallel_workers` 参数超过系统最大线程数。
- **ValueError** - `shard_id` 参数错误小于0或者大于等于 `num_shards`
.. note:: 此数据集可以指定参数 `sampler` ,但参数 `sampler` 和参数 `shuffle` 的行为是互斥的。下表展示了几种合法的输入参数组合及预期的行为。
@ -112,5 +113,5 @@ mindspore.dataset.PhotoTourDataset
doi={10.1109/CVPR.2007.382971}
}
.. include:: mindspore.dataset.api_list_vision.rst
.. include:: mindspore.dataset.api_list_vision.rst

View File

@ -303,24 +303,6 @@ API示例所需模块的导入代码如下
返回:
bool表示是否开启watchdog Python线程。
.. py:function:: mindspore.dataset.config.set_multiprocessing_timeout_interval(interval)
设置在多进程/多线程下,主进程/主线程获取数据超时时,告警日志打印的默认时间间隔(秒)。
参数:
- **interval** (int) - 表示多进程/多线程下,主进程/主线程获取数据超时时,告警日志打印的时间间隔(秒)。
异常:
- **TypeError** - `interval` 不是int类型。
- **ValueError** - `interval` 小于等于0或 `interval` 大于 `INT32_MAX(2147483647)` 时, `interval` 无效。
.. py:function:: mindspore.dataset.config.get_multiprocessing_timeout_interval()
获取在多进程/多线程下,主进程/主线程获取数据超时时,告警日志打印的时间间隔的全局配置。
返回:
int表示多进程/多线程下,主进程/主线程获取数据超时时告警日志打印的时间间隔默认300秒
.. py:function:: mindspore.dataset.config.set_fast_recovery(fast_recovery)
在数据集管道故障恢复时,是否开启快速恢复模式(快速恢复模式下,无法保证随机性的数据增强操作得到与故障之前相同的结果)。
@ -340,3 +322,21 @@ API示例所需模块的导入代码如下
.. automodule:: mindspore.dataset.config
:members:
.. py:function:: mindspore.dataset.config.set_multiprocessing_timeout_interval(interval)
设置在多进程/多线程下,主进程/主线程获取数据超时时,告警日志打印的默认时间间隔(秒)。
参数:
- **interval** (int) - 表示多进程/多线程下,主进程/主线程获取数据超时时,告警日志打印的时间间隔(秒)。
异常:
- **TypeError** - `interval` 不是int类型。
- **ValueError** - `interval` 小于等于0或 `interval` 大于 `INT32_MAX(2147483647)` 时, `interval` 无效。
.. py:function:: mindspore.dataset.config.get_multiprocessing_timeout_interval()
获取在多进程/多线程下,主进程/主线程获取数据超时时,告警日志打印的时间间隔的全局配置。
返回:
int表示多进程/多线程下,主进程/主线程获取数据超时时告警日志打印的时间间隔默认300秒

View File

@ -808,7 +808,7 @@ def set_fast_recovery(fast_recovery):
(yet with slightly different random augmentations).
Args:
fast_recovery (bool): Whether the dataset pipeline recovers in fast mode. Default: True
fast_recovery (bool): Whether the dataset pipeline recovers in fast mode.
Raises:
TypeError: If `fast_recovery` is not a boolean data type.
@ -823,10 +823,10 @@ def set_fast_recovery(fast_recovery):
def get_fast_recovery():
"""
Get the fast_recovery flag of the dataset pipeline
Get whether the fast recovery mode is enabled for the current dataset pipeline.
Returns:
bool, whether the dataset recovers fast in failover reset
bool, whether the dataset recovers fast in failover reset.
Examples:
>>> is_fast_recovery = ds.config.get_fast_recovery()

View File

@ -714,20 +714,20 @@ class Dataset:
@check_shuffle
def shuffle(self, buffer_size):
"""
Randomly shuffles the rows of this dataset using the following policy:
Shuffle the dataset by creating a cache with the size of `buffer_size` .
1. Make a shuffle buffer that contains the first buffer_size rows.
1. Make a shuffle buffer that contains the first `buffer_size` rows.
2. Randomly select an element from the shuffle buffer to be the next row
propagated to the child node.
3. Get the next row (if any) from the parent node and put it in the shuffle buffer.
4. Repeat steps 2 and 3 until there are no more rows left in the shuffle buffer.
A random seed can be provided to be used on the first epoch via `dataset.config.set_seed`. In every subsequent
A random seed can be provided to be used on the first epoch via `dataset.config.set_seed` . In every subsequent
epoch, the seed is changed to a new one, randomly generated value.
Args:
buffer_size (int): The size of the buffer (must be larger than 1) for
shuffling. Setting buffer_size equal to the number of rows in the entire
shuffling. Setting `buffer_size` equal to the number of rows in the entire
dataset will result in a global shuffle.
Returns:

View File

@ -456,38 +456,38 @@ class LJSpeechDataset(MappableDataset, AudioBaseDataset):
"""
A source dataset that reads and parses LJSpeech dataset.
The generated dataset has four columns :py:obj:`[waveform, sample_rate, transcription, normalized_transcript]`.
The tensor of column :py:obj:`waveform` is a tensor of the float32 type.
The tensor of column :py:obj:`sample_rate` is a scalar of the int32 type.
The tensor of column :py:obj:`transcription` is a scalar of the string type.
The tensor of column :py:obj:`normalized_transcript` is a scalar of the string type.
The generated dataset has four columns :py:obj:`[waveform, sample_rate, transcription, normalized_transcript]` .
The column :py:obj:`waveform` is a tensor of the float32 type.
The column :py:obj:`sample_rate` is a scalar of the int32 type.
The column :py:obj:`transcription` is a scalar of the string type.
The column :py:obj:`normalized_transcript` is a scalar of the string type.
Args:
dataset_dir (str): Path to the root directory that contains the dataset.
num_samples (int, optional): The number of audios to be included in the dataset
(default=None, all audios).
num_parallel_workers (int, optional): Number of workers to read the data
(default=None, number set in the config).
shuffle (bool, optional): Whether to perform shuffle on the dataset (default=None, expected
order behavior shown in the table).
sampler (Sampler, optional): Object used to choose samples from the
dataset (default=None, expected order behavior shown in the table).
num_shards (int, optional): Number of shards that the dataset will be divided
into (default=None). When this argument is specified, `num_samples` reflects
num_samples (int, optional): The number of audios to be included in the dataset.
Default: None, all audios.
num_parallel_workers (int, optional): Number of workers to read the data.
Default: None, number set in the mindspore.dataset.config.
shuffle (bool, optional): Whether to perform shuffle on the dataset. Default: None, expected
order behavior shown in the table below.
sampler (Sampler, optional): Object used to choose samples from the dataset.
Default: None, expected order behavior shown in the table below.
num_shards (int, optional): Number of shards that the dataset will be divided into.
Default: None. When this argument is specified, `num_samples` reflects
the maximum sample number of per shard.
shard_id (int, optional): The shard ID within `num_shards` (default=None). This
shard_id (int, optional): The shard ID within `num_shards` . Default: None. This
argument can only be specified when `num_shards` is also specified.
cache (DatasetCache, optional): Use tensor caching service to speed up dataset processing. More details:
`Single-Node Data Cache <https://www.mindspore.cn/tutorials/experts/en/master/dataset/cache.html>`_
(default=None, which means no cache is used).
`Single-Node Data Cache <https://www.mindspore.cn/tutorials/experts/en/master/dataset/cache.html>`_ .
Default: None, which means no cache is used.
Raises:
RuntimeError: If `dataset_dir` does not contain data files.
ValueError: If `num_parallel_workers` exceeds the max thread numbers.
RuntimeError: If `sampler` and `shuffle` are specified at the same time.
RuntimeError: If `sampler` and `num_shards`/`shard_id` are specified at the same time.
RuntimeError: If `num_shards` is specified but `shard_id` is None.
RuntimeError: If `shard_id` is specified but `num_shards` is None.
ValueError: If `num_parallel_workers` exceeds the max thread numbers.
ValueError: If `shard_id` is invalid (< 0 or >= `num_shards`).
Note:

View File

@ -39,32 +39,38 @@ class AGNewsDataset(SourceDataset, TextBaseDataset):
"""
A source dataset that reads and parses AG News datasets.
The generated dataset has three columns: :py:obj:`[index, title, description]`,
The generated dataset has three columns: :py:obj:`[index, title, description]` ,
and the data type of three columns is string type.
Args:
dataset_dir (str): Path to the root directory that contains the dataset.
usage (str, optional): Acceptable usages include 'train', 'test' and 'all' (default=None, all samples).
num_samples (int, optional): Number of samples (rows) to read (default=None, reads the full dataset).
num_parallel_workers (int, optional): Number of workers to read the data
(default=None, number set in the config).
shuffle (Union[bool, Shuffle], optional): Perform reshuffling of the data every epoch
(default=Shuffle.GLOBAL). Bool type and Shuffle enum are both supported to pass in.
If shuffle is False, no shuffling will be performed.
If shuffle is True, performs global shuffle.
There are three levels of shuffling, desired shuffle enum defined by mindspore.dataset.Shuffle.
usage (str, optional): Acceptable usages include 'train', 'test' and 'all'. Default: None, all samples.
num_samples (int, optional): Number of samples (rows) to read. Default: None, reads the full dataset.
num_parallel_workers (int, optional): Number of workers to read the data.
Default: None, number set in the mindspore.dataset.config.
shuffle (Union[bool, Shuffle], optional): Perform reshuffling of the data every epoch.
Bool type and Shuffle enum are both supported to pass in. Default: `Shuffle.GLOBAL` .
If `shuffle` is False, no shuffling will be performed.
If `shuffle` is True, it is equivalent to setting `shuffle` to mindspore.dataset.Shuffle.GLOBAL.
Set the mode of data shuffling by passing in enumeration variables:
- Shuffle.GLOBAL: Shuffle both the files and samples, same as setting shuffle to True.
- Shuffle.GLOBAL: Shuffle both the files and samples.
- Shuffle.FILES: Shuffle files only.
num_shards (int, optional): Number of shards that the dataset will be divided into (default=None).
When this argument is specified, 'num_samples' reflects the max sample number of per shard.
shard_id (int, optional): The shard ID within `num_shards` (default=None). This
argument can only be specified when `num_shards` is also specified.
num_shards (int, optional): Number of shards that the dataset will be divided into. Default: None.
When this argument is specified, `num_samples` reflects the max sample number of per shard.
shard_id (int, optional): The shard ID within `num_shards` . This
argument can only be specified when `num_shards` is also specified. Default: None.
cache (DatasetCache, optional): Use tensor caching service to speed up dataset processing. More details:
`Single-Node Data Cache <https://www.mindspore.cn/tutorials/experts/en/master/dataset/cache.html>`_
(default=None, which means no cache is used).
`Single-Node Data Cache <https://www.mindspore.cn/tutorials/experts/en/master/dataset/cache.html>`_ .
Default: None, which means no cache is used.
Raises:
RuntimeError: If `dataset_dir` does not contain data files.
RuntimeError: If `num_shards` is specified but `shard_id` is None.
RuntimeError: If `shard_id` is specified but `num_shards` is None.
ValueError: If `num_parallel_workers` exceeds the max thread numbers.
Examples:
>>> ag_news_dataset_dir = "/path/to/ag_news_dataset_file"
@ -125,45 +131,45 @@ class AmazonReviewDataset(SourceDataset, TextBaseDataset):
"""
A source dataset that reads and parses Amazon Review Polarity and Amazon Review Full datasets.
The generated dataset has three columns: :py:obj:`[label, title, content]`,
The generated dataset has three columns: :py:obj:`[label, title, content]` ,
and the data type of three columns is string.
Args:
dataset_dir (str): Path to the root directory that contains the Amazon Review Polarity dataset
or the Amazon Review Full dataset.
usage (str, optional): Usage of this dataset, can be 'train', 'test' or 'all' (default= 'all').
usage (str, optional): Usage of this dataset, can be 'train', 'test' or 'all'.
For Polarity dataset, 'train' will read from 3,600,000 train samples,
'test' will read from 400,000 test samples,
'all' will read from all 4,000,000 samples.
For Full dataset, 'train' will read from 3,000,000 train samples,
'test' will read from 650,000 test samples,
'all' will read from all 3,650,000 samples (default=None, all samples).
num_samples (int, optional): Number of samples (rows) to be read (default=None, reads the full dataset).
shuffle (Union[bool, Shuffle], optional): Perform reshuffling of the data every epoch
(default=Shuffle.GLOBAL). Bool type and Shuffle enum are both supported to pass in.
If shuffle is False, no shuffling will be performed.
If shuffle is True, performs global shuffle.
There are three levels of shuffling, desired shuffle enum defined by mindspore.dataset.Shuffle.
'all' will read from all 3,650,000 samples. Default: None, all samples.
num_samples (int, optional): Number of samples (rows) to be read. Default: None, reads the full dataset.
num_parallel_workers (int, optional): Number of workers to read the data.
Default: None, number set in the mindspore.dataset.config.
shuffle (Union[bool, Shuffle], optional): Perform reshuffling of the data every epoch.
Bool type and Shuffle enum are both supported to pass in. Default: `Shuffle.GLOBAL` .
If `shuffle` is False, no shuffling will be performed.
If `shuffle` is True, it is equivalent to setting `shuffle` to mindspore.dataset.Shuffle.GLOBAL.
Set the mode of data shuffling by passing in enumeration variables:
- Shuffle.GLOBAL: Shuffle both the files and samples, same as setting shuffle to True.
- Shuffle.GLOBAL: Shuffle both the files and samples.
- Shuffle.FILES: Shuffle files only.
num_shards (int, optional): Number of shards that the dataset will be divided into (default=None).
num_shards (int, optional): Number of shards that the dataset will be divided into. Default: None.
When this argument is specified, `num_samples` reflects the max sample number of per shard.
shard_id (int, optional): The shard ID within `num_shards` (default=None). This
shard_id (int, optional): The shard ID within `num_shards` . Default: None. This
argument can only be specified when `num_shards` is also specified.
num_parallel_workers (int, optional): Number of workers to read the data
(default=None, number set in the mindspore.dataset.config).
cache (DatasetCache, optional): Use tensor caching service to speed up dataset processing. More details:
`Single-Node Data Cache <https://www.mindspore.cn/tutorials/experts/en/master/dataset/cache.html>`_
(default=None, which means no cache is used).
`Single-Node Data Cache <https://www.mindspore.cn/tutorials/experts/en/master/dataset/cache.html>`_ .
Default: None, which means no cache is used.
Raises:
RuntimeError: If `dataset_dir` does not contain data files.
ValueError: If `num_parallel_workers` exceeds the max thread numbers.
RuntimeError: If `num_shards` is specified but `shard_id` is None.
RuntimeError: If `shard_id` is specified but `num_shards` is None.
ValueError: If `num_parallel_workers` exceeds the max thread numbers.
Examples:
>>> amazon_review_dataset_dir = "/path/to/amazon_review_dataset_dir"
@ -545,7 +551,7 @@ class DBpediaDataset(SourceDataset, TextBaseDataset):
"""
A source dataset that reads and parses the DBpedia dataset.
The generated dataset has three columns :py:obj:`[class, title, content]`,
The generated dataset has three columns :py:obj:`[class, title, content]` ,
and the data type of three columns is string.
Args:
@ -553,34 +559,34 @@ class DBpediaDataset(SourceDataset, TextBaseDataset):
usage (str, optional): Usage of this dataset, can be 'train', 'test' or 'all'.
'train' will read from 560,000 train samples,
'test' will read from 70,000 test samples,
'all' will read from all 630,000 samples (default=None, all samples).
num_samples (int, optional): The number of samples to be included in the dataset
(default=None, will include all text).
num_parallel_workers (int, optional): Number of workers to read the data
(default=None, number set in the config).
shuffle (Union[bool, Shuffle], optional): Perform reshuffling of the data every epoch
(default=Shuffle.GLOBAL). Bool type and Shuffle enum are both supported to pass in.
'all' will read from all 630,000 samples. Default: None, all samples.
num_samples (int, optional): The number of samples to be included in the dataset.
Default: None, will include all text.
num_parallel_workers (int, optional): Number of workers to read the data.
Default: None, number set in the mindspore.dataset.config.
shuffle (Union[bool, Shuffle], optional): Perform reshuffling of the data every epoch.
Bool type and Shuffle enum are both supported to pass in. Default: `Shuffle.GLOBAL` .
If shuffle is False, no shuffling will be performed.
If shuffle is True, performs global shuffle.
There are three levels of shuffling, desired shuffle enum defined by mindspore.dataset.Shuffle.
If shuffle is True, it is equivalent to setting `shuffle` to mindspore.dataset.Shuffle.GLOBAL.
Set the mode of data shuffling by passing in enumeration variables:
- Shuffle.GLOBAL: Shuffle both the files and samples, same as setting shuffle to True.
- Shuffle.GLOBAL: Shuffle both the files and samples.
- Shuffle.FILES: Shuffle files only.
num_shards (int, optional): Number of shards that the dataset will be divided into (default=None).
num_shards (int, optional): Number of shards that the dataset will be divided into. Default: None.
When this argument is specified, `num_samples` reflects the maximum sample number of per shard.
shard_id (int, optional): The shard ID within `num_shards` (default=None). This
shard_id (int, optional): The shard ID within `num_shards` . Default: None. This
argument can only be specified when `num_shards` is also specified.
cache (DatasetCache, optional): Use tensor caching service to speed up dataset processing. More details:
`Single-Node Data Cache <https://www.mindspore.cn/tutorials/experts/en/master/dataset/cache.html>`_
(default=None, which means no cache is used).
`Single-Node Data Cache <https://www.mindspore.cn/tutorials/experts/en/master/dataset/cache.html>`_ .
Default: None, which means no cache is used.
Raises:
RuntimeError: If `dataset_dir` does not contain data files.
ValueError: If `num_parallel_workers` exceeds the max thread numbers.
RuntimeError: If `num_shards` is specified but `shard_id` is None.
RuntimeError: If `shard_id` is specified but `num_shards` is None.
ValueError: If `num_parallel_workers` exceeds the max thread numbers.
ValueError: If `shard_id` is invalid (< 0 or >= `num_shards`).
Examples:
@ -640,33 +646,42 @@ class DBpediaDataset(SourceDataset, TextBaseDataset):
class EnWik9Dataset(SourceDataset, TextBaseDataset):
"""
A source dataset that reads and parses EnWik9 dataset.
A source dataset that reads and parses EnWik9 Polarity and EnWik9 Full datasets.
The generated dataset has one column :py:obj:`[text]` with type string.
Args:
dataset_dir (str): Path to the root directory that contains the dataset.
num_samples (int, optional): The number of samples to be included in the dataset
(default=None, will include all samples).
num_parallel_workers (int, optional): Number of workers to read the data
(default=None, number set in the config).
shuffle (Union[bool, Shuffle], optional): Perform reshuffling of the data every epoch
(default=True). Bool type and Shuffle enum are both supported to pass in.
num_samples (int, optional): The number of samples to be included in the dataset.
For Polarity dataset, 'train' will read from 3,600,000 train samples, 'test' will read from 400,000 test
samples, 'all' will read from all 4,000,000 samples.
For Full dataset, 'train' will read from 3,000,000 train samples, 'test' will read from 650,000 test
samples, 'all' will read from all 3,650,000 samples. Default: None, will include all samples.
num_parallel_workers (int, optional): Number of workers to read the data.
Default: None, number set in the mindspore.dataset.config.
shuffle (Union[bool, Shuffle], optional): Perform reshuffling of the data every epoch.
Bool type and Shuffle enum are both supported to pass in. Default: True.
If shuffle is False, no shuffling will be performed.
If shuffle is True, performs global shuffle.
There are three levels of shuffling, desired shuffle enum defined by mindspore.dataset.Shuffle.
If shuffle is True, it is equivalent to setting `shuffle` to mindspore.dataset.Shuffle.GLOBAL.
Set the mode of data shuffling by passing in enumeration variables:
- Shuffle.GLOBAL: Shuffle both the files and samples, same as setting shuffle to True.
- Shuffle.GLOBAL: Shuffle both the files and samples.
- Shuffle.FILES: Shuffle files only.
num_shards (int, optional): Number of shards that the dataset will be divided into (default=None).
num_shards (int, optional): Number of shards that the dataset will be divided into. Default: None.
When this argument is specified, `num_samples` reflects the maximum sample number of per shard.
shard_id (int, optional): The shard ID within `num_shards` (default=None). This
shard_id (int, optional): The shard ID within `num_shards` . Default: None. This
argument can only be specified when `num_shards` is also specified.
cache (DatasetCache, optional): Use tensor caching service to speed up dataset processing. More details:
`Single-Node Data Cache <https://www.mindspore.cn/tutorials/experts/en/master/dataset/cache.html>`_
(default=None, which means no cache is used).
`Single-Node Data Cache <https://www.mindspore.cn/tutorials/experts/en/master/dataset/cache.html>`_ .
Default: None, which means no cache is used.
Raises:
RuntimeError: If `dataset_dir` does not contain data files.
RuntimeError: If `num_shards` is specified but `shard_id` is None.
RuntimeError: If `shard_id` is specified but `num_shards` is None.
ValueError: If `num_parallel_workers` exceeds the max thread numbers.
Examples:
>>> en_wik9_dataset_dir = "/path/to/en_wik9_dataset"
@ -719,38 +734,41 @@ class IMDBDataset(MappableDataset, TextBaseDataset):
"""
A source dataset that reads and parses Internet Movie Database (IMDb).
The generated dataset has two columns: :py:obj:`[text, label]`.
The generated dataset has two columns: :py:obj:`[text, label]` .
The tensor of column :py:obj:`text` is of the string type.
The tensor of column :py:obj:`label` is of a scalar of uint32 type.
The column :py:obj:`label` is of a scalar of uint32 type.
Args:
dataset_dir (str): Path to the root directory that contains the dataset.
usage (str, optional): Usage of this dataset, can be 'train', 'test' or 'all'
(default=None, will read all samples).
num_samples (int, optional): The number of images to be included in the dataset
(default=None, will read all samples).
num_parallel_workers (int, optional): Number of workers to read the data
(default=None, set in the config).
shuffle (bool, optional): Whether or not to perform shuffle on the dataset
(default=None, expected order behavior shown in the table).
sampler (Sampler, optional): Object used to choose samples from the
dataset (default=None, expected order behavior shown in the table).
usage (str, optional): Usage of this dataset, can be 'train', 'test' or 'all'.
Default: None, will read all samples.
num_samples (int, optional): The number of images to be included in the dataset.
For Polarity dataset, 'train' will read from 3,600,000 train samples, 'test' will read from 400,000 test
samples, 'all' will read from all 4,000,000 samples. For Full dataset, 'train' will read from 3,000,000
train samples, 'test' will read from 650,000 test samples, 'all' will read from all 3,650,000 samples.
Default: None, will include all samples.
num_parallel_workers (int, optional): Number of workers to read the data.
Default: None, number set in the mindspore.dataset.config.
shuffle (bool, optional): Whether or not to perform shuffle on the dataset.
Default: None, expected order behavior shown in the table below.
sampler (Sampler, optional): Object used to choose samples from the dataset.
Default: None, expected order behavior shown in the table below.
num_shards (int, optional): Number of shards that the dataset will be divided
into (default=None). When this argument is specified, `num_samples` reflects
into. Default: None. When this argument is specified, `num_samples` reflects
the maximum sample number of per shard.
shard_id (int, optional): The shard ID within `num_shards` (default=None). This
shard_id (int, optional): The shard ID within `num_shards` . Default: None. This
argument can only be specified when `num_shards` is also specified.
cache (DatasetCache, optional): Use tensor caching service to speed up dataset processing. More details:
`Single-Node Data Cache <https://www.mindspore.cn/tutorials/experts/en/master/dataset/cache.html>`_
(default=None, which means no cache is used).
`Single-Node Data Cache <https://www.mindspore.cn/tutorials/experts/en/master/dataset/cache.html>`_ .
Default: None, which means no cache is used.
Raises:
RuntimeError: If `dataset_dir` does not contain data files.
ValueError: If `num_parallel_workers` exceeds the max thread numbers.
RuntimeError: If `sampler` and `shuffle` are specified at the same time.
RuntimeError: If `sampler` and `num_shards`/`shard_id` are specified at the same time.
RuntimeError: If `num_shards` is specified but `shard_id` is None.
RuntimeError: If `shard_id` is specified but `num_shards` is None.
ValueError: If `num_parallel_workers` exceeds the max thread numbers.
ValueError: If `shard_id` is invalid (< 0 or >= `num_shards`).
Note:
@ -861,47 +879,48 @@ class IWSLT2016Dataset(SourceDataset, TextBaseDataset):
"""
A source dataset that reads and parses IWSLT2016 datasets.
The generated dataset has two columns: :py:obj:`[text, translation]`.
The generated dataset has two columns: :py:obj:`[text, translation]` .
The tensor of column :py:obj: `text` is of the string type.
The tensor of column :py:obj: `translation` is of the string type.
The column :py:obj: `translation` is of the string type.
Args:
dataset_dir (str): Path to the root directory that contains the dataset.
usage (str, optional): Acceptable usages include 'train', 'valid', 'test' and 'all' (default=None, all samples).
usage (str, optional): Acceptable usages include 'train', 'valid', 'test' and 'all'. Default: None, all samples.
language_pair (sequence, optional): Sequence containing source and target language, supported values are
('en', 'fr'), ('en', 'de'), ('en', 'cs'), ('en', 'ar'), ('fr', 'en'), ('de', 'en'), ('cs', 'en'),
('ar', 'en') (default=('de', 'en')).
('ar', 'en'). Default: ('de', 'en').
valid_set (str, optional): A string to identify validation set, when usage is valid or all, the validation set
of valid_set type will be read, supported values are 'dev2010', 'tst2010', 'tst2011', 'tst2012', 'tst2013'
and 'tst2014' (default='tst2013').
test_set (str, optional): A string to identify test set, when usage is test or all, the test set of test_set
type will be read, supported values are 'dev2010', 'tst2010', 'tst2011', 'tst2012', 'tst2013' and 'tst2014'
(default='tst2014').
num_samples (int, optional): Number of samples (rows) to read (default=None, reads the full dataset).
shuffle (Union[bool, Shuffle], optional): Perform reshuffling of the data every epoch
(default=Shuffle.GLOBAL). Bool type and Shuffle enum are both supported to pass in.
If shuffle is False, no shuffling will be performed.
If shuffle is True, performs global shuffle.
There are three levels of shuffling, desired shuffle enum defined by mindspore.dataset.Shuffle.
of `valid_set` type will be read, supported values are 'dev2010', 'tst2010', 'tst2011', 'tst2012', 'tst2013'
and 'tst2014'. Default: 'tst2013'.
test_set (str, optional): A string to identify test set, when usage is test or all, the test set of `test_set`
type will be read, supported values are 'dev2010', 'tst2010', 'tst2011', 'tst2012', 'tst2013' and 'tst2014'.
Default: 'tst2014'.
num_samples (int, optional): Number of samples (rows) to read. Default: None, reads the full dataset.
shuffle (Union[bool, Shuffle], optional): Perform reshuffling of the data every epoch.
Bool type and Shuffle enum are both supported to pass in. Default: `Shuffle.GLOBAL` .
If `shuffle` is False, no shuffling will be performed.
If `shuffle` is True, it is equivalent to setting `shuffle` to mindspore.dataset.Shuffle.GLOBAL.
Set the mode of data shuffling by passing in enumeration variables:
- Shuffle.GLOBAL: Shuffle both the files and samples, same as setting shuffle to True.
- Shuffle.GLOBAL: Shuffle both the files and samples.
- Shuffle.FILES: Shuffle files only.
num_shards (int, optional): Number of shards that the dataset will be divided into (default=None).
num_shards (int, optional): Number of shards that the dataset will be divided into. Default: None.
When this argument is specified, `num_samples` reflects the max sample number of per shard.
shard_id (int, optional): The shard ID within `num_shards` (default=None). This
shard_id (int, optional): The shard ID within `num_shards` .Default: None. This
argument can only be specified when `num_shards` is also specified.
num_parallel_workers (int, optional): Number of workers to read the data
(default=None, number set in the config).
num_parallel_workers (int, optional): Number of workers to read the data.
Default: None, number set in the mindspore.dataset.config.
cache (DatasetCache, optional): Use tensor caching service to speed up dataset processing. More details:
`Single-Node Data Cache <https://www.mindspore.cn/tutorials/experts/en/master/dataset/cache.html>`_
(default=None, which means no cache is used).
`Single-Node Data Cache <https://www.mindspore.cn/tutorials/experts/en/master/dataset/cache.html>`_ .
Default: None, which means no cache is used.
Raises:
RuntimeError: If `dataset_dir` does not contain data files.
ValueError: If `num_parallel_workers` exceeds the max thread numbers.
RuntimeError: If `num_shards` is specified but `shard_id` is None.
RuntimeError: If `shard_id` is specified but `num_shards` is None.
ValueError: If `num_parallel_workers` exceeds the max thread numbers.
Examples:
>>> iwslt2016_dataset_dir = "/path/to/iwslt2016_dataset_dir"
@ -912,8 +931,8 @@ class IWSLT2016Dataset(SourceDataset, TextBaseDataset):
IWSLT is an international oral translation conference, a major annual scientific conference dedicated to all aspects
of oral translation. The MT task of the IWSLT evaluation activity constitutes a dataset, which can be publicly
obtained through the WIT3 website wit3.fbk.eu. The IWSLT2016 dataset includes translations from English to Arabic,
Czech, French, and German, and translations from Arabic, Czech, French, and German to English.
obtained through the WIT3 website `wit3 <https://wit3.fbk.eu>`_ . The IWSLT2016 dataset includes translations from
English to Arabic, Czech, French, and German, and translations from Arabic, Czech, French, and German to English.
You can unzip the original IWSLT2016 dataset files into this directory structure and read by MindSpore's API. After
decompression, you also need to decompress the dataset to be read in the specified folder. For example, if you want
@ -988,42 +1007,42 @@ class IWSLT2017Dataset(SourceDataset, TextBaseDataset):
"""
A source dataset that reads and parses IWSLT2017 datasets.
The generated dataset has two columns: :py:obj:`[text, translation]`.
The tensor of column :py:obj:`text` is of the string type.
The tensor of column :py:obj:`translation` is of the string type.
The generated dataset has two columns: :py:obj:`[text, translation]` .
The tensor of column :py:obj:`text` and :py:obj:`translation` are of the string type.
Args:
dataset_dir (str): Path to the root directory that contains the dataset.
usage (str, optional): Acceptable usages include 'train', 'valid', 'test' and 'all' (default=None, all samples).
usage (str, optional): Acceptable usages include 'train', 'valid', 'test' and 'all'. Default: None, all samples.
language_pair (sequence, optional): List containing src and tgt language, supported values are ('en', 'nl'),
('en', 'de'), ('en', 'it'), ('en', 'ro'), ('nl', 'en'), ('nl', 'de'), ('nl', 'it'), ('nl', 'ro'),
('de', 'en'), ('de', 'nl'), ('de', 'it'), ('de', 'ro'), ('it', 'en'), ('it', 'nl'), ('it', 'de'),
('it', 'ro'), ('ro', 'en'), ('ro', 'nl'), ('ro', 'de'), ('ro', 'it') (default=('de', 'en')).
num_samples (int, optional): Number of samples (rows) to read (default=None, reads the full dataset).
shuffle (Union[bool, Shuffle], optional): Perform reshuffling of the data every epoch
(default=Shuffle.GLOBAL). Bool type and Shuffle enum are both supported to pass in.
('it', 'ro'), ('ro', 'en'), ('ro', 'nl'), ('ro', 'de'), ('ro', 'it'). Default: ('de', 'en').
num_samples (int, optional): Number of samples (rows) to read. Default: None, reads the full dataset.
shuffle (Union[bool, Shuffle], optional): Perform reshuffling of the data every epoch.
Bool type and Shuffle enum are both supported to pass in. Default: `Shuffle.GLOBAL` .
If shuffle is False, no shuffling will be performed.
If shuffle is True, performs global shuffle.
There are three levels of shuffling, desired shuffle enum defined by mindspore.dataset.Shuffle.
If shuffle is True, it is equivalent to setting `shuffle` to mindspore.dataset.Shuffle.GLOBAL.
Set the mode of data shuffling by passing in enumeration variables:
- Shuffle.GLOBAL: Shuffle both the files and samples, same as setting shuffle to True.
- Shuffle.GLOBAL: Shuffle both the files and samples.
- Shuffle.FILES: Shuffle files only.
num_shards (int, optional): Number of shards that the dataset will be divided into (default=None).
num_shards (int, optional): Number of shards that the dataset will be divided into. Default: None.
When this argument is specified, `num_samples` reflects the max sample number of per shard.
shard_id (int, optional): The shard ID within `num_shards` (default=None). This
shard_id (int, optional): The shard ID within `num_shards` . Default: None. This
argument can only be specified when `num_shards` is also specified.
num_parallel_workers (int, optional): Number of workers to read the data
(default=None, number set in the config).
num_parallel_workers (int, optional): Number of workers to read the data.
Default: None, number set in the mindspore.dataset.config.
cache (DatasetCache, optional): Use tensor caching service to speed up dataset processing. More details:
`Single-Node Data Cache <https://www.mindspore.cn/tutorials/experts/en/master/dataset/cache.html>`_
(default=None, which means no cache is used).
`Single-Node Data Cache <https://www.mindspore.cn/tutorials/experts/en/master/dataset/cache.html>`_ .
Default: None, which means no cache is used.
Raises:
RuntimeError: If `dataset_dir` does not contain data files.
ValueError: If `num_parallel_workers` exceeds the max thread numbers.
RuntimeError: If `num_shards` is specified but `shard_id` is None.
RuntimeError: If `shard_id` is specified but `num_shards` is None.
ValueError: If `num_parallel_workers` exceeds the max thread numbers.
Examples:
>>> iwslt2017_dataset_dir = "/path/to/iwslt2017_dataset_dir"
@ -1033,8 +1052,8 @@ class IWSLT2017Dataset(SourceDataset, TextBaseDataset):
IWSLT is an international oral translation conference, a major annual scientific conference dedicated to all aspects
of oral translation. The MT task of the IWSLT evaluation activity constitutes a dataset, which can be publicly
obtained through the WIT3 website wit3.fbk.eu. The IWSLT2017 dataset involves German, English, Italian, Dutch, and
Romanian. The dataset includes translations in any two different languages.
obtained through the WIT3 website `wit3 <https://wit3.fbk.eu>`_ . The IWSLT2017 dataset involves German, English,
Italian, Dutch, and Romanian. The dataset includes translations in any two different languages.
You can unzip the original IWSLT2017 dataset files into this directory structure and read by MindSpore's API. You
need to decompress the dataset package in texts/DeEnItNlRo/DeEnItNlRo directory to get the DeEnItNlRo-DeEnItNlRo
@ -1186,7 +1205,7 @@ class PennTreebankDataset(SourceDataset, TextBaseDataset):
"""
A source dataset that reads and parses PennTreebank datasets.
The generated dataset has one column :py:obj:`[text]`.
The generated dataset has one column :py:obj:`[text]` .
The tensor of column :py:obj:`text` is of the string type.
Args:
@ -1195,27 +1214,33 @@ class PennTreebankDataset(SourceDataset, TextBaseDataset):
'train' will read from 42,068 train samples of string type,
'test' will read from 3,370 test samples of string type,
'valid' will read from 3,761 test samples of string type,
'all' will read from all 49,199 samples of string type (default=None, all samples).
num_samples (int, optional): Number of samples (rows) to read (default=None, reads the full dataset).
num_parallel_workers (int, optional): Number of workers to read the data
(default=None, number set in the config).
shuffle (Union[bool, Shuffle], optional): Perform reshuffling of the data every epoch
(default=Shuffle.GLOBAL). Bool type and Shuffle enum are both supported to pass in.
'all' will read from all 49,199 samples of string type. Default: None, all samples.
num_samples (int, optional): Number of samples (rows) to read. Default: None, reads the full dataset.
num_parallel_workers (int, optional): Number of workers to read the data.
Default: None, number set in the mindspore.dataset.config.
shuffle (Union[bool, Shuffle], optional): Perform reshuffling of the data every epoch.
Bool type and Shuffle enum are both supported to pass in. Default: `Shuffle.GLOBAL` .
If shuffle is False, no shuffling will be performed.
If shuffle is True, performs global shuffle.
There are three levels of shuffling, desired shuffle enum defined by mindspore.dataset.Shuffle.
If shuffle is True, it is equivalent to setting `shuffle` to mindspore.dataset.Shuffle.GLOBAL.
Set the mode of data shuffling by passing in enumeration variables:
- Shuffle.GLOBAL: Shuffle both the files and samples, same as setting shuffle to True.
- Shuffle.GLOBAL: Shuffle both the files and samples.
- Shuffle.FILES: Shuffle files only.
num_shards (int, optional): Number of shards that the dataset will be divided into (default=None).
num_shards (int, optional): Number of shards that the dataset will be divided into. Default: None.
When this argument is specified, 'num_samples' reflects the max sample number of per shard.
shard_id (int, optional): The shard ID within `num_shards` (default=None). This
shard_id (int, optional): The shard ID within `num_shards` . Default: None. This
argument can only be specified when `num_shards` is also specified.
cache (DatasetCache, optional): Use tensor caching service to speed up dataset processing. More details:
`Single-Node Data Cache <https://www.mindspore.cn/tutorials/experts/en/master/dataset/cache.html>`_
(default=None, which means no cache is used).
`Single-Node Data Cache <https://www.mindspore.cn/tutorials/experts/en/master/dataset/cache.html>`_ .
Default: None, which means no cache is used.
Raises:
RuntimeError: If `dataset_dir` does not contain data files.
RuntimeError: If `num_shards` is specified but `shard_id` is None.
RuntimeError: If `shard_id` is specified but `num_shards` is None.
ValueError: If `num_parallel_workers` exceeds the max thread numbers.
Examples:
>>> penn_treebank_dataset_dir = "/path/to/penn_treebank_dataset_directory"

View File

@ -1444,7 +1444,7 @@ class EMnistDataset(MappableDataset, VisionBaseDataset):
"""
A source dataset that reads and parses the EMNIST dataset.
The generated dataset has two columns :py:obj:`[image, label]`.
The generated dataset has two columns :py:obj:`[image, label]` .
The tensor of column :py:obj:`image` is of the uint8 type.
The tensor of column :py:obj:`label` is a scalar of the uint32 type.
@ -1452,23 +1452,24 @@ class EMnistDataset(MappableDataset, VisionBaseDataset):
dataset_dir (str): Path to the root directory that contains the dataset.
name (str): Name of splits for this dataset, can be 'byclass', 'bymerge', 'balanced', 'letters', 'digits'
or 'mnist'.
usage (str, optional): Usage of this dataset, can be 'train', 'test' or 'all'.
(default=None, will read all samples).
num_samples (int, optional): The number of images to be included in the dataset
(default=None, will read all images).
num_parallel_workers (int, optional): Number of workers to read the data
(default=None, will use value set in the config).
shuffle (bool, optional): Whether or not to perform shuffle on the dataset
(default=None, expected order behavior shown in the table).
usage (str, optional): Usage of this dataset, can be 'train', 'test' or 'all'.'train' will read from 60,000
train samples, 'test' will read from 10,000 test samples, 'all' will read from all 70,000 samples.
Default: None, will read all samples.
num_samples (int, optional): The number of images to be included in the dataset.
Default: None, will read all images.
num_parallel_workers (int, optional): Number of workers to read the data.
Default: None, number set in the mindspore.dataset.config.
shuffle (bool, optional): Whether or not to perform shuffle on the dataset.
Default: None, expected order behavior shown in the table below.
sampler (Sampler, optional): Object used to choose samples from the
dataset (default=None, expected order behavior shown in the table).
num_shards (int, optional): Number of shards that the dataset will be divided into (default=None).
dataset. Default: None, expected order behavior shown in the table below.
num_shards (int, optional): Number of shards that the dataset will be divided into. Default: None.
When this argument is specified, `num_samples` reflects the max sample number of per shard.
shard_id (int, optional): The shard ID within `num_shards` (default=None). This
shard_id (int, optional): The shard ID within `num_shards` . Default: None. This
argument can only be specified when `num_shards` is also specified.
cache (DatasetCache, optional): Use tensor caching service to speed up dataset processing. More details:
`Single-Node Data Cache <https://www.mindspore.cn/tutorials/experts/en/master/dataset/cache.html>`_
(default=None, which means no cache is used).
`Single-Node Data Cache <https://www.mindspore.cn/tutorials/experts/en/master/dataset/cache.html>`_ .
Default: None, which means no cache is used.
Raises:
RuntimeError: If `sampler` and `shuffle` are specified at the same time.
@ -1577,44 +1578,44 @@ class FakeImageDataset(MappableDataset, VisionBaseDataset):
"""
A source dataset for generating fake images.
The generated dataset has two columns :py:obj:`[image, label]`.
The generated dataset has two columns :py:obj:`[image, label]` .
The tensor of column :py:obj:`image` is of the uint8 type.
The tensor of column :py:obj:`label` is a scalar of the uint32 type.
The column :py:obj:`label` is a scalar of the uint32 type.
Args:
num_images (int, optional): Number of images to generate in the dataset (default=1000).
image_size (tuple, optional): Size of the fake image (default=(224, 224, 3)).
num_classes (int, optional): Number of classes in the dataset (default=10).
base_seed (int, optional): Offsets the index-based random seed used to generate each image (default=0).
num_samples (int, optional): The number of images to be included in the dataset
(default=None, will read all images).
num_parallel_workers (int, optional): Number of workers to read the data
(default=None, will use value set in the config).
shuffle (bool, optional): Whether or not to perform shuffle on the dataset
(default=None, expected order behavior shown in the table).
num_images (int, optional): Number of images to generate in the dataset. Default: 1000.
image_size (tuple, optional): Size of the fake image. Default: (224, 224, 3).
num_classes (int, optional): Number of classes in the dataset. Default: 10.
base_seed (int, optional): Offsets the index-based random seed used to generate each image. Default: 0.
num_samples (int, optional): The number of images to be included in the dataset.
Default: None, will read all images.
num_parallel_workers (int, optional): Number of workers to read the data.
Default: None, number set in the mindspore.dataset.config.
shuffle (bool, optional): Whether or not to perform shuffle on the dataset.
Default: None, expected order behavior shown in the table below.
sampler (Sampler, optional): Object used to choose samples from the
dataset (default=None, expected order behavior shown in the table).
num_shards (int, optional): Number of shards that the dataset will be divided into (default=None).
dataset. Default: None, expected order behavior shown in the table below.
num_shards (int, optional): Number of shards that the dataset will be divided into. Default: None.
When this argument is specified, `num_samples` reflects the max sample number of per shard.
shard_id (int, optional): The shard ID within `num_shards` (default=None). This
shard_id (int, optional): The shard ID within `num_shards` . Default: None. This
argument can only be specified when `num_shards` is also specified.
cache (DatasetCache, optional): Use tensor caching service to speed up dataset processing. More details:
`Single-Node Data Cache <https://www.mindspore.cn/tutorials/experts/en/master/dataset/cache.html>`_
(default=None, which means no cache is used).
`Single-Node Data Cache <https://www.mindspore.cn/tutorials/experts/en/master/dataset/cache.html>`_ .
Default: None, which means no cache is used.
Raises:
ValueError: If `num_parallel_workers` exceeds the max thread numbers.
RuntimeError: If `sampler` and `shuffle` are specified at the same time.
RuntimeError: If `sampler` and `num_shards`/`shard_id` are specified at the same time.
RuntimeError: If `num_shards` is specified but `shard_id` is None.
RuntimeError: If `shard_id` is specified but `num_shards` is None.
ValueError: If `num_parallel_workers` exceeds the max thread numbers.
ValueError: If `shard_id` is invalid (< 0 or >= `num_shards`).
Note:
- This dataset can take in a sampler. 'sampler' and 'shuffle' are mutually exclusive.
- This dataset can take in a `sampler` . `sampler` and `shuffle` are mutually exclusive.
The table below shows what input arguments are allowed and their expected behavior.
.. list-table:: Expected Order Behavior of Using 'sampler' and 'shuffle'
.. list-table:: Expected Order Behavior of Using `sampler` and `shuffle`
:widths: 25 25 50
:header-rows: 1
@ -1665,40 +1666,40 @@ class FakeImageDataset(MappableDataset, VisionBaseDataset):
class FashionMnistDataset(MappableDataset, VisionBaseDataset):
"""
A source dataset that reads and parses the FASHION-MNIST dataset.
A source dataset that reads and parses the Fashion-MNIST dataset.
The generated dataset has two columns :py:obj:`[image, label]`.
The generated dataset has two columns :py:obj:`[image, label]` .
The tensor of column :py:obj:`image` is of the uint8 type.
The tensor of column :py:obj:`label` is a scalar of the uint32 type.
The column :py:obj:`label` is a scalar of the uint32 type.
Args:
dataset_dir (str): Path to the root directory that contains the dataset.
usage (str, optional): Usage of this dataset, can be 'train', 'test' or 'all'. 'train' will read from 60,000
train samples, 'test' will read from 10,000 test samples, 'all' will read from all 70,000 samples.
(default=None, will read all samples)
num_samples (int, optional): The number of images to be included in the dataset
(default=None, will read all images).
num_parallel_workers (int, optional): Number of workers to read the data
(default=None, will use value set in the config).
shuffle (bool, optional): Whether or not to perform shuffle on the dataset
(default=None, expected order behavior shown in the table).
sampler (Sampler, optional): Object used to choose samples from the
dataset (default=None, expected order behavior shown in the table).
num_shards (int, optional): Number of shards that the dataset will be divided into (default=None).
Default: None, will read all samples.
num_samples (int, optional): The number of images to be included in the dataset.
Default: None, will read all images.
num_parallel_workers (int, optional): Number of workers to read the data.
Default: None, number set in the mindspore.dataset.config.
shuffle (bool, optional): Whether or not to perform shuffle on the dataset.
Default: None, expected order behavior shown in the table below.
sampler (Sampler, optional): Object used to choose samples from the dataset.
Default: None, expected order behavior shown in the table below.
num_shards (int, optional): Number of shards that the dataset will be divided into. Default: None.
When this argument is specified, `num_samples` reflects the maximum sample number of per shard.
shard_id (int, optional): The shard ID within `num_shards` (default=None). This
shard_id (int, optional): The shard ID within `num_shards` . Default: None. This
argument can only be specified when `num_shards` is also specified.
cache (DatasetCache, optional): Use tensor caching service to speed up dataset processing. More details:
`Single-Node Data Cache <https://www.mindspore.cn/tutorials/experts/en/master/dataset/cache.html>`_
(default=None, which means no cache is used).
`Single-Node Data Cache <https://www.mindspore.cn/tutorials/experts/en/master/dataset/cache.html>`_ .
Default: None, which means no cache is used.
Raises:
RuntimeError: If `dataset_dir` does not contain data files.
ValueError: If `num_parallel_workers` exceeds the max thread numbers.
RuntimeError: If `sampler` and `shuffle` are specified at the same time.
RuntimeError: If `sampler` and `num_shards`/`shard_id` are specified at the same time.
RuntimeError: If `num_shards` is specified but `shard_id` is None.
RuntimeError: If `shard_id` is specified but `num_shards` is None.
ValueError: If `num_parallel_workers` exceeds the max thread numbers.
ValueError: If `shard_id` is invalid (< 0 or >= `num_shards`).
Note:
@ -2033,43 +2034,42 @@ class Flowers102Dataset(GeneratorDataset):
"""
A source dataset that reads and parses Flowers102 dataset.
The generated dataset has two columns :py:obj:`[image, label]` or three :py:obj:`[image, segmentation, label]`.
The tensor of column :py:obj:`image` is of the uint8 type.
The tensor of column :py:obj:`segmentation` is of the uint8 type.
The tensor of column :py:obj:`label` is a scalar or a tensor of the uint32 type.
According to the given `task` configuration, the generated dataset has different output columns:
- `task` = 'Classification', output columns: `[image, dtype=uint8]` , `[label, dtype=uint32]` .
- `task` = 'Segmentation',
output columns: `[image, dtype=uint8]` , `[segmentation, dtype=uint8]` , `[label, dtype=uint32]` .
Args:
dataset_dir (str): Path to the root directory that contains the dataset.
task (str, optional): Specify the 'Classification' or 'Segmentation' task (default='Classification').
usage (str, optional): Specify the 'train', 'valid', 'test' part or 'all' parts of dataset
(default='all', will read all samples).
num_samples (int, optional): The number of samples to be included in the dataset (default=None, all images).
num_parallel_workers (int, optional): Number of subprocesses used to fetch the dataset in parallel (default=1).
shuffle (bool, optional): Whether or not to perform shuffle on the dataset. Random accessible input is required.
(default=None, expected order behavior shown in the table).
decode (bool, optional): Whether or not to decode the images and segmentations after reading (default=False).
sampler (Union[Sampler, Iterable], optional): Object used to choose samples from the dataset. Random accessible
input is required (default=None, expected order behavior shown in the table).
num_shards (int, optional): Number of shards that the dataset will be divided into (default=None).
Random accessible input is required. When this argument is specified, 'num_samples' reflects the max
sample number of per shard.
shard_id (int, optional): The shard ID within `num_shards` (default=None). This argument must be specified only
when num_shards is also specified. Random accessible input is required.
task (str, optional): Specify the 'Classification' or 'Segmentation' task. Default: 'Classification'.
usage (str, optional): Specify the 'train', 'valid', 'test' part or 'all' parts of dataset.
Default: 'all', will read all samples.
num_samples (int, optional): The number of samples to be included in the dataset. Default: None, all images.
num_parallel_workers (int, optional): Number of subprocesses used to fetch the dataset in parallel. Default: 1.
shuffle (bool, optional): Whether or not to perform shuffle on the dataset.
Default: None, expected order behavior shown in the table below.
decode (bool, optional): Whether or not to decode the images and segmentations after reading. Default: False.
sampler (Union[Sampler, Iterable], optional): Object used to choose samples from the dataset.
Default: None, expected order behavior shown in the table below.
num_shards (int, optional): Number of shards that the dataset will be divided into. Default: None.
When this argument is specified, `num_samples` reflects the max sample number of per shard.
shard_id (int, optional): The shard ID within `num_shards` . Default: None. This argument must be specified only
when `num_shards` is also specified.
Raises:
RuntimeError: If `dataset_dir` does not contain data files.
ValueError: If `num_parallel_workers` exceeds the max thread numbers.
RuntimeError: If `sampler` and `shuffle` are specified at the same time.
RuntimeError: If `sampler` and `num_shards`/`shard_id` are specified at the same time.
RuntimeError: If `num_shards` is specified but `shard_id` is None.
RuntimeError: If `shard_id` is specified but `num_shards` is None.
ValueError: If `num_parallel_workers` exceeds the max thread numbers.
ValueError: If `shard_id` is invalid (< 0 or >= `num_shards`).
Note:
- This dataset can take in a sampler. 'sampler' and 'shuffle' are mutually exclusive.
- This dataset can take in a `sampler` . `sampler` and `shuffle` are mutually exclusive.
The table below shows what input arguments are allowed and their expected behavior.
.. list-table:: Expected Order Behavior of Using 'sampler' and 'shuffle'
.. list-table:: Expected Order Behavior of Using `sampler` and `shuffle`
:widths: 25 25 50
:header-rows: 1
@ -2479,39 +2479,39 @@ class KMnistDataset(MappableDataset, VisionBaseDataset):
"""
A source dataset that reads and parses the KMNIST dataset.
The generated dataset has two columns :py:obj:`[image, label]`.
The generated dataset has two columns :py:obj:`[image, label]` .
The tensor of column :py:obj:`image` is of the uint8 type.
The tensor of column :py:obj:`label` is a scalar of the uint32 type.
The column :py:obj:`label` is a scalar of the uint32 type.
Args:
dataset_dir (str): Path to the root directory that contains the dataset.
usage (str, optional): Usage of this dataset, can be 'train', 'test' or 'all' . 'train' will read from 60,000
train samples, 'test' will read from 10,000 test samples, 'all' will read from all 70,000 samples.
(default=None, will read all samples)
num_samples (int, optional): The number of images to be included in the dataset
(default=None, will read all images).
num_parallel_workers (int, optional): Number of workers to read the data
(default=None, will use value set in the config).
shuffle (bool, optional): Whether or not to perform shuffle on the dataset
(default=None, expected order behavior shown in the table).
sampler (Sampler, optional): Object used to choose samples from the
dataset (default=None, expected order behavior shown in the table).
num_shards (int, optional): Number of shards that the dataset will be divided into (default=None).
Default: None, will read all samples.
num_samples (int, optional): The number of images to be included in the dataset.
Default: None, will read all images.
num_parallel_workers (int, optional): Number of workers to read the data.
Default: None, number set in the mindspore.dataset.config.
shuffle (bool, optional): Whether or not to perform shuffle on the dataset.
Default: None, expected order behavior shown in the table below.
sampler (Sampler, optional): Object used to choose samples from the dataset.
Default: None, expected order behavior shown in the table below.
num_shards (int, optional): Number of shards that the dataset will be divided into. Default: None.
When this argument is specified, `num_samples` reflects the maximum sample number of per shard.
shard_id (int, optional): The shard ID within `num_shards` (default=None). This
shard_id (int, optional): The shard ID within `num_shards` . Default: None. This
argument can only be specified when `num_shards` is also specified.
cache (DatasetCache, optional): Use tensor caching service to speed up dataset processing. More details:
`Single-Node Data Cache <https://www.mindspore.cn/tutorials/experts/en/master/dataset/cache.html>`_
(default=None, which means no cache is used).
`Single-Node Data Cache <https://www.mindspore.cn/tutorials/experts/en/master/dataset/cache.html>`_ .
Default: None, which means no cache is used.
Raises:
RuntimeError: If `dataset_dir` does not contain data files.
ValueError: If `num_parallel_workers` exceeds the max thread numbers.
RuntimeError: If `sampler` and `shuffle` are specified at the same time.
RuntimeError: If `sampler` and sharding are specified at the same time.
RuntimeError: If `num_shards` is specified but `shard_id` is None.
RuntimeError: If `shard_id` is specified but `num_shards` is None.
ValueError: If `shard_id` is invalid (out of range [0, `num_shards`]).
ValueError: If `num_parallel_workers` exceeds the max thread numbers.
ValueError: If `shard_id` is invalid (< 0 or >= `num_shards`).
Note:
- This dataset can take in a `sampler`. `sampler` and `shuffle` are mutually exclusive.
@ -3259,41 +3259,38 @@ class PhotoTourDataset(MappableDataset, VisionBaseDataset):
"""
A source dataset that reads and parses the PhotoTour dataset.
The generated dataset with different usage has different output columns.
If train, the generated dataset has one column :py:obj:`[image]`,
else three columns :py:obj:`[image1, image2, matches]`.
The tensor of column :py:obj:`image`, :py:obj:`image1` and :py:obj:`image2` is of the uint8 type.
The tensor of column :py:obj:`matches` is a scalar of the uint32 type.
According to the given `usage` configuration, the generated dataset has different output columns:
- `usage` = 'train', output columns: `[image, dtype=uint8]` .
- `usage` 'train', output columns: `[image1, dtype=uint8]` , `[image2, dtype=uint8]` , `[matches, dtype=uint32]` .
Args:
dataset_dir (str): Path to the root directory that contains the dataset.
name (str): Name of the dataset to load,
should be one of 'notredame', 'yosemite', 'liberty', 'notredame_harris',
'yosemite_harris' or 'liberty_harris'.
usage (str, optional): Usage of the dataset, can be 'train' or 'test' (Default=None, will be set to 'train').
usage (str, optional): Usage of the dataset, can be 'train' or 'test'. Default: None, will be set to 'train'.
When usage is 'train', number of samples for each `name` is
{'notredame': 468159, 'yosemite': 633587, 'liberty': 450092, 'liberty_harris': 379587,
'yosemite_harris': 450912, 'notredame_harris': 325295}.
When usage is 'test', will read 100,000 samples for testing.
num_samples (int, optional): The number of images to be included in the dataset
(default=None, will read all images).
num_parallel_workers (int, optional): Number of workers to read the data
(default=None, will use value set in the config).
shuffle (bool, optional): Whether or not to perform shuffle on the dataset
(default=None, expected order behavior shown in the table).
sampler (Sampler, optional): Object used to choose samples from the
dataset (default=None, expected order behavior shown in the table).
num_shards (int, optional): Number of shards that the dataset will be divided into (default=None).
num_samples (int, optional): The number of images to be included in the dataset.
Default: None, will read all images.
num_parallel_workers (int, optional): Number of workers to read the data.
Default: None, number set in the mindspore.dataset.config..
shuffle (bool, optional): Whether or not to perform shuffle on the dataset.
Default: None, expected order behavior shown in the table below.
sampler (Sampler, optional): Object used to choose samples from the dataset.
Default: None, expected order behavior shown in the table below.
num_shards (int, optional): Number of shards that the dataset will be divided into. Default: None.
When this argument is specified, `num_samples` reflects the max sample number of per shard.
shard_id (int, optional): The shard ID within `num_shards` (default=None). This
shard_id (int, optional): The shard ID within `num_shards` . Default: None. This
argument can only be specified when `num_shards` is also specified.
cache (DatasetCache, optional): Use tensor caching service to speed up dataset processing. More details:
`Single-Node Data Cache <https://www.mindspore.cn/tutorials/experts/en/master/dataset/cache.html>`_
(default=None, which means no cache is used).
`Single-Node Data Cache <https://www.mindspore.cn/tutorials/experts/en/master/dataset/cache.html>`_ .
Default: None, which means no cache is used.
Raises:
RuntimeError: If `dataset_dir` does not contain data files.
ValueError: If `num_parallel_workers` exceeds the max thread numbers.
RuntimeError: If `sampler` and `shuffle` are specified at the same time.
RuntimeError: If `sampler` and `num_shards`/`shard_id` are specified at the same time.
RuntimeError: If `num_shards` is specified but `shard_id` is None.
@ -3302,13 +3299,14 @@ class PhotoTourDataset(MappableDataset, VisionBaseDataset):
ValueError: If `usage` is not in ["train", "test"].
ValueError: If name is not in ["notredame", "yosemite", "liberty",
"notredame_harris", "yosemite_harris", "liberty_harris"].
ValueError: If `num_parallel_workers` exceeds the max thread numbers.
ValueError: If `shard_id` is invalid (< 0 or >= `num_shards`).
Note:
- This dataset can take in a sampler. `sampler` and `shuffle` are mutually exclusive. The table
- This dataset can take in a `sampler` . `sampler` and `shuffle` are mutually exclusive. The table
below shows what input arguments are allowed and their expected behavior.
.. list-table:: Expected Order Behavior of Using 'sampler' and 'shuffle'
.. list-table:: Expected Order Behavior of Using `sampler` and `shuffle`
:widths: 64 64 1
:header-rows: 1