forked from mindspore-Ecosystem/mindspore
!44822 Fix problems with some Chinese API reviews
Merge pull request !44822 from 刘勇琪/code_docs_modify_chinese_api
This commit is contained in:
commit
d4db709ee0
|
@ -3,7 +3,7 @@ mindspore.dataset.Dataset.shuffle
|
|||
|
||||
.. py:method:: mindspore.dataset.Dataset.shuffle(buffer_size)
|
||||
|
||||
使用以下策略混洗此数据集的行:
|
||||
通过创建 `buffer_size` 大小的缓存来混洗该数据集。
|
||||
|
||||
1. 生成一个混洗缓冲区包含 `buffer_size` 条数据行。
|
||||
|
||||
|
|
|
@ -12,17 +12,23 @@ mindspore.dataset.AGNewsDataset
|
|||
- **usage** (str, 可选) - 指定数据集的子集,可取值为'train','test'或'all'。默认值:None,读取全部样本。
|
||||
- **num_samples** (int, 可选) - 指定从数据集中读取的样本数。默认值:None,读取所有样本。
|
||||
- **num_parallel_workers** (int, 可选) - 指定读取数据的工作线程数。默认值:None,使用mindspore.dataset.config中配置的线程数。
|
||||
- **shuffle** (Union[bool, Shuffle], 可选) - 每个epoch中数据混洗的模式,支持传入bool类型与枚举类型进行指定,默认值:mindspore.dataset.Shuffle.GLOBAL。
|
||||
- **shuffle** (Union[bool, Shuffle], 可选) - 每个epoch中数据混洗的模式,支持传入bool类型与枚举类型进行指定。默认值:`Shuffle.GLOBAL` 。
|
||||
如果 `shuffle` 为False,则不混洗,如果 `shuffle` 为True,等同于将 `shuffle` 设置为mindspore.dataset.Shuffle.GLOBAL。
|
||||
通过传入枚举变量设置数据混洗的模式:
|
||||
|
||||
- **Shuffle.GLOBAL**:混洗文件和样本。
|
||||
- **Shuffle.FILES**:仅混洗文件。
|
||||
|
||||
- **num_shards** (int, 可选) - 指定分布式训练时将数据集进行划分的分片数,默认值:None。指定此参数后, `num_samples` 表示每个分片的最大样本数。
|
||||
- **shard_id** (int, 可选) - 指定分布式训练时使用的分片ID号,默认值:None。只有当指定了 `num_shards` 时才能指定此参数。
|
||||
- **num_shards** (int, 可选) - 指定分布式训练时将数据集进行划分的分片数。默认值:None。指定此参数后, `num_samples` 表示每个分片的最大样本数。
|
||||
- **shard_id** (int, 可选) - 指定分布式训练时使用的分片ID号。默认值:None。只有当指定了 `num_shards` 时才能指定此参数。
|
||||
- **cache** (DatasetCache, 可选) - 单节点数据缓存服务,用于加快数据集处理,详情请阅读 `单节点数据缓存 <https://www.mindspore.cn/tutorials/experts/zh-CN/master/dataset/cache.html>`_ 。默认值:None,不使用缓存。
|
||||
|
||||
异常:
|
||||
- **RuntimeError** - `dataset_dir` 参数所指向的文件目录不存在或缺少数据集文件。
|
||||
- **RuntimeError** - 指定了 `num_shards` 参数,但是未指定 `shard_id` 参数。
|
||||
- **RuntimeError** - 指定了 `shard_id` 参数,但是未指定 `num_shards` 参数。
|
||||
- **ValueError** - `num_parallel_workers` 参数超过系统最大线程数。
|
||||
|
||||
**关于AGNews数据集:**
|
||||
|
||||
AG是一个大型合集,具有超过100万篇新闻文章。这些新闻文章是由ComeToMyHead在持续1年多的活动中,从2000多个新闻来源收集的。ComeToMyHead是一个学术新闻搜索引擎,自2004年7月以来一直在运营。
|
||||
|
@ -52,5 +58,5 @@ mindspore.dataset.AGNewsDataset
|
|||
archivePrefix={arXiv},
|
||||
primaryClass={cs.LG}
|
||||
}
|
||||
|
||||
.. include:: mindspore.dataset.api_list_nlp.rst
|
||||
|
||||
.. include:: mindspore.dataset.api_list_nlp.rst
|
||||
|
|
|
@ -14,22 +14,22 @@ mindspore.dataset.AmazonReviewDataset
|
|||
对于Full数据集,'train'将读取300万个训练样本,'test'将读取65万个测试样本,'all'将读取所有365万个样本。默认值:None,读取所有样本。
|
||||
- **num_samples** (int, 可选) - 指定从数据集中读取的样本数。
|
||||
- **num_parallel_workers** (int, 可选) - 指定读取数据的工作线程数。默认值:None,使用mindspore.dataset.config中配置的线程数。
|
||||
- **shuffle** (Union[bool, Shuffle], 可选) - 每个epoch中数据混洗的模式,支持传入bool类型与枚举类型进行指定,默认值:mindspore.dataset.Shuffle.GLOBAL。
|
||||
- **shuffle** (Union[bool, Shuffle], 可选) - 每个epoch中数据混洗的模式,支持传入bool类型与枚举类型进行指定。默认值:`Shuffle.GLOBAL` 。
|
||||
如果 `shuffle` 为False,则不混洗,如果 `shuffle` 为True,等同于将 `shuffle` 设置为mindspore.dataset.Shuffle.GLOBAL。
|
||||
通过传入枚举变量设置数据混洗的模式:
|
||||
|
||||
- **Shuffle.GLOBAL**:混洗文件和样本。
|
||||
- **Shuffle.FILES**:仅混洗文件。
|
||||
|
||||
- **num_shards** (int, 可选) - 指定分布式训练时将数据集进行划分的分片数,默认值:None。指定此参数后, `num_samples` 表示每个分片的最大样本数。
|
||||
- **shard_id** (int, 可选) - 指定分布式训练时使用的分片ID号,默认值:None。只有当指定了 `num_shards` 时才能指定此参数。
|
||||
- **num_shards** (int, 可选) - 指定分布式训练时将数据集进行划分的分片数。默认值:None。指定此参数后, `num_samples` 表示每个分片的最大样本数。
|
||||
- **shard_id** (int, 可选) - 指定分布式训练时使用的分片ID号。默认值:None。只有当指定了 `num_shards` 时才能指定此参数。
|
||||
- **cache** (DatasetCache, 可选) - 单节点数据缓存服务,用于加快数据集处理,详情请阅读 `单节点数据缓存 <https://www.mindspore.cn/tutorials/experts/zh-CN/master/dataset/cache.html>`_ 。默认值:None,不使用缓存。
|
||||
|
||||
异常:
|
||||
- **RuntimeError** - `dataset_dir` 参数所指向的文件目录不存在或缺少数据集文件。
|
||||
- **ValueError** - `num_parallel_workers` 参数超过系统最大线程数。
|
||||
- **RuntimeError** - 指定了 `num_shards` 参数,但是未指定 `shard_id` 参数。
|
||||
- **RuntimeError** - 指定了 `shard_id` 参数,但是未指定 `num_shards` 参数。
|
||||
- **ValueError** - `num_parallel_workers` 参数超过系统最大线程数。
|
||||
|
||||
**关于AmazonReview数据集:**
|
||||
|
||||
|
|
|
@ -10,26 +10,26 @@ mindspore.dataset.DBpediaDataset
|
|||
参数:
|
||||
- **dataset_dir** (str) - 包含数据集文件的根目录路径。
|
||||
- **usage** (str, 可选) - 指定数据集的子集,可取值为'train','test'或'all'。
|
||||
'train'将读取560,000个训练样本,'test'将读取70,000个测试样本中,'all'将读取所有63万个样本。默认值:None,读取全部样本。
|
||||
'train'将读取560,000个训练样本,'test'将读取70,000个测试样本中,'all'将读取所有630,000个样本。默认值:None,读取全部样本。
|
||||
- **num_samples** (int, 可选) - 指定从数据集中读取的样本数。默认值:None,读取所有样本。
|
||||
- **num_parallel_workers** (int, 可选) - 指定读取数据的工作线程数。默认值:None,使用mindspore.dataset.config中配置的线程数。
|
||||
- **shuffle** (Union[bool, Shuffle], 可选) - 每个epoch中数据混洗的模式,支持传入bool类型与枚举类型进行指定,默认值:mindspore.dataset.Shuffle.GLOBAL。
|
||||
- **shuffle** (Union[bool, Shuffle], 可选) - 每个epoch中数据混洗的模式,支持传入bool类型与枚举类型进行指定。默认值:`Shuffle.GLOBAL` 。
|
||||
如果 `shuffle` 为False,则不混洗,如果 `shuffle` 为True,等同于将 `shuffle` 设置为mindspore.dataset.Shuffle.GLOBAL。
|
||||
通过传入枚举变量设置数据混洗的模式:
|
||||
|
||||
- **Shuffle.GLOBAL**:混洗文件和样本。
|
||||
- **Shuffle.FILES**:仅混洗文件。
|
||||
|
||||
- **num_shards** (int, 可选) - 指定分布式训练时将数据集进行划分的分片数,默认值:None。指定此参数后, `num_samples` 表示每个分片的最大样本数。
|
||||
- **shard_id** (int, 可选) - 指定分布式训练时使用的分片ID号,默认值:None。只有当指定了 `num_shards` 时才能指定此参数。
|
||||
- **num_shards** (int, 可选) - 指定分布式训练时将数据集进行划分的分片数。默认值:None。指定此参数后, `num_samples` 表示每个分片的最大样本数。
|
||||
- **shard_id** (int, 可选) - 指定分布式训练时使用的分片ID号。默认值:None。只有当指定了 `num_shards` 时才能指定此参数。
|
||||
- **cache** (DatasetCache, 可选) - 单节点数据缓存服务,用于加快数据集处理,详情请阅读 `单节点数据缓存 <https://www.mindspore.cn/tutorials/experts/zh-CN/master/dataset/cache.html>`_ 。默认值:None,不使用缓存。
|
||||
|
||||
异常:
|
||||
- **RuntimeError** - `dataset_dir` 参数所指向的文件目录不存在或缺少数据集文件。
|
||||
- **ValueError** - `num_parallel_workers` 参数超过系统最大线程数。
|
||||
- **RuntimeError** - 指定了 `num_shards` 参数,但是未指定 `shard_id` 参数。
|
||||
- **RuntimeError** - 指定了 `shard_id` 参数,但是未指定 `num_shards` 参数。
|
||||
- **ValueError** - `shard_id` 参数值错误(小于0或者大于等于 `num_shards` )。
|
||||
- **ValueError** - `num_parallel_workers` 参数超过系统最大线程数。
|
||||
- **ValueError** - `shard_id` 参数值错误,小于0或者大于等于 `num_shards` 。
|
||||
|
||||
**关于DBpedia数据集:**
|
||||
|
||||
|
@ -61,5 +61,5 @@ mindspore.dataset.DBpediaDataset
|
|||
howpublished = {http://dbpedia.org}
|
||||
}
|
||||
|
||||
|
||||
.. include:: mindspore.dataset.api_list_nlp.rst
|
||||
|
||||
.. include:: mindspore.dataset.api_list_nlp.rst
|
||||
|
|
|
@ -5,19 +5,19 @@ mindspore.dataset.EMnistDataset
|
|||
|
||||
读取和解析EMNIST数据集的源文件构建数据集。
|
||||
|
||||
生成的数据集有两列: `[image, label]`。 `image` 列的数据类型为uint8。 `label` 列的数据类型为uint32。
|
||||
生成的数据集有两列: `[image, label]` 。 `image` 列的数据类型为uint8。 `label` 列的数据类型为uint32。
|
||||
|
||||
参数:
|
||||
- **dataset_dir** (str) - 包含数据集文件的根目录路径。
|
||||
- **name** (str) - 按给定规则对数据集进行拆分,可以是'byclass'、'bymerge'、'balanced'、'letters'、'digits'或'mnist'。
|
||||
- **usage** (str, 可选) - 指定数据集的子集,可取值为 'train'、'test' 或 'all'。
|
||||
取值为'train'时将会读取60,000个训练样本,取值为'test'时将会读取10,000个测试样本,取值为'all'时将会读取全部70,000个样本。默认值:None,读取全部样本图片。
|
||||
- **num_samples** (int, 可选) - 指定从数据集中读取的样本数,可以小于数据集总数。默认值:None,读取全部样本图片。
|
||||
- **num_samples** (int, 可选) - 指定从数据集中读取的样本数。默认值:None,读取全部样本图片。
|
||||
- **num_parallel_workers** (int, 可选) - 指定读取数据的工作线程数。默认值:None,使用mindspore.dataset.config中配置的线程数。
|
||||
- **shuffle** (bool, 可选) - 是否混洗数据集。默认值:None,下表中会展示不同参数配置的预期行为。
|
||||
- **sampler** (Sampler, 可选) - 指定从数据集中选取样本的采样器,默认值:None,下表中会展示不同配置的预期行为。
|
||||
- **num_shards** (int, 可选) - 指定分布式训练时将数据集进行划分的分片数,默认值:None。指定此参数后, `num_samples` 表示每个分片的最大样本数。
|
||||
- **shard_id** (int, 可选) - 指定分布式训练时使用的分片ID号,默认值:None。只有当指定了 `num_shards` 时才能指定此参数。
|
||||
- **sampler** (Sampler, 可选) - 指定从数据集中选取样本的采样器。默认值:None,下表中会展示不同配置的预期行为。
|
||||
- **num_shards** (int, 可选) - 指定分布式训练时将数据集进行划分的分片数。默认值:None。指定此参数后, `num_samples` 表示每个分片的最大样本数。
|
||||
- **shard_id** (int, 可选) - 指定分布式训练时使用的分片ID号。默认值:None。只有当指定了 `num_shards` 时才能指定此参数。
|
||||
- **cache** (DatasetCache, 可选) - 单节点数据缓存服务,用于加快数据集处理,详情请阅读 `单节点数据缓存 <https://www.mindspore.cn/tutorials/experts/zh-CN/master/dataset/cache.html>`_ 。默认值:None,不使用缓存。
|
||||
|
||||
异常:
|
||||
|
@ -25,7 +25,7 @@ mindspore.dataset.EMnistDataset
|
|||
- **RuntimeError** - 同时指定了 `sampler` 和 `num_shards` 参数或同时指定了 `sampler` 和 `shard_id` 参数。
|
||||
- **RuntimeError** - 指定了 `num_shards` 参数,但是未指定 `shard_id` 参数。
|
||||
- **RuntimeError** - 指定了 `shard_id` 参数,但是未指定 `num_shards` 参数。
|
||||
- **ValueError** - `shard_id` 参数错误(小于0或者大于等于 `num_shards` )。
|
||||
- **ValueError** - `shard_id` 参数错误,小于0或者大于等于 `num_shards` 。
|
||||
|
||||
.. note:: 此数据集可以指定参数 `sampler` ,但参数 `sampler` 和参数 `shuffle` 的行为是互斥的。下表展示了几种合法的输入参数组合及预期的行为。
|
||||
|
||||
|
@ -96,5 +96,5 @@ mindspore.dataset.EMnistDataset
|
|||
publication_support_materials/emnist}
|
||||
}
|
||||
|
||||
|
||||
.. include:: mindspore.dataset.api_list_vision.rst
|
||||
|
||||
.. include:: mindspore.dataset.api_list_vision.rst
|
||||
|
|
|
@ -3,7 +3,7 @@ mindspore.dataset.EnWik9Dataset
|
|||
|
||||
.. py:class:: mindspore.dataset.EnWik9Dataset(dataset_dir, num_samples=None, num_parallel_workers=None, shuffle=True, num_shards=None, shard_id=None, cache=None)
|
||||
|
||||
读取和解析EnWik9数据集的源数据集。
|
||||
读取和解析EnWik9 Full和EnWik9 Polarity数据集。
|
||||
|
||||
生成的数据集有一列 `[text]` ,数据类型为string。
|
||||
|
||||
|
@ -13,17 +13,23 @@ mindspore.dataset.EnWik9Dataset
|
|||
对于Polarity数据集,'train'将读取360万个训练样本,'test'将读取40万个测试样本,'all'将读取所有400万个样本。
|
||||
对于Full数据集,'train'将读取300万个训练样本,'test'将读取65万个测试样本,'all'将读取所有365万个样本。默认值:None,读取所有样本。
|
||||
- **num_parallel_workers** (int, 可选) - 指定读取数据的工作线程数。默认值:None,使用mindspore.dataset.config中配置的线程数。
|
||||
- **shuffle** (Union[bool, Shuffle], 可选) - 每个epoch中数据混洗的模式,支持传入bool类型与枚举类型进行指定,默认值:True。
|
||||
- **shuffle** (Union[bool, Shuffle], 可选) - 每个epoch中数据混洗的模式,支持传入bool类型与枚举类型进行指定。默认值:True。
|
||||
如果 `shuffle` 为False,则不混洗,如果 `shuffle` 为True,等同于将 `shuffle` 设置为mindspore.dataset.Shuffle.GLOBAL。
|
||||
通过传入枚举变量设置数据混洗的模式:
|
||||
|
||||
- **Shuffle.GLOBAL**:混洗文件和样本。
|
||||
- **Shuffle.FILES**:仅混洗文件。
|
||||
|
||||
- **num_shards** (int, 可选) - 指定分布式训练时将数据集进行划分的分片数,默认值:None。指定此参数后, `num_samples` 表示每个分片的最大样本数。
|
||||
- **shard_id** (int, 可选) - 指定分布式训练时使用的分片ID号,默认值:None。只有当指定了 `num_shards` 时才能指定此参数。
|
||||
- **num_shards** (int, 可选) - 指定分布式训练时将数据集进行划分的分片数。默认值:None。指定此参数后, `num_samples` 表示每个分片的最大样本数。
|
||||
- **shard_id** (int, 可选) - 指定分布式训练时使用的分片ID号。默认值:None。只有当指定了 `num_shards` 时才能指定此参数。
|
||||
- **cache** (DatasetCache, 可选) - 单节点数据缓存服务,用于加快数据集处理,详情请阅读 `单节点数据缓存 <https://www.mindspore.cn/tutorials/experts/zh-CN/master/dataset/cache.html>`_ 。默认值:None,不使用缓存。
|
||||
|
||||
异常:
|
||||
- **RuntimeError** - `dataset_dir` 参数所指向的文件目录不存在或缺少数据集文件。
|
||||
- **RuntimeError** - 指定了 `num_shards` 参数,但是未指定 `shard_id` 参数。
|
||||
- **RuntimeError** - 指定了 `shard_id` 参数,但是未指定 `num_shards` 参数。
|
||||
- **ValueError** - `num_parallel_workers` 参数超过系统最大线程数。
|
||||
|
||||
**关于EnWik9数据集:**
|
||||
|
||||
EnWik9的数据是一系列UTF-8编码的XML,主要由英文文本组成。数据集包含243,426篇文章标题,其中85,560个被重定向以修复丢失的网页链接,其余是常规文章。
|
||||
|
@ -50,5 +56,5 @@ mindspore.dataset.EnWik9Dataset
|
|||
year = {2006}
|
||||
}
|
||||
|
||||
|
||||
.. include:: mindspore.dataset.api_list_nlp.rst
|
||||
|
||||
.. include:: mindspore.dataset.api_list_nlp.rst
|
||||
|
|
|
@ -5,28 +5,28 @@ mindspore.dataset.FakeImageDataset
|
|||
|
||||
生成虚假图像构建数据集。
|
||||
|
||||
生成的数据集有两列: `[image, label]`。 `image` 列的数据类型为uint8。 `label` 列的数据类型为uint32。
|
||||
生成的数据集有两列: `[image, label]` 。 `image` 列的数据类型为uint8。 `label` 列的数据类型为uint32。
|
||||
|
||||
参数:
|
||||
- **num_images** (int, 可选) - 要生成的虚假图像数,默认值:1000。
|
||||
- **image_size** (tuple, 可选) - 虚假图像的尺寸,默认值:(224, 224, 3)。
|
||||
- **num_classes** (int, 可选) - 数据集的类别数,默认值:10。
|
||||
- **base_seed** (int, 可选) - 生成随机图像的随机种子,默认值:0。
|
||||
- **num_images** (int, 可选) - 要生成的虚假图像数。默认值:1000。
|
||||
- **image_size** (tuple, 可选) - 虚假图像的尺寸。默认值:(224, 224, 3)。
|
||||
- **num_classes** (int, 可选) - 数据集的类别数。默认值:10。
|
||||
- **base_seed** (int, 可选) - 生成随机图像的随机种子。默认值:0。
|
||||
- **num_samples** (int, 可选) - 指定从数据集中读取的样本数,可以小于数据集总数。默认值:None,读取全部样本图片。
|
||||
- **num_parallel_workers** (int, 可选) - 指定读取数据的工作线程数。默认值:None,使用mindspore.dataset.config中配置的线程数。
|
||||
- **shuffle** (bool, 可选) - 是否混洗数据集。默认值:None,下表中会展示不同参数配置的预期行为。
|
||||
- **sampler** (Sampler, 可选) - 指定从数据集中选取样本的采样器,默认值:None,下表中会展示不同配置的预期行为。
|
||||
- **num_shards** (int, 可选) - 指定分布式训练时将数据集进行划分的分片数,默认值:None。指定此参数后, `num_samples` 表示每个分片的最大样本数。
|
||||
- **shard_id** (int, 可选) - 指定分布式训练时使用的分片ID号,默认值:None。只有当指定了 `num_shards` 时才能指定此参数。
|
||||
- **sampler** (Sampler, 可选) - 指定从数据集中选取样本的采样器。默认值:None,下表中会展示不同配置的预期行为。
|
||||
- **num_shards** (int, 可选) - 指定分布式训练时将数据集进行划分的分片数。默认值:None。指定此参数后, `num_samples` 表示每个分片的最大样本数。
|
||||
- **shard_id** (int, 可选) - 指定分布式训练时使用的分片ID号。默认值:None。只有当指定了 `num_shards` 时才能指定此参数。
|
||||
- **cache** (DatasetCache, 可选) - 单节点数据缓存服务,用于加快数据集处理,详情请阅读 `单节点数据缓存 <https://www.mindspore.cn/tutorials/experts/zh-CN/master/dataset/cache.html>`_ 。默认值:None,不使用缓存。
|
||||
|
||||
异常:
|
||||
- **ValueError** - `num_parallel_workers` 参数超过系统最大线程数。
|
||||
- **RuntimeError** - 同时指定了 `sampler` 和 `shuffle` 参数。
|
||||
- **RuntimeError** - 同时指定了 `sampler` 和 `num_shards` 参数或同时指定了 `sampler` 和 `shard_id` 参数。
|
||||
- **RuntimeError** - 指定了 `num_shards` 参数,但是未指定 `shard_id` 参数。
|
||||
- **RuntimeError** - 指定了 `shard_id` 参数,但是未指定 `num_shards` 参数。
|
||||
- **ValueError** - `shard_id` 参数错误(小于0或者大于等于 `num_shards` )。
|
||||
- **ValueError** - `num_parallel_workers` 参数超过系统最大线程数。
|
||||
- **ValueError** - `shard_id` 参数错误,小于0或者大于等于 `num_shards` 。
|
||||
|
||||
.. note:: 此数据集可以指定参数 `sampler` ,但参数 `sampler` 和参数 `shuffle` 的行为是互斥的。下表展示了几种合法的输入参数组合及预期的行为。
|
||||
|
||||
|
@ -56,5 +56,5 @@ mindspore.dataset.FakeImageDataset
|
|||
- False
|
||||
- 不允许
|
||||
|
||||
|
||||
.. include:: mindspore.dataset.api_list_vision.rst
|
||||
|
||||
.. include:: mindspore.dataset.api_list_vision.rst
|
||||
|
|
|
@ -5,7 +5,7 @@ mindspore.dataset.FashionMnistDataset
|
|||
|
||||
读取和解析Fashion-MNIST数据集的源文件构建数据集。
|
||||
|
||||
生成的数据集有两列: `[image, label]`。 `image` 列的数据类型为uint8。 `label` 列的数据类型为uint32。
|
||||
生成的数据集有两列: `[image, label]` 。 `image` 列的数据类型为uint8。 `label` 列的数据类型为uint32。
|
||||
|
||||
参数:
|
||||
- **dataset_dir** (str) - 包含数据集文件的根目录路径。
|
||||
|
@ -14,19 +14,19 @@ mindspore.dataset.FashionMnistDataset
|
|||
- **num_samples** (int, 可选) - 指定从数据集中读取的样本数,可以小于数据集总数。默认值:None,读取全部样本图片。
|
||||
- **num_parallel_workers** (int, 可选) - 指定读取数据的工作线程数。默认值:None,使用mindspore.dataset.config中配置的线程数。
|
||||
- **shuffle** (bool, 可选) - 是否混洗数据集。默认值:None,下表中会展示不同参数配置的预期行为。
|
||||
- **sampler** (Sampler, 可选) - 指定从数据集中选取样本的采样器,默认值:None,下表中会展示不同配置的预期行为。
|
||||
- **num_shards** (int, 可选) - 指定分布式训练时将数据集进行划分的分片数,默认值:None。指定此参数后, `num_samples` 表示每个分片的最大样本数。
|
||||
- **shard_id** (int, 可选) - 指定分布式训练时使用的分片ID号,默认值:None。只有当指定了 `num_shards` 时才能指定此参数。
|
||||
- **sampler** (Sampler, 可选) - 指定从数据集中选取样本的采样器。默认值:None,下表中会展示不同配置的预期行为。
|
||||
- **num_shards** (int, 可选) - 指定分布式训练时将数据集进行划分的分片数。默认值:None。指定此参数后, `num_samples` 表示每个分片的最大样本数。
|
||||
- **shard_id** (int, 可选) - 指定分布式训练时使用的分片ID号。默认值:None。只有当指定了 `num_shards` 时才能指定此参数。
|
||||
- **cache** (DatasetCache, 可选) - 单节点数据缓存服务,用于加快数据集处理,详情请阅读 `单节点数据缓存 <https://www.mindspore.cn/tutorials/experts/zh-CN/master/dataset/cache.html>`_ 。默认值:None,不使用缓存。
|
||||
|
||||
异常:
|
||||
- **RuntimeError** - `dataset_dir` 路径下不包含数据文件。
|
||||
- **ValueError** - `num_parallel_workers` 参数超过系统最大线程数。
|
||||
- **RuntimeError** - 同时指定了 `sampler` 和 `shuffle` 参数。
|
||||
- **RuntimeError** - 同时指定了 `sampler` 和 `num_shards` 参数或同时指定了 `sampler` 和 `shard_id` 参数。
|
||||
- **RuntimeError** - 指定了 `num_shards` 参数,但是未指定 `shard_id` 参数。
|
||||
- **RuntimeError** - 指定了 `shard_id` 参数,但是未指定 `num_shards` 参数。
|
||||
- **ValueError** - `shard_id` 参数错误(小于0或者大于等于 `num_shards` )。
|
||||
- **ValueError** - `num_parallel_workers` 参数超过系统最大线程数。
|
||||
- **ValueError** - `shard_id` 参数错误,小于0或者大于等于 `num_shards` 。
|
||||
|
||||
.. note:: 此数据集可以指定参数 `sampler` ,但参数 `sampler` 和参数 `shuffle` 的行为是互斥的。下表展示了几种合法的输入参数组合及预期的行为。
|
||||
|
||||
|
|
|
@ -7,29 +7,29 @@ mindspore.dataset.Flowers102Dataset
|
|||
|
||||
根据给定的 `task` 配置,生成数据集具有不同的输出列:
|
||||
|
||||
- `task` = 'Classification',输出列: `[image, dtype=uint8]` , `[label, dtype=uint32]` 。
|
||||
- `task` = 'Segmentation',输出列: `[image, dtype=uint8]` , `[segmentation, dtype=uint8]` , `[label, dtype=uint32]`。
|
||||
- `task` = 'Classification',输出列: `[image, dtype=uint8]` 、 `[label, dtype=uint32]` 。
|
||||
- `task` = 'Segmentation',输出列: `[image, dtype=uint8]` 、 `[segmentation, dtype=uint8]` 、 `[label, dtype=uint32]` 。
|
||||
|
||||
参数:
|
||||
- **dataset_dir** (str) - 包含数据集文件的根目录的路径。
|
||||
- **task** (str, 可选) - 指定读取数据的任务类型,支持'Classification'和'Segmentation'。默认值:'Classification'。
|
||||
- **usage** (str, 可选) - 指定数据集的子集,可取值为'train','valid','test'或'all'。默认值:'all',读取全部样本。
|
||||
- **num_samples** (int, 可选) - 指定从数据集中读取的样本数。默认值:None,所有图像样本。
|
||||
- **num_parallel_workers** (int, 可选) - 指定读取数据的工作线程数。默认值:None,使用mindspore.dataset.config中配置的线程数。
|
||||
- **num_parallel_workers** (int, 可选) - 指定读取数据的工作线程数。默认值:1。
|
||||
- **shuffle** (bool, 可选) - 是否混洗数据集。默认值:None,下表中会展示不同参数配置的预期行为。
|
||||
- **decode** (bool, 可选) - 是否对读取的图片进行解码操作,默认值:False,不解码。
|
||||
- **sampler** (Union[Sampler, Iterable], 可选) - 指定从数据集中选取样本的采样器,默认值:None,下表中会展示不同配置的预期行为。
|
||||
- **num_shards** (int, 可选) - 指定分布式训练时将数据集进行划分的分片数,默认值:None。指定此参数后, `num_samples` 表示每个分片的最大样本数。
|
||||
- **shard_id** (int, 可选) - 指定分布式训练时使用的分片ID号,默认值:None。只有当指定了 `num_shards` 时才能指定此参数。
|
||||
- **decode** (bool, 可选) - 是否对读取的图片进行解码操作。默认值:False,不解码。
|
||||
- **sampler** (Union[Sampler, Iterable], 可选) - 指定从数据集中选取样本的采样器。默认值:None,下表中会展示不同配置的预期行为。
|
||||
- **num_shards** (int, 可选) - 指定分布式训练时将数据集进行划分的分片数。默认值:None。指定此参数后, `num_samples` 表示每个分片的最大样本数。
|
||||
- **shard_id** (int, 可选) - 指定分布式训练时使用的分片ID号。默认值:None。只有当指定了 `num_shards` 时才能指定此参数。
|
||||
|
||||
异常:
|
||||
- **RuntimeError** - `dataset_dir` 路径下不包含任何数据文件。
|
||||
- **ValueError** - `num_parallel_workers` 参数超过系统最大线程数。
|
||||
- **RuntimeError** - 同时指定了 `sampler` 和 `shuffle` 参数。
|
||||
- **RuntimeError** - 同时指定了 `sampler` 和 `num_shards` 参数或同时指定了 `sampler` 和 `shard_id` 参数。
|
||||
- **RuntimeError** - 指定了 `num_shards` 参数,但是未指定 `shard_id` 参数。
|
||||
- **RuntimeError** - 指定了 `shard_id` 参数,但是未指定 `num_shards` 参数。
|
||||
- **ValueError** - `shard_id` 参数值错误(小于0或者大于等于 `num_shards` )。
|
||||
- **ValueError** - `num_parallel_workers` 参数超过系统最大线程数。
|
||||
- **ValueError** - `shard_id` 参数值错误,小于0或者大于等于 `num_shards` 。
|
||||
|
||||
.. note:: 此数据集可以指定参数 `sampler` ,但参数 `sampler` 和参数 `shuffle` 的行为是互斥的。下表展示了几种合法的输入参数组合及预期的行为。
|
||||
|
||||
|
@ -93,5 +93,5 @@ mindspore.dataset.Flowers102Dataset
|
|||
year = "2008",
|
||||
}
|
||||
|
||||
|
||||
.. include:: mindspore.dataset.api_list_vision.rst
|
||||
|
||||
.. include:: mindspore.dataset.api_list_vision.rst
|
||||
|
|
|
@ -14,26 +14,20 @@ mindspore.dataset.IMDBDataset
|
|||
对于Polarity数据集,'train'将读取360万个训练样本,'test'将读取40万个测试样本,'all'将读取所有400万个样本。
|
||||
对于Full数据集,'train'将读取300万个训练样本,'test'将读取65万个测试样本,'all'将读取所有365万个样本。默认值:None,读取所有样本。
|
||||
- **num_parallel_workers** (int, 可选) - 指定读取数据的工作线程数。默认值:None,使用mindspore.dataset.config中配置的线程数。
|
||||
- **shuffle** (bool, 可选) - 每个epoch中数据混洗的模式,支持传入bool类型与枚举类型进行指定,默认值:mindspore.dataset.Shuffle.GLOBAL。
|
||||
如果 `shuffle` 为False,则不混洗,如果 `shuffle` 为True,等同于将 `shuffle` 设置为mindspore.dataset.Shuffle.GLOBAL。
|
||||
通过传入枚举变量设置数据混洗的模式:
|
||||
|
||||
- **Shuffle.GLOBAL**:混洗文件和样本。
|
||||
- **Shuffle.FILES**:仅混洗文件。
|
||||
|
||||
- **sampler** (Sampler, 可选) - 指定从数据集中选取样本的采样器,默认值:None,下表中会展示不同配置的预期行为。
|
||||
- **num_shards** (int, 可选) - 指定分布式训练时将数据集进行划分的分片数,默认值:None。指定此参数后, `num_samples` 表示每个分片的最大样本数。
|
||||
- **shard_id** (int, 可选) - 指定分布式训练时使用的分片ID号,默认值:None。只有当指定了 `num_shards` 时才能指定此参数。
|
||||
- **shuffle** (bool, 可选) - 是否混洗数据集。默认值:None,下表中会展示不同参数配置的预期行为。
|
||||
- **sampler** (Sampler, 可选) - 指定从数据集中选取样本的采样器。默认值:None,下表中会展示不同配置的预期行为。
|
||||
- **num_shards** (int, 可选) - 指定分布式训练时将数据集进行划分的分片数。默认值:None。指定此参数后, `num_samples` 表示每个分片的最大样本数。
|
||||
- **shard_id** (int, 可选) - 指定分布式训练时使用的分片ID号。默认值:None。只有当指定了 `num_shards` 时才能指定此参数。
|
||||
- **cache** (DatasetCache, 可选) - 单节点数据缓存服务,用于加快数据集处理,详情请阅读 `单节点数据缓存 <https://www.mindspore.cn/tutorials/experts/zh-CN/master/dataset/cache.html>`_ 。默认值:None,不使用缓存。
|
||||
|
||||
异常:
|
||||
- **RuntimeError** - `dataset_dir` 参数所指向的文件目录不存在或缺少数据集文件。
|
||||
- **ValueError** - `num_parallel_workers` 参数超过系统最大线程数。
|
||||
- **RuntimeError** - 同时指定了 `sampler` 和 `shuffle` 参数。
|
||||
- **RuntimeError** - 同时指定了 `sampler` 和 `num_shards` 参数或同时指定了 `sampler` 和 `shard_id` 参数。
|
||||
- **RuntimeError** - 指定了 `num_shards` 参数,但是未指定 `shard_id` 参数。
|
||||
- **RuntimeError** - 指定了 `shard_id` 参数,但是未指定 `num_shards` 参数。
|
||||
- **ValueError** - `shard_id` 参数值错误(小于0或者大于等于 `num_shards` )。
|
||||
- **ValueError** - `num_parallel_workers` 参数超过系统最大线程数。
|
||||
- **ValueError** - `shard_id` 参数值错误,小于0或者大于等于 `num_shards` 。
|
||||
|
||||
.. note:: 此数据集可以指定参数 `sampler` ,但参数 `sampler` 和参数 `shuffle` 的行为是互斥的。下表展示了几种合法的输入参数组合及预期的行为。
|
||||
|
||||
|
@ -112,5 +106,5 @@ mindspore.dataset.IMDBDataset
|
|||
url = {http://www.aclweb.org/anthology/P11-1015}
|
||||
}
|
||||
|
||||
|
||||
.. include:: mindspore.dataset.api_list_nlp.rst
|
||||
|
||||
.. include:: mindspore.dataset.api_list_nlp.rst
|
||||
|
|
|
@ -10,31 +10,31 @@ mindspore.dataset.IWSLT2016Dataset
|
|||
参数:
|
||||
- **dataset_dir** (str) - 包含数据集文件的根目录路径。
|
||||
- **usage** (str, 可选) - 指定数据集的子集,可取值为'train','valid','test'或'all'。默认值:None,读取全部样本。
|
||||
- **language_pair** (sequence, 可选) - 包含源语言和目标语言的序列,支持的值为('en','fr')、('en','de')、('en','cs')、('en','ar')、('de','en'),('cs','en'),('ar','en'),默认值:('de','en')。
|
||||
- **valid_set** (str, 可选) - 标识验证集的字符串,支持的值为'dev2010'、'tst2010'、'tst2011'、'tst'2012,'tst2013'和'tst2014',默认值:'tst2013'。
|
||||
- **test_set** (str, 可选) - 识测试集的字符串,支持的值为'dev2010'、'tst2010'、'tst2011'、'tst'2012、'tst2013'和'tst2014',默认值:'tst2014'。
|
||||
- **num_samples** (int, 可选) - 指定从数据集中读取的样本数。
|
||||
- **shuffle** (Union[bool, Shuffle], 可选) - 每个epoch中数据混洗的模式,支持传入bool类型与枚举类型进行指定,默认值:mindspore.dataset.Shuffle.GLOBAL。
|
||||
- **language_pair** (sequence, 可选) - 包含源语言和目标语言的序列,支持的值为('en','fr')、('en','de')、('en','cs')、('en','ar')、('de','en')、('cs','en')、('ar','en')。默认值:('de','en')。
|
||||
- **valid_set** (str, 可选) - 标识验证集的字符串,支持的值为'dev2010'、'tst2010'、'tst2011'、'tst2012'、'tst2013'和'tst2014'。默认值:'tst2013'。
|
||||
- **test_set** (str, 可选) - 识别测试集的字符串,支持的值为'dev2010'、'tst2010'、'tst2011'、'tst'2012、'tst2013'和'tst2014'。默认值:'tst2014'。
|
||||
- **num_samples** (int, 可选) - 指定从数据集中读取的样本数。默认值:None,读取所有样本。
|
||||
- **shuffle** (Union[bool, Shuffle], 可选) - 每个epoch中数据混洗的模式,支持传入bool类型与枚举类型进行指定。默认值:`Shuffle.GLOBAL` 。
|
||||
如果 `shuffle` 为False,则不混洗,如果 `shuffle` 为True,等同于将 `shuffle` 设置为mindspore.dataset.Shuffle.GLOBAL。
|
||||
通过传入枚举变量设置数据混洗的模式:
|
||||
|
||||
- **Shuffle.GLOBAL**:混洗文件和样本。
|
||||
- **Shuffle.FILES**:仅混洗文件。
|
||||
|
||||
- **num_shards** (int, 可选) - 指定分布式训练时将数据集进行划分的分片数,默认值:None。指定此参数后, `num_samples` 表示每个分片的最大样本数。
|
||||
- **shard_id** (int, 可选) - 指定分布式训练时使用的分片ID号,默认值:None。只有当指定了 `num_shards` 时才能指定此参数。
|
||||
- **num_shards** (int, 可选) - 指定分布式训练时将数据集进行划分的分片数。默认值:None。指定此参数后, `num_samples` 表示每个分片的最大样本数。
|
||||
- **shard_id** (int, 可选) - 指定分布式训练时使用的分片ID号。默认值:None。只有当指定了 `num_shards` 时才能指定此参数。
|
||||
- **num_parallel_workers** (int, 可选) - 指定读取数据的工作线程数。默认值:None,使用mindspore.dataset.config中配置的线程数。
|
||||
- **cache** (DatasetCache, 可选) - 单节点数据缓存服务,用于加快数据集处理,详情请阅读 `单节点数据缓存 <https://www.mindspore.cn/tutorials/experts/zh-CN/master/dataset/cache.html>`_ 。默认值:None,不使用缓存。
|
||||
|
||||
异常:
|
||||
- **RuntimeError** - `dataset_dir` 参数所指向的文件目录不存在或缺少数据集文件。
|
||||
- **ValueError** - `num_parallel_workers` 参数超过系统最大线程数。
|
||||
- **RuntimeError** - 指定了 `num_shards` 参数,但是未指定 `shard_id` 参数。
|
||||
- **RuntimeError** - 指定了 `shard_id` 参数,但是未指定 `num_shards` 参数。
|
||||
- **ValueError** - `num_parallel_workers` 参数超过系统最大线程数。
|
||||
|
||||
**关于IWSLT2016数据集:**
|
||||
|
||||
IWSLT是一个专门讨论口译各个方面的重要年度科学会议。IWSLT评估活动中的MT任务被构成一个数据集,该数据集可通过wit3.fbk.eu公开获取。
|
||||
IWSLT是一个专门讨论口译各个方面的重要年度科学会议。IWSLT评估活动中的MT任务被构成一个数据集,该数据集可通过 `wit3 <https://wit3.fbk.eu>`_ 公开获取。
|
||||
IWSLT2016数据集包括从英语到阿拉伯语、捷克、法语和德语的翻译,以及从阿拉伯语、捷克、法语和德语到英语的翻译。
|
||||
|
||||
可以将原始IWSLT2016数据集文件解压缩到此目录结构中,并由MindSpore的API读取。解压后,还需要将要读取的数据集解压到指定文件夹中。例如,如果要读取de-en的数据集,则需要解压缩de/en目录下的tgz文件,数据集位于解压缩文件夹中。
|
||||
|
|
|
@ -5,34 +5,37 @@ mindspore.dataset.IWSLT2017Dataset
|
|||
|
||||
读取和解析IWSLT2017数据集的源数据集。
|
||||
|
||||
生成的数据集有两列 `[text, translation]` 。 `text` 列的数据类型是string。 `translation` 列的数据类型是string。
|
||||
生成的数据集有两列 `[text, translation]` 。 `text` 列和 `translation` 列的数据类型均为string。
|
||||
|
||||
参数:
|
||||
- **dataset_dir** (str) - 包含数据集文件的根目录路径。
|
||||
- **usage** (str, 可选) - 指定数据集的子集,可取值为'train','valid','test'或'all'。默认值:None,读取全部样本。
|
||||
- **language_pair** (sequence, 可选) - 包含源语和目标语的语言列表,支持的语言对有('en','nl')、('en','de')、('en','it')、('en','ro')、('nl','en','de')、('nl','it')、('nl','ro')、('de','en')、('de','nl')、('de','it','it','en')、('it','nl')、('it','de')、('it','ro')、('ro','en')、('ro','nl')、('ro','de')、('ro','it'),默认值:('de','en')。
|
||||
- **num_samples** (int, 可选) - 指定从数据集中读取的样本数。
|
||||
- **shuffle** (Union[bool, Shuffle], 可选) - 每个epoch中数据混洗的模式,支持传入bool类型与枚举类型进行指定,默认值:mindspore.dataset.Shuffle.GLOBAL。
|
||||
- **language_pair** (sequence, 可选) - 包含源语和目标语的语言列表,支持的语言对有('en', 'nl')、
|
||||
('en', 'de')、('en', 'it')、('en', 'ro')、('nl', 'en')、('nl', 'de')、('nl', 'it')、('nl', 'ro')、
|
||||
('de', 'en')、('de', 'nl')、('de', 'it')、('de', 'ro')、('it', 'en')、('it', 'nl')、('it', 'de')、
|
||||
('it', 'ro')、('ro', 'en')、('ro', 'nl')、('ro', 'de')、('ro', 'it')。默认值:('de','en')。
|
||||
- **num_samples** (int, 可选) - 指定从数据集中读取的样本数。默认值:None,读取所有样本。
|
||||
- **shuffle** (Union[bool, Shuffle], 可选) - 每个epoch中数据混洗的模式,支持传入bool类型与枚举类型进行指定。默认值:`Shuffle.GLOBAL` 。
|
||||
如果 `shuffle` 为False,则不混洗,如果 `shuffle` 为True,等同于将 `shuffle` 设置为mindspore.dataset.Shuffle.GLOBAL。
|
||||
通过传入枚举变量设置数据混洗的模式:
|
||||
|
||||
- **Shuffle.GLOBAL**:混洗文件和样本。
|
||||
- **Shuffle.FILES**:仅混洗文件。
|
||||
|
||||
- **num_shards** (int, 可选) - 指定分布式训练时将数据集进行划分的分片数,默认值:None。指定此参数后, `num_samples` 表示每个分片的最大样本数。
|
||||
- **shard_id** (int, 可选) - 指定分布式训练时使用的分片ID号,默认值:None。只有当指定了 `num_shards` 时才能指定此参数。
|
||||
- **num_shards** (int, 可选) - 指定分布式训练时将数据集进行划分的分片数。默认值:None。指定此参数后, `num_samples` 表示每个分片的最大样本数。
|
||||
- **shard_id** (int, 可选) - 指定分布式训练时使用的分片ID号。默认值:None。只有当指定了 `num_shards` 时才能指定此参数。
|
||||
- **num_parallel_workers** (int, 可选) - 指定读取数据的工作线程数。默认值:None,使用mindspore.dataset.config中配置的线程数。
|
||||
- **cache** (DatasetCache, 可选) - 单节点数据缓存服务,用于加快数据集处理,详情请阅读 `单节点数据缓存 <https://www.mindspore.cn/tutorials/experts/zh-CN/master/dataset/cache.html>`_ 。默认值:None,不使用缓存。
|
||||
|
||||
异常:
|
||||
- **RuntimeError** - `dataset_dir` 参数所指向的文件目录不存在或缺少数据集文件。
|
||||
- **ValueError** - `num_parallel_workers` 参数超过系统最大线程数。
|
||||
- **RuntimeError** - 指定了 `num_shards` 参数,但是未指定 `shard_id` 参数。
|
||||
- **RuntimeError** - 指定了 `shard_id` 参数,但是未指定 `num_shards` 参数。
|
||||
- **ValueError** - `num_parallel_workers` 参数超过系统最大线程数。
|
||||
|
||||
**关于IWSLT2016数据集:**
|
||||
|
||||
IWSLT是一个专门讨论口译各个方面的重要年度科学会议。IWSLT评估活动中的MT任务被构成一个数据集,该数据集可通过wit3.fbk.eu公开获取。
|
||||
IWSLT是一个专门讨论口译各个方面的重要年度科学会议。IWSLT评估活动中的MT任务被构成一个数据集,该数据集可通过 `wit3 <https://wit3.fbk.eu>`_ 公开获取。
|
||||
IWSLT2017数据集中有德语、英语、意大利语、荷兰语和罗马尼亚语,数据集包括其中任何两种语言的翻译。
|
||||
|
||||
可以将原始IWSLT2017数据集文件解压缩到此目录结构中,并由MindSpore的API读取。解压后,还需要将要读取的数据集解压到指定文件夹中。例如,如果要读取de-en的数据集,则需要解压缩de/en目录下的tgz文件,数据集位于解压缩文件夹中。
|
||||
|
|
|
@ -5,7 +5,7 @@ mindspore.dataset.KMnistDataset
|
|||
|
||||
读取和解析KMNIST数据集的源文件构建数据集。
|
||||
|
||||
生成的数据集有两列: `[image, label]`。 `image` 列的数据类型为uint8。 `label` 列的数据类型为uint32。
|
||||
生成的数据集有两列: `[image, label]` 。 `image` 列的数据类型为uint8。 `label` 列的数据类型为uint32。
|
||||
|
||||
参数:
|
||||
- **dataset_dir** (str) - 包含数据集文件的根目录路径。
|
||||
|
@ -14,19 +14,19 @@ mindspore.dataset.KMnistDataset
|
|||
- **num_samples** (int, 可选) - 指定从数据集中读取的样本数,可以小于数据集总数。默认值:None,读取全部样本图片。
|
||||
- **num_parallel_workers** (int, 可选) - 指定读取数据的工作线程数。默认值:None,使用mindspore.dataset.config中配置的线程数。
|
||||
- **shuffle** (bool, 可选) - 是否混洗数据集。默认值:None,下表中会展示不同参数配置的预期行为。
|
||||
- **sampler** (Sampler, 可选) - 指定从数据集中选取样本的采样器,默认值:None,下表中会展示不同配置的预期行为。
|
||||
- **num_shards** (int, 可选) - 指定分布式训练时将数据集进行划分的分片数,默认值:None。指定此参数后, `num_samples` 表示每个分片的最大样本数。
|
||||
- **shard_id** (int, 可选) - 指定分布式训练时使用的分片ID号,默认值:None。只有当指定了 `num_shards` 时才能指定此参数。
|
||||
- **sampler** (Sampler, 可选) - 指定从数据集中选取样本的采样器。默认值:None,下表中会展示不同配置的预期行为。
|
||||
- **num_shards** (int, 可选) - 指定分布式训练时将数据集进行划分的分片数。默认值:None。指定此参数后, `num_samples` 表示每个分片的最大样本数。
|
||||
- **shard_id** (int, 可选) - 指定分布式训练时使用的分片ID号。默认值:None。只有当指定了 `num_shards` 时才能指定此参数。
|
||||
- **cache** (DatasetCache, 可选) - 单节点数据缓存服务,用于加快数据集处理,详情请阅读 `单节点数据缓存 <https://www.mindspore.cn/tutorials/experts/zh-CN/master/dataset/cache.html>`_ 。默认值:None,不使用缓存。
|
||||
|
||||
异常:
|
||||
- **RuntimeError** - `dataset_dir` 路径下不包含数据文件。
|
||||
- **ValueError** - `num_parallel_workers` 参数超过系统最大线程数。
|
||||
- **RuntimeError** - 同时指定了 `sampler` 和 `shuffle` 参数。
|
||||
- **RuntimeError** - 同时指定了 `sampler` 和 `num_shards` 参数或同时指定了 `sampler` 和 `shard_id` 参数。
|
||||
- **RuntimeError** - 指定了 `num_shards` 参数,但是未指定 `shard_id` 参数。
|
||||
- **RuntimeError** - 指定了 `shard_id` 参数,但是未指定 `num_shards` 参数。
|
||||
- **ValueError** - `shard_id` 参数错误(小于0或者大于等于 `num_shards` )。
|
||||
- **ValueError** - `num_parallel_workers` 参数超过系统最大线程数。
|
||||
- **ValueError** - `shard_id` 参数错误,小于0或者大于等于 `num_shards` 。
|
||||
|
||||
.. note:: 此数据集可以指定参数 `sampler` ,但参数 `sampler` 和参数 `shuffle` 的行为是互斥的。下表展示了几种合法的输入参数组合及预期的行为。
|
||||
|
||||
|
|
|
@ -5,7 +5,7 @@ mindspore.dataset.LJSpeechDataset
|
|||
|
||||
读取和解析LJSpeech数据集的源文件构建数据集。
|
||||
|
||||
生成的数据集有两列: `[waveform, sample_rate, transcription, normalized_transcript]`。
|
||||
生成的数据集有四列: `[waveform, sample_rate, transcription, normalized_transcript]` 。
|
||||
`waveform` 列的数据类型为float32。 `sample_rate` 列的数据类型为int32。 `transcription` 列的数据类型为string。 `normalized_transcript` 列的数据类型为string。
|
||||
|
||||
参数:
|
||||
|
@ -13,19 +13,19 @@ mindspore.dataset.LJSpeechDataset
|
|||
- **num_samples** (int, 可选) - 指定从数据集中读取的样本数,可以小于数据集总数。默认值:None,读取全部样本音频。
|
||||
- **num_parallel_workers** (int, 可选) - 指定读取数据的工作线程数。默认值:None,使用mindspore.dataset.config中配置的线程数。
|
||||
- **shuffle** (bool, 可选) - 是否混洗数据集。默认值:None,下表中会展示不同参数配置的预期行为。
|
||||
- **sampler** (Sampler, 可选) - 指定从数据集中选取样本的采样器,默认值:None,下表中会展示不同配置的预期行为。
|
||||
- **num_shards** (int, 可选) - 指定分布式训练时将数据集进行划分的分片数,默认值:None。指定此参数后, `num_samples` 表示每个分片的最大样本数。
|
||||
- **shard_id** (int, 可选) - 指定分布式训练时使用的分片ID号,默认值:None。只有当指定了 `num_shards` 时才能指定此参数。
|
||||
- **sampler** (Sampler, 可选) - 指定从数据集中选取样本的采样器。默认值:None,下表中会展示不同配置的预期行为。
|
||||
- **num_shards** (int, 可选) - 指定分布式训练时将数据集进行划分的分片数。默认值:None。指定此参数后, `num_samples` 表示每个分片的最大样本数。
|
||||
- **shard_id** (int, 可选) - 指定分布式训练时使用的分片ID号。默认值:None。只有当指定了 `num_shards` 时才能指定此参数。
|
||||
- **cache** (DatasetCache, 可选) - 单节点数据缓存服务,用于加快数据集处理,详情请阅读 `单节点数据缓存 <https://www.mindspore.cn/tutorials/experts/zh-CN/master/dataset/cache.html>`_ 。默认值:None,不使用缓存。
|
||||
|
||||
异常:
|
||||
- **RuntimeError** - `dataset_dir` 路径下不包含数据文件。
|
||||
- **ValueError** - `num_parallel_workers` 参数超过系统最大线程数。
|
||||
- **RuntimeError** - 同时指定了 `sampler` 和 `shuffle` 参数。
|
||||
- **RuntimeError** - 同时指定了 `sampler` 和 `num_shards` 参数或同时指定了 `sampler` 和 `shard_id` 参数。
|
||||
- **RuntimeError** - 指定了 `num_shards` 参数,但是未指定 `shard_id` 参数。
|
||||
- **RuntimeError** - 指定了 `shard_id` 参数,但是未指定 `num_shards` 参数。
|
||||
- **ValueError** - `shard_id` 参数错误(小于0或者大于等于 `num_shards` )。
|
||||
- **ValueError** - `num_parallel_workers` 参数超过系统最大线程数。
|
||||
- **ValueError** - `shard_id` 参数错误,小于0或者大于等于 `num_shards` 。
|
||||
|
||||
.. note:: 此数据集可以指定参数 `sampler` ,但参数 `sampler` 和参数 `shuffle` 的行为是互斥的。下表展示了几种合法的输入参数组合及预期的行为。
|
||||
|
||||
|
|
|
@ -5,7 +5,7 @@ mindspore.dataset.PennTreebankDataset
|
|||
|
||||
读取和解析PennTreebank数据集的源数据集。
|
||||
|
||||
生成的数据集有一列 `[text]` 。数据类型为string。
|
||||
生成的数据集有一列 `[text]` 。 `text` 列的数据类型为string。
|
||||
|
||||
参数:
|
||||
- **dataset_dir** (str) - 包含数据集文件的根目录路径。
|
||||
|
@ -13,17 +13,23 @@ mindspore.dataset.PennTreebankDataset
|
|||
取值为'train'将读取42,068个样本,'test'将读取3,370个样本,'test'将读取3,761个样本,'all'将读取所有49,199个样本。默认值:None,读取全部样本。
|
||||
- **num_samples** (int, 可选) - 指定从数据集中读取的样本数。默认值:None,读取所有样本。
|
||||
- **num_parallel_workers** (int, 可选) - 指定读取数据的工作线程数。默认值:None,使用mindspore.dataset.config中配置的线程数。
|
||||
- **shuffle** (Union[bool, Shuffle], 可选) - 每个epoch中数据混洗的模式,支持传入bool类型与枚举类型进行指定,默认值:True。
|
||||
- **shuffle** (Union[bool, Shuffle], 可选) - 每个epoch中数据混洗的模式,支持传入bool类型与枚举类型进行指定。默认值:`Shuffle.GLOBAL` 。
|
||||
如果 `shuffle` 为False,则不混洗,如果 `shuffle` 为True,等同于将 `shuffle` 设置为mindspore.dataset.Shuffle.GLOBAL。
|
||||
通过传入枚举变量设置数据混洗的模式:
|
||||
|
||||
- **Shuffle.GLOBAL**:混洗文件和样本。
|
||||
- **Shuffle.FILES**:仅混洗文件。
|
||||
|
||||
- **num_shards** (int, 可选) - 指定分布式训练时将数据集进行划分的分片数,默认值:None。指定此参数后, `num_samples` 表示每个分片的最大样本数。
|
||||
- **shard_id** (int, 可选) - 指定分布式训练时使用的分片ID号,默认值:None。只有当指定了 `num_shards` 时才能指定此参数。
|
||||
- **num_shards** (int, 可选) - 指定分布式训练时将数据集进行划分的分片数。默认值:None。指定此参数后, `num_samples` 表示每个分片的最大样本数。
|
||||
- **shard_id** (int, 可选) - 指定分布式训练时使用的分片ID号。默认值:None。只有当指定了 `num_shards` 时才能指定此参数。
|
||||
- **cache** (DatasetCache, 可选) - 单节点数据缓存服务,用于加快数据集处理,详情请阅读 `单节点数据缓存 <https://www.mindspore.cn/tutorials/experts/zh-CN/master/dataset/cache.html>`_ 。默认值:None,不使用缓存。
|
||||
|
||||
异常:
|
||||
- **RuntimeError** - `dataset_dir` 参数所指向的文件目录不存在或缺少数据集文件。
|
||||
- **RuntimeError** - 指定了 `num_shards` 参数,但是未指定 `shard_id` 参数。
|
||||
- **RuntimeError** - 指定了 `shard_id` 参数,但是未指定 `num_shards` 参数。
|
||||
- **ValueError** - `num_parallel_workers` 参数超过系统最大线程数。
|
||||
|
||||
**关于PennTreebank数据集:**
|
||||
|
||||
Penn Treebank (PTB) 数据集,广泛用于 NLP(自然语言处理)的机器学习研究。
|
||||
|
@ -60,5 +66,5 @@ mindspore.dataset.PennTreebankDataset
|
|||
year = 1990
|
||||
}
|
||||
|
||||
|
||||
.. include:: mindspore.dataset.api_list_nlp.rst
|
||||
|
||||
.. include:: mindspore.dataset.api_list_nlp.rst
|
||||
|
|
|
@ -5,26 +5,26 @@ mindspore.dataset.PhotoTourDataset
|
|||
|
||||
读取和解析PhotoTour数据集的源数据集。
|
||||
|
||||
当 `usage` = 'train',生成的数据集有一列 `[image]` ,数据类型为uint8。
|
||||
当 `usage` ≠ 'train',生成的数据集有三列: `[image1, image2, matches]`。 `image1` 、 `image2` 列的数据类型为uint8。 `matches` 列的数据类型为uint32。
|
||||
根据给定的 `usage` 配置,生成数据集具有不同的输出列:
|
||||
- `usage` = 'train',输出列: `[image, dtype=uint8]` 。
|
||||
- `usage` ≠ 'train',输出列: `[image1, dtype=uint8]` 、 `[image2, dtype=uint8]` 、 `[matches, dtype=uint32]` 。
|
||||
|
||||
参数:
|
||||
- **dataset_dir** (str) - 包含数据集文件的根目录路径。
|
||||
- **name** (str) - 要加载的数据集内容名称,可以取值为'notredame', 'yosemite', 'liberty', 'notredame_harris', 'yosemite_harris' 或 'liberty_harris'。
|
||||
- **name** (str) - 要加载的数据集内容名称,可以取值为'notredame'、'yosemite'、'liberty'、'notredame_harris'、'yosemite_harris' 或 'liberty_harris'。
|
||||
- **usage** (str, 可选) - 指定数据集的子集,可取值为'train'或'test'。默认值:None,将被设置为'train'。
|
||||
取值为'train'时,每个 `name` 的数据集样本数分别为{'notredame': 468159, 'yosemite': 633587, 'liberty': 450092, 'liberty_harris': 379587, 'yosemite_harris': 450912, 'notredame_harris': 325295}。
|
||||
取值为'test'时,将读取100,000个测试样本。
|
||||
- **num_samples** (int, 可选) - 指定从数据集中读取的样本数。默认值:None,读取所有样本。
|
||||
- **num_parallel_workers** (int, 可选) - 指定读取数据的工作线程数。默认值:None,使用mindspore.dataset.config中配置的线程数。
|
||||
- **shuffle** (bool, 可选) - 是否混洗数据集。默认值:None,下表中会展示不同参数配置的预期行为。
|
||||
- **sampler** (Sampler, 可选) - 指定从数据集中选取样本的采样器,默认值:None,下表中会展示不同配置的预期行为。
|
||||
- **num_shards** (int, 可选) - 指定分布式训练时将数据集进行划分的分片数,默认值:None。指定此参数后, `num_samples` 表示每个分片的最大样本数。
|
||||
- **shard_id** (int, 可选) - 指定分布式训练时使用的分片ID号,默认值:None。只有当指定了 `num_shards` 时才能指定此参数。
|
||||
- **sampler** (Sampler, 可选) - 指定从数据集中选取样本的采样器。默认值:None,下表中会展示不同配置的预期行为。
|
||||
- **num_shards** (int, 可选) - 指定分布式训练时将数据集进行划分的分片数。默认值:None。指定此参数后, `num_samples` 表示每个分片的最大样本数。
|
||||
- **shard_id** (int, 可选) - 指定分布式训练时使用的分片ID号。默认值:None。只有当指定了 `num_shards` 时才能指定此参数。
|
||||
- **cache** (DatasetCache, 可选) - 单节点数据缓存服务,用于加快数据集处理,详情请阅读 `单节点数据缓存 <https://www.mindspore.cn/tutorials/experts/zh-CN/master/dataset/cache.html>`_ 。默认值:None,不使用缓存。
|
||||
|
||||
异常:
|
||||
- **RuntimeError** - `dataset_dir` 路径下不包含数据文件。
|
||||
- **ValueError** - `num_parallel_workers` 参数超过系统最大线程数。
|
||||
- **RuntimeError** - 同时指定了 `sampler` 和 `shuffle` 参数。
|
||||
- **RuntimeError** - 同时指定了 `sampler` 和 `num_shards` 参数或同时指定了 `sampler` 和 `shard_id` 参数。
|
||||
- **RuntimeError** - 指定了 `num_shards` 参数,但是未指定 `shard_id` 参数。
|
||||
|
@ -32,7 +32,8 @@ mindspore.dataset.PhotoTourDataset
|
|||
- **ValueError** - `dataset_dir` 不存在。
|
||||
- **ValueError** - `usage` 不是["train", "test"]中的任何一个。
|
||||
- **ValueError** - `name` 不是["notredame", "yosemite", "liberty","notredame_harris", "yosemite_harris", "liberty_harris"]中的任何一个。
|
||||
- **ValueError** - `shard_id` 参数错误(小于0或者大于等于 `num_shards` )。
|
||||
- **ValueError** - `num_parallel_workers` 参数超过系统最大线程数。
|
||||
- **ValueError** - `shard_id` 参数错误,小于0或者大于等于 `num_shards` 。
|
||||
|
||||
.. note:: 此数据集可以指定参数 `sampler` ,但参数 `sampler` 和参数 `shuffle` 的行为是互斥的。下表展示了几种合法的输入参数组合及预期的行为。
|
||||
|
||||
|
@ -112,5 +113,5 @@ mindspore.dataset.PhotoTourDataset
|
|||
doi={10.1109/CVPR.2007.382971}
|
||||
}
|
||||
|
||||
|
||||
.. include:: mindspore.dataset.api_list_vision.rst
|
||||
|
||||
.. include:: mindspore.dataset.api_list_vision.rst
|
||||
|
|
|
@ -303,24 +303,6 @@ API示例所需模块的导入代码如下:
|
|||
返回:
|
||||
bool,表示是否开启watchdog Python线程。
|
||||
|
||||
.. py:function:: mindspore.dataset.config.set_multiprocessing_timeout_interval(interval)
|
||||
|
||||
设置在多进程/多线程下,主进程/主线程获取数据超时时,告警日志打印的默认时间间隔(秒)。
|
||||
|
||||
参数:
|
||||
- **interval** (int) - 表示多进程/多线程下,主进程/主线程获取数据超时时,告警日志打印的时间间隔(秒)。
|
||||
|
||||
异常:
|
||||
- **TypeError** - `interval` 不是int类型。
|
||||
- **ValueError** - `interval` 小于等于0或 `interval` 大于 `INT32_MAX(2147483647)` 时, `interval` 无效。
|
||||
|
||||
.. py:function:: mindspore.dataset.config.get_multiprocessing_timeout_interval()
|
||||
|
||||
获取在多进程/多线程下,主进程/主线程获取数据超时时,告警日志打印的时间间隔的全局配置。
|
||||
|
||||
返回:
|
||||
int,表示多进程/多线程下,主进程/主线程获取数据超时时,告警日志打印的时间间隔(默认300秒)。
|
||||
|
||||
.. py:function:: mindspore.dataset.config.set_fast_recovery(fast_recovery)
|
||||
|
||||
在数据集管道故障恢复时,是否开启快速恢复模式(快速恢复模式下,无法保证随机性的数据增强操作得到与故障之前相同的结果)。
|
||||
|
@ -340,3 +322,21 @@ API示例所需模块的导入代码如下:
|
|||
|
||||
.. automodule:: mindspore.dataset.config
|
||||
:members:
|
||||
|
||||
.. py:function:: mindspore.dataset.config.set_multiprocessing_timeout_interval(interval)
|
||||
|
||||
设置在多进程/多线程下,主进程/主线程获取数据超时时,告警日志打印的默认时间间隔(秒)。
|
||||
|
||||
参数:
|
||||
- **interval** (int) - 表示多进程/多线程下,主进程/主线程获取数据超时时,告警日志打印的时间间隔(秒)。
|
||||
|
||||
异常:
|
||||
- **TypeError** - `interval` 不是int类型。
|
||||
- **ValueError** - `interval` 小于等于0或 `interval` 大于 `INT32_MAX(2147483647)` 时, `interval` 无效。
|
||||
|
||||
.. py:function:: mindspore.dataset.config.get_multiprocessing_timeout_interval()
|
||||
|
||||
获取在多进程/多线程下,主进程/主线程获取数据超时时,告警日志打印的时间间隔的全局配置。
|
||||
|
||||
返回:
|
||||
int,表示多进程/多线程下,主进程/主线程获取数据超时时,告警日志打印的时间间隔(默认300秒)。
|
||||
|
|
|
@ -808,7 +808,7 @@ def set_fast_recovery(fast_recovery):
|
|||
(yet with slightly different random augmentations).
|
||||
|
||||
Args:
|
||||
fast_recovery (bool): Whether the dataset pipeline recovers in fast mode. Default: True
|
||||
fast_recovery (bool): Whether the dataset pipeline recovers in fast mode.
|
||||
|
||||
Raises:
|
||||
TypeError: If `fast_recovery` is not a boolean data type.
|
||||
|
@ -823,10 +823,10 @@ def set_fast_recovery(fast_recovery):
|
|||
|
||||
def get_fast_recovery():
|
||||
"""
|
||||
Get the fast_recovery flag of the dataset pipeline
|
||||
Get whether the fast recovery mode is enabled for the current dataset pipeline.
|
||||
|
||||
Returns:
|
||||
bool, whether the dataset recovers fast in failover reset
|
||||
bool, whether the dataset recovers fast in failover reset.
|
||||
|
||||
Examples:
|
||||
>>> is_fast_recovery = ds.config.get_fast_recovery()
|
||||
|
|
|
@ -714,20 +714,20 @@ class Dataset:
|
|||
@check_shuffle
|
||||
def shuffle(self, buffer_size):
|
||||
"""
|
||||
Randomly shuffles the rows of this dataset using the following policy:
|
||||
Shuffle the dataset by creating a cache with the size of `buffer_size` .
|
||||
|
||||
1. Make a shuffle buffer that contains the first buffer_size rows.
|
||||
1. Make a shuffle buffer that contains the first `buffer_size` rows.
|
||||
2. Randomly select an element from the shuffle buffer to be the next row
|
||||
propagated to the child node.
|
||||
3. Get the next row (if any) from the parent node and put it in the shuffle buffer.
|
||||
4. Repeat steps 2 and 3 until there are no more rows left in the shuffle buffer.
|
||||
|
||||
A random seed can be provided to be used on the first epoch via `dataset.config.set_seed`. In every subsequent
|
||||
A random seed can be provided to be used on the first epoch via `dataset.config.set_seed` . In every subsequent
|
||||
epoch, the seed is changed to a new one, randomly generated value.
|
||||
|
||||
Args:
|
||||
buffer_size (int): The size of the buffer (must be larger than 1) for
|
||||
shuffling. Setting buffer_size equal to the number of rows in the entire
|
||||
shuffling. Setting `buffer_size` equal to the number of rows in the entire
|
||||
dataset will result in a global shuffle.
|
||||
|
||||
Returns:
|
||||
|
|
|
@ -456,38 +456,38 @@ class LJSpeechDataset(MappableDataset, AudioBaseDataset):
|
|||
"""
|
||||
A source dataset that reads and parses LJSpeech dataset.
|
||||
|
||||
The generated dataset has four columns :py:obj:`[waveform, sample_rate, transcription, normalized_transcript]`.
|
||||
The tensor of column :py:obj:`waveform` is a tensor of the float32 type.
|
||||
The tensor of column :py:obj:`sample_rate` is a scalar of the int32 type.
|
||||
The tensor of column :py:obj:`transcription` is a scalar of the string type.
|
||||
The tensor of column :py:obj:`normalized_transcript` is a scalar of the string type.
|
||||
The generated dataset has four columns :py:obj:`[waveform, sample_rate, transcription, normalized_transcript]` .
|
||||
The column :py:obj:`waveform` is a tensor of the float32 type.
|
||||
The column :py:obj:`sample_rate` is a scalar of the int32 type.
|
||||
The column :py:obj:`transcription` is a scalar of the string type.
|
||||
The column :py:obj:`normalized_transcript` is a scalar of the string type.
|
||||
|
||||
Args:
|
||||
dataset_dir (str): Path to the root directory that contains the dataset.
|
||||
num_samples (int, optional): The number of audios to be included in the dataset
|
||||
(default=None, all audios).
|
||||
num_parallel_workers (int, optional): Number of workers to read the data
|
||||
(default=None, number set in the config).
|
||||
shuffle (bool, optional): Whether to perform shuffle on the dataset (default=None, expected
|
||||
order behavior shown in the table).
|
||||
sampler (Sampler, optional): Object used to choose samples from the
|
||||
dataset (default=None, expected order behavior shown in the table).
|
||||
num_shards (int, optional): Number of shards that the dataset will be divided
|
||||
into (default=None). When this argument is specified, `num_samples` reflects
|
||||
num_samples (int, optional): The number of audios to be included in the dataset.
|
||||
Default: None, all audios.
|
||||
num_parallel_workers (int, optional): Number of workers to read the data.
|
||||
Default: None, number set in the mindspore.dataset.config.
|
||||
shuffle (bool, optional): Whether to perform shuffle on the dataset. Default: None, expected
|
||||
order behavior shown in the table below.
|
||||
sampler (Sampler, optional): Object used to choose samples from the dataset.
|
||||
Default: None, expected order behavior shown in the table below.
|
||||
num_shards (int, optional): Number of shards that the dataset will be divided into.
|
||||
Default: None. When this argument is specified, `num_samples` reflects
|
||||
the maximum sample number of per shard.
|
||||
shard_id (int, optional): The shard ID within `num_shards` (default=None). This
|
||||
shard_id (int, optional): The shard ID within `num_shards` . Default: None. This
|
||||
argument can only be specified when `num_shards` is also specified.
|
||||
cache (DatasetCache, optional): Use tensor caching service to speed up dataset processing. More details:
|
||||
`Single-Node Data Cache <https://www.mindspore.cn/tutorials/experts/en/master/dataset/cache.html>`_
|
||||
(default=None, which means no cache is used).
|
||||
`Single-Node Data Cache <https://www.mindspore.cn/tutorials/experts/en/master/dataset/cache.html>`_ .
|
||||
Default: None, which means no cache is used.
|
||||
|
||||
Raises:
|
||||
RuntimeError: If `dataset_dir` does not contain data files.
|
||||
ValueError: If `num_parallel_workers` exceeds the max thread numbers.
|
||||
RuntimeError: If `sampler` and `shuffle` are specified at the same time.
|
||||
RuntimeError: If `sampler` and `num_shards`/`shard_id` are specified at the same time.
|
||||
RuntimeError: If `num_shards` is specified but `shard_id` is None.
|
||||
RuntimeError: If `shard_id` is specified but `num_shards` is None.
|
||||
ValueError: If `num_parallel_workers` exceeds the max thread numbers.
|
||||
ValueError: If `shard_id` is invalid (< 0 or >= `num_shards`).
|
||||
|
||||
Note:
|
||||
|
|
|
@ -39,32 +39,38 @@ class AGNewsDataset(SourceDataset, TextBaseDataset):
|
|||
"""
|
||||
A source dataset that reads and parses AG News datasets.
|
||||
|
||||
The generated dataset has three columns: :py:obj:`[index, title, description]`,
|
||||
The generated dataset has three columns: :py:obj:`[index, title, description]` ,
|
||||
and the data type of three columns is string type.
|
||||
|
||||
Args:
|
||||
dataset_dir (str): Path to the root directory that contains the dataset.
|
||||
usage (str, optional): Acceptable usages include 'train', 'test' and 'all' (default=None, all samples).
|
||||
num_samples (int, optional): Number of samples (rows) to read (default=None, reads the full dataset).
|
||||
num_parallel_workers (int, optional): Number of workers to read the data
|
||||
(default=None, number set in the config).
|
||||
shuffle (Union[bool, Shuffle], optional): Perform reshuffling of the data every epoch
|
||||
(default=Shuffle.GLOBAL). Bool type and Shuffle enum are both supported to pass in.
|
||||
If shuffle is False, no shuffling will be performed.
|
||||
If shuffle is True, performs global shuffle.
|
||||
There are three levels of shuffling, desired shuffle enum defined by mindspore.dataset.Shuffle.
|
||||
usage (str, optional): Acceptable usages include 'train', 'test' and 'all'. Default: None, all samples.
|
||||
num_samples (int, optional): Number of samples (rows) to read. Default: None, reads the full dataset.
|
||||
num_parallel_workers (int, optional): Number of workers to read the data.
|
||||
Default: None, number set in the mindspore.dataset.config.
|
||||
shuffle (Union[bool, Shuffle], optional): Perform reshuffling of the data every epoch.
|
||||
Bool type and Shuffle enum are both supported to pass in. Default: `Shuffle.GLOBAL` .
|
||||
If `shuffle` is False, no shuffling will be performed.
|
||||
If `shuffle` is True, it is equivalent to setting `shuffle` to mindspore.dataset.Shuffle.GLOBAL.
|
||||
Set the mode of data shuffling by passing in enumeration variables:
|
||||
|
||||
- Shuffle.GLOBAL: Shuffle both the files and samples, same as setting shuffle to True.
|
||||
- Shuffle.GLOBAL: Shuffle both the files and samples.
|
||||
|
||||
- Shuffle.FILES: Shuffle files only.
|
||||
|
||||
num_shards (int, optional): Number of shards that the dataset will be divided into (default=None).
|
||||
When this argument is specified, 'num_samples' reflects the max sample number of per shard.
|
||||
shard_id (int, optional): The shard ID within `num_shards` (default=None). This
|
||||
argument can only be specified when `num_shards` is also specified.
|
||||
num_shards (int, optional): Number of shards that the dataset will be divided into. Default: None.
|
||||
When this argument is specified, `num_samples` reflects the max sample number of per shard.
|
||||
shard_id (int, optional): The shard ID within `num_shards` . This
|
||||
argument can only be specified when `num_shards` is also specified. Default: None.
|
||||
cache (DatasetCache, optional): Use tensor caching service to speed up dataset processing. More details:
|
||||
`Single-Node Data Cache <https://www.mindspore.cn/tutorials/experts/en/master/dataset/cache.html>`_
|
||||
(default=None, which means no cache is used).
|
||||
`Single-Node Data Cache <https://www.mindspore.cn/tutorials/experts/en/master/dataset/cache.html>`_ .
|
||||
Default: None, which means no cache is used.
|
||||
|
||||
Raises:
|
||||
RuntimeError: If `dataset_dir` does not contain data files.
|
||||
RuntimeError: If `num_shards` is specified but `shard_id` is None.
|
||||
RuntimeError: If `shard_id` is specified but `num_shards` is None.
|
||||
ValueError: If `num_parallel_workers` exceeds the max thread numbers.
|
||||
|
||||
Examples:
|
||||
>>> ag_news_dataset_dir = "/path/to/ag_news_dataset_file"
|
||||
|
@ -125,45 +131,45 @@ class AmazonReviewDataset(SourceDataset, TextBaseDataset):
|
|||
"""
|
||||
A source dataset that reads and parses Amazon Review Polarity and Amazon Review Full datasets.
|
||||
|
||||
The generated dataset has three columns: :py:obj:`[label, title, content]`,
|
||||
The generated dataset has three columns: :py:obj:`[label, title, content]` ,
|
||||
and the data type of three columns is string.
|
||||
|
||||
Args:
|
||||
dataset_dir (str): Path to the root directory that contains the Amazon Review Polarity dataset
|
||||
or the Amazon Review Full dataset.
|
||||
usage (str, optional): Usage of this dataset, can be 'train', 'test' or 'all' (default= 'all').
|
||||
usage (str, optional): Usage of this dataset, can be 'train', 'test' or 'all'.
|
||||
For Polarity dataset, 'train' will read from 3,600,000 train samples,
|
||||
'test' will read from 400,000 test samples,
|
||||
'all' will read from all 4,000,000 samples.
|
||||
For Full dataset, 'train' will read from 3,000,000 train samples,
|
||||
'test' will read from 650,000 test samples,
|
||||
'all' will read from all 3,650,000 samples (default=None, all samples).
|
||||
num_samples (int, optional): Number of samples (rows) to be read (default=None, reads the full dataset).
|
||||
shuffle (Union[bool, Shuffle], optional): Perform reshuffling of the data every epoch
|
||||
(default=Shuffle.GLOBAL). Bool type and Shuffle enum are both supported to pass in.
|
||||
If shuffle is False, no shuffling will be performed.
|
||||
If shuffle is True, performs global shuffle.
|
||||
There are three levels of shuffling, desired shuffle enum defined by mindspore.dataset.Shuffle.
|
||||
'all' will read from all 3,650,000 samples. Default: None, all samples.
|
||||
num_samples (int, optional): Number of samples (rows) to be read. Default: None, reads the full dataset.
|
||||
num_parallel_workers (int, optional): Number of workers to read the data.
|
||||
Default: None, number set in the mindspore.dataset.config.
|
||||
shuffle (Union[bool, Shuffle], optional): Perform reshuffling of the data every epoch.
|
||||
Bool type and Shuffle enum are both supported to pass in. Default: `Shuffle.GLOBAL` .
|
||||
If `shuffle` is False, no shuffling will be performed.
|
||||
If `shuffle` is True, it is equivalent to setting `shuffle` to mindspore.dataset.Shuffle.GLOBAL.
|
||||
Set the mode of data shuffling by passing in enumeration variables:
|
||||
|
||||
- Shuffle.GLOBAL: Shuffle both the files and samples, same as setting shuffle to True.
|
||||
- Shuffle.GLOBAL: Shuffle both the files and samples.
|
||||
|
||||
- Shuffle.FILES: Shuffle files only.
|
||||
|
||||
num_shards (int, optional): Number of shards that the dataset will be divided into (default=None).
|
||||
num_shards (int, optional): Number of shards that the dataset will be divided into. Default: None.
|
||||
When this argument is specified, `num_samples` reflects the max sample number of per shard.
|
||||
shard_id (int, optional): The shard ID within `num_shards` (default=None). This
|
||||
shard_id (int, optional): The shard ID within `num_shards` . Default: None. This
|
||||
argument can only be specified when `num_shards` is also specified.
|
||||
num_parallel_workers (int, optional): Number of workers to read the data
|
||||
(default=None, number set in the mindspore.dataset.config).
|
||||
cache (DatasetCache, optional): Use tensor caching service to speed up dataset processing. More details:
|
||||
`Single-Node Data Cache <https://www.mindspore.cn/tutorials/experts/en/master/dataset/cache.html>`_
|
||||
(default=None, which means no cache is used).
|
||||
`Single-Node Data Cache <https://www.mindspore.cn/tutorials/experts/en/master/dataset/cache.html>`_ .
|
||||
Default: None, which means no cache is used.
|
||||
|
||||
Raises:
|
||||
RuntimeError: If `dataset_dir` does not contain data files.
|
||||
ValueError: If `num_parallel_workers` exceeds the max thread numbers.
|
||||
RuntimeError: If `num_shards` is specified but `shard_id` is None.
|
||||
RuntimeError: If `shard_id` is specified but `num_shards` is None.
|
||||
ValueError: If `num_parallel_workers` exceeds the max thread numbers.
|
||||
|
||||
Examples:
|
||||
>>> amazon_review_dataset_dir = "/path/to/amazon_review_dataset_dir"
|
||||
|
@ -545,7 +551,7 @@ class DBpediaDataset(SourceDataset, TextBaseDataset):
|
|||
"""
|
||||
A source dataset that reads and parses the DBpedia dataset.
|
||||
|
||||
The generated dataset has three columns :py:obj:`[class, title, content]`,
|
||||
The generated dataset has three columns :py:obj:`[class, title, content]` ,
|
||||
and the data type of three columns is string.
|
||||
|
||||
Args:
|
||||
|
@ -553,34 +559,34 @@ class DBpediaDataset(SourceDataset, TextBaseDataset):
|
|||
usage (str, optional): Usage of this dataset, can be 'train', 'test' or 'all'.
|
||||
'train' will read from 560,000 train samples,
|
||||
'test' will read from 70,000 test samples,
|
||||
'all' will read from all 630,000 samples (default=None, all samples).
|
||||
num_samples (int, optional): The number of samples to be included in the dataset
|
||||
(default=None, will include all text).
|
||||
num_parallel_workers (int, optional): Number of workers to read the data
|
||||
(default=None, number set in the config).
|
||||
shuffle (Union[bool, Shuffle], optional): Perform reshuffling of the data every epoch
|
||||
(default=Shuffle.GLOBAL). Bool type and Shuffle enum are both supported to pass in.
|
||||
'all' will read from all 630,000 samples. Default: None, all samples.
|
||||
num_samples (int, optional): The number of samples to be included in the dataset.
|
||||
Default: None, will include all text.
|
||||
num_parallel_workers (int, optional): Number of workers to read the data.
|
||||
Default: None, number set in the mindspore.dataset.config.
|
||||
shuffle (Union[bool, Shuffle], optional): Perform reshuffling of the data every epoch.
|
||||
Bool type and Shuffle enum are both supported to pass in. Default: `Shuffle.GLOBAL` .
|
||||
If shuffle is False, no shuffling will be performed.
|
||||
If shuffle is True, performs global shuffle.
|
||||
There are three levels of shuffling, desired shuffle enum defined by mindspore.dataset.Shuffle.
|
||||
If shuffle is True, it is equivalent to setting `shuffle` to mindspore.dataset.Shuffle.GLOBAL.
|
||||
Set the mode of data shuffling by passing in enumeration variables:
|
||||
|
||||
- Shuffle.GLOBAL: Shuffle both the files and samples, same as setting shuffle to True.
|
||||
- Shuffle.GLOBAL: Shuffle both the files and samples.
|
||||
|
||||
- Shuffle.FILES: Shuffle files only.
|
||||
|
||||
num_shards (int, optional): Number of shards that the dataset will be divided into (default=None).
|
||||
num_shards (int, optional): Number of shards that the dataset will be divided into. Default: None.
|
||||
When this argument is specified, `num_samples` reflects the maximum sample number of per shard.
|
||||
shard_id (int, optional): The shard ID within `num_shards` (default=None). This
|
||||
shard_id (int, optional): The shard ID within `num_shards` . Default: None. This
|
||||
argument can only be specified when `num_shards` is also specified.
|
||||
cache (DatasetCache, optional): Use tensor caching service to speed up dataset processing. More details:
|
||||
`Single-Node Data Cache <https://www.mindspore.cn/tutorials/experts/en/master/dataset/cache.html>`_
|
||||
(default=None, which means no cache is used).
|
||||
`Single-Node Data Cache <https://www.mindspore.cn/tutorials/experts/en/master/dataset/cache.html>`_ .
|
||||
Default: None, which means no cache is used.
|
||||
|
||||
Raises:
|
||||
RuntimeError: If `dataset_dir` does not contain data files.
|
||||
ValueError: If `num_parallel_workers` exceeds the max thread numbers.
|
||||
RuntimeError: If `num_shards` is specified but `shard_id` is None.
|
||||
RuntimeError: If `shard_id` is specified but `num_shards` is None.
|
||||
ValueError: If `num_parallel_workers` exceeds the max thread numbers.
|
||||
ValueError: If `shard_id` is invalid (< 0 or >= `num_shards`).
|
||||
|
||||
Examples:
|
||||
|
@ -640,33 +646,42 @@ class DBpediaDataset(SourceDataset, TextBaseDataset):
|
|||
|
||||
class EnWik9Dataset(SourceDataset, TextBaseDataset):
|
||||
"""
|
||||
A source dataset that reads and parses EnWik9 dataset.
|
||||
A source dataset that reads and parses EnWik9 Polarity and EnWik9 Full datasets.
|
||||
|
||||
The generated dataset has one column :py:obj:`[text]` with type string.
|
||||
|
||||
Args:
|
||||
dataset_dir (str): Path to the root directory that contains the dataset.
|
||||
num_samples (int, optional): The number of samples to be included in the dataset
|
||||
(default=None, will include all samples).
|
||||
num_parallel_workers (int, optional): Number of workers to read the data
|
||||
(default=None, number set in the config).
|
||||
shuffle (Union[bool, Shuffle], optional): Perform reshuffling of the data every epoch
|
||||
(default=True). Bool type and Shuffle enum are both supported to pass in.
|
||||
num_samples (int, optional): The number of samples to be included in the dataset.
|
||||
For Polarity dataset, 'train' will read from 3,600,000 train samples, 'test' will read from 400,000 test
|
||||
samples, 'all' will read from all 4,000,000 samples.
|
||||
For Full dataset, 'train' will read from 3,000,000 train samples, 'test' will read from 650,000 test
|
||||
samples, 'all' will read from all 3,650,000 samples. Default: None, will include all samples.
|
||||
num_parallel_workers (int, optional): Number of workers to read the data.
|
||||
Default: None, number set in the mindspore.dataset.config.
|
||||
shuffle (Union[bool, Shuffle], optional): Perform reshuffling of the data every epoch.
|
||||
Bool type and Shuffle enum are both supported to pass in. Default: True.
|
||||
If shuffle is False, no shuffling will be performed.
|
||||
If shuffle is True, performs global shuffle.
|
||||
There are three levels of shuffling, desired shuffle enum defined by mindspore.dataset.Shuffle.
|
||||
If shuffle is True, it is equivalent to setting `shuffle` to mindspore.dataset.Shuffle.GLOBAL.
|
||||
Set the mode of data shuffling by passing in enumeration variables:
|
||||
|
||||
- Shuffle.GLOBAL: Shuffle both the files and samples, same as setting shuffle to True.
|
||||
- Shuffle.GLOBAL: Shuffle both the files and samples.
|
||||
|
||||
- Shuffle.FILES: Shuffle files only.
|
||||
|
||||
num_shards (int, optional): Number of shards that the dataset will be divided into (default=None).
|
||||
num_shards (int, optional): Number of shards that the dataset will be divided into. Default: None.
|
||||
When this argument is specified, `num_samples` reflects the maximum sample number of per shard.
|
||||
shard_id (int, optional): The shard ID within `num_shards` (default=None). This
|
||||
shard_id (int, optional): The shard ID within `num_shards` . Default: None. This
|
||||
argument can only be specified when `num_shards` is also specified.
|
||||
cache (DatasetCache, optional): Use tensor caching service to speed up dataset processing. More details:
|
||||
`Single-Node Data Cache <https://www.mindspore.cn/tutorials/experts/en/master/dataset/cache.html>`_
|
||||
(default=None, which means no cache is used).
|
||||
`Single-Node Data Cache <https://www.mindspore.cn/tutorials/experts/en/master/dataset/cache.html>`_ .
|
||||
Default: None, which means no cache is used.
|
||||
|
||||
Raises:
|
||||
RuntimeError: If `dataset_dir` does not contain data files.
|
||||
RuntimeError: If `num_shards` is specified but `shard_id` is None.
|
||||
RuntimeError: If `shard_id` is specified but `num_shards` is None.
|
||||
ValueError: If `num_parallel_workers` exceeds the max thread numbers.
|
||||
|
||||
Examples:
|
||||
>>> en_wik9_dataset_dir = "/path/to/en_wik9_dataset"
|
||||
|
@ -719,38 +734,41 @@ class IMDBDataset(MappableDataset, TextBaseDataset):
|
|||
"""
|
||||
A source dataset that reads and parses Internet Movie Database (IMDb).
|
||||
|
||||
The generated dataset has two columns: :py:obj:`[text, label]`.
|
||||
The generated dataset has two columns: :py:obj:`[text, label]` .
|
||||
The tensor of column :py:obj:`text` is of the string type.
|
||||
The tensor of column :py:obj:`label` is of a scalar of uint32 type.
|
||||
The column :py:obj:`label` is of a scalar of uint32 type.
|
||||
|
||||
Args:
|
||||
dataset_dir (str): Path to the root directory that contains the dataset.
|
||||
usage (str, optional): Usage of this dataset, can be 'train', 'test' or 'all'
|
||||
(default=None, will read all samples).
|
||||
num_samples (int, optional): The number of images to be included in the dataset
|
||||
(default=None, will read all samples).
|
||||
num_parallel_workers (int, optional): Number of workers to read the data
|
||||
(default=None, set in the config).
|
||||
shuffle (bool, optional): Whether or not to perform shuffle on the dataset
|
||||
(default=None, expected order behavior shown in the table).
|
||||
sampler (Sampler, optional): Object used to choose samples from the
|
||||
dataset (default=None, expected order behavior shown in the table).
|
||||
usage (str, optional): Usage of this dataset, can be 'train', 'test' or 'all'.
|
||||
Default: None, will read all samples.
|
||||
num_samples (int, optional): The number of images to be included in the dataset.
|
||||
For Polarity dataset, 'train' will read from 3,600,000 train samples, 'test' will read from 400,000 test
|
||||
samples, 'all' will read from all 4,000,000 samples. For Full dataset, 'train' will read from 3,000,000
|
||||
train samples, 'test' will read from 650,000 test samples, 'all' will read from all 3,650,000 samples.
|
||||
Default: None, will include all samples.
|
||||
num_parallel_workers (int, optional): Number of workers to read the data.
|
||||
Default: None, number set in the mindspore.dataset.config.
|
||||
shuffle (bool, optional): Whether or not to perform shuffle on the dataset.
|
||||
Default: None, expected order behavior shown in the table below.
|
||||
sampler (Sampler, optional): Object used to choose samples from the dataset.
|
||||
Default: None, expected order behavior shown in the table below.
|
||||
num_shards (int, optional): Number of shards that the dataset will be divided
|
||||
into (default=None). When this argument is specified, `num_samples` reflects
|
||||
into. Default: None. When this argument is specified, `num_samples` reflects
|
||||
the maximum sample number of per shard.
|
||||
shard_id (int, optional): The shard ID within `num_shards` (default=None). This
|
||||
shard_id (int, optional): The shard ID within `num_shards` . Default: None. This
|
||||
argument can only be specified when `num_shards` is also specified.
|
||||
cache (DatasetCache, optional): Use tensor caching service to speed up dataset processing. More details:
|
||||
`Single-Node Data Cache <https://www.mindspore.cn/tutorials/experts/en/master/dataset/cache.html>`_
|
||||
(default=None, which means no cache is used).
|
||||
`Single-Node Data Cache <https://www.mindspore.cn/tutorials/experts/en/master/dataset/cache.html>`_ .
|
||||
Default: None, which means no cache is used.
|
||||
|
||||
Raises:
|
||||
RuntimeError: If `dataset_dir` does not contain data files.
|
||||
ValueError: If `num_parallel_workers` exceeds the max thread numbers.
|
||||
RuntimeError: If `sampler` and `shuffle` are specified at the same time.
|
||||
RuntimeError: If `sampler` and `num_shards`/`shard_id` are specified at the same time.
|
||||
RuntimeError: If `num_shards` is specified but `shard_id` is None.
|
||||
RuntimeError: If `shard_id` is specified but `num_shards` is None.
|
||||
ValueError: If `num_parallel_workers` exceeds the max thread numbers.
|
||||
ValueError: If `shard_id` is invalid (< 0 or >= `num_shards`).
|
||||
|
||||
Note:
|
||||
|
@ -861,47 +879,48 @@ class IWSLT2016Dataset(SourceDataset, TextBaseDataset):
|
|||
"""
|
||||
A source dataset that reads and parses IWSLT2016 datasets.
|
||||
|
||||
The generated dataset has two columns: :py:obj:`[text, translation]`.
|
||||
The generated dataset has two columns: :py:obj:`[text, translation]` .
|
||||
The tensor of column :py:obj: `text` is of the string type.
|
||||
The tensor of column :py:obj: `translation` is of the string type.
|
||||
The column :py:obj: `translation` is of the string type.
|
||||
|
||||
Args:
|
||||
dataset_dir (str): Path to the root directory that contains the dataset.
|
||||
usage (str, optional): Acceptable usages include 'train', 'valid', 'test' and 'all' (default=None, all samples).
|
||||
usage (str, optional): Acceptable usages include 'train', 'valid', 'test' and 'all'. Default: None, all samples.
|
||||
language_pair (sequence, optional): Sequence containing source and target language, supported values are
|
||||
('en', 'fr'), ('en', 'de'), ('en', 'cs'), ('en', 'ar'), ('fr', 'en'), ('de', 'en'), ('cs', 'en'),
|
||||
('ar', 'en') (default=('de', 'en')).
|
||||
('ar', 'en'). Default: ('de', 'en').
|
||||
valid_set (str, optional): A string to identify validation set, when usage is valid or all, the validation set
|
||||
of valid_set type will be read, supported values are 'dev2010', 'tst2010', 'tst2011', 'tst2012', 'tst2013'
|
||||
and 'tst2014' (default='tst2013').
|
||||
test_set (str, optional): A string to identify test set, when usage is test or all, the test set of test_set
|
||||
type will be read, supported values are 'dev2010', 'tst2010', 'tst2011', 'tst2012', 'tst2013' and 'tst2014'
|
||||
(default='tst2014').
|
||||
num_samples (int, optional): Number of samples (rows) to read (default=None, reads the full dataset).
|
||||
shuffle (Union[bool, Shuffle], optional): Perform reshuffling of the data every epoch
|
||||
(default=Shuffle.GLOBAL). Bool type and Shuffle enum are both supported to pass in.
|
||||
If shuffle is False, no shuffling will be performed.
|
||||
If shuffle is True, performs global shuffle.
|
||||
There are three levels of shuffling, desired shuffle enum defined by mindspore.dataset.Shuffle.
|
||||
of `valid_set` type will be read, supported values are 'dev2010', 'tst2010', 'tst2011', 'tst2012', 'tst2013'
|
||||
and 'tst2014'. Default: 'tst2013'.
|
||||
test_set (str, optional): A string to identify test set, when usage is test or all, the test set of `test_set`
|
||||
type will be read, supported values are 'dev2010', 'tst2010', 'tst2011', 'tst2012', 'tst2013' and 'tst2014'.
|
||||
Default: 'tst2014'.
|
||||
num_samples (int, optional): Number of samples (rows) to read. Default: None, reads the full dataset.
|
||||
shuffle (Union[bool, Shuffle], optional): Perform reshuffling of the data every epoch.
|
||||
Bool type and Shuffle enum are both supported to pass in. Default: `Shuffle.GLOBAL` .
|
||||
If `shuffle` is False, no shuffling will be performed.
|
||||
If `shuffle` is True, it is equivalent to setting `shuffle` to mindspore.dataset.Shuffle.GLOBAL.
|
||||
Set the mode of data shuffling by passing in enumeration variables:
|
||||
|
||||
- Shuffle.GLOBAL: Shuffle both the files and samples, same as setting shuffle to True.
|
||||
- Shuffle.GLOBAL: Shuffle both the files and samples.
|
||||
|
||||
- Shuffle.FILES: Shuffle files only.
|
||||
num_shards (int, optional): Number of shards that the dataset will be divided into (default=None).
|
||||
|
||||
num_shards (int, optional): Number of shards that the dataset will be divided into. Default: None.
|
||||
When this argument is specified, `num_samples` reflects the max sample number of per shard.
|
||||
shard_id (int, optional): The shard ID within `num_shards` (default=None). This
|
||||
shard_id (int, optional): The shard ID within `num_shards` .Default: None. This
|
||||
argument can only be specified when `num_shards` is also specified.
|
||||
num_parallel_workers (int, optional): Number of workers to read the data
|
||||
(default=None, number set in the config).
|
||||
num_parallel_workers (int, optional): Number of workers to read the data.
|
||||
Default: None, number set in the mindspore.dataset.config.
|
||||
cache (DatasetCache, optional): Use tensor caching service to speed up dataset processing. More details:
|
||||
`Single-Node Data Cache <https://www.mindspore.cn/tutorials/experts/en/master/dataset/cache.html>`_
|
||||
(default=None, which means no cache is used).
|
||||
`Single-Node Data Cache <https://www.mindspore.cn/tutorials/experts/en/master/dataset/cache.html>`_ .
|
||||
Default: None, which means no cache is used.
|
||||
|
||||
Raises:
|
||||
RuntimeError: If `dataset_dir` does not contain data files.
|
||||
ValueError: If `num_parallel_workers` exceeds the max thread numbers.
|
||||
RuntimeError: If `num_shards` is specified but `shard_id` is None.
|
||||
RuntimeError: If `shard_id` is specified but `num_shards` is None.
|
||||
ValueError: If `num_parallel_workers` exceeds the max thread numbers.
|
||||
|
||||
Examples:
|
||||
>>> iwslt2016_dataset_dir = "/path/to/iwslt2016_dataset_dir"
|
||||
|
@ -912,8 +931,8 @@ class IWSLT2016Dataset(SourceDataset, TextBaseDataset):
|
|||
|
||||
IWSLT is an international oral translation conference, a major annual scientific conference dedicated to all aspects
|
||||
of oral translation. The MT task of the IWSLT evaluation activity constitutes a dataset, which can be publicly
|
||||
obtained through the WIT3 website wit3.fbk.eu. The IWSLT2016 dataset includes translations from English to Arabic,
|
||||
Czech, French, and German, and translations from Arabic, Czech, French, and German to English.
|
||||
obtained through the WIT3 website `wit3 <https://wit3.fbk.eu>`_ . The IWSLT2016 dataset includes translations from
|
||||
English to Arabic, Czech, French, and German, and translations from Arabic, Czech, French, and German to English.
|
||||
|
||||
You can unzip the original IWSLT2016 dataset files into this directory structure and read by MindSpore's API. After
|
||||
decompression, you also need to decompress the dataset to be read in the specified folder. For example, if you want
|
||||
|
@ -988,42 +1007,42 @@ class IWSLT2017Dataset(SourceDataset, TextBaseDataset):
|
|||
"""
|
||||
A source dataset that reads and parses IWSLT2017 datasets.
|
||||
|
||||
The generated dataset has two columns: :py:obj:`[text, translation]`.
|
||||
The tensor of column :py:obj:`text` is of the string type.
|
||||
The tensor of column :py:obj:`translation` is of the string type.
|
||||
The generated dataset has two columns: :py:obj:`[text, translation]` .
|
||||
The tensor of column :py:obj:`text` and :py:obj:`translation` are of the string type.
|
||||
|
||||
Args:
|
||||
dataset_dir (str): Path to the root directory that contains the dataset.
|
||||
usage (str, optional): Acceptable usages include 'train', 'valid', 'test' and 'all' (default=None, all samples).
|
||||
usage (str, optional): Acceptable usages include 'train', 'valid', 'test' and 'all'. Default: None, all samples.
|
||||
language_pair (sequence, optional): List containing src and tgt language, supported values are ('en', 'nl'),
|
||||
('en', 'de'), ('en', 'it'), ('en', 'ro'), ('nl', 'en'), ('nl', 'de'), ('nl', 'it'), ('nl', 'ro'),
|
||||
('de', 'en'), ('de', 'nl'), ('de', 'it'), ('de', 'ro'), ('it', 'en'), ('it', 'nl'), ('it', 'de'),
|
||||
('it', 'ro'), ('ro', 'en'), ('ro', 'nl'), ('ro', 'de'), ('ro', 'it') (default=('de', 'en')).
|
||||
num_samples (int, optional): Number of samples (rows) to read (default=None, reads the full dataset).
|
||||
shuffle (Union[bool, Shuffle], optional): Perform reshuffling of the data every epoch
|
||||
(default=Shuffle.GLOBAL). Bool type and Shuffle enum are both supported to pass in.
|
||||
('it', 'ro'), ('ro', 'en'), ('ro', 'nl'), ('ro', 'de'), ('ro', 'it'). Default: ('de', 'en').
|
||||
num_samples (int, optional): Number of samples (rows) to read. Default: None, reads the full dataset.
|
||||
shuffle (Union[bool, Shuffle], optional): Perform reshuffling of the data every epoch.
|
||||
Bool type and Shuffle enum are both supported to pass in. Default: `Shuffle.GLOBAL` .
|
||||
If shuffle is False, no shuffling will be performed.
|
||||
If shuffle is True, performs global shuffle.
|
||||
There are three levels of shuffling, desired shuffle enum defined by mindspore.dataset.Shuffle.
|
||||
If shuffle is True, it is equivalent to setting `shuffle` to mindspore.dataset.Shuffle.GLOBAL.
|
||||
Set the mode of data shuffling by passing in enumeration variables:
|
||||
|
||||
- Shuffle.GLOBAL: Shuffle both the files and samples, same as setting shuffle to True.
|
||||
- Shuffle.GLOBAL: Shuffle both the files and samples.
|
||||
|
||||
- Shuffle.FILES: Shuffle files only.
|
||||
num_shards (int, optional): Number of shards that the dataset will be divided into (default=None).
|
||||
|
||||
num_shards (int, optional): Number of shards that the dataset will be divided into. Default: None.
|
||||
When this argument is specified, `num_samples` reflects the max sample number of per shard.
|
||||
shard_id (int, optional): The shard ID within `num_shards` (default=None). This
|
||||
shard_id (int, optional): The shard ID within `num_shards` . Default: None. This
|
||||
argument can only be specified when `num_shards` is also specified.
|
||||
num_parallel_workers (int, optional): Number of workers to read the data
|
||||
(default=None, number set in the config).
|
||||
num_parallel_workers (int, optional): Number of workers to read the data.
|
||||
Default: None, number set in the mindspore.dataset.config.
|
||||
cache (DatasetCache, optional): Use tensor caching service to speed up dataset processing. More details:
|
||||
`Single-Node Data Cache <https://www.mindspore.cn/tutorials/experts/en/master/dataset/cache.html>`_
|
||||
(default=None, which means no cache is used).
|
||||
`Single-Node Data Cache <https://www.mindspore.cn/tutorials/experts/en/master/dataset/cache.html>`_ .
|
||||
Default: None, which means no cache is used.
|
||||
|
||||
Raises:
|
||||
RuntimeError: If `dataset_dir` does not contain data files.
|
||||
ValueError: If `num_parallel_workers` exceeds the max thread numbers.
|
||||
RuntimeError: If `num_shards` is specified but `shard_id` is None.
|
||||
RuntimeError: If `shard_id` is specified but `num_shards` is None.
|
||||
ValueError: If `num_parallel_workers` exceeds the max thread numbers.
|
||||
|
||||
Examples:
|
||||
>>> iwslt2017_dataset_dir = "/path/to/iwslt2017_dataset_dir"
|
||||
|
@ -1033,8 +1052,8 @@ class IWSLT2017Dataset(SourceDataset, TextBaseDataset):
|
|||
|
||||
IWSLT is an international oral translation conference, a major annual scientific conference dedicated to all aspects
|
||||
of oral translation. The MT task of the IWSLT evaluation activity constitutes a dataset, which can be publicly
|
||||
obtained through the WIT3 website wit3.fbk.eu. The IWSLT2017 dataset involves German, English, Italian, Dutch, and
|
||||
Romanian. The dataset includes translations in any two different languages.
|
||||
obtained through the WIT3 website `wit3 <https://wit3.fbk.eu>`_ . The IWSLT2017 dataset involves German, English,
|
||||
Italian, Dutch, and Romanian. The dataset includes translations in any two different languages.
|
||||
|
||||
You can unzip the original IWSLT2017 dataset files into this directory structure and read by MindSpore's API. You
|
||||
need to decompress the dataset package in texts/DeEnItNlRo/DeEnItNlRo directory to get the DeEnItNlRo-DeEnItNlRo
|
||||
|
@ -1186,7 +1205,7 @@ class PennTreebankDataset(SourceDataset, TextBaseDataset):
|
|||
"""
|
||||
A source dataset that reads and parses PennTreebank datasets.
|
||||
|
||||
The generated dataset has one column :py:obj:`[text]`.
|
||||
The generated dataset has one column :py:obj:`[text]` .
|
||||
The tensor of column :py:obj:`text` is of the string type.
|
||||
|
||||
Args:
|
||||
|
@ -1195,27 +1214,33 @@ class PennTreebankDataset(SourceDataset, TextBaseDataset):
|
|||
'train' will read from 42,068 train samples of string type,
|
||||
'test' will read from 3,370 test samples of string type,
|
||||
'valid' will read from 3,761 test samples of string type,
|
||||
'all' will read from all 49,199 samples of string type (default=None, all samples).
|
||||
num_samples (int, optional): Number of samples (rows) to read (default=None, reads the full dataset).
|
||||
num_parallel_workers (int, optional): Number of workers to read the data
|
||||
(default=None, number set in the config).
|
||||
shuffle (Union[bool, Shuffle], optional): Perform reshuffling of the data every epoch
|
||||
(default=Shuffle.GLOBAL). Bool type and Shuffle enum are both supported to pass in.
|
||||
'all' will read from all 49,199 samples of string type. Default: None, all samples.
|
||||
num_samples (int, optional): Number of samples (rows) to read. Default: None, reads the full dataset.
|
||||
num_parallel_workers (int, optional): Number of workers to read the data.
|
||||
Default: None, number set in the mindspore.dataset.config.
|
||||
shuffle (Union[bool, Shuffle], optional): Perform reshuffling of the data every epoch.
|
||||
Bool type and Shuffle enum are both supported to pass in. Default: `Shuffle.GLOBAL` .
|
||||
If shuffle is False, no shuffling will be performed.
|
||||
If shuffle is True, performs global shuffle.
|
||||
There are three levels of shuffling, desired shuffle enum defined by mindspore.dataset.Shuffle.
|
||||
If shuffle is True, it is equivalent to setting `shuffle` to mindspore.dataset.Shuffle.GLOBAL.
|
||||
Set the mode of data shuffling by passing in enumeration variables:
|
||||
|
||||
- Shuffle.GLOBAL: Shuffle both the files and samples, same as setting shuffle to True.
|
||||
- Shuffle.GLOBAL: Shuffle both the files and samples.
|
||||
|
||||
- Shuffle.FILES: Shuffle files only.
|
||||
|
||||
num_shards (int, optional): Number of shards that the dataset will be divided into (default=None).
|
||||
num_shards (int, optional): Number of shards that the dataset will be divided into. Default: None.
|
||||
When this argument is specified, 'num_samples' reflects the max sample number of per shard.
|
||||
shard_id (int, optional): The shard ID within `num_shards` (default=None). This
|
||||
shard_id (int, optional): The shard ID within `num_shards` . Default: None. This
|
||||
argument can only be specified when `num_shards` is also specified.
|
||||
cache (DatasetCache, optional): Use tensor caching service to speed up dataset processing. More details:
|
||||
`Single-Node Data Cache <https://www.mindspore.cn/tutorials/experts/en/master/dataset/cache.html>`_
|
||||
(default=None, which means no cache is used).
|
||||
`Single-Node Data Cache <https://www.mindspore.cn/tutorials/experts/en/master/dataset/cache.html>`_ .
|
||||
Default: None, which means no cache is used.
|
||||
|
||||
Raises:
|
||||
RuntimeError: If `dataset_dir` does not contain data files.
|
||||
RuntimeError: If `num_shards` is specified but `shard_id` is None.
|
||||
RuntimeError: If `shard_id` is specified but `num_shards` is None.
|
||||
ValueError: If `num_parallel_workers` exceeds the max thread numbers.
|
||||
|
||||
Examples:
|
||||
>>> penn_treebank_dataset_dir = "/path/to/penn_treebank_dataset_directory"
|
||||
|
|
|
@ -1444,7 +1444,7 @@ class EMnistDataset(MappableDataset, VisionBaseDataset):
|
|||
"""
|
||||
A source dataset that reads and parses the EMNIST dataset.
|
||||
|
||||
The generated dataset has two columns :py:obj:`[image, label]`.
|
||||
The generated dataset has two columns :py:obj:`[image, label]` .
|
||||
The tensor of column :py:obj:`image` is of the uint8 type.
|
||||
The tensor of column :py:obj:`label` is a scalar of the uint32 type.
|
||||
|
||||
|
@ -1452,23 +1452,24 @@ class EMnistDataset(MappableDataset, VisionBaseDataset):
|
|||
dataset_dir (str): Path to the root directory that contains the dataset.
|
||||
name (str): Name of splits for this dataset, can be 'byclass', 'bymerge', 'balanced', 'letters', 'digits'
|
||||
or 'mnist'.
|
||||
usage (str, optional): Usage of this dataset, can be 'train', 'test' or 'all'.
|
||||
(default=None, will read all samples).
|
||||
num_samples (int, optional): The number of images to be included in the dataset
|
||||
(default=None, will read all images).
|
||||
num_parallel_workers (int, optional): Number of workers to read the data
|
||||
(default=None, will use value set in the config).
|
||||
shuffle (bool, optional): Whether or not to perform shuffle on the dataset
|
||||
(default=None, expected order behavior shown in the table).
|
||||
usage (str, optional): Usage of this dataset, can be 'train', 'test' or 'all'.'train' will read from 60,000
|
||||
train samples, 'test' will read from 10,000 test samples, 'all' will read from all 70,000 samples.
|
||||
Default: None, will read all samples.
|
||||
num_samples (int, optional): The number of images to be included in the dataset.
|
||||
Default: None, will read all images.
|
||||
num_parallel_workers (int, optional): Number of workers to read the data.
|
||||
Default: None, number set in the mindspore.dataset.config.
|
||||
shuffle (bool, optional): Whether or not to perform shuffle on the dataset.
|
||||
Default: None, expected order behavior shown in the table below.
|
||||
sampler (Sampler, optional): Object used to choose samples from the
|
||||
dataset (default=None, expected order behavior shown in the table).
|
||||
num_shards (int, optional): Number of shards that the dataset will be divided into (default=None).
|
||||
dataset. Default: None, expected order behavior shown in the table below.
|
||||
num_shards (int, optional): Number of shards that the dataset will be divided into. Default: None.
|
||||
When this argument is specified, `num_samples` reflects the max sample number of per shard.
|
||||
shard_id (int, optional): The shard ID within `num_shards` (default=None). This
|
||||
shard_id (int, optional): The shard ID within `num_shards` . Default: None. This
|
||||
argument can only be specified when `num_shards` is also specified.
|
||||
cache (DatasetCache, optional): Use tensor caching service to speed up dataset processing. More details:
|
||||
`Single-Node Data Cache <https://www.mindspore.cn/tutorials/experts/en/master/dataset/cache.html>`_
|
||||
(default=None, which means no cache is used).
|
||||
`Single-Node Data Cache <https://www.mindspore.cn/tutorials/experts/en/master/dataset/cache.html>`_ .
|
||||
Default: None, which means no cache is used.
|
||||
|
||||
Raises:
|
||||
RuntimeError: If `sampler` and `shuffle` are specified at the same time.
|
||||
|
@ -1577,44 +1578,44 @@ class FakeImageDataset(MappableDataset, VisionBaseDataset):
|
|||
"""
|
||||
A source dataset for generating fake images.
|
||||
|
||||
The generated dataset has two columns :py:obj:`[image, label]`.
|
||||
The generated dataset has two columns :py:obj:`[image, label]` .
|
||||
The tensor of column :py:obj:`image` is of the uint8 type.
|
||||
The tensor of column :py:obj:`label` is a scalar of the uint32 type.
|
||||
The column :py:obj:`label` is a scalar of the uint32 type.
|
||||
|
||||
Args:
|
||||
num_images (int, optional): Number of images to generate in the dataset (default=1000).
|
||||
image_size (tuple, optional): Size of the fake image (default=(224, 224, 3)).
|
||||
num_classes (int, optional): Number of classes in the dataset (default=10).
|
||||
base_seed (int, optional): Offsets the index-based random seed used to generate each image (default=0).
|
||||
num_samples (int, optional): The number of images to be included in the dataset
|
||||
(default=None, will read all images).
|
||||
num_parallel_workers (int, optional): Number of workers to read the data
|
||||
(default=None, will use value set in the config).
|
||||
shuffle (bool, optional): Whether or not to perform shuffle on the dataset
|
||||
(default=None, expected order behavior shown in the table).
|
||||
num_images (int, optional): Number of images to generate in the dataset. Default: 1000.
|
||||
image_size (tuple, optional): Size of the fake image. Default: (224, 224, 3).
|
||||
num_classes (int, optional): Number of classes in the dataset. Default: 10.
|
||||
base_seed (int, optional): Offsets the index-based random seed used to generate each image. Default: 0.
|
||||
num_samples (int, optional): The number of images to be included in the dataset.
|
||||
Default: None, will read all images.
|
||||
num_parallel_workers (int, optional): Number of workers to read the data.
|
||||
Default: None, number set in the mindspore.dataset.config.
|
||||
shuffle (bool, optional): Whether or not to perform shuffle on the dataset.
|
||||
Default: None, expected order behavior shown in the table below.
|
||||
sampler (Sampler, optional): Object used to choose samples from the
|
||||
dataset (default=None, expected order behavior shown in the table).
|
||||
num_shards (int, optional): Number of shards that the dataset will be divided into (default=None).
|
||||
dataset. Default: None, expected order behavior shown in the table below.
|
||||
num_shards (int, optional): Number of shards that the dataset will be divided into. Default: None.
|
||||
When this argument is specified, `num_samples` reflects the max sample number of per shard.
|
||||
shard_id (int, optional): The shard ID within `num_shards` (default=None). This
|
||||
shard_id (int, optional): The shard ID within `num_shards` . Default: None. This
|
||||
argument can only be specified when `num_shards` is also specified.
|
||||
cache (DatasetCache, optional): Use tensor caching service to speed up dataset processing. More details:
|
||||
`Single-Node Data Cache <https://www.mindspore.cn/tutorials/experts/en/master/dataset/cache.html>`_
|
||||
(default=None, which means no cache is used).
|
||||
`Single-Node Data Cache <https://www.mindspore.cn/tutorials/experts/en/master/dataset/cache.html>`_ .
|
||||
Default: None, which means no cache is used.
|
||||
|
||||
Raises:
|
||||
ValueError: If `num_parallel_workers` exceeds the max thread numbers.
|
||||
RuntimeError: If `sampler` and `shuffle` are specified at the same time.
|
||||
RuntimeError: If `sampler` and `num_shards`/`shard_id` are specified at the same time.
|
||||
RuntimeError: If `num_shards` is specified but `shard_id` is None.
|
||||
RuntimeError: If `shard_id` is specified but `num_shards` is None.
|
||||
ValueError: If `num_parallel_workers` exceeds the max thread numbers.
|
||||
ValueError: If `shard_id` is invalid (< 0 or >= `num_shards`).
|
||||
|
||||
Note:
|
||||
- This dataset can take in a sampler. 'sampler' and 'shuffle' are mutually exclusive.
|
||||
- This dataset can take in a `sampler` . `sampler` and `shuffle` are mutually exclusive.
|
||||
The table below shows what input arguments are allowed and their expected behavior.
|
||||
|
||||
.. list-table:: Expected Order Behavior of Using 'sampler' and 'shuffle'
|
||||
.. list-table:: Expected Order Behavior of Using `sampler` and `shuffle`
|
||||
:widths: 25 25 50
|
||||
:header-rows: 1
|
||||
|
||||
|
@ -1665,40 +1666,40 @@ class FakeImageDataset(MappableDataset, VisionBaseDataset):
|
|||
|
||||
class FashionMnistDataset(MappableDataset, VisionBaseDataset):
|
||||
"""
|
||||
A source dataset that reads and parses the FASHION-MNIST dataset.
|
||||
A source dataset that reads and parses the Fashion-MNIST dataset.
|
||||
|
||||
The generated dataset has two columns :py:obj:`[image, label]`.
|
||||
The generated dataset has two columns :py:obj:`[image, label]` .
|
||||
The tensor of column :py:obj:`image` is of the uint8 type.
|
||||
The tensor of column :py:obj:`label` is a scalar of the uint32 type.
|
||||
The column :py:obj:`label` is a scalar of the uint32 type.
|
||||
|
||||
Args:
|
||||
dataset_dir (str): Path to the root directory that contains the dataset.
|
||||
usage (str, optional): Usage of this dataset, can be 'train', 'test' or 'all'. 'train' will read from 60,000
|
||||
train samples, 'test' will read from 10,000 test samples, 'all' will read from all 70,000 samples.
|
||||
(default=None, will read all samples)
|
||||
num_samples (int, optional): The number of images to be included in the dataset
|
||||
(default=None, will read all images).
|
||||
num_parallel_workers (int, optional): Number of workers to read the data
|
||||
(default=None, will use value set in the config).
|
||||
shuffle (bool, optional): Whether or not to perform shuffle on the dataset
|
||||
(default=None, expected order behavior shown in the table).
|
||||
sampler (Sampler, optional): Object used to choose samples from the
|
||||
dataset (default=None, expected order behavior shown in the table).
|
||||
num_shards (int, optional): Number of shards that the dataset will be divided into (default=None).
|
||||
Default: None, will read all samples.
|
||||
num_samples (int, optional): The number of images to be included in the dataset.
|
||||
Default: None, will read all images.
|
||||
num_parallel_workers (int, optional): Number of workers to read the data.
|
||||
Default: None, number set in the mindspore.dataset.config.
|
||||
shuffle (bool, optional): Whether or not to perform shuffle on the dataset.
|
||||
Default: None, expected order behavior shown in the table below.
|
||||
sampler (Sampler, optional): Object used to choose samples from the dataset.
|
||||
Default: None, expected order behavior shown in the table below.
|
||||
num_shards (int, optional): Number of shards that the dataset will be divided into. Default: None.
|
||||
When this argument is specified, `num_samples` reflects the maximum sample number of per shard.
|
||||
shard_id (int, optional): The shard ID within `num_shards` (default=None). This
|
||||
shard_id (int, optional): The shard ID within `num_shards` . Default: None. This
|
||||
argument can only be specified when `num_shards` is also specified.
|
||||
cache (DatasetCache, optional): Use tensor caching service to speed up dataset processing. More details:
|
||||
`Single-Node Data Cache <https://www.mindspore.cn/tutorials/experts/en/master/dataset/cache.html>`_
|
||||
(default=None, which means no cache is used).
|
||||
`Single-Node Data Cache <https://www.mindspore.cn/tutorials/experts/en/master/dataset/cache.html>`_ .
|
||||
Default: None, which means no cache is used.
|
||||
|
||||
Raises:
|
||||
RuntimeError: If `dataset_dir` does not contain data files.
|
||||
ValueError: If `num_parallel_workers` exceeds the max thread numbers.
|
||||
RuntimeError: If `sampler` and `shuffle` are specified at the same time.
|
||||
RuntimeError: If `sampler` and `num_shards`/`shard_id` are specified at the same time.
|
||||
RuntimeError: If `num_shards` is specified but `shard_id` is None.
|
||||
RuntimeError: If `shard_id` is specified but `num_shards` is None.
|
||||
ValueError: If `num_parallel_workers` exceeds the max thread numbers.
|
||||
ValueError: If `shard_id` is invalid (< 0 or >= `num_shards`).
|
||||
|
||||
Note:
|
||||
|
@ -2033,43 +2034,42 @@ class Flowers102Dataset(GeneratorDataset):
|
|||
"""
|
||||
A source dataset that reads and parses Flowers102 dataset.
|
||||
|
||||
The generated dataset has two columns :py:obj:`[image, label]` or three :py:obj:`[image, segmentation, label]`.
|
||||
The tensor of column :py:obj:`image` is of the uint8 type.
|
||||
The tensor of column :py:obj:`segmentation` is of the uint8 type.
|
||||
The tensor of column :py:obj:`label` is a scalar or a tensor of the uint32 type.
|
||||
According to the given `task` configuration, the generated dataset has different output columns:
|
||||
- `task` = 'Classification', output columns: `[image, dtype=uint8]` , `[label, dtype=uint32]` .
|
||||
- `task` = 'Segmentation',
|
||||
output columns: `[image, dtype=uint8]` , `[segmentation, dtype=uint8]` , `[label, dtype=uint32]` .
|
||||
|
||||
Args:
|
||||
dataset_dir (str): Path to the root directory that contains the dataset.
|
||||
task (str, optional): Specify the 'Classification' or 'Segmentation' task (default='Classification').
|
||||
usage (str, optional): Specify the 'train', 'valid', 'test' part or 'all' parts of dataset
|
||||
(default='all', will read all samples).
|
||||
num_samples (int, optional): The number of samples to be included in the dataset (default=None, all images).
|
||||
num_parallel_workers (int, optional): Number of subprocesses used to fetch the dataset in parallel (default=1).
|
||||
shuffle (bool, optional): Whether or not to perform shuffle on the dataset. Random accessible input is required.
|
||||
(default=None, expected order behavior shown in the table).
|
||||
decode (bool, optional): Whether or not to decode the images and segmentations after reading (default=False).
|
||||
sampler (Union[Sampler, Iterable], optional): Object used to choose samples from the dataset. Random accessible
|
||||
input is required (default=None, expected order behavior shown in the table).
|
||||
num_shards (int, optional): Number of shards that the dataset will be divided into (default=None).
|
||||
Random accessible input is required. When this argument is specified, 'num_samples' reflects the max
|
||||
sample number of per shard.
|
||||
shard_id (int, optional): The shard ID within `num_shards` (default=None). This argument must be specified only
|
||||
when num_shards is also specified. Random accessible input is required.
|
||||
task (str, optional): Specify the 'Classification' or 'Segmentation' task. Default: 'Classification'.
|
||||
usage (str, optional): Specify the 'train', 'valid', 'test' part or 'all' parts of dataset.
|
||||
Default: 'all', will read all samples.
|
||||
num_samples (int, optional): The number of samples to be included in the dataset. Default: None, all images.
|
||||
num_parallel_workers (int, optional): Number of subprocesses used to fetch the dataset in parallel. Default: 1.
|
||||
shuffle (bool, optional): Whether or not to perform shuffle on the dataset.
|
||||
Default: None, expected order behavior shown in the table below.
|
||||
decode (bool, optional): Whether or not to decode the images and segmentations after reading. Default: False.
|
||||
sampler (Union[Sampler, Iterable], optional): Object used to choose samples from the dataset.
|
||||
Default: None, expected order behavior shown in the table below.
|
||||
num_shards (int, optional): Number of shards that the dataset will be divided into. Default: None.
|
||||
When this argument is specified, `num_samples` reflects the max sample number of per shard.
|
||||
shard_id (int, optional): The shard ID within `num_shards` . Default: None. This argument must be specified only
|
||||
when `num_shards` is also specified.
|
||||
|
||||
Raises:
|
||||
RuntimeError: If `dataset_dir` does not contain data files.
|
||||
ValueError: If `num_parallel_workers` exceeds the max thread numbers.
|
||||
RuntimeError: If `sampler` and `shuffle` are specified at the same time.
|
||||
RuntimeError: If `sampler` and `num_shards`/`shard_id` are specified at the same time.
|
||||
RuntimeError: If `num_shards` is specified but `shard_id` is None.
|
||||
RuntimeError: If `shard_id` is specified but `num_shards` is None.
|
||||
ValueError: If `num_parallel_workers` exceeds the max thread numbers.
|
||||
ValueError: If `shard_id` is invalid (< 0 or >= `num_shards`).
|
||||
|
||||
Note:
|
||||
- This dataset can take in a sampler. 'sampler' and 'shuffle' are mutually exclusive.
|
||||
- This dataset can take in a `sampler` . `sampler` and `shuffle` are mutually exclusive.
|
||||
The table below shows what input arguments are allowed and their expected behavior.
|
||||
|
||||
.. list-table:: Expected Order Behavior of Using 'sampler' and 'shuffle'
|
||||
.. list-table:: Expected Order Behavior of Using `sampler` and `shuffle`
|
||||
:widths: 25 25 50
|
||||
:header-rows: 1
|
||||
|
||||
|
@ -2479,39 +2479,39 @@ class KMnistDataset(MappableDataset, VisionBaseDataset):
|
|||
"""
|
||||
A source dataset that reads and parses the KMNIST dataset.
|
||||
|
||||
The generated dataset has two columns :py:obj:`[image, label]`.
|
||||
The generated dataset has two columns :py:obj:`[image, label]` .
|
||||
The tensor of column :py:obj:`image` is of the uint8 type.
|
||||
The tensor of column :py:obj:`label` is a scalar of the uint32 type.
|
||||
The column :py:obj:`label` is a scalar of the uint32 type.
|
||||
|
||||
Args:
|
||||
dataset_dir (str): Path to the root directory that contains the dataset.
|
||||
usage (str, optional): Usage of this dataset, can be 'train', 'test' or 'all' . 'train' will read from 60,000
|
||||
train samples, 'test' will read from 10,000 test samples, 'all' will read from all 70,000 samples.
|
||||
(default=None, will read all samples)
|
||||
num_samples (int, optional): The number of images to be included in the dataset
|
||||
(default=None, will read all images).
|
||||
num_parallel_workers (int, optional): Number of workers to read the data
|
||||
(default=None, will use value set in the config).
|
||||
shuffle (bool, optional): Whether or not to perform shuffle on the dataset
|
||||
(default=None, expected order behavior shown in the table).
|
||||
sampler (Sampler, optional): Object used to choose samples from the
|
||||
dataset (default=None, expected order behavior shown in the table).
|
||||
num_shards (int, optional): Number of shards that the dataset will be divided into (default=None).
|
||||
Default: None, will read all samples.
|
||||
num_samples (int, optional): The number of images to be included in the dataset.
|
||||
Default: None, will read all images.
|
||||
num_parallel_workers (int, optional): Number of workers to read the data.
|
||||
Default: None, number set in the mindspore.dataset.config.
|
||||
shuffle (bool, optional): Whether or not to perform shuffle on the dataset.
|
||||
Default: None, expected order behavior shown in the table below.
|
||||
sampler (Sampler, optional): Object used to choose samples from the dataset.
|
||||
Default: None, expected order behavior shown in the table below.
|
||||
num_shards (int, optional): Number of shards that the dataset will be divided into. Default: None.
|
||||
When this argument is specified, `num_samples` reflects the maximum sample number of per shard.
|
||||
shard_id (int, optional): The shard ID within `num_shards` (default=None). This
|
||||
shard_id (int, optional): The shard ID within `num_shards` . Default: None. This
|
||||
argument can only be specified when `num_shards` is also specified.
|
||||
cache (DatasetCache, optional): Use tensor caching service to speed up dataset processing. More details:
|
||||
`Single-Node Data Cache <https://www.mindspore.cn/tutorials/experts/en/master/dataset/cache.html>`_
|
||||
(default=None, which means no cache is used).
|
||||
`Single-Node Data Cache <https://www.mindspore.cn/tutorials/experts/en/master/dataset/cache.html>`_ .
|
||||
Default: None, which means no cache is used.
|
||||
|
||||
Raises:
|
||||
RuntimeError: If `dataset_dir` does not contain data files.
|
||||
ValueError: If `num_parallel_workers` exceeds the max thread numbers.
|
||||
RuntimeError: If `sampler` and `shuffle` are specified at the same time.
|
||||
RuntimeError: If `sampler` and sharding are specified at the same time.
|
||||
RuntimeError: If `num_shards` is specified but `shard_id` is None.
|
||||
RuntimeError: If `shard_id` is specified but `num_shards` is None.
|
||||
ValueError: If `shard_id` is invalid (out of range [0, `num_shards`]).
|
||||
ValueError: If `num_parallel_workers` exceeds the max thread numbers.
|
||||
ValueError: If `shard_id` is invalid (< 0 or >= `num_shards`).
|
||||
|
||||
Note:
|
||||
- This dataset can take in a `sampler`. `sampler` and `shuffle` are mutually exclusive.
|
||||
|
@ -3259,41 +3259,38 @@ class PhotoTourDataset(MappableDataset, VisionBaseDataset):
|
|||
"""
|
||||
A source dataset that reads and parses the PhotoTour dataset.
|
||||
|
||||
The generated dataset with different usage has different output columns.
|
||||
If train, the generated dataset has one column :py:obj:`[image]`,
|
||||
else three columns :py:obj:`[image1, image2, matches]`.
|
||||
The tensor of column :py:obj:`image`, :py:obj:`image1` and :py:obj:`image2` is of the uint8 type.
|
||||
The tensor of column :py:obj:`matches` is a scalar of the uint32 type.
|
||||
According to the given `usage` configuration, the generated dataset has different output columns:
|
||||
- `usage` = 'train', output columns: `[image, dtype=uint8]` .
|
||||
- `usage` ≠ 'train', output columns: `[image1, dtype=uint8]` , `[image2, dtype=uint8]` , `[matches, dtype=uint32]` .
|
||||
|
||||
Args:
|
||||
dataset_dir (str): Path to the root directory that contains the dataset.
|
||||
name (str): Name of the dataset to load,
|
||||
should be one of 'notredame', 'yosemite', 'liberty', 'notredame_harris',
|
||||
'yosemite_harris' or 'liberty_harris'.
|
||||
usage (str, optional): Usage of the dataset, can be 'train' or 'test' (Default=None, will be set to 'train').
|
||||
usage (str, optional): Usage of the dataset, can be 'train' or 'test'. Default: None, will be set to 'train'.
|
||||
When usage is 'train', number of samples for each `name` is
|
||||
{'notredame': 468159, 'yosemite': 633587, 'liberty': 450092, 'liberty_harris': 379587,
|
||||
'yosemite_harris': 450912, 'notredame_harris': 325295}.
|
||||
When usage is 'test', will read 100,000 samples for testing.
|
||||
num_samples (int, optional): The number of images to be included in the dataset
|
||||
(default=None, will read all images).
|
||||
num_parallel_workers (int, optional): Number of workers to read the data
|
||||
(default=None, will use value set in the config).
|
||||
shuffle (bool, optional): Whether or not to perform shuffle on the dataset
|
||||
(default=None, expected order behavior shown in the table).
|
||||
sampler (Sampler, optional): Object used to choose samples from the
|
||||
dataset (default=None, expected order behavior shown in the table).
|
||||
num_shards (int, optional): Number of shards that the dataset will be divided into (default=None).
|
||||
num_samples (int, optional): The number of images to be included in the dataset.
|
||||
Default: None, will read all images.
|
||||
num_parallel_workers (int, optional): Number of workers to read the data.
|
||||
Default: None, number set in the mindspore.dataset.config..
|
||||
shuffle (bool, optional): Whether or not to perform shuffle on the dataset.
|
||||
Default: None, expected order behavior shown in the table below.
|
||||
sampler (Sampler, optional): Object used to choose samples from the dataset.
|
||||
Default: None, expected order behavior shown in the table below.
|
||||
num_shards (int, optional): Number of shards that the dataset will be divided into. Default: None.
|
||||
When this argument is specified, `num_samples` reflects the max sample number of per shard.
|
||||
shard_id (int, optional): The shard ID within `num_shards` (default=None). This
|
||||
shard_id (int, optional): The shard ID within `num_shards` . Default: None. This
|
||||
argument can only be specified when `num_shards` is also specified.
|
||||
cache (DatasetCache, optional): Use tensor caching service to speed up dataset processing. More details:
|
||||
`Single-Node Data Cache <https://www.mindspore.cn/tutorials/experts/en/master/dataset/cache.html>`_
|
||||
(default=None, which means no cache is used).
|
||||
`Single-Node Data Cache <https://www.mindspore.cn/tutorials/experts/en/master/dataset/cache.html>`_ .
|
||||
Default: None, which means no cache is used.
|
||||
|
||||
Raises:
|
||||
RuntimeError: If `dataset_dir` does not contain data files.
|
||||
ValueError: If `num_parallel_workers` exceeds the max thread numbers.
|
||||
RuntimeError: If `sampler` and `shuffle` are specified at the same time.
|
||||
RuntimeError: If `sampler` and `num_shards`/`shard_id` are specified at the same time.
|
||||
RuntimeError: If `num_shards` is specified but `shard_id` is None.
|
||||
|
@ -3302,13 +3299,14 @@ class PhotoTourDataset(MappableDataset, VisionBaseDataset):
|
|||
ValueError: If `usage` is not in ["train", "test"].
|
||||
ValueError: If name is not in ["notredame", "yosemite", "liberty",
|
||||
"notredame_harris", "yosemite_harris", "liberty_harris"].
|
||||
ValueError: If `num_parallel_workers` exceeds the max thread numbers.
|
||||
ValueError: If `shard_id` is invalid (< 0 or >= `num_shards`).
|
||||
|
||||
Note:
|
||||
- This dataset can take in a sampler. `sampler` and `shuffle` are mutually exclusive. The table
|
||||
- This dataset can take in a `sampler` . `sampler` and `shuffle` are mutually exclusive. The table
|
||||
below shows what input arguments are allowed and their expected behavior.
|
||||
|
||||
.. list-table:: Expected Order Behavior of Using 'sampler' and 'shuffle'
|
||||
.. list-table:: Expected Order Behavior of Using `sampler` and `shuffle`
|
||||
:widths: 64 64 1
|
||||
:header-rows: 1
|
||||
|
||||
|
|
Loading…
Reference in New Issue