forked from mindspore-Ecosystem/mindspore
fix the remaining review problems
This commit is contained in:
parent
a3d391ae70
commit
ce156448a6
|
@ -19,7 +19,7 @@ mindspore.dataset.Caltech101Dataset
|
|||
目录Annotations用于存储图像的标注。
|
||||
- **target_type** (str, 可选) - 指定数据集的子集,可取值为'category'、'annotation' 或 'all'。
|
||||
取值为'category'时将读取图像的类别标注作为label,取值为'annotation'时将读取图像的轮廓标注作为label,
|
||||
取值为'all'时将同时输出图像的类别标注和轮廓标注。默认值:'category'。
|
||||
取值为'all'时将同时输出图像的类别标注和轮廓标注。默认值:None,表示'category'。
|
||||
- **num_samples** (int, 可选) - 指定从数据集中读取的样本数,可以小于数据集总数。默认值:None,读取全部样本图片。
|
||||
- **num_parallel_workers** (int, 可选) - 指定读取数据的工作线程数。默认值:None,使用mindspore.dataset.config中配置的线程数。
|
||||
- **shuffle** (bool, 可选) - 是否混洗数据集。默认值:None,下表中会展示不同参数配置的预期行为。
|
||||
|
|
|
@ -1,7 +1,7 @@
|
|||
.. py:method:: add_sampler(new_sampler)
|
||||
|
||||
为当前数据集对象添加采样器。
|
||||
为当前数据集添加子采样器。
|
||||
|
||||
**参数:**
|
||||
|
||||
- **new_sampler** (Sampler) :指定作用于当前数据集对象的新采样器。
|
||||
- **new_sampler** (Sampler):待添加的子采样器。
|
||||
|
|
|
@ -1,7 +1,7 @@
|
|||
.. py:method:: use_sampler(new_sampler)
|
||||
|
||||
为当前数据集对象更换一个新的采样器。
|
||||
替换当前数据集的最末子采样器,保持父采样器不变。
|
||||
|
||||
**参数:**
|
||||
|
||||
- **new_sampler** (Sampler) :替换的新采样器。
|
||||
- **new_sampler** (Sampler) :用于替换的新采样器。
|
||||
|
|
|
@ -3,13 +3,12 @@
|
|||
|
||||
.. py:class:: mindspore.dataset.text.NormalizeForm
|
||||
|
||||
:class:`mindspore.dataset.text.transforms.NormalizeUTF8` 的枚举值。
|
||||
`Unicode规范化模式 <http://unicode.org/reports/tr15/>`_ 枚举类。
|
||||
|
||||
`Unicode规范化模式 <http://unicode.org/reports/tr15/#Norm_Forms>_` 可选的枚举值包括: `NormalizeForm.NONE` 、 `NormalizeForm.NFC` 、 `NormalizeForm.NFKC` 、 `NormalizeForm.NFD` 和 `NormalizeForm.NFKD` 。
|
||||
可选枚举值为:NormalizeForm.NONE、NormalizeForm.NFC、NormalizeForm.NFKC、NormalizeForm.NFD 和 NormalizeForm.NFKD。
|
||||
|
||||
- **NormalizeForm.NONE** - 对输入字符串不做任何处理。
|
||||
- **NormalizeForm.NFC** - 对输入字符串进行C形式规范化。
|
||||
- **NormalizeForm.NFKC** - 对输入字符串进行KC形式规范化。
|
||||
- **NormalizeForm.NFD** - 对输入字符串进行D形式规范化。
|
||||
- **NormalizeForm.NFKD** - 对输入字符串进行KD形式规范化。
|
||||
|
||||
- **NormalizeForm.NONE** - 不进行规范化处理。
|
||||
- **NormalizeForm.NFC** - 先以标准等价方式分解,再以标准等价方式重组。
|
||||
- **NormalizeForm.NFKC** - 先以兼容等价方式分解,再以标准等价方式重组。
|
||||
- **NormalizeForm.NFD** - 以标准等价方式分解。
|
||||
- **NormalizeForm.NFKD** - 以兼容等价方式分解。
|
||||
|
|
|
@ -9,17 +9,17 @@
|
|||
|
||||
**参数:**
|
||||
|
||||
- **lower_case** (bool,可选) - 若为True,将对输入执行 :class:`mindspore.dataset.text.transforms.CaseFold` 、NFD模式 :class:`mindspore.dataset.text.transforms.NormalizeUTF8` 和 :class:`mindspore.dataset.text.transforms.RegexReplace` 等操作,将文本转换为小写并删除重音字符;若为False,将只执行 `normalization_form` 模式 :class:`mindspore.dataset.text.transforms.NormalizeUTF8` 操作。默认值:False。
|
||||
- **lower_case** (bool,可选) - 是否对字符串进行小写转换处理。若为True,会将字符串转换为小写并删除重音字符;若为False,将只对字符串进行规范化处理,其模式由 `normalization_form` 指定。默认值:False。
|
||||
- **keep_whitespace** (bool,可选) - 是否在分词输出中保留空格。默认值:False。
|
||||
- **normalization_form** (:class:`mindspore.dataset.text.NormalizeForm`,可选) - `Unicode规范化模式 <http://unicode.org/reports/tr15/>`_,仅当 `lower_case` 为False时生效,取值可为NormalizeForm.NONE、NormalizeForm.NFC、NormalizeForm.NFKC、NormalizeForm.NFD或NormalizeForm.NFKD。默认值:NormalizeForm.NONE。
|
||||
|
||||
- NormalizeForm.NONE:对输入字符串不做任何处理。
|
||||
- NormalizeForm.NFC:对输入字符串进行C形式规范化。
|
||||
- NormalizeForm.NFKC:对输入字符串进行KC形式规范化。
|
||||
- NormalizeForm.NFD:对输入字符串进行D形式规范化。
|
||||
- NormalizeForm.NFKD:对输入字符串进行KD形式规范化。
|
||||
- NormalizeForm.NONE:不进行规范化处理。
|
||||
- NormalizeForm.NFC:先以标准等价方式分解,再以标准等价方式重组。
|
||||
- NormalizeForm.NFKC:先以兼容等价方式分解,再以标准等价方式重组。
|
||||
- NormalizeForm.NFD:以标准等价方式分解。
|
||||
- NormalizeForm.NFKD:以兼容等价方式分解。
|
||||
|
||||
- **preserve_unused_token** (bool,可选) - 若为True,将不会对特殊词汇进行分词,如 '[CLS]', '[SEP]', '[UNK]', '[PAD]', '[MASK]' 等。默认值:True。
|
||||
- **preserve_unused_token** (bool,可选) - 是否保留特殊词汇。若为True,将不会对特殊词汇进行分词,如 '[CLS]', '[SEP]', '[UNK]', '[PAD]', '[MASK]' 等。默认值:True。
|
||||
- **with_offsets** (bool,可选) - 是否输出词汇在字符串中的偏移量。默认值:False。
|
||||
|
||||
**异常:**
|
||||
|
|
|
@ -13,10 +13,17 @@ mindspore.dataset.text.transforms.BertTokenizer
|
|||
- **suffix_indicator** (str,可选) - 用于指示子词后缀的前缀标志。默认值:'##'。
|
||||
- **max_bytes_per_token** (int,可选) - 分词最大长度,超过此长度的词汇将不会被拆分。默认值:100。
|
||||
- **unknown_token** (str,可选) - 对未知词汇的分词输出。当设置为空字符串时,直接返回对应未知词汇作为分词输出;否则,返回该字符串作为分词输出。默认值:'[UNK]'。
|
||||
- **lower_case** (bool,可选) - 若为True,将对输入执行 :class:`mindspore.dataset.text.transforms.CaseFold` 、NFD模式 :class:`mindspore.dataset.text.transforms.NormalizeUTF8` 和 :class:`mindspore.dataset.text.transforms.RegexReplace` 等操作,将文本转换为小写并删除重音字符;若为False,将只执行 `normalization_form` 模式 :class:`mindspore.dataset.text.transforms.NormalizeUTF8` 操作。默认值:False。
|
||||
- **lower_case** (bool,可选) - 是否对字符串进行小写转换处理。若为True,会将字符串转换为小写并删除重音字符;若为False,将只对字符串进行规范化处理,其模式由 `normalization_form` 指定。默认值:False。
|
||||
- **keep_whitespace** (bool,可选) - 是否在分词输出中保留空格。默认值:False。
|
||||
- **normalization_form** (:class:`mindspore.dataset.text.NormalizeForm`,可选) - `Unicode规范化模式 <http://unicode.org/reports/tr15/>`_,仅当 `lower_case` 为False时生效。详见 :class:`mindspore.dataset.text.transforms.NormalizeUTF8` 。默认值:NormalizeForm.NONE。
|
||||
- **preserve_unused_token** (bool,可选) - 若为True,将不会对特殊词汇进行分词,如 '[CLS]', '[SEP]', '[UNK]', '[PAD]', '[MASK]' 等。默认值:True。
|
||||
- **normalization_form** (:class:`mindspore.dataset.text.NormalizeForm`,可选) - `Unicode规范化模式 <http://unicode.org/reports/tr15/>`_,仅当 `lower_case` 为False时生效,取值可为NormalizeForm.NONE、NormalizeForm.NFC、NormalizeForm.NFKC、NormalizeForm.NFD或NormalizeForm.NFKD。默认值:NormalizeForm.NONE。
|
||||
|
||||
- NormalizeForm.NONE:不进行规范化处理。
|
||||
- NormalizeForm.NFC:先以标准等价方式分解,再以标准等价方式重组。
|
||||
- NormalizeForm.NFKC:先以兼容等价方式分解,再以标准等价方式重组。
|
||||
- NormalizeForm.NFD:以标准等价方式分解。
|
||||
- NormalizeForm.NFKD:以兼容等价方式分解。
|
||||
|
||||
- **preserve_unused_token** (bool,可选) - 是否保留特殊词汇。若为True,将不会对特殊词汇进行分词,如 '[CLS]', '[SEP]', '[UNK]', '[PAD]', '[MASK]' 等。默认值:True。
|
||||
- **with_offsets** (bool,可选) - 是否输出词汇在字符串中的偏移量。默认值:False。
|
||||
|
||||
**异常:**
|
||||
|
|
|
@ -2159,18 +2159,16 @@ class MappableDataset(SourceDataset):
|
|||
|
||||
def add_sampler(self, new_sampler):
|
||||
"""
|
||||
Add a sampler for current dataset.
|
||||
Add a child sampler for the current dataset.
|
||||
|
||||
Args:
|
||||
new_sampler (Sampler): The sampler to be added as the parent sampler for current dataset.
|
||||
new_sampler (Sampler): The child sampler to be added.
|
||||
|
||||
Examples:
|
||||
>>> # dataset is an instance object of Dataset
|
||||
>>> # use a DistributedSampler instead
|
||||
>>> new_sampler = ds.DistributedSampler(10, 2)
|
||||
>>> dataset.add_sampler(new_sampler)
|
||||
>>> dataset.add_sampler(new_sampler) # dataset is an instance of Dataset
|
||||
"""
|
||||
# note: By adding a sampler, the sampled IDs will flow to new_sampler
|
||||
# Note: By adding a sampler, the sampled IDs will flow to the new_sampler
|
||||
# after first passing through the current samplers attached to this dataset.
|
||||
self.dataset_size = None
|
||||
new_sampler.add_child(self.sampler)
|
||||
|
@ -2178,10 +2176,10 @@ class MappableDataset(SourceDataset):
|
|||
|
||||
def use_sampler(self, new_sampler):
|
||||
"""
|
||||
Make the current dataset use the new_sampler provided by other API.
|
||||
Replace the last child sampler of the current dataset, remaining the parent sampler unchanged.
|
||||
|
||||
Args:
|
||||
new_sampler (Sampler): The sampler to use for the current dataset.
|
||||
new_sampler (Sampler): The new sampler to replace with.
|
||||
|
||||
Examples:
|
||||
>>> # dataset is an instance object of Dataset
|
||||
|
|
|
@ -690,24 +690,23 @@ if platform.system().lower() != 'windows':
|
|||
`BasicTokenizer` is not supported on Windows platform yet.
|
||||
|
||||
Args:
|
||||
lower_case (bool, optional): If True, will apply `CaseFold`, NormalizeForm.NFD mode `NormalizeUTF8`,
|
||||
`RegexReplace` operations on the input to fold the text to lower case and strip accented characters.
|
||||
If False, will only apply `NormalizeUTF8` operation of mode `normalization_form` on the input.
|
||||
Default: False.
|
||||
lower_case (bool, optional): Whether to perform lowercase processing on the text. If True, will fold the
|
||||
text to lower case and strip accented characters. If False, will only perform normalization on the
|
||||
text, with mode specified by `normalization_form`. Default: False.
|
||||
keep_whitespace (bool, optional): If True, the whitespace will be kept in the output. Default: False.
|
||||
normalization_form (NormalizeForm, optional):
|
||||
`Unicode normalization forms <http://unicode.org/reports/tr15/>`_, only valid when `lower_case`
|
||||
is False, can be NormalizeForm.NONE, NormalizeForm.NFC, NormalizeForm.NFKC, NormalizeForm.NFD or
|
||||
NormalizeForm.NFKD. Default: NormalizeForm.NONE.
|
||||
|
||||
- NormalizeForm.NONE, do nothing to input string.
|
||||
- NormalizeForm.NFC, normalize with Normalization Form C.
|
||||
- NormalizeForm.NFKC, normalize with Normalization Form KC.
|
||||
- NormalizeForm.NFD, normalize with Normalization Form D.
|
||||
- NormalizeForm.NFKD, normalize with Normalization Form KD.
|
||||
- NormalizeForm.NONE, no normalization.
|
||||
- NormalizeForm.NFC, Canonical Decomposition, followed by Canonical Composition.
|
||||
- NormalizeForm.NFKC, Compatibility Decomposition, followed by Canonical Composition.
|
||||
- NormalizeForm.NFD, Canonical Decomposition.
|
||||
- NormalizeForm.NFKD, Compatibility Decomposition.
|
||||
|
||||
preserve_unused_token (bool, optional): If True, will not split special tokens like
|
||||
'[CLS]', '[SEP]', '[UNK]', '[PAD]', '[MASK]'. Default: True.
|
||||
preserve_unused_token (bool, optional): Whether to preserve special tokens. If True, will not split special
|
||||
tokens like '[CLS]', '[SEP]', '[UNK]', '[PAD]', '[MASK]'. Default: True.
|
||||
with_offsets (bool, optional): Whether to return the offsets of tokens. Default: False.
|
||||
|
||||
Raises:
|
||||
|
@ -778,24 +777,23 @@ if platform.system().lower() != 'windows':
|
|||
unknown_token (str, optional): The output for unknown words. When set to an empty string, the corresponding
|
||||
unknown word will be directly returned as the output. Otherwise, the set string will be returned as the
|
||||
output. Default: '[UNK]'.
|
||||
lower_case (bool, optional): If True, will apply `CaseFold`, NormalizeForm.NFD mode `NormalizeUTF8`,
|
||||
`RegexReplace` operations on the input to fold the text to lower case and strip accented characters.
|
||||
If False, will only apply `NormalizeUTF8` operation of mode `normalization_form` on the input.
|
||||
Default: False.
|
||||
lower_case (bool, optional): Whether to perform lowercase processing on the text. If True, will fold the
|
||||
text to lower case and strip accented characters. If False, will only perform normalization on the
|
||||
text, with mode specified by `normalization_form`. Default: False.
|
||||
keep_whitespace (bool, optional): If True, the whitespace will be kept in the output. Default: False.
|
||||
normalization_form (NormalizeForm, optional):
|
||||
`Unicode normalization forms <http://unicode.org/reports/tr15/>`_, only valid when `lower_case`
|
||||
is False, can be NormalizeForm.NONE, NormalizeForm.NFC, NormalizeForm.NFKC, NormalizeForm.NFD or
|
||||
NormalizeForm.NFKD. Default: NormalizeForm.NONE.
|
||||
|
||||
- NormalizeForm.NONE, do nothing to input string.
|
||||
- NormalizeForm.NFC, normalize with Normalization Form C.
|
||||
- NormalizeForm.NFKC, normalize with Normalization Form KC.
|
||||
- NormalizeForm.NFD, normalize with Normalization Form D.
|
||||
- NormalizeForm.NFKD, normalize with Normalization Form KD.
|
||||
- NormalizeForm.NONE, no normalization.
|
||||
- NormalizeForm.NFC, Canonical Decomposition, followed by Canonical Composition.
|
||||
- NormalizeForm.NFKC, Compatibility Decomposition, followed by Canonical Composition.
|
||||
- NormalizeForm.NFD, Canonical Decomposition.
|
||||
- NormalizeForm.NFKD, Compatibility Decomposition.
|
||||
|
||||
preserve_unused_token (bool, optional): If True, will not split special tokens like
|
||||
'[CLS]', '[SEP]', '[UNK]', '[PAD]', '[MASK]'. Default: True.
|
||||
preserve_unused_token (bool, optional): Whether to preserve special tokens. If True, will not split special
|
||||
tokens like '[CLS]', '[SEP]', '[UNK]', '[PAD]', '[MASK]'. Default: True.
|
||||
with_offsets (bool, optional): Whether to return the offsets of tokens. Default: False.
|
||||
|
||||
Raises:
|
||||
|
|
|
@ -420,16 +420,16 @@ class JiebaMode(IntEnum):
|
|||
|
||||
class NormalizeForm(IntEnum):
|
||||
"""
|
||||
An enumeration for NormalizeUTF8.
|
||||
Enumeration class for `Unicode normalization forms <http://unicode.org/reports/tr15/>`_ .
|
||||
|
||||
Possible enumeration values are: NormalizeForm.NONE, NormalizeForm.NFC, NormalizeForm.NFKC, NormalizeForm.NFD,
|
||||
NormalizeForm.NFKD.
|
||||
Possible enumeration values are: NormalizeForm.NONE, NormalizeForm.NFC, NormalizeForm.NFKC, NormalizeForm.NFD
|
||||
and NormalizeForm.NFKD.
|
||||
|
||||
- NormalizeForm.NONE: do nothing for input string tensor.
|
||||
- NormalizeForm.NFC: normalize with Normalization Form C.
|
||||
- NormalizeForm.NFKC: normalize with Normalization Form KC.
|
||||
- NormalizeForm.NFD: normalize with Normalization Form D.
|
||||
- NormalizeForm.NFKD: normalize with Normalization Form KD.
|
||||
- NormalizeForm.NONE: no normalization.
|
||||
- NormalizeForm.NFC: Canonical Decomposition, followed by Canonical Composition.
|
||||
- NormalizeForm.NFKC: Compatibility Decomposition, followed by Canonical Composition.
|
||||
- NormalizeForm.NFD: Canonical Decomposition.
|
||||
- NormalizeForm.NFKD: Compatibility Decomposition.
|
||||
"""
|
||||
|
||||
NONE = 0
|
||||
|
|
Loading…
Reference in New Issue