diff --git a/docs/api/api_python/dataset/mindspore.dataset.Caltech101Dataset.rst b/docs/api/api_python/dataset/mindspore.dataset.Caltech101Dataset.rst
index 69c71dd66c5..00876e5fbf0 100644
--- a/docs/api/api_python/dataset/mindspore.dataset.Caltech101Dataset.rst
+++ b/docs/api/api_python/dataset/mindspore.dataset.Caltech101Dataset.rst
@@ -19,7 +19,7 @@ mindspore.dataset.Caltech101Dataset
目录Annotations用于存储图像的标注。
- **target_type** (str, 可选) - 指定数据集的子集,可取值为'category'、'annotation' 或 'all'。
取值为'category'时将读取图像的类别标注作为label,取值为'annotation'时将读取图像的轮廓标注作为label,
- 取值为'all'时将同时输出图像的类别标注和轮廓标注。默认值:'category'。
+ 取值为'all'时将同时输出图像的类别标注和轮廓标注。默认值:None,表示'category'。
- **num_samples** (int, 可选) - 指定从数据集中读取的样本数,可以小于数据集总数。默认值:None,读取全部样本图片。
- **num_parallel_workers** (int, 可选) - 指定读取数据的工作线程数。默认值:None,使用mindspore.dataset.config中配置的线程数。
- **shuffle** (bool, 可选) - 是否混洗数据集。默认值:None,下表中会展示不同参数配置的预期行为。
diff --git a/docs/api/api_python/dataset/mindspore.dataset.Dataset.add_sampler.rst b/docs/api/api_python/dataset/mindspore.dataset.Dataset.add_sampler.rst
index 9d35b51b583..0793b3367c5 100644
--- a/docs/api/api_python/dataset/mindspore.dataset.Dataset.add_sampler.rst
+++ b/docs/api/api_python/dataset/mindspore.dataset.Dataset.add_sampler.rst
@@ -1,7 +1,7 @@
.. py:method:: add_sampler(new_sampler)
- 为当前数据集对象添加采样器。
+ 为当前数据集添加子采样器。
**参数:**
- - **new_sampler** (Sampler) :指定作用于当前数据集对象的新采样器。
+ - **new_sampler** (Sampler):待添加的子采样器。
diff --git a/docs/api/api_python/dataset/mindspore.dataset.Dataset.use_sampler.rst b/docs/api/api_python/dataset/mindspore.dataset.Dataset.use_sampler.rst
index 89c87860fc6..490e27afd43 100644
--- a/docs/api/api_python/dataset/mindspore.dataset.Dataset.use_sampler.rst
+++ b/docs/api/api_python/dataset/mindspore.dataset.Dataset.use_sampler.rst
@@ -1,7 +1,7 @@
.. py:method:: use_sampler(new_sampler)
- 为当前数据集对象更换一个新的采样器。
+ 替换当前数据集的最末子采样器,保持父采样器不变。
**参数:**
- - **new_sampler** (Sampler) :替换的新采样器。
+ - **new_sampler** (Sampler) :用于替换的新采样器。
diff --git a/docs/api/api_python/dataset_text/mindspore.dataset.text.NormalizeForm.rst b/docs/api/api_python/dataset_text/mindspore.dataset.text.NormalizeForm.rst
index 0ffaef8459f..f07d4679b65 100644
--- a/docs/api/api_python/dataset_text/mindspore.dataset.text.NormalizeForm.rst
+++ b/docs/api/api_python/dataset_text/mindspore.dataset.text.NormalizeForm.rst
@@ -3,13 +3,12 @@
.. py:class:: mindspore.dataset.text.NormalizeForm
- :class:`mindspore.dataset.text.transforms.NormalizeUTF8` 的枚举值。
+ `Unicode规范化模式 `_ 枚举类。
- `Unicode规范化模式 _` 可选的枚举值包括: `NormalizeForm.NONE` 、 `NormalizeForm.NFC` 、 `NormalizeForm.NFKC` 、 `NormalizeForm.NFD` 和 `NormalizeForm.NFKD` 。
+ 可选枚举值为:NormalizeForm.NONE、NormalizeForm.NFC、NormalizeForm.NFKC、NormalizeForm.NFD 和 NormalizeForm.NFKD。
- - **NormalizeForm.NONE** - 对输入字符串不做任何处理。
- - **NormalizeForm.NFC** - 对输入字符串进行C形式规范化。
- - **NormalizeForm.NFKC** - 对输入字符串进行KC形式规范化。
- - **NormalizeForm.NFD** - 对输入字符串进行D形式规范化。
- - **NormalizeForm.NFKD** - 对输入字符串进行KD形式规范化。
-
\ No newline at end of file
+ - **NormalizeForm.NONE** - 不进行规范化处理。
+ - **NormalizeForm.NFC** - 先以标准等价方式分解,再以标准等价方式重组。
+ - **NormalizeForm.NFKC** - 先以兼容等价方式分解,再以标准等价方式重组。
+ - **NormalizeForm.NFD** - 以标准等价方式分解。
+ - **NormalizeForm.NFKD** - 以兼容等价方式分解。
diff --git a/docs/api/api_python/dataset_text/mindspore.dataset.text.transforms.BasicTokenizer.rst b/docs/api/api_python/dataset_text/mindspore.dataset.text.transforms.BasicTokenizer.rst
index f5fe063030f..cc90c143df5 100644
--- a/docs/api/api_python/dataset_text/mindspore.dataset.text.transforms.BasicTokenizer.rst
+++ b/docs/api/api_python/dataset_text/mindspore.dataset.text.transforms.BasicTokenizer.rst
@@ -9,17 +9,17 @@
**参数:**
- - **lower_case** (bool,可选) - 若为True,将对输入执行 :class:`mindspore.dataset.text.transforms.CaseFold` 、NFD模式 :class:`mindspore.dataset.text.transforms.NormalizeUTF8` 和 :class:`mindspore.dataset.text.transforms.RegexReplace` 等操作,将文本转换为小写并删除重音字符;若为False,将只执行 `normalization_form` 模式 :class:`mindspore.dataset.text.transforms.NormalizeUTF8` 操作。默认值:False。
+ - **lower_case** (bool,可选) - 是否对字符串进行小写转换处理。若为True,会将字符串转换为小写并删除重音字符;若为False,将只对字符串进行规范化处理,其模式由 `normalization_form` 指定。默认值:False。
- **keep_whitespace** (bool,可选) - 是否在分词输出中保留空格。默认值:False。
- **normalization_form** (:class:`mindspore.dataset.text.NormalizeForm`,可选) - `Unicode规范化模式 `_,仅当 `lower_case` 为False时生效,取值可为NormalizeForm.NONE、NormalizeForm.NFC、NormalizeForm.NFKC、NormalizeForm.NFD或NormalizeForm.NFKD。默认值:NormalizeForm.NONE。
- - NormalizeForm.NONE:对输入字符串不做任何处理。
- - NormalizeForm.NFC:对输入字符串进行C形式规范化。
- - NormalizeForm.NFKC:对输入字符串进行KC形式规范化。
- - NormalizeForm.NFD:对输入字符串进行D形式规范化。
- - NormalizeForm.NFKD:对输入字符串进行KD形式规范化。
+ - NormalizeForm.NONE:不进行规范化处理。
+ - NormalizeForm.NFC:先以标准等价方式分解,再以标准等价方式重组。
+ - NormalizeForm.NFKC:先以兼容等价方式分解,再以标准等价方式重组。
+ - NormalizeForm.NFD:以标准等价方式分解。
+ - NormalizeForm.NFKD:以兼容等价方式分解。
- - **preserve_unused_token** (bool,可选) - 若为True,将不会对特殊词汇进行分词,如 '[CLS]', '[SEP]', '[UNK]', '[PAD]', '[MASK]' 等。默认值:True。
+ - **preserve_unused_token** (bool,可选) - 是否保留特殊词汇。若为True,将不会对特殊词汇进行分词,如 '[CLS]', '[SEP]', '[UNK]', '[PAD]', '[MASK]' 等。默认值:True。
- **with_offsets** (bool,可选) - 是否输出词汇在字符串中的偏移量。默认值:False。
**异常:**
diff --git a/docs/api/api_python/dataset_text/mindspore.dataset.text.transforms.BertTokenizer.rst b/docs/api/api_python/dataset_text/mindspore.dataset.text.transforms.BertTokenizer.rst
index 56cac325fea..2a4f56f3ab0 100644
--- a/docs/api/api_python/dataset_text/mindspore.dataset.text.transforms.BertTokenizer.rst
+++ b/docs/api/api_python/dataset_text/mindspore.dataset.text.transforms.BertTokenizer.rst
@@ -13,10 +13,17 @@ mindspore.dataset.text.transforms.BertTokenizer
- **suffix_indicator** (str,可选) - 用于指示子词后缀的前缀标志。默认值:'##'。
- **max_bytes_per_token** (int,可选) - 分词最大长度,超过此长度的词汇将不会被拆分。默认值:100。
- **unknown_token** (str,可选) - 对未知词汇的分词输出。当设置为空字符串时,直接返回对应未知词汇作为分词输出;否则,返回该字符串作为分词输出。默认值:'[UNK]'。
- - **lower_case** (bool,可选) - 若为True,将对输入执行 :class:`mindspore.dataset.text.transforms.CaseFold` 、NFD模式 :class:`mindspore.dataset.text.transforms.NormalizeUTF8` 和 :class:`mindspore.dataset.text.transforms.RegexReplace` 等操作,将文本转换为小写并删除重音字符;若为False,将只执行 `normalization_form` 模式 :class:`mindspore.dataset.text.transforms.NormalizeUTF8` 操作。默认值:False。
+ - **lower_case** (bool,可选) - 是否对字符串进行小写转换处理。若为True,会将字符串转换为小写并删除重音字符;若为False,将只对字符串进行规范化处理,其模式由 `normalization_form` 指定。默认值:False。
- **keep_whitespace** (bool,可选) - 是否在分词输出中保留空格。默认值:False。
- - **normalization_form** (:class:`mindspore.dataset.text.NormalizeForm`,可选) - `Unicode规范化模式 `_,仅当 `lower_case` 为False时生效。详见 :class:`mindspore.dataset.text.transforms.NormalizeUTF8` 。默认值:NormalizeForm.NONE。
- - **preserve_unused_token** (bool,可选) - 若为True,将不会对特殊词汇进行分词,如 '[CLS]', '[SEP]', '[UNK]', '[PAD]', '[MASK]' 等。默认值:True。
+ - **normalization_form** (:class:`mindspore.dataset.text.NormalizeForm`,可选) - `Unicode规范化模式 `_,仅当 `lower_case` 为False时生效,取值可为NormalizeForm.NONE、NormalizeForm.NFC、NormalizeForm.NFKC、NormalizeForm.NFD或NormalizeForm.NFKD。默认值:NormalizeForm.NONE。
+
+ - NormalizeForm.NONE:不进行规范化处理。
+ - NormalizeForm.NFC:先以标准等价方式分解,再以标准等价方式重组。
+ - NormalizeForm.NFKC:先以兼容等价方式分解,再以标准等价方式重组。
+ - NormalizeForm.NFD:以标准等价方式分解。
+ - NormalizeForm.NFKD:以兼容等价方式分解。
+
+ - **preserve_unused_token** (bool,可选) - 是否保留特殊词汇。若为True,将不会对特殊词汇进行分词,如 '[CLS]', '[SEP]', '[UNK]', '[PAD]', '[MASK]' 等。默认值:True。
- **with_offsets** (bool,可选) - 是否输出词汇在字符串中的偏移量。默认值:False。
**异常:**
diff --git a/mindspore/python/mindspore/dataset/engine/datasets.py b/mindspore/python/mindspore/dataset/engine/datasets.py
index c3da99404a2..8d5d93a7bb3 100644
--- a/mindspore/python/mindspore/dataset/engine/datasets.py
+++ b/mindspore/python/mindspore/dataset/engine/datasets.py
@@ -2159,18 +2159,16 @@ class MappableDataset(SourceDataset):
def add_sampler(self, new_sampler):
"""
- Add a sampler for current dataset.
+ Add a child sampler for the current dataset.
Args:
- new_sampler (Sampler): The sampler to be added as the parent sampler for current dataset.
+ new_sampler (Sampler): The child sampler to be added.
Examples:
- >>> # dataset is an instance object of Dataset
- >>> # use a DistributedSampler instead
>>> new_sampler = ds.DistributedSampler(10, 2)
- >>> dataset.add_sampler(new_sampler)
+ >>> dataset.add_sampler(new_sampler) # dataset is an instance of Dataset
"""
- # note: By adding a sampler, the sampled IDs will flow to new_sampler
+ # Note: By adding a sampler, the sampled IDs will flow to the new_sampler
# after first passing through the current samplers attached to this dataset.
self.dataset_size = None
new_sampler.add_child(self.sampler)
@@ -2178,10 +2176,10 @@ class MappableDataset(SourceDataset):
def use_sampler(self, new_sampler):
"""
- Make the current dataset use the new_sampler provided by other API.
+ Replace the last child sampler of the current dataset, remaining the parent sampler unchanged.
Args:
- new_sampler (Sampler): The sampler to use for the current dataset.
+ new_sampler (Sampler): The new sampler to replace with.
Examples:
>>> # dataset is an instance object of Dataset
diff --git a/mindspore/python/mindspore/dataset/text/transforms.py b/mindspore/python/mindspore/dataset/text/transforms.py
index 4e66bcfc33b..65bec0bfcfe 100644
--- a/mindspore/python/mindspore/dataset/text/transforms.py
+++ b/mindspore/python/mindspore/dataset/text/transforms.py
@@ -690,24 +690,23 @@ if platform.system().lower() != 'windows':
`BasicTokenizer` is not supported on Windows platform yet.
Args:
- lower_case (bool, optional): If True, will apply `CaseFold`, NormalizeForm.NFD mode `NormalizeUTF8`,
- `RegexReplace` operations on the input to fold the text to lower case and strip accented characters.
- If False, will only apply `NormalizeUTF8` operation of mode `normalization_form` on the input.
- Default: False.
+ lower_case (bool, optional): Whether to perform lowercase processing on the text. If True, will fold the
+ text to lower case and strip accented characters. If False, will only perform normalization on the
+ text, with mode specified by `normalization_form`. Default: False.
keep_whitespace (bool, optional): If True, the whitespace will be kept in the output. Default: False.
normalization_form (NormalizeForm, optional):
`Unicode normalization forms `_, only valid when `lower_case`
is False, can be NormalizeForm.NONE, NormalizeForm.NFC, NormalizeForm.NFKC, NormalizeForm.NFD or
NormalizeForm.NFKD. Default: NormalizeForm.NONE.
- - NormalizeForm.NONE, do nothing to input string.
- - NormalizeForm.NFC, normalize with Normalization Form C.
- - NormalizeForm.NFKC, normalize with Normalization Form KC.
- - NormalizeForm.NFD, normalize with Normalization Form D.
- - NormalizeForm.NFKD, normalize with Normalization Form KD.
+ - NormalizeForm.NONE, no normalization.
+ - NormalizeForm.NFC, Canonical Decomposition, followed by Canonical Composition.
+ - NormalizeForm.NFKC, Compatibility Decomposition, followed by Canonical Composition.
+ - NormalizeForm.NFD, Canonical Decomposition.
+ - NormalizeForm.NFKD, Compatibility Decomposition.
- preserve_unused_token (bool, optional): If True, will not split special tokens like
- '[CLS]', '[SEP]', '[UNK]', '[PAD]', '[MASK]'. Default: True.
+ preserve_unused_token (bool, optional): Whether to preserve special tokens. If True, will not split special
+ tokens like '[CLS]', '[SEP]', '[UNK]', '[PAD]', '[MASK]'. Default: True.
with_offsets (bool, optional): Whether to return the offsets of tokens. Default: False.
Raises:
@@ -778,24 +777,23 @@ if platform.system().lower() != 'windows':
unknown_token (str, optional): The output for unknown words. When set to an empty string, the corresponding
unknown word will be directly returned as the output. Otherwise, the set string will be returned as the
output. Default: '[UNK]'.
- lower_case (bool, optional): If True, will apply `CaseFold`, NormalizeForm.NFD mode `NormalizeUTF8`,
- `RegexReplace` operations on the input to fold the text to lower case and strip accented characters.
- If False, will only apply `NormalizeUTF8` operation of mode `normalization_form` on the input.
- Default: False.
+ lower_case (bool, optional): Whether to perform lowercase processing on the text. If True, will fold the
+ text to lower case and strip accented characters. If False, will only perform normalization on the
+ text, with mode specified by `normalization_form`. Default: False.
keep_whitespace (bool, optional): If True, the whitespace will be kept in the output. Default: False.
normalization_form (NormalizeForm, optional):
`Unicode normalization forms `_, only valid when `lower_case`
is False, can be NormalizeForm.NONE, NormalizeForm.NFC, NormalizeForm.NFKC, NormalizeForm.NFD or
NormalizeForm.NFKD. Default: NormalizeForm.NONE.
- - NormalizeForm.NONE, do nothing to input string.
- - NormalizeForm.NFC, normalize with Normalization Form C.
- - NormalizeForm.NFKC, normalize with Normalization Form KC.
- - NormalizeForm.NFD, normalize with Normalization Form D.
- - NormalizeForm.NFKD, normalize with Normalization Form KD.
+ - NormalizeForm.NONE, no normalization.
+ - NormalizeForm.NFC, Canonical Decomposition, followed by Canonical Composition.
+ - NormalizeForm.NFKC, Compatibility Decomposition, followed by Canonical Composition.
+ - NormalizeForm.NFD, Canonical Decomposition.
+ - NormalizeForm.NFKD, Compatibility Decomposition.
- preserve_unused_token (bool, optional): If True, will not split special tokens like
- '[CLS]', '[SEP]', '[UNK]', '[PAD]', '[MASK]'. Default: True.
+ preserve_unused_token (bool, optional): Whether to preserve special tokens. If True, will not split special
+ tokens like '[CLS]', '[SEP]', '[UNK]', '[PAD]', '[MASK]'. Default: True.
with_offsets (bool, optional): Whether to return the offsets of tokens. Default: False.
Raises:
diff --git a/mindspore/python/mindspore/dataset/text/utils.py b/mindspore/python/mindspore/dataset/text/utils.py
index 4f4abfff220..e5f55c2bbff 100644
--- a/mindspore/python/mindspore/dataset/text/utils.py
+++ b/mindspore/python/mindspore/dataset/text/utils.py
@@ -420,16 +420,16 @@ class JiebaMode(IntEnum):
class NormalizeForm(IntEnum):
"""
- An enumeration for NormalizeUTF8.
+ Enumeration class for `Unicode normalization forms `_ .
- Possible enumeration values are: NormalizeForm.NONE, NormalizeForm.NFC, NormalizeForm.NFKC, NormalizeForm.NFD,
- NormalizeForm.NFKD.
+ Possible enumeration values are: NormalizeForm.NONE, NormalizeForm.NFC, NormalizeForm.NFKC, NormalizeForm.NFD
+ and NormalizeForm.NFKD.
- - NormalizeForm.NONE: do nothing for input string tensor.
- - NormalizeForm.NFC: normalize with Normalization Form C.
- - NormalizeForm.NFKC: normalize with Normalization Form KC.
- - NormalizeForm.NFD: normalize with Normalization Form D.
- - NormalizeForm.NFKD: normalize with Normalization Form KD.
+ - NormalizeForm.NONE: no normalization.
+ - NormalizeForm.NFC: Canonical Decomposition, followed by Canonical Composition.
+ - NormalizeForm.NFKC: Compatibility Decomposition, followed by Canonical Composition.
+ - NormalizeForm.NFD: Canonical Decomposition.
+ - NormalizeForm.NFKD: Compatibility Decomposition.
"""
NONE = 0