fix the remaining review problems

2022-04-02 16:26:32 +08:00 · 2022-04-02 16:26:32 +08:00 · ce156448a6
parent a3d391ae70
commit ce156448a6
9 changed files with 63 additions and 61 deletions
--- a/docs/api/api_python/dataset/mindspore.dataset.Caltech101Dataset.rst
+++ b/docs/api/api_python/dataset/mindspore.dataset.Caltech101Dataset.rst
@ -19,7 +19,7 @@ mindspore.dataset.Caltech101Dataset
      目录Annotations用于存储图像的标注。
    - **target_type** (str, 可选) - 指定数据集的子集，可取值为'category'、'annotation' 或 'all'。
      取值为'category'时将读取图像的类别标注作为label，取值为'annotation'时将读取图像的轮廓标注作为label，
-      取值为'all'时将同时输出图像的类别标注和轮廓标注。默认值：'category'。
+      取值为'all'时将同时输出图像的类别标注和轮廓标注。默认值：None，表示'category'。
    - **num_samples** (int, 可选) - 指定从数据集中读取的样本数，可以小于数据集总数。默认值：None，读取全部样本图片。
    - **num_parallel_workers** (int, 可选) - 指定读取数据的工作线程数。默认值：None，使用mindspore.dataset.config中配置的线程数。
    - **shuffle** (bool, 可选) - 是否混洗数据集。默认值：None，下表中会展示不同参数配置的预期行为。
--- a/docs/api/api_python/dataset/mindspore.dataset.Dataset.add_sampler.rst
+++ b/docs/api/api_python/dataset/mindspore.dataset.Dataset.add_sampler.rst
@ -1,7 +1,7 @@
 .. py:method:: add_sampler(new_sampler)

-    为当前数据集对象添加采样器。
+    为当前数据集添加子采样器。

    **参数：**

-    - **new_sampler** (Sampler) ：指定作用于当前数据集对象的新采样器。
+    - **new_sampler** (Sampler)：待添加的子采样器。
--- a/docs/api/api_python/dataset/mindspore.dataset.Dataset.use_sampler.rst
+++ b/docs/api/api_python/dataset/mindspore.dataset.Dataset.use_sampler.rst
@ -1,7 +1,7 @@
 .. py:method:: use_sampler(new_sampler)

-    为当前数据集对象更换一个新的采样器。
+    替换当前数据集的最末子采样器，保持父采样器不变。

    **参数：**

-    - **new_sampler** (Sampler) ：替换的新采样器。
+    - **new_sampler** (Sampler) ：用于替换的新采样器。
--- a/docs/api/api_python/dataset_text/mindspore.dataset.text.NormalizeForm.rst
+++ b/docs/api/api_python/dataset_text/mindspore.dataset.text.NormalizeForm.rst
@ -3,13 +3,12 @@

 .. py:class:: mindspore.dataset.text.NormalizeForm

-    :class:`mindspore.dataset.text.transforms.NormalizeUTF8` 的枚举值。
+    `Unicode规范化模式 <http://unicode.org/reports/tr15/>`_ 枚举类。

-    `Unicode规范化模式 <http://unicode.org/reports/tr15/#Norm_Forms>_` 可选的枚举值包括： `NormalizeForm.NONE` 、 `NormalizeForm.NFC` 、 `NormalizeForm.NFKC` 、 `NormalizeForm.NFD` 和 `NormalizeForm.NFKD` 。
+    可选枚举值为：NormalizeForm.NONE、NormalizeForm.NFC、NormalizeForm.NFKC、NormalizeForm.NFD 和 NormalizeForm.NFKD。

-    - **NormalizeForm.NONE** - 对输入字符串不做任何处理。
-    - **NormalizeForm.NFC** - 对输入字符串进行C形式规范化。
-    - **NormalizeForm.NFKC** - 对输入字符串进行KC形式规范化。
-    - **NormalizeForm.NFD** - 对输入字符串进行D形式规范化。
-    - **NormalizeForm.NFKD** - 对输入字符串进行KD形式规范化。
-    
+    - **NormalizeForm.NONE** - 不进行规范化处理。
+    - **NormalizeForm.NFC** - 先以标准等价方式分解，再以标准等价方式重组。
+    - **NormalizeForm.NFKC** - 先以兼容等价方式分解，再以标准等价方式重组。
+    - **NormalizeForm.NFD** - 以标准等价方式分解。
+    - **NormalizeForm.NFKD** - 以兼容等价方式分解。
--- a/docs/api/api_python/dataset_text/mindspore.dataset.text.transforms.BasicTokenizer.rst
+++ b/docs/api/api_python/dataset_text/mindspore.dataset.text.transforms.BasicTokenizer.rst
@ -9,17 +9,17 @@

    **参数：**

-    - **lower_case** (bool，可选) - 若为True，将对输入执行 :class:`mindspore.dataset.text.transforms.CaseFold` 、NFD模式 :class:`mindspore.dataset.text.transforms.NormalizeUTF8` 和 :class:`mindspore.dataset.text.transforms.RegexReplace` 等操作，将文本转换为小写并删除重音字符；若为False，将只执行 `normalization_form` 模式 :class:`mindspore.dataset.text.transforms.NormalizeUTF8` 操作。默认值：False。
+    - **lower_case** (bool，可选) - 是否对字符串进行小写转换处理。若为True，会将字符串转换为小写并删除重音字符；若为False，将只对字符串进行规范化处理，其模式由 `normalization_form` 指定。默认值：False。
    - **keep_whitespace** (bool，可选) - 是否在分词输出中保留空格。默认值：False。
    - **normalization_form** (:class:`mindspore.dataset.text.NormalizeForm`，可选) - `Unicode规范化模式 <http://unicode.org/reports/tr15/>`_，仅当 `lower_case` 为False时生效，取值可为NormalizeForm.NONE、NormalizeForm.NFC、NormalizeForm.NFKC、NormalizeForm.NFD或NormalizeForm.NFKD。默认值：NormalizeForm.NONE。

-      - NormalizeForm.NONE：对输入字符串不做任何处理。
-      - NormalizeForm.NFC：对输入字符串进行C形式规范化。
-      - NormalizeForm.NFKC：对输入字符串进行KC形式规范化。
-      - NormalizeForm.NFD：对输入字符串进行D形式规范化。
-      - NormalizeForm.NFKD：对输入字符串进行KD形式规范化。
+      - NormalizeForm.NONE：不进行规范化处理。
+      - NormalizeForm.NFC：先以标准等价方式分解，再以标准等价方式重组。
+      - NormalizeForm.NFKC：先以兼容等价方式分解，再以标准等价方式重组。
+      - NormalizeForm.NFD：以标准等价方式分解。
+      - NormalizeForm.NFKD：以兼容等价方式分解。

-    - **preserve_unused_token** (bool，可选) - 若为True，将不会对特殊词汇进行分词，如 '[CLS]', '[SEP]', '[UNK]', '[PAD]', '[MASK]' 等。默认值：True。
+    - **preserve_unused_token** (bool，可选) - 是否保留特殊词汇。若为True，将不会对特殊词汇进行分词，如 '[CLS]', '[SEP]', '[UNK]', '[PAD]', '[MASK]' 等。默认值：True。
    - **with_offsets** (bool，可选) - 是否输出词汇在字符串中的偏移量。默认值：False。

    **异常：**
--- a/docs/api/api_python/dataset_text/mindspore.dataset.text.transforms.BertTokenizer.rst
+++ b/docs/api/api_python/dataset_text/mindspore.dataset.text.transforms.BertTokenizer.rst
@ -13,10 +13,17 @@ mindspore.dataset.text.transforms.BertTokenizer
    - **suffix_indicator** (str，可选) - 用于指示子词后缀的前缀标志。默认值：'##'。
    - **max_bytes_per_token** (int，可选) - 分词最大长度，超过此长度的词汇将不会被拆分。默认值：100。
    - **unknown_token** (str，可选) - 对未知词汇的分词输出。当设置为空字符串时，直接返回对应未知词汇作为分词输出；否则，返回该字符串作为分词输出。默认值：'[UNK]'。
-    - **lower_case** (bool，可选) - 若为True，将对输入执行 :class:`mindspore.dataset.text.transforms.CaseFold` 、NFD模式 :class:`mindspore.dataset.text.transforms.NormalizeUTF8` 和 :class:`mindspore.dataset.text.transforms.RegexReplace` 等操作，将文本转换为小写并删除重音字符；若为False，将只执行 `normalization_form` 模式 :class:`mindspore.dataset.text.transforms.NormalizeUTF8` 操作。默认值：False。
+    - **lower_case** (bool，可选) - 是否对字符串进行小写转换处理。若为True，会将字符串转换为小写并删除重音字符；若为False，将只对字符串进行规范化处理，其模式由 `normalization_form` 指定。默认值：False。
    - **keep_whitespace** (bool，可选) - 是否在分词输出中保留空格。默认值：False。
-    - **normalization_form** (:class:`mindspore.dataset.text.NormalizeForm`，可选) - `Unicode规范化模式 <http://unicode.org/reports/tr15/>`_，仅当 `lower_case` 为False时生效。详见 :class:`mindspore.dataset.text.transforms.NormalizeUTF8` 。默认值：NormalizeForm.NONE。
-    - **preserve_unused_token** (bool，可选) - 若为True，将不会对特殊词汇进行分词，如 '[CLS]', '[SEP]', '[UNK]', '[PAD]', '[MASK]' 等。默认值：True。
+    - **normalization_form** (:class:`mindspore.dataset.text.NormalizeForm`，可选) - `Unicode规范化模式 <http://unicode.org/reports/tr15/>`_，仅当 `lower_case` 为False时生效，取值可为NormalizeForm.NONE、NormalizeForm.NFC、NormalizeForm.NFKC、NormalizeForm.NFD或NormalizeForm.NFKD。默认值：NormalizeForm.NONE。
+
+      - NormalizeForm.NONE：不进行规范化处理。
+      - NormalizeForm.NFC：先以标准等价方式分解，再以标准等价方式重组。
+      - NormalizeForm.NFKC：先以兼容等价方式分解，再以标准等价方式重组。
+      - NormalizeForm.NFD：以标准等价方式分解。
+      - NormalizeForm.NFKD：以兼容等价方式分解。
+
+    - **preserve_unused_token** (bool，可选) - 是否保留特殊词汇。若为True，将不会对特殊词汇进行分词，如 '[CLS]', '[SEP]', '[UNK]', '[PAD]', '[MASK]' 等。默认值：True。
    - **with_offsets** (bool，可选) - 是否输出词汇在字符串中的偏移量。默认值：False。

    **异常：**
--- a/mindspore/python/mindspore/dataset/engine/datasets.py
+++ b/mindspore/python/mindspore/dataset/engine/datasets.py
@ -2159,18 +2159,16 @@ class MappableDataset(SourceDataset):

    def add_sampler(self, new_sampler):
        """
-        Add a sampler for current dataset.
+        Add a child sampler for the current dataset.

        Args:
-            new_sampler (Sampler): The sampler to be added as the parent sampler for current dataset.
+            new_sampler (Sampler): The child sampler to be added.

        Examples:
-            >>> # dataset is an instance object of Dataset
-            >>> # use a DistributedSampler instead
            >>> new_sampler = ds.DistributedSampler(10, 2)
-            >>> dataset.add_sampler(new_sampler)
+            >>> dataset.add_sampler(new_sampler)  # dataset is an instance of Dataset
        """
-        # note: By adding a sampler, the sampled IDs will flow to new_sampler
+        # Note: By adding a sampler, the sampled IDs will flow to the new_sampler
        # after first passing through the current samplers attached to this dataset.
        self.dataset_size = None
        new_sampler.add_child(self.sampler)
@ -2178,10 +2176,10 @@ class MappableDataset(SourceDataset):

    def use_sampler(self, new_sampler):
        """
-        Make the current dataset use the new_sampler provided by other API.
+        Replace the last child sampler of the current dataset, remaining the parent sampler unchanged.

        Args:
-            new_sampler (Sampler): The sampler to use for the current dataset.
+            new_sampler (Sampler): The new sampler to replace with.

        Examples:
            >>> # dataset is an instance object of Dataset
--- a/mindspore/python/mindspore/dataset/text/transforms.py
+++ b/mindspore/python/mindspore/dataset/text/transforms.py
@ -690,24 +690,23 @@ if platform.system().lower() != 'windows':
            `BasicTokenizer` is not supported on Windows platform yet.

        Args:
-            lower_case (bool, optional): If True, will apply `CaseFold`, NormalizeForm.NFD mode `NormalizeUTF8`,
-                `RegexReplace` operations on the input to fold the text to lower case and strip accented characters.
-                If False, will only apply `NormalizeUTF8` operation of mode `normalization_form` on the input.
-                Default: False.
+            lower_case (bool, optional): Whether to perform lowercase processing on the text. If True, will fold the
+                text to lower case and strip accented characters. If False, will only perform normalization on the
+                text, with mode specified by `normalization_form`. Default: False.
            keep_whitespace (bool, optional): If True, the whitespace will be kept in the output. Default: False.
            normalization_form (NormalizeForm, optional):
                `Unicode normalization forms <http://unicode.org/reports/tr15/>`_, only valid when `lower_case`
                is False, can be NormalizeForm.NONE, NormalizeForm.NFC, NormalizeForm.NFKC, NormalizeForm.NFD or
                NormalizeForm.NFKD. Default: NormalizeForm.NONE.

-                - NormalizeForm.NONE, do nothing to input string.
-                - NormalizeForm.NFC, normalize with Normalization Form C.
-                - NormalizeForm.NFKC, normalize with Normalization Form KC.
-                - NormalizeForm.NFD, normalize with Normalization Form D.
-                - NormalizeForm.NFKD, normalize with Normalization Form KD.
+                - NormalizeForm.NONE, no normalization.
+                - NormalizeForm.NFC, Canonical Decomposition, followed by Canonical Composition.
+                - NormalizeForm.NFKC, Compatibility Decomposition, followed by Canonical Composition.
+                - NormalizeForm.NFD, Canonical Decomposition.
+                - NormalizeForm.NFKD, Compatibility Decomposition.

-            preserve_unused_token (bool, optional): If True, will not split special tokens like
-                '[CLS]', '[SEP]', '[UNK]', '[PAD]', '[MASK]'. Default: True.
+            preserve_unused_token (bool, optional): Whether to preserve special tokens. If True, will not split special
+                tokens like '[CLS]', '[SEP]', '[UNK]', '[PAD]', '[MASK]'. Default: True.
            with_offsets (bool, optional): Whether to return the offsets of tokens. Default: False.

        Raises:
@ -778,24 +777,23 @@ if platform.system().lower() != 'windows':
            unknown_token (str, optional): The output for unknown words. When set to an empty string, the corresponding
                unknown word will be directly returned as the output. Otherwise, the set string will be returned as the
                output. Default: '[UNK]'.
-            lower_case (bool, optional): If True, will apply `CaseFold`, NormalizeForm.NFD mode `NormalizeUTF8`,
-                `RegexReplace` operations on the input to fold the text to lower case and strip accented characters.
-                If False, will only apply `NormalizeUTF8` operation of mode `normalization_form` on the input.
-                Default: False.
+            lower_case (bool, optional): Whether to perform lowercase processing on the text. If True, will fold the
+                text to lower case and strip accented characters. If False, will only perform normalization on the
+                text, with mode specified by `normalization_form`. Default: False.
            keep_whitespace (bool, optional): If True, the whitespace will be kept in the output. Default: False.
            normalization_form (NormalizeForm, optional):
                `Unicode normalization forms <http://unicode.org/reports/tr15/>`_, only valid when `lower_case`
                is False, can be NormalizeForm.NONE, NormalizeForm.NFC, NormalizeForm.NFKC, NormalizeForm.NFD or
                NormalizeForm.NFKD. Default: NormalizeForm.NONE.

-                - NormalizeForm.NONE, do nothing to input string.
-                - NormalizeForm.NFC, normalize with Normalization Form C.
-                - NormalizeForm.NFKC, normalize with Normalization Form KC.
-                - NormalizeForm.NFD, normalize with Normalization Form D.
-                - NormalizeForm.NFKD, normalize with Normalization Form KD.
+                - NormalizeForm.NONE, no normalization.
+                - NormalizeForm.NFC, Canonical Decomposition, followed by Canonical Composition.
+                - NormalizeForm.NFKC, Compatibility Decomposition, followed by Canonical Composition.
+                - NormalizeForm.NFD, Canonical Decomposition.
+                - NormalizeForm.NFKD, Compatibility Decomposition.

-            preserve_unused_token (bool, optional): If True, will not split special tokens like
-                '[CLS]', '[SEP]', '[UNK]', '[PAD]', '[MASK]'. Default: True.
+            preserve_unused_token (bool, optional): Whether to preserve special tokens. If True, will not split special
+                tokens like '[CLS]', '[SEP]', '[UNK]', '[PAD]', '[MASK]'. Default: True.
            with_offsets (bool, optional): Whether to return the offsets of tokens. Default: False.

        Raises:
--- a/mindspore/python/mindspore/dataset/text/utils.py
+++ b/mindspore/python/mindspore/dataset/text/utils.py
@ -420,16 +420,16 @@ class JiebaMode(IntEnum):

 class NormalizeForm(IntEnum):
    """
-    An enumeration for NormalizeUTF8.
+    Enumeration class for `Unicode normalization forms <http://unicode.org/reports/tr15/>`_ .

-    Possible enumeration values are: NormalizeForm.NONE, NormalizeForm.NFC, NormalizeForm.NFKC, NormalizeForm.NFD,
-    NormalizeForm.NFKD.
+    Possible enumeration values are: NormalizeForm.NONE, NormalizeForm.NFC, NormalizeForm.NFKC, NormalizeForm.NFD
+    and NormalizeForm.NFKD.

-    - NormalizeForm.NONE: do nothing for input string tensor.
-    - NormalizeForm.NFC: normalize with Normalization Form C.
-    - NormalizeForm.NFKC: normalize with Normalization Form KC.
-    - NormalizeForm.NFD: normalize with Normalization Form D.
-    - NormalizeForm.NFKD: normalize with Normalization Form KD.
+    - NormalizeForm.NONE: no normalization.
+    - NormalizeForm.NFC: Canonical Decomposition, followed by Canonical Composition.
+    - NormalizeForm.NFKC: Compatibility Decomposition, followed by Canonical Composition.
+    - NormalizeForm.NFD: Canonical Decomposition.
+    - NormalizeForm.NFKD: Compatibility Decomposition.
    """

    NONE = 0