!32400 Fix Chinese API

Merge pull request !32400 from shenwei41/code_docs_chinese
2022-03-31 12:48:10 +00:00 · 2022-03-31 12:48:10 +00:00 · d83222f04b
parent ccd3b6c028 e584c644eb
commit d83222f04b
12 changed files with 53 additions and 50 deletions
--- a/docs/api/api_python/dataset_text/mindspore.dataset.text.NormalizeForm.rst
+++ b/docs/api/api_python/dataset_text/mindspore.dataset.text.NormalizeForm.rst
@ -5,7 +5,7 @@

    :class:`mindspore.dataset.text.transforms.NormalizeUTF8` 的枚举值。

-    可选的枚举值包括： `NormalizeForm.NONE` 、 `NormalizeForm.NFC` 、 `NormalizeForm.NFKC` 、 `NormalizeForm.NFD` 和 `NormalizeForm.NFKD` 。
+    `Unicode规范化模式 <http://unicode.org/reports/tr15/#Norm_Forms>_` 可选的枚举值包括： `NormalizeForm.NONE` 、 `NormalizeForm.NFC` 、 `NormalizeForm.NFKC` 、 `NormalizeForm.NFD` 和 `NormalizeForm.NFKD` 。

    - **NormalizeForm.NONE** - 对输入字符串不做任何处理。
    - **NormalizeForm.NFC** - 对输入字符串进行C形式规范化。
--- a/docs/api/api_python/dataset_text/mindspore.dataset.text.SentencePieceModel.rst
+++ b/docs/api/api_python/dataset_text/mindspore.dataset.text.SentencePieceModel.rst
@ -8,7 +8,7 @@
    可选的枚举值包括： `SentencePieceModel.UNIGRAM` 、 `SentencePieceModel.BPE` 、 `SentencePieceModel.CHAR` 和 `SentencePieceModel.WORD` 。

    - **SentencePieceModel.UNIGRAM** - Unigram语言模型意味着句子中的下一个单词被假定为独立于模型生成的前一个单词。
-    - **SentencePieceModel.BPE** - 指字节对编码算法，它取代了最频繁的对句子中的字节数，其中包含一个未使用的字节。
+    - **SentencePieceModel.BPE** - 指字节对编码算法，它取代了最频繁的句子对中的字节数，其中包含一个未使用的字节。
    - **SentencePieceModel.CHAR** - 引用基于字符的SentencePiece模型类型。
    - **SentencePieceModel.WORD** - 引用基于单词的SentencePiece模型类型。
    
--- a/docs/api/api_python/dataset_text/mindspore.dataset.text.SentencePieceVocab.rst
+++ b/docs/api/api_python/dataset_text/mindspore.dataset.text.SentencePieceVocab.rst
@ -14,19 +14,19 @@
        - **dataset** (Dataset) - 表示用于构建SentencePiece对象的数据集。
        - **col_names** (list) - 表示列名称的列表。
        - **vocab_size** (int) - 表示词汇大小。
-        - **character_coverage** (float) - 表示模型涵盖的字符数。推荐的默认值为：0.9995，适用于具有丰富字符集的语言，如日文或中文，1.0适用于具有小字符集的其他语言。
+        - **character_coverage** (float) - 表示模型涵盖的字符数量。推荐的默认值为：0.9995，适用于具有丰富字符集的语言，如日文或中文，1.0适用于具有小字符集的其他语言。
        - **model_type** (SentencePieceModel) - 其值可以是SentencePieceModel.UNIGRAM、SentencePieceModel.BPE、SentencePieceModel.CHAR或SentencePieceModel.WORD，默认值：SentencePieceModel.UNIgram。使用SentencePieceModel.WORD类型时，必须预先标记输入句子。

          - SentencePieceModel.UNIGRAM：Unigram语言模型意味着句子中的下一个单词被假定为独立于模型生成的前一个单词。
          - SentencePieceModel.BPE：指字节对编码算法，它取代了最频繁的对句子中的字节数，其中包含一个未使用的字节。
          - SentencePieceModel.CHAR：引用基于字符的SentencePiece模型类型。
-          - SentencePieceModel.WORD：引用基于单词的SentencePiece型类型。
+          - SentencePieceModel.WORD：引用基于单词的SentencePiece模型类型。

        - **params** (dict)：表示没有传入参数的字典。

        **返回：**

-        SentencePieceVocab，从数据集构建的vocab。
+        SentencePieceVocab，从数据集构建的Vocab对象。

    .. py:method:: from_file(file_path, vocab_size, character_coverage, model_type, params)

@ -34,15 +34,15 @@

        **参数：**

-        - **file_path** (list) - 表示包含SentencePiece列表的文件的路径。
+        - **file_path** (list) - 表示包含SentencePiece文件路径的一个列表。
        - **vocab_size** (int) - 表示词汇大小。
-        - **character_coverage** (float) - 表示模型涵盖的字符数。推荐的默认值为：0.9995适用于具有丰富字符集的语言，如日文或中文，1.0适用于具有小字符集的其他语言。
+        - **character_coverage** (float) - 表示模型涵盖的字符数量。推荐的默认值为：0.9995适用于具有丰富字符集的语言，如日文或中文，1.0适用于具有小字符集的其他语言。
        - **model_type** (SentencePieceModel) - 其值可以是SentencePieceModel.UNIGRAM、SentencePieceModel.BPE、SentencePieceModel.CHAR或SentencePieceModel.WORD，默认值为SentencePieceModel.UNIgram。使用SentencePieceModel.WORD类型时，必须预先标记输入句子。

          - SentencePieceModel.UNIGRAM：Unigram语言模型意味着句子中的下一个单词被假定为独立于模型生成的前一个单词。
          - SentencePieceModel.BPE：指字节对编码算法，它取代了最频繁的对句子中的字节数，其中包含一个未使用的字节。
          - SentencePieceModel.CHAR：引用基于字符的SentencePiece模型类型。
-          - SentencePieceModel.WORD：引用基于单词的SentencePiece型类型。
+          - SentencePieceModel.WORD：引用基于单词的SentencePiece模型类型。

        - **params** (dict)：表示没有传入参数的字典（参数派生自SentencePiece库）。

@ -53,7 +53,7 @@

        **返回：**

-        SentencePieceVocab，表示从文件中构建的vocab。
+        SentencePieceVocab，表示从文件中构建的Vocab对象。

    .. py:method:: save_model(vocab, path, filename)

--- a/docs/api/api_python/dataset_text/mindspore.dataset.text.Vocab.rst
+++ b/docs/api/api_python/dataset_text/mindspore.dataset.text.Vocab.rst
@ -11,16 +11,16 @@

        通过数据集构建Vocab对象。

-        这将收集数据集中的所有唯一单词，并在freq_range中用户指定的频率范围内返回一个vocab。如果没有单词在该频率上，用户将收到预警信息。
+        获得数据集中的所有唯一单词，并在 `freq_range` 中用户指定的频率范围内返回一个vocab。如果没有单词在该频率上，用户将收到预警信息。
        vocab中的单词按最高频率到最低频率的顺序进行排列。具有相同频率的单词将按词典顺序进行排列。

        **参数：**

        - **dataset** (Dataset) - 表示要从中构建vocab的数据集。
-        - **columns** (list[str]，可选) - 表示要从中获取单词的列名。它可以是列名的列表，默认值：None。如果没有列是string类型，将返回错误。
+        - **columns** (list[str]，可选) - 表示要从中获取单词的列名。它可以是列名的列表，默认值：None。
        - **freq_range** (tuple，可选) - 表示整数元组（min_frequency，max_frequency）。频率范围内的单词将被保留。0 <= min_frequency <= max_frequency <= total_words。min_frequency=0等同于min_frequency=1。max_frequency > total_words等同于max_frequency = total_words。min_frequency和max_frequency可以为None，分别对应于0和total_words，默认值：None。
-        - **top_k** (int，可选) - `top_k` 大于0。要在vocab中 `top_k` 建立的单词数量表示取用最频繁的单词。 `top_k` 在 `freq_range` 之后取用。如果没有足够的 `top_k` ，所有单词都将被取用,默认值：None。
-        - **special_tokens** (list，可选) - 特殊分词列表，如常用的"[PAD]"、"[UNK]"等。默认值：None，表示不添加特殊分词（token）。
+        - **top_k** (int，可选) - `top_k` 大于0。要在vocab中 `top_k` 建立的单词数量表示取用最频繁的单词。 `top_k` 在 `freq_range` 之后取用。如果没有足够的 `top_k` ，所有单词都将被取用，默认值：None。
+        - **special_tokens** (list，可选) - 特殊分词列表，如常用的"<pad>"、"<unk>"等。默认值：None，表示不添加特殊分词（token）。
        - **special_first** (bool，可选) - 表示是否将 `special_tokens` 中的特殊分词添加到词典的最前面。如果为True则将 `special_tokens` 添加到词典的最前，否则添加到词典的最后。默认值：True。

        **返回：**
@ -39,16 +39,16 @@

        Vocab，从字典构建的Vocab对象。

-    .. py:method:: from_file(file_path, delimiter='', vocab_size=None, special_tokens=None, special_first=True)
+    .. py:method:: from_file(file_path, delimiter="", vocab_size=None, special_tokens=None, special_first=True)

        通过文件构建Vocab对象。

        **参数：**

-        - **file_path** (str) - 表示包含vocab列表的文件的路径。
+        - **file_path** (str) - 表示包含vocab文件路径的一个列表。
        - **delimiter** (str，可选) - 表示用来分隔文件中每一行的分隔符。第一个元素被视为单词，默认值：""。
        - **vocab_size** (int，可选) - 表示要从 `file_path` 读取的字数，默认值：None，表示读取所有的字。
-        - **special_tokens** (list，可选) - 特殊分词列表，如常用的"[PAD]"、"[UNK]"等。默认值：None，表示不添加特殊分词（token）。
+        - **special_tokens** (list，可选) - 特殊分词列表，如常用的"<pad>"、"<unk>"等。默认值：None，表示不添加特殊分词（token）。
        - **special_first** (list，可选) - 表示是否将 `special_tokens` 中的特殊分词添加到词典的最前面。如果为True则将 `special_tokens` 添加到词典的最前，否则添加到词典的最后。默认值：True。

        **返回：**
@ -62,7 +62,7 @@
        **参数：**

        - **word_list** (list) - 输入单词列表，每个单词需要为字符串类型。
-        - **special_tokens** (list，可选) - 特殊分词列表，如常用的"[PAD]"、"[UNK]"等。默认值：None，表示不添加特殊分词（token）。
+        - **special_tokens** (list，可选) - 特殊分词列表，如常用的"<pad>"、"<unk>"等。默认值：None，表示不添加特殊分词（token）。
        - **special_first** (bool，可选) - 表示是否将 `special_tokens` 中的特殊分词添加到词典的最前面。如果为True则将 `special_tokens` 添加到词典的最前，否则添加到词典的最后。默认值：True。

        **返回：**
--- a/docs/api/api_python/dataset_text/mindspore.dataset.text.to_bytes.rst
+++ b/docs/api/api_python/dataset_text/mindspore.dataset.text.to_bytes.rst
@ -8,7 +8,7 @@
    **参数：**

    - **array** (numpy.ndarray) - 表示 `string` 类型的数组，代表字符串。
-    - **encoding** (str) - 表示用于编码的字符集。
+    - **encoding** (str) - 表示用于编码的字符集，默认值：'utf8'。

    **返回：**

--- a/docs/api/api_python/dataset_text/mindspore.dataset.text.to_str.rst
+++ b/docs/api/api_python/dataset_text/mindspore.dataset.text.to_str.rst
@ -8,7 +8,7 @@
    **参数：**

    - **array** (numpy.ndarray) - 表示 `bytes` 类型的数组，代表字符串。
-    - **encoding** (str) - 表示用于解码的字符集。
+    - **encoding** (str) - 表示用于解码的字符集，默认值：'utf8'。

    **返回：**

--- a/docs/api/api_python/dataset_text/mindspore.dataset.text.transforms.WhitespaceTokenizer.rst
+++ b/docs/api/api_python/dataset_text/mindspore.dataset.text.transforms.WhitespaceTokenizer.rst
@ -3,7 +3,7 @@ mindspore.dataset.text.transforms.WhitespaceTokenizer

 .. py:class:: mindspore.dataset.text.transforms.WhitespaceTokenizer(with_offsets=False)

-    基于ICU4C定义的空白字符（' ', '\\\\t', '\\\\r', '\\\\n'）对输入字符串进行分词。
+    基于ICU4C定义的空白字符（' ', '\\\\t', '\\\\r', '\\\\n'）对输入的UTF-8字符串进行分词。

    .. note:: Windows平台尚不支持 `WhitespaceTokenizer` 。

--- a/docs/api/api_python/mindspore.dataset.config.rst
+++ b/docs/api/api_python/mindspore.dataset.config.rst
@ -23,7 +23,7 @@ API示例所需模块的导入代码如下：

 .. py:function:: mindspore.dataset.config.load(file)

-    从文件格式中加载项目配置。
+    根据文件内容加载项目配置文件。

    **参数：**

@ -35,10 +35,10 @@ API示例所需模块的导入代码如下：

 .. py:function:: mindspore.dataset.config.set_seed(seed)

-    如果设置了种子，生成的随机数将被固定，这有助于产生确定性结果。
+    设置种子，固定产生的随机数达到确定性的结果。

    .. note::
-        此函数在Python随机库和numpy.random库中设置种子，以便随机进行确定性Python增强。此函数应与创建的每个迭代器一起调用，以重置随机种子。在管道中，这并不保证 `num_parallel_workers` 大于1。
+        此函数在Python随机库和numpy.random库中设置种子，以便随机进行确定性Python增强。此函数应与创建的每个迭代器一起调用，以重置随机种子。

    **参数：**

@ -77,6 +77,7 @@ API示例所需模块的导入代码如下：
 .. py:function:: mindspore.dataset.config.get_prefetch_size()

    获取数据处理管道的输出缓存队列长度。
+    如果 `set_prefetch_size` 方法未被调用，那么将会返回默认值16。

    **返回：**

@ -89,7 +90,7 @@ API示例所需模块的导入代码如下：

    **参数：**

-    - **num** (int) - 表示并行工作线程的数量，用作为每个操作的默认值。
+    - **num** (int) - 表示并行工作线程的数量，用作为每个数据集操作的默认值。

    **异常：**

@ -99,11 +100,11 @@ API示例所需模块的导入代码如下：
 .. py:function:: mindspore.dataset.config.get_num_parallel_workers()

    获取并行工作线程数量的全局配置。
-    这是并行工作线程数量的值，用于每个操作。
+    这是作用于每个操作的并行工作线程数量的值。

    **返回：**

-    int，表示每个操作中默认的并行工作进程的数量。
+    int，表示每个操作中默认的并行工作线程的数量。

 .. py:function:: mindspore.dataset.config.set_numa_enable(numa_enable)

@ -142,6 +143,7 @@ API示例所需模块的导入代码如下：
 .. py:function:: mindspore.dataset.config.get_monitor_sampling_interval()

    获取性能监控采样时间间隔的全局配置。
+    如果 `set_monitor_sampling_interval` 方法未被调用，那么将会返回默认值1000。

    **返回：**

@ -149,12 +151,11 @@ API示例所需模块的导入代码如下：

 .. py:function:: mindspore.dataset.config.set_callback_timeout(timeout)

-    为DSWaitedCallback设置的默认超时时间（秒）。
-    如果出现死锁，等待函数将在超时时间结束后退出。
+    为 :class:`mindspore.dataset.WaitedDSCallback` 设置的默认超时时间（秒）。

    **参数：**

-    - **timeout** (int) - 表示在出现死锁情况下，用于结束DSWaitedCallback中等待的超时时间（秒）。
+    - **timeout** (int) - 表示在出现死锁情况下，用于结束 :class:`mindspore.dataset.WaitedDSCallback` 中等待的超时时间（秒）。

    **异常：**

@ -163,21 +164,21 @@ API示例所需模块的导入代码如下：

 .. py:function:: mindspore.dataset.config.get_callback_timeout()

-    获取DSWaitedCallback的默认超时时间。
+    获取 :class:`mindspore.dataset.WaitedDSCallback` 的默认超时时间。
    如果出现死锁，等待的函数将在超时时间结束后退出。

    **返回：**

-    int，表示在出现死锁情况下，用于结束DSWaitedCallback中的等待函数的超时时间（秒）。
+    int，表示在出现死锁情况下，用于结束 :class:`mindspore.dataset.WaitedDSCallback` 中的等待函数的超时时间（秒）。

 .. py:function:: mindspore.dataset.config.set_auto_num_workers(enable)

    自动为每个数据集操作设置并行线程数量（默认情况下，此功能关闭）。

-    如果启用该功能，将自动调整每个数据集操作中的并行线程数量，这可能会覆盖用户传入的并行线程数量或通过ds.config.set_num_parallel_workers()设置的默认值（如果用户未传递任何内容）。
+    如果启用该功能，将自动调整每个数据集操作中的并行线程数量，这可能会覆盖用户通过脚本定义的并行线程数量或通过ds.config.set_num_parallel_workers()设置的默认值（如果用户未传递任何内容）。

    目前，此函数仅针对具有per_batch_map（batch中的运行映射）的YOLOv3数据集进行了优化。
-    此功能旨在为每个操作的优化线程数量分配提供基线。
+    此功能旨在为每个操作的优化线程数量分配一个基础值。
    并行线程数有所调整的数据集操作将会被记录。

    **参数：**
--- a/mindspore/python/mindspore/dataset/core/config.py
+++ b/mindspore/python/mindspore/dataset/core/config.py
@ -94,14 +94,12 @@ def _init_device_info():

 def set_seed(seed):
    """
-    If the seed is set, the generated random number will be fixed, this helps to
-    produce deterministic results.
+    Set the seed so the random generated number will be fixed for deterministic results.

    Note:
        This set_seed function sets the seed in the Python random library and numpy.random library
        for deterministic Python augmentations using randomness. This set_seed function should
-        be called with every iterator created to reset the random seed. In the pipeline, this
-        does not guarantee deterministic results with num_parallel_workers > 1.
+        be called when iterator is created to reset the random seed.

    Args:
        seed(int): Random number seed. It is used to generate deterministic random numbers.
@ -172,6 +170,7 @@ def set_prefetch_size(size):
 def get_prefetch_size():
    """
    Get the prefetch size as for number of rows.
+    If `set_prefetch_size` is never called before, the default value 16 will be returned.

    Returns:
        int, total number of rows to be prefetched.
@ -212,8 +211,7 @@ def set_num_parallel_workers(num):
 def get_num_parallel_workers():
    """
    Get the global configuration of number of parallel workers.
-    This is the DEFAULT num_parallel_workers value used for each operation, it is not related
-    to AutoNumWorker feature.
+    This is the DEFAULT num_parallel_workers value used for each operation.

    Returns:
        int, number of parallel workers to be used as a default for each operation.
@ -286,6 +284,7 @@ def set_monitor_sampling_interval(interval):
 def get_monitor_sampling_interval():
    """
    Get the global configuration of sampling interval of performance monitor.
+    If `set_monitor_sampling_interval` is never called before, the default value(1000) will be returned.

    Returns:
        int, interval (in milliseconds) for performance monitor sampling.
@ -360,7 +359,7 @@ def get_auto_num_workers():

    Examples:
        >>> # Get the global configuration of auto number worker feature.
-        >>> num_workers = ds.config.get_auto_num_workers()
+        >>> flag = ds.config.get_auto_num_workers()
    """
    return _config.get_auto_num_workers()

@ -368,7 +367,6 @@ def get_auto_num_workers():
 def set_callback_timeout(timeout):
    """
    Set the default timeout (in seconds) for DSWaitedCallback.
-    In case of a deadlock, the wait function will exit after the timeout period.

    Args:
        timeout (int): Timeout (in seconds) to be used to end the wait in DSWaitedCallback in case of a deadlock.
@ -390,8 +388,7 @@ def set_callback_timeout(timeout):

 def get_callback_timeout():
    """
-    Get the default timeout for DSWaitedCallback.
-    In case of a deadlock, the wait function will exit after the timeout period.
+    Get the default timeout for WaitedDSCallback.

    Returns:
        int, Timeout (in seconds) to be used to end the wait in DSWaitedCallback in case of a deadlock.
@ -416,7 +413,7 @@ def __str__():

 def load(file):
    """
-    Load the project configuration from the file format.
+    Load the project configuration from the file.

    Args:
        file (str): Path of the configuration file to be loaded.
@ -577,7 +574,7 @@ def get_enable_shared_mem():
        `get_enable_shared_mem` is not supported on Windows and MacOS platforms yet.

    Returns:
-        bool, the state of shared mem enabled variable (default=True).
+        bool, the state of shared mem enabled variable.

    Examples:
        >>> # Get the flag of shared memory feature.
--- a/mindspore/python/mindspore/dataset/engine/datasets.py
+++ b/mindspore/python/mindspore/dataset/engine/datasets.py
@ -130,9 +130,9 @@ def _reset_training_dataset(step):
 class Shuffle(str, Enum):
    """Specify the shuffle mode.

-    - GLOBAL: Shuffle both the files and samples.
-    - FILES: Shuffle files only.
-    - INFILE: Shuffle data within each file.
+    - Shuffle.GLOBAL: Shuffle both the files and samples.
+    - Shuffle.FILES: Shuffle files only.
+    - Shuffle.INFILE: Shuffle data within each file.
    """
    GLOBAL: str = "global"
    FILES: str = "files"
--- a/mindspore/python/mindspore/dataset/text/utils.py
+++ b/mindspore/python/mindspore/dataset/text/utils.py
@ -115,7 +115,7 @@ class Vocab:
        Args:
            dataset (Dataset): dataset to build vocab from.
            columns (list[str], optional): column names to get words from. It can be a list of column names.
-                (default=None, where all columns will be used. If any column isn't string type, will return error).
+                (default=None).
            freq_range (tuple, optional): A tuple of integers (min_frequency, max_frequency). Words within the frequency
                range would be kept. 0 <= min_frequency <= max_frequency <= total_words. min_frequency=0 is the same as
                min_frequency=1. max_frequency > total_words is the same as max_frequency = total_words.
@ -388,6 +388,12 @@ def to_bytes(array, encoding='utf8'):

    Returns:
        numpy.ndarray, NumPy array of `bytes`.
+
+    Examples:
+        >>> text_file_dataset_dir = ["/path/to/text_file_dataset_file"]
+        >>> dataset = ds.TextFileDataset(dataset_files=text_file_dataset_dir, shuffle=False)
+        >>> for item in dataset.create_dict_iterator(num_epochs=1, output_numpy=True):
+        ...     data = text.to_bytes(item["text"])
    """

    if not isinstance(array, np.ndarray):
--- a/mindspore/python/mindspore/dataset/transforms/c_transforms.py
+++ b/mindspore/python/mindspore/dataset/transforms/c_transforms.py
@ -276,7 +276,6 @@ class Mask(TensorOperation):
        operator (Relational): relational operators, it can be any of [Relational.EQ, Relational.NE, Relational.LT,
            Relational.GT, Relational.LE, Relational.GE], take Relational.EQ as example, EQ refers to equal.
        constant (Union[str, int, float, bool]): Constant to be compared to.
-            Constant will be cast to the type of the input tensor.
        dtype (mindspore.dtype, optional): Type of the generated mask. Default: mindspore.dtype.bool\_.

    Raises: