!55024 dataset gallery zh-cn

Merge pull request !55024 from luoyang/code_docs_gallery
2023-06-12 07:18:43 +00:00 · 2023-06-12 07:18:43 +00:00 · ea8e09e654
parent aee5a78b6d 387c71162d
commit ea8e09e654
11 changed files with 2954 additions and 336 deletions
--- a/.jenkins/check/config/filter_linklint.txt
+++ b/.jenkins/check/config/filter_linklint.txt
@ -59,3 +59,11 @@ https://www.mindspore.cn/docs/en/master/api_python/samples/dataset/dataset_galle
 https://www.mindspore.cn/docs/en/master/api_python/samples/dataset/vision_gallery.html
 https://www.mindspore.cn/docs/en/master/api_python/samples/dataset/audio_gallery.html
 https://www.mindspore.cn/docs/en/master/api_python/samples/dataset/text_gallery.html
+https://www.mindspore.cn/docs/zh-CN/master/api_python/samples/dataset/dataset_gallery.html
+https://www.mindspore.cn/docs/zh-CN/master/api_python/samples/dataset/audio_gallery.html
+https://www.mindspore.cn/docs/zh-CN/master/api_python/samples/dataset/text_gallery.html
+https://www.mindspore.cn/docs/zh-CN/master/api_python/samples/dataset/vision_gallery.html
+https://gitee.com/mindspore/mindspore/blob/master/docs/api/api_python/samples/dataset/audio_gallery.ipynb
+https://gitee.com/mindspore/mindspore/blob/master/docs/api/api_python/samples/dataset/dataset_gallery.ipynb
+https://gitee.com/mindspore/mindspore/blob/master/docs/api/api_python/samples/dataset/text_gallery.ipynb
+https://gitee.com/mindspore/mindspore/blob/master/docs/api/api_python/samples/dataset/vision_gallery.ipynb
--- a/docs/api/api_python/mindspore.dataset.rst
+++ b/docs/api/api_python/mindspore.dataset.rst
@ -51,7 +51,7 @@ mindspore.dataset
 数据处理Pipeline快速上手
 ----------------------

-如何快速使用Dataset Pipeline，可以将 `Load & Process Data With Dataset Pipeline <https://www.mindspore.cn/docs/en/master/api_python/samples/dataset/dataset_gallery.html>`_ 下载到本地，按照顺序执行并观察输出结果。
+如何快速使用Dataset Pipeline，可以将 `使用数据Pipeline加载 & 处理数据集 <https://www.mindspore.cn/docs/zh-CN/master/api_python/samples/dataset/dataset_gallery.html>`_ 下载到本地，按照顺序执行并观察输出结果。

 视觉
 -----
--- a/docs/api/api_python/mindspore.dataset.transforms.rst
+++ b/docs/api/api_python/mindspore.dataset.transforms.rst
@ -95,8 +95,8 @@ API样例中常用的导入模块如下：
 样例库
 ^^^^^^

-快速上手使用视觉类变换的API，跳转参考 `Illustration of vision transforms <https://www.mindspore.cn/docs/en/master/api_python/samples/dataset/vision_gallery.html>`_ 。
-此指南中展示了典型的API用法，以及输入输出结果。
+快速上手使用视觉类变换的API，跳转参考 `视觉变换样例库 <https://www.mindspore.cn/docs/zh-CN/master/api_python/samples/dataset/vision_gallery.html>`_ 。
+此指南中展示了多个变换API的用法，以及输入输出结果。

 变换
 ^^^^^
@ -241,8 +241,8 @@ API样例中常用的导入模块如下：
 样例库
 ^^^^^^

-快速上手使用文本变换的API，跳转参考 `Illustration of text transforms <https://www.mindspore.cn/docs/en/master/api_python/samples/dataset/text_gallery.html>`_ 。
-此指南中展示了典型的API用法，以及输入输出结果。
+快速上手使用文本变换的API，跳转参考 `文本变换样例库 <https://www.mindspore.cn/docs/zh-CN/master/api_python/samples/dataset/text_gallery.html>`_ 。
+此指南中展示了多个变换API的用法，以及输入输出结果。

 变换
 ^^^^^
@ -311,8 +311,8 @@ API样例中常用的导入模块如下：
 样例库
 ^^^^^^

-快速上手使用音频变换的API，跳转参考 `Illustration of audio transforms <https://www.mindspore.cn/docs/en/master/api_python/samples/dataset/audio_gallery.html>`_ 。
-此指南中展示了典型的API用法，以及输入输出结果。
+快速上手使用音频变换的API，跳转参考 `音频变换样例库 <https://www.mindspore.cn/docs/zh-CN/master/api_python/samples/dataset/audio_gallery.html>`_ 。
+此指南中展示了多个变换API的用法，以及输入输出结果。

 变换
 ^^^^^
--- a/docs/api/api_python/samples/dataset/audio_gallery.ipynb
+++ b/docs/api/api_python/samples/dataset/audio_gallery.ipynb
--- a/docs/api/api_python/samples/dataset/dataset_gallery.ipynb
+++ b/docs/api/api_python/samples/dataset/dataset_gallery.ipynb
--- a/docs/api/api_python/samples/dataset/text_gallery.ipynb
+++ b/docs/api/api_python/samples/dataset/text_gallery.ipynb
@ -0,0 +1,277 @@
+{
+  "cells": [
+    {
+      "attachments": {},
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "# 文本变换样例库\n",
+        "\n",
+        "[![下载Notebook](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/master/resource/_static/logo_notebook.png)](https://obs.dualstack.cn-north-4.myhuaweicloud.com/mindspore-website/notebook/master/docs/api_python/samples/dataset/text_gallery.ipynb)&emsp;\n",
+        "[![查看源文件](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/master/resource/_static/logo_source.png)](https://gitee.com/mindspore/mindspore/blob/master/docs/api/api_python/samples/dataset/text_gallery.ipynb)\n",
+        "\n",
+        "此指南展示了[mindpore.dataset.text](https://www.mindspore.cn/docs/zh-CN/master/api_python/mindspore.dataset.transforms.html#%E6%96%87%E6%9C%AC)模块中各种变换的用法。"
+      ]
+    },
+    {
+      "attachments": {},
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## 环境准备"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 1,
+      "metadata": {
+        "collapsed": false
+      },
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "Downloading data from https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/datasets/bert-base-uncased-vocab.txt (226 kB)\n",
+            "\n",
+            "file_sizes: 100%|████████████████████████████| 232k/232k [00:00<00:00, 2.21MB/s]\n",
+            "Successfully downloaded file to ./bert-base-uncased-vocab.txt\n",
+            "Downloading data from https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/datasets/article.txt (9 kB)\n",
+            "\n",
+            "file_sizes: 100%|██████████████████████████| 9.06k/9.06k [00:00<00:00, 1.83MB/s]\n",
+            "Successfully downloaded file to ./article.txt\n",
+            "['text_gallery.ipynb', 'article.txt', 'bert-base-uncased-vocab.txt']\n"
+          ]
+        }
+      ],
+      "source": [
+        "import os\n",
+        "from download import download\n",
+        "\n",
+        "import mindspore.dataset as ds\n",
+        "import mindspore.dataset.text as text\n",
+        "\n",
+        "# Download opensource datasets\n",
+        "# citation: https://www.kaggle.com/datasets/drknope/bertbaseuncasedvocab\n",
+        "url = \"https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/datasets/bert-base-uncased-vocab.txt\"\n",
+        "download(url, './bert-base-uncased-vocab.txt', replace=True)\n",
+        "\n",
+        "url = \"https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/datasets/article.txt\"\n",
+        "download(url, './article.txt', replace=True)\n",
+        "\n",
+        "# Show the directory\n",
+        "print(os.listdir())\n",
+        "\n",
+        "def call_op(op, input):\n",
+        "    print(op(input), flush=True)"
+      ]
+    },
+    {
+      "attachments": {},
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## Vocab\n",
+        "\n",
+        "[mindspore.dataset.text.Vocab](https://mindspore.cn/docs/zh-CN/master/api_python/dataset_text/mindspore.dataset.text.Vocab.html#mindspore.dataset.text.Vocab) 用于存储多对字符与ID。其包含一个映射，可以将每个单词（str）映射到一个ID（int）。"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 2,
+      "metadata": {},
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "ids [18863, 18279]\n",
+            "tokens ['##nology', 'crystalline']\n",
+            "lookup: ids [18863 18279]\n"
+          ]
+        }
+      ],
+      "source": [
+        "# Load bert vocab\n",
+        "vocab_file = open(\"bert-base-uncased-vocab.txt\")\n",
+        "vocab_content = list(set(vocab_file.read().splitlines()))\n",
+        "vocab = text.Vocab.from_list(vocab_content)\n",
+        "\n",
+        "# lookup tokens to ids\n",
+        "ids = vocab.tokens_to_ids([\"good\", \"morning\"])\n",
+        "print(\"ids\", ids)\n",
+        "\n",
+        "# lookup ids to tokens\n",
+        "tokens = vocab.ids_to_tokens([128, 256])\n",
+        "print(\"tokens\", tokens)\n",
+        "\n",
+        "# Use Lookup op to lookup index\n",
+        "op = text.Lookup(vocab)\n",
+        "ids = op([\"good\", \"morning\"])\n",
+        "print(\"lookup: ids\", ids)"
+      ]
+    },
+    {
+      "attachments": {},
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## AddToken\n",
+        "\n",
+        "[mindspore.dataset.text.AddToken](https://mindspore.cn/docs/zh-CN/master/api_python/dataset_text/mindspore.dataset.text.AddToken.html#mindspore.dataset.text.AddToken) 将分词(token)添加到序列的开头或结尾处。"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 3,
+      "metadata": {},
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "['TOKEN' 'a' 'b' 'c' 'd' 'e']\n",
+            "['a' 'b' 'c' 'd' 'e' 'END']\n"
+          ]
+        }
+      ],
+      "source": [
+        "txt = [\"a\", \"b\", \"c\", \"d\", \"e\"]\n",
+        "add_token_op = text.AddToken(token='TOKEN', begin=True)\n",
+        "call_op(add_token_op, txt)\n",
+        "\n",
+        "add_token_op = text.AddToken(token='END', begin=False)\n",
+        "call_op(add_token_op, txt)"
+      ]
+    },
+    {
+      "attachments": {},
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## SentencePieceTokenizer\n",
+        "\n",
+        "[mindspore.dataset.text.SentencePieceTokenizer](https://mindspore.cn/docs/zh-CN/master/api_python/dataset_text/mindspore.dataset.text.SentencePieceTokenizer.html#mindspore.dataset.text.SentencePieceTokenizer) 使用SentencePiece分词器对字符串进行分词。\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 4,
+      "metadata": {
+        "collapsed": false
+      },
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "Downloading data from https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/datasets/sentencepiece.bpe.model (4.8 MB)\n",
+            "\n",
+            "file_sizes: 100%|██████████████████████████| 5.07M/5.07M [00:01<00:00, 2.93MB/s]\n",
+            "Successfully downloaded file to ./sentencepiece.bpe.model\n",
+            "['▁Today' '▁is' '▁Tuesday' '.']\n"
+          ]
+        }
+      ],
+      "source": [
+        "# Construct a SentencePieceVocab model\n",
+        "url = \"https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/datasets/sentencepiece.bpe.model\"\n",
+        "download(url, './sentencepiece.bpe.model', replace=True)\n",
+        "sentence_piece_vocab_file = './sentencepiece.bpe.model'\n",
+        "\n",
+        "# Use the model to tokenize text\n",
+        "tokenizer = text.SentencePieceTokenizer(sentence_piece_vocab_file, out_type=text.SPieceTokenizerOutType.STRING)\n",
+        "txt = \"Today is Tuesday.\"\n",
+        "call_op(tokenizer, txt)"
+      ]
+    },
+    {
+      "attachments": {},
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## WordpieceTokenizer\n",
+        "\n",
+        "[mindspore.dataset.text.WordpieceTokenizer](https://mindspore.cn/docs/zh-CN/master/api_python/dataset_text/mindspore.dataset.text.WordpieceTokenizer.html#mindspore.dataset.text.WordpieceTokenizer) 将输入的字符串切分为子词。"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 5,
+      "metadata": {
+        "collapsed": false
+      },
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "['token' '##izer' 'will' 'outputs' 'sub' '##words']\n"
+          ]
+        }
+      ],
+      "source": [
+        "# Reuse the vocab defined above as input vocab\n",
+        "tokenizer = text.WordpieceTokenizer(vocab=vocab, unknown_token='[UNK]')\n",
+        "txt = [\"tokenizer\", \"will\", \"outputs\", \"subwords\"]\n",
+        "call_op(tokenizer, txt)"
+      ]
+    },
+    {
+      "attachments": {},
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## 在数据Pipeline中加载和处理TXT文件\n",
+        "\n",
+        "使用 [mindspore.dataset.TextFileDataset](https://mindspore.cn/docs/zh-CN/master/api_python/dataset/mindspore.dataset.TextFileDataset.html#mindspore.dataset.TextFileDataset) 将磁盘中的文本文件内容加载到数据Pipeline中，并应用分词器对其中的内容进行分词。"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "# Load text content into dataset pipeline\n",
+        "text_file = \"article.txt\"\n",
+        "dataset = ds.TextFileDataset(dataset_files=text_file, shuffle=False)\n",
+        "\n",
+        "# check the column names inside the dataset\n",
+        "print(\"column names:\", dataset.get_col_names())\n",
+        "\n",
+        "# tokenize all text content into tokens with bert vocab\n",
+        "dataset = dataset.map(text.BertTokenizer(vocab=vocab), input_columns=[\"text\"])\n",
+        "\n",
+        "for data in dataset:\n",
+        "    print(data)"
+      ]
+    }
+  ],
+  "metadata": {
+    "kernelspec": {
+      "display_name": "ly37",
+      "language": "python",
+      "name": "python3"
+    },
+    "language_info": {
+      "codemirror_mode": {
+        "name": "ipython",
+        "version": 3
+      },
+      "file_extension": ".py",
+      "mimetype": "text/x-python",
+      "name": "python",
+      "nbconvert_exporter": "python",
+      "pygments_lexer": "ipython3",
+      "version": "3.7.5"
+    },
+    "vscode": {
+      "interpreter": {
+        "hash": "9f0efe8a0d8ccef1406a56130f5ab5480567fb275f7fbf51bbc40aede97503df"
+      }
+    }
+  },
+  "nbformat": 4,
+  "nbformat_minor": 0
+}
--- a/docs/api/api_python/samples/dataset/vision_gallery.ipynb
+++ b/docs/api/api_python/samples/dataset/vision_gallery.ipynb
--- a/docs/api/api_python_en/samples/dataset/audio_gallery.ipynb
+++ b/docs/api/api_python_en/samples/dataset/audio_gallery.ipynb
--- a/docs/api/api_python_en/samples/dataset/dataset_gallery.ipynb
+++ b/docs/api/api_python_en/samples/dataset/dataset_gallery.ipynb
--- a/docs/api/api_python_en/samples/dataset/text_gallery.ipynb
+++ b/docs/api/api_python_en/samples/dataset/text_gallery.ipynb
@ -72,7 +72,7 @@
      "source": [
        "## Vocab\n",
        "\n",
-        "The :class:`~.dataset.text.Vocab` is used to save pairs of words and ids.\n",
+        "The [mindspore.dataset.text.Vocab](https://mindspore.cn/docs/en/master/api_python/dataset_text/mindspore.dataset.text.Vocab.html#mindspore.dataset.text.Vocab) is used to save pairs of words and ids.\n",
        "It contains a map that maps each word(str) to an id(int) or reverse."
      ]
    },
@ -112,12 +112,13 @@
      ]
    },
    {
+      "attachments": {},
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## AddToken\n",
        "\n",
-        "The :class:`~.dataset.text.AddToken` transform adds token to beginning or end of sequence.\n",
+        "The [mindspore.dataset.text.AddToken](https://mindspore.cn/docs/en/master/api_python/dataset_text/mindspore.dataset.text.AddToken.html#mindspore.dataset.text.AddToken) transform adds token to beginning or end of sequence.\n",
        "\n"
      ]
    },
@ -151,7 +152,7 @@
      "source": [
        "## SentencePieceTokenizer\n",
        "\n",
-        "The :class:`~.dataset.text.SentencePieceTokenizer` transform tokenizes scalar token or 1-D tokens to tokens by sentencepiece.\n"
+        "The [mindspore.dataset.text.SentencePieceTokenizer](https://mindspore.cn/docs/en/master/api_python/dataset_text/mindspore.dataset.text.SentencePieceTokenizer.html#mindspore.dataset.text.SentencePieceTokenizer) transform tokenizes scalar string into tokens by sentencepiece.\n"
      ]
    },
    {
@ -192,7 +193,7 @@
      "source": [
        "## WordpieceTokenizer\n",
        "\n",
-        "The :class:`~.dataset.text.WordpieceTokenizer` transform tokenizes the input text to subword tokens.\n",
+        "The [mindspore.dataset.text.WordpieceTokenizer](https://mindspore.cn/docs/en/master/api_python/dataset_text/mindspore.dataset.text.WordpieceTokenizer.html#mindspore.dataset.text.WordpieceTokenizer) transform tokenizes the input text to subword tokens.\n",
        "\n"
      ]
    },
@ -223,9 +224,9 @@
      "cell_type": "markdown",
      "metadata": {},
      "source": [
-        "## Process TXT File in pipeline\n",
+        "## Process TXT File In Dataset Pipeline\n",
        "\n",
-        "Use :class:`~mindspore.dataset.TextFileDataset` to read content of text into dataset pipeline and the perform tokenization on text."
+        "Use [mindspore.dataset.TextFileDataset](https://mindspore.cn/docs/en/master/api_python/dataset/mindspore.dataset.TextFileDataset.html#mindspore.dataset.TextFileDataset) to read content of text into dataset pipeline and the perform tokenization on text."
      ]
    },
    {
@ -275,4 +276,4 @@
  },
  "nbformat": 4,
  "nbformat_minor": 0
-}
+}
--- a/docs/api/api_python_en/samples/dataset/vision_gallery.ipynb
+++ b/docs/api/api_python_en/samples/dataset/vision_gallery.ipynb