!55024 dataset gallery zh-cn
Merge pull request !55024 from luoyang/code_docs_gallery
This commit is contained in:
commit
ea8e09e654
|
@ -59,3 +59,11 @@ https://www.mindspore.cn/docs/en/master/api_python/samples/dataset/dataset_galle
|
|||
https://www.mindspore.cn/docs/en/master/api_python/samples/dataset/vision_gallery.html
|
||||
https://www.mindspore.cn/docs/en/master/api_python/samples/dataset/audio_gallery.html
|
||||
https://www.mindspore.cn/docs/en/master/api_python/samples/dataset/text_gallery.html
|
||||
https://www.mindspore.cn/docs/zh-CN/master/api_python/samples/dataset/dataset_gallery.html
|
||||
https://www.mindspore.cn/docs/zh-CN/master/api_python/samples/dataset/audio_gallery.html
|
||||
https://www.mindspore.cn/docs/zh-CN/master/api_python/samples/dataset/text_gallery.html
|
||||
https://www.mindspore.cn/docs/zh-CN/master/api_python/samples/dataset/vision_gallery.html
|
||||
https://gitee.com/mindspore/mindspore/blob/master/docs/api/api_python/samples/dataset/audio_gallery.ipynb
|
||||
https://gitee.com/mindspore/mindspore/blob/master/docs/api/api_python/samples/dataset/dataset_gallery.ipynb
|
||||
https://gitee.com/mindspore/mindspore/blob/master/docs/api/api_python/samples/dataset/text_gallery.ipynb
|
||||
https://gitee.com/mindspore/mindspore/blob/master/docs/api/api_python/samples/dataset/vision_gallery.ipynb
|
|
@ -51,7 +51,7 @@ mindspore.dataset
|
|||
数据处理Pipeline快速上手
|
||||
----------------------
|
||||
|
||||
如何快速使用Dataset Pipeline,可以将 `Load & Process Data With Dataset Pipeline <https://www.mindspore.cn/docs/en/master/api_python/samples/dataset/dataset_gallery.html>`_ 下载到本地,按照顺序执行并观察输出结果。
|
||||
如何快速使用Dataset Pipeline,可以将 `使用数据Pipeline加载 & 处理数据集 <https://www.mindspore.cn/docs/zh-CN/master/api_python/samples/dataset/dataset_gallery.html>`_ 下载到本地,按照顺序执行并观察输出结果。
|
||||
|
||||
视觉
|
||||
-----
|
||||
|
|
|
@ -95,8 +95,8 @@ API样例中常用的导入模块如下:
|
|||
样例库
|
||||
^^^^^^
|
||||
|
||||
快速上手使用视觉类变换的API,跳转参考 `Illustration of vision transforms <https://www.mindspore.cn/docs/en/master/api_python/samples/dataset/vision_gallery.html>`_ 。
|
||||
此指南中展示了典型的API用法,以及输入输出结果。
|
||||
快速上手使用视觉类变换的API,跳转参考 `视觉变换样例库 <https://www.mindspore.cn/docs/zh-CN/master/api_python/samples/dataset/vision_gallery.html>`_ 。
|
||||
此指南中展示了多个变换API的用法,以及输入输出结果。
|
||||
|
||||
变换
|
||||
^^^^^
|
||||
|
@ -241,8 +241,8 @@ API样例中常用的导入模块如下:
|
|||
样例库
|
||||
^^^^^^
|
||||
|
||||
快速上手使用文本变换的API,跳转参考 `Illustration of text transforms <https://www.mindspore.cn/docs/en/master/api_python/samples/dataset/text_gallery.html>`_ 。
|
||||
此指南中展示了典型的API用法,以及输入输出结果。
|
||||
快速上手使用文本变换的API,跳转参考 `文本变换样例库 <https://www.mindspore.cn/docs/zh-CN/master/api_python/samples/dataset/text_gallery.html>`_ 。
|
||||
此指南中展示了多个变换API的用法,以及输入输出结果。
|
||||
|
||||
变换
|
||||
^^^^^
|
||||
|
@ -311,8 +311,8 @@ API样例中常用的导入模块如下:
|
|||
样例库
|
||||
^^^^^^
|
||||
|
||||
快速上手使用音频变换的API,跳转参考 `Illustration of audio transforms <https://www.mindspore.cn/docs/en/master/api_python/samples/dataset/audio_gallery.html>`_ 。
|
||||
此指南中展示了典型的API用法,以及输入输出结果。
|
||||
快速上手使用音频变换的API,跳转参考 `音频变换样例库 <https://www.mindspore.cn/docs/zh-CN/master/api_python/samples/dataset/audio_gallery.html>`_ 。
|
||||
此指南中展示了多个变换API的用法,以及输入输出结果。
|
||||
|
||||
变换
|
||||
^^^^^
|
||||
|
|
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
|
@ -0,0 +1,277 @@
|
|||
{
|
||||
"cells": [
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# 文本变换样例库\n",
|
||||
"\n",
|
||||
"[![下载Notebook](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/master/resource/_static/logo_notebook.png)](https://obs.dualstack.cn-north-4.myhuaweicloud.com/mindspore-website/notebook/master/docs/api_python/samples/dataset/text_gallery.ipynb) \n",
|
||||
"[![查看源文件](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/master/resource/_static/logo_source.png)](https://gitee.com/mindspore/mindspore/blob/master/docs/api/api_python/samples/dataset/text_gallery.ipynb)\n",
|
||||
"\n",
|
||||
"此指南展示了[mindpore.dataset.text](https://www.mindspore.cn/docs/zh-CN/master/api_python/mindspore.dataset.transforms.html#%E6%96%87%E6%9C%AC)模块中各种变换的用法。"
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 环境准备"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 1,
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Downloading data from https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/datasets/bert-base-uncased-vocab.txt (226 kB)\n",
|
||||
"\n",
|
||||
"file_sizes: 100%|████████████████████████████| 232k/232k [00:00<00:00, 2.21MB/s]\n",
|
||||
"Successfully downloaded file to ./bert-base-uncased-vocab.txt\n",
|
||||
"Downloading data from https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/datasets/article.txt (9 kB)\n",
|
||||
"\n",
|
||||
"file_sizes: 100%|██████████████████████████| 9.06k/9.06k [00:00<00:00, 1.83MB/s]\n",
|
||||
"Successfully downloaded file to ./article.txt\n",
|
||||
"['text_gallery.ipynb', 'article.txt', 'bert-base-uncased-vocab.txt']\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"import os\n",
|
||||
"from download import download\n",
|
||||
"\n",
|
||||
"import mindspore.dataset as ds\n",
|
||||
"import mindspore.dataset.text as text\n",
|
||||
"\n",
|
||||
"# Download opensource datasets\n",
|
||||
"# citation: https://www.kaggle.com/datasets/drknope/bertbaseuncasedvocab\n",
|
||||
"url = \"https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/datasets/bert-base-uncased-vocab.txt\"\n",
|
||||
"download(url, './bert-base-uncased-vocab.txt', replace=True)\n",
|
||||
"\n",
|
||||
"url = \"https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/datasets/article.txt\"\n",
|
||||
"download(url, './article.txt', replace=True)\n",
|
||||
"\n",
|
||||
"# Show the directory\n",
|
||||
"print(os.listdir())\n",
|
||||
"\n",
|
||||
"def call_op(op, input):\n",
|
||||
" print(op(input), flush=True)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Vocab\n",
|
||||
"\n",
|
||||
"[mindspore.dataset.text.Vocab](https://mindspore.cn/docs/zh-CN/master/api_python/dataset_text/mindspore.dataset.text.Vocab.html#mindspore.dataset.text.Vocab) 用于存储多对字符与ID。其包含一个映射,可以将每个单词(str)映射到一个ID(int)。"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 2,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"ids [18863, 18279]\n",
|
||||
"tokens ['##nology', 'crystalline']\n",
|
||||
"lookup: ids [18863 18279]\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"# Load bert vocab\n",
|
||||
"vocab_file = open(\"bert-base-uncased-vocab.txt\")\n",
|
||||
"vocab_content = list(set(vocab_file.read().splitlines()))\n",
|
||||
"vocab = text.Vocab.from_list(vocab_content)\n",
|
||||
"\n",
|
||||
"# lookup tokens to ids\n",
|
||||
"ids = vocab.tokens_to_ids([\"good\", \"morning\"])\n",
|
||||
"print(\"ids\", ids)\n",
|
||||
"\n",
|
||||
"# lookup ids to tokens\n",
|
||||
"tokens = vocab.ids_to_tokens([128, 256])\n",
|
||||
"print(\"tokens\", tokens)\n",
|
||||
"\n",
|
||||
"# Use Lookup op to lookup index\n",
|
||||
"op = text.Lookup(vocab)\n",
|
||||
"ids = op([\"good\", \"morning\"])\n",
|
||||
"print(\"lookup: ids\", ids)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## AddToken\n",
|
||||
"\n",
|
||||
"[mindspore.dataset.text.AddToken](https://mindspore.cn/docs/zh-CN/master/api_python/dataset_text/mindspore.dataset.text.AddToken.html#mindspore.dataset.text.AddToken) 将分词(token)添加到序列的开头或结尾处。"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 3,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"['TOKEN' 'a' 'b' 'c' 'd' 'e']\n",
|
||||
"['a' 'b' 'c' 'd' 'e' 'END']\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"txt = [\"a\", \"b\", \"c\", \"d\", \"e\"]\n",
|
||||
"add_token_op = text.AddToken(token='TOKEN', begin=True)\n",
|
||||
"call_op(add_token_op, txt)\n",
|
||||
"\n",
|
||||
"add_token_op = text.AddToken(token='END', begin=False)\n",
|
||||
"call_op(add_token_op, txt)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## SentencePieceTokenizer\n",
|
||||
"\n",
|
||||
"[mindspore.dataset.text.SentencePieceTokenizer](https://mindspore.cn/docs/zh-CN/master/api_python/dataset_text/mindspore.dataset.text.SentencePieceTokenizer.html#mindspore.dataset.text.SentencePieceTokenizer) 使用SentencePiece分词器对字符串进行分词。\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 4,
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Downloading data from https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/datasets/sentencepiece.bpe.model (4.8 MB)\n",
|
||||
"\n",
|
||||
"file_sizes: 100%|██████████████████████████| 5.07M/5.07M [00:01<00:00, 2.93MB/s]\n",
|
||||
"Successfully downloaded file to ./sentencepiece.bpe.model\n",
|
||||
"['▁Today' '▁is' '▁Tuesday' '.']\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"# Construct a SentencePieceVocab model\n",
|
||||
"url = \"https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/datasets/sentencepiece.bpe.model\"\n",
|
||||
"download(url, './sentencepiece.bpe.model', replace=True)\n",
|
||||
"sentence_piece_vocab_file = './sentencepiece.bpe.model'\n",
|
||||
"\n",
|
||||
"# Use the model to tokenize text\n",
|
||||
"tokenizer = text.SentencePieceTokenizer(sentence_piece_vocab_file, out_type=text.SPieceTokenizerOutType.STRING)\n",
|
||||
"txt = \"Today is Tuesday.\"\n",
|
||||
"call_op(tokenizer, txt)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## WordpieceTokenizer\n",
|
||||
"\n",
|
||||
"[mindspore.dataset.text.WordpieceTokenizer](https://mindspore.cn/docs/zh-CN/master/api_python/dataset_text/mindspore.dataset.text.WordpieceTokenizer.html#mindspore.dataset.text.WordpieceTokenizer) 将输入的字符串切分为子词。"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 5,
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"['token' '##izer' 'will' 'outputs' 'sub' '##words']\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"# Reuse the vocab defined above as input vocab\n",
|
||||
"tokenizer = text.WordpieceTokenizer(vocab=vocab, unknown_token='[UNK]')\n",
|
||||
"txt = [\"tokenizer\", \"will\", \"outputs\", \"subwords\"]\n",
|
||||
"call_op(tokenizer, txt)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 在数据Pipeline中加载和处理TXT文件\n",
|
||||
"\n",
|
||||
"使用 [mindspore.dataset.TextFileDataset](https://mindspore.cn/docs/zh-CN/master/api_python/dataset/mindspore.dataset.TextFileDataset.html#mindspore.dataset.TextFileDataset) 将磁盘中的文本文件内容加载到数据Pipeline中,并应用分词器对其中的内容进行分词。"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Load text content into dataset pipeline\n",
|
||||
"text_file = \"article.txt\"\n",
|
||||
"dataset = ds.TextFileDataset(dataset_files=text_file, shuffle=False)\n",
|
||||
"\n",
|
||||
"# check the column names inside the dataset\n",
|
||||
"print(\"column names:\", dataset.get_col_names())\n",
|
||||
"\n",
|
||||
"# tokenize all text content into tokens with bert vocab\n",
|
||||
"dataset = dataset.map(text.BertTokenizer(vocab=vocab), input_columns=[\"text\"])\n",
|
||||
"\n",
|
||||
"for data in dataset:\n",
|
||||
" print(data)"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "ly37",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.7.5"
|
||||
},
|
||||
"vscode": {
|
||||
"interpreter": {
|
||||
"hash": "9f0efe8a0d8ccef1406a56130f5ab5480567fb275f7fbf51bbc40aede97503df"
|
||||
}
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 0
|
||||
}
|
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
|
@ -72,7 +72,7 @@
|
|||
"source": [
|
||||
"## Vocab\n",
|
||||
"\n",
|
||||
"The :class:`~.dataset.text.Vocab` is used to save pairs of words and ids.\n",
|
||||
"The [mindspore.dataset.text.Vocab](https://mindspore.cn/docs/en/master/api_python/dataset_text/mindspore.dataset.text.Vocab.html#mindspore.dataset.text.Vocab) is used to save pairs of words and ids.\n",
|
||||
"It contains a map that maps each word(str) to an id(int) or reverse."
|
||||
]
|
||||
},
|
||||
|
@ -112,12 +112,13 @@
|
|||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## AddToken\n",
|
||||
"\n",
|
||||
"The :class:`~.dataset.text.AddToken` transform adds token to beginning or end of sequence.\n",
|
||||
"The [mindspore.dataset.text.AddToken](https://mindspore.cn/docs/en/master/api_python/dataset_text/mindspore.dataset.text.AddToken.html#mindspore.dataset.text.AddToken) transform adds token to beginning or end of sequence.\n",
|
||||
"\n"
|
||||
]
|
||||
},
|
||||
|
@ -151,7 +152,7 @@
|
|||
"source": [
|
||||
"## SentencePieceTokenizer\n",
|
||||
"\n",
|
||||
"The :class:`~.dataset.text.SentencePieceTokenizer` transform tokenizes scalar token or 1-D tokens to tokens by sentencepiece.\n"
|
||||
"The [mindspore.dataset.text.SentencePieceTokenizer](https://mindspore.cn/docs/en/master/api_python/dataset_text/mindspore.dataset.text.SentencePieceTokenizer.html#mindspore.dataset.text.SentencePieceTokenizer) transform tokenizes scalar string into tokens by sentencepiece.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
|
@ -192,7 +193,7 @@
|
|||
"source": [
|
||||
"## WordpieceTokenizer\n",
|
||||
"\n",
|
||||
"The :class:`~.dataset.text.WordpieceTokenizer` transform tokenizes the input text to subword tokens.\n",
|
||||
"The [mindspore.dataset.text.WordpieceTokenizer](https://mindspore.cn/docs/en/master/api_python/dataset_text/mindspore.dataset.text.WordpieceTokenizer.html#mindspore.dataset.text.WordpieceTokenizer) transform tokenizes the input text to subword tokens.\n",
|
||||
"\n"
|
||||
]
|
||||
},
|
||||
|
@ -223,9 +224,9 @@
|
|||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Process TXT File in pipeline\n",
|
||||
"## Process TXT File In Dataset Pipeline\n",
|
||||
"\n",
|
||||
"Use :class:`~mindspore.dataset.TextFileDataset` to read content of text into dataset pipeline and the perform tokenization on text."
|
||||
"Use [mindspore.dataset.TextFileDataset](https://mindspore.cn/docs/en/master/api_python/dataset/mindspore.dataset.TextFileDataset.html#mindspore.dataset.TextFileDataset) to read content of text into dataset pipeline and the perform tokenization on text."
|
||||
]
|
||||
},
|
||||
{
|
||||
|
@ -275,4 +276,4 @@
|
|||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 0
|
||||
}
|
||||
}
|
||||
|
|
File diff suppressed because one or more lines are too long
Loading…
Reference in New Issue