!1159 add nlp preprocess example to mindrecord

Merge pull request !1159 from guozhijian/add_nlp_preprocess_example
This commit is contained in:
mindspore-ci-bot 2020-05-14 23:28:48 +08:00 committed by Gitee
commit c68d1567e9
8 changed files with 356 additions and 2 deletions

View File

@ -16,13 +16,13 @@
This example is used to read data from aclImdb dataset and generate mindrecord. It just transfers the aclImdb dataset to mindrecord without any data preprocessing. You can modify the example or follow the example to implement your own example.
1. run.sh: generate MindRecord entry script.
- gen_mindrecord.py : read the aclImdb data and tranfer it to mindrecord.
- gen_mindrecord.py : read the aclImdb data and transfer it to mindrecord.
2. run_read.py: create MindDataset by MindRecord entry script.
- create_dataset.py: use MindDataset to read MindRecord to generate dataset.
## How to use the example to generate MindRecord
Download aclImdb dataset, tranfer it to mindrecord, use MindDataset to read mindrecord.
Download aclImdb dataset, transfer it to mindrecord, use MindDataset to read mindrecord.
### Download aclImdb dataset and unzip

View File

@ -0,0 +1,131 @@
# Guideline to Preprocess Large Movie Review Dataset - aclImdb to MindRecord
<!-- TOC -->
- [What does the example do](#what-does-the-example-do)
- [How to use the example to generate MindRecord](#how-to-use-the-example-to-generate-mindrecord)
- [Download aclImdb dataset and unzip](#download-aclimdb-dataset-and-unzip)
- [Generate MindRecord](#generate-mindrecord)
- [Create MindDataset By MindRecord](#create-minddataset-by-mindrecord)
<!-- /TOC -->
## What does the example do
This example is used to read data from aclImdb dataset, preprocess it and generate mindrecord. The preprocessing process mainly uses vocab file to convert the training set text into dictionary sequence, which can be further used in the subsequent training process.
1. run.sh: generate MindRecord entry script.
- gen_mindrecord.py : read the aclImdb data, preprocess it and transfer it to mindrecord.
2. run_read.py: create MindDataset by MindRecord entry script.
- create_dataset.py: use MindDataset to read MindRecord to generate dataset.
## How to use the example to generate MindRecord
Download aclImdb dataset, transfer it to mindrecord, use MindDataset to read mindrecord.
### Download aclImdb dataset and unzip
1. Download the training data zip.
> [aclImdb dataset download address](http://ai.stanford.edu/~amaas/data/sentiment/) **-> Large Movie Review Dataset v1.0**
2. Unzip the training data to dir example/nlp_to_mindrecord/aclImdb_preprocess/data.
```
tar -zxvf aclImdb_v1.tar.gz -C {your-mindspore}/example/nlp_to_mindrecord/aclImdb_preprocess/data/
```
### Generate MindRecord
1. Run the run.sh script.
```bash
bash run.sh
```
2. Output like this:
```
...
>> begin generate mindrecord by train data
>> transformed 256 record...
>> transformed 512 record...
>> transformed 768 record...
>> transformed 1024 record...
...
>> transformed 25000 record...
[INFO] MD(6553,python):2020-05-14-16:10:44.947.617 [mindspore/ccsrc/mindrecord/io/shard_writer.cc:227] Commit] Write metadata successfully.
[INFO] MD(6553,python):2020-05-14-16:10:44.948.193 [mindspore/ccsrc/mindrecord/io/shard_index_generator.cc:59] Build] Init header from mindrecord file for index successfully.
[INFO] MD(6553,python):2020-05-14-16:10:44.974.544 [mindspore/ccsrc/mindrecord/io/shard_index_generator.cc:600] DatabaseWriter] Init index db for shard: 0 successfully.
[INFO] MD(6553,python):2020-05-14-16:10:46.110.119 [mindspore/ccsrc/mindrecord/io/shard_index_generator.cc:549] ExecuteTransaction] Insert 25000 rows to index db.
[INFO] MD(6553,python):2020-05-14-16:10:46.128.212 [mindspore/ccsrc/mindrecord/io/shard_index_generator.cc:620] DatabaseWriter] Generate index db for shard: 0 successfully.
[INFO] ME(6553:139716072798016,MainProcess):2020-05-14-16:10:46.130.596 [mindspore/mindrecord/filewriter.py:313] The list of mindrecord files created are: ['output/aclImdb_train.mindrecord'], and the list of index files are: ['output/aclImdb_train.mindrecord.db']
>> begin generate mindrecord by test data
>> transformed 256 record...
>> transformed 512 record...
>> transformed 768 record...
>> transformed 1024 record...
...
[INFO] MD(6553,python):2020-05-14-16:10:55.047.633 [mindspore/ccsrc/mindrecord/io/shard_index_generator.cc:600] DatabaseWriter] Init index db for shard: 0 successfully.
[INFO] MD(6553,python):2020-05-14-16:10:56.092.477 [mindspore/ccsrc/mindrecord/io/shard_index_generator.cc:549] ExecuteTransaction] Insert 25000 rows to index db.
[INFO] MD(6553,python):2020-05-14-16:10:56.107.799 [mindspore/ccsrc/mindrecord/io/shard_index_generator.cc:620] DatabaseWriter] Generate index db for shard: 0 successfully.
[INFO] ME(6553:139716072798016,MainProcess):2020-05-14-16:10:56.111.193 [mindspore/mindrecord/filewriter.py:313] The list of mindrecord files created are: ['output/aclImdb_test.mindrecord'], and the list of index files are: ['output/aclImdb_test.mindrecord.db']
```
3. Generate mindrecord files
```
$ ls output/
aclImdb_test.mindrecord aclImdb_test.mindrecord.db aclImdb_train.mindrecord aclImdb_train.mindrecord.db README.md
```
### Create MindDataset By MindRecord
1. Run the run_read.sh script.
```bash
bash run_read.sh
```
2. Output like this:
```
example 24992: {'input_ids': array([ -1, -1, 65, 0, 89, 0, 367, 0, -1,
-1, -1, -1, 488, 0, 0, 0, 206, 0,
816, 0, -1, -1, 16, 0, -1, -1, 11998,
0, 0, 0, 852, 0, 1, 0, 111, 0,
-1, -1, -1, -1, 765, 0, 9, 0, 17,
0, 35, 0, 72, 0, -1, -1, -1, -1,
40, 0, 895, 0, 41, 0, 0, 0, 6952,
0, 170, 0, -1, -1, -1, -1, 3, 0,
28, 0, -1, -1, 0, 0, 111, 0, 58,
0, 110, 0, 569, 0, -1, -1, -1, -1,
-1, -1, 0, 0, 24512, 0, 3, 0, 0,
0], dtype=int32), 'id': array(8045, dtype=int32), 'input_mask': array([1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0,
1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0,
1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0,
1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0,
1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0], dtype=int32), 'segment_ids': array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int32), 'score': array(1, dtype=int32), 'label': array(1, dtype=int32)}
example 24993: {'input_ids': array([ -1, -1, 11, 0, 7400, 0, 189, 0, 4, 0, 1247,
0, 9, 0, 17, 0, 29, 0, 0, 0, -1, -1,
-1, -1, -1, -1, 1, 0, -1, -1, 218, 0, 131,
0, 10, 0, -1, -1, 52, 0, 72, 0, 488, 0,
6, 0, -1, -1, -1, -1, -1, -1, 1749, 0, 0,
0, -1, -1, 42, 0, 21, 0, 65, 0, 6895, 0,
-1, -1, -1, -1, -1, -1, 11, 0, 52, 0, 72,
0, 1498, 0, 10, 0, 21, 0, 65, 0, 19, 0,
-1, -1, -1, -1, 36, 0, 130, 0, 88, 0, 210,
0], dtype=int32), 'id': array(9903, dtype=int32), 'input_mask': array([1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0,
1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0,
1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0,
1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0,
1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0], dtype=int32), 'segment_ids': array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int32), 'score': array(7, dtype=int32), 'label': array(0, dtype=int32)}
```
- id : the id "3219" is from review docs like **3219**_10.txt.
- label : indicates whether the review is positive or negative, positive: 0, negative: 1.
- score : the score "10" is from review docs like 3219_**10**.txt.
- input_ids : the input_ids are from the review dos's content which mapped by imdb.vocab file.
- input_mask : the input_mask are from the review dos's content which mapped by imdb.vocab file.
- segment_ids : the segment_ids are from the review dos's content which mapped by imdb.vocab file.

View File

@ -0,0 +1,34 @@
# Copyright 2020 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
"""create MindDataset by MindRecord"""
import mindspore.dataset as ds
def create_dataset(data_file):
"""create MindDataset"""
num_readers = 4
data_set = ds.MindDataset(dataset_file=data_file,
num_parallel_workers=num_readers,
shuffle=True)
index = 0
for item in data_set.create_dict_iterator():
print("example {}: {}".format(index, item))
index += 1
if index % 1000 == 0:
print(">> read rows: {}".format(index))
print(">> total rows: {}".format(index))
if __name__ == '__main__':
create_dataset('output/aclImdb_train.mindrecord')
create_dataset('output/aclImdb_test.mindrecord')

View File

@ -0,0 +1 @@
## The input dataset

View File

@ -0,0 +1,150 @@
# Copyright 2020 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
"""get data from aclImdb and write the data to mindrecord file"""
import collections
import os
import re
import string
import numpy as np
from mindspore.mindrecord import FileWriter
ACLIMDB_DIR = "data/aclImdb"
MINDRECORD_FILE_NAME_TRAIN = "output/aclImdb_train.mindrecord"
MINDRECORD_FILE_NAME_TEST = "output/aclImdb_test.mindrecord"
def inputs(vectors, maxlen=50):
"""generate input_ids, mask, segemnt"""
length = len(vectors)
if length > maxlen:
return vectors[0:maxlen], [1]*maxlen, [0]*maxlen
input_ = vectors+[0]*(maxlen-length)
mask = [1]*length + [0]*(maxlen-length)
segment = [0]*maxlen
return input_, mask, segment
def get_nlp_data(data_dir, vocab_dict):
"""get data from dir like aclImdb/train"""
dir_list = [os.path.join(data_dir, "pos"),
os.path.join(data_dir, "neg")]
for index, exact_dir in enumerate(dir_list):
if not os.path.exists(exact_dir):
raise IOError("dir {} not exists".format(exact_dir))
vocab_dict = load_vocab(os.path.join(data_dir, "../imdb.vocab"))
for item in os.listdir(exact_dir):
data = {}
# file name like 4372_2.txt, we will get id: 4372, score: 2
id_score = item.split("_", 1)
score = id_score[1].split(".", 1)
review_file = open(os.path.join(exact_dir, item), "r")
review = review_file.read()
review_file.close()
vectors = []
vectors += [vocab_dict.get(i) if i in vocab_dict else -1
for i in re.findall(r"[\w']+|[{}]".format(string.punctuation), review)]
input_, mask, segment = inputs(vectors)
input_ids = np.reshape(np.array(input_), [1, -1])
input_mask = np.reshape(np.array(mask), [1, -1])
segment_ids = np.reshape(np.array(segment), [1, -1])
data = {
"label": int(index), # indicate pos: 0, neg: 1
"id": int(id_score[0]),
"score": int(score[0]),
"input_ids": input_ids, # raw content convert it
"input_mask": input_mask, # raw content convert it
"segment_ids": segment_ids # raw content convert it
}
yield data
def convert_to_uni(text):
"""convert bytes to text"""
if isinstance(text, str):
return text
if isinstance(text, bytes):
return text.decode('utf-8', 'ignore')
raise Exception("The type %s does not convert!" % type(text))
def load_vocab(vocab_file):
"""load vocabulary to translate statement."""
vocab = collections.OrderedDict()
vocab.setdefault('blank', 2)
index = 0
with open(vocab_file) as reader:
while True:
tmp = reader.readline()
if not tmp:
break
token = convert_to_uni(tmp)
token = token.strip()
vocab[token] = index
index += 1
return vocab
def gen_mindrecord(data_type):
"""gen mindreocrd according exactly schema"""
if data_type == "train":
fw = FileWriter(MINDRECORD_FILE_NAME_TRAIN)
else:
fw = FileWriter(MINDRECORD_FILE_NAME_TEST)
schema = {"id": {"type": "int32"},
"label": {"type": "int32"},
"score": {"type": "int32"},
"input_ids": {"type": "int32", "shape": [-1]},
"input_mask": {"type": "int32", "shape": [-1]},
"segment_ids": {"type": "int32", "shape": [-1]}}
fw.add_schema(schema, "aclImdb preprocessed dataset")
fw.add_index(["id", "label", "score"])
vocab_dict = load_vocab(os.path.join(ACLIMDB_DIR, "imdb.vocab"))
get_data_iter = get_nlp_data(os.path.join(ACLIMDB_DIR, data_type), vocab_dict)
batch_size = 256
transform_count = 0
while True:
data_list = []
try:
for _ in range(batch_size):
data_list.append(get_data_iter.__next__())
transform_count += 1
fw.write_raw_data(data_list)
print(">> transformed {} record...".format(transform_count))
except StopIteration:
if data_list:
fw.write_raw_data(data_list)
print(">> transformed {} record...".format(transform_count))
break
fw.commit()
def main():
# generate mindrecord for train
print(">> begin generate mindrecord by train data")
gen_mindrecord("train")
# generate mindrecord for test
print(">> begin generate mindrecord by test data")
gen_mindrecord("test")
if __name__ == "__main__":
main()

View File

@ -0,0 +1 @@
## Output the mindrecord

View File

@ -0,0 +1,20 @@
#!/bin/bash
# Copyright 2020 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
rm -f output/aclImdb_train.mindrecord*
rm -f output/aclImdb_test.mindrecord*
python gen_mindrecord.py

View File

@ -0,0 +1,17 @@
#!/bin/bash
# Copyright 2020 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
python create_dataset.py