add Caltech-UCSD-Birds-200-2011 to mindrecord example
This commit is contained in:
parent
44e3d167f0
commit
4333080b6f
|
@ -0,0 +1,131 @@
|
|||
# Guideline to Transfer Caltech-UCSD Birds-200-2011 Dataset to MindRecord
|
||||
|
||||
<!-- TOC -->
|
||||
|
||||
- [What does the example do](#what-does-the-example-do)
|
||||
- [How to use the example to generate MindRecord](#how-to-use-the-example-to-generate-mindrecord)
|
||||
- [Download Caltech-UCSD Birds-200-2011 dataset and unzip](#download-caltech-ucsd-birds-200-2011-dataset-and-unzip)
|
||||
- [Generate MindRecord](#generate-mindrecord)
|
||||
- [Create MindDataset By MindRecord](#create-minddataset-by-mindrecord)
|
||||
|
||||
|
||||
<!-- /TOC -->
|
||||
|
||||
## What does the example do
|
||||
|
||||
This example is used to read data from Caltech-UCSD Birds-200-2011 dataset and generate mindrecord. It just transfers the Caltech-UCSD Birds-200-2011 dataset to mindrecord without any data preprocessing. You can modify the example or follow the example to implement your own example.
|
||||
|
||||
1. run.sh: generate MindRecord entry script.
|
||||
- gen_mindrecord.py : read the Caltech-UCSD Birds-200-2011 data and transfer it to mindrecord.
|
||||
2. run_read.py: create MindDataset by MindRecord entry script.
|
||||
- create_dataset.py: use MindDataset to read MindRecord to generate dataset.
|
||||
|
||||
## How to use the example to generate MindRecord
|
||||
|
||||
Download Caltech-UCSD Birds-200-2011 dataset, transfer it to mindrecord, use MindDataset to read mindrecord.
|
||||
|
||||
### Download Caltech-UCSD Birds-200-2011 dataset and unzip
|
||||
|
||||
1. Download the training data zip.
|
||||
> [Caltech-UCSD Birds-200-2011 dataset download address](http://www.vision.caltech.edu/visipedia/CUB-200-2011.html)
|
||||
> **1) -> Download -> All Images and Annotations**
|
||||
> **2) -> Download -> Segmentations**
|
||||
|
||||
2. Unzip the training data to dir example/nlp_to_mindrecord/Caltech-UCSD-Birds-200-2011/data.
|
||||
```
|
||||
tar -zxvf CUB_200_2011.tgz -C {your-mindspore}/example/cv_to_mindrecord/Caltech-UCSD-Birds-200-2011/data/
|
||||
tar -zxvf segmentations.tgz -C {your-mindspore}/example/cv_to_mindrecord/Caltech-UCSD-Birds-200-2011/data/
|
||||
```
|
||||
- The unzip should like this:
|
||||
```
|
||||
$ ls {your-mindspore}/example/cv_to_mindrecord/Caltech-UCSD-Birds-200-2011/data/
|
||||
attributes.txt CUB_200_2011 README.md segmentations
|
||||
```
|
||||
|
||||
### Generate MindRecord
|
||||
|
||||
1. Run the run.sh script.
|
||||
```bash
|
||||
bash run.sh
|
||||
```
|
||||
|
||||
2. Output like this:
|
||||
```
|
||||
...
|
||||
>> begin generate mindrecord
|
||||
>> sample id: 1, filename: data/CUB_200_2011/images/001.Black_footed_Albatross/Black_Footed_Albatross_0046_18.jpg, bbox: [60.0, 27.0, 325.0, 304.0], label: 1, seg_filename: data/segmentations/001.Black_footed_Albatross/Black_Footed_Albatross_0046_18.png, class: 001.Black_footed_Albatross
|
||||
[INFO] MD(11253,python):2020-05-20-16:21:42.462.686 [mindspore/ccsrc/mindrecord/io/shard_writer.cc:106] OpenDataFiles] Open shard file successfully.
|
||||
[INFO] MD(11253,python):2020-05-20-16:21:43.147.496 [mindspore/ccsrc/mindrecord/io/shard_writer.cc:680] WriteRawData] Write 256 records successfully.
|
||||
>> transformed 256 record...
|
||||
[INFO] MD(11253,python):2020-05-20-16:21:43.842.372 [mindspore/ccsrc/mindrecord/io/shard_writer.cc:680] WriteRawData] Write 256 records successfully.
|
||||
>> transformed 512 record...
|
||||
[INFO] MD(11253,python):2020-05-20-16:21:44.748.585 [mindspore/ccsrc/mindrecord/io/shard_writer.cc:680] WriteRawData] Write 256 records successfully.
|
||||
>> transformed 768 record...
|
||||
[INFO] MD(11253,python):2020-05-20-16:21:45.736.179 [mindspore/ccsrc/mindrecord/io/shard_writer.cc:680] WriteRawData] Write 256 records successfully.
|
||||
>> transformed 1024 record...
|
||||
...
|
||||
[INFO] MD(11253,python):2020-05-20-16:22:21.207.820 [mindspore/ccsrc/mindrecord/io/shard_writer.cc:680] WriteRawData] Write 12 records successfully.
|
||||
>> transformed 11788 record...
|
||||
[INFO] MD(11253,python):2020-05-20-16:22:21.210.261 [mindspore/ccsrc/mindrecord/io/shard_writer.cc:227] Commit] Write metadata successfully.
|
||||
[INFO] MD(11253,python):2020-05-20-16:22:21.211.688 [mindspore/ccsrc/mindrecord/io/shard_index_generator.cc:59] Build] Init header from mindrecord file for index successfully.
|
||||
[INFO] MD(11253,python):2020-05-20-16:22:21.236.799 [mindspore/ccsrc/mindrecord/io/shard_index_generator.cc:600] DatabaseWriter] Init index db for shard: 0 successfully.
|
||||
[INFO] MD(11253,python):2020-05-20-16:22:21.964.034 [mindspore/ccsrc/mindrecord/io/shard_index_generator.cc:549] ExecuteTransaction] Insert 11788 rows to index db.
|
||||
[INFO] MD(11253,python):2020-05-20-16:22:21.978.087 [mindspore/ccsrc/mindrecord/io/shard_index_generator.cc:620] DatabaseWriter] Generate index db for shard: 0 successfully.
|
||||
[INFO] ME(11253:139923799271232,MainProcess):2020-05-20-16:22:21.979.634 [mindspore/mindrecord/filewriter.py:313] The list of mindrecord files created are: ['output/CUB_200_2011.mindrecord'], and the list of index files are: ['output/CUB_200_2011.mindrecord.db']
|
||||
```
|
||||
|
||||
3. Generate mindrecord files
|
||||
```
|
||||
$ ls output/
|
||||
CUB_200_2011.mindrecord CUB_200_2011.mindrecord.db README.md
|
||||
```
|
||||
|
||||
### Create MindDataset By MindRecord
|
||||
|
||||
1. Run the run_read.sh script.
|
||||
```bash
|
||||
bash run_read.sh
|
||||
```
|
||||
|
||||
2. Output like this:
|
||||
```
|
||||
[INFO] MD(12469,python):2020-05-20-16:26:38.308.797 [mindspore/ccsrc/dataset/util/task.cc:31] operator()] Op launched, OperatorId:0 Thread ID 139702598620928 Started.
|
||||
[INFO] MD(12469,python):2020-05-20-16:26:38.322.433 [mindspore/ccsrc/mindrecord/io/shard_reader.cc:343] ReadAllRowsInShard] Get 11788 records from shard 0 index.
|
||||
[INFO] MD(12469,python):2020-05-20-16:26:38.386.904 [mindspore/ccsrc/mindrecord/io/shard_reader.cc:1058] CreateTasks] Total rows is 11788
|
||||
[INFO] MD(12469,python):2020-05-20-16:26:38.387.068 [mindspore/ccsrc/dataset/util/task.cc:31] operator()] Parallel Op Worker Thread ID 139702590228224 Started.
|
||||
[INFO] MD(12469,python):2020-05-20-16:26:38.387.272 [mindspore/ccsrc/dataset/util/task.cc:31] operator()] Parallel Op Worker Thread ID 139702581044992 Started.
|
||||
[INFO] MD(12469,python):2020-05-20-16:26:38.387.465 [mindspore/ccsrc/dataset/util/task.cc:31] operator()] Parallel Op Worker Thread ID 139702572652288 Started.
|
||||
[INFO] MD(12469,python):2020-05-20-16:26:38.387.617 [mindspore/ccsrc/dataset/util/task.cc:31] operator()] Parallel Op Worker Thread ID 139702564259584 Started.
|
||||
example 0: {'image': array([255, 216, 255, ..., 47, 255, 217], dtype=uint8), 'bbox': array([ 70., 120., 168., 150.], dtype=float32), 'label': array(199, dtype=int32), 'image_filename': array([ 87, 105, 110, 116, 101, 114, 95, 87, 114, 101, 110, 95, 48,
|
||||
49, 49, 54, 95, 49, 56, 57, 56, 51, 52, 46, 106, 112,
|
||||
103], dtype=uint8), 'segmentation_mask': array([137, 80, 78, ..., 66, 96, 130], dtype=uint8), 'label_name': array([ 49, 57, 57, 46, 87, 105, 110, 116, 101, 114, 95, 87, 114,
|
||||
101, 110], dtype=uint8)}
|
||||
example 1: {'image': array([255, 216, 255, ..., 3, 255, 217], dtype=uint8), 'bbox': array([ 51., 51., 235., 322.], dtype=float32), 'label': array(170, dtype=int32), 'image_filename': array([ 77, 111, 117, 114, 110, 105, 110, 103, 95, 87, 97, 114, 98,
|
||||
108, 101, 114, 95, 48, 48, 55, 52, 95, 55, 57, 53, 51,
|
||||
54, 55, 46, 106, 112, 103], dtype=uint8), 'segmentation_mask': array([137, 80, 78, ..., 66, 96, 130], dtype=uint8), 'label_name': array([ 49, 55, 48, 46, 77, 111, 117, 114, 110, 105, 110, 103, 95,
|
||||
87, 97, 114, 98, 108, 101, 114], dtype=uint8)}
|
||||
example 2: {'image': array([255, 216, 255, ..., 35, 255, 217], dtype=uint8), 'bbox': array([ 57., 56., 285., 248.], dtype=float32), 'label': array(148, dtype=int32), 'image_filename': array([ 71, 114, 101, 101, 110, 95, 84, 97, 105, 108, 101, 100, 95,
|
||||
84, 111, 119, 104, 101, 101, 95, 48, 48, 53, 52, 95, 49,
|
||||
53, 52, 57, 51, 56, 46, 106, 112, 103], dtype=uint8), 'segmentation_mask': array([137, 80, 78, ..., 66, 96, 130], dtype=uint8), 'label_name': array([ 49, 52, 56, 46, 71, 114, 101, 101, 110, 95, 116, 97, 105,
|
||||
108, 101, 100, 95, 84, 111, 119, 104, 101, 101], dtype=uint8)}
|
||||
example 3: {'image': array([255, 216, 255, ..., 85, 255, 217], dtype=uint8), 'bbox': array([ 95., 61., 333., 323.], dtype=float32), 'label': array(176, dtype=int32), 'image_filename': array([ 80, 114, 97, 105, 114, 105, 101, 95, 87, 97, 114, 98, 108,
|
||||
101, 114, 95, 48, 49, 48, 53, 95, 49, 55, 50, 57, 56,
|
||||
50, 46, 106, 112, 103], dtype=uint8), 'segmentation_mask': array([137, 80, 78, ..., 66, 96, 130], dtype=uint8), 'label_name': array([ 49, 55, 54, 46, 80, 114, 97, 105, 114, 105, 101, 95, 87,
|
||||
97, 114, 98, 108, 101, 114], dtype=uint8)}
|
||||
...
|
||||
example 11786: {'image': array([255, 216, 255, ..., 199, 255, 217], dtype=uint8), 'bbox': array([180., 61., 153., 162.], dtype=float32), 'label': array(75, dtype=int32), 'image_filename': array([ 71, 114, 101, 101, 110, 95, 74, 97, 121, 95, 48, 48, 55,
|
||||
49, 95, 54, 53, 55, 57, 57, 46, 106, 112, 103], dtype=uint8), 'segmentation_mask': array([137, 80, 78, ..., 66, 96, 130], dtype=uint8), 'label_name': array([ 48, 55, 53, 46, 71, 114, 101, 101, 110, 95, 74, 97, 121],
|
||||
dtype=uint8)}
|
||||
example 11787: {'image': array([255, 216, 255, ..., 127, 255, 217], dtype=uint8), 'bbox': array([ 49., 33., 276., 216.], dtype=float32), 'label': array(27, dtype=int32), 'image_filename': array([ 83, 104, 105, 110, 121, 95, 67, 111, 119, 98, 105, 114, 100,
|
||||
95, 48, 48, 51, 49, 95, 55, 57, 54, 56, 53, 49, 46,
|
||||
106, 112, 103], dtype=uint8), 'segmentation_mask': array([137, 80, 78, ..., 66, 96, 130], dtype=uint8), 'label_name': array([ 48, 50, 55, 46, 83, 104, 105, 110, 121, 95, 67, 111, 119,
|
||||
98, 105, 114, 100], dtype=uint8)}
|
||||
>> total rows: 11788
|
||||
[INFO] MD(12469,python):2020-05-20-16:26:49.582.298 [mindspore/ccsrc/dataset/util/task.cc:128] Join] Watchdog Thread ID 139702607013632 Stopped.
|
||||
```
|
||||
- bbox : coordinate value of the bounding box in the picture.
|
||||
- image: the image bytes which is from like "data/CUB_200_2011/images/001.Black_footed_Albatross/Black_Footed_Albatross_0001_796111.jpg".
|
||||
- image_filename: the image name which is like "Black_Footed_Albatross_0001_796111.jpg"
|
||||
- label : the picture label which is in [1, 200].
|
||||
- lable_name : object which is like "016.Painted_Bunting" corresponding to label.
|
||||
- segmentation_mask : the image bytes which is from like "data/segmentations/001.Black_footed_Albatross/Black_Footed_Albatross_0001_796111.png".
|
|
@ -0,0 +1,33 @@
|
|||
# Copyright 2020 Huawei Technologies Co., Ltd
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
# ============================================================================
|
||||
"""create MindDataset by MindRecord"""
|
||||
import mindspore.dataset as ds
|
||||
|
||||
def create_dataset(data_file):
|
||||
"""create MindDataset"""
|
||||
num_readers = 4
|
||||
data_set = ds.MindDataset(dataset_file=data_file,
|
||||
num_parallel_workers=num_readers,
|
||||
shuffle=True)
|
||||
index = 0
|
||||
for item in data_set.create_dict_iterator():
|
||||
print("example {}: {}".format(index, item))
|
||||
index += 1
|
||||
if index % 1000 == 0:
|
||||
print(">> read rows: {}".format(index))
|
||||
print(">> total rows: {}".format(index))
|
||||
|
||||
if __name__ == '__main__':
|
||||
create_dataset('output/CUB_200_2011.mindrecord')
|
|
@ -0,0 +1 @@
|
|||
## The input dataset
|
|
@ -0,0 +1,144 @@
|
|||
# Copyright 2020 Huawei Technologies Co., Ltd
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
# ============================================================================
|
||||
"""get data from aclImdb and write the data to mindrecord file"""
|
||||
import os
|
||||
import numpy as np
|
||||
from mindspore.mindrecord import FileWriter
|
||||
|
||||
CUB_200_2011_DIR = "data/CUB_200_2011"
|
||||
SEGMENTATION_DIR = "data/segmentations"
|
||||
|
||||
MINDRECORD_FILE_NAME = "output/CUB_200_2011.mindrecord"
|
||||
|
||||
def get_data_as_dict():
|
||||
"""get data from dataset"""
|
||||
# id : filename
|
||||
id_and_filename = {}
|
||||
images_txt = open(os.path.join(CUB_200_2011_DIR, "images.txt"))
|
||||
for line in images_txt:
|
||||
# images.txt, get id and filename
|
||||
single_images_txt = line.split(" ")
|
||||
id_and_filename[int(single_images_txt[0])] = os.path.join(os.path.join(CUB_200_2011_DIR, "images"),
|
||||
single_images_txt[1].replace("\n", ""))
|
||||
images_txt.close()
|
||||
|
||||
# id : bounding_box
|
||||
id_and_bbox = {}
|
||||
bounding_boxes_txt = open(os.path.join(CUB_200_2011_DIR, "bounding_boxes.txt"))
|
||||
for line in bounding_boxes_txt:
|
||||
# bounding_boxes.txt, get id and bounding_box
|
||||
single_bounding_boxes_txt = line.split(" ")
|
||||
id_and_bbox[int(single_bounding_boxes_txt[0])] = [float(single_bounding_boxes_txt[1]),
|
||||
float(single_bounding_boxes_txt[2]),
|
||||
float(single_bounding_boxes_txt[3]),
|
||||
float(single_bounding_boxes_txt[4])]
|
||||
bounding_boxes_txt.close()
|
||||
|
||||
# id : label
|
||||
id_and_label = {}
|
||||
image_class_labels_txt = open(os.path.join(CUB_200_2011_DIR, "image_class_labels.txt"))
|
||||
for line in image_class_labels_txt:
|
||||
# image_class_labels.txt, get id and label
|
||||
single_image_class_labels_txt = line.split(" ")
|
||||
id_and_label[int(single_image_class_labels_txt[0])] = int(single_image_class_labels_txt[1])
|
||||
image_class_labels_txt.close()
|
||||
|
||||
# id : segmentation filename
|
||||
id_and_segmentation_file_name = {}
|
||||
for item in id_and_filename:
|
||||
segmentation_filename = id_and_filename[item]
|
||||
segmentation_filename = segmentation_filename.replace(os.path.join(CUB_200_2011_DIR, "images"),
|
||||
SEGMENTATION_DIR)
|
||||
segmentation_filename = segmentation_filename.replace(".jpg", ".png")
|
||||
id_and_segmentation_file_name[item] = segmentation_filename
|
||||
|
||||
# label: class
|
||||
label_and_class = {}
|
||||
classes_txt = open(os.path.join(CUB_200_2011_DIR, "classes.txt"))
|
||||
for line in classes_txt:
|
||||
# classes.txt, get label and class
|
||||
single_classes_txt = line.split(" ")
|
||||
label_and_class[int(single_classes_txt[0])] = str(single_classes_txt[1]).replace("\n", "")
|
||||
classes_txt.close()
|
||||
|
||||
assert len(id_and_filename) == len(id_and_bbox)
|
||||
assert len(id_and_filename) == len(id_and_label)
|
||||
assert len(id_and_filename) == len(id_and_segmentation_file_name)
|
||||
|
||||
print(">> sample id: {}, filename: {}, bbox: {}, label: {}, seg_filename: {}, class: {}"
|
||||
.format(1, id_and_filename[1], id_and_bbox[1], id_and_label[1], id_and_segmentation_file_name[1],
|
||||
label_and_class[id_and_label[1]]))
|
||||
|
||||
for item in id_and_filename:
|
||||
data = {}
|
||||
data["bbox"] = np.asarray(id_and_bbox[item], dtype=np.float32) # [60.0, 27.0, 325.0, 304.0]
|
||||
|
||||
image_file = open(id_and_filename[item], "rb")
|
||||
image_bytes = image_file.read()
|
||||
image_file.close()
|
||||
data["image"] = image_bytes
|
||||
|
||||
image_filename = id_and_filename[item].split("/")
|
||||
data["image_filename"] = image_filename[-1] # Black_Footed_Albatross_0046_18.jpg
|
||||
|
||||
data["label"] = id_and_label[item] # 1-200
|
||||
data["label_name"] = label_and_class[id_and_label[item]] # 177.Prothonotary_Warbler
|
||||
|
||||
segmentation_file = open(id_and_segmentation_file_name[item], "rb")
|
||||
segmentation_bytes = segmentation_file.read()
|
||||
segmentation_file.close()
|
||||
data["segmentation_mask"] = segmentation_bytes
|
||||
|
||||
yield data
|
||||
|
||||
def gen_mindrecord():
|
||||
"""gen mindreocrd according exactly schema"""
|
||||
fw = FileWriter(MINDRECORD_FILE_NAME)
|
||||
|
||||
schema = {"bbox": {"type": "float32", "shape": [-1]},
|
||||
"image": {"type": "bytes"},
|
||||
"image_filename": {"type": "string"},
|
||||
"label": {"type": "int32"},
|
||||
"label_name": {"type": "string"},
|
||||
"segmentation_mask": {"type": "bytes"}}
|
||||
fw.add_schema(schema, "CUB 200 2011 dataset")
|
||||
|
||||
get_data_iter = get_data_as_dict()
|
||||
|
||||
batch_size = 256
|
||||
transform_count = 0
|
||||
while True:
|
||||
data_list = []
|
||||
try:
|
||||
for _ in range(batch_size):
|
||||
data_list.append(get_data_iter.__next__())
|
||||
transform_count += 1
|
||||
fw.write_raw_data(data_list)
|
||||
print(">> transformed {} record...".format(transform_count))
|
||||
except StopIteration:
|
||||
if data_list:
|
||||
fw.write_raw_data(data_list)
|
||||
print(">> transformed {} record...".format(transform_count))
|
||||
break
|
||||
|
||||
fw.commit()
|
||||
|
||||
def main():
|
||||
# generate mindrecord
|
||||
print(">> begin generate mindrecord")
|
||||
gen_mindrecord()
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
|
@ -0,0 +1 @@
|
|||
## Output the mindrecord
|
|
@ -0,0 +1,19 @@
|
|||
#!/bin/bash
|
||||
# Copyright 2020 Huawei Technologies Co., Ltd
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
# ============================================================================
|
||||
|
||||
rm -f output/CUB_200_2011.mindrecord*
|
||||
|
||||
python gen_mindrecord.py
|
|
@ -0,0 +1,17 @@
|
|||
#!/bin/bash
|
||||
# Copyright 2020 Huawei Technologies Co., Ltd
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
# ============================================================================
|
||||
|
||||
python create_dataset.py
|
Loading…
Reference in New Issue