add Caltech-UCSD-Birds-200-2011 to mindrecord example

2020-05-20 16:32:56 +08:00 · 2020-05-20 16:32:56 +08:00 · 4333080b6f
parent 44e3d167f0
commit 4333080b6f
7 changed files with 346 additions and 0 deletions
--- a/example/cv_to_mindrecord/Caltech-UCSD-Birds-200-2011/README.md
+++ b/example/cv_to_mindrecord/Caltech-UCSD-Birds-200-2011/README.md
@ -0,0 +1,131 @@
+# Guideline to Transfer Caltech-UCSD Birds-200-2011 Dataset to MindRecord
+
+<!-- TOC -->
+
+- [What does the example do](#what-does-the-example-do)
+- [How to use the example to generate MindRecord](#how-to-use-the-example-to-generate-mindrecord)
+    - [Download Caltech-UCSD Birds-200-2011 dataset and unzip](#download-caltech-ucsd-birds-200-2011-dataset-and-unzip)
+    - [Generate MindRecord](#generate-mindrecord)
+    - [Create MindDataset By MindRecord](#create-minddataset-by-mindrecord)
+
+
+<!-- /TOC -->
+
+## What does the example do
+
+This example is used to read data from Caltech-UCSD Birds-200-2011 dataset and generate mindrecord. It just transfers the Caltech-UCSD Birds-200-2011 dataset to mindrecord without any data preprocessing. You can modify the example or follow the example to implement your own example.
+
+1.  run.sh: generate MindRecord entry script.
+    - gen_mindrecord.py : read the Caltech-UCSD Birds-200-2011 data and transfer it to mindrecord.
+2.  run_read.py: create MindDataset by MindRecord entry script.
+    - create_dataset.py: use MindDataset to read MindRecord to generate dataset.
+
+## How to use the example to generate MindRecord
+
+Download Caltech-UCSD Birds-200-2011 dataset, transfer it to mindrecord, use MindDataset to read mindrecord.
+
+### Download Caltech-UCSD Birds-200-2011 dataset and unzip
+
+1. Download the training data zip.
+    > [Caltech-UCSD Birds-200-2011 dataset download address](http://www.vision.caltech.edu/visipedia/CUB-200-2011.html)  
+    > **1) -> Download -> All Images and Annotations**  
+    > **2) -> Download -> Segmentations**  
+
+2. Unzip the training data to dir example/nlp_to_mindrecord/Caltech-UCSD-Birds-200-2011/data.
+    ```
+    tar -zxvf CUB_200_2011.tgz -C {your-mindspore}/example/cv_to_mindrecord/Caltech-UCSD-Birds-200-2011/data/
+    tar -zxvf segmentations.tgz -C {your-mindspore}/example/cv_to_mindrecord/Caltech-UCSD-Birds-200-2011/data/
+    ```
+    - The unzip should like this:
+    ```
+    $ ls {your-mindspore}/example/cv_to_mindrecord/Caltech-UCSD-Birds-200-2011/data/
+    attributes.txt  CUB_200_2011  README.md  segmentations
+    ```
+
+### Generate MindRecord
+
+1. Run the run.sh script.
+    ```bash
+    bash run.sh
+    ```
+
+2. Output like this:
+    ```
+    ...
+    >> begin generate mindrecord
+    >> sample id: 1, filename: data/CUB_200_2011/images/001.Black_footed_Albatross/Black_Footed_Albatross_0046_18.jpg, bbox: [60.0, 27.0, 325.0, 304.0], label: 1, seg_filename: data/segmentations/001.Black_footed_Albatross/Black_Footed_Albatross_0046_18.png, class: 001.Black_footed_Albatross
+    [INFO] MD(11253,python):2020-05-20-16:21:42.462.686 [mindspore/ccsrc/mindrecord/io/shard_writer.cc:106] OpenDataFiles] Open shard file successfully.
+    [INFO] MD(11253,python):2020-05-20-16:21:43.147.496 [mindspore/ccsrc/mindrecord/io/shard_writer.cc:680] WriteRawData] Write 256 records successfully.
+    >> transformed 256 record...
+    [INFO] MD(11253,python):2020-05-20-16:21:43.842.372 [mindspore/ccsrc/mindrecord/io/shard_writer.cc:680] WriteRawData] Write 256 records successfully.
+    >> transformed 512 record...
+     [INFO] MD(11253,python):2020-05-20-16:21:44.748.585 [mindspore/ccsrc/mindrecord/io/shard_writer.cc:680] WriteRawData] Write 256 records successfully.
+    >> transformed 768 record...
+    [INFO] MD(11253,python):2020-05-20-16:21:45.736.179 [mindspore/ccsrc/mindrecord/io/shard_writer.cc:680] WriteRawData] Write 256 records successfully.
+    >> transformed 1024 record...
+    ...
+    [INFO] MD(11253,python):2020-05-20-16:22:21.207.820 [mindspore/ccsrc/mindrecord/io/shard_writer.cc:680] WriteRawData] Write 12 records successfully.
+    >> transformed 11788 record...
+    [INFO] MD(11253,python):2020-05-20-16:22:21.210.261 [mindspore/ccsrc/mindrecord/io/shard_writer.cc:227] Commit] Write metadata successfully.
+    [INFO] MD(11253,python):2020-05-20-16:22:21.211.688 [mindspore/ccsrc/mindrecord/io/shard_index_generator.cc:59] Build] Init header from mindrecord file for index successfully.
+    [INFO] MD(11253,python):2020-05-20-16:22:21.236.799 [mindspore/ccsrc/mindrecord/io/shard_index_generator.cc:600] DatabaseWriter] Init index db for shard: 0 successfully.
+    [INFO] MD(11253,python):2020-05-20-16:22:21.964.034 [mindspore/ccsrc/mindrecord/io/shard_index_generator.cc:549] ExecuteTransaction] Insert 11788 rows to index db.
+    [INFO] MD(11253,python):2020-05-20-16:22:21.978.087 [mindspore/ccsrc/mindrecord/io/shard_index_generator.cc:620] DatabaseWriter] Generate index db for shard: 0 successfully.
+    [INFO] ME(11253:139923799271232,MainProcess):2020-05-20-16:22:21.979.634 [mindspore/mindrecord/filewriter.py:313] The list of mindrecord files created are: ['output/CUB_200_2011.mindrecord'], and the list of index files are: ['output/CUB_200_2011.mindrecord.db']
+    ```
+
+3. Generate mindrecord files
+    ```
+    $ ls output/
+    CUB_200_2011.mindrecord  CUB_200_2011.mindrecord.db  README.md
+    ```
+
+### Create MindDataset By MindRecord
+
+1. Run the run_read.sh script.
+    ```bash
+    bash run_read.sh
+    ```
+
+2. Output like this:
+    ```
+    [INFO] MD(12469,python):2020-05-20-16:26:38.308.797 [mindspore/ccsrc/dataset/util/task.cc:31] operator()] Op launched, OperatorId:0 Thread ID 139702598620928 Started.
+    [INFO] MD(12469,python):2020-05-20-16:26:38.322.433 [mindspore/ccsrc/mindrecord/io/shard_reader.cc:343] ReadAllRowsInShard] Get 11788 records from shard 0 index.
+    [INFO] MD(12469,python):2020-05-20-16:26:38.386.904 [mindspore/ccsrc/mindrecord/io/shard_reader.cc:1058] CreateTasks] Total rows is 11788
+    [INFO] MD(12469,python):2020-05-20-16:26:38.387.068 [mindspore/ccsrc/dataset/util/task.cc:31] operator()] Parallel Op Worker Thread ID 139702590228224 Started.
+    [INFO] MD(12469,python):2020-05-20-16:26:38.387.272 [mindspore/ccsrc/dataset/util/task.cc:31] operator()] Parallel Op Worker Thread ID 139702581044992 Started.
+    [INFO] MD(12469,python):2020-05-20-16:26:38.387.465 [mindspore/ccsrc/dataset/util/task.cc:31] operator()] Parallel Op Worker Thread ID 139702572652288 Started.
+    [INFO] MD(12469,python):2020-05-20-16:26:38.387.617 [mindspore/ccsrc/dataset/util/task.cc:31] operator()] Parallel Op Worker Thread ID 139702564259584 Started.
+    example 0: {'image': array([255, 216, 255, ...,  47, 255, 217], dtype=uint8), 'bbox': array([ 70., 120., 168., 150.], dtype=float32), 'label': array(199, dtype=int32), 'image_filename': array([ 87, 105, 110, 116, 101, 114,  95,  87, 114, 101, 110,  95,  48,
+        49,  49,  54,  95,  49,  56,  57,  56,  51,  52,  46, 106, 112,
+       103], dtype=uint8), 'segmentation_mask': array([137,  80,  78, ...,  66,  96, 130], dtype=uint8), 'label_name': array([ 49,  57,  57,  46,  87, 105, 110, 116, 101, 114,  95,  87, 114,
+       101, 110], dtype=uint8)}
+    example 1: {'image': array([255, 216, 255, ...,   3, 255, 217], dtype=uint8), 'bbox': array([ 51.,  51., 235., 322.], dtype=float32), 'label': array(170, dtype=int32), 'image_filename': array([ 77, 111, 117, 114, 110, 105, 110, 103,  95,  87,  97, 114,  98,
+       108, 101, 114,  95,  48,  48,  55,  52,  95,  55,  57,  53,  51,
+        54,  55,  46, 106, 112, 103], dtype=uint8), 'segmentation_mask': array([137,  80,  78, ...,  66,  96, 130], dtype=uint8), 'label_name': array([ 49,  55,  48,  46,  77, 111, 117, 114, 110, 105, 110, 103,  95,
+        87,  97, 114,  98, 108, 101, 114], dtype=uint8)}
+    example 2: {'image': array([255, 216, 255, ...,  35, 255, 217], dtype=uint8), 'bbox': array([ 57.,  56., 285., 248.], dtype=float32), 'label': array(148, dtype=int32), 'image_filename': array([ 71, 114, 101, 101, 110,  95,  84,  97, 105, 108, 101, 100,  95,
+        84, 111, 119, 104, 101, 101,  95,  48,  48,  53,  52,  95,  49,
+        53,  52,  57,  51,  56,  46, 106, 112, 103], dtype=uint8), 'segmentation_mask': array([137,  80,  78, ...,  66,  96, 130], dtype=uint8), 'label_name': array([ 49,  52,  56,  46,  71, 114, 101, 101, 110,  95, 116,  97, 105,
+       108, 101, 100,  95,  84, 111, 119, 104, 101, 101], dtype=uint8)}
+    example 3: {'image': array([255, 216, 255, ...,  85, 255, 217], dtype=uint8), 'bbox': array([ 95.,  61., 333., 323.], dtype=float32), 'label': array(176, dtype=int32), 'image_filename': array([ 80, 114,  97, 105, 114, 105, 101,  95,  87,  97, 114,  98, 108,
+       101, 114,  95,  48,  49,  48,  53,  95,  49,  55,  50,  57,  56,
+        50,  46, 106, 112, 103], dtype=uint8), 'segmentation_mask': array([137,  80,  78, ...,  66,  96, 130], dtype=uint8), 'label_name': array([ 49,  55,  54,  46,  80, 114,  97, 105, 114, 105, 101,  95,  87,
+        97, 114,  98, 108, 101, 114], dtype=uint8)}
+    ...
+    example 11786: {'image': array([255, 216, 255, ..., 199, 255, 217], dtype=uint8), 'bbox': array([180.,  61., 153., 162.], dtype=float32), 'label': array(75, dtype=int32), 'image_filename': array([ 71, 114, 101, 101, 110,  95,  74,  97, 121,  95,  48,  48,  55,
+        49,  95,  54,  53,  55,  57,  57,  46, 106, 112, 103], dtype=uint8), 'segmentation_mask': array([137,  80,  78, ...,  66,  96, 130], dtype=uint8), 'label_name': array([ 48,  55,  53,  46,  71, 114, 101, 101, 110,  95,  74,  97, 121],
+      dtype=uint8)}
+    example 11787: {'image': array([255, 216, 255, ..., 127, 255, 217], dtype=uint8), 'bbox': array([ 49.,  33., 276., 216.], dtype=float32), 'label': array(27, dtype=int32), 'image_filename': array([ 83, 104, 105, 110, 121,  95,  67, 111, 119,  98, 105, 114, 100,
+        95,  48,  48,  51,  49,  95,  55,  57,  54,  56,  53,  49,  46,
+       106, 112, 103], dtype=uint8), 'segmentation_mask': array([137,  80,  78, ...,  66,  96, 130], dtype=uint8), 'label_name': array([ 48,  50,  55,  46,  83, 104, 105, 110, 121,  95,  67, 111, 119,
+        98, 105, 114, 100], dtype=uint8)}
+    >> total rows: 11788
+    [INFO] MD(12469,python):2020-05-20-16:26:49.582.298 [mindspore/ccsrc/dataset/util/task.cc:128] Join] Watchdog Thread ID 139702607013632 Stopped.
+    ```
+    - bbox : coordinate value of the bounding box in the picture.
+    - image: the image bytes which is from like "data/CUB_200_2011/images/001.Black_footed_Albatross/Black_Footed_Albatross_0001_796111.jpg".
+    - image_filename: the image name which is like "Black_Footed_Albatross_0001_796111.jpg"
+    - label : the picture label which is in [1, 200].
+    - lable_name : object which is like "016.Painted_Bunting" corresponding to label.
+    - segmentation_mask : the image bytes which is from like "data/segmentations/001.Black_footed_Albatross/Black_Footed_Albatross_0001_796111.png".
--- a/example/cv_to_mindrecord/Caltech-UCSD-Birds-200-2011/create_dataset.py
+++ b/example/cv_to_mindrecord/Caltech-UCSD-Birds-200-2011/create_dataset.py
@ -0,0 +1,33 @@
+# Copyright 2020 Huawei Technologies Co., Ltd
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ============================================================================
+"""create MindDataset by MindRecord"""
+import mindspore.dataset as ds
+
+def create_dataset(data_file):
+    """create MindDataset"""
+    num_readers = 4
+    data_set = ds.MindDataset(dataset_file=data_file,
+                              num_parallel_workers=num_readers,
+                              shuffle=True)
+    index = 0
+    for item in data_set.create_dict_iterator():
+        print("example {}: {}".format(index, item))
+        index += 1
+        if index % 1000 == 0:
+            print(">> read rows: {}".format(index))
+    print(">> total rows: {}".format(index))
+
+if __name__ == '__main__':
+    create_dataset('output/CUB_200_2011.mindrecord')
--- a/example/cv_to_mindrecord/Caltech-UCSD-Birds-200-2011/data/README.md
+++ b/example/cv_to_mindrecord/Caltech-UCSD-Birds-200-2011/data/README.md
@ -0,0 +1 @@
+## The input dataset
--- a/example/cv_to_mindrecord/Caltech-UCSD-Birds-200-2011/gen_mindrecord.py
+++ b/example/cv_to_mindrecord/Caltech-UCSD-Birds-200-2011/gen_mindrecord.py
@ -0,0 +1,144 @@
+# Copyright 2020 Huawei Technologies Co., Ltd
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ============================================================================
+"""get data from aclImdb and write the data to mindrecord file"""
+import os
+import numpy as np
+from mindspore.mindrecord import FileWriter
+
+CUB_200_2011_DIR = "data/CUB_200_2011"
+SEGMENTATION_DIR = "data/segmentations"
+
+MINDRECORD_FILE_NAME = "output/CUB_200_2011.mindrecord"
+
+def get_data_as_dict():
+    """get data from dataset"""
+    # id : filename
+    id_and_filename = {}
+    images_txt = open(os.path.join(CUB_200_2011_DIR, "images.txt"))
+    for line in images_txt:
+        # images.txt, get id and filename
+        single_images_txt = line.split(" ")
+        id_and_filename[int(single_images_txt[0])] = os.path.join(os.path.join(CUB_200_2011_DIR, "images"),
+                                                                  single_images_txt[1].replace("\n", ""))
+    images_txt.close()
+
+    # id : bounding_box
+    id_and_bbox = {}
+    bounding_boxes_txt = open(os.path.join(CUB_200_2011_DIR, "bounding_boxes.txt"))
+    for line in bounding_boxes_txt:
+        # bounding_boxes.txt, get id and bounding_box
+        single_bounding_boxes_txt = line.split(" ")
+        id_and_bbox[int(single_bounding_boxes_txt[0])] = [float(single_bounding_boxes_txt[1]),
+                                                          float(single_bounding_boxes_txt[2]),
+                                                          float(single_bounding_boxes_txt[3]),
+                                                          float(single_bounding_boxes_txt[4])]
+    bounding_boxes_txt.close()
+
+    # id : label
+    id_and_label = {}
+    image_class_labels_txt = open(os.path.join(CUB_200_2011_DIR, "image_class_labels.txt"))
+    for line in image_class_labels_txt:
+        # image_class_labels.txt, get id and label
+        single_image_class_labels_txt = line.split(" ")
+        id_and_label[int(single_image_class_labels_txt[0])] = int(single_image_class_labels_txt[1])
+    image_class_labels_txt.close()
+
+    # id : segmentation filename
+    id_and_segmentation_file_name = {}
+    for item in id_and_filename:
+        segmentation_filename = id_and_filename[item]
+        segmentation_filename = segmentation_filename.replace(os.path.join(CUB_200_2011_DIR, "images"),
+                                                              SEGMENTATION_DIR)
+        segmentation_filename = segmentation_filename.replace(".jpg", ".png")
+        id_and_segmentation_file_name[item] = segmentation_filename
+
+    # label: class
+    label_and_class = {}
+    classes_txt = open(os.path.join(CUB_200_2011_DIR, "classes.txt"))
+    for line in classes_txt:
+        # classes.txt, get label and class
+        single_classes_txt = line.split(" ")
+        label_and_class[int(single_classes_txt[0])] = str(single_classes_txt[1]).replace("\n", "")
+    classes_txt.close()
+
+    assert len(id_and_filename) == len(id_and_bbox)
+    assert len(id_and_filename) == len(id_and_label)
+    assert len(id_and_filename) == len(id_and_segmentation_file_name)
+
+    print(">> sample id: {}, filename: {}, bbox: {}, label: {}, seg_filename: {}, class: {}"
+          .format(1, id_and_filename[1], id_and_bbox[1], id_and_label[1], id_and_segmentation_file_name[1],
+                  label_and_class[id_and_label[1]]))
+
+    for item in id_and_filename:
+        data = {}
+        data["bbox"] = np.asarray(id_and_bbox[item], dtype=np.float32)  # [60.0, 27.0, 325.0, 304.0]
+
+        image_file = open(id_and_filename[item], "rb")
+        image_bytes = image_file.read()
+        image_file.close()
+        data["image"] = image_bytes
+
+        image_filename = id_and_filename[item].split("/")
+        data["image_filename"] = image_filename[-1]  # Black_Footed_Albatross_0046_18.jpg
+
+        data["label"] = id_and_label[item]  # 1-200
+        data["label_name"] = label_and_class[id_and_label[item]]  # 177.Prothonotary_Warbler
+
+        segmentation_file = open(id_and_segmentation_file_name[item], "rb")
+        segmentation_bytes = segmentation_file.read()
+        segmentation_file.close()
+        data["segmentation_mask"] = segmentation_bytes
+
+        yield data
+
+def gen_mindrecord():
+    """gen mindreocrd according exactly schema"""
+    fw = FileWriter(MINDRECORD_FILE_NAME)
+
+    schema = {"bbox": {"type": "float32", "shape": [-1]},
+              "image": {"type": "bytes"},
+              "image_filename": {"type": "string"},
+              "label": {"type": "int32"},
+              "label_name": {"type": "string"},
+              "segmentation_mask": {"type": "bytes"}}
+    fw.add_schema(schema, "CUB 200 2011 dataset")
+
+    get_data_iter = get_data_as_dict()
+
+    batch_size = 256
+    transform_count = 0
+    while True:
+        data_list = []
+        try:
+            for _ in range(batch_size):
+                data_list.append(get_data_iter.__next__())
+                transform_count += 1
+            fw.write_raw_data(data_list)
+            print(">> transformed {} record...".format(transform_count))
+        except StopIteration:
+            if data_list:
+                fw.write_raw_data(data_list)
+                print(">> transformed {} record...".format(transform_count))
+            break
+
+    fw.commit()
+
+def main():
+    # generate mindrecord
+    print(">> begin generate mindrecord")
+    gen_mindrecord()
+
+if __name__ == "__main__":
+    main()
--- a/example/cv_to_mindrecord/Caltech-UCSD-Birds-200-2011/output/README.md
+++ b/example/cv_to_mindrecord/Caltech-UCSD-Birds-200-2011/output/README.md
@ -0,0 +1 @@
+## Output the mindrecord
--- a/example/cv_to_mindrecord/Caltech-UCSD-Birds-200-2011/run.sh
+++ b/example/cv_to_mindrecord/Caltech-UCSD-Birds-200-2011/run.sh
@ -0,0 +1,19 @@
+#!/bin/bash
+# Copyright 2020 Huawei Technologies Co., Ltd
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ============================================================================
+
+rm -f output/CUB_200_2011.mindrecord*
+
+python gen_mindrecord.py
--- a/example/cv_to_mindrecord/Caltech-UCSD-Birds-200-2011/run_read.sh
+++ b/example/cv_to_mindrecord/Caltech-UCSD-Birds-200-2011/run_read.sh
@ -0,0 +1,17 @@
+#!/bin/bash
+# Copyright 2020 Huawei Technologies Co., Ltd
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ============================================================================
+
+python create_dataset.py