History

heleiwang b111b85828 fix log error		2020-05-25 18:40:53 +08:00
..
citeseer	Support processing GNN data	2020-05-22 14:15:25 +08:00
cora	Support processing GNN data	2020-05-22 14:15:25 +08:00
README.md	Support processing GNN data	2020-05-22 14:15:25 +08:00
read_citeseer.sh	Support processing GNN data	2020-05-22 14:15:25 +08:00
read_cora.sh	Support processing GNN data	2020-05-22 14:15:25 +08:00
reader.py	Support processing GNN data	2020-05-22 14:15:25 +08:00
write_citeseer.sh	fix log error	2020-05-25 18:40:53 +08:00
write_cora.sh	fix log error	2020-05-25 18:40:53 +08:00
writer.py	Support processing GNN data	2020-05-22 14:15:25 +08:00

README.md

Guideline to Efficiently Generating MindRecord

What does the example do
Example test for Cora
How to use the example for other dataset

What does the example do

This example provides an efficient way to generate MindRecord. Users only need to define the parallel granularity of training data reading and the data reading function of a single task. That is, they can efficiently convert the user's training data into MindRecord.

write_cora.sh: entry script, users need to modify parameters according to their own training data.
writer.py: main script, called by write_cora.sh, it mainly reads user training data in parallel and generates MindRecord.
cora/mr_api.py: uers define their own parallel granularity of training data reading and single task reading function through the cora.

Example test for Cora

Download and prepare the Cora dataset as required.

Cora dataset download address

Edit write_cora.sh and modify the parameters

--mindrecord_file: output MindRecord file.
--mindrecord_partitions: the partitions for MindRecord.

Run the bash script
```
bash write_cora.sh
```

How to use the example for other dataset

Create work space

Assume the dataset name is 'xyz'

Create work space from cora

cd ${your_mindspore_home}/example/graph_to_mindrecord
cp -r cora xyz

Implement data generator

Edit dictionary data generator.

Edit file

cd ${your_mindspore_home}/example/graph_to_mindrecord
vi xyz/mr_api.py

Two API, 'mindrecord_task_number' and 'mindrecord_dict_data', must be implemented.

'mindrecord_task_number()' returns number of tasks. Return 1 if data row is generated serially. Return N if generator can be split into N parallel-run tasks.
'mindrecord_dict_data(task_id)' yields dictionary data row by row. 'task_id' is 0..N-1, if N is return value of mindrecord_task_number()

Run data generator

run python script

cd ${your_mindspore_home}/example/graph_to_mindrecord
python writer.py --mindrecord_script xyz [...]

You can put this command in script write_xyz.sh for easy execution