!21889 Add per batch size for the GPU training

Merge pull request !21889 from huangxinjing/code_docs_fix_per_batch
This commit is contained in:
i-robot 2021-08-16 08:38:37 +00:00 committed by Gitee
commit 91a84ce0db
2 changed files with 7 additions and 4 deletions

View File

@ -179,12 +179,13 @@ https:gitee.com/mindspore/mindspore/tree/master/model_zoo/utils/hccl_tools.
The script will launch the GPU training through `mpirun`, the user can run the following command on any machine to start training.
```bash
bash scripts/run_distributed_train_gpu.sh RANK_SIZE HOSTFILE DATASET MOD
bash scripts/run_distributed_train_gpu.sh RANK_SIZE HOSTFILE DATASET PER_BATCH MOD
```
- RANK_SIZE: The device number. This can be your total device numbers. For example, 8, 16, 32 ...
- HOSTFILE: It's a text file describes the host ip and its devices. Please see our [tutorial](https://www.mindspore.cn/docs/programming_guide/en/master/distributed_training_gpu.html) or [OpenMPI](https://www.open-mpi.org/) for more details.
- DATASET: The path to the mindrecord files's parent directory . For example: `/home/work/mindrecord/`.
- PER_BATCH: The batch size for each data parallel-way.
- MODE: Can be `2.6B`, `13B` and `200B`.
### Incremental Training

View File

@ -16,8 +16,8 @@
echo "=============================================================================================================="
echo "Please run the script as: "
echo "bash run_distributed_train_gpu.sh RANK_SIZE HOSTFILE DATASET MODE"
echo "for example: bash run_distributed_train_gpu.sh 16 hostfile_16p /mass_dataset/train_data/ 2.6B"
echo "bash run_distributed_train_gpu.sh RANK_SIZE HOSTFILE DATASET PER_BATCH MODE"
echo "for example: bash run_distributed_train_gpu.sh 16 hostfile_16p /mass_dataset/train_data/ 16 2.6B"
echo "It is better to use absolute path."
echo "=============================================================================================================="
@ -26,7 +26,8 @@ self_path=$(dirname "${script_self}")
RANK_SIZE=$1
HOSTFILE=$2
DATASET=$3
MODE=$4
PER_BATCH=$4
MODE=$5
mpirun --allow-run-as-root -x PATH -x LD_LIBRARY_PATH -x PYTHONPATH -x NCCL_DEBUG -x GLOG_v -n $RANK_SIZE --hostfile $HOSTFILE --output-filename log_output --merge-stderr-to-stdout \
python -s ${self_path}/../train.py \
@ -35,4 +36,5 @@ mpirun --allow-run-as-root -x PATH -x LD_LIBRARY_PATH -x PYTHONPATH -x NCCL_DEBU
--device_target="GPU" \
--data_url=$DATASET \
--mode=$MODE \
--per_batch_size=$PER_BATCH \
--run_type=train > train_log.txt 2>&1 &