forked from mindspore-Ecosystem/mindspore
!21889 Add per batch size for the GPU training
Merge pull request !21889 from huangxinjing/code_docs_fix_per_batch
This commit is contained in:
commit
91a84ce0db
|
@ -179,12 +179,13 @@ https:gitee.com/mindspore/mindspore/tree/master/model_zoo/utils/hccl_tools.
|
|||
The script will launch the GPU training through `mpirun`, the user can run the following command on any machine to start training.
|
||||
|
||||
```bash
|
||||
bash scripts/run_distributed_train_gpu.sh RANK_SIZE HOSTFILE DATASET MOD
|
||||
bash scripts/run_distributed_train_gpu.sh RANK_SIZE HOSTFILE DATASET PER_BATCH MOD
|
||||
```
|
||||
|
||||
- RANK_SIZE: The device number. This can be your total device numbers. For example, 8, 16, 32 ...
|
||||
- HOSTFILE: It's a text file describes the host ip and its devices. Please see our [tutorial](https://www.mindspore.cn/docs/programming_guide/en/master/distributed_training_gpu.html) or [OpenMPI](https://www.open-mpi.org/) for more details.
|
||||
- DATASET: The path to the mindrecord files's parent directory . For example: `/home/work/mindrecord/`.
|
||||
- PER_BATCH: The batch size for each data parallel-way.
|
||||
- MODE: Can be `2.6B`, `13B` and `200B`.
|
||||
|
||||
### Incremental Training
|
||||
|
|
|
@ -16,8 +16,8 @@
|
|||
|
||||
echo "=============================================================================================================="
|
||||
echo "Please run the script as: "
|
||||
echo "bash run_distributed_train_gpu.sh RANK_SIZE HOSTFILE DATASET MODE"
|
||||
echo "for example: bash run_distributed_train_gpu.sh 16 hostfile_16p /mass_dataset/train_data/ 2.6B"
|
||||
echo "bash run_distributed_train_gpu.sh RANK_SIZE HOSTFILE DATASET PER_BATCH MODE"
|
||||
echo "for example: bash run_distributed_train_gpu.sh 16 hostfile_16p /mass_dataset/train_data/ 16 2.6B"
|
||||
echo "It is better to use absolute path."
|
||||
echo "=============================================================================================================="
|
||||
|
||||
|
@ -26,7 +26,8 @@ self_path=$(dirname "${script_self}")
|
|||
RANK_SIZE=$1
|
||||
HOSTFILE=$2
|
||||
DATASET=$3
|
||||
MODE=$4
|
||||
PER_BATCH=$4
|
||||
MODE=$5
|
||||
|
||||
mpirun --allow-run-as-root -x PATH -x LD_LIBRARY_PATH -x PYTHONPATH -x NCCL_DEBUG -x GLOG_v -n $RANK_SIZE --hostfile $HOSTFILE --output-filename log_output --merge-stderr-to-stdout \
|
||||
python -s ${self_path}/../train.py \
|
||||
|
@ -35,4 +36,5 @@ mpirun --allow-run-as-root -x PATH -x LD_LIBRARY_PATH -x PYTHONPATH -x NCCL_DEBU
|
|||
--device_target="GPU" \
|
||||
--data_url=$DATASET \
|
||||
--mode=$MODE \
|
||||
--per_batch_size=$PER_BATCH \
|
||||
--run_type=train > train_log.txt 2>&1 &
|
||||
|
|
Loading…
Reference in New Issue