modify readme for deepfm

2020-08-26 15:16:09 +08:00 · 2020-08-26 15:16:09 +08:00 · 4e56132b32
parent b9a2e771c6
commit 4e56132b32
3 changed files with 259 additions and 115 deletions
--- a/model_zoo/official/recommend/deepfm/README.md
+++ b/model_zoo/official/recommend/deepfm/README.md
@ -1,147 +1,287 @@
-# DeepFM Description
+# Contents

-This is an example of training DeepFM with Criteo dataset in MindSpore.
-
-[Paper](https://arxiv.org/pdf/1703.04247.pdf) Huifeng Guo, Ruiming Tang, Yunming Ye, Zhenguo Li, Xiuqiang He
+- [DeepFM Description](#deepfm-description)
+- [Model Architecture](#model-architecture)
+- [Dataset](#dataset)
+- [Environment Requirements](#environment-requirements)
+- [Quick Start](#quick-start)    
+- [Script Description](#script-description)
+    - [Script and Sample Code](#script-and-sample-code)
+    - [Script Parameters](#script-parameters)
+    - [Training Process](#training-process)
+        - [Training](#training)
+        - [Distributed Training](#distributed-training)  
+    - [Evaluation Process](#evaluation-process)
+        - [Evaluation](#evaluation)
+- [Model Description](#model-description)
+    - [Performance](#performance)  
+        - [Evaluation Performance](#evaluation-performance)
+        - [Inference Performance](#evaluation-performance)
+- [Description of Random Situation](#description-of-random-situation)
+- [ModelZoo Homepage](#modelzoo-homepage)


-# Model architecture
+# [DeepFM Description](#contents)

-The overall network architecture of DeepFM is show below:
+Learning sophisticated feature interactions behind user behaviors is critical in maximizing CTR for recommender systems. Despite great progress, existing methods seem to have a strong bias towards low- or high-order interactions, or require expertise feature engineering. In this paper, we show that it is possible to derive an end-to-end learning model that emphasizes both low- and high-order feature interactions. The proposed model, DeepFM, combines the power of factorization machines for recommendation and deep learning for feature learning in a new neural network architecture. 

-[Link](https://arxiv.org/pdf/1703.04247.pdf)
+[Paper](https://arxiv.org/abs/1703.04247):  Huifeng Guo, Ruiming Tang, Yunming Ye, Zhenguo Li, Xiuqiang He. DeepFM: A Factorization-Machine based Neural Network for CTR Prediction

+# [Model Architecture](#contents)

-# Requirements
- Install [MindSpore](https://www.mindspore.cn/install/en).
- Download the criteo dataset for pre-training. Extract and clean text in the dataset with [WikiExtractor](https://github.com/attardi/wikiextractor). Convert the dataset to TFRecord format and move the files to a specified path.
+DeepFM consists of two components. The FM component is a factorization machine, which is proposed in to learn feature interactions for recommendation. The deep component is a feed-forward neural network, which is used to learn high-order feature interactions.
+The FM and deep component share the same input raw feature vector, which enables DeepFM to learn low- and high-order feature interactions simultaneously from the input raw features.
+
+# [Dataset](#contents)
+
+- [1] A dataset used in  Huifeng Guo, Ruiming Tang, Yunming Ye, Zhenguo Li, Xiuqiang He. DeepFM: A Factorization-Machine based Neural Network for CTR Prediction[J]. 2017.
+  
+
+# [Environment Requirements](#contents)
+
+- Hardware（Ascend/GPU）
+  - Prepare hardware environment with Ascend or GPU processor. If you want to try Ascend, please send the [application form](https://obs-9be7.obs.cn-east-2.myhuaweicloud.com/file/other/Ascend%20Model%20Zoo%E4%BD%93%E9%AA%8C%E8%B5%84%E6%BA%90%E7%94%B3%E8%AF%B7%E8%A1%A8.docx) to ascend@huawei.com. Once approved, you can get the resources. 
+- Framework
+  - [MindSpore](https://www.mindspore.cn/install/en)
 - For more information, please check the resources below：
  - [MindSpore tutorials](https://www.mindspore.cn/tutorial/zh-CN/master/index.html) 
  - [MindSpore API](https://www.mindspore.cn/api/zh-CN/master/index.html)

-# Script description

-## Script and sample code

-```shell
-├── deepfm       
-  ├── README.md                      
-  ├── scripts 
-  │   ├──run_distribute_train.sh                
-  │   ├──run_distribute_train_gpu.sh
-  │   ├──run_standalone_train.sh                    
-  │   ├──run_eval.sh                   
-  ├── src
-  │   ├──__init__.py                     
-  │   ├──config.py                     
-  │   ├──dataset.py
-  │   ├──callback.py                                    
-  │   ├──deepfm.py
-  ├── train.py
-  ├── eval.py
+# [Quick Start](#contents)
+
+After installing MindSpore via the official website, you can start training and evaluation as follows: 
+
+- runing on Ascend
+
+  ```
+  # run training example
+  python train.py \
+    --dataset_path='dataset/train' \
+    --ckpt_path='./checkpoint' \
+    --eval_file_name='auc.log' \
+    --loss_file_name='loss.log' \
+    --device_target='Ascend' \
+    --do_eval=True > ms_log/output.log 2>&1 &
+  
+  # run distributed training example
+  sh scripts/run_distribute_train.sh 8 /dataset_path /rank_table_8p.json
+  
+  # run evaluation example
+  python eval.py \
+    --dataset_path='dataset/test' \
+    --checkpoint_path='./checkpoint/deepfm.ckpt' \
+    --device_target='Ascend' > ms_log/eval_output.log 2>&1 &
+  OR
+  sh scripts/run_eval.sh 0 Ascend /dataset_path /checkpoint_path/deepfm.ckpt
+  ```
+
+  For distributed training, a hccl configuration file with JSON format needs to be created in advance.
+
+  Please follow the instructions in the link below:
+
+  https://gitee.com/mindspore/mindspore/tree/master/model_zoo/utils/hccl_tools.
+
+- running on GPU
+
+  For running on GPU, please change `device_target` from `Ascend` to `GPU` in configuration file src/config.py
+
+  ```
+  # run training example
+  python train.py \
+    --dataset_path='dataset/train' \
+    --ckpt_path='./checkpoint' \
+    --eval_file_name='auc.log' \
+    --loss_file_name='loss.log' \
+    --device_target='GPU' \
+    --do_eval=True > ms_log/output.log 2>&1 &
+  
+  # run distributed training example
+  sh scripts/run_distribute_train.sh 8 /dataset_path
+  
+  # run evaluation example
+  python eval.py \
+    --dataset_path='dataset/test' \
+    --checkpoint_path='./checkpoint/deepfm.ckpt' \
+    --device_target='GPU' > ms_log/eval_output.log 2>&1 &
+  OR
+  sh scripts/run_eval.sh 0 GPU /dataset_path /checkpoint_path/deepfm.ckpt
+  ```
+
+# [Script Description](#contents)
+
+## [Script and Sample Code](#contents)
+
+```
+.
+└─deepfm      
+  ├─README.md
+  ├─scripts      
+    ├─run_standalone_train.sh         # launch standalone training(1p) in Ascend or GPU
+    ├─run_distribute_train.sh         # launch distributed training(8p) in Ascend
+    ├─run_distribute_train_gpu.sh     # launch distributed training(8p) in GPU
+    └─run_eval.sh                     # launch evaluating in Ascend or GPU
+  ├─src
+    ├─__init__.py                     # python init file
+    ├─config.py                       # parameter configuration
+    ├─callback.py                     # define callback function
+    ├─deepfm.py                       # deepfm network
+    ├─dataset.py                      # create dataset for deepfm
+  ├─eval.py                           # eval net
+  └─train.py                          # train net
 ```

-## Training process
+## [Script Parameters](#contents)

-### Usage
+Parameters for both training and evaluation can be set in config.py

- sh run_distribute_train.sh [DEVICE_NUM] [DATASET_PATH] [RANK_TABLE_FILE]
- sh run_distribute_train_gpu.sh [DEVICE_NUM] [DATASET_PATH]
- sh run_standalone_train.sh [DEVICE_ID] [DEVICE_TARGET] [DATASET_PATH]
- python train.py --dataset_path [DATASET_PATH] --device_target [DEVICE_TARGET]
-
-### Launch
-
-``` 
-# distribute training example
-  sh scripts/run_distribute_train.sh 8 /opt/dataset/criteo /opt/mindspore_hccl_file.json
-  sh scripts/run_distribute_train_gpu.sh 8 /opt/dataset/criteo
-# standalone training example
-  sh scripts/run_standalone_train.sh 0 Ascend /opt/dataset/criteo
-  or
-  python train.py --dataset_path /opt/dataset/criteo --device_target Ascend > output.log 2>&1 &
-```
-
-### Result
-
-Training result will be stored in the example path. 
-Checkpoints will be stored at `./checkpoint` by default, 
-and training log  will be redirected to `./output.log` by default,
-and loss log will be redirected to `./loss.log` by default,
-and eval log will be redirected to `./auc.log` by default. 
+- train parameters
+  ```
+  optional arguments:
+  -h, --help            show this help message and exit
+  --dataset_path DATASET_PATH
+                        Dataset path
+  --ckpt_path CKPT_PATH
+                        Checkpoint path
+  --eval_file_name EVAL_FILE_NAME
+                        Auc log file path. Default: "./auc.log"
+  --loss_file_name LOSS_FILE_NAME
+                        Loss log file path. Default: "./loss.log"
+  --do_eval DO_EVAL     Do evaluation or not. Default: True
+  --device_target DEVICE_TARGET
+                        Ascend or GPU. Default: Ascend
+  ```
+- eval parameters
+  ```
+  optional arguments:
+  -h, --help            show this help message and exit
+  --checkpoint_path CHECKPOINT_PATH
+                        Checkpoint file path
+  --dataset_path DATASET_PATH
+                        Dataset path
+  --device_target DEVICE_TARGET
+                        Ascend or GPU. Default: Ascend
+  ```


-## Eval process
+## [Training Process](#contents)

-### Usage
+### Training 

- sh run_eval.sh [DEVICE_ID] [DEVICE_TARGET] [DATASET_PATH] [CHECKPOINT_PATH]
+- running on Ascend

-### Launch
+  ```
+  python train.py \
+    --dataset_path='dataset/train' \
+    --ckpt_path='./checkpoint' \
+    --eval_file_name='auc.log' \
+    --loss_file_name='loss.log' \
+    --device_target='Ascend' \
+    --do_eval=True > ms_log/output.log 2>&1 &
+  ```
+  
+  The python command above will run in the background, you can view the results through the file `ms_log/output.log`.
+  
+  After training, you'll get some checkpoint files under `./checkpoint` folder by default. The loss value are saved in loss.log file.
+  
+  ```
+  2020-05-27 15:26:29 epoch: 1 step: 41257, loss is 0.498953253030777
+  2020-05-27 15:32:32 epoch: 2 step: 41257, loss is 0.45545706152915955
+  ...
+  ```
+  
+  The model checkpoint will be saved in the current directory. 

-``` 
-# infer example
-    sh scripts/run_eval.sh 0 Ascend ~/criteo/eval/ ~/train/deepfm-15_41257.ckpt
-```
+- running on GPU
+  To do.

-> checkpoint can be produced in training process. 
+### Distributed Training

-### Result
+- running on Ascend

-Inference result will be stored in the example path, you can find result like the followings in `auc.log`. 
+  ```
+  sh scripts/run_distribute_train.sh 8 /dataset_path /rank_table_8p.json
+  ```
+  
+  The above shell script will run distribute training in the background. You can view the results through the file `log[X]/output.log`. The loss value are saved in loss.log file.
+  

-``` 
-2020-05-27 20:51:35 AUC: 0.80577889065281, eval time: 35.55999s.
-```
+- running on GPU
+  To do.

-# Model description

-## Learning Rate
+## [Evaluation Process](#contents)

-| Number of Devices      | Learning Rate      |
-| ---------------------- | ------------------ |
-| 1                      | 1e-5               |
-| 8                      | 1e-4               |
+### Evaluation

-> Change the learning rate at src/config.py accordingly.
+- evaluation on dataset when running on Ascend

-## Performance
+  Before running the command below, please check the checkpoint path used for evaluation.
+  
+  ```
+  python eval.py \
+    --dataset_path='dataset/test' \
+    --checkpoint_path='./checkpoint/deepfm.ckpt' \
+    --device_target='Ascend' > ms_log/eval_output.log 2>&1 &
+  OR
+  sh scripts/run_eval.sh 0 Ascend /dataset_path /checkpoint_path/deepfm.ckpt
+  ```
+  
+  The above python command will run in the background. You can view the results through the file "eval_output.log". The accuracy is saved in auc.log file.
+  
+  ```
+  {'result': {'AUC': 0.8057789065281104, 'eval_time': 35.64779996871948}}
+  ```

-### Training Performance

-| Parameters                 | DeepFM                                                |
-| -------------------------- | ------------------------------------------------------|
-| Model Version              |                                                       |
-| Resource                   | Ascend 910, cpu:2.60GHz 96cores, memory:1.5T          |
-| uploaded Date              | 05/27/2020                                            |
-| MindSpore Version          | 0.2.0                                                 |
-| Dataset                    | Criteo                                                |
-| Training Parameters        | src/config.py                                         |
-| Optimizer                  | Adam                                                  |
-| Loss Function              | SoftmaxCrossEntropyWithLogits                         |
-| outputs                    |                                                       |
-| Loss                       | 0.4234                                                |
-| Accuracy                   | AUC[0.8055]                                           |
-| Total time                 | 91 min                                                |
-| Params (M)                 |                                                       |
-| Checkpoint for Fine tuning |                                                       |
-| Model for inference        |                                                       |
+- evaluation on dataset when running on GPU
+  To do.

-#### Inference Performance

-| Parameters                 |                               |                           |
-| -------------------------- | ----------------------------- | ------------------------- |
-| Model Version              |                               |                           |   
-| Resource                   | Ascend 910                    | Ascend 310                | 
-| uploaded Date              | 05/27/2020                    | 05/27/2020                | 
-| MindSpore Version          | 0.2.0                         | 0.2.0                     |  
-| Dataset                    | Criteo                        |                           |
-| batch_size                 | 1000                          |                           |
-| outputs                    |                               |                           |
-| Accuracy                   | AUC[0.8055]                   |                           |                      
-| Speed                      |                               |                           |                     
-| Total time                 | 35.559s                       |                           |                      
-| Model for inference        |                               |                           |                 
+# [Model Description](#contents)
+## [Performance](#contents)
+
+### Evaluation Performance 
+
+| Parameters                 | Ascend                                                      | GPU                    |
+| -------------------------- | ----------------------------------------------------------- | ---------------------- |
+| Model Version              | DeepFM                                                      | To do                  |
+| Resource                   | Ascend 910; CPU 2.60GHz, 192cores; Memory 314G              | To do                  |
+| uploaded Date              | 05/17/2020 (month/day/year)                                 | To do                  |
+| MindSpore Version          | 0.3.0-alpha                                                 | To do                  |
+| Dataset                    | [1]                                                         | To do                  |
+| Training Parameters        | epoch=15, batch_size=1000, lr=1e-5                          | To do                  |
+| Optimizer                  | Adam                                                        | To do                  |
+| Loss Function              | Sigmoid Cross Entropy With Logits                           | To do                  |
+| outputs                    | Accuracy                                                    | To do                  |
+| Loss                       | 0.45                                                        | To do                  |
+| Speed                      | 1pc: 8.16 ms/step;                                          | To do                  |
+| Total time                 | 1pc: 90 mins;                                               | To do                  |
+| Parameters (M)             | 16.5                                                        | To do                  |
+| Checkpoint for Fine tuning | 190M (.ckpt file)                                           | To do                  |
+| Scripts                    | [deepfm script](https://gitee.com/mindspore/mindspore/tree/master/model_zoo/official/recommend/deepfm) | To do                  |
+
+
+### Inference Performance
+
+| Parameters          | Ascend                      | GPU                         |
+| ------------------- | --------------------------- | --------------------------- |
+| Model Version       | DeepFM                      | To do                       |
+| Resource            | Ascend 910                  | To do                       |
+| Uploaded Date       | 05/27/2020 (month/day/year) | To do                       |
+| MindSpore Version   | 0.3.0-alpha                 | To do                       |
+| Dataset             | [1]                         | To do                       |
+| batch_size          | 1000                        | To do                       |
+| outputs             | accuracy                    | To do                       |
+| Accuracy            | 1pc: 80.55%;                | To do                       |
+| Model for inference | 190M (.ckpt file)           | To do                       |
+
+
+# [Description of Random Situation](#contents)
+
+We set the random seed before training in train.py.
+
+# [ModelZoo Homepage](#contents)  
+ Please check the official [homepage](https://gitee.com/mindspore/mindspore/tree/master/model_zoo).  

-# ModelZoo Homepage  
- [Link](https://gitee.com/mindspore/mindspore/tree/master/mindspore/model_zoo)  
--- a/model_zoo/official/recommend/deepfm/eval.py
+++ b/model_zoo/official/recommend/deepfm/eval.py
@ -30,7 +30,7 @@ sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
 parser = argparse.ArgumentParser(description='CTR Prediction')
 parser.add_argument('--checkpoint_path', type=str, default=None, help='Checkpoint file path')
 parser.add_argument('--dataset_path', type=str, default=None, help='Dataset path')
-parser.add_argument('--device_target', type=str, default="Ascend", help='Ascend, GPU, or CPU')
+parser.add_argument('--device_target', type=str, default="Ascend", help='Ascend or GPU. Default: Ascend')
 args_opt, _ = parser.parse_known_args()
 device_id = int(os.getenv('DEVICE_ID'))
 context.set_context(mode=context.GRAPH_MODE, device_target=args_opt.device_target, device_id=device_id)
--- a/model_zoo/official/recommend/deepfm/train.py
+++ b/model_zoo/official/recommend/deepfm/train.py
@ -34,11 +34,15 @@ sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
 parser = argparse.ArgumentParser(description='CTR Prediction')
 parser.add_argument('--dataset_path', type=str, default=None, help='Dataset path')
 parser.add_argument('--ckpt_path', type=str, default=None, help='Checkpoint path')
-parser.add_argument('--eval_file_name', type=str, default="./auc.log", help='eval file path')
-parser.add_argument('--loss_file_name', type=str, default="./loss.log", help='loss file path')
-parser.add_argument('--do_eval', type=bool, default=True, help='Do evaluation or not.')
-parser.add_argument('--device_target', type=str, default="Ascend", help='Ascend, GPU, or CPU')
+parser.add_argument('--eval_file_name', type=str, default="./auc.log",
+                    help='Auc log file path. Default: "./auc.log"')
+parser.add_argument('--loss_file_name', type=str, default="./loss.log",
+                    help='Loss log file path. Default: "./loss.log"')
+parser.add_argument('--do_eval', type=str, default='True',
+                    help='Do evaluation or not, only support "True" or "False". Default: "True"')
+parser.add_argument('--device_target', type=str, default="Ascend", help='Ascend or GPU. Default: Ascend')
 args_opt, _ = parser.parse_known_args()
+args_opt.do_eval = args_opt.do_eval == 'True'
 rank_size = int(os.environ.get("RANK_SIZE", 1))

 random.seed(1)