!10592 deepfm network use mindrecord dataset

From: @shuzigood Reviewed-by: @linqingke,@wuxuejian Signed-off-by: @linqingke
2020-12-27 10:14:28 +08:00 · 2020-12-27 10:14:28 +08:00 · 1cbef74372
parent dd3fd238e0 435a524f06
commit 1cbef74372
3 changed files with 295 additions and 9 deletions
--- a/model_zoo/official/recommend/deepfm/README_CN.md
+++ b/model_zoo/official/recommend/deepfm/README_CN.md
@ -0,0 +1,286 @@
+# 目录
+
+<!-- TOC -->
+
+- [目录](#目录)
+- [DeepFM概述](#deepfm概述)
+- [模型架构](#模型架构)
+- [数据集](#数据集)
+- [环境要求](#环境要求)
+- [快速入门](#快速入门)
+    - [脚本说明](#脚本说明)
+    - [脚本和样例代码](#脚本和样例代码)
+    - [脚本参数](#脚本参数)
+    - [训练过程](#训练过程)
+        - [训练](#训练)
+        - [分布式训练](#分布式训练)
+    - [评估过程](#评估过程)
+        - [评估](#评估)
+- [模型描述](#模型描述)
+    - [性能](#性能)
+        - [评估性能](#评估性能)
+        - [推理性能](#推理性能)
+    - [随机情况说明](#随机情况说明)
+    - [ModelZoo主页](#modelzoo主页)
+
+<!-- /TOC -->
+
+## DeepFM概述
+
+要想在推荐系统中实现最大点击率，学习用户行为背后复杂的特性交互十分重要。虽然已在这一领域取得很大进展，但高阶交互和低阶交互的方法差异明显，亟需专业的特征工程。本论文中,我们将会展示高阶和低阶交互的端到端学习模型的推导。本论文提出的模型DeepFM，结合了推荐系统中因子分解机和新神经网络架构中的深度特征学习。
+
+[论文](https://arxiv.org/abs/1703.04247):  Huifeng Guo, Ruiming Tang, Yunming Ye, Zhenguo Li, Xiuqiang He. DeepFM: A Factorization-Machine based Neural Network for CTR Prediction
+
+## 模型架构
+
+DeepFM由两部分组成。FM部分是一个因子分解机，用于学习推荐的特征交互；深度学习部分是一个前馈神经网络，用于学习高阶特征交互。
+FM和深度学习部分拥有相同的输入原样特征向量，让DeepFM能从输入原样特征中同时学习低阶和高阶特征交互。
+
+## 数据集
+
+- [1] A dataset used in  Huifeng Guo, Ruiming Tang, Yunming Ye, Zhenguo Li, Xiuqiang He. DeepFM: A Factorization-Machine based Neural Network for CTR Prediction[J]. 2017.
+
+## 环境要求
+
+- 硬件（Ascend或GPU）
+    - 使用Ascend或GPU处理器准备硬件环境。如需试用昇腾处理器，请发送[申请表](https://obs-9be7.obs.cn-east-2.myhuaweicloud.com/file/other/Ascend%20Model%20Zoo%E4%BD%93%E9%AA%8C%E8%B5%84%E6%BA%90%E7%94%B3%E8%AF%B7%E8%A1%A8.docx)至ascend@huawei.com，申请通过后，即可获得资源。
+- 框架
+    - [MindSpore](https://www.mindspore.cn/install)
+- 如需查看详情，请参见如下资源：
+    - [MindSpore教程](https://www.mindspore.cn/tutorial/training/zh-CN/master/index.html)
+    - [MindSpore Python API](https://www.mindspore.cn/doc/api_python/zh-CN/master/index.html)
+
+## 快速入门
+
+通过官方网站安装MindSpore后，您可以按照如下步骤进行训练和评估：
+
+- Ascend处理器环境运行
+
+  ```训练示例
+  # 运行训练示例
+  python train.py \
+    --dataset_path='dataset/train' \
+    --ckpt_path='./checkpoint' \
+    --eval_file_name='auc.log' \
+    --loss_file_name='loss.log' \
+    --device_target='Ascend' \
+    --do_eval=True > ms_log/output.log 2>&1 &
+
+  # 运行分布式训练示例
+  sh scripts/run_distribute_train.sh 8 /dataset_path /rank_table_8p.json
+
+  # 运行评估示例
+  python eval.py \
+    --dataset_path='dataset/test' \
+    --checkpoint_path='./checkpoint/deepfm.ckpt' \
+    --device_target='Ascend' > ms_log/eval_output.log 2>&1 &
+  OR
+  sh scripts/run_eval.sh 0 Ascend /dataset_path /checkpoint_path/deepfm.ckpt
+  ```
+
+  在分布式训练中，JSON格式的HCCL配置文件需要提前创建。
+
+  具体操作，参见：
+
+  <https://gitee.com/mindspore/mindspore/tree/master/model_zoo/utils/hccl_tools>.
+
+- 在GPU上运行
+
+  如在GPU上运行,请配置文件src/config.py中的`device_target`从 `Ascend`改为`GPU`。
+
+  ```训练示例
+  # 运行训练示例
+  python train.py \
+    --dataset_path='dataset/train' \
+    --ckpt_path='./checkpoint' \
+    --eval_file_name='auc.log' \
+    --loss_file_name='loss.log' \
+    --device_target='GPU' \
+    --do_eval=True > ms_log/output.log 2>&1 &
+
+  # 运行分布式训练示例
+  sh scripts/run_distribute_train.sh 8 /dataset_path
+
+  # 运行评估示例
+  python eval.py \
+    --dataset_path='dataset/test' \
+    --checkpoint_path='./checkpoint/deepfm.ckpt' \
+    --device_target='GPU' > ms_log/eval_output.log 2>&1 &
+  OR
+  sh scripts/run_eval.sh 0 GPU /dataset_path /checkpoint_path/deepfm.ckpt
+  ```
+
+## 脚本说明
+
+## 脚本和样例代码
+
+```deepfm
+.
+└─deepfm
+  ├─README.md
+  ├─mindspore_hub_conf.md             # mindspore hub配置
+  ├─scripts
+    ├─run_standalone_train.sh         # 在Ascend处理器或GPU上进行单机训练(单卡)
+    ├─run_distribute_train.sh         # 在Ascend处理器上进行分布式训练(8卡)
+    ├─run_distribute_train_gpu.sh     # 在GPU上进行分布式训练(8卡)
+    └─run_eval.sh                     # 在Ascend处理器或GPU上进行评估
+  ├─src
+    ├─__init__.py                     # python init文件
+    ├─config.py                       # 参数配置
+    ├─callback.py                     # 定义回调功能
+    ├─deepfm.py                       # DeepFM网络
+    ├─dataset.py                      # 创建DeepFM数据集
+  ├─eval.py                           # 评估网络
+  └─train.py                          # 训练网络
+```
+
+## 脚本参数
+
+在config.py中可以同时配置训练参数和评估参数。
+
+- 训练参数。
+
+  ```参数
+  optional arguments:
+  -h, --help            show this help message and exit
+  --dataset_path DATASET_PATH
+                        Dataset path
+  --ckpt_path CKPT_PATH
+                        Checkpoint path
+  --eval_file_name EVAL_FILE_NAME
+                        Auc log file path. Default: "./auc.log"
+  --loss_file_name LOSS_FILE_NAME
+                        Loss log file path. Default: "./loss.log"
+  --do_eval DO_EVAL     Do evaluation or not. Default: True
+  --device_target DEVICE_TARGET
+                        Ascend or GPU. Default: Ascend
+  ```
+
+- 评估参数。
+
+  ```参数
+  optional arguments:
+  -h, --help            show this help message and exit
+  --checkpoint_path CHECKPOINT_PATH
+                        Checkpoint file path
+  --dataset_path DATASET_PATH
+                        Dataset path
+  --device_target DEVICE_TARGET
+                        Ascend or GPU. Default: Ascend
+  ```
+
+## 训练过程
+
+### 训练
+
+- Ascend处理器上运行
+
+  ```运行命令
+  python trin.py \
+    --dataset_path='dataset/train' \
+    --ckpt_path='./checkpoint' \
+    --eval_file_name='auc.log' \
+    --loss_file_name='loss.log' \
+    --device_target='Ascend' \
+    --do_eval=True > ms_log/output.log 2>&1 &
+  ```
+
+  上述python命令将在后台运行,您可以通过`ms_log/output.log`文件查看结果。
+
+  训练结束后, 您可在默认文件夹`./checkpoint`中找到检查点文件。损失值保存在loss.log文件中。
+
+  ```运行结果
+  2020-05-27 15:26:29 epoch: 1 step: 41257, loss is 0.498953253030777
+  2020-05-27 15:32:32 epoch: 2 step: 41257, loss is 0.45545706152915955
+  ...
+  ```
+
+  模型检查点将会储存在当前路径。
+
+- GPU上运行
+  待运行。
+
+### 分布式训练
+
+- Ascend处理器上运行
+
+  ```运行命令
+  sh scripts/run_distribute_train.sh 8 /dataset_path /rank_table_8p.json
+  ```
+
+  上述shell脚本将在后台运行分布式训练。请在`log[X]/output.log`文件中查看结果。损失值保存在loss.log文件中。
+
+- GPU上运行
+  待运行。
+
+## 评估过程
+
+### 评估
+
+- Ascend处理器上运行时评估数据集
+
+  在运行以下命令之前，请检查用于评估的检查点路径。
+
+  ```命令
+  python eval.py \
+    --dataset_path='dataset/test' \
+    --checkpoint_path='./checkpoint/deepfm.ckpt' \
+    --device_target='Ascend' > ms_log/eval_output.log 2>&1 &
+  OR
+  sh scripts/run_eval.sh 0 Ascend /dataset_path /checkpoint_path/deepfm.ckpt
+  ```
+
+  上述python命令将在后台运行，请在eval_output.log路径下查看结果。准确率保存在auc.log文件中。
+
+  ```结果
+  {'result': {'AUC': 0.8057789065281104, 'eval_time': 35.64779996871948}}
+  ```
+
+- 在GPU运行时评估数据集
+  待运行。
+
+## 模型描述
+
+## 性能
+
+### 评估性能
+
+| 参数                    | Ascend                                                      | GPU                    |
+| -------------------------- | ----------------------------------------------------------- | ---------------------- |
+| 模型版本              | DeepFM                                                      | 待运行                  |
+| 资源                   | Ascend 910;CPU 2.60GHz,192核；内存：755G              | 待运行                  |
+| 上传日期              | 2020-05-17                                 | 待运行                 |
+| MindSpore版本          | 0.3.0-alpha                                                 | 待运行                  |
+| 数据集                    | [1]                                                         | 待运行                 |
+| 训练参数        | epoch=15, batch_size=1000, lr=1e-5                          | 待运行                  |
+| 优化器                  | Adam                                                        | 待运行                 |
+| 损失函数              | Sigmoid Cross Entropy With Logits                           | 待运行                  |
+| 输出                    | 准确率                                                    | 待运行                 |
+| 损失                       | 0.45                                                        | 待运行                 |
+| 速度| 单卡：8.16毫秒/步;                                          | 待运行                  |
+| 总时长| 单卡：90 分钟;                                               | 待运行                 |
+| 参数(M)             | 16.5                                                        | 待运行                  |
+| 微调检查点 | 190M (.ckpt 文件)                                           | 待运行                  |
+| 脚本                    | [DeepFM脚本](https://gitee.com/mindspore/mindspore/tree/master/model_zoo/official/recommend/deepfm) | 待运行                  |
+
+### 推理性能
+
+| 参数          | Ascend                      | GPU                         |
+| ------------------- | --------------------------- | --------------------------- |
+| 模型版本       | DeepFM                      | 待运行                       |
+| 资源            | Ascend 910                  | 待运行                       |
+| 上传日期       | 2020-05-27 | 待运行                       |
+| MindSpore版本   | 0.3.0-alpha                 | 待运行                       |
+| 数据集             | [1]                         | 待运行                       |
+| batch_size          | 1000                        | 待运行                      |
+| 输出             | 准确率                    | 待运行                       |
+| 准确率| 单卡：80.55%;                |待运行                       |
+| 推理模型 | 190M (.ckpt文件)           | 待运行                       |
+
+## 随机情况说明
+
+在train.py.中训练之前设置随机种子。
+
+## ModelZoo主页
+
+ 请浏览官网[主页](https://gitee.com/mindspore/mindspore/tree/master/model_zoo)。  
--- a/tests/st/model_zoo_tests/DeepFM/src/config.py
+++ b/tests/st/model_zoo_tests/DeepFM/src/config.py
@ -27,7 +27,7 @@ class DataConfig:
    batch_size = 16000
    data_field_size = 39
    # dataset format, 1: mindrecord, 2: tfrecord, 3: h5
-    data_format = 3
+    data_format = 1


 class ModelConfig:
--- a/tests/st/model_zoo_tests/DeepFM/test_deepfm.py
+++ b/tests/st/model_zoo_tests/DeepFM/test_deepfm.py
@ -14,7 +14,7 @@
 # ============================================================================
 """train_criteo."""
 import os
-# import pytest
+import pytest

 from mindspore import context
 from mindspore.train.model import Model
@ -27,10 +27,10 @@ from src.callback import EvalCallBack, LossCallBack, TimeMonitor

 set_seed(1)

-# @pytest.mark.level0
-# @pytest.mark.platform_arm_ascend_training
-# @pytest.mark.platform_x86_ascend_training
-# @pytest.mark.env_onecard
+@pytest.mark.level0
+@pytest.mark.platform_arm_ascend_training
+@pytest.mark.platform_x86_ascend_training
+@pytest.mark.env_onecard
 def test_deepfm():
    data_config = DataConfig()
    train_config = TrainConfig()
@ -39,7 +39,7 @@ def test_deepfm():
    rank_size = None
    rank_id = None

-    dataset_path = "/home/workspace/mindspore_dataset/criteo_data/criteo_h5/"
+    dataset_path = "/home/workspace/mindspore_dataset/criteo_data/mindrecord/"
    print("dataset_path:", dataset_path)
    ds_train = create_dataset(dataset_path,
                              train_mode=True,
@ -71,10 +71,10 @@ def test_deepfm():
    print("train_config.train_epochs:", train_config.train_epochs)
    model.train(train_config.train_epochs, ds_train, callbacks=callback_list)

-    export_loss_value = 0.51
+    export_loss_value = 0.52
    print("loss_callback.loss:", loss_callback.loss)
    assert loss_callback.loss < export_loss_value
-    export_per_step_time = 40.0
+    export_per_step_time = 30.0
    print("time_callback:", time_callback.per_step_time)
    assert time_callback.per_step_time < export_per_step_time
    print("*******test case pass!********")