fix typo

add faq in evaulation
update wise model, add FQA in loss evaluation
2023-11-17 14:21:17 +08:00 · 2023-11-17 14:20:06 +08:00 · 2023-11-17 11:59:08 +08:00 · 2023-11-17 10:47:33 +08:00 · 2023-11-06 15:51:39 +08:00 · 2023-11-06 10:09:13 +08:00
4 changed files with 60 additions and 19 deletions
--- a/README.md
+++ b/README.md
@ -6,7 +6,7 @@
 <div align="center"><img src="misc/skywork_logo.jpeg" width="550"/></div>

 <p align="center">
-🤗 <a href="https://huggingface.co/Skywork" target="_blank">Hugging Face</a> • 🤖 <a href="https://modelscope.cn/organization/Skywork" target="_blank">ModelScope</a> • 💬 <a href="https://github.com/SkyworkAI/Skywork/blob/main/misc/wechat.png?raw=true" target="_blank">WeChat</a>• 📜<a href="http://arxiv.org/abs/2310.19341" target="_blank">Tech Report</a>
+🤗 <a href="https://huggingface.co/Skywork" target="_blank">Hugging Face</a> • 🤖 <a href="https://modelscope.cn/organization/Skywork" target="_blank">ModelScope</a> • 👾 <a href="https://wisemodel.cn/organization/Skywork" target="_blank">Wisemodel</a> • 💬 <a href="https://github.com/SkyworkAI/Skywork/blob/main/misc/wechat.png?raw=true" target="_blank">WeChat</a>• 📜<a href="http://arxiv.org/abs/2310.19341" target="_blank">Tech Report</a>
 </p>


@ -87,12 +87,12 @@
 ## 模型下载


-|         | HuggingFace基础模型   | HuggingFace量化版模型 | ModelScope基础模型 | ModelScope量化版模型 | 
-|:-------:|:-----------:|:-----------------------------:|:-----------------------------:|:-----------------------------:|
-| **Skywork-13B-Base**      | 🤗 [Skywork-13B-Base](https://huggingface.co/Skywork/Skywork-13B-Base) | 🤗 [Skywork-13B-Base-8bits](https://huggingface.co/Skywork/Skywork-13B-Base-8bits) | 🤖[Skywork-13B-Base](https://www.modelscope.cn/models/skywork/Skywork-13B-Base) | 🤖 [Skywork-13B-Base-8bits](https://www.modelscope.cn/models/skywork/Skywork-13B-Base-8bits) |
-| **Skywork-13B-Chat**      | 🤗敬请期待 | 🤗敬请期待 | 🤖敬请期待 | 🤖敬请期待 |
-| **Skywork-13B-Math**      | 🤗 [Skywork-13B-Math](https://huggingface.co/Skywork/Skywork-13B-Math) | 🤗 [Skywork-13B-Math-8bits](https://huggingface.co/Skywork/Skywork-13B-Math-8bits) | 🤖 [Skywork-13B-Math](https://www.modelscope.cn/models/skywork/Skywork-13B-Math) | 🤖 [Skywork-13B-Math-8bits](https://www.modelscope.cn/models/skywork/Skywork-13B-Math-8bits) |
-| **Skywork-13B-MM**      | 🤗敬请期待 | - | 🤖敬请期待 | - |
+|         | HuggingFace基础模型   | HuggingFace量化版模型 | ModelScope基础模型 | ModelScope量化版模型 | Wisemodel基础模型 | Wisemodel量化版模型 |
+|:-------:|:-----------:|:-----------------------------:|:-----------------------------:|:-----------------------------:|:-----------------------------:|:-----------------------------:|
+| **Skywork-13B-Base**      | 🤗 [Skywork-13B-Base](https://huggingface.co/Skywork/Skywork-13B-Base) | 🤗 [Skywork-13B-Base-8bits](https://huggingface.co/Skywork/Skywork-13B-Base-8bits) | 🤖[Skywork-13B-Base](https://www.modelscope.cn/models/skywork/Skywork-13B-Base) | 🤖 [Skywork-13B-Base-8bits](https://www.modelscope.cn/models/skywork/Skywork-13B-Base-8bits) |👾[Skywork-13B-Base](https://wisemodel.cn/models/Skywork/Skywork-13B-Base) | 👾 [Skywork-13B-Base-8bits](https://wisemodel.cn/models/Skywork/Skywork-13B-Base-8bits) |
+| **Skywork-13B-Chat**      | 🤗敬请期待 | 🤗敬请期待 | 🤖敬请期待 | 🤖敬请期待 |👾敬请期待 | 👾敬请期待 |
+| **Skywork-13B-Math**      | 🤗 [Skywork-13B-Math](https://huggingface.co/Skywork/Skywork-13B-Math) | 🤗 [Skywork-13B-Math-8bits](https://huggingface.co/Skywork/Skywork-13B-Math-8bits) | 🤖 [Skywork-13B-Math](https://www.modelscope.cn/models/skywork/Skywork-13B-Math) | 🤖 [Skywork-13B-Math-8bits](https://www.modelscope.cn/models/skywork/Skywork-13B-Math-8bits) |👾[Skywork-13B-Math](https://wisemodel.cn/models/Skywork/Skywork-13B-Math) | 👾 [Skywork-13B-Math-8bits](https://wisemodel.cn/models/Skywork/Skywork-13B-Math-8bits) |
+| **Skywork-13B-MM**      | 🤗敬请期待 | - | 🤖敬请期待 | - |👾敬请期待 | - |


 ## 数据下载
@ -224,6 +224,24 @@ loss = -\sum^{n}_{i=1} log(p_i) / n = -log( \prod_{i=1}^n p_i) / n
 ```
 bash bash_scripts/skywork_eval_loss.sh
 ```
+假设我们需要计算A模型和Skywork模型的标准化损失。我们分别对A模型和Skywork模型运行上面脚本，会在各自目录下的result.txt文件中得到两个值，第一个值是loss，第二个值是文档token数。我们将A模型的loss和token数分别记作loss_a和token_a，Skywork模型的loss和token数分别记作loss_s和token_s。那么A模型标准化损失loss_a_norm = loss_a * token_a / token_s。这样就可以对比loss_a_norm和loss_s来对比A模型和Skywork模型的效果。扩展到多个模型同理。
+
+### 评估常见问题
+**Q1**: 为什么要让所有模型的文档长度一样，而不是让分词后token一样？
+
+**A1**: 本质上领域困惑度是衡量不同模型生成高质量文档的概率，概率越大模型效果越好，因此我们需要保证所有模型看到的文档是一样的。此外，因为不同模型使用不同的分词器，分词后的token数目差异很大，以Llama为例，会将汉字切分为3个unicode编码，如果按照分词后token比较的话，那么Llama模型看到的文档长度会比其他模型短，而我们知道文档前面的token loss是更低的，后面token loss更高，因此按照分词后token比较的话会对Llama这种分词更细的模型会不公平。 
+
+**Q2**: 为什么预处理的时候截取max_position_embedding除以3这个长度的文本?
+
+**A2**: 根据问题1我们知道Llama模型一般是将汉字切分成3个字符，为了保证一篇文档输入模型分词后不超过4096的限制，所以我们将文档最大长度设置成了4096/3=1228。在我们对比模型中，Llama模型是对中文切分最细的，所以只要不超过Llama模型的分词长度，其他模型也肯定可以塞得下。 
+
+**Q3**: 不同模型有不同的最大长度，统一用4096是否不公平？
+
+**A3**: 如上所示，我们计算的文档长度是1228个中文汉字，以Qwen为例，训练长度为2K，推理时候可以扩展到8K，并且中英双语模型的压缩率一般在2-3倍，因此1228个中文字符一般只有500-1000个token，远远达到不了2K甚至4K的最大长度限制。 
+
+**Q4**: 为什么Average Ppl和每个领域Ppl平均不一致？
+
+**A4**: 我们计算Average Ppl的方式是将所有文档的loss平均起来，然后取指数转换成Ppl，这样是为了避免有些领域Ppl过分的大，导致整体结果容易受到极端值影响。其物理含义是将所有文档视为一个整体问题，Average Ppl则是针对这整体文档计算出的Ppl。

 ## Benchmark评估
 我们评估了各大权威评测benchmark上的结果作为参考，包括C-Eval，MMLU，CMMLU，GSM8K。遵循之前的评估流程，C-Eval、MMLU、CMMLU测试5-shot结果，GSM8K测试8-shot结果。可以看到Skywork-13B-Base模型在中文开源模型中处于前列，在同等参数规模下为最优水平。
--- a/README_EN.md
+++ b/README_EN.md
@ -6,9 +6,8 @@
 </div> -->
 <div align="center"><img src="misc/skywork_logo.jpeg" width="550"/></div>

-
 <p align="center">
-🤗 <a href="https://huggingface.co/Skywork" target="_blank">Hugging Face</a> • 🤖 <a href="https://modelscope.cn/organization/Skywork" target="_blank">ModelScope</a> • 💬 <a href="https://github.com/SkyworkAI/Skywork/blob/main/misc/wechat.png?raw=true" target="_blank">WeChat</a>• 📜<a href="http://arxiv.org/abs/2310.19341" target="_blank">Tech Report</a>
+🤗 <a href="https://huggingface.co/Skywork" target="_blank">Hugging Face</a> • 🤖 <a href="https://modelscope.cn/organization/Skywork" target="_blank">ModelScope</a> • 👾 <a href="https://wisemodel.cn/organization/Skywork" target="_blank">Wisemodel</a> • 💬 <a href="https://github.com/SkyworkAI/Skywork/blob/main/misc/wechat.png?raw=true" target="_blank">WeChat</a>• 📜<a href="http://arxiv.org/abs/2310.19341" target="_blank">Tech Report</a>
 </p>

 <div align="center">
@ -77,13 +76,12 @@ If you are interested in more training and evaluation details, please refer to o
 # Download URL
 ## Download URL of Skywork Models

-|         | HuggingFace Base Model   | HuggingFace Quantized Model |  ModelScope Base Model   | ModelScope Quantized Model |
-|:-------:|:-----------:|:-----------------------------:|:-----------------------------:|:-----------------------------:|
-| **Skywork-13B-Base**      | 🤗 [Skywork-13B-Base](https://huggingface.co/Skywork/Skywork-13B-Base) | 🤗 [Skywork-13B-Base-8bits](https://huggingface.co/Skywork/Skywork-13B-Base-8bits) | 🤖[Skywork-13B-Base](https://www.modelscope.cn/models/skywork/Skywork-13B-Base) | 🤖 [Skywork-13B-Base-8bits](https://www.modelscope.cn/models/skywork/Skywork-13B-Base-8bits) |
-| **Skywork-13B-Chat**      | 🤗coming soon | 🤗coming soon | 🤖coming soon | 🤖coming soon |
-| **Skywork-13B-Math**      | 🤗 [Skywork-13B-Math](https://huggingface.co/Skywork/Skywork-13B-Math) | 🤗 [Skywork-13B-Math-8bits](https://huggingface.co/Skywork/Skywork-13B-Math-8bits) | 🤖 [Skywork-13B-Math](https://www.modelscope.cn/models/skywork/Skywork-13B-Math) | 🤖 [Skywork-13B-Math-8bits](https://www.modelscope.cn/models/skywork/Skywork-13B-Math-8bits) |
-| **Skywork-13B-MM**      | 🤗coming soon | - | 🤖coming soon | - |
-
+|         | HuggingFace Base Model   | HuggingFace Quantized Model |  ModelScope Base Model   | ModelScope Quantized Model |  Wisemodel Base Model   | Wisemodel Quantized Model |
+|:-------:|:-----------:|:-----------------------------:|:-----------------------------:|:-----------------------------:|:-----------------------------:|:-----------------------------:|
+| **Skywork-13B-Base**      | 🤗 [Skywork-13B-Base](https://huggingface.co/Skywork/Skywork-13B-Base) | 🤗 [Skywork-13B-Base-8bits](https://huggingface.co/Skywork/Skywork-13B-Base-8bits) | 🤖[Skywork-13B-Base](https://www.modelscope.cn/models/skywork/Skywork-13B-Base) | 🤖 [Skywork-13B-Base-8bits](https://www.modelscope.cn/models/skywork/Skywork-13B-Base-8bits) |👾[Skywork-13B-Base](https://wisemodel.cn/models/Skywork/Skywork-13B-Base) | 👾 [Skywork-13B-Base-8bits](https://wisemodel.cn/models/Skywork/Skywork-13B-Base-8bits) |
+| **Skywork-13B-Chat**      | 🤗coming soon | 🤗coming soon | 🤖coming soon | 🤖coming soon |👾coming soon | 👾coming soon |
+| **Skywork-13B-Math**      | 🤗 [Skywork-13B-Math](https://huggingface.co/Skywork/Skywork-13B-Math) | 🤗 [Skywork-13B-Math-8bits](https://huggingface.co/Skywork/Skywork-13B-Math-8bits) | 🤖 [Skywork-13B-Math](https://www.modelscope.cn/models/skywork/Skywork-13B-Math) | 🤖 [Skywork-13B-Math-8bits](https://www.modelscope.cn/models/skywork/Skywork-13B-Math-8bits) |👾[Skywork-13B-Math](https://wisemodel.cn/models/Skywork/Skywork-13B-Math) | 👾 [Skywork-13B-Math-8bits](https://wisemodel.cn/models/Skywork/Skywork-13B-Math-8bits) |
+| **Skywork-13B-MM**      | 🤗coming soon | - | 🤖coming soon | - |👾coming soon | - |

 ## Download URL of Skypile
 |    Data    |    Download URL | 
@ -216,7 +214,33 @@ We have also open-sourced the data and evaluation scripts. You can reproduce our
 bash bash_scripts/skywork_eval_loss.sh
 ```

+If you need to calculate the normalized loss for Model A and Skywork model, you can follow these steps:

+1. Run the above script for Model A and Skywork model separately. The results will be stored in the result.txt file in their respective directories.
+2. In the result.txt file, you will find two values for each model: the first value represents the loss, and the second value represents the number of document tokens.
+3. Let's denote the loss and token numbers for Model A as loss_a and token_a, and for Skywork model as loss_s and token_s.
+4. To calculate the normalized loss for Model A (loss_a_norm), loss_a_norm = loss_a * token_a / token_s
+5. By comparing the normalized loss (loss_a_norm) of Model A with the loss (loss_s) of the Skywork model, we can evaluate the effectiveness of both models.
+6. The same approach can be extended to multiple models.
+
+### FAQ in Evaluation
+
+**Q1**: Why should all models have the same document length instead of having the same number of tokens after tokenization?
+
+**A1**: Essentially, domain perplexity measures the probability of different models generating high-quality documents, with higher probability indicating better model performance. Therefore, we need to ensure that all models see the same document. Additionally, since different models use different tokenizers, there can be a significant difference in the number of tokens after tokenization. For example, Llama will split Chinese characters into three Unicode encodings. If we compare the number of tokens after tokenization, the document length seen by Llama will be shorter compared to other models. However, we know that the token loss is lower in the beginning of the document and higher towards the end. Therefore, comparing based on the number of tokens after tokenization would be unfair to models like Llama, which have finer tokenization.
+
+**Q2**: Why do we truncate the text to a length of max_position_embedding divided by 3?
+
+**A2**: As mentioned in the answer to question 1, Llama model generally splits Chinese characters into three characters. To ensure that the maximum length of a document input to the model does not exceed the limit of 4096, we set the maximum document length to 4096/3 = 1228. In our comparison models, Llama has the finest tokenization for Chinese. Therefore, as long as the document length does not exceed the tokenization length of Llama, it will fit in other models as well.
+
+**Q3**: Is it unfair to use a uniform length of 4096 for different models?
+
+**A3**: As explained above, the calculated document length is 1228 Chinese characters. Taking Qwen as an example, the training length is 2K, which can be expanded to 8K during inference. Additionally, the compression ratio of bilingual models is generally 2-3 times. Therefore, 1228 Chinese characters usually only amount to 500-1000 tokens, far from the maximum length limit of 2K or even 4K.
+
+
+**Q4**: Why is the Average Ppl inconsistent with the average Ppl of each domain?
+
+**A4**: We calculate Average Ppl by averaging the losses of all documents and then converting it to Ppl using an exponential function. This is done to avoid having some domains with excessively high Ppl, which would negatively impact the overall results. The idea behind Average Ppl is to consider all documents as a cohesive collection, representing the overall Ppl of the document.

 ## Benchmark Results
 We evaluated Skywork-13B-Base on several popular benchmarks, including C-Eval, MMLU, CMMLU, and GSM8K. Following the previous evaluation process, we tested the 5-shot results of C-Eval, MMLU, and CMMLU, and the 8-shot results of GSM8K. It can be seen that the Skywork-13B-Base model is among the top models in the Chinese open source model community, performing at an optimal level with the same parameter scale.
--- a/bash_scripts/skywork_eval_loss.sh
+++ b/bash_scripts/skywork_eval_loss.sh
@ -5,7 +5,7 @@ do
    export DATA=$LOSS_DATA
    export BATCH_SIZE=16  
    mkdir -p prediction/$DATA/$FLAG
-    python eval/eval_loss_tp.py \
+    python eval/eval_loss.py \
        -m $HF_MODEL_PATH --n-gpus 8 \
        -d data/eval_loss/$DATA.jsonl --data-type json -i text -b $BATCH_SIZE --max-tokens 4096 --max-samples 10000 \
        -o prediction/$DATA/$FLAG/result.txt
--- a/eval/eval_loss.py
+++ b/eval/eval_loss.py
@ -134,7 +134,7 @@ def set_seed(seed: int):
 if __name__ == '__main__':

    parser = argparse.ArgumentParser()
-    parser.add_argument("-m", "--model_path", type=str, default="")
+    parser.add_argument("-m", "--model-path", type=str, default="")
    parser.add_argument("-d", "--dataset", type=str, default="emozilla/pg19")  
    parser.add_argument("-s", "--subset", type=str, default=None) 
    parser.add_argument("-i", "--input-text-field", type=str, default="text")
@ -144,7 +144,6 @@ if __name__ == '__main__':
    parser.add_argument("--max-samples", type=int, default=None)
    parser.add_argument("--data-type", type=str, default=None)
    parser.add_argument("--n-gpus", type=int, default=None)
-    parser.add_argument("--aggressive-memory", action="store_true")
    parser.add_argument("--split", type=str, default="train")

    args = parser.parse_args()
Author	SHA1	Message	Date
liang.zhao	ab7847c050	fix typo	2023-11-17 14:21:17 +08:00
liang.zhao	41689e7ec5	add faq in evaulation	2023-11-17 14:20:06 +08:00
liang.zhao	28c3a31d56	update wise model, add FQA in loss evaluation	2023-11-17 11:59:08 +08:00
liang.zhao	f5975efad3	Merge branch 'main' of https://github.com/SkyworkAI/Skywork into fix_typo	2023-11-17 10:47:33 +08:00
liang.zhao	c099f8f3c5	fix loss typo	2023-11-06 15:51:39 +08:00
liang.zhao	40ae57e637	Merge branch 'main' of https://github.com/SkyworkAI/Skywork into fix_typo	2023-11-06 10:09:13 +08:00
liang.zhao	60b21cb70c	update evaluation data to hugginface and fix some typos	2023-11-02 11:29:19 +08:00
liang.zhao	f6ce5344e4	Merge branch 'main' of https://github.com/SkyworkAI/Skywork into upload_evaluation_data_to_huggingface	2023-11-02 11:11:06 +08:00
liang.zhao	7dc72d83c7	Merge branch 'main' of https://github.com/SkyworkAI/Skywork into skywork_tech_repo_arxiv	2023-10-31 14:00:14 +08:00
liang.zhao	9a846f8633	update skywork tech report arxiv url	2023-10-31 13:14:08 +08:00
liang.zhao	adff144842	update url	2023-10-31 10:31:54 +08:00