Compare commits

...

11 Commits

Author SHA1 Message Date
liang.zhao ab7847c050 fix typo 2023-11-17 14:21:17 +08:00
liang.zhao 41689e7ec5 add faq in evaulation 2023-11-17 14:20:06 +08:00
liang.zhao 28c3a31d56 update wise model, add FQA in loss evaluation 2023-11-17 11:59:08 +08:00
liang.zhao f5975efad3 Merge branch 'main' of https://github.com/SkyworkAI/Skywork into fix_typo 2023-11-17 10:47:33 +08:00
liang.zhao c099f8f3c5 fix loss typo 2023-11-06 15:51:39 +08:00
liang.zhao 40ae57e637 Merge branch 'main' of https://github.com/SkyworkAI/Skywork into fix_typo 2023-11-06 10:09:13 +08:00
liang.zhao 60b21cb70c update evaluation data to hugginface and fix some typos 2023-11-02 11:29:19 +08:00
liang.zhao f6ce5344e4 Merge branch 'main' of https://github.com/SkyworkAI/Skywork into upload_evaluation_data_to_huggingface 2023-11-02 11:11:06 +08:00
liang.zhao 7dc72d83c7 Merge branch 'main' of https://github.com/SkyworkAI/Skywork into skywork_tech_repo_arxiv 2023-10-31 14:00:14 +08:00
liang.zhao 9a846f8633 update skywork tech report arxiv url 2023-10-31 13:14:08 +08:00
liang.zhao adff144842 update url 2023-10-31 10:31:54 +08:00
4 changed files with 60 additions and 19 deletions

View File

@ -6,7 +6,7 @@
<div align="center"><img src="misc/skywork_logo.jpeg" width="550"/></div>
<p align="center">
🤗 <a href="https://huggingface.co/Skywork" target="_blank">Hugging Face</a> • 🤖 <a href="https://modelscope.cn/organization/Skywork" target="_blank">ModelScope</a> • 💬 <a href="https://github.com/SkyworkAI/Skywork/blob/main/misc/wechat.png?raw=true" target="_blank">WeChat</a>• 📜<a href="http://arxiv.org/abs/2310.19341" target="_blank">Tech Report</a>
🤗 <a href="https://huggingface.co/Skywork" target="_blank">Hugging Face</a> • 🤖 <a href="https://modelscope.cn/organization/Skywork" target="_blank">ModelScope</a>👾 <a href="https://wisemodel.cn/organization/Skywork" target="_blank">Wisemodel</a>💬 <a href="https://github.com/SkyworkAI/Skywork/blob/main/misc/wechat.png?raw=true" target="_blank">WeChat</a>• 📜<a href="http://arxiv.org/abs/2310.19341" target="_blank">Tech Report</a>
</p>
@ -87,12 +87,12 @@
## 模型下载
| | HuggingFace基础模型 | HuggingFace量化版模型 | ModelScope基础模型 | ModelScope量化版模型 |
|:-------:|:-----------:|:-----------------------------:|:-----------------------------:|:-----------------------------:|
| **Skywork-13B-Base** | 🤗 [Skywork-13B-Base](https://huggingface.co/Skywork/Skywork-13B-Base) | 🤗 [Skywork-13B-Base-8bits](https://huggingface.co/Skywork/Skywork-13B-Base-8bits) | 🤖[Skywork-13B-Base](https://www.modelscope.cn/models/skywork/Skywork-13B-Base) | 🤖 [Skywork-13B-Base-8bits](https://www.modelscope.cn/models/skywork/Skywork-13B-Base-8bits) |
| **Skywork-13B-Chat** | 🤗敬请期待 | 🤗敬请期待 | 🤖敬请期待 | 🤖敬请期待 |
| **Skywork-13B-Math** | 🤗 [Skywork-13B-Math](https://huggingface.co/Skywork/Skywork-13B-Math) | 🤗 [Skywork-13B-Math-8bits](https://huggingface.co/Skywork/Skywork-13B-Math-8bits) | 🤖 [Skywork-13B-Math](https://www.modelscope.cn/models/skywork/Skywork-13B-Math) | 🤖 [Skywork-13B-Math-8bits](https://www.modelscope.cn/models/skywork/Skywork-13B-Math-8bits) |
| **Skywork-13B-MM** | 🤗敬请期待 | - | 🤖敬请期待 | - |
| | HuggingFace基础模型 | HuggingFace量化版模型 | ModelScope基础模型 | ModelScope量化版模型 | Wisemodel基础模型 | Wisemodel量化版模型 |
|:-------:|:-----------:|:-----------------------------:|:-----------------------------:|:-----------------------------:|:-----------------------------:|:-----------------------------:|
| **Skywork-13B-Base** | 🤗 [Skywork-13B-Base](https://huggingface.co/Skywork/Skywork-13B-Base) | 🤗 [Skywork-13B-Base-8bits](https://huggingface.co/Skywork/Skywork-13B-Base-8bits) | 🤖[Skywork-13B-Base](https://www.modelscope.cn/models/skywork/Skywork-13B-Base) | 🤖 [Skywork-13B-Base-8bits](https://www.modelscope.cn/models/skywork/Skywork-13B-Base-8bits) |👾[Skywork-13B-Base](https://wisemodel.cn/models/Skywork/Skywork-13B-Base) | 👾 [Skywork-13B-Base-8bits](https://wisemodel.cn/models/Skywork/Skywork-13B-Base-8bits) |
| **Skywork-13B-Chat** | 🤗敬请期待 | 🤗敬请期待 | 🤖敬请期待 | 🤖敬请期待 |👾敬请期待 | 👾敬请期待 |
| **Skywork-13B-Math** | 🤗 [Skywork-13B-Math](https://huggingface.co/Skywork/Skywork-13B-Math) | 🤗 [Skywork-13B-Math-8bits](https://huggingface.co/Skywork/Skywork-13B-Math-8bits) | 🤖 [Skywork-13B-Math](https://www.modelscope.cn/models/skywork/Skywork-13B-Math) | 🤖 [Skywork-13B-Math-8bits](https://www.modelscope.cn/models/skywork/Skywork-13B-Math-8bits) |👾[Skywork-13B-Math](https://wisemodel.cn/models/Skywork/Skywork-13B-Math) | 👾 [Skywork-13B-Math-8bits](https://wisemodel.cn/models/Skywork/Skywork-13B-Math-8bits) |
| **Skywork-13B-MM** | 🤗敬请期待 | - | 🤖敬请期待 | - |👾敬请期待 | - |
## 数据下载
@ -224,6 +224,24 @@ loss = -\sum^{n}_{i=1} log(p_i) / n = -log( \prod_{i=1}^n p_i) / n
```
bash bash_scripts/skywork_eval_loss.sh
```
假设我们需要计算A模型和Skywork模型的标准化损失。我们分别对A模型和Skywork模型运行上面脚本会在各自目录下的result.txt文件中得到两个值第一个值是loss第二个值是文档token数。我们将A模型的loss和token数分别记作loss_a和token_aSkywork模型的loss和token数分别记作loss_s和token_s。那么A模型标准化损失loss_a_norm = loss_a * token_a / token_s。这样就可以对比loss_a_norm和loss_s来对比A模型和Skywork模型的效果。扩展到多个模型同理。
### 评估常见问题
**Q1**: 为什么要让所有模型的文档长度一样而不是让分词后token一样
**A1**: 本质上领域困惑度是衡量不同模型生成高质量文档的概率概率越大模型效果越好因此我们需要保证所有模型看到的文档是一样的。此外因为不同模型使用不同的分词器分词后的token数目差异很大以Llama为例会将汉字切分为3个unicode编码如果按照分词后token比较的话那么Llama模型看到的文档长度会比其他模型短而我们知道文档前面的token loss是更低的后面token loss更高因此按照分词后token比较的话会对Llama这种分词更细的模型会不公平。
**Q2**: 为什么预处理的时候截取max_position_embedding除以3这个长度的文本?
**A2**: 根据问题1我们知道Llama模型一般是将汉字切分成3个字符为了保证一篇文档输入模型分词后不超过4096的限制所以我们将文档最大长度设置成了4096/3=1228。在我们对比模型中Llama模型是对中文切分最细的所以只要不超过Llama模型的分词长度其他模型也肯定可以塞得下。
**Q3**: 不同模型有不同的最大长度统一用4096是否不公平
**A3**: 如上所示我们计算的文档长度是1228个中文汉字以Qwen为例训练长度为2K推理时候可以扩展到8K并且中英双语模型的压缩率一般在2-3倍因此1228个中文字符一般只有500-1000个token远远达到不了2K甚至4K的最大长度限制。
**Q4**: 为什么Average Ppl和每个领域Ppl平均不一致
**A4**: 我们计算Average Ppl的方式是将所有文档的loss平均起来然后取指数转换成Ppl这样是为了避免有些领域Ppl过分的大导致整体结果容易受到极端值影响。其物理含义是将所有文档视为一个整体问题Average Ppl则是针对这整体文档计算出的Ppl。
## Benchmark评估
我们评估了各大权威评测benchmark上的结果作为参考包括C-EvalMMLUCMMLUGSM8K。遵循之前的评估流程C-Eval、MMLU、CMMLU测试5-shot结果GSM8K测试8-shot结果。可以看到Skywork-13B-Base模型在中文开源模型中处于前列在同等参数规模下为最优水平。

View File

@ -6,9 +6,8 @@
</div> -->
<div align="center"><img src="misc/skywork_logo.jpeg" width="550"/></div>
<p align="center">
🤗 <a href="https://huggingface.co/Skywork" target="_blank">Hugging Face</a> • 🤖 <a href="https://modelscope.cn/organization/Skywork" target="_blank">ModelScope</a> • 💬 <a href="https://github.com/SkyworkAI/Skywork/blob/main/misc/wechat.png?raw=true" target="_blank">WeChat</a>• 📜<a href="http://arxiv.org/abs/2310.19341" target="_blank">Tech Report</a>
🤗 <a href="https://huggingface.co/Skywork" target="_blank">Hugging Face</a> • 🤖 <a href="https://modelscope.cn/organization/Skywork" target="_blank">ModelScope</a>👾 <a href="https://wisemodel.cn/organization/Skywork" target="_blank">Wisemodel</a>💬 <a href="https://github.com/SkyworkAI/Skywork/blob/main/misc/wechat.png?raw=true" target="_blank">WeChat</a>• 📜<a href="http://arxiv.org/abs/2310.19341" target="_blank">Tech Report</a>
</p>
<div align="center">
@ -77,13 +76,12 @@ If you are interested in more training and evaluation details, please refer to o
# Download URL
## Download URL of Skywork Models
| | HuggingFace Base Model | HuggingFace Quantized Model | ModelScope Base Model | ModelScope Quantized Model |
|:-------:|:-----------:|:-----------------------------:|:-----------------------------:|:-----------------------------:|
| **Skywork-13B-Base** | 🤗 [Skywork-13B-Base](https://huggingface.co/Skywork/Skywork-13B-Base) | 🤗 [Skywork-13B-Base-8bits](https://huggingface.co/Skywork/Skywork-13B-Base-8bits) | 🤖[Skywork-13B-Base](https://www.modelscope.cn/models/skywork/Skywork-13B-Base) | 🤖 [Skywork-13B-Base-8bits](https://www.modelscope.cn/models/skywork/Skywork-13B-Base-8bits) |
| **Skywork-13B-Chat** | 🤗coming soon | 🤗coming soon | 🤖coming soon | 🤖coming soon |
| **Skywork-13B-Math** | 🤗 [Skywork-13B-Math](https://huggingface.co/Skywork/Skywork-13B-Math) | 🤗 [Skywork-13B-Math-8bits](https://huggingface.co/Skywork/Skywork-13B-Math-8bits) | 🤖 [Skywork-13B-Math](https://www.modelscope.cn/models/skywork/Skywork-13B-Math) | 🤖 [Skywork-13B-Math-8bits](https://www.modelscope.cn/models/skywork/Skywork-13B-Math-8bits) |
| **Skywork-13B-MM** | 🤗coming soon | - | 🤖coming soon | - |
| | HuggingFace Base Model | HuggingFace Quantized Model | ModelScope Base Model | ModelScope Quantized Model | Wisemodel Base Model | Wisemodel Quantized Model |
|:-------:|:-----------:|:-----------------------------:|:-----------------------------:|:-----------------------------:|:-----------------------------:|:-----------------------------:|
| **Skywork-13B-Base** | 🤗 [Skywork-13B-Base](https://huggingface.co/Skywork/Skywork-13B-Base) | 🤗 [Skywork-13B-Base-8bits](https://huggingface.co/Skywork/Skywork-13B-Base-8bits) | 🤖[Skywork-13B-Base](https://www.modelscope.cn/models/skywork/Skywork-13B-Base) | 🤖 [Skywork-13B-Base-8bits](https://www.modelscope.cn/models/skywork/Skywork-13B-Base-8bits) |👾[Skywork-13B-Base](https://wisemodel.cn/models/Skywork/Skywork-13B-Base) | 👾 [Skywork-13B-Base-8bits](https://wisemodel.cn/models/Skywork/Skywork-13B-Base-8bits) |
| **Skywork-13B-Chat** | 🤗coming soon | 🤗coming soon | 🤖coming soon | 🤖coming soon |👾coming soon | 👾coming soon |
| **Skywork-13B-Math** | 🤗 [Skywork-13B-Math](https://huggingface.co/Skywork/Skywork-13B-Math) | 🤗 [Skywork-13B-Math-8bits](https://huggingface.co/Skywork/Skywork-13B-Math-8bits) | 🤖 [Skywork-13B-Math](https://www.modelscope.cn/models/skywork/Skywork-13B-Math) | 🤖 [Skywork-13B-Math-8bits](https://www.modelscope.cn/models/skywork/Skywork-13B-Math-8bits) |👾[Skywork-13B-Math](https://wisemodel.cn/models/Skywork/Skywork-13B-Math) | 👾 [Skywork-13B-Math-8bits](https://wisemodel.cn/models/Skywork/Skywork-13B-Math-8bits) |
| **Skywork-13B-MM** | 🤗coming soon | - | 🤖coming soon | - |👾coming soon | - |
## Download URL of Skypile
| Data | Download URL |
@ -216,7 +214,33 @@ We have also open-sourced the data and evaluation scripts. You can reproduce our
bash bash_scripts/skywork_eval_loss.sh
```
If you need to calculate the normalized loss for Model A and Skywork model, you can follow these steps:
1. Run the above script for Model A and Skywork model separately. The results will be stored in the result.txt file in their respective directories.
2. In the result.txt file, you will find two values for each model: the first value represents the loss, and the second value represents the number of document tokens.
3. Let's denote the loss and token numbers for Model A as loss_a and token_a, and for Skywork model as loss_s and token_s.
4. To calculate the normalized loss for Model A (loss_a_norm), loss_a_norm = loss_a * token_a / token_s
5. By comparing the normalized loss (loss_a_norm) of Model A with the loss (loss_s) of the Skywork model, we can evaluate the effectiveness of both models.
6. The same approach can be extended to multiple models.
### FAQ in Evaluation
**Q1**: Why should all models have the same document length instead of having the same number of tokens after tokenization?
**A1**: Essentially, domain perplexity measures the probability of different models generating high-quality documents, with higher probability indicating better model performance. Therefore, we need to ensure that all models see the same document. Additionally, since different models use different tokenizers, there can be a significant difference in the number of tokens after tokenization. For example, Llama will split Chinese characters into three Unicode encodings. If we compare the number of tokens after tokenization, the document length seen by Llama will be shorter compared to other models. However, we know that the token loss is lower in the beginning of the document and higher towards the end. Therefore, comparing based on the number of tokens after tokenization would be unfair to models like Llama, which have finer tokenization.
**Q2**: Why do we truncate the text to a length of max_position_embedding divided by 3?
**A2**: As mentioned in the answer to question 1, Llama model generally splits Chinese characters into three characters. To ensure that the maximum length of a document input to the model does not exceed the limit of 4096, we set the maximum document length to 4096/3 = 1228. In our comparison models, Llama has the finest tokenization for Chinese. Therefore, as long as the document length does not exceed the tokenization length of Llama, it will fit in other models as well.
**Q3**: Is it unfair to use a uniform length of 4096 for different models?
**A3**: As explained above, the calculated document length is 1228 Chinese characters. Taking Qwen as an example, the training length is 2K, which can be expanded to 8K during inference. Additionally, the compression ratio of bilingual models is generally 2-3 times. Therefore, 1228 Chinese characters usually only amount to 500-1000 tokens, far from the maximum length limit of 2K or even 4K.
**Q4**: Why is the Average Ppl inconsistent with the average Ppl of each domain?
**A4**: We calculate Average Ppl by averaging the losses of all documents and then converting it to Ppl using an exponential function. This is done to avoid having some domains with excessively high Ppl, which would negatively impact the overall results. The idea behind Average Ppl is to consider all documents as a cohesive collection, representing the overall Ppl of the document.
## Benchmark Results
We evaluated Skywork-13B-Base on several popular benchmarks, including C-Eval, MMLU, CMMLU, and GSM8K. Following the previous evaluation process, we tested the 5-shot results of C-Eval, MMLU, and CMMLU, and the 8-shot results of GSM8K. It can be seen that the Skywork-13B-Base model is among the top models in the Chinese open source model community, performing at an optimal level with the same parameter scale.

View File

@ -5,7 +5,7 @@ do
export DATA=$LOSS_DATA
export BATCH_SIZE=16
mkdir -p prediction/$DATA/$FLAG
python eval/eval_loss_tp.py \
python eval/eval_loss.py \
-m $HF_MODEL_PATH --n-gpus 8 \
-d data/eval_loss/$DATA.jsonl --data-type json -i text -b $BATCH_SIZE --max-tokens 4096 --max-samples 10000 \
-o prediction/$DATA/$FLAG/result.txt

View File

@ -134,7 +134,7 @@ def set_seed(seed: int):
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument("-m", "--model_path", type=str, default="")
parser.add_argument("-m", "--model-path", type=str, default="")
parser.add_argument("-d", "--dataset", type=str, default="emozilla/pg19")
parser.add_argument("-s", "--subset", type=str, default=None)
parser.add_argument("-i", "--input-text-field", type=str, default="text")
@ -144,7 +144,6 @@ if __name__ == '__main__':
parser.add_argument("--max-samples", type=int, default=None)
parser.add_argument("--data-type", type=str, default=None)
parser.add_argument("--n-gpus", type=int, default=None)
parser.add_argument("--aggressive-memory", action="store_true")
parser.add_argument("--split", type=str, default="train")
args = parser.parse_args()