|
1 | 1 | # BERT
|
2 | 2 |
|
| 3 | +**\*\*\*\*\* New March 11th, 2020: Smaller BERT Models \*\*\*\*\*** |
| 4 | + |
| 5 | +This is a release of 24 smaller BERT models (English only, uncased, trained with WordPiece masking) referenced in [Well-Read Students Learn Better: On the Importance of Pre-training Compact Models](https://arxiv.org/abs/1908.08962). |
| 6 | + |
| 7 | +We have shown that the standard BERT recipe (including model architecture and training objective) is effective on a wide range of model sizes, beyond BERT-Base and BERT-Large. The smaller BERT models are intended for environments with restricted computational resources. They can be fine-tuned in the same manner as the original BERT models. However, they are most effective in the context of knowledge distillation, where the fine-tuning labels are produced by a larger and more accurate teacher. |
| 8 | + |
| 9 | +Our goal is to enable research in institutions with fewer computational resources and encourage the community to seek directions of innovation alternative to increasing model capacity. |
| 10 | + |
| 11 | +You can download all 24 from [here][all], or individually from the table below: |
| 12 | + |
| 13 | +| |H=128|H=256|H=512|H=768| |
| 14 | +|---|:---:|:---:|:---:|:---:| |
| 15 | +| **L=2** |[**2/128 (BERT-Tiny)**][2_128]|[2/256][2_256]|[2/512][2_512]|[2/768][2_768]| |
| 16 | +| **L=4** |[4/128][4_128]|[**4/256 (BERT-Mini)**][4_256]|[**4/512 (BERT-Small)**][4_512]|[4/768][4_768]| |
| 17 | +| **L=6** |[6/128][6_128]|[6/256][6_256]|[6/512][6_512]|[6/768][6_768]| |
| 18 | +| **L=8** |[8/128][8_128]|[8/256][8_256]|[**8/512 (BERT-Medium)**][8_512]|[8/768][8_768]| |
| 19 | +| **L=10** |[10/128][10_128]|[10/256][10_256]|[10/512][10_512]|[10/768][10_768]| |
| 20 | +| **L=12** |[12/128][12_128]|[12/256][12_256]|[12/512][12_512]|[**12/768 (BERT-Base)**][12_768]| |
| 21 | + |
| 22 | +Note that the BERT-Base model in this release is included for completeness only; it was re-trained under the same regime as the original model. |
| 23 | + |
| 24 | +Here are the corresponding GLUE scores on the test set: |
| 25 | + |
| 26 | +|Model|Score|CoLA|SST-2|MRPC|STS-B|QQP|MNLI-m|MNLI-mm|QNLI(v2)|RTE|WNLI|AX| |
| 27 | +|---|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:| |
| 28 | +|BERT-Tiny|64.2|0.0|83.2|81.1/71.1|74.3/73.6|62.2/83.4|70.2|70.3|81.5|57.2|62.3|21.0| |
| 29 | +|BERT-Mini|65.8|0.0|85.9|81.1/71.8|75.4/73.3|66.4/86.2|74.8|74.3|84.1|57.9|62.3|26.1| |
| 30 | +|BERT-Small|71.2|27.8|89.7|83.4/76.2|78.8/77.0|68.1/87.0|77.6|77.0|86.4|61.8|62.3|28.6| |
| 31 | +|BERT-Medium|73.5|38.0|89.6|86.6/81.6|80.4/78.4|69.6/87.9|80.0|79.1|87.7|62.2|62.3|30.5| |
| 32 | + |
| 33 | +For each task, we selected the best fine-tuning hyperparameters from the lists below, and trained for 4 epochs: |
| 34 | +- batch sizes: 8, 16, 32, 64, 128 |
| 35 | +- learning rates: 3e-4, 1e-4, 5e-5, 3e-5 |
| 36 | + |
| 37 | +If you use these models, please cite the following paper: |
| 38 | + |
| 39 | +``` |
| 40 | +@article{turc2019, |
| 41 | + title={Well-Read Students Learn Better: On the Importance of Pre-training Compact Models}, |
| 42 | + author={Turc, Iulia and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina}, |
| 43 | + journal={arXiv preprint arXiv:1908.08962v2 }, |
| 44 | + year={2019} |
| 45 | +} |
| 46 | +``` |
| 47 | + |
| 48 | +[2_128]: https://storage.googleapis.com/bert_models/2020_02_20/uncased_L-2_H-128_A-2.zip |
| 49 | +[2_256]: https://storage.googleapis.com/bert_models/2020_02_20/uncased_L-2_H-256_A-4.zip |
| 50 | +[2_512]: https://storage.googleapis.com/bert_models/2020_02_20/uncased_L-2_H-512_A-8.zip |
| 51 | +[2_768]: https://storage.googleapis.com/bert_models/2020_02_20/uncased_L-2_H-768_A-12.zip |
| 52 | +[4_128]: https://storage.googleapis.com/bert_models/2020_02_20/uncased_L-4_H-128_A-2.zip |
| 53 | +[4_256]: https://storage.googleapis.com/bert_models/2020_02_20/uncased_L-4_H-256_A-4.zip |
| 54 | +[4_512]: https://storage.googleapis.com/bert_models/2020_02_20/uncased_L-4_H-512_A-8.zip |
| 55 | +[4_768]: https://storage.googleapis.com/bert_models/2020_02_20/uncased_L-4_H-768_A-12.zip |
| 56 | +[6_128]: https://storage.googleapis.com/bert_models/2020_02_20/uncased_L-6_H-128_A-2.zip |
| 57 | +[6_256]: https://storage.googleapis.com/bert_models/2020_02_20/uncased_L-6_H-256_A-4.zip |
| 58 | +[6_512]: https://storage.googleapis.com/bert_models/2020_02_20/uncased_L-6_H-512_A-8.zip |
| 59 | +[6_768]: https://storage.googleapis.com/bert_models/2020_02_20/uncased_L-6_H-768_A-12.zip |
| 60 | +[8_128]: https://storage.googleapis.com/bert_models/2020_02_20/uncased_L-8_H-128_A-2.zip |
| 61 | +[8_256]: https://storage.googleapis.com/bert_models/2020_02_20/uncased_L-8_H-256_A-4.zip |
| 62 | +[8_512]: https://storage.googleapis.com/bert_models/2020_02_20/uncased_L-8_H-512_A-8.zip |
| 63 | +[8_768]: https://storage.googleapis.com/bert_models/2020_02_20/uncased_L-8_H-768_A-12.zip |
| 64 | +[10_128]: https://storage.googleapis.com/bert_models/2020_02_20/uncased_L-10_H-128_A-2.zip |
| 65 | +[10_256]: https://storage.googleapis.com/bert_models/2020_02_20/uncased_L-10_H-256_A-4.zip |
| 66 | +[10_512]: https://storage.googleapis.com/bert_models/2020_02_20/uncased_L-10_H-512_A-8.zip |
| 67 | +[10_768]: https://storage.googleapis.com/bert_models/2020_02_20/uncased_L-10_H-768_A-12.zip |
| 68 | +[12_128]: https://storage.googleapis.com/bert_models/2020_02_20/uncased_L-12_H-128_A-2.zip |
| 69 | +[12_256]: https://storage.googleapis.com/bert_models/2020_02_20/uncased_L-12_H-256_A-4.zip |
| 70 | +[12_512]: https://storage.googleapis.com/bert_models/2020_02_20/uncased_L-12_H-512_A-8.zip |
| 71 | +[12_768]: https://storage.googleapis.com/bert_models/2020_02_20/uncased_L-12_H-768_A-12.zip |
| 72 | +[all]: https://storage.googleapis.com/bert_models/2020_02_20/all_bert_models.zip |
| 73 | + |
3 | 74 | **\*\*\*\*\* New May 31st, 2019: Whole Word Masking Models \*\*\*\*\***
|
4 | 75 |
|
5 | 76 | This is a release of several new models which were the result of an improvement
|
|
0 commit comments