dbmdz / bert-base-historic-dutch-cased

huggingface.co
Total runs: 117
24-hour runs: 0
7-day runs: 104
30-day runs: 92
Model's Last Updated: November 15 2024
fill-mask

Introduction of bert-base-historic-dutch-cased

Model Details of bert-base-historic-dutch-cased

Language Model for Historic Dutch

In this repository we open source a language model for Historic Dutch, trained on the Delpher Corpus , that include digitized texts from Dutch newspapers, ranging from 1618 to 1879.

Changelog

  • 13.12.2021: Initial version of this repository.

Model Zoo

The following models for Historic Dutch are available on the Hugging Face Model Hub:

Model identifier Model Hub link
dbmdz/bert-base-historic-dutch-cased here

Stats

The download urls for all archives can be found here .

We then used the awesome alto-tools from this repository to extract plain text. The following table shows the size overview per year range:

Period Extracted plain text size
1618-1699 170MB
1700-1709 103MB
1710-1719 65MB
1720-1729 137MB
1730-1739 144MB
1740-1749 188MB
1750-1759 171MB
1760-1769 235MB
1770-1779 271MB
1780-1789 414MB
1790-1799 614MB
1800-1809 734MB
1810-1819 807MB
1820-1829 987MB
1830-1839 1.7GB
1840-1849 2.2GB
1850-1854 1.3GB
1855-1859 1.7GB
1860-1864 2.0GB
1865-1869 2.3GB
1870-1874 1.9GB
1875-1876 867MB
1877-1879 1.9GB

The total training corpus consists of 427,181,269 sentences and 3,509,581,683 tokens (counted via wc ), resulting in a total corpus size of 21GB.

The following figure shows an overview of the number of chars per year distribution:

Delpher Corpus Stats

Language Model Pretraining

We use the official BERT implementation using the following command to train the model:

python3 run_pretraining.py --input_file gs://delpher-bert/tfrecords/*.tfrecord \
--output_dir gs://delpher-bert/bert-base-historic-dutch-cased \
--bert_config_file ./config.json \
--max_seq_length=512 \
--max_predictions_per_seq=75 \
--do_train=True \
--train_batch_size=128 \
--num_train_steps=3000000 \
--learning_rate=1e-4 \
--save_checkpoints_steps=100000 \
--keep_checkpoint_max=20 \
--use_tpu=True \
--tpu_name=electra-2 \
--num_tpu_cores=32

We train the model for 3M steps using a total batch size of 128 on a v3-32 TPU. The pretraining loss curve can be seen in the next figure:

Delpher Pretraining Loss Curve

Evaluation

We evaluate our model on the preprocessed Europeana NER dataset for Dutch, that was presented in the "Data Centric Domain Adaptation for Historical Text with OCR Errors" paper.

The data is available in their repository. We perform a hyper-parameter search for:

  • Batch sizes: [4, 8]
  • Learning rates: [3e-5, 5e-5]
  • Number of epochs: [5, 10]

and report averaged F1-Score over 5 runs with different seeds. We also include hmBERT as baseline model.

Results:

Model F1-Score (Dev / Test)
hmBERT (82.73) / 81.34
Maerz et al. (2021) - / 84.2
Ours (89.73) / 87.45

License

All models are licensed under MIT .

Acknowledgments

Research supported with Cloud TPUs from Google's TPU Research Cloud (TRC) program, previously known as TensorFlow Research Cloud (TFRC). Many thanks for providing access to the TRC ❤️

We thank Clemens Neudecker for maintaining the amazing ALTO tools that were used for parsing the Delpher Corpus XML files.

Thanks to the generous support from the Hugging Face team, it is possible to download both cased and uncased models from their S3 storage 🤗

Runs of dbmdz bert-base-historic-dutch-cased on huggingface.co

117
Total runs
0
24-hour runs
99
3-day runs
104
7-day runs
92
30-day runs

More Information About bert-base-historic-dutch-cased huggingface.co Model

More bert-base-historic-dutch-cased license Visit here:

https://choosealicense.com/licenses/mit

bert-base-historic-dutch-cased huggingface.co

bert-base-historic-dutch-cased huggingface.co is an AI model on huggingface.co that provides bert-base-historic-dutch-cased's model effect (), which can be used instantly with this dbmdz bert-base-historic-dutch-cased model. huggingface.co supports a free trial of the bert-base-historic-dutch-cased model, and also provides paid use of the bert-base-historic-dutch-cased. Support call bert-base-historic-dutch-cased model through api, including Node.js, Python, http.

bert-base-historic-dutch-cased huggingface.co Url

https://huggingface.co/dbmdz/bert-base-historic-dutch-cased

dbmdz bert-base-historic-dutch-cased online free

bert-base-historic-dutch-cased huggingface.co is an online trial and call api platform, which integrates bert-base-historic-dutch-cased's modeling effects, including api services, and provides a free online trial of bert-base-historic-dutch-cased, you can try bert-base-historic-dutch-cased online for free by clicking the link below.

dbmdz bert-base-historic-dutch-cased online free url in huggingface.co:

https://huggingface.co/dbmdz/bert-base-historic-dutch-cased

bert-base-historic-dutch-cased install

bert-base-historic-dutch-cased is an open source model from GitHub that offers a free installation service, and any user can find bert-base-historic-dutch-cased on GitHub to install. At the same time, huggingface.co provides the effect of bert-base-historic-dutch-cased install, users can directly use bert-base-historic-dutch-cased installed effect in huggingface.co for debugging and trial. It also supports api for free installation.

bert-base-historic-dutch-cased install url in huggingface.co:

https://huggingface.co/dbmdz/bert-base-historic-dutch-cased

Url of bert-base-historic-dutch-cased

bert-base-historic-dutch-cased huggingface.co Url

Provider of bert-base-historic-dutch-cased huggingface.co

dbmdz
ORGANIZATIONS

Other API from dbmdz

huggingface.co

Total runs: 9.3K
Run Growth: 439
Growth Rate: 4.75%
Updated: December 14 2023