bert-base-historic-multilingual-cased huggingface.co api & dbmdz bert-base-historic-multilingual-cased github AI Model

Introduction of bert-base-historic-multilingual-cased

Model Details of bert-base-historic-multilingual-cased

hmBERT: Historical Multilingual Language Models for Named Entity Recognition

More information about our hmBERT model can be found in our new paper: "hmBERT: Historical Multilingual Language Models for Named Entity Recognition" .

Languages

Our Historic Language Models Zoo contains support for the following languages - incl. their training data source:

Language	Training data	Size
German	Europeana	13-28GB (filtered)
French	Europeana	11-31GB (filtered)
English	British Library	24GB (year filtered)
Finnish	Europeana	1.2GB
Swedish	Europeana	1.1GB

Smaller Models

We have also released smaller models for the multilingual model:

Model identifier	Model Hub link
`dbmdz/bert-tiny-historic-multilingual-cased`	here
`dbmdz/bert-mini-historic-multilingual-cased`	here
`dbmdz/bert-small-historic-multilingual-cased`	here
`dbmdz/bert-medium-historic-multilingual-cased`	here

Corpora Stats

German Europeana Corpus

We provide some statistics using different thresholds of ocr confidences, in order to shrink down the corpus size and use less-noisier data:

OCR confidence	Size
0.60	28GB
0.65	18GB
0.70	13GB

For the final corpus we use a OCR confidence of 0.6 (28GB). The following plot shows a tokens per year distribution:

French Europeana Corpus

Like German, we use different ocr confidence thresholds:

OCR confidence	Size
0.60	31GB
0.65	27GB
0.70	27GB
0.75	23GB
0.80	11GB

For the final corpus we use a OCR confidence of 0.7 (27GB). The following plot shows a tokens per year distribution:

British Library Corpus

Metadata is taken from here . Stats incl. year filtering:

Years	Size
ALL	24GB
>= 1800 && < 1900	24GB

We use the year filtered variant. The following plot shows a tokens per year distribution:

Finnish Europeana Corpus

OCR confidence	Size
0.60	1.2GB

The following plot shows a tokens per year distribution:

Swedish Europeana Corpus

OCR confidence	Size
0.60	1.1GB

The following plot shows a tokens per year distribution:

All Corpora

The following plot shows a tokens per year distribution of the complete training corpus:

Multilingual Vocab generation

For the first attempt, we use the first 10GB of each pretraining corpus. We upsample both Finnish and Swedish to ~10GB. The following tables shows the exact size that is used for generating a 32k and 64k subword vocabs:

Language	Size
German	10GB
French	10GB
English	10GB
Finnish	9.5GB
Swedish	9.7GB

We then calculate the subword fertility rate and portion of [UNK] s over the following NER corpora:

Language	NER corpora
German	CLEF-HIPE, NewsEye
French	CLEF-HIPE, NewsEye
English	CLEF-HIPE
Finnish	NewsEye
Swedish	NewsEye

Breakdown of subword fertility rate and unknown portion per language for the 32k vocab:

Language	Subword fertility	Unknown portion
German	1.43	0.0004
French	1.25	0.0001
English	1.25	0.0
Finnish	1.69	0.0007
Swedish	1.43	0.0

Breakdown of subword fertility rate and unknown portion per language for the 64k vocab:

Language	Subword fertility	Unknown portion
German	1.31	0.0004
French	1.16	0.0001
English	1.17	0.0
Finnish	1.54	0.0007
Swedish	1.32	0.0

Final pretraining corpora

We upsample Swedish and Finnish to ~27GB. The final stats for all pretraining corpora can be seen here:

Language	Size
German	28GB
French	27GB
English	24GB
Finnish	27GB
Swedish	27GB

Total size is 130GB.

Pretraining

Multilingual model

We train a multilingual BERT model using the 32k vocab with the official BERT implementation on a v3-32 TPU using the following parameters:

python3 run_pretraining.py --input_file gs://histolectra/historic-multilingual-tfrecords/*.tfrecord \
--output_dir gs://histolectra/bert-base-historic-multilingual-cased \
--bert_config_file ./config.json \
--max_seq_length=512 \
--max_predictions_per_seq=75 \
--do_train=True \
--train_batch_size=128 \
--num_train_steps=3000000 \
--learning_rate=1e-4 \
--save_checkpoints_steps=100000 \
--keep_checkpoint_max=20 \
--use_tpu=True \
--tpu_name=electra-2 \
--num_tpu_cores=32

The following plot shows the pretraining loss curve:

Acknowledgments

Research supported with Cloud TPUs from Google's TPU Research Cloud (TRC) program, previously known as TensorFlow Research Cloud (TFRC). Many thanks for providing access to the TRC ❤️

Thanks to the generous support from the Hugging Face team, it is possible to download both cased and uncased models from their S3 storage 🤗

Runs of dbmdz bert-base-historic-multilingual-cased on huggingface.co

Total runs

24-hour runs

3-day runs

7-day runs

30-day runs

More Information About bert-base-historic-multilingual-cased huggingface.co Model

More bert-base-historic-multilingual-cased license Visit here:

https://choosealicense.com/licenses/mit

bert-base-historic-multilingual-cased huggingface.co

bert-base-historic-multilingual-cased huggingface.co is an AI model on huggingface.co that provides bert-base-historic-multilingual-cased's model effect (), which can be used instantly with this dbmdz bert-base-historic-multilingual-cased model. huggingface.co supports a free trial of the bert-base-historic-multilingual-cased model, and also provides paid use of the bert-base-historic-multilingual-cased. Support call bert-base-historic-multilingual-cased model through api, including Node.js, Python, http.

bert-base-historic-multilingual-cased huggingface.co Url

https://huggingface.co/dbmdz/bert-base-historic-multilingual-cased

dbmdz bert-base-historic-multilingual-cased online free

bert-base-historic-multilingual-cased huggingface.co is an online trial and call api platform, which integrates bert-base-historic-multilingual-cased's modeling effects, including api services, and provides a free online trial of bert-base-historic-multilingual-cased, you can try bert-base-historic-multilingual-cased online for free by clicking the link below.

dbmdz bert-base-historic-multilingual-cased online free url in huggingface.co:

https://huggingface.co/dbmdz/bert-base-historic-multilingual-cased

bert-base-historic-multilingual-cased install

bert-base-historic-multilingual-cased is an open source model from GitHub that offers a free installation service, and any user can find bert-base-historic-multilingual-cased on GitHub to install. At the same time, huggingface.co provides the effect of bert-base-historic-multilingual-cased install, users can directly use bert-base-historic-multilingual-cased installed effect in huggingface.co for debugging and trial. It also supports api for free installation.

bert-base-historic-multilingual-cased install url in huggingface.co:

https://huggingface.co/dbmdz/bert-base-historic-multilingual-cased

huggingface.co

dbmdz/bert-large-cased-finetuned-conll03-english

Total runs: 1.4M

Run Growth: -288.0K

Growth Rate: -20.67%

Updated: September 07 2023

huggingface.co

dbmdz/bert-base-turkish-cased

Total runs: 85.7K

Run Growth: -201

Growth Rate: -0.23%

Updated: November 17 2024

huggingface.co

dbmdz/bert-base-italian-xxl-uncased

Total runs: 72.0K

Run Growth: -16.8K

Growth Rate: -23.27%

Updated: September 07 2023

huggingface.co

dbmdz/bert-base-italian-xxl-cased

Total runs: 63.7K

Run Growth: 25.4K

Growth Rate: 39.83%

Updated: September 07 2023

huggingface.co

dbmdz/bert-base-french-europeana-cased

Total runs: 48.7K

Run Growth: 11.5K

Growth Rate: 23.63%

Updated: September 14 2021

huggingface.co

dbmdz/bert-base-german-cased

Total runs: 41.2K

Run Growth: 35.0K

Growth Rate: 85.08%

Updated: September 07 2023

huggingface.co

dbmdz/bert-base-german-uncased

Total runs: 37.6K

Run Growth: 23.0K

Growth Rate: 61.06%

Updated: September 07 2023

huggingface.co

dbmdz/bert-base-turkish-uncased

Total runs: 25.6K

Run Growth: 9.3K

Growth Rate: 36.52%

Updated: February 21 2024

huggingface.co

dbmdz/bert-base-turkish-128k-uncased

Total runs: 25.0K

Run Growth: 1.9K

Growth Rate: 7.66%

Updated: October 28 2024

huggingface.co

dbmdz/distilbert-base-turkish-cased

Total runs: 9.3K

Run Growth: 2.2K

Growth Rate: 23.87%

Updated: October 28 2024

huggingface.co

dbmdz/german-gpt2

Total runs: 9.2K

Run Growth: 439

Growth Rate: 4.75%

Updated: December 14 2023

huggingface.co

dbmdz/electra-base-italian-xxl-cased-discriminator

Total runs: 6.7K

Run Growth: 521

Growth Rate: 7.75%

Updated: November 02 2024

huggingface.co

dbmdz/bert-base-italian-cased

Total runs: 4.6K

Run Growth: 2.0K

Growth Rate: 43.25%

Updated: September 07 2023

huggingface.co

dbmdz/bert-base-italian-uncased

Total runs: 3.7K

Run Growth: -1.5K

Growth Rate: -41.94%

Updated: May 19 2021

huggingface.co

dbmdz/bert-base-turkish-128k-cased

Total runs: 1.6K

Run Growth: 26

Growth Rate: 1.67%

Updated: October 28 2024

huggingface.co

dbmdz/bert-base-cased-finetuned-conll03-english

Total runs: 1.3K

Run Growth: 774

Growth Rate: 58.77%

Updated: September 06 2023

huggingface.co

dbmdz/bert-medium-historic-multilingual-cased

Total runs: 1.3K

Run Growth: -2.4K

Growth Rate: -190.95%

Updated: September 07 2023

huggingface.co

dbmdz/convbert-base-turkish-mc4-uncased

Total runs: 1000

Run Growth: -218

Growth Rate: -21.80%

Updated: September 11 2023

huggingface.co

dbmdz/electra-base-french-europeana-cased-generator

Total runs: 804

Run Growth: 793

Growth Rate: 98.63%

Updated: September 07 2023

huggingface.co

dbmdz/bert-base-german-europeana-cased

Total runs: 550

Run Growth: -219

Growth Rate: -39.82%

Updated: October 28 2024

huggingface.co

dbmdz/bert-mini-historic-multilingual-cased

Total runs: 482

Run Growth: -195

Growth Rate: -40.46%

Updated: September 07 2023

huggingface.co

dbmdz/electra-large-discriminator-finetuned-conll03-english

Total runs: 455

Run Growth: 263

Growth Rate: 57.80%

Updated: September 07 2023

huggingface.co

dbmdz/electra-base-turkish-mc4-uncased-discriminator

Total runs: 287

Run Growth: 221

Growth Rate: 77.00%

Updated: November 18 2024

huggingface.co

dbmdz/distilbert-base-german-europeana-cased

Total runs: 286

Run Growth: 23

Growth Rate: 8.04%

Updated: October 28 2024

huggingface.co

dbmdz/convbert-base-turkish-mc4-cased

Total runs: 264

Run Growth: 189

Growth Rate: 71.59%

Updated: September 07 2023

huggingface.co

dbmdz/electra-base-turkish-cased-discriminator

Total runs: 231

Run Growth: -182

Growth Rate: -78.79%

Updated: October 28 2024

huggingface.co

dbmdz/bert-base-german-europeana-uncased

Total runs: 184

Run Growth: 124

Growth Rate: 67.39%

Updated: December 13 2024

huggingface.co

dbmdz/convbert-base-turkish-cased

Total runs: 164

Run Growth: 122

Growth Rate: 74.39%

Updated: September 07 2023

huggingface.co

dbmdz/t5-base-conll03-english

Total runs: 148

Run Growth: -190

Growth Rate: -128.38%

Updated: September 07 2023

huggingface.co

dbmdz/electra-small-turkish-cased-discriminator

Total runs: 138

Run Growth: 125

Growth Rate: 90.58%

Updated: November 19 2024

huggingface.co

dbmdz/convbert-base-german-europeana-cased

Total runs: 119

Run Growth: 58

Growth Rate: 48.74%

Updated: September 19 2023

huggingface.co

dbmdz/bert-tiny-historic-multilingual-cased

Total runs: 118

Run Growth: -49

Growth Rate: -41.53%

Updated: September 07 2023

huggingface.co

dbmdz/bert-base-historic-dutch-cased

Total runs: 117

Run Growth: 92

Growth Rate: 78.63%

Updated: November 15 2024

huggingface.co

dbmdz/bert-base-historic-multilingual-64k-td-cased

Total runs: 109

Run Growth: 63

Growth Rate: 57.80%

Updated: September 07 2023

huggingface.co

dbmdz/electra-base-italian-mc4-cased-generator

Total runs: 105

Run Growth: 93

Growth Rate: 88.57%

Updated: September 14 2023

huggingface.co

dbmdz/electra-base-turkish-mc4-cased-discriminator

Total runs: 88

Run Growth: 55

Growth Rate: 62.50%

Updated: October 31 2024

huggingface.co

dbmdz/bert-small-historic-multilingual-cased

Total runs: 46

Run Growth: 20

Growth Rate: 43.48%

Updated: September 07 2023

huggingface.co

dbmdz/flair-historic-ner-onb

Total runs: 29

Run Growth: 21

Growth Rate: 72.41%

Updated: February 26 2021

huggingface.co

dbmdz/bert-base-multilingual-cased-finetuned-conll03-spanish

Total runs: 27

Run Growth: 10

Growth Rate: 37.04%

Updated: September 07 2023

huggingface.co

dbmdz/german-gpt2-faust

Total runs: 27

Run Growth: -5

Growth Rate: -18.52%

Updated: September 07 2023

huggingface.co

dbmdz/bert-base-multilingual-cased-finetuned-conll03-dutch

Total runs: 23

Run Growth: 1

Growth Rate: 4.35%

Updated: September 07 2023

huggingface.co

dbmdz/bert-base-historic-english-cased

Total runs: 21

Run Growth: -1

Growth Rate: -4.76%

Updated: February 08 2024

huggingface.co

dbmdz/bert-base-finnish-europeana-cased

Total runs: 20

Run Growth: 6

Growth Rate: 30.00%

Updated: November 14 2024

huggingface.co

dbmdz/flair-historic-ner-lft

Total runs: 15

Run Growth: 7

Growth Rate: 46.67%

Updated: December 11 2020

huggingface.co

dbmdz/flair-clef-hipe-german-base

Total runs: 14

Run Growth: 1

Growth Rate: 7.14%

Updated: April 09 2021

huggingface.co

dbmdz/electra-base-german-europeana-cased-generator

Total runs: 12

Run Growth: 2

Growth Rate: 16.67%

Updated: September 07 2023

huggingface.co

dbmdz/electra-base-german-europeana-cased-discriminator

Total runs: 12

Run Growth: -2

Growth Rate: -16.67%

Updated: July 26 2020

huggingface.co

dbmdz/flair-hipe-2022-ajmc-all

Total runs: 9

Run Growth: 6

Growth Rate: 66.67%

Updated: May 04 2022

huggingface.co

dbmdz/bert-base-german-europeana-td-cased

Total runs: 5

Run Growth: -1

Growth Rate: -20.00%

Updated: September 07 2023

huggingface.co

dbmdz/bert-base-swedish-europeana-cased

Total runs: 5

Run Growth: 0

Growth Rate: 0.00%

Updated: November 19 2021

huggingface.co

dbmdz/electra-base-italian-xxl-cased-generator

Total runs: 5

Run Growth: -12

Growth Rate: -240.00%

Updated: September 07 2023

huggingface.co

dbmdz/electra-small-turkish-cased-generator

Total runs: 4

Run Growth: -2

Growth Rate: -50.00%

Updated: September 07 2023

huggingface.co

dbmdz/electra-base-turkish-mc4-cased-generator

Total runs: 3

Run Growth: -53

Growth Rate: -1766.67%

Updated: September 07 2023

huggingface.co

dbmdz/electra-base-french-europeana-cased-discriminator

Total runs: 2

Run Growth: -8

Growth Rate: -400.00%

Updated: September 14 2021

huggingface.co

dbmdz/electra-base-turkish-cased-generator

Total runs: 2

Run Growth: -6

Growth Rate: -300.00%

Updated: September 07 2023

huggingface.co

dbmdz/electra-base-turkish-cased-v0-generator

Total runs: 2

Run Growth: -7

Growth Rate: -350.00%

Updated: September 07 2023

huggingface.co

dbmdz/electra-base-turkish-cased-v0-discriminator

Total runs: 2

Run Growth: -3

Growth Rate: -150.00%

Updated: April 24 2020

huggingface.co

dbmdz/electra-base-turkish-mc4-uncased-generator

Total runs: 1

Run Growth: -5

Growth Rate: -500.00%

Updated: September 23 2021

huggingface.co

dbmdz/electra-base-italian-mc4-cased-discriminator

Total runs: 1

Run Growth: -4

Growth Rate: -400.00%

Updated: August 24 2021

huggingface.co

dbmdz/flair-hipe-2022-ajmc-all-64k

Total runs: 0

Run Growth: 0

Growth Rate: 0.00%

Updated: April 29 2022

huggingface.co

dbmdz/detectron2-v2-model

Total runs: 0

Run Growth: 0

Growth Rate: 0.00%

Updated: January 29 2024

huggingface.co

dbmdz/detectron2-model

Total runs: 0

Run Growth: 0

Growth Rate: 0.00%

Updated: August 07 2022

huggingface.co

dbmdz/flair-hipe-2022-ajmc-fr-64k

Total runs: 0

Run Growth: 0

Growth Rate: 0.00%

Updated: April 28 2022

huggingface.co

dbmdz/flair-hipe-2022-ajmc-en-64k

Total runs: 0

Run Growth: 0

Growth Rate: 0.00%

Updated: April 27 2022

huggingface.co

dbmdz/flair-hipe-2022-ajmc-de-64k

Total runs: 0

Run Growth: 0

Growth Rate: 0.00%

Updated: April 27 2022

huggingface.co

dbmdz/flair-hipe-2022-ajmc-en

Total runs: 0

Run Growth: 0

Growth Rate: 0.00%

Updated: April 28 2022

huggingface.co

dbmdz/flair-hipe-2022-ajmc-de

Total runs: 0

Run Growth: 0

Growth Rate: 0.00%

Updated: April 28 2022

huggingface.co

dbmdz/flair-hipe-2022-ajmc-fr

Total runs: 0

Run Growth: 0

Growth Rate: 0.00%

Updated: April 28 2022

dbmdz / bert-base-historic-multilingual-cased

Introduction of bert-base-historic-multilingual-cased

Model Details of bert-base-historic-multilingual-cased

hmBERT: Historical Multilingual Language Models for Named Entity Recognition

Languages

Smaller Models

Corpora Stats

German Europeana Corpus

French Europeana Corpus

British Library Corpus

Finnish Europeana Corpus

Swedish Europeana Corpus

All Corpora

Multilingual Vocab generation

Final pretraining corpora

Pretraining

Multilingual model

Acknowledgments

Runs of dbmdz bert-base-historic-multilingual-cased on huggingface.co

More Information About bert-base-historic-multilingual-cased huggingface.co Model

More bert-base-historic-multilingual-cased license Visit here:

bert-base-historic-multilingual-cased huggingface.co

bert-base-historic-multilingual-cased huggingface.co Url

dbmdz bert-base-historic-multilingual-cased online free

dbmdz bert-base-historic-multilingual-cased online free url in huggingface.co:

bert-base-historic-multilingual-cased install

bert-base-historic-multilingual-cased install url in huggingface.co:

Url of bert-base-historic-multilingual-cased

bert-base-historic-multilingual-cased huggingface.co Url

Provider of bert-base-historic-multilingual-cased huggingface.co

Other API from dbmdz