We provide some statistics using different thresholds of ocr confidences, in order to shrink down the corpus size
and use less-noisier data:
OCR confidence
Size
0.60
28GB
0.65
18GB
0.70
13GB
For the final corpus we use a OCR confidence of 0.6 (28GB). The following plot shows a tokens per year distribution:
French Europeana Corpus
Like German, we use different ocr confidence thresholds:
OCR confidence
Size
0.60
31GB
0.65
27GB
0.70
27GB
0.75
23GB
0.80
11GB
For the final corpus we use a OCR confidence of 0.7 (27GB). The following plot shows a tokens per year distribution:
British Library Corpus
Metadata is taken from
here
. Stats incl. year filtering:
Years
Size
ALL
24GB
>= 1800 && < 1900
24GB
We use the year filtered variant. The following plot shows a tokens per year distribution:
Finnish Europeana Corpus
OCR confidence
Size
0.60
1.2GB
The following plot shows a tokens per year distribution:
Swedish Europeana Corpus
OCR confidence
Size
0.60
1.1GB
The following plot shows a tokens per year distribution:
All Corpora
The following plot shows a tokens per year distribution of the complete training corpus:
Multilingual Vocab generation
For the first attempt, we use the first 10GB of each pretraining corpus. We upsample both Finnish and Swedish to ~10GB.
The following tables shows the exact size that is used for generating a 32k and 64k subword vocabs:
Language
Size
German
10GB
French
10GB
English
10GB
Finnish
9.5GB
Swedish
9.7GB
We then calculate the subword fertility rate and portion of
[UNK]
s over the following NER corpora:
Language
NER corpora
German
CLEF-HIPE, NewsEye
French
CLEF-HIPE, NewsEye
English
CLEF-HIPE
Finnish
NewsEye
Swedish
NewsEye
Breakdown of subword fertility rate and unknown portion per language for the 32k vocab:
Language
Subword fertility
Unknown portion
German
1.43
0.0004
French
1.25
0.0001
English
1.25
0.0
Finnish
1.69
0.0007
Swedish
1.43
0.0
Breakdown of subword fertility rate and unknown portion per language for the 64k vocab:
Language
Subword fertility
Unknown portion
German
1.31
0.0004
French
1.16
0.0001
English
1.17
0.0
Finnish
1.54
0.0007
Swedish
1.32
0.0
Final pretraining corpora
We upsample Swedish and Finnish to ~27GB. The final stats for all pretraining corpora can be seen here:
The following plot shows the pretraining loss curve:
Smaller multilingual models
We use the same parameters as used for training the base model.
hmBERT Tiny
The following plot shows the pretraining loss curve for the tiny model:
hmBERT Mini
The following plot shows the pretraining loss curve for the mini model:
hmBERT Small
The following plot shows the pretraining loss curve for the small model:
hmBERT Medium
The following plot shows the pretraining loss curve for the medium model:
English model
The English BERT model - with texts from British Library corpus - was trained with the Hugging Face
JAX/FLAX implementation for 10 epochs (approx. 1M steps) on a v3-8 TPU, using the following command:
The following plot shows the pretraining loss curve:
Finnish model
The BERT model - with texts from Finnish part of Europeana - was trained with the Hugging Face
JAX/FLAX implementation for 40 epochs (approx. 1M steps) on a v3-8 TPU, using the following command:
The following plot shows the pretraining loss curve:
Swedish model
The BERT model - with texts from Swedish part of Europeana - was trained with the Hugging Face
JAX/FLAX implementation for 40 epochs (approx. 660K steps) on a v3-8 TPU, using the following command:
The following plot shows the pretraining loss curve:
Acknowledgments
Research supported with Cloud TPUs from Google's TPU Research Cloud (TRC) program, previously known as
TensorFlow Research Cloud (TFRC). Many thanks for providing access to the TRC ❤️
Thanks to the generous support from the
Hugging Face
team,
it is possible to download both cased and uncased models from their S3 storage 🤗
Runs of dbmdz bert-mini-historic-multilingual-cased on huggingface.co
836
Total runs
-2
24-hour runs
-65
3-day runs
-195
7-day runs
-631
30-day runs
More Information About bert-mini-historic-multilingual-cased huggingface.co Model
More bert-mini-historic-multilingual-cased license Visit here:
bert-mini-historic-multilingual-cased huggingface.co is an AI model on huggingface.co that provides bert-mini-historic-multilingual-cased's model effect (), which can be used instantly with this dbmdz bert-mini-historic-multilingual-cased model. huggingface.co supports a free trial of the bert-mini-historic-multilingual-cased model, and also provides paid use of the bert-mini-historic-multilingual-cased. Support call bert-mini-historic-multilingual-cased model through api, including Node.js, Python, http.
bert-mini-historic-multilingual-cased huggingface.co is an online trial and call api platform, which integrates bert-mini-historic-multilingual-cased's modeling effects, including api services, and provides a free online trial of bert-mini-historic-multilingual-cased, you can try bert-mini-historic-multilingual-cased online for free by clicking the link below.
dbmdz bert-mini-historic-multilingual-cased online free url in huggingface.co:
bert-mini-historic-multilingual-cased is an open source model from GitHub that offers a free installation service, and any user can find bert-mini-historic-multilingual-cased on GitHub to install. At the same time, huggingface.co provides the effect of bert-mini-historic-multilingual-cased install, users can directly use bert-mini-historic-multilingual-cased installed effect in huggingface.co for debugging and trial. It also supports api for free installation.
bert-mini-historic-multilingual-cased install url in huggingface.co: