We provide some statistics using different thresholds of ocr confidences, in order to shrink down the corpus size
and use less-noisier data:
OCR confidence
Size
0.60
28GB
0.65
18GB
0.70
13GB
For the final corpus we use a OCR confidence of 0.6 (28GB). The following plot shows a tokens per year distribution:
French Europeana Corpus
Like German, we use different ocr confidence thresholds:
OCR confidence
Size
0.60
31GB
0.65
27GB
0.70
27GB
0.75
23GB
0.80
11GB
For the final corpus we use a OCR confidence of 0.7 (27GB). The following plot shows a tokens per year distribution:
British Library Corpus
Metadata is taken from
here
. Stats incl. year filtering:
Years
Size
ALL
24GB
>= 1800 && < 1900
24GB
We use the year filtered variant. The following plot shows a tokens per year distribution:
Finnish Europeana Corpus
OCR confidence
Size
0.60
1.2GB
The following plot shows a tokens per year distribution:
Swedish Europeana Corpus
OCR confidence
Size
0.60
1.1GB
The following plot shows a tokens per year distribution:
All Corpora
The following plot shows a tokens per year distribution of the complete training corpus:
Multilingual Vocab generation
For the first attempt, we use the first 10GB of each pretraining corpus. We upsample both Finnish and Swedish to ~10GB.
The following tables shows the exact size that is used for generating a 32k and 64k subword vocabs:
Language
Size
German
10GB
French
10GB
English
10GB
Finnish
9.5GB
Swedish
9.7GB
We then calculate the subword fertility rate and portion of
[UNK]
s over the following NER corpora:
Language
NER corpora
German
CLEF-HIPE, NewsEye
French
CLEF-HIPE, NewsEye
English
CLEF-HIPE
Finnish
NewsEye
Swedish
NewsEye
Breakdown of subword fertility rate and unknown portion per language for the 32k vocab:
Language
Subword fertility
Unknown portion
German
1.43
0.0004
French
1.25
0.0001
English
1.25
0.0
Finnish
1.69
0.0007
Swedish
1.43
0.0
Breakdown of subword fertility rate and unknown portion per language for the 64k vocab:
Language
Subword fertility
Unknown portion
German
1.31
0.0004
French
1.16
0.0001
English
1.17
0.0
Finnish
1.54
0.0007
Swedish
1.32
0.0
Final pretraining corpora
We upsample Swedish and Finnish to ~27GB. The final stats for all pretraining corpora can be seen here:
Language
Size
German
28GB
French
27GB
English
24GB
Finnish
27GB
Swedish
27GB
Total size is 130GB.
Pretraining
Multilingual model
We train a multilingual BERT model using the 32k vocab with the official BERT implementation
on a v3-32 TPU using the following parameters:
The following plot shows the pretraining loss curve:
Acknowledgments
Research supported with Cloud TPUs from Google's TPU Research Cloud (TRC) program, previously known as
TensorFlow Research Cloud (TFRC). Many thanks for providing access to the TRC ❤️
Thanks to the generous support from the
Hugging Face
team,
it is possible to download both cased and uncased models from their S3 storage 🤗
Runs of dbmdz bert-base-historic-multilingual-cased on huggingface.co
72
Total runs
4
24-hour runs
1
3-day runs
5
7-day runs
6
30-day runs
More Information About bert-base-historic-multilingual-cased huggingface.co Model
More bert-base-historic-multilingual-cased license Visit here:
bert-base-historic-multilingual-cased huggingface.co is an AI model on huggingface.co that provides bert-base-historic-multilingual-cased's model effect (), which can be used instantly with this dbmdz bert-base-historic-multilingual-cased model. huggingface.co supports a free trial of the bert-base-historic-multilingual-cased model, and also provides paid use of the bert-base-historic-multilingual-cased. Support call bert-base-historic-multilingual-cased model through api, including Node.js, Python, http.
bert-base-historic-multilingual-cased huggingface.co is an online trial and call api platform, which integrates bert-base-historic-multilingual-cased's modeling effects, including api services, and provides a free online trial of bert-base-historic-multilingual-cased, you can try bert-base-historic-multilingual-cased online for free by clicking the link below.
dbmdz bert-base-historic-multilingual-cased online free url in huggingface.co:
bert-base-historic-multilingual-cased is an open source model from GitHub that offers a free installation service, and any user can find bert-base-historic-multilingual-cased on GitHub to install. At the same time, huggingface.co provides the effect of bert-base-historic-multilingual-cased install, users can directly use bert-base-historic-multilingual-cased installed effect in huggingface.co for debugging and trial. It also supports api for free installation.
bert-base-historic-multilingual-cased install url in huggingface.co: