Albertina 100M PTBR
is a foundation, large language model for American
Portuguese
from
Brazil
.
It is an
encoder
of the BERT family, based on the neural architecture Transformer and
developed over the DeBERTa model, with most competitive performance for this language.
It is distributed free of charge and under a most permissible license.
Albertina 100M PTBR base
is developed by a joint team from the University of Lisbon and the University of Porto, Portugal.
For further details, check the respective
publication
:
@misc{albertina-pt-fostering,
title={Fostering the Ecosystem of Open Neural Encoders
for Portuguese with Albertina PT-* family},
author={Rodrigo Santos and João Rodrigues and Luís Gomes
and João Silva and António Branco
and Henrique Lopes Cardoso and Tomás Freitas Osório
and Bernardo Leite},
year={2024},
eprint={2403.01897},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
Please use the above cannonical reference when using or citing this model.
Model Description
This model card is for Albertina 100M PTBR
, with 100M parameters, 12 layers and a hidden size of 768.
Albertina-PT-BR base is distributed under an
MIT license
.
Albertina P100M PTBR
was trained over a 3.7 billion token curated selection of documents from the
OSCAR
data set.
The OSCAR data set includes documents in more than one hundred languages, including Portuguese, and it is widely used in the literature. It is the result of a selection performed over the
Common Crawl
data set, crawled from the Web, that retains only pages whose metadata indicates permission to be crawled, that performs deduplication, and that removes some boilerplate, among other filters.
Given that it does not discriminate between the Portuguese variants, we performed extra filtering by retaining only documents whose meta-data indicate the Internet country code top-level domain of Brazil. We used the January 2023 version of OSCAR, which is based on the November/December 2022 version of Common Crawl.
Preprocessing
We filtered the PT-BR corpora using the
BLOOM pre-processing
pipeline.
We skipped the default filtering of stopwords since it would disrupt the syntactic structure, and also the filtering for language identification given the corpus was pre-selected as Portuguese.
To train
Albertina 100M PTBR
, the data set was tokenized with the original DeBERTa tokenizer with a 128 token sequence truncation and dynamic padding.
The model was trained using the maximum available memory capacity resulting in a batch size of 3072 samples (192 samples per GPU).
We opted for a learning rate of 1e-5 with linear decay and 10k warm-up steps.
The model was trained with a total of 150 training epochs resulting in approximately 180k steps.
The model was trained for one day on a2-megagpu-16gb Google Cloud A2 VMs with 16 GPUs, 96 vCPUs and 1.360 GB of RAM.
Evaluation
The base model versions was evaluated on downstream tasks, namely the translations into PTBR of the English data sets used for a few of the tasks in the widely-used
GLUE benchmark
.
GLUE tasks translated
We resort to
PLUE
(Portuguese Language Understanding Evaluation), a data set that was obtained by automatically translating GLUE into
PT-BR
.
We address four tasks from those in PLUE, namely:
two similarity tasks: MRPC, for detecting whether two sentences are paraphrases of each other, and STS-B, for semantic textual similarity;
and two inference tasks: RTE, for recognizing textual entailment and WNLI, for coreference and natural language inference.
Model
RTE (Accuracy)
WNLI (Accuracy)
MRPC (F1)
STS-B (Pearson)
Albertina 900M PTBR No-brWaC
0.7798
0.5070
0.9167
0.8743
Albertina 900M PTBR
0.7545
0.4601
0.9071
0.8910
Albertina 100M PTBR
0.6462
0.5493
0.8779
0.8501
How to use
You can use this model directly with a pipeline for masked language modeling:
>>> from transformers import pipeline
>>> unmasker = pipeline('fill-mask', model='PORTULAN/albertina-ptbr-base')
>>> unmasker("A culinária brasileira é rica em sabores e [MASK], tornando-se um dos maiores patrimônios do país.")
[{'score': 0.9391396045684814, 'token': 14690, 'token_str': ' costumes', 'sequence': 'A culinária brasileira é rica em sabores e costumes, tornando-se um dos maiores patrimônios do país.'},
{'score': 0.04568921774625778, 'token': 29829, 'token_str': ' cores', 'sequence': 'A culinária brasileira é rica em sabores e cores, tornando-se um dos maiores patrimônios do país.'},
{'score': 0.004134135786443949, 'token': 6696, 'token_str': ' drinks', 'sequence': 'A culinária brasileira é rica em sabores e drinks, tornando-se um dos maiores patrimônios do país.'},
{'score': 0.0009097770671360195, 'token': 33455, 'token_str': ' nuances', 'sequence': 'A culinária brasileira é rica em sabores e nuances, tornando-se um dos maiores patrimônios do país.'},
{'score': 0.0008549498743377626, 'token': 606, 'token_str': ' comes', 'sequence': 'A culinária brasileira é rica em sabores e comes, tornando-se um dos maiores patrimônios do país.'}]
The model can be used by fine-tuning it for a specific task:
When using or citing this model, kindly cite the following
publication
:
@misc{albertina-pt-fostering,
title={Fostering the Ecosystem of Open Neural Encoders
for Portuguese with Albertina PT-* family},
author={Rodrigo Santos and João Rodrigues and Luís Gomes
and João Silva and António Branco
and Henrique Lopes Cardoso and Tomás Freitas Osório
and Bernardo Leite},
year={2024},
eprint={2403.01897},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
Acknowledgments
The research reported here was partially supported by: PORTULAN CLARIN—Research Infrastructure for the Science and Technology of Language,
funded by Lisboa 2020, Alentejo 2020 and FCT—Fundação para a Ciência e Tecnologia under the
grant PINFRA/22117/2016; research project ALBERTINA - Foundation Encoder Model for Portuguese and AI, funded by FCT—Fundação para a Ciência e Tecnologia under the
grant CPCA-IAC/AV/478394/2022; innovation project ACCELERAT.AI - Multilingual Intelligent Contact Centers, funded by IAPMEI, I.P. - Agência para a Competitividade e Inovação under the grant C625734525-00462629, of Plano de Recuperação e Resiliência, call RE-C05-i01.01 – Agendas/Alianças Mobilizadoras para a Reindustrialização; and LIACC - Laboratory for AI and Computer Science, funded by FCT—Fundação para a Ciência e Tecnologia under the grant FCT/UID/CEC/0027/2020.
Runs of PORTULAN albertina-100m-portuguese-ptbr-encoder on huggingface.co
5.9K
Total runs
14
24-hour runs
-407
3-day runs
-118
7-day runs
3.9K
30-day runs
More Information About albertina-100m-portuguese-ptbr-encoder huggingface.co Model
More albertina-100m-portuguese-ptbr-encoder license Visit here:
albertina-100m-portuguese-ptbr-encoder huggingface.co is an AI model on huggingface.co that provides albertina-100m-portuguese-ptbr-encoder's model effect (), which can be used instantly with this PORTULAN albertina-100m-portuguese-ptbr-encoder model. huggingface.co supports a free trial of the albertina-100m-portuguese-ptbr-encoder model, and also provides paid use of the albertina-100m-portuguese-ptbr-encoder. Support call albertina-100m-portuguese-ptbr-encoder model through api, including Node.js, Python, http.
albertina-100m-portuguese-ptbr-encoder huggingface.co is an online trial and call api platform, which integrates albertina-100m-portuguese-ptbr-encoder's modeling effects, including api services, and provides a free online trial of albertina-100m-portuguese-ptbr-encoder, you can try albertina-100m-portuguese-ptbr-encoder online for free by clicking the link below.
PORTULAN albertina-100m-portuguese-ptbr-encoder online free url in huggingface.co:
albertina-100m-portuguese-ptbr-encoder is an open source model from GitHub that offers a free installation service, and any user can find albertina-100m-portuguese-ptbr-encoder on GitHub to install. At the same time, huggingface.co provides the effect of albertina-100m-portuguese-ptbr-encoder install, users can directly use albertina-100m-portuguese-ptbr-encoder installed effect in huggingface.co for debugging and trial. It also supports api for free installation.
albertina-100m-portuguese-ptbr-encoder install url in huggingface.co: