castorini / afriteva_v2_large

huggingface.co
Total runs: 35
24-hour runs: 0
7-day runs: -80
30-day runs: -265
Model's Last Updated: Tháng sáu 30 2024
text2text-generation

Introduction of afriteva_v2_large

Model Details of afriteva_v2_large

AfriTeVa V2 Large

AfriTeVa V2 Large is a multilingual T5 Version 1.1 model pretrained on Wura with a vocabulary size of 150,000. The model has 1B parameters.

Paper: Better Quality Pretraining Data & T5 Models for African Languages

Authors: Akintunde Oladipo, Mofetoluwa Adeyemi, Orevaoghene Ahia, Abraham Toluwalase Owodunni, Odunayo Ogundepo, David Ifeoluwa Adelani, Jimmy Lin

NOTES :

  • Dropout was turned off during pretraining and should be re-enabled for finetuning.
  • Other checkpoints are available here .
Abstract

In this study, we highlight the importance of enhancing the quality of pretraining data in multilingual language models. Existing web crawls have demonstrated quality issues, particularly in the context of low-resource languages. Consequently, we introduce a new multilingual pretraining corpus for African languages, designed by carefully auditing existing pretraining corpora to understand and rectify prevalent quality issues. To compile this dataset, we undertake a rigorous examination of current data sources for thirteen languages within one of the most extensive multilingual web crawls, mC4, and extract cleaner data through meticulous auditing and improved web crawling strategies. Subsequently, we pretrain a new T5-based model on this dataset and evaluate its performance on multiple downstream tasks. Our model demonstrates better downstream effectiveness over existing pretrained models across four NLP tasks, underscoring the critical role data quality plays in pretraining language models in low-resource scenarios. Specifically, on cross-lingual QA evaluation, our new model is more than twice as effective as multilingual T5. All code, data and models are publicly available at castorini/AfriTeVa-keji .

Citation Information
@inproceedings{oladipo-etal-2023-better,
    title = "Better Quality Pre-training Data and T5 Models for {A}frican Languages",
    author = "Oladipo, Akintunde  and
      Adeyemi, Mofetoluwa  and
      Ahia, Orevaoghene  and
      Owodunni, Abraham  and
      Ogundepo, Odunayo  and
      Adelani, David  and
      Lin, Jimmy",
    editor = "Bouamor, Houda  and
      Pino, Juan  and
      Bali, Kalika",
    booktitle = "Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing",
    month = dec,
    year = "2023",
    address = "Singapore",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.emnlp-main.11",
    pages = "158--168",
    abstract = "In this study, we highlight the importance of enhancing the quality of pretraining data in multilingual language models. Existing web crawls have demonstrated quality issues, particularly in the context of low-resource languages. Consequently, we introduce a new multilingual pretraining corpus for 16 African languages, designed by carefully auditing existing pretraining corpora to understand and rectify prevalent quality issues. To compile this dataset, we undertake a rigorous examination of current data sources for thirteen languages within one of the most extensive multilingual web crawls, mC4, and extract cleaner data through meticulous auditing and improved web crawling strategies. Subsequently, we pretrain a new T5-based model on this dataset and evaluate its performance on multiple downstream tasks. Our model demonstrates better downstream effectiveness over existing pretrained models across four NLP tasks, underscoring the critical role data quality plays in pretraining language models in low-resource scenarios. Specifically, on cross-lingual QA evaluation, our new model is more than twice as effective as multilingual T5. All code, data and models are publicly available at https://github.com/castorini/AfriTeVa-keji.",
}

Runs of castorini afriteva_v2_large on huggingface.co

35
Total runs
0
24-hour runs
-85
3-day runs
-80
7-day runs
-265
30-day runs

More Information About afriteva_v2_large huggingface.co Model

More afriteva_v2_large license Visit here:

https://choosealicense.com/licenses/apache-2.0

afriteva_v2_large huggingface.co

afriteva_v2_large huggingface.co is an AI model on huggingface.co that provides afriteva_v2_large's model effect (), which can be used instantly with this castorini afriteva_v2_large model. huggingface.co supports a free trial of the afriteva_v2_large model, and also provides paid use of the afriteva_v2_large. Support call afriteva_v2_large model through api, including Node.js, Python, http.

afriteva_v2_large huggingface.co Url

https://huggingface.co/castorini/afriteva_v2_large

castorini afriteva_v2_large online free

afriteva_v2_large huggingface.co is an online trial and call api platform, which integrates afriteva_v2_large's modeling effects, including api services, and provides a free online trial of afriteva_v2_large, you can try afriteva_v2_large online for free by clicking the link below.

castorini afriteva_v2_large online free url in huggingface.co:

https://huggingface.co/castorini/afriteva_v2_large

afriteva_v2_large install

afriteva_v2_large is an open source model from GitHub that offers a free installation service, and any user can find afriteva_v2_large on GitHub to install. At the same time, huggingface.co provides the effect of afriteva_v2_large install, users can directly use afriteva_v2_large installed effect in huggingface.co for debugging and trial. It also supports api for free installation.

afriteva_v2_large install url in huggingface.co:

https://huggingface.co/castorini/afriteva_v2_large

Url of afriteva_v2_large

afriteva_v2_large huggingface.co Url

Provider of afriteva_v2_large huggingface.co

castorini
ORGANIZATIONS

Other API from castorini

huggingface.co

Total runs: 123
Run Growth: 119
Growth Rate: 96.75%
Updated: Tháng mười một 05 2021
huggingface.co

Total runs: 83
Run Growth: 0
Growth Rate: 0.00%
Updated: Tháng mười một 21 2024