fastText is an open-source, free, lightweight library that allows users to learn text representations and text classifiers. It works on standard, generic hardware. Models can later be reduced in size to even fit on mobile devices. It was introduced in
this paper
. The official website can be found
here
.
This LID (Language IDentification) model is used to predict the language of the input text, and the hosted version (
lid218e
) was
released as part of the NLLB project
and can detect 217 languages. You can find older versions (ones that can identify 157 languages) on the
official fastText website
.
Model description
fastText is a library for efficient learning of word representations and sentence classification. fastText is designed to be simple to use for developers, domain experts, and students. It's dedicated to text classification and learning word representations, and was designed to allow for quick model iteration and refinement without specialized hardware. fastText models can be trained on more than a billion words on any multicore CPU in less than a few minutes.
It includes pre-trained models learned on Wikipedia and in over 157 different languages. fastText can be used as a command line, linked to a C++ application, or used as a library for use cases from experimentation and prototyping to production.
Intended uses & limitations
You can use pre-trained word vectors for text classification or language identification. See the
tutorials
and
resources
on its official website to look for tasks that interest you.
How to use
Here is how to use this model to detect the language of a given text:
Even if the training data used for this model could be characterized as fairly neutral, this model can have biased predictions.
Cosine similarity can be used to measure the similarity between two different word vectors. If two two vectors are identical, the cosine similarity will be 1. For two completely unrelated vectors, the value will be 0. If two vectors have an opposite relationship, the value will be -1.
Pre-trained word vectors for 157 languages were trained on
Common Crawl
and
Wikipedia
using fastText. These models were trained using CBOW with position-weights, in dimension 300, with character n-grams of length 5, a window of size 5 and 10 negatives. We also distribute three new word analogy datasets, for French, Hindi and Polish.
Training procedure
Tokenization
We used the
Stanford word segmenter
for Chinese,
Mecab
for Japanese and
UETsegmenter
for Vietnamese. For languages using the Latin, Cyrillic, Hebrew or Greek scripts, we used the tokenizer from the
Europarl
preprocessing tools. For the remaining languages, we used the ICU tokenizer.
@article{bojanowski2016enriching,
title={Enriching Word Vectors with Subword Information},
author={Bojanowski, Piotr and Grave, Edouard and Joulin, Armand and Mikolov, Tomas},
journal={arXiv preprint arXiv:1607.04606},
year={2016}
}
@article{joulin2016bag,
title={Bag of Tricks for Efficient Text Classification},
author={Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Mikolov, Tomas},
journal={arXiv preprint arXiv:1607.01759},
year={2016}
}
@article{joulin2016fasttext,
title={FastText.zip: Compressing text classification models},
author={Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Douze, Matthijs and J{'e}gou, H{'e}rve and Mikolov, Tomas},
journal={arXiv preprint arXiv:1612.03651},
year={2016}
}
If you use these word vectors, please cite the following paper:
@inproceedings{grave2018learning,
title={Learning Word Vectors for 157 Languages},
author={Grave, Edouard and Bojanowski, Piotr and Gupta, Prakhar and Joulin, Armand and Mikolov, Tomas},
booktitle={Proceedings of the International Conference on Language Resources and Evaluation (LREC 2018)},
year={2018}
}
(* These authors contributed equally.)
Runs of facebook fasttext-language-identification on huggingface.co
3.0M
Total runs
6.3K
24-hour runs
10.8K
3-day runs
34.6K
7-day runs
2.7M
30-day runs
More Information About fasttext-language-identification huggingface.co Model
More fasttext-language-identification license Visit here:
fasttext-language-identification huggingface.co is an AI model on huggingface.co that provides fasttext-language-identification's model effect (), which can be used instantly with this facebook fasttext-language-identification model. huggingface.co supports a free trial of the fasttext-language-identification model, and also provides paid use of the fasttext-language-identification. Support call fasttext-language-identification model through api, including Node.js, Python, http.
fasttext-language-identification huggingface.co is an online trial and call api platform, which integrates fasttext-language-identification's modeling effects, including api services, and provides a free online trial of fasttext-language-identification, you can try fasttext-language-identification online for free by clicking the link below.
facebook fasttext-language-identification online free url in huggingface.co:
fasttext-language-identification is an open source model from GitHub that offers a free installation service, and any user can find fasttext-language-identification on GitHub to install. At the same time, huggingface.co provides the effect of fasttext-language-identification install, users can directly use fasttext-language-identification installed effect in huggingface.co for debugging and trial. It also supports api for free installation.
fasttext-language-identification install url in huggingface.co: