macbert4csc-base-chinese huggingface.co api & shibing624 macbert4csc-base-chinese github AI Model

Introduction of macbert4csc-base-chinese

Model Details of macbert4csc-base-chinese

MacBERT for Chinese Spelling Correction(macbert4csc) Model

中文拼写纠错模型

macbert4csc-base-chinese evaluate SIGHAN2015 test data：

Char Level: precision:0.9372, recall:0.8640, f1:0.8991
Sentence Level: precision:0.8264, recall:0.7366, f1:0.7789

由于训练使用的数据使用了SIGHAN2015的训练集（复现paper），在SIGHAN2015的测试集上达到SOTA水平。

模型结构，魔改于softmaskedbert：

Usage

本项目开源在中文文本纠错项目： pycorrector ，可支持macbert4csc模型，通过如下命令调用：

from pycorrector.macbert.macbert_corrector import MacBertCorrector

nlp = MacBertCorrector("shibing624/macbert4csc-base-chinese").macbert_correct

i = nlp('今天新情很好')
print(i)

当然，你也可使用官方的huggingface/transformers调用：

Please use 'Bert' related functions to load this model!

import operator
import torch
from transformers import BertTokenizer, BertForMaskedLM
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

tokenizer = BertTokenizer.from_pretrained("shibing624/macbert4csc-base-chinese")
model = BertForMaskedLM.from_pretrained("shibing624/macbert4csc-base-chinese")
model.to(device)

texts = ["今天新情很好", "你找到你最喜欢的工作，我也很高心。"]
with torch.no_grad():
    outputs = model(**tokenizer(texts, padding=True, return_tensors='pt').to(device))

def get_errors(corrected_text, origin_text):
    sub_details = []
    for i, ori_char in enumerate(origin_text):
        if ori_char in [' ', '“', '”', '‘', '’', '琊', '\n', '…', '—', '擤']:
            # add unk word
            corrected_text = corrected_text[:i] + ori_char + corrected_text[i:]
            continue
        if i >= len(corrected_text):
            continue
        if ori_char != corrected_text[i]:
            if ori_char.lower() == corrected_text[i]:
                # pass english upper char
                corrected_text = corrected_text[:i] + ori_char + corrected_text[i + 1:]
                continue
            sub_details.append((ori_char, corrected_text[i], i, i + 1))
    sub_details = sorted(sub_details, key=operator.itemgetter(2))
    return corrected_text, sub_details

result = []
for ids, text in zip(outputs.logits, texts):
    _text = tokenizer.decode(torch.argmax(ids, dim=-1), skip_special_tokens=True).replace(' ', '')
    corrected_text = _text[:len(text)]
    corrected_text, details = get_errors(corrected_text, text)
    print(text, ' => ', corrected_text, details)
    result.append((corrected_text, details))
print(result)

output:

今天新情很好  =>  今天心情很好 [('新', '心', 2, 3)]
你找到你最喜欢的工作，我也很高心。  =>  你找到你最喜欢的工作，我也很高兴。 [('心', '兴', 15, 16)]

模型文件组成：

macbert4csc-base-chinese
    ├── config.json
    ├── added_tokens.json
    ├── pytorch_model.bin
    ├── special_tokens_map.json
    ├── tokenizer_config.json
    └── vocab.txt

训练数据集

SIGHAN+Wang271K中文纠错数据集

数据集	语料	下载链接	压缩包大小
`SIGHAN+Wang271K中文纠错数据集`	SIGHAN+Wang271K(27万条)	百度网盘（密码01b9）	106M
`原始SIGHAN数据集`	SIGHAN13 14 15	官方csc.html	339K
`原始Wang271K数据集`	Wang271K	Automatic-Corpus-Generation dimmywang提供	93M

SIGHAN+Wang271K中文纠错数据集，数据格式：

[
    {
        "id": "B2-4029-3",
        "original_text": "晚间会听到嗓音，白天的时候大家都不会太在意，但是在睡觉的时候这嗓音成为大家的恶梦。",
        "wrong_ids": [
            5,
            31
        ],
        "correct_text": "晚间会听到噪音，白天的时候大家都不会太在意，但是在睡觉的时候这噪音成为大家的恶梦。"
    },
]

macbert4csc
    ├── config.json
    ├── pytorch_model.bin
    ├── special_tokens_map.json
    ├── tokenizer_config.json
    └── vocab.txt

如果需要训练macbert4csc，请参考 https://github.com/shibing624/pycorrector/tree/master/pycorrector/macbert

About MacBERT

MacBERT is an improved BERT with novel M LM a s c orrection pre-training task, which mitigates the discrepancy of pre-training and fine-tuning.

Here is an example of our pre-training task.

task	Example
Original Sentence	we use a language model to predict the probability of the next word.
MLM	we use a language [M] to [M] ##di ##ct the pro [M] ##bility of the next word .
Whole word masking	we use a language [M] to [M] [M] [M] the [M] [M] [M] of the next word .
N-gram masking	we use a [M] [M] to [M] [M] [M] the [M] [M] [M] [M] [M] next word .
MLM as correction	we use a text system to ca ##lc ##ulate the po ##si ##bility of the next word .

Except for the new pre-training task, we also incorporate the following techniques.

Whole Word Masking (WWM)
N-gram masking
Sentence-Order Prediction (SOP)

Note that our MacBERT can be directly replaced with the original BERT as there is no differences in the main neural architecture.

For more technical details, please check our paper: Revisiting Pre-trained Models for Chinese Natural Language Processing

Citation

@software{pycorrector,
  author = {Xu Ming},
  title = {pycorrector: Text Error Correction Tool},
  year = {2021},
  url = {https://github.com/shibing624/pycorrector},
}

Runs of shibing624 macbert4csc-base-chinese on huggingface.co

69.3K

Total runs

24-hour runs

2.8K

3-day runs

6.1K

7-day runs

59.3K

30-day runs

More Information About macbert4csc-base-chinese huggingface.co Model

More macbert4csc-base-chinese license Visit here:

https://choosealicense.com/licenses/apache-2.0

macbert4csc-base-chinese huggingface.co

macbert4csc-base-chinese huggingface.co is an AI model on huggingface.co that provides macbert4csc-base-chinese's model effect (), which can be used instantly with this shibing624 macbert4csc-base-chinese model. huggingface.co supports a free trial of the macbert4csc-base-chinese model, and also provides paid use of the macbert4csc-base-chinese. Support call macbert4csc-base-chinese model through api, including Node.js, Python, http.

macbert4csc-base-chinese huggingface.co Url

https://huggingface.co/shibing624/macbert4csc-base-chinese

shibing624 macbert4csc-base-chinese online free

macbert4csc-base-chinese huggingface.co is an online trial and call api platform, which integrates macbert4csc-base-chinese's modeling effects, including api services, and provides a free online trial of macbert4csc-base-chinese, you can try macbert4csc-base-chinese online for free by clicking the link below.

shibing624 macbert4csc-base-chinese online free url in huggingface.co:

https://huggingface.co/shibing624/macbert4csc-base-chinese

macbert4csc-base-chinese install

macbert4csc-base-chinese is an open source model from GitHub that offers a free installation service, and any user can find macbert4csc-base-chinese on GitHub to install. At the same time, huggingface.co provides the effect of macbert4csc-base-chinese install, users can directly use macbert4csc-base-chinese installed effect in huggingface.co for debugging and trial. It also supports api for free installation.

macbert4csc-base-chinese install url in huggingface.co:

https://huggingface.co/shibing624/macbert4csc-base-chinese

huggingface.co

shibing624/text2vec-base-chinese

Total runs: 1.2M

Run Growth: 520.5K

Growth Rate: 44.70%

Updated: 2024年11月14日

huggingface.co

shibing624/text2vec-base-multilingual

Total runs: 193.1K

Run Growth: 179.4K

Growth Rate: 92.91%

Updated: 2024年7月31日

huggingface.co

shibing624/text2vec-base-chinese-paraphrase

Total runs: 37.5K

Run Growth: -15.1K

Growth Rate: -41.68%

Updated: 2024年10月28日

huggingface.co

shibing624/chinese-text-correction-1.5b

Total runs: 1.9K

Run Growth: 883

Growth Rate: 38.97%

Updated: 2024年10月14日

huggingface.co

shibing624/chinese-alpaca-plus-7b-hf

Total runs: 1.6K

Run Growth: 558

Growth Rate: 36.54%

Updated: 2023年12月15日

huggingface.co

shibing624/chinese-llama-plus-13b-hf

Total runs: 1.2K

Run Growth: 232

Growth Rate: 20.00%

Updated: 2023年12月15日

huggingface.co

shibing624/chinese-alpaca-plus-13b-hf

Total runs: 1.2K

Run Growth: 220

Growth Rate: 19.21%

Updated: 2023年12月15日

huggingface.co

shibing624/text2vec-base-chinese-sentence

Total runs: 1.2K

Run Growth: -2.9K

Growth Rate: -239.78%

Updated: 2024年10月28日

huggingface.co

shibing624/mengzi-t5-base-chinese-correction

Total runs: 1.1K

Run Growth: 94

Growth Rate: 8.83%

Updated: 2024年2月19日

huggingface.co

shibing624/text2vec-bge-large-chinese

Total runs: 978

Run Growth: -131

Growth Rate: -13.39%

Updated: 2024年8月21日

huggingface.co

shibing624/chinese-text-correction-7b

Total runs: 690

Run Growth: 28

Growth Rate: 1.90%

Updated: 2024年10月14日

huggingface.co

shibing624/bert4ner-base-chinese

Total runs: 587

Run Growth: 240

Growth Rate: 40.89%

Updated: 2024年2月19日

huggingface.co

shibing624/gpt2-dialogbot-base-chinese

Total runs: 401

Run Growth: 232

Growth Rate: 57.86%

Updated: 2023年3月19日

huggingface.co

shibing624/code-autocomplete-gpt2-base

Total runs: 171

Run Growth: -57

Growth Rate: -33.33%

Updated: 2023年3月19日

huggingface.co

shibing624/vicuna-baichuan-13b-chat

Total runs: 166

Run Growth: 21

Growth Rate: 12.65%

Updated: 2024年1月23日

huggingface.co

shibing624/parrots-chinese-hubert-base

Total runs: 104

Run Growth: -5

Growth Rate: -4.81%

Updated: 2024年11月11日

huggingface.co

shibing624/ziya-llama-13b-medical-merged

Total runs: 104

Run Growth: -275

Growth Rate: -264.42%

Updated: 2024年2月19日

huggingface.co

shibing624/chatglm3-6b-csc-chinese-lora

Total runs: 85

Run Growth: -4

Growth Rate: -4.71%

Updated: 2024年2月19日

huggingface.co

shibing624/parrots-chinese-roberta-wwm-ext-large

Total runs: 76

Run Growth: -172

Growth Rate: -226.32%

Updated: 2024年2月12日

huggingface.co

shibing624/code-autocomplete-distilgpt2-python

Total runs: 75

Run Growth: 0

Growth Rate: 0.00%

Updated: 2024年2月19日

huggingface.co

shibing624/asian-role

Total runs: 60

Run Growth: -24

Growth Rate: -37.50%

Updated: 2024年2月19日

huggingface.co

shibing624/bart4csc-base-chinese

Total runs: 41

Run Growth: 16

Growth Rate: 39.02%

Updated: 2024年2月19日

huggingface.co

shibing624/chinese-text-correction-7b-lora

Total runs: 28

Run Growth: -1

Growth Rate: -3.23%

Updated: 2024年10月14日

huggingface.co

shibing624/llama-3-8b-instruct-262k-chinese

Total runs: 25

Run Growth: 0

Growth Rate: 0.00%

Updated: 2024年4月29日

huggingface.co

shibing624/bert4ner-base-uncased

Total runs: 15

Run Growth: 0

Growth Rate: 0.00%

Updated: 2024年2月19日

huggingface.co

shibing624/chinese-text-correction-1.5b-lora

Total runs: 14

Run Growth: 0

Growth Rate: 0.00%

Updated: 2024年10月14日

huggingface.co

shibing624/vicuna-baichuan-13b-chat-lora

Total runs: 13

Run Growth: -2

Growth Rate: -15.38%

Updated: 2024年2月19日

huggingface.co

shibing624/t5-chinese-couplet

Total runs: 12

Run Growth: -1

Growth Rate: -8.33%

Updated: 2023年3月28日

huggingface.co

shibing624/bertspan4ner-base-chinese

Total runs: 12

Run Growth: -16

Growth Rate: -133.33%

Updated: 2024年2月19日

huggingface.co

shibing624/chatglm-6b-belle-zh-lora

Total runs: 8

Run Growth: -27

Growth Rate: -337.50%

Updated: 2024年2月19日

huggingface.co

shibing624/text2vec-word2vec-tencent-chinese

Total runs: 0

Run Growth: 0

Growth Rate: 0.00%

Updated: 2025年1月2日

huggingface.co

shibing624/ziya-llama-13b-medical-lora

Total runs: 0

Run Growth: 0

Growth Rate: 0.00%

Updated: 2024年2月3日

huggingface.co

shibing624/songnet-base-chinese-couplet

Total runs: 0

Run Growth: 0

Growth Rate: 0.00%

Updated: 2022年11月27日

huggingface.co

shibing624/chinese-alpaca-plus-13b-pth

Total runs: 0

Run Growth: 0

Growth Rate: 0.00%

Updated: 2023年5月19日

huggingface.co

shibing624/chinese-kenlm-klm

Total runs: 0

Run Growth: 0

Growth Rate: 0.00%

Updated: 2023年11月8日

huggingface.co

shibing624/llama-3-8b-instruct-262k-chinese-lora

Total runs: 0

Run Growth: 0

Growth Rate: 0.00%

Updated: 2024年4月29日

huggingface.co

shibing624/parrots-gpt-sovits-speaker

Total runs: 0

Run Growth: 0

Growth Rate: 0.00%

Updated: 2024年2月19日

huggingface.co

shibing624/chatglm-6b-csc-zh-lora

Total runs: 0

Run Growth: 0

Growth Rate: 0.00%

Updated: 2023年12月15日

huggingface.co

shibing624/parrots-gpt-sovits-speaker-maimai

Total runs: 0

Run Growth: 0

Growth Rate: 0.00%

Updated: 2024年2月19日

huggingface.co

shibing624/llama-13b-belle-zh-lora

Total runs: 0

Run Growth: 0

Growth Rate: 0.00%

Updated: 2023年4月14日

huggingface.co

shibing624/songnet-base-chinese

Total runs: 0

Run Growth: 0

Growth Rate: 0.00%

Updated: 2024年2月19日

huggingface.co

shibing624/songnet-base-chinese-songci

Total runs: 0

Run Growth: 0

Growth Rate: 0.00%

Updated: 2024年2月19日

shibing624 / macbert4csc-base-chinese

Introduction of macbert4csc-base-chinese

Model Details of macbert4csc-base-chinese

MacBERT for Chinese Spelling Correction(macbert4csc) Model

Usage

训练数据集

SIGHAN+Wang271K中文纠错数据集

About MacBERT

Citation

Runs of shibing624 macbert4csc-base-chinese on huggingface.co

More Information About macbert4csc-base-chinese huggingface.co Model

More macbert4csc-base-chinese license Visit here:

macbert4csc-base-chinese huggingface.co

macbert4csc-base-chinese huggingface.co Url

shibing624 macbert4csc-base-chinese online free

shibing624 macbert4csc-base-chinese online free url in huggingface.co:

macbert4csc-base-chinese install

macbert4csc-base-chinese install url in huggingface.co:

Url of macbert4csc-base-chinese

macbert4csc-base-chinese huggingface.co Url

Provider of macbert4csc-base-chinese huggingface.co

Other API from shibing624