starpii huggingface.co api & bigcode starpii github AI Model

Introduction of starpii

Model Details of starpii

StarPII

Model description

This is an NER model trained to detect Personal Identifiable Information (PII) in code datasets. We fine-tuned bigcode-encoder on a PII dataset we annotated, available with gated access at bigcode-pii-dataset (see bigcode-pii-dataset-training for the exact data splits). We added a linear layer as a token classification head on top of the encoder model, with 6 target classes: Names, Emails, Keys, Passwords, IP addresses and Usernames.

Dataset

Fine-tuning on the annotated dataset

The finetuning dataset contains 20961 secrets and 31 programming languages, but the base encoder model was pre-trained on 88 programming languages from The Stack dataset.

Initial training on a pseudo-labelled dataset

To enhance model performance on some rare PII entities like keys, we initially trained on a pseudo-labeled dataset before fine-tuning on the annotated dataset. The method involves training a model on a small set of labeled data and subsequently generating predictions for a larger set of unlabeled data.

Specifically, we annotated 18,000 files available at bigcode-pii-ppseudo-labeled using an ensemble of two encoder models Deberta-v3-large and stanford-deidentifier-base which were fine-tuned on an intern previously labeled PII dataset for code with 400 files from this work . To select good-quality pseudo-labels, we computed the average probability logits between the models and filtered based on a minimum score. After inspection, we observed a high rate of false positives for Keys and Passwords, hence we retained only the entities that had a trigger word like key , auth and pwd in the surrounding context. Training on this synthetic dataset prior to fine-tuning on the annotated one yielded superior results for all PII categories, as demonstrated in the table in the following section.

Performance

This model is respresented in the last row (NER + pseudo labels )

Emails, IP addresses and Keys

Method	Email address			IP address			Key
	Prec.	Recall	F1	Prec.	Recall	F1	Prec.	Recall	F1
Regex	69.8%	98.8%	81.8%	65.9%	78%	71.7%	2.8%	46.9%	5.3%
NER	94.01%	98.10%	96.01%	88.95%	94.43%	91.61%	60.37%	53.38%	56.66%
+ pseudo labels	97.73%	98.94%	98.15%	90.10%	93.86%	91.94%	62.38%	80.81%	70.41%

Names, Usernames and Passwords

Method	Name			Username			Password
	Prec.	Recall	F1	Prec.	Recall	F1	Prec.	Recall	F1
NER	83.66%	95.52%	89.19%	48.93%	75.55%	59.39%	59.16%	96.62%	73.39%
+ pseudo labels	86.45%	97.38%	91.59%	52.20%	74.81%	61.49%	70.94%	95.96%	81.57%

We used this model to mask PII in the bigcode large model training. We dropped usernames since they resulted in many false positives and negatives. For the other PII types, we added the following post-processing that we recommend for future uses of the model (the code is also available on GitHub):

Ignore secrets with less than 4 characters.
Detect full names only.
Ignore detected keys with less than 9 characters or that are not gibberish using a gibberish-detector .
Ignore IP addresses that aren't valid or are private (non-internet facing) using the ipaddress python package. We also ignore IP addresses from popular DNS servers. We use the same list as in this paper .

Considerations for Using the Model

While using this model, please be aware that there may be potential risks associated with its application. There is a possibility of false positives and negatives, which could lead to unintended consequences when processing sensitive data. Moreover, the model's performance may vary across different data types and programming languages, necessitating validation and fine-tuning for specific use cases. Researchers and developers are expected to uphold ethical standards and data protection measures when using the model. By making it openly accessible, our aim is to encourage the development of privacy-preserving AI technologies while remaining vigilant of potential risks associated with PII.

Runs of bigcode starpii on huggingface.co

376

Total runs

24-hour runs

-119

3-day runs

-257

7-day runs

-287

30-day runs

More Information About starpii huggingface.co Model

starpii huggingface.co

starpii huggingface.co is an AI model on huggingface.co that provides starpii's model effect (), which can be used instantly with this bigcode starpii model. huggingface.co supports a free trial of the starpii model, and also provides paid use of the starpii. Support call starpii model through api, including Node.js, Python, http.

starpii huggingface.co Url

https://huggingface.co/bigcode/starpii

bigcode starpii online free

starpii huggingface.co is an online trial and call api platform, which integrates starpii's modeling effects, including api services, and provides a free online trial of starpii, you can try starpii online for free by clicking the link below.

bigcode starpii online free url in huggingface.co:

https://huggingface.co/bigcode/starpii

starpii install

starpii is an open source model from GitHub that offers a free installation service, and any user can find starpii on GitHub to install. At the same time, huggingface.co provides the effect of starpii install, users can directly use starpii installed effect in huggingface.co for debugging and trial. It also supports api for free installation.

starpii install url in huggingface.co:

https://huggingface.co/bigcode/starpii

huggingface.co

bigcode/starcoder2-3b

Total runs: 1.1M

Run Growth: 717.0K

Growth Rate: 68.17%

Updated: March 04 2024

huggingface.co

bigcode/gpt_bigcode-santacoder

Total runs: 50.8K

Run Growth: 19.4K

Growth Rate: 40.41%

Updated: June 08 2023

huggingface.co

bigcode/starcoder2-7b

Total runs: 35.3K

Run Growth: 26.7K

Growth Rate: 78.08%

Updated: June 11 2024

huggingface.co

bigcode/starcoder2-15b

Total runs: 26.6K

Run Growth: -1.9K

Growth Rate: -7.18%

Updated: June 05 2024

huggingface.co

bigcode/starcoder

Total runs: 18.6K

Run Growth: 1.0K

Growth Rate: 5.67%

Updated: October 08 2024

huggingface.co

bigcode/starcoderbase-1b

Total runs: 8.8K

Run Growth: 4.7K

Growth Rate: 56.40%

Updated: September 14 2023

huggingface.co

bigcode/santacoder

Total runs: 7.0K

Run Growth: -48.3K

Growth Rate: -719.62%

Updated: October 12 2023

huggingface.co

bigcode/starcoderbase-7b

Total runs: 3.3K

Run Growth: 1.5K

Growth Rate: 43.98%

Updated: August 22 2024

huggingface.co

bigcode/tiny_starcoder_py

Total runs: 3.2K

Run Growth: 1.3K

Growth Rate: 41.77%

Updated: June 01 2023

huggingface.co

bigcode/starencoder

Total runs: 2.1K

Run Growth: -472

Growth Rate: -23.95%

Updated: May 10 2023

huggingface.co

bigcode/starcoderbase

Total runs: 1.6K

Run Growth: -1.9K

Growth Rate: -128.21%

Updated: May 11 2023

huggingface.co

bigcode/starcoderbase-3b

Total runs: 1.3K

Run Growth: 477

Growth Rate: 41.19%

Updated: June 30 2023

huggingface.co

bigcode/starcoder2-15b-instruct-v0.1

Total runs: 1.3K

Run Growth: 634

Growth Rate: 49.15%

Updated: November 03 2024

huggingface.co

bigcode/octocoder

Total runs: 286

Run Growth: 93

Growth Rate: 32.52%

Updated: August 17 2023

huggingface.co

bigcode/santacoder-ldf

Total runs: 203

Run Growth: 192

Growth Rate: 94.58%

Updated: August 17 2023

huggingface.co

bigcode/santacoderpack

Total runs: 175

Run Growth: 0

Growth Rate: 0.00%

Updated: August 16 2023

huggingface.co

bigcode/starcoderplus

Total runs: 175

Run Growth: -16

Growth Rate: -9.04%

Updated: August 21 2023

huggingface.co

bigcode/octogeex

Total runs: 159

Run Growth: -16

Growth Rate: -10.06%

Updated: August 17 2023

huggingface.co

bigcode/astraios-7b-adapterp

Total runs: 40

Run Growth: 33

Growth Rate: 82.50%

Updated: January 01 2024

huggingface.co

bigcode/astraios-lora

Total runs: 32

Run Growth: 26

Growth Rate: 81.25%

Updated: January 01 2024

huggingface.co

bigcode/astraios-parallel

Total runs: 29

Run Growth: 17

Growth Rate: 58.62%

Updated: January 01 2024

huggingface.co

bigcode/deepseekcoder-33b-codeqwen-align-subset

Total runs: 27

Run Growth: 8

Growth Rate: 30.77%

Updated: May 13 2024

huggingface.co

bigcode/astraios-7b-ia3

Total runs: 26

Run Growth: 23

Growth Rate: 88.46%

Updated: January 01 2024

huggingface.co

bigcode/santacoder-fast-inference

Total runs: 25

Run Growth: 22

Growth Rate: 88.00%

Updated: March 29 2023

huggingface.co

bigcode/santacoder-cf

Total runs: 25

Run Growth: -145

Growth Rate: -580.00%

Updated: August 13 2023

huggingface.co

bigcode/astraios-adapterp

Total runs: 23

Run Growth: 21

Growth Rate: 91.30%

Updated: January 01 2024

huggingface.co

bigcode/starcoder-cxo

Total runs: 23

Run Growth: 12

Growth Rate: 52.17%

Updated: August 05 2023

huggingface.co

bigcode/astraios-7b-parallel

Total runs: 21

Run Growth: 14

Growth Rate: 66.67%

Updated: January 01 2024

huggingface.co

bigcode/starcoder-co-manual

Total runs: 20

Run Growth: 4

Growth Rate: 22.22%

Updated: August 04 2023

huggingface.co

bigcode/starcoder-co-target

Total runs: 19

Run Growth: 4

Growth Rate: 21.05%

Updated: August 05 2023

huggingface.co

bigcode/astraios-ia3

Total runs: 19

Run Growth: 15

Growth Rate: 78.95%

Updated: January 01 2024

huggingface.co

bigcode/astraios-7b-lora

Total runs: 19

Run Growth: 1

Growth Rate: 5.26%

Updated: January 01 2024

huggingface.co

bigcode/starcoder-o

Total runs: 18

Run Growth: 1

Growth Rate: 5.56%

Updated: August 05 2023

huggingface.co

bigcode/starcoder-cxso

Total runs: 18

Run Growth: 4

Growth Rate: 22.22%

Updated: August 04 2023

huggingface.co

bigcode/starcoder-co-format

Total runs: 17

Run Growth: 2

Growth Rate: 11.76%

Updated: August 05 2023

huggingface.co

bigcode/astraios-ptuning

Total runs: 16

Run Growth: 9

Growth Rate: 56.25%

Updated: January 01 2024

huggingface.co

bigcode/starcoder-xo

Total runs: 15

Run Growth: 0

Growth Rate: 0.00%

Updated: August 05 2023

huggingface.co

bigcode/astraios-adapterh

Total runs: 13

Run Growth: -14

Growth Rate: -107.69%

Updated: January 01 2024

huggingface.co

bigcode/astraios-7b-adapterh

Total runs: 12

Run Growth: 4

Growth Rate: 33.33%

Updated: January 01 2024

huggingface.co

bigcode/astraios-7b-ptuning

Total runs: 11

Run Growth: 5

Growth Rate: 41.67%

Updated: January 01 2024

huggingface.co

bigcode/astraios-1b-lora

Total runs: 10

Run Growth: 6

Growth Rate: 60.00%

Updated: January 01 2024

huggingface.co

bigcode/astraios-1b-ptuning

Total runs: 10

Run Growth: 0

Growth Rate: 0.00%

Updated: January 01 2024

huggingface.co

bigcode/astraios-1b-adapterh

Total runs: 10

Run Growth: 6

Growth Rate: 60.00%

Updated: January 01 2024

huggingface.co

bigcode/astraios-1b-ia3

Total runs: 8

Run Growth: 5

Growth Rate: 62.50%

Updated: January 01 2024

huggingface.co

bigcode/astraios-3b-lora

Total runs: 8

Run Growth: 5

Growth Rate: 62.50%

Updated: January 01 2024

huggingface.co

bigcode/astraios-3b-ptuning

Total runs: 8

Run Growth: 5

Growth Rate: 55.56%

Updated: January 01 2024

huggingface.co

bigcode/astraios-3b-ia3

Total runs: 8

Run Growth: 3

Growth Rate: 37.50%

Updated: January 01 2024

huggingface.co

bigcode/astraios-3b-parallel

Total runs: 7

Run Growth: 1

Growth Rate: 14.29%

Updated: January 01 2024

huggingface.co

bigcode/astraios-1b-adapterp

Total runs: 6

Run Growth: 3

Growth Rate: 50.00%

Updated: January 01 2024

huggingface.co

bigcode/astraios-1b-parallel

Total runs: 6

Run Growth: 0

Growth Rate: 0.00%

Updated: January 01 2024

huggingface.co

bigcode/astraios-3b-adapterh

Total runs: 4

Run Growth: 1

Growth Rate: 25.00%

Updated: January 01 2024

huggingface.co

bigcode/astraios-3b-adapterp

Total runs: 3

Run Growth: -1

Growth Rate: -33.33%

Updated: January 01 2024

huggingface.co

bigcode/report

Total runs: 0

Run Growth: 0

Growth Rate: 0.00%

Updated: February 28 2024

huggingface.co

bigcode/tokenizer-the-stack-march-sample-v2

Total runs: 0

Run Growth: 0

Growth Rate: 0.00%

Updated: March 08 2023

huggingface.co

bigcode/astraios-7b-fft

Total runs: 0

Run Growth: 0

Growth Rate: 0.00%

Updated: January 01 2024

huggingface.co

bigcode/santacoder-megatron

Total runs: 0

Run Growth: 0

Growth Rate: 0.00%

Updated: January 04 2023

huggingface.co

bigcode/starcoderbase-megatron

Total runs: 0

Run Growth: 0

Growth Rate: 0.00%

Updated: May 12 2023

huggingface.co

bigcode/tokenizer-the-stack-march-sample-v3

Total runs: 0

Run Growth: 0

Growth Rate: 0.00%

Updated: March 09 2023

huggingface.co

bigcode/starcoder2-tokenizer

Total runs: 0

Run Growth: 0

Growth Rate: 0.00%

Updated: November 24 2023

huggingface.co

bigcode/tokenizer-ablations

Total runs: 0

Run Growth: 0

Growth Rate: 0.00%

Updated: October 14 2023

huggingface.co

bigcode/astraios-1b-fft

Total runs: 0

Run Growth: 0

Growth Rate: 0.00%

Updated: January 01 2024

huggingface.co

bigcode/tokenizer-the-stack-march-sample

Total runs: 0

Run Growth: 0

Growth Rate: 0.00%

Updated: March 03 2023

huggingface.co

bigcode/astraios-3b-fft

Total runs: 0

Run Growth: 0

Growth Rate: 0.00%

Updated: January 01 2024

huggingface.co

bigcode/starcoder-megatron

Total runs: 0

Run Growth: 0

Growth Rate: 0.00%

Updated: May 12 2023

huggingface.co

bigcode/tokenizer-the-stack-march-sample-v3-no-prefix-spaces

Total runs: 0

Run Growth: 0

Growth Rate: 0.00%

Updated: March 09 2023

huggingface.co

bigcode/bcb_update

Total runs: 0

Run Growth: 0

Growth Rate: 0.00%

Updated: January 14 2025

huggingface.co

bigcode/astraios-fft

Total runs: 0

Run Growth: 0

Growth Rate: 0.00%

Updated: January 01 2024

huggingface.co

bigcode/starcoderplus-megatron

Total runs: 0

Run Growth: 0

Growth Rate: 0.00%

Updated: June 13 2023

bigcode / starpii

Introduction of starpii

Model Details of starpii

StarPII

Model description

Dataset

Fine-tuning on the annotated dataset

Initial training on a pseudo-labelled dataset

Performance

Considerations for Using the Model

Runs of bigcode starpii on huggingface.co

More Information About starpii huggingface.co Model

starpii huggingface.co

starpii huggingface.co Url

bigcode starpii online free

bigcode starpii online free url in huggingface.co:

starpii install

starpii install url in huggingface.co:

Url of starpii

starpii huggingface.co Url

Provider of starpii huggingface.co

Other API from bigcode