This is an NER model trained to detect Personal Identifiable Information (PII) in code datasets. We fine-tuned
bigcode-encoder
on a PII dataset we annotated, available with gated access at
bigcode-pii-dataset
(see
bigcode-pii-dataset-training
for the exact data splits).
We added a linear layer as a token classification head on top of the encoder model, with 6 target classes: Names, Emails, Keys, Passwords, IP addresses and Usernames.
Dataset
Fine-tuning on the annotated dataset
The finetuning dataset contains 20961 secrets and 31 programming languages, but the base encoder model was pre-trained on 88
programming languages from
The Stack
dataset.
Initial training on a pseudo-labelled dataset
To enhance model performance on some rare PII entities like keys, we initially trained on a pseudo-labeled dataset before fine-tuning on the annotated dataset.
The method involves training a model on a small set of labeled data and subsequently generating predictions for a larger set of unlabeled data.
Specifically, we annotated 18,000 files available at
bigcode-pii-ppseudo-labeled
using an ensemble of two encoder models
Deberta-v3-large
and
stanford-deidentifier-base
which were fine-tuned on an intern previously labeled PII
dataset
for code with 400 files from this
work
.
To select good-quality pseudo-labels, we computed the average probability logits between the models and filtered based on a minimum score.
After inspection, we observed a high rate of false positives for Keys and Passwords, hence we retained only the entities that had a trigger word like
key
,
auth
and
pwd
in the surrounding context.
Training on this synthetic dataset prior to fine-tuning on the annotated one yielded superior results for all PII categories,
as demonstrated in the table in the following section.
Performance
This model is respresented in the last row (NER + pseudo labels )
Emails, IP addresses and Keys
Method
Email address
IP address
Key
Prec.
Recall
F1
Prec.
Recall
F1
Prec.
Recall
F1
Regex
69.8%
98.8%
81.8%
65.9%
78%
71.7%
2.8%
46.9%
5.3%
NER
94.01%
98.10%
96.01%
88.95%
94.43%
91.61%
60.37%
53.38%
56.66%
+ pseudo labels
97.73%
98.94%
98.15%
90.10%
93.86%
91.94%
62.38%
80.81%
70.41%
Names, Usernames and Passwords
Method
Name
Username
Password
Prec.
Recall
F1
Prec.
Recall
F1
Prec.
Recall
F1
NER
83.66%
95.52%
89.19%
48.93%
75.55%
59.39%
59.16%
96.62%
73.39%
+ pseudo labels
86.45%
97.38%
91.59%
52.20%
74.81%
61.49%
70.94%
95.96%
81.57%
We used this model to mask PII in the bigcode large model training. We dropped usernames since they resulted in many false positives and negatives.
For the other PII types, we added the following post-processing that we recommend for future uses of the model (the code is also available on GitHub):
Ignore secrets with less than 4 characters.
Detect full names only.
Ignore detected keys with less than 9 characters or that are not gibberish using a
gibberish-detector
.
Ignore IP addresses that aren't valid or are private (non-internet facing) using the
ipaddress
python package. We also ignore IP addresses from popular DNS servers.
We use the same list as in this
paper
.
Considerations for Using the Model
While using this model, please be aware that there may be potential risks associated with its application.
There is a possibility of false positives and negatives, which could lead to unintended consequences when processing sensitive data.
Moreover, the model's performance may vary across different data types and programming languages, necessitating validation and fine-tuning for specific use cases.
Researchers and developers are expected to uphold ethical standards and data protection measures when using the model. By making it openly accessible,
our aim is to encourage the development of privacy-preserving AI technologies while remaining vigilant of potential risks associated with PII.
Runs of bigcode starpii on huggingface.co
376
Total runs
0
24-hour runs
-119
3-day runs
-257
7-day runs
-287
30-day runs
More Information About starpii huggingface.co Model
starpii huggingface.co
starpii huggingface.co is an AI model on huggingface.co that provides starpii's model effect (), which can be used instantly with this bigcode starpii model. huggingface.co supports a free trial of the starpii model, and also provides paid use of the starpii. Support call starpii model through api, including Node.js, Python, http.
starpii huggingface.co is an online trial and call api platform, which integrates starpii's modeling effects, including api services, and provides a free online trial of starpii, you can try starpii online for free by clicking the link below.
bigcode starpii online free url in huggingface.co:
starpii is an open source model from GitHub that offers a free installation service, and any user can find starpii on GitHub to install. At the same time, huggingface.co provides the effect of starpii install, users can directly use starpii installed effect in huggingface.co for debugging and trial. It also supports api for free installation.