Master Llama 2 Tokenizer: Padding, Prompt Format & More

Home AI News Master Llama 2 Tokenizer: Padding, Prompt Format & More

Master Llama 2 Tokenizer: Padding, Prompt Format & More

Introduction
Basics of the Llama2 Tokenizer
Setting up Padding for the Tokenizer
Adding Special Tokens
Tokenizing with BOS and EOS Tokens
Defining a Padding Token
Using Pad Tokens for Training
Advanced Features: Mask Tokens
The Prompt Format in Lamma
Conclusion

Introduction

In the world of language models, time moves quickly. It has already been a few months since the release of Llama 2, and there is still much confusion surrounding its tokenizer and padding setup. This article aims to provide a comprehensive guide on the basics of Llama 2 tokenizer and how to set up padding correctly. Whether you are a seasoned user or new to Llama 2, this article will help you navigate through the fundamentals with ease.

Basics of the Llama2 Tokenizer

Llama's tokenizer is equipped with 32,000 tokens, representing both words and short words. Additionally, there are special tokens that play important roles in the tokenizer, such as the beginning-of-sequence (BOS) token denoted as 's' and the end-of-sequence (EOS) token denoted as '/s'. These tokens indicate the start and end of sequences passed to the language model. It is worth noting that the tokenizer does not include masking or padding tokens by default, which we will explore later.

Setting up Padding for the Tokenizer

When fine-tuning Llama, one of the first things to consider is setting up a padding token, as the tokenizer does not include one by default. Padding tokens are essential for padding sequences to a uniform length, which is especially useful when working with batched data. To add a padding token, you can define a new token and update the tokenizer's vocabulary. However, it is important to note that the model's vocabulary also needs to be updated to include the padding token.

Adding Special Tokens

In addition to the padding token, Llama 2 tokenizer also includes special tokens like the unknown token (UNK) that represents tokens not in the vocabulary. By default, the tokenizer does not add the beginning-of-sequence (BOS) and end-of-sequence (EOS) tokens automatically when tokenizing. However, you can set the option to add special tokens to true, which will automatically include the BOS token at the start of the sequence.

Tokenizing with BOS and EOS Tokens

Tokenizing with BOS (beginning-of-sequence) and EOS (end-of-sequence) tokens can provide valuable information to the language model about the start and end of a sequence. By setting the option to add special tokens to true, the tokenizer will automatically include the BOS token at the beginning of the tokenized sequence. This helps to indicate the start of a sequence to the language model. Additionally, you can manually add the EOS token to denote the end of a sequence.

Defining a Padding Token

In Llama, there is no predefined padding token. However, you can define a new padding token and add it to the tokenizer's vocabulary. By defining a new padding token, you can ensure that sequences are padded to a uniform length, which is crucial for training and fine-tuning the language model.

Using Pad Tokens for Training

When training or fine-tuning a model, the use of pad tokens becomes even more essential. Batches of data may have varying lengths, and padding tokens allow you to pad sequences to a fixed length. While it is common to use the end-of-sequence token as the pad token, it can be confusing due to the dual purpose it serves. Alternatively, you can use the unknown token (UNK) as the pad token. This ensures a clear distinction between unknown tokens and padding tokens.

Advanced Features: Mask Tokens

Mask tokens are advanced features in Llama that are primarily used for training purposes. They can be utilized to ignore certain tokens during training or focus on specific tokens in a sequence. For example, you could mask the first few tokens in a sequence to train the model to focus on the performance of tokens following the masked ones. Masking can also be useful when you want to predict the next token without considering tokens that precede it. This requires the use of an attention mask.

The Prompt Format in Lamma

Lamma uses a unique prompt format that differs from other models like OpenAI. It employs specific codes to indicate the beginning and end of instructions and system messages. In the prompt format, an instruction begins with <instruct> and ends with </instruct>, while a system message starts with <sys> and ends with </sys>. These codes are not actual tokens in the vocabulary but help structure the prompt in a specific way that Lamma understands. Understanding the prompt format is crucial for effective use of Llama.

Conclusion

In this article, we explored the fundamentals of Llama 2 tokenizer and learned how to set up padding correctly. We discussed the importance of special tokens like the BOS and EOS tokens, and how to add a padding token to the tokenizer's vocabulary. Additionally, we touched upon advanced features such as mask tokens and discussed the unique prompt format used in Lamma. Armed with this knowledge, you can now confidently navigate the tokenizer and padding setup in Llama 2 to achieve optimal results.

Highlights

Llama 2 tokenizer has 32,000 tokens representing words and short words.
Special tokens like BOS and EOS indicate the start and end of a sequence.
Adding special tokens and defining a padding token are crucial steps in setting up the tokenizer.
Mask tokens offer advanced training capabilities by allowing the model to ignore or focus on specific tokens.
Lamma uses a unique prompt format that uses specific codes to structure instructions and system messages.

FAQ

Q: Can I use the end-of-sequence (EOS) token as a padding token?

A: While it is possible to use the EOS token as a padding token, it can lead to confusion as the token serves a dual purpose. It is recommended to define a new padding token for Clarity and consistency in your training and fine-tuning processes.

Q: How do I handle multi-turn conversations in Llama?

A: In Llama, multi-turn conversations can be handled by structuring the prompt in a specific format. Each instruction begins with <instruct> and ends with </instruct>. System messages start with <sys> and end with </sys>. This allows the model to understand the conversation flow and generate appropriate responses.

Q: Can I ignore certain tokens during training?

A: Yes, you can use mask tokens to ignore specific tokens during training. This is particularly useful when you want to focus on tokens that follow the masked ones or if you want to exclude certain tokens from influencing the next token prediction.

Q: Where can I find more information about Llama 2 tokenizer?

A: You can refer to the public GitHub repository of Trellis Research, which provides a comprehensive guide on setting up and using the Llama 2 tokenizer. Additionally, the original Llama repository on GitHub contains valuable resources and information about the tokenizer.