open

๐Ÿค— Transformers & ๐Ÿค— Tokenizers์™€ ํ•จ๊ป˜ ๋‚˜๋งŒ์˜ ํ† ํฌ๋‚˜์ด์ € ๋งŒ๋“ค๊ธฐ

Find AI Tools
No difficulty
No complicated process
Find ai tools

๐Ÿค— Transformers & ๐Ÿค— Tokenizers์™€ ํ•จ๊ป˜ ๋‚˜๋งŒ์˜ ํ† ํฌ๋‚˜์ด์ € ๋งŒ๋“ค๊ธฐ

Table of Contents:

  1. ์ด ๊ธ€์˜ ๋ชฉ์ ๊ณผ ๋‚ด์šฉ
  2. ์ž๊ธฐ์†Œ๊ฐœ: ๋ฃจ์‹œ ์†”๋‹ˆ์—, ๊ธฐ๊ณ„ ํ•™์Šต ์—”์ง€๋‹ˆ์–ด
  3. Tokenizer์˜ ๊ฐœ๋…๊ณผ ์—ญํ• 
  4. Tokenizer์˜ ์ข…๋ฅ˜์™€ ์‚ฌ์šฉ ๋ฐฉ๋ฒ• 4.1 Pre-trained Tokenizer ์‚ฌ์šฉํ•˜๊ธฐ 4.1.1 ์ ํ•ฉํ•œ Pre-trained Tokenizer ์ฐพ๊ธฐ 4.1.2 Pre-trained Tokenizer ํ™œ์šฉ ๋ฐฉ๋ฒ• 4.2 Custom Tokenizer ๋งŒ๋“ค๊ธฐ 4.2.1 Wordpiece ๊ธฐ๋ฐ˜ Tokenizer 4.2.2 ๋‹ค๋ฅธ Tokenizer ์œ ํ˜•
  5. Tokenizer ์„ ํƒ ์‹œ ๊ณ ๋ ค์‚ฌํ•ญ 5.1 ๋ฌธ์žฅ ๊ธธ์ด์™€ ํšจ์œจ์„ฑ 5.2 ์–ดํœ˜ ์‚ฌ์ „๊ณผ ํŠน์ˆ˜ ํ† ํฐ
  6. Tokenizer์˜ ์ค‘์š”์„ฑ๊ณผ ์˜ํ–ฅ๋ ฅ
  7. Big Science ํ”„๋กœ์ ํŠธ์—์„œ์˜ ์—ญํ• 
  8. ๊ฐ Tokenizer ์œ ํ˜•์˜ ์žฅ๋‹จ์ 
  9. Tokenizer ๊ด€๋ จ ์ž์ฃผ ๋ฌป๋Š” ์งˆ๋ฌธ (FAQ) 9.1 Pre-trained Tokenizer์™€ ๋ฐ์ดํ„ฐ ์ ํ•ฉ์„ฑ 9.2 Custom Tokenizer ์ œ์ž‘ ์‹œ ์–ด๋–ค ๋ฐ์ดํ„ฐ ์‚ฌ์šฉ? 9.3 Tokenizer์˜ ์„ฑ๋Šฅ๊ณผ ๋ชจ๋ธ ํ›ˆ๋ จ์˜ ๊ด€๊ณ„
  10. ๋งˆ๋ฌด๋ฆฌ ๋ฐ ์ฐธ๊ณ ์ž๋ฃŒ

2. ์ž๊ธฐ์†Œ๊ฐœ: ๋ฃจ์‹œ ์†”๋‹ˆ์—, ๊ธฐ๊ณ„ ํ•™์Šต ์—”์ง€๋‹ˆ์–ด

์•ˆ๋…•ํ•˜์„ธ์š”, ์—ฌ๋Ÿฌ๋ถ„. ์ €๋Š” ๋ฃจ์‹œ ์†”๋‹ˆ์—์ž…๋‹ˆ๋‹ค. ์ €๋Š” ํ•˜๊น… ํŽ˜์ด์Šค(Hugging Face)์—์„œ ๊ธฐ๊ณ„ ํ•™์Šต ์—”์ง€๋‹ˆ์–ด๋กœ ์ผํ•˜๊ณ  ์žˆ์œผ๋ฉฐ, ์˜คํ”ˆ ์†Œ์Šค ๋„๊ตฌ์˜ ์‚ฌ์šฉ๊ณผ ๊ฐœ๋ฐœ, ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ(NLP) ๋ถ„์•ผ์˜ ๋‹ค์–‘ํ•œ ์—ฐ๊ตฌ ํ”„๋กœ์ ํŠธ์— ์ฐธ์—ฌํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ํŠนํžˆ ํ˜‘๋ ฅ ํ›ˆ๋ จ(collaborative training)์— ๊ด€ํ•œ ํฅ๋ฏธ๋กœ์šด ํ”„๋กœ์ ํŠธ๋ฅผ ์ง„ํ–‰ ์ค‘์ž…๋‹ˆ๋‹ค. ์ด ํ”„๋กœ์ ํŠธ๋Š” ์ „ ์„ธ๊ณ„ ์—ฌ๋Ÿฌ ๋Œ€์˜ ์ปดํ“จํ„ฐ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํŠธ๋žœ์Šคํฌ๋จธ(transformer) ๋ชจ๋ธ์„ ์ฒ˜์Œ๋ถ€ํ„ฐ ํ›ˆ๋ จํ•˜๋Š” ๊ฒƒ์„ ๋งํ•ฉ๋‹ˆ๋‹ค. ๊ด€๋ จ ๋‚ด์šฉ์€ ์ €ํฌ ๋ธ”๋กœ๊ทธ์—์„œ ํ™•์ธํ•˜์‹ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ, ์ €๋Š” ๋น… ์‚ฌ์ด์–ธ์Šค ํ”„๋กœ์ ํŠธ์—์„œ๋„ ํ™œ๋ฐœํžˆ ์ฐธ์—ฌํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ์ด ํ”„๋กœ์ ํŠธ๋Š” GPT-3์™€ ์œ ์‚ฌํ•œ ๋Œ€๊ทœ๋ชจ ์–ธ์–ด ๋ชจ๋ธ์„ ํ›ˆ๋ จํ•˜๋Š” ๊ฒƒ์„ ๋ชฉํ‘œ๋กœ ํ•˜๋Š” ์—ฐ๊ตฌ๋ฅผ 1๋…„ ๋™์•ˆ ์ง„ํ–‰ํ•˜๋Š” ํ”„๋กœ์ ํŠธ์ž…๋‹ˆ๋‹ค. ์ด ๋ชจ๋ธ์€ GPT-3๋ณด๋‹ค ๋” ํ’๋ถ€ํ•œ ํ†ต์ฐฐ๋ ฅ์„ ๊ฐ€์ง€๊ณ  ์žˆ์„ ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์˜ค๋Š˜์€ ์ €ํฌ Hugging Face Transformers์™€ Tokenizers ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ž์ฒด ํ† ํฌ๋‚˜์ด์ €(tokenizer)๋ฅผ ๋งŒ๋“œ๋Š” ๋ฐฉ๋ฒ•์— ๋Œ€ํ•ด ์„ค๋ช…๋“œ๋ฆฌ๊ฒ ์Šต๋‹ˆ๋‹ค. ๊ทธ๋Ÿผ ์—ฌ๋Ÿฌ๋ถ„, ์งˆ๋ฌธ ์‹œ๊ฐ„์— ๋‹ค์‹œ ์ฐพ์•„๋ต™๊ฒ ์Šต๋‹ˆ๋‹ค.


3. Tokenizer์˜ ๊ฐœ๋…๊ณผ ์—ญํ• 

์ž, ์ด์ œ๋ถ€ํ„ฐ๋Š” tokenizer(ํ† ํฌ๋‚˜์ด์ €)์— ๋Œ€ํ•ด ์•Œ์•„๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค. tokenizer๋Š” ์–ด๋–ค ์—ญํ• ์„ ํ•˜๋Š”์ง€ ๊ฐ„๋žตํ•˜๊ฒŒ ์„ค๋ช…ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค. tokenizer๋Š” ํ…์ŠคํŠธ ๋ฐ์ดํ„ฐ์™€ ์–ธ์–ด ๋ชจ๋ธ ์‚ฌ์ด์— ์œ„์น˜ํ•œ ๊ตฌ์„ฑ์š”์†Œ์ž…๋‹ˆ๋‹ค. ์ด ๊ตฌ์„ฑ์š”์†Œ๋Š” ํ…์ŠคํŠธ ๋ฐ์ดํ„ฐ๋ฅผ ์–ธ์–ด ๋ชจ๋ธ์ด ์˜ˆ์ƒํ•œ ํ˜•์‹์œผ๋กœ ์ค€๋น„ํ•˜๋Š” ์—ญํ• ์„ ๋‹ด๋‹นํ•ฉ๋‹ˆ๋‹ค. ๊ตฌ์ฒด์ ์œผ๋กœ ๋งํ•˜๋ฉด, tokenizer๋Š” ํ…์ŠคํŠธ๋ฅผ ์ž‘์€ ๋‹จ์œ„์ธ ํ† ํฐ์˜ ์‹œํ€€์Šค๋กœ ๋ถ„ํ• ํ•˜๋Š” ์—ญํ• ์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค. ํ•˜๋‚˜์˜ ํ† ํฐ์€ ๊ณ ์œ ํ•œ ๋ฒˆํ˜ธ์™€ ๋งคํ•‘๋˜๋Š” ์–ดํœ˜ ์‚ฌ์ „(vocabulary)์— ์†ํ•ฉ๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ tokenizer๋Š” ํ…์ŠคํŠธ๋ฅผ ์ˆซ์ž์˜ ์‹œํ€€์Šค๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ์—ญํ• ์„ ํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” ์–ธ์–ด ๋ชจ๋ธ์ด ์ˆซ์ž๋งŒ์„ ์ž…๋ ฅ์œผ๋กœ ๋ฐ›์„ ์ˆ˜ ์žˆ๋Š” ํ•จ์ˆ˜์ด๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค. tokenizer๋Š” ์—ฌ๋Ÿฌ ์ž‘์—…์„ ์ˆ˜ํ–‰ํ•˜๋Š”๋ฐ, ์ด๋Š” ํ† ํฐํ™”(normalization)๋ถ€ํ„ฐ ํ›„์ฒ˜๋ฆฌ(post-processing)์— ์ด๋ฅด๊ธฐ๊นŒ์ง€ ๋ชจ๋“  ๋ถ„ํ•  ๊ณผ์ •์„ ํฌํ•จํ•ฉ๋‹ˆ๋‹ค. ์—ฌ๋Ÿฌ๋ถ„์ด ์–ธ์–ด ๋ชจ๋ธ์„ ํ›ˆ๋ จํ•  ๋•Œ tokenizer๋ฅผ ์–ด๋–ป๊ฒŒ ๋‹ค๋ฃจ์–ด์•ผ ํ•˜๋Š”์ง€์— ๋Œ€ํ•œ ๊ณ ๋ฏผ์ด ์žˆ์„ ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์ฒซ ๋ฒˆ์งธ ๊ฐ€๋Šฅ์„ฑ์€ ์ด๋ฏธ ํ•„์š”ํ•œ ์š”๊ตฌ ์‚ฌํ•ญ์— ๋งž๋Š” ํ›ˆ๋ จ๋œ tokenizer๊ฐ€ ์žˆ๋Š”์ง€ ํ™•์ธํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ๋งŒ์ผ ์ด๋ฏธ ์ ํ•ฉํ•œ tokenizer๊ฐ€ ์žˆ๋‹ค๋ฉด ๋ฌธ์ œ๊ฐ€ ํ•ด๊ฒฐ๋ฉ๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ ๊ทธ๋ ‡์ง€ ์•Š์€ ๊ฒฝ์šฐ ์ ์ ˆํ•œ tokenizer๋ฅผ ์ฐพ์•„ ๋ณ€ํ˜•ํ•˜๊ฑฐ๋‚˜ ์™„์ „ํžˆ ์ƒˆ๋กœ์šด tokenizer๋ฅผ ์„ค๊ณ„ํ•˜์—ฌ ํ›ˆ๋ จํ•ด์•ผ ํ•  ์ˆ˜๋„ ์žˆ์Šต๋‹ˆ๋‹ค. ๋‹คํ–‰ํžˆ๋„, Hugging Face ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋Š” ์„ธ ๊ฐ€์ง€ ๊ฒฝ์šฐ ๋ชจ๋‘ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ๋„๋ก ์ค€๋น„๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค. ๋ฐฉ๊ธˆ ์ „์— ์ถœ์‹œ๋œ ๊ฐ•์ขŒ์—๋Š” tokenizer์— ๋Œ€ํ•œ ์ž์„ธํ•œ ๋‚ด์šฉ์ด ์žˆ์œผ๋ฉฐ, ์ง์ ‘ ์ž์‹ ์˜ tokenizer๋ฅผ ์–ด๋–ป๊ฒŒ ์–ป์–ด์•ผ ํ•˜๋Š”์ง€์— ๋Œ€ํ•œ ์ดํ•ด๋ฅผ ๋•์Šต๋‹ˆ๋‹ค. ์ด ๊ฐ•์ขŒ ์ด์ „์— ๋ช‡ ๊ฐ€์ง€ ์˜ˆ์‹œ๋ฅผ ํ†ตํ•ด Hugging Face์˜ ๋…ธํŠธ๋ถ์„ ์‚ฌ์šฉํ•˜์—ฌ ์ƒˆ๋กœ์šด tokenizer๋ฅผ ์‹คํ—˜ํ•˜๊ณ  ๊ฒฐ๊ณผ๋ฅผ ํ™•์ธํ•ด ๋ณผ ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์‹œ์ž‘ํ•˜๊ธฐ ์ „์— Hugging Face Transformers์™€ Tokenizers ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ, ๊ทธ๋ฆฌ๊ณ  SentencePiece์™€ ๊ฐ™์€ ๋””ํŽœ๋˜์‹œ๋ฅผ ์„ค์น˜ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์ด ๋‹ค์šด๋กœ๋“œ๋ฅผ ๋ฏธ๋ฆฌ ์ง„ํ–‰ํ•˜์˜€์œผ๋‹ˆ ์ €๋Š” ์ด๋ ‡๊ฒŒ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ์„ค์น˜ํ•˜์˜€์Šต๋‹ˆ๋‹ค. ์ด ๊ฐ•์ขŒ์—์„œ ๋‹ค๋ฃฐ ๋‘ ๊ฐ€์ง€ ๊ฒฝ์šฐ๋ฅผ ์‚ดํŽด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค. ์ฒซ ๋ฒˆ์งธ๋Š” ์ด๋ฏธ ํ›ˆ๋ จ๋œ tokenizer๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒฝ์šฐ์ž…๋‹ˆ๋‹ค. ๋‘ ๋ฒˆ์งธ๋Š” ์ง์ ‘ ์ƒˆ๋กœ์šด tokenizer๋ฅผ ๋งŒ๋“œ๋Š” ๊ฒฝ์šฐ์ž…๋‹ˆ๋‹ค. ์‹œ์ž‘ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค.

Most people like

Are you spending too much time looking for ai tools?
App rating
4.9
AI Tools
100k+
Trusted Users
5000+
WHY YOU SHOULD CHOOSE TOOLIFY

TOOLIFY is the best ai tool source.

Related Articles
Refresh Articles