๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ
1๏ธโƒฃ AI•DS/๐ŸŒ LLM

[์ฑ… ์Šคํ„ฐ๋””] 3. ํ—ˆ๊น…ํŽ˜์ด์Šค ํŠธ๋žœ์Šคํฌ๋จธ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ

by isdawell 2024. 12. 16.
728x90

 

 

๐Ÿ‘€ LLM์„ ํ™œ์šฉํ•œ ์‹ค์ „ AI ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜ ๊ฐœ๋ฐœ ์ฑ… ์Šคํ„ฐ๋”” ์ •๋ฆฌ ์ž๋ฃŒ 

(์ €์ž‘๊ถŒ ๋ฌธ์ œ์‹œ ์ž ๊ธˆํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค~!)

 

 

 

โ–ถ๏ธ ์‹ค์Šต์ฝ”๋“œ (๊นƒํ—™ ๊ณต๊ฐœ์ฝ”๋“œ ํ™œ์šฉ)

โ–ถ๏ธ ๊นƒํ—™์ฝ”๋“œ

 

 

 

 

 

1. ํ—ˆ๊น…ํŽ˜์ด์Šค ํŠธ๋žœ์Šคํฌ๋จธ๋ž€ 


 

โ—ฏ  Transformer ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ 

 

 

•   ๊ณตํ†ต๋œ ์ธํ„ฐํŽ˜์ด์Šค๋กœ ํŠธ๋žœ์Šคํฌ๋จธ ๋ชจ๋ธ์„ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ๋„๋ก ์ง€์›ํ•˜์—ฌ ๋”ฅ๋Ÿฌ๋‹ ๋ถ„์•ผ์˜ ํ•ต์‹ฌ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ์ค‘ ํ•˜๋‚˜ (์˜คํ”ˆ์†Œ์Šค ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ) 

•   ๋ชจ๋ธ๊ณผ ํ† ํฌ๋‚˜์ด์ € → transformer ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ : AutoTokenizer, AutoModel

•   ๋ฐ์ดํ„ฐ์…‹ → datasets ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ 

 

 

from transformers import AutoTokenizer, AutoModel

text = "What is Huggingface Transformers?"
# BERT ๋ชจ๋ธ ํ™œ์šฉ
bert_model = AutoModel.from_pretrained("bert-base-uncased")
bert_tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
encoded_input = bert_tokenizer(text, return_tensors='pt')
bert_output = bert_model(**encoded_input)


# GPT-2 ๋ชจ๋ธ ํ™œ์šฉ
gpt_model = AutoModel.from_pretrained('gpt2')
gpt_tokenizer = AutoTokenizer.from_pretrained('gpt2')
encoded_input = gpt_tokenizer(text, return_tensors='pt')
gpt_output = gpt_model(**encoded_input)

 

 

•   output ์€ ์ž„๋ฒ ๋”ฉ๋œ ์ˆซ์ž ๋ฒกํ„ฐ ๊ฒฐ๊ณผ 

 

 

 

 

 


2. ํ—ˆ๊น…ํŽ˜์ด์Šค ํ—ˆ๋ธŒ ํƒ์ƒ‰ํ•˜๊ธฐ 


 

โ—ฏ  ํ—ˆ๊น…ํŽ˜์ด์Šค ํ—ˆ๋ธŒ 

 

https://huggingface.co/

 

Hugging Face – The AI community building the future.

 

huggingface.co

 

 

•   ๋‹ค์–‘ํ•œ ์‚ฌ์ „ ํ•™์Šต ๋ชจ๋ธ๊ณผ ๋ฐ์ดํ„ฐ์…‹์„ ํƒ์ƒ‰ํ•˜๊ณ  ์‰ฝ๊ฒŒ ๋ถˆ๋Ÿฌ์™€ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋„๋ก ์ œ๊ณตํ•˜๋Š” ์˜จ๋ผ์ธ ํ”Œ๋žซํผ

•   Space : ์ž์‹ ์˜ ๋ชจ๋ธ ๋ฐ๋ชจ๋ฅผ ์ œ๊ณตํ•˜๊ณ  ๋‹ค๋ฅธ ์‚ฌ๋žŒ์˜ ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•ด๋ณผ ์ˆ˜ ์žˆ๋Š” ๊ธฐ๋Šฅ 

 


โ—ฏ  ๋ชจ๋ธํ—ˆ๋ธŒ

 

https://huggingface.co/models

 

 

•   ์–ด๋–ค ์ž‘์—…์— ์‚ฌ์šฉํ•˜๋Š”์ง€, ์–ด๋–ค ์–ธ์–ด๋กœ ํ•™์Šต๋œ ๋ชจ๋ธ์ธ์ง€ ๋‹ค์–‘ํ•œ ๊ธฐ์ค€์œผ๋กœ ๋ชจ๋ธ์„ ๋ถ„๋ฅ˜. NLP, CV, Audio, Multimodal ๋“ฑ ๋‹ค์–‘ํ•œ ์ž‘์—… ๋ถ„์•ผ์˜ ๋ชจ๋ธ์„ ์ œ๊ณตํ•œ๋‹ค.

 

 

 

https://huggingface.co/google/gemma-7b

 

 

•   ๋ชจ๋ธ์˜ ์ด๋ฆ„ ๋ฐ ์š”์•ฝ ์ •๋ณด, ๋ชจ๋ธ ์„ค๋ช… ์นด๋“œ, ๋ชจ๋ธ ํŠธ๋ Œ๋“œ, ๋ชจ๋ธ ์ถ”๋ก  ํ…Œ์ŠคํŠธ ๋“ฑ์„ ํ™•์ธํ•ด๋ณผ ์ˆ˜ ์žˆ๋‹ค. 

 

 

 

โ—ฏ  ๋ฐ์ดํ„ฐ์…‹ํ—ˆ๋ธŒ 

 

 

•   ๋ฐ์ดํ„ฐ์…‹ํฌ๊ธฐ, ์œ ํ˜• ๋“ฑ์ด ์ถ”๊ฐ€๋˜์–ด์žˆ๋‹ค. 

 

https://huggingface.co/datasets/klue/klue

 

 

•   KLUE : ํ•œ๊ตญ์–ด ์–ธ์–ด ์ดํ•ด ํ‰๊ฐ€์˜ ์•ฝ์ž๋กœ ํ…์ŠคํŠธ ๋ถ„๋ฅ˜, ๊ธฐ๊ณ„ ๋…ํ•ด, ๋ฌธ์žฅ ์œ ์‚ฌ๋„ ํŒ๋‹จ ๋“ฑ ๋‹ค์–‘ํ•œ ์ž‘์—…์—์„œ ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์„ ํ‰๊ฐ€ํ•˜๊ธฐ ์œ„ํ•ด ๊ฐœ๋ฐœ๋œ ๋ฒค์น˜๋งˆํฌ ๋ฐ์ดํ„ฐ์…‹์ด๋‹ค. MRC๋ฐ์ดํ„ฐ, YNAT ๋ฐ์ดํ„ฐ ๋“ฑ 8๊ฐœ ๋ฐ์ดํ„ฐ๊ฐ€ ํฌํ•จ๋˜์–ด ์žˆ๋‹ค. ํ•™์Šต์šฉ/๊ฒ€์ฆ์šฉ/ํ…Œ์ŠคํŠธ์šฉ ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ ๊ตฌ๋ถ„๋˜์–ด ์žˆ๋‹ค. 

 

 

 

โ—ฏ  ์ŠคํŽ˜์ด์Šค

 

 

 

•   ๋ชจ๋ธ ๋ฐ๋ชจ๋ฅผ ๊ฐ„ํŽธํ•˜๊ฒŒ ๊ณต๊ฐœํ•  ์ˆ˜ ์žˆ๋Š” ๊ธฐ๋Šฅ์ด๋‹ค. ์ŠคํŽ˜์ด์Šค๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด ๋ณ„๋„์˜ ๋ณต์žกํ•œ ์›นํŽ˜์ด์ง€ ๊ฐœ๋ฐœ ์—†์ด ๋ชจ๋ธ ๋ฐ๋ชจ๋ฅผ ๋ฐ”๋กœ ๊ณต์œ ํ•  ์ˆ˜ ์žˆ๋‹ค. 

 

 

https://huggingface.co/spaces/kadirnar/Yolov9

 

 

•   ๊ฐ์ฒด์ธ์‹ ๋ชจ๋ธ Yolo v9 ํ™”๋ฉด์œผ๋กœ, ์™ผ์ชฝ์— ๋ชจ๋ธ ์ถ”๋ก ์— ์‚ฌ์šฉํ•  ์ด๋ฏธ์ง€๋ฅผ ์—…๋กœ๋“œํ•  ์ˆ˜ ์žˆ๋Š” ์˜์—ญ๊ณผ ์‚ฌ์šฉํ•  ๋ชจ๋ธ์˜ ์ข…๋ฅ˜, ์ถ”๋ก ์— ์‚ฌ์šฉํ•  ๋ชจ๋ธ ์„ค์ •์„ ์„ ํƒํ•  ์ˆ˜ ์žˆ๋Š” ์˜์—ญ์ด ์žˆ๋‹ค. 

 

 

๋ชจ๋ธ ์ถ”๋ก  ๊ฒฐ๊ณผ ์˜ˆ์‹œ

 

 

 

 

 

โ—ฏ  LLM ๋ฆฌ๋”๋ณด๋“œ

 

•    ์˜์–ด ๋ฐ์ดํ„ฐ ๋ฆฌ๋”๋ณด๋“œ

 

Open LLM Leaderboard - a Hugging Face Space by open-llm-leaderboard

 

huggingface.co

 

 

•   ํ•œ๊ตญ์–ด๋ฐ์ดํ„ฐ ๋ฆฌ๋”๋ณด๋“œ 

 

Open Ko-LLM Leaderboard - a Hugging Face Space by upstage

 

huggingface.co

 

 

 

•   ํ‘œ์— ๋‚˜ํƒ€๋‚  ๋ฒค์น˜๋งˆํฌ ํ•ญ๋ชฉ ๋ฆฌ์ŠคํŠธ : ko-GPQA, Average ๋“ฑ

•   ๋ชจ๋ธ ํ•™์Šต ๋ฐฉ์‹ : pretrained, fine-tuned ๋“ฑ 

•   ๋ชจ๋ธ ํฌ๊ธฐ, ํŒŒ๋ผ๋ฏธํ„ฐ ๋ฐ์ดํ„ฐ ํ˜•์‹ ๋“ฑ 

 

 

 

 

 

 


3. ํ—ˆ๊น…ํŽ˜์ด์Šค ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ์‚ฌ์šฉ๋ฒ• ์ตํžˆ๊ธฐ 


 

โ—ฏ ๋ฐ”๋””์™€ ํ—ค๋“œ

 

•   ํ—ˆ๊น…ํŽ˜์ด์Šค๋Š” ๋ชจ๋ธ์„ Body์™€ Head๋กœ ๊ตฌ๋ถ„ํ•œ๋‹ค. ๊ฐ™์€ Body๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด์„œ, ๋‹ค๋ฅธ ์ž‘์—…์— ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋„๋ก ๋งŒ๋“ค๊ธฐ ์œ„ํ•ด์„œ์ด๋‹ค. 

•   ๊ฐ€๋ น ๊ฐ™์€ BERT body๋ฅผ ์‚ฌ์šฉํ•˜์ง€๋งŒ, ํ…์ŠคํŠธ ๋ถ„๋ฅ˜ ํ—ค๋“œ, ํ† ํฐ ๋ถ„๋ฅ˜ ํ—ค๋“œ ๋“ฑ ์ž‘์—… ์ข…๋ฅ˜์— ๋”ฐ๋ผ ์„œ๋กœ ๋‹ค๋ฅธ ํ—ค๋“œ๋ฅผ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค. 

 

 

•   ๋ชจ๋ธ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ 

 

from transformers import AutoModel
model_id = 'klue/roberta-base'
model = AutoModel.from_pretrained(model_id)

 

 

•   ๋ถ„๋ฅ˜ ํ—ค๋“œ๊ฐ€ ํฌํ•จ๋œ ๋ชจ๋ธ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ 

 

from transformers import AutoModelForSequenceClassification
model_id = 'SamLowe/roberta-base-go_emotions'
classification_model = AutoModelForSequenceClassification.from_pretrained(model_id)

 

 

•   ๋ถ„๋ฅ˜ ํ—ค๋“œ๊ฐ€ ๋žœ๋ค์œผ๋กœ ์ดˆ๊ธฐํ™”๋œ ๋ชจ๋ธ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ 

 

from transformers import AutoModelForSequenceClassification
model_id = 'klue/roberta-base'
classification_model = AutoModelForSequenceClassification.from_pretrained(model_id)

 

 

 

 

โ—ฏ  ํ† ํฌ๋‚˜์ด์ € ํ™œ์šฉํ•˜๊ธฐ

 

•   ํ…์ŠคํŠธ๋ฅผ ํ† ํฐ ๋‹จ์œ„๋กœ ๋‚˜๋ˆ„๊ณ  ๊ฐ ํ† ํฐ์„ ๋Œ€์‘ํ•˜๋Š” ํ† ํฐ ์•„์ด๋””๋กœ ๋ณ€ํ™˜, ํ•„์š”ํ•œ ๊ฒฝ์šฐ ํŠน์ˆ˜ ํ† ํฐ์„ ์ถ”๊ฐ€ 

•   ํ† ํฌ๋‚˜์ด์ €๋„ ํ•™์Šต ๋ฐ์ดํ„ฐ๋ฅผ ํ†ตํ•ด ์–ดํœ˜ ์‚ฌ์ „์„ ๊ตฌ์ถ•ํ•˜๋ฏ€๋กœ ๋ชจ๋ธ๊ณผ ํ•จ๊ป˜ ์ €์žฅํ•˜๋Š” ๊ฒƒ์ด ์ผ๋ฐ˜์  

•   ๋ชจ๋ธ๊ณผ ํ† ํฌ๋‚˜์ด์ €๋ฅผ ๋ถˆ๋Ÿฌ์˜ค๋Š” ๊ฒฝ์šฐ ๋™์ผํ•œ ๋ชจ๋ธ ์•„์ด๋””๋ฅผ ์„ค์ • 

 

from transformers import AutoTokenizer
model_id = 'klue/roberta-base'
tokenizer = AutoTokenizer.from_pretrained(model_id)

 

tokenized = tokenizer("ํ† ํฌ๋‚˜์ด์ €๋Š” ํ…์ŠคํŠธ๋ฅผ ํ† ํฐ ๋‹จ์œ„๋กœ ๋‚˜๋ˆˆ๋‹ค")
print(tokenized)
# {'input_ids': [0, 9157, 7461, 2190, 2259, 8509, 2138, 1793, 2855, 5385, 2200, 20950, 2],
#  'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
#  'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

print(tokenizer.convert_ids_to_tokens(tokenized['input_ids']))
# ['[CLS]', 'ํ† ํฌ', '##๋‚˜์ด', '##์ €', '##๋Š”', 'ํ…์ŠคํŠธ', '##๋ฅผ', 'ํ† ', '##ํฐ', '๋‹จ์œ„', '##๋กœ', '๋‚˜๋ˆˆ๋‹ค', '[SEP]']

print(tokenizer.decode(tokenized['input_ids']))
# [CLS] ํ† ํฌ๋‚˜์ด์ €๋Š” ํ…์ŠคํŠธ๋ฅผ ํ† ํฐ ๋‹จ์œ„๋กœ ๋‚˜๋ˆˆ๋‹ค [SEP]

print(tokenizer.decode(tokenized['input_ids'], skip_special_tokens=True))
# ํ† ํฌ๋‚˜์ด์ €๋Š” ํ…์ŠคํŠธ๋ฅผ ํ† ํฐ ๋‹จ์œ„๋กœ ๋‚˜๋ˆˆ๋‹ค

 

 

 

โ—ฏ  ๋ฐ์ดํ„ฐ์…‹ ํ™œ์šฉํ•˜๊ธฐ

 

from datasets import load_dataset
klue_mrc_dataset = load_dataset('klue', 'mrc')

 

 

 

 

 

4. ๋ชจ๋ธ ํ•™์Šตํ•˜๊ธฐ 


 

โ—ฏ  ์—ฐํ•ฉ๋‰ด์Šค ๋ฐ์ดํ„ฐ ์‹ค์Šต 

 

•   ํ•œ๊ตญ์–ด ๊ธฐ์‚ฌ ์ œ๋ชฉ์„ ๋ฐ”ํƒ•์œผ๋กœ ๊ธฐ์‚ฌ์˜ ์นดํ…Œ๊ณ ๋ฆฌ๋ฅผ ๋ถ„๋ฅ˜ํ•˜๋Š” ํ…์ŠคํŠธ ๋ถ„๋ฅ˜ ๋ชจ๋ธ 

•   ํŠธ๋ ˆ์ด๋„ˆ API๋ฅผ ์‚ฌ์šฉํ•ด ํ•™์Šต 

 

 

1)   ๋ฐ์ดํ„ฐ ๊ฐ€์ ธ์˜ค๊ธฐ 

 

from datasets import load_dataset
klue_tc_train = load_dataset('klue', 'ynat', split='train')
klue_tc_eval = load_dataset('klue', 'ynat', split='validation')
klue_tc_train
klue_tc_train.features['label'].names
# ['IT๊ณผํ•™', '๊ฒฝ์ œ', '์‚ฌํšŒ', '์ƒํ™œ๋ฌธํ™”', '์„ธ๊ณ„', '์Šคํฌ์ธ ', '์ •์น˜']

 

 

 

2)   ๋ฐ์ดํ„ฐ ์ค€๋น„

 

train_dataset = klue_tc_train.train_test_split(test_size=10000, shuffle=True, seed=42)['test']
dataset = klue_tc_eval.train_test_split(test_size=1000, shuffle=True, seed=42)
test_dataset = dataset['test']
valid_dataset = dataset['train'].train_test_split(test_size=1000, shuffle=True, seed=42)['test']

 

 

3)   Trainer๋ฅผ ์‚ฌ์šฉํ•œ ํ•™์Šต 

 

import torch
import numpy as np
from transformers import (
    Trainer,
    TrainingArguments,
    AutoModelForSequenceClassification,
    AutoTokenizer
)

# 1) ์ค€๋น„

def tokenize_function(examples):
    return tokenizer(examples["title"], padding="max_length", truncation=True)

model_id = "klue/roberta-base"
model = AutoModelForSequenceClassification.from_pretrained(model_id, num_labels=len(train_dataset.features['label'].names))
tokenizer = AutoTokenizer.from_pretrained(model_id)

train_dataset = train_dataset.map(tokenize_function, batched=True)
valid_dataset = valid_dataset.map(tokenize_function, batched=True)
test_dataset = test_dataset.map(tokenize_function, batched=True)


# 2) ํ•™์Šต ํŒŒ๋ผ๋ฏธํ„ฐ์™€ ํ‰๊ฐ€ํ•จ์ˆ˜ ์ •์˜ 

training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=1,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    evaluation_strategy="epoch",
    learning_rate=5e-5,
    push_to_hub=False
)

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return {"accuracy": (predictions == labels).mean()}
    

# 3) ํ•™์Šต ์ง„ํ–‰ 
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=valid_dataset,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

trainer.train()

trainer.evaluate(test_dataset) # ์ •ํ™•๋„ 0.84

 

 

 

•   Trainer API๋ฅผ ์‚ฌ์šฉํ•˜์ง€ ์•Š๋Š”๋‹ค๋ฉด torch๋กœ ๊ตฌํ˜„ ํ•ด์•ผ ํ•จ (๋ชจ๋ธ ๊ตฌํ˜„ ์ฝ”๋“œ ๋™์ž‘๊ณผ์ •์— ๋Œ€ํ•œ ์ดํ•ด๋„๊ฐ€ ๋†’์•„์ง€๊ธด ํ•จ)

 

 

 

 


5. ๋ชจ๋ธ ์ถ”๋ก ํ•˜๊ธฐ 


 

โ—ฏ  ํŒŒ์ดํ”„๋ผ์ธ์„ ํ™œ์šฉํ•œ ์ถ”๋ก  

 

•   ๋ชจ๋ธ์„ ํ™œ์šฉํ•˜๊ธฐ ์‰ฝ๋„๋ก ์ถ”์ƒํ™”ํ•œ ํŒŒ์ดํ”„๋ผ์ธ์„ ํ™œ์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•๊ณผ ์ง์ ‘ ๋ชจ๋ธ๊ณผ ํ† ํฌ๋‚˜์ด์ €๋ฅผ ๋ถˆ๋Ÿฌ์™€ ํ™œ์šฉํ•˜๋Š” ๋ฐฉ๋ฒ• ์กด์žฌ 

 

 

•   ํŒŒ์ดํ”„๋ผ์ธ ํ™œ์šฉ ์ถ”๋ก 

 

from transformers import pipeline

model_id = "๋ณธ์ธ์˜ ์•„์ด๋”” ์ž…๋ ฅ/roberta-base-klue-ynat-classification"

model_pipeline = pipeline("text-classification", model=model_id)

model_pipeline(dataset["title"][:5])

 

 

 

•   Custom pipeline ์ •์˜ํ•ด์„œ ๊ตฌํ˜„ํ•  ์ˆ˜๋„ ์žˆ๊ธด ํ•จ (class , def ํ™œ์šฉ)

 

 

 

 

728x90

๋Œ“๊ธ€