๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ
1๏ธโƒฃ AI•DS/๐ŸŒ LLM

[์ฑ…์Šคํ„ฐ๋””] 10-(2). ์‹ค์Šต : ์˜๋ฏธ๊ฒ€์ƒ‰ ๊ตฌํ˜„ํ•˜๊ธฐ

by isdawell 2025. 9. 19.
728x90
โ˜€๏ธ  Summary 

โ—  ํ…์ŠคํŠธ์— ์˜๋ฏธ๋ฅผ ๋‹ด์•„ ์ˆซ์ž๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ๋ฐฉ๋ฒ• : ์›ํ•ซ์ธ์ฝ”๋”ฉ, BoW, TF-IDF, Word2Vec(๋ฐ€์ง‘์ž„๋ฒ ๋”ฉ)
โ—  ๋ฌธ์žฅ์ž„๋ฒ ๋”ฉ๋ฐฉ์‹ : ๊ต์ฐจ์ธ์ฝ”๋”, ๋ฐ”์ด์ธ์ฝ”๋”(BERT,ํ’€๋ง์ธต), Sentence-Transformers 
โ—  ์˜๋ฏธ๊ฒ€์ƒ‰, ํ‚ค์›Œ๋“œ๊ฒ€์ƒ‰, ํ•˜์ด๋ธŒ๋ฆฌ๋“œ๊ฒ€์ƒ‰ 

 

 

 


3.   ์˜๋ฏธ๊ฒ€์ƒ‰ ๊ตฌํ˜„ํ•˜๊ธฐ 


 

3.1  ์˜๋ฏธ๊ฒ€์ƒ‰๊ตฌํ˜„ํ•˜๊ธฐ

 

โ  ์˜๋ฏธ๊ฒ€์ƒ‰

 โ†ช๏ธŽ  ๋‹จ์ˆœํžˆ ํ‚ค์›Œ๋“œ ๋งค์นญ์„ ํ†ตํ•œ ๊ฒ€์ƒ‰์ด ์•„๋‹ˆ๋ผ, ๋ฐ€์ง‘ ์ž„๋ฒ ๋”ฉ์„ ์ด์šฉํ•ด ๋ฌธ์žฅ์ด๋‚˜ ๋ฌธ์„œ์˜ ์˜๋ฏธ๋ฅผ ๊ณ ๋ คํ•œ ๊ฒ€์ƒ‰์„ ์ˆ˜ํ–‰ํ•˜๋Š” ๊ฒƒ 

 

โ  faiss

 โ†ช๏ธŽ  ๋ฉ”ํƒ€๊ฐ€ ๊ฐœ๋ฐœํ•œ ๋ฒกํ„ฐ ์—ฐ์‚ฐ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋กœ, ์ฝ”์‚ฌ์ธ์œ ์‚ฌ๋„, ์œ ํด๋ฆฌ๋“œ ๊ฑฐ๋ฆฌ ๋“ฑ ๊ธฐ๋ณธ์ ์ธ ๋ฐฉ๋ฒ•์„ ์ง€์›ํ•  ๋ฟ ์•„๋‹ˆ๋ผ, ๋ฒกํ„ฐ ๊ฒ€์ƒ‰ ์†๋„๋ฅผ ํ–ฅ์ƒํ•ด์ฃผ๋Š” ANN์•Œ๊ณ ๋ฆฌ์ฆ˜๋„ ๋‹ค์–‘ํ•˜๊ฒŒ ์ œ๊ณตํ•œ๋‹ค. 

 

 

โ  ์‹ค์Šต

 

โ—‡ 1) ๋ชจ๋ธ๊ณผ ๋ฐ์ดํ„ฐ์…‹ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ 

from datasets import load_dataset
from sentence_transformers import SentenceTransformer

# ๊ธฐ์‚ฌ ๋ณธ๋ฌธ๊ณผ ๊ธฐ์‚ฌ๋ณธ๋ฌธ์— ๊ด€๋ จ๋œ ์งˆ๋ฌธ์„ ๋ชจ์€ ๋ฐ์ดํ„ฐ์…‹ 
klue_mrc_dataset = load_dataset('klue', 'mrc', split='train')

# ํ•œ๊ตญ์–ด ์ž„๋ฒ ๋”ฉ ๋ชจ๋ธ
sentence_model = SentenceTransformer('snunlp/KR-SBERT-V40K-klueNLI-augSTS')

 

 

โ—‡ 2) ์‹ค์Šต ๋ฐ์ดํ„ฐ์—์„œ 1000๊ฐœ๋งŒ ์„ ํƒํ•˜๊ณ  ๋ฌธ์žฅ ์ž„๋ฒ ๋”ฉ์œผ๋กœ ๋ณ€ํ™˜ 

klue_mrc_dataset = klue_mrc_dataset.train_test_split(train_size=1000, shuffle=False)['train']

# ๊ธฐ์‚ฌ ๋ณธ๋ฌธ๋ฐ์ดํ„ฐ๋ฅผ ์ €์žฅํ•˜๊ณ  ์žˆ๋Š” context์นผ๋Ÿผ์„ ๋ฌธ์žฅ ์ž„๋ฒ ๋”ฉ ๋ชจ๋ธ์˜ ์ž…๋ ฅ์œผ๋กœ ๋„ฃ์–ด ๋ฌธ์žฅ ์ž„๋ฒ ๋”ฉ์œผ๋กœ ๋ณ€ํ™˜ 
embeddings = sentence_model.encode(klue_mrc_dataset['context'])

embeddings.shape
# ์ถœ๋ ฅ ๊ฒฐ๊ณผ
# (1000, 768)

 

 

โ—‡ 3) KNN ๊ฒ€์ƒ‰ ์ธ๋ฑ์Šค๋ฅผ ์ƒ์„ฑํ•˜๊ณ  ๋ฌธ์žฅ ์ž„๋ฒ ๋”ฉ ์ €์žฅ

 โ†ช๏ธŽ ๋ฐ์ดํ„ฐ๋ฅผ ์ €์žฅํ•  ํ…Œ์ด๋ธ”์„ ์ƒ์„ฑํ•œ๋‹ค๊ณ  ์ƒ๊ฐํ•˜๋ฉด ๋œ๋‹ค.

 โ†ช๏ธŽ ์˜๋ฏธ๊ฒ€์ƒ‰์„ ๊ตฌํ˜„ํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š”, ๊ฒ€์ƒ‰ ์ฟผ๋ฆฌ ๋ฌธ์žฅ์„ ๋ฌธ์žฅ์ž„๋ฒ ๋”ฉ์œผ๋กœ ๋ณ€ํ™˜ํ•˜๊ณ , ์ธ๋ฑ์Šค์—์„œ ๊ฒ€์ƒ‰ํ•˜๋ฉด ๋œ๋‹ค. 

import faiss

# ์ธ๋ฑ์Šค ๋งŒ๋“ค๊ธฐ
index = faiss.IndexFlatL2(embeddings.shape[1])

# ์ธ๋ฑ์Šค์— ์ž„๋ฒ ๋”ฉ ์ €์žฅํ•˜๊ธฐ
index.add(embeddings)

 

 

 

โ—‡ 4) ์˜๋ฏธ๊ฒ€์ƒ‰

 โ†ช๏ธŽ ๊ฒ€์ƒ‰์ฟผ๋ฆฌ๋ฌธ์žฅ์„ ๋ฌธ์žฅ์ž„๋ฒ ๋”ฉ ๋ชจ๋ธ์˜ encode ๋ฉ”์„œ๋“œ๋ฅผ ์‚ฌ์šฉํ•ด ๋ฌธ์žฅ ์ž„๋ฒ ๋”ฉ์œผ๋กœ ๋ณ€ํ™˜ํ•˜๊ณ  ์ธ๋ฑ์Šค์˜ search ๋ฉ”์„œ๋“œ๋กœ ์ฟผ๋ฆฌ์ž„๋ฒ ๋”ฉ๊ณผ ๊ฐ€์žฅ ๊ฐ€๊นŒ์šด 3๊ฐœ ๋ฌธ์„œ๋ฅผ ๋ฐ˜ํ™˜๋ฐ›๋Š”๋‹ค. 

query = "์ด๋ฒˆ ์—ฐ๋„์—๋Š” ์–ธ์ œ ๋น„๊ฐ€ ๋งŽ์ด ์˜ฌ๊นŒ?"
query_embedding = sentence_model.encode([query])
distances, indices = index.search(query_embedding, 3)

for idx in indices[0]:
  print(klue_mrc_dataset['context'][idx][:50])

# ์ถœ๋ ฅ ๊ฒฐ๊ณผ
# ์˜ฌ์—ฌ๋ฆ„ ์žฅ๋งˆ๊ฐ€ 17์ผ ์ œ์ฃผ๋„์—์„œ ์‹œ์ž‘๋๋‹ค. ์„œ์šธ ๋“ฑ ์ค‘๋ถ€์ง€๋ฐฉ์€ ์˜ˆ๋…„๋ณด๋‹ค ์‚ฌ๋‚˜ํ˜ ์ •๋„ ๋Šฆ์€   (์ •๋‹ต)
# ์—ฐ๊ตฌ ๊ฒฐ๊ณผ์— ๋”ฐ๋ฅด๋ฉด, ์˜ค๋ฆฌ๋„ˆ๊ตฌ๋ฆฌ์˜ ๋ˆˆ์€ ๋Œ€๋ถ€๋ถ„์˜ ํฌ์œ ๋ฅ˜๋ณด๋‹ค๋Š” ์–ด๋ฅ˜์ธ ์น ์„ฑ์žฅ์–ด๋‚˜ ๋จน์žฅ์–ด, ๊ทธ (์˜ค๋‹ต)
# ์—ฐ๊ตฌ ๊ฒฐ๊ณผ์— ๋”ฐ๋ฅด๋ฉด, ์˜ค๋ฆฌ๋„ˆ๊ตฌ๋ฆฌ์˜ ๋ˆˆ์€ ๋Œ€๋ถ€๋ถ„์˜ ํฌ์œ ๋ฅ˜๋ณด๋‹ค๋Š” ์–ด๋ฅ˜์ธ ์น ์„ฑ์žฅ์–ด๋‚˜ ๋จน์žฅ์–ด, ๊ทธ (์˜ค๋‹ต)

 

 

โ—‡ 5) ์˜๋ฏธ๊ฒ€์ƒ‰์˜ ํ•œ๊ณ„ 

 โ†ช๏ธŽ ํ‚ค์›Œ๋“œ๊ฐ€ ๋™์ผํ•˜์ง€ ์•Š์•„๋„, ์˜๋ฏธ๊ฐ€ ์œ ์‚ฌํ•˜๋ฉด ๋‚ด์šฉ์ด ๊ด€๋ จ ์—†๋”๋ผ๋„ ๊ฒ€์ƒ‰ ๊ฒฐ๊ณผ๋กœ ๋‚˜์˜ฌ ์ˆ˜ ์žˆ๋‹ค. 

query = klue_mrc_dataset[3]['question'] # ๋กœ๋ฒ„ํŠธ ํ—จ๋ฆฌ ๋”•์ด 1946๋…„์— ๋งค์‚ฌ์ถ”์„ธ์ธ  ์—ฐ๊ตฌ์†Œ์—์„œ ๊ฐœ๋ฐœํ•œ ๊ฒƒ์€ ๋ฌด์—‡์ธ๊ฐ€?
query_embedding = sentence_model.encode([query])
distances, indices = index.search(query_embedding, 3)

for idx in indices[0]:
  print(klue_mrc_dataset['context'][idx][:50])

# ์ถœ๋ ฅ ๊ฒฐ๊ณผ
# ํƒœํ‰์–‘ ์ „์Ÿ ์ค‘ ๋‰ด๊ธฐ๋‹ˆ ๋ฐฉ๋ฉด์—์„œ ์ง„๊ณต ์ž‘์ „์„ ์‹ค์‹œํ•ด ์˜จ ๋”๊ธ€๋Ÿฌ์Šค ๋งฅ์•„๋” ์žฅ๊ตฐ์„ ์‚ฌ๋ น๊ด€์œผ๋กœ (์˜ค๋‹ต)
# ํƒœํ‰์–‘ ์ „์Ÿ ์ค‘ ๋‰ด๊ธฐ๋‹ˆ ๋ฐฉ๋ฉด์—์„œ ์ง„๊ณต ์ž‘์ „์„ ์‹ค์‹œํ•ด ์˜จ ๋”๊ธ€๋Ÿฌ์Šค ๋งฅ์•„๋” ์žฅ๊ตฐ์„ ์‚ฌ๋ น๊ด€์œผ๋กœ (์˜ค๋‹ต)
# ๋ฏธ๊ตญ ์„ธ์ธํŠธ๋ฃจ์ด์Šค์—์„œ ํƒœ์–ด๋‚ฌ๊ณ , ํ”„๋ฆฐ์Šคํ„ด ๋Œ€ํ•™๊ต์—์„œ ํ•™์‚ฌ ํ•™์œ„๋ฅผ ๋งˆ์น˜๊ณ  1939๋…„์— ๋กœ์ฒด์Šค (์ •๋‹ต)

 

 โ†ช๏ธŽ ๋กœ๋ฒ„ํŠธํ—จ๋ฆฌ๋”•์ด ๊ฐœ๋ฐœํ•œ ๊ฒƒ์— ๋Œ€ํ•œ ์งˆ๋ฌธ ์ฟผ๋ฆฌ์— ๋Œ€ํ•ด, ๊ด€๋ จ์—†๋Š” ๋‹ต๋ณ€์ด ๋จผ์ € ์ƒ์œ„์— ๋‚˜์˜ค๋Š” ๊ฒฐ๊ณผ (ํ‚ค์›Œ๋“œ๋Š” ๋™์ผํ•˜์ง€ ์•Š์ง€๋งŒ, ์˜๋ฏธ์ƒ ์œ ์‚ฌํ•˜๋‹ค๊ณ  ํŒ๋‹จ๋œ ๊ฒฝ์šฐ) โ˜ž ์ด๋Ÿฌํ•œ ํ•œ๊ณ„๋Š” ํ•˜์ด๋ธŒ๋ฆฌ๋“œ๊ฒ€์ƒ‰์œผ๋กœ ๊ฐœ์„ 

 

 

 


3.2 ๋ผ๋งˆ์ธ๋ฑ์Šค์—์„œ Sentence-Transformers ๋ชจ๋ธ ์‚ฌ์šฉํ•˜๊ธฐ 

 

from llama_index.core import VectorStoreIndex, ServiceContext
from llama_index.core import Document
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

# ํ—ˆ๊น…ํŽ˜์ด์Šค ๊ด€๋ จ ํด๋ž˜์Šค์— ์ €์žฅํ•ด๋‘” ๋ชจ๋ธ์„ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ 
embed_model = HuggingFaceEmbedding(model_name="snunlp/KR-SBERT-V40K-klueNLI-augSTS")
service_context = ServiceContext.from_defaults(embed_model=embed_model, llm=None)

# ๋กœ์ปฌ ๋ชจ๋ธ ํ™œ์šฉํ•˜๊ธฐ
# service_context = ServiceContext.from_defaults(embed_model="local")


text_list = klue_mrc_dataset[:100]['context']
documents = [Document(text=t) for t in text_list]

index_llama = VectorStoreIndex.from_documents(
    documents,
    service_context=service_context,
)

 

 โ†ช๏ธŽ  ๋ผ๋งˆ์ธ๋ฑ์Šค๋Š” Sentence-Transformers์˜ ์ž„๋ฒ ๋”ฉ ๋ชจ๋ธ์„ ํ†ตํ•ฉํ•  ์ˆ˜ ์žˆ๋Š” ๊ธฐ๋Šฅ์„ ์ง€์› 

 

 

 

 

 

 

4.   ๊ฒ€์ƒ‰ ๋ฐฉ์‹์„ ์กฐํ•ฉํ•ด ์„ฑ๋Šฅ ๋†’์ด๊ธฐ


 

4.1  ํ‚ค์›Œ๋“œ ๊ฒ€์ƒ‰๋ฐฉ์‹ : BM25

 

โ  ํ‚ค์›Œ๋“œ๊ฒ€์ƒ‰

 โ†ช๏ธŽ  ์˜๋ฏธ๊ฒ€์ƒ‰๊ณผ ๋‹ฌ๋ฆฌ, ๋™์ผํ•œ ํ‚ค์›Œ๋“œ๊ฐ€ ๋งŽ์ด ํฌํ•จ๋ ์ˆ˜๋ก ์œ ์‚ฌ๋„๋ฅผ ๋†’๊ฒŒ ํ‰๊ฐ€ํ•˜๋Š” ๊ฒ€์ƒ‰๋ฐฉ์‹์„ ๋งํ•œ๋‹ค. 

 โ†ช๏ธŽ  ๊ด€๋ จ์„ฑ์ด ๋–จ์–ด์ง€๋Š” ๊ฒ€์ƒ‰ ๊ฒฐ๊ณผ๊ฐ€ ๋‚˜ํƒ€๋‚  ๊ฐ€๋Šฅ์„ฑ์ด ๋‚ฎ๋‹ค๋Š” ์žฅ์ ์ด ์žˆ์ง€๋งŒ, ๋™์ผํ•œ ํ‚ค์›Œ๋“œ๋ฅผ ์‚ฌ์šฉํ•˜์ง€ ์•Š์œผ๋ฉด ์˜๋ฏธ๊ฐ€ ์œ ์‚ฌํ•˜๋”๋ผ๋„ ๊ฒ€์ƒ‰ํ•˜์ง€ ๋ชปํ•œ๋‹ค๋Š” ๋‹จ์ ์ด ์žˆ๋‹ค. ๋”ฐ๋ผ์„œ ์ฃผ๋กœ ์˜๋ฏธ๊ฒ€์ƒ‰๊ณผ ํ‚ค์›Œ๋“œ๊ฒ€์ƒ‰์„ ์กฐํ•ฉํ•œ ํ•˜์ด๋ธŒ๋ฆฌ๋“œ๊ฒ€์ƒ‰์„ ํ™œ์šฉํ•œ๋‹ค. 

 

 

โ  BM25

 โ†ช๏ธŽ  TF-IDF์™€ ์œ ์‚ฌํ•œ ํ†ต๊ณ„๊ธฐ๋ฐ˜ ์Šค์ฝ”์–ด๋ง ๋ฐฉ๋ฒ•์œผ๋กœ, TF-IDF์— ๋ฌธ์„œ์˜ ๊ธธ์ด์— ๋Œ€ํ•œ ๊ฐ€์ค‘์น˜๋ฅผ ์ถ”๊ฐ€ํ•œ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด๋‹ค. 

 โ†ช๏ธŽ  ๊ฐ„๋‹จํ•˜๊ณ  ๊ณ„์‚ฐ๋Ÿ‰์ด ์ ์œผ๋ฉฐ, ๋›ฐ์–ด๋‚œ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ ๋Œ€ํ‘œ์ ์ธ ๊ฒ€์ƒ‰์—”์ง„์ธ Elasticsearch์˜ ๊ธฐ๋ณธ ์•Œ๊ณ ๋ฆฌ์ฆ˜์œผ๋กœ ์‚ฌ์šฉ๋˜๊ธฐ๋„ ํ•œ๋‹ค. 

 

โ†ช๏ธŽ  1) ํฌํ™”ํšจ๊ณผ ๊ณ ๋ ค : ํŠน์ • ๋ฌธ์„œ ๋‚ด์— ํ† ํฐ์ด ์ž์ฃผ ๋‚˜์˜ค๋”๋ผ๋„ TF ํ•ญ์ด ์ผ์ • ๊ฐ’ ์ด์ƒ์œผ๋กœ ์ปค์ง€์ง€ ์•Š๋Š”๋‹ค. 

โ†ช๏ธŽ  2) ๋ฌธ์„œ ๊ธธ์ด ๊ณ ๋ ค : ์งง์€ ๋ฌธ์„œ์— ํ† ํฐ q๊ฐ€ ๋“ฑ์žฅํ•œ ๊ฒฝ์šฐ, ๋” ์ค‘์š”๋„๋ฅผ ๋†’๊ฒŒ ํŒ๋‹จํ•œ๋‹ค. 

 

 


4.2 ์ƒํ˜ธ์ˆœ์œ„์กฐํ•ฉ ์ดํ•ดํ•˜๊ธฐ 

 

โ  ์ƒํ˜ธ์ˆœ์œ„์กฐํ•ฉ

 โ†ช๏ธŽ  ํ•˜์ด๋ธŒ๋ฆฌ๋“œ ๊ฒ€์ƒ‰์„ ์œ„ํ•ด์„œ๋Š” ํ†ต๊ณ„๊ธฐ๋ฐ˜ ์ ์ˆ˜์™€ ์ž„๋ฒ ๋”ฉ ์œ ์‚ฌ๋„ ์ ์ˆ˜๋ฅผ ํ•˜๋‚˜๋กœ ํ•ฉ์ณ์•ผ ํ•œ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜, ์ ์ˆ˜๋งˆ๋‹ค ๋ถ„ํฌ๊ฐ€ ๋‹ค๋ฅด๋ฏ€๋กœ ์ด๋ฅผ ๋งž์ถฐ์ฃผ๋Š” ๊ฒƒ์ด ํ•„์š”ํ•˜๋‹ค. 

 โ†ช๏ธŽ  ์ƒํ˜ธ์ˆœ์œ„์กฐํ•ฉ์€ ๊ฐ ์ ์ˆ˜์—์„œ์˜ '์ˆœ์œ„'๋ฅผ ํ™œ์šฉํ•ด ์ ์ˆ˜๋ฅผ ์‚ฐ์ถœํ•œ๋‹ค. ์ˆœ์œ„์— ๋”ฐ๋ผ ์ ์ˆ˜ (1/(k+์ˆœ์œ„)) ๋ฅผ ๋ถ€์—ฌํ•œ๋‹ค. k๋Š” ๊ฐ ๋ชจ๋ธ์—์„œ ๊ณ ๋ คํ•  ์ˆœ์œ„์ด๋‹ค. (๊ต์žฌ 10.17 ๊ทธ๋ฆผ ์ฐธ๊ณ ํ•˜๊ธฐ) 

 

 

 

 

 

 

5.   ํ•˜์ด๋ธŒ๋ฆฌ๋“œ๊ฒ€์ƒ‰๊ตฌํ˜„


 

5.1  BM25๊ตฌํ˜„ํ•˜๊ธฐ 

 

 

โ  Class์ •์˜

 

import math
import numpy as np
from typing import List
from transformers import PreTrainedTokenizer
from collections import defaultdict

class BM25:
  def __init__(self, corpus:List[List[str]], tokenizer:PreTrainedTokenizer):
    self.tokenizer = tokenizer
    self.corpus = corpus
    self.tokenized_corpus = self.tokenizer(corpus, add_special_tokens=False)['input_ids']
    self.n_docs = len(self.tokenized_corpus)
    self.avg_doc_lens = sum(len(lst) for lst in self.tokenized_corpus) / len(self.tokenized_corpus)
    self.idf = self._calculate_idf()
    self.term_freqs = self._calculate_term_freqs()

  def _calculate_idf(self):
    idf = defaultdict(float)
    for doc in self.tokenized_corpus:
      for token_id in set(doc):
        idf[token_id] += 1
    for token_id, doc_frequency in idf.items():
      idf[token_id] = math.log(((self.n_docs - doc_frequency + 0.5) / (doc_frequency + 0.5)) + 1)
    return idf

  def _calculate_term_freqs(self):
    term_freqs = [defaultdict(int) for _ in range(self.n_docs)]
    for i, doc in enumerate(self.tokenized_corpus):
      for token_id in doc:
        term_freqs[i][token_id] += 1
    return term_freqs

  def get_scores(self, query:str, k1:float = 1.2, b:float=0.75):
    query = self.tokenizer([query], add_special_tokens=False)['input_ids'][0]
    scores = np.zeros(self.n_docs)
    for q in query:
      idf = self.idf[q]
      for i, term_freq in enumerate(self.term_freqs):
        q_frequency = term_freq[q]
        doc_len = len(self.tokenized_corpus[i])
        score_q = idf * (q_frequency * (k1 + 1)) / ((q_frequency) + k1 * (1 - b + b * (doc_len / self.avg_doc_lens)))
        scores[i] += score_q
    return scores

  def get_top_k(self, query:str, k:int):
    scores = self.get_scores(query)
    top_k_indices = np.argsort(scores)[-k:][::-1]
    top_k_scores = scores[top_k_indices]
    return top_k_scores, top_k_indices

 

 

 โ†ช๏ธŽ  get_scores : ์ ์ˆ˜ ๊ณ„์‚ฐ 

      โ†ช๏ธŽ idf์™€ term_freqs๋ฅผ ํ†ตํ•ด, ๊ฒ€์ƒ‰ํ•˜๋ ค๋Š” ์ฟผ๋ฆฌ์™€ ๊ฐ ๋ฌธ์„œ ์‚ฌ์ด์˜ ์ ์ˆ˜๋ฅผ ๊ณ„์‚ฐ 

 

 โ†ช๏ธŽ  get_top_k : ์ƒ์œ„ k๊ฐœ์˜ ์ ์ˆ˜์™€ ์ธ๋ฑ์Šค ์ถ”์ถœ 

      โ†ช๏ธŽ ์ฟผ๋ฆฌ์™€ ๋ฌธ์„œ ์‚ฌ์ด์˜ ๊ฒ€์ˆ˜๊ฐ€ ๊ฐ€์žฅ ๋†’์€ k๊ฐœ์˜ ๋ฌธ์„œ์˜ ์ธ๋ฑ์Šค์™€ ์ ์ˆ˜๋ฅผ ๋ฐ˜ํ™˜ 

 

 โ†ช๏ธŽ _calculate_term_freqs : ๊ฐ ํ† ํฐ์ด ๊ฐ ๋ฌธ์„œ ๋‚ด์—์„œ ๋ช‡ ๋ฒˆ ๋“ฑ์žฅํ•˜๋Š”์ง€ ์ง‘๊ณ„ 

      โ†ช๏ธŽ self.n_docs ๋ฌธ์„œ์ˆ˜ ๋งŒํผ์˜ ๋”•์…”๋„ˆ๋ฆฌ๋ฅผ ๋งŒ๋“ค๊ณ  ๊ฐ ๋ฌธ์„œ ๋‚ด์— ์–ด๋–ค ํ† ํฐ์ด ๋ช‡ ๋ฒˆ ๋“ฑ์žฅํ•˜๋Š”์ง€ ์ง‘๊ณ„ํ•œ๋‹ค. 

 

 

 

โ  BM25 ์ ์ˆ˜ ๊ฒฐ๊ณผ ํ™•์ธ

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('klue/roberta-base')

bm25 = BM25(['์•ˆ๋…•ํ•˜์„ธ์š”', '๋ฐ˜๊ฐ‘์Šต๋‹ˆ๋‹ค', '์•ˆ๋…• ์„œ์šธ'], tokenizer)
bm25.get_scores('์•ˆ๋…•')
# array([0.44713859, 0.        , 0.52354835])

 

 โ†ช๏ธŽ  '์•ˆ๋…•' ์ด๋ผ๋Š” ๊ฒ€์ƒ‰ ์ฟผ๋ฆฌ์— ๋Œ€ํ•ด, '๋ฐ˜๊ฐ‘์Šต๋‹ˆ๋‹ค'๋Š” ์ผ์น˜ํ•˜๋Š” ํ† ํฐ์ด ์—†์–ด ์œ ์‚ฌ๋„๊ฐ€ 0์ด ๋œ๋‹ค. 

 

 

 

โ  BM25 ๊ฒ€์ƒ‰ ๊ฒฐ๊ณผ ํ•œ๊ณ„ 

 

# BM25 ๊ฒ€์ƒ‰ ์ค€๋น„
bm25 = BM25(klue_mrc_dataset['context'], tokenizer)

query = "์ด๋ฒˆ ์—ฐ๋„์—๋Š” ์–ธ์ œ ๋น„๊ฐ€ ๋งŽ์ด ์˜ฌ๊นŒ?"
_, bm25_search_ranking = bm25.get_top_k(query, 100)

for idx in bm25_search_ranking[:3]:
  print(klue_mrc_dataset['context'][idx][:50])

# ์ถœ๋ ฅ ๊ฒฐ๊ณผ
# ๊ฐค๋Ÿญ์‹œS5 ์–ธ์ œ ๋ฐœ๋งคํ•œ๋‹ค๋Š” ๊ฑด์ง€์–ธ์ œ๋Š” “27์ผ ํŒ๋งคํ•œ๋‹ค”๊ณ  ํ–ˆ๋‹ค๊ฐ€ “์ด๋ฅด๋ฉด 26์ผ ํŒ๋งคํ•œ๋‹ค (์˜ค๋‹ต)
# ์ธ๊ตฌ ๋น„์œจ๋‹น ๋…ธ๋ฒจ์ƒ์„ ์„ธ๊ณ„์—์„œ ๊ฐ€์žฅ ๋งŽ์ด ๋ฐ›์€ ๋‚˜๋ผ, ๊ณผํ•™ ๋…ผ๋ฌธ์„ ๊ฐ€์žฅ ๋งŽ์ด ์“ฐ๊ณ  ์˜๋ฃŒ ํŠน (์˜ค๋‹ต)
# ์˜ฌ์—ฌ๋ฆ„ ์žฅ๋งˆ๊ฐ€ 17์ผ ์ œ์ฃผ๋„์—์„œ ์‹œ์ž‘๋๋‹ค. ์„œ์šธ ๋“ฑ ์ค‘๋ถ€์ง€๋ฐฉ์€ ์˜ˆ๋…„๋ณด๋‹ค ์‚ฌ๋‚˜ํ˜ ์ •๋„ ๋Šฆ์€  (์ •๋‹ต)

 

 โ†ช๏ธŽ  ์งˆ๋ฌธ์— ๋Œ€ํ•ด, ์ •๋‹ต์ด ๋˜๋Š” ๋ฌธ์žฅ์„ ์„ธ๋ฒˆ์งธ๋กœ ์ถœ๋ ฅํ•œ๋‹ค. ๊ฒ€์ƒ‰ ์ฟผ๋ฆฌ ๋ฌธ์žฅ๊ณผ ์ •๋‹ต ๊ธฐ์‚ฌ ์‚ฌ์ด์— ์ผ์น˜ํ•˜๋Š” ํ‚ค์›Œ๋“œ๊ฐ€ ์ ์–ด ๊ฐ€์žฅ ๋จผ์ € ๊ฒ€์ƒ‰๋˜์ง€ ์•Š์€ ๊ฒƒ์ด๋‹ค. 

 

 

 

โ  BM25๊ฒ€์ƒ‰ ์žฅ์ 

 

query = klue_mrc_dataset[3]['question']  # ๋กœ๋ฒ„ํŠธ ํ—จ๋ฆฌ ๋”•์ด 1946๋…„์— ๋งค์‚ฌ์ถ”์„ธ์ธ  ์—ฐ๊ตฌ์†Œ์—์„œ ๊ฐœ๋ฐœํ•œ ๊ฒƒ์€ ๋ฌด์—‡์ธ๊ฐ€?
_, bm25_search_ranking = bm25.get_top_k(query, 100)

for idx in bm25_search_ranking[:3]:
  print(klue_mrc_dataset['context'][idx][:50])

# ์ถœ๋ ฅ ๊ฒฐ๊ณผ
# ๋ฏธ๊ตญ ์„ธ์ธํŠธ๋ฃจ์ด์Šค์—์„œ ํƒœ์–ด๋‚ฌ๊ณ , ํ”„๋ฆฐ์Šคํ„ด ๋Œ€ํ•™๊ต์—์„œ ํ•™์‚ฌ ํ•™์œ„๋ฅผ ๋งˆ์น˜๊ณ  1939๋…„์— ๋กœ์ฒด์Šค (์ •๋‹ต)
# ;๋ฉ”์นด๋™(ใƒกใ‚ซใƒ‰ใƒณ)                                                      (์˜ค๋‹ต)
# :์„ฑ์šฐ : ๋‚˜๋ผํ•˜์‹œ ๋ฏธํ‚ค(ใชใ‚‰ใฏใ—ใฟใ)
# ๊ธธ๊ฐ€์— ๋ฒ„๋ ค์ ธ ์žˆ๋˜ ๋‚ก์€ ๋Аํ‹ฐ๋‚˜
# ;๋ฉ”์นด๋™(ใƒกใ‚ซใƒ‰ใƒณ)                                                      (์˜ค๋‹ต)
# :์„ฑ์šฐ : ๋‚˜๋ผํ•˜์‹œ ๋ฏธํ‚ค(ใชใ‚‰ใฏใ—ใฟใ)
# ๊ธธ๊ฐ€์— ๋ฒ„๋ ค์ ธ ์žˆ๋˜ ๋‚ก์€ ๋Аํ‹ฐ๋‚˜

 

 โ†ช๏ธŽ  ๋ฐ˜๋ฉด, ์˜๋ฏธ๊ฒ€์ƒ‰์—์„œ ํ•œ๊ณ„ ์˜ˆ์‹œ์—์„œ ๋ณด์˜€๋˜ ์ฟผ๋ฆฌ ๊ฒ€์ƒ‰ ๋ฌธ์žฅ์— ๋Œ€ํ•œ ๊ฒฐ๊ณผ๋Š” ์ •๋‹ต ๋ฌธ์žฅ์„ ์ œ์ผ ์ฒซ๋ฒˆ์งธ๋กœ ๋“ฑ์žฅํ•˜๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. ๊ธฐ์‚ฌ ๋ณธ๋ฌธ์„ ์ถœ๋ ฅํ•˜๋ฉด '๋งค์‚ฌ์ถ”์„ธ์ธ  ์—ฐ๊ตฌ์†Œ'๋ผ๋Š” ํ‘œํ˜„์ด ๋งŽ์ด ๋“ฑ์žฅํ•˜๋Š”๋ฐ, BM25์˜ ์ผ์น˜ํ•˜๋Š” ํ‚ค์›Œ๋“œ ๋ฐ”ํƒ•์˜ ๊ด€๋ จ ๊ธฐ์‚ฌ ๊ฒ€์ƒ‰์˜ ์žฅ์ ์„ ์ž˜ ๋ณด์—ฌ์ค€๋‹ค. 

 

 

 

 

 


5.2 ์ƒํ˜ธ์ˆœ์œ„์กฐํ•ฉ ๊ตฌํ˜„ํ•˜๊ธฐ 

 

โ  ์ƒํ˜ธ์ˆœ์œ„์กฐํ•ฉํ•จ์ˆ˜๊ตฌํ˜„

 

from collections import defaultdict

def reciprocal_rank_fusion(rankings:List[List[int]], k=5):
    rrf = defaultdict(float)
    for ranking in rankings:
        for i, doc_id in enumerate(ranking, 1):
            rrf[doc_id] += 1.0 / (k + i)  # ๊ฐ ๋ฌธ์„œ ์ธ๋ฑ์Šค์— ์ˆœ์œ„๊ธฐ๋ฐ˜ ์ ์ˆ˜๋ฅผ ๋”ํ•จ 
    return sorted(rrf.items(), key=lambda x: x[1], reverse=True) # ์ ์ˆ˜๋ฅผ ์ข…ํ•ฉํ•œ ๋”•์…”๋„ˆ๋ฆฌ๋ฅผ ์ ์ˆ˜์— ๋”ฐ๋ผ ๋†’์€ ์ˆœ์œผ๋กœ ์ •๋ ฌํ•ด ๋ฐ˜ํ™˜

 

 โ†ช๏ธŽ  reciprocal_rank_fusion : ๊ฐ ๊ฒ€์ƒ‰ ๋ฐฉ์‹์œผ๋กœ ๊ณ„์‚ฐํ•ด ์ •ํ•ด์ง„ ๋ฌธ์„œ์˜ ์ˆœ์œ„๋ฅผ ์ž…๋ ฅ์œผ๋กœ ๋ฐ›์•„, ์ƒํ˜ธ ์ˆœ์œ„ ์กฐํ•ฉ ์ ์ˆ˜๊ฐ€ ๋†’์€ ์ˆœ๋Œ€๋กœ ์ •๋ ฌํ•ด ๋ฐ˜ํ™˜ 

 โ†ช๏ธŽ  rankings ์ธ์ž : ์—ฌ๋Ÿฌ ๊ฒ€์ƒ‰ ๋ฐฉ์‹์—์„œ ์ •ํ•ด์ง„ ์œ ์‚ฌํ•œ ๋ฌธ์„œ์˜ ์ธ๋ฑ์Šค ๋ฆฌ์ŠคํŠธ๋ฅผ ์ž…๋ ฅ์œผ๋กœ ๋ฐ›๋Š”๋‹ค. 

 โ†ช๏ธŽ ์—ฌ๋Ÿฌ ๊ฒ€์ƒ‰ ๋ฐฉ์‹์˜ ์ ์ˆ˜๋ฅผ ์ข…ํ•ฉํ•˜๊ณ  ๋†’์€ ์ ์ˆ˜๋ฅผ ๋ฐ›์€ ์ˆœ์„œ๋Œ€๋กœ ์ •๋ ฌํ•ด ๊ฒฐ๊ณผ๋ฅผ ๋ฐ˜ํ™˜ํ•œ๋‹ค. 

 

rankings = [[1, 4, 3, 5, 6], [2, 1, 3, 6, 4]]
reciprocal_rank_fusion(rankings)

# [(1, 0.30952380952380953),
#  (3, 0.25),
#  (4, 0.24285714285714285),
#  (6, 0.2111111111111111),
#  (2, 0.16666666666666666),
#  (5, 0.1111111111111111)]

 

โ†ช๏ธŽ  ์˜ˆ์‹œ ๋ฐ์ดํ„ฐ๋กœ ํ•จ์ˆ˜์˜ ๊ตฌํ˜„ ๊ฒฐ๊ณผ ํ™•์ธ 

 

 

 

โ  ํ•˜์ด๋ธŒ๋ฆฌ๋“œ๊ฒ€์ƒ‰ ๊ตฌํ˜„ํ•˜๊ธฐ

 

# ์˜๋ฏธ ๊ฒ€์ƒ‰์—์„œ ๋ฐ˜๋ณต์ ์œผ๋กœ ์ˆ˜ํ–‰ํ•˜๋˜ ๊ฒ€์ƒ‰์ฟผ๋ฆฌ ๋ฌธ์žฅ ์ž„๋ฒ ๋”ฉ ๋ณ€ํ™˜๊ณผ ์ธ๋ฑ์Šค ๊ฒ€์ƒ‰ ๋ถ€๋ถ„์„ ํ•œ๋ฒˆ์— ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ๋„๋ก ์ •์˜ํ•œ ํ•จ์ˆ˜ 
def dense_vector_search(query:str, k:int):
  query_embedding = sentence_model.encode([query])
  distances, indices = index.search(query_embedding, k)
  return distances[0], indices[0]

# ๊ฒ€์ƒ‰ ์ฟผ๋ฆฌ ๋ฌธ์žฅ๊ณผ ์ƒํ˜ธ ์ˆœ์œ„ ์กฐํ•ฉ์— ์‚ฌ์šฉํ•  ํŒŒ๋ผ๋ฏธํ„ฐ k๋ฅผ ์ž…๋ ฅ์œผ๋กœ ๋ฐ›์Œ 
def hybrid_search(query, k=20):
 # ์˜๋ฏธ๊ฒ€์ƒ‰ ์ˆ˜ํ–‰
  _, dense_search_ranking = dense_vector_search(query, 100)
 # ํ‚ค์›Œ๋“œ๊ฒ€์ƒ‰ ์ˆ˜ํ–‰
  _, bm25_search_ranking = bm25.get_top_k(query, 100)

# ๋‘ ๊ฒ€์ƒ‰ ๋ฐฉ์‹์˜ ์ˆœ์œ„๋ฅผ ์กฐํ•ฉํ•˜๊ณ  ๊ฒฐ๊ณผ๋ฅผ ๋ฐ˜ํ™˜ 
  results = reciprocal_rank_fusion([dense_search_ranking, bm25_search_ranking], k=k)
  return results

 

 

 

โ  ์˜ˆ์‹œ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•œ ํ•˜์ด๋ธŒ๋ฆฌ๋“œ ๊ฒ€์ƒ‰ ๊ฒฐ๊ณผ ํ™•์ธ 

 

query = "์ด๋ฒˆ ์—ฐ๋„์—๋Š” ์–ธ์ œ ๋น„๊ฐ€ ๋งŽ์ด ์˜ฌ๊นŒ?"
print("๊ฒ€์ƒ‰ ์ฟผ๋ฆฌ ๋ฌธ์žฅ: ", query)
results = hybrid_search(query)
for idx, score in results[:3]:
  print(klue_mrc_dataset['context'][idx][:50])

print("=" * 80)
query = klue_mrc_dataset[3]['question'] # ๋กœ๋ฒ„ํŠธ ํ—จ๋ฆฌ ๋”•์ด 1946๋…„์— ๋งค์‚ฌ์ถ”์„ธ์ธ  ์—ฐ๊ตฌ์†Œ์—์„œ ๊ฐœ๋ฐœํ•œ ๊ฒƒ์€ ๋ฌด์—‡์ธ๊ฐ€?
print("๊ฒ€์ƒ‰ ์ฟผ๋ฆฌ ๋ฌธ์žฅ: ", query)

results = hybrid_search(query)
for idx, score in results[:3]:
  print(klue_mrc_dataset['context'][idx][:50])

# ์ถœ๋ ฅ ๊ฒฐ๊ณผ
# ๊ฒ€์ƒ‰ ์ฟผ๋ฆฌ ๋ฌธ์žฅ:  ์ด๋ฒˆ ์—ฐ๋„์—๋Š” ์–ธ์ œ ๋น„๊ฐ€ ๋งŽ์ด ์˜ฌ๊นŒ?
# ์˜ฌ์—ฌ๋ฆ„ ์žฅ๋งˆ๊ฐ€ 17์ผ ์ œ์ฃผ๋„์—์„œ ์‹œ์ž‘๋๋‹ค. ์„œ์šธ ๋“ฑ ์ค‘๋ถ€์ง€๋ฐฉ์€ ์˜ˆ๋…„๋ณด๋‹ค ์‚ฌ๋‚˜ํ˜ ์ •๋„ ๋Šฆ์€  (์ •๋‹ต)
# ๊ฐค๋Ÿญ์‹œS5 ์–ธ์ œ ๋ฐœ๋งคํ•œ๋‹ค๋Š” ๊ฑด์ง€์–ธ์ œ๋Š” “27์ผ ํŒ๋งคํ•œ๋‹ค”๊ณ  ํ–ˆ๋‹ค๊ฐ€ “์ด๋ฅด๋ฉด 26์ผ ํŒ๋งคํ•œ๋‹ค  (์˜ค๋‹ต)
# ์—ฐ๊ตฌ ๊ฒฐ๊ณผ์— ๋”ฐ๋ฅด๋ฉด, ์˜ค๋ฆฌ๋„ˆ๊ตฌ๋ฆฌ์˜ ๋ˆˆ์€ ๋Œ€๋ถ€๋ถ„์˜ ํฌ์œ ๋ฅ˜๋ณด๋‹ค๋Š” ์–ด๋ฅ˜์ธ ์น ์„ฑ์žฅ์–ด๋‚˜ ๋จน์žฅ์–ด, ๊ทธ (์˜ค๋‹ต)
# ================================================================================
# ๊ฒ€์ƒ‰ ์ฟผ๋ฆฌ ๋ฌธ์žฅ:  ๋กœ๋ฒ„ํŠธ ํ—จ๋ฆฌ ๋”•์ด 1946๋…„์— ๋งค์‚ฌ์ถ”์„ธ์ธ  ์—ฐ๊ตฌ์†Œ์—์„œ ๊ฐœ๋ฐœํ•œ ๊ฒƒ์€ ๋ฌด์—‡์ธ๊ฐ€?
# ๋ฏธ๊ตญ ์„ธ์ธํŠธ๋ฃจ์ด์Šค์—์„œ ํƒœ์–ด๋‚ฌ๊ณ , ํ”„๋ฆฐ์Šคํ„ด ๋Œ€ํ•™๊ต์—์„œ ํ•™์‚ฌ ํ•™์œ„๋ฅผ ๋งˆ์น˜๊ณ  1939๋…„์— ๋กœ์ฒด์Šค (์ •๋‹ต)
# 1950๋…„๋Œ€ ๋ง ๋งค์‚ฌ์ถ”์„ธ์ธ  ๊ณต๊ณผ๋Œ€ํ•™๊ต์˜ ๋™์•„๋ฆฌ ํ…Œํฌ๋ชจ๋ธ์ฒ ๋„ํด๋Ÿฝ์—์„œ ‘ํ•ด์ปค’๋ผ๋Š” ์šฉ์–ด๊ฐ€ ์ฒ˜์Œ (์˜ค๋‹ต)
# 1950๋…„๋Œ€ ๋ง ๋งค์‚ฌ์ถ”์„ธ์ธ  ๊ณต๊ณผ๋Œ€ํ•™๊ต์˜ ๋™์•„๋ฆฌ ํ…Œํฌ๋ชจ๋ธ์ฒ ๋„ํด๋Ÿฝ์—์„œ ‘ํ•ด์ปค’๋ผ๋Š” ์šฉ์–ด๊ฐ€ ์ฒ˜์Œ (์˜ค๋‹ต)

 

 

 โ†ช๏ธŽ  ํ•˜์ด๋ธŒ๋ฆฌ๋“œ ๊ฒ€์ƒ‰์„ ์‚ฌ์šฉํ•˜๋‹ˆ, ๊ฒ€์ƒ‰์ฟผ๋ฆฌ๋ฌธ์žฅ์— ๋Œ€ํ•ด ๋ชจ๋‘ ์ •๋‹ต ๋ฌธ์žฅ์ด ์ฒซ๋ฒˆ์งธ ๊ฒฐ๊ณผ๋กœ ์˜ค๋Š” ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค. 

728x90

๋Œ“๊ธ€