๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ
1๏ธโƒฃ AI•DS/๐Ÿ“— NLP

NER ์‹ค์Šต

by isdawell 2022. 6. 2.
728x90

๐Ÿ“Œ ํ•„์‚ฌ ์ž๋ฃŒ ๋งํฌ : https://colab.research.google.com/drive/1wsD4VE-GIwn6CASc7RWk3s0PO7FC9LC4?usp=sharing 

 

week4_NER_์‹ค์Šต.ipynb

Colaboratory notebook

colab.research.google.com

 

 

NER _ ๊ฐœ์ฒด๋ช… ์ธ์‹

 

โœ” ํƒœ๊น… ์ž‘์—… 

 

  • ํƒœ๊น… : ๊ฐ ๋‹จ์–ด๊ฐ€ ์–ด๋–ค ์œ ํ˜•์— ์†ํ•˜๋Š”์ง€ ์•Œ์•„๋‚ด๋Š” ์ž‘์—… 
  • ๋Œ€ํ‘œ์ ์ธ ํƒœ๊น… ์ž‘์—…์œผ๋กœ ๊ฐœ์ฒด๋ช… ์ธ์‹๊ณผ ํ’ˆ์‚ฌํƒœ๊น…์ด ์žˆ๋‹ค.
  • ํ’ˆ์‚ฌํƒœ๊น… : ๋‹จ์–ด์˜ ํ’ˆ์‚ฌ๊ฐ€ ๋ช…์‚ฌ, ๋™์‚ฌ, ํ˜•์šฉ์‚ฌ ์ธ์ง€ ์•Œ์•„๋‚ด๋Š” ์ž‘์—… 

 

 

โœ” ๊ฐœ์ฒด๋ช… ์ธ์‹ 

 

  • ๊ฐœ์ฒด๋ช… ์ธ์‹์„ ์‚ฌ์šฉํ•˜๋ฉด ์ฝ”ํผ์Šค๋กœ๋ถ€ํ„ฐ ์–ด๋–ค ๋‹จ์–ด๊ฐ€ ์‚ฌ๋žŒ, ์žฅ์†Œ, ์กฐ์ง ๋“ฑ์„ ์˜๋ฏธํ•˜๋Š” ๋‹จ์–ด์ธ์ง€๋ฅผ ์ฐพ์„ ์ˆ˜ ์žˆ๋‹ค.
  • 'ํ˜ธ๋น„๋Š” 2022๋…„์— ์นด์นด์˜ค ์ธํ„ด์— ํ•ฉ๊ฒฉํ–ˆ๋‹ค' ๐Ÿ‘‰ ํ˜ธ๋น„ - ์‚ฌ๋žŒ  ,   2022๋…„ - ์‹œ๊ฐ„ ,  ์นด์นด์˜ค - ์กฐ์ง 

 

 

 

 

 

1๏ธโƒฃ NER task by nltk library


https://wikidocs.net/30682

 

3) ๊ฐœ์ฒด๋ช… ์ธ์‹(Named Entity Recognition)

์ฝ”ํผ์Šค๋กœ๋ถ€ํ„ฐ ๊ฐ ๊ฐœ์ฒด(entity)์˜ ์œ ํ˜•์„ ์ธ์‹ํ•˜๋Š” ๊ฐœ์ฒด๋ช… ์ธ์‹(Named Entity Recognition)์— ๋Œ€ํ•ด์„œ ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค. ๊ฐœ์ฒด๋ช… ์ธ์‹์„ ์‚ฌ์šฉํ•˜๋ฉด ์ฝ”ํผ์Šค๋กœ๋ถ€ํ„ฐ ์–ด ...

wikidocs.net

 

 

๐Ÿ“Œ NER nltk ์‹ค์Šต 

 

๐Ÿ‘€ NLTK ์—์„œ๋Š” ๊ฐœ์ฒด๋ช… ์ธ์‹๊ธฐ NER chunker ๋ฅผ ์ง€์›ํ•˜๊ณ  ์žˆ๋‹ค. 

 

๐Ÿ‘€ ne_chunck ๋Š” ๊ฐœ์ฒด๋ช…์„ ํƒœ๊น…ํ•˜๊ธฐ ์œ„ํ•ด ์•ž์„œ ํ’ˆ์‚ฌ ํƒœ๊น… pos_tag ๊ฐ€ ์ˆ˜ํ–‰๋˜์–ด์•ผ ํ•œ๋‹ค. 

 

from nltk import word_tokenize, pos_tag, ne_chunk 
sentence = 'James is working at Disney in London' 
tokenized_sentence = pos_tag(word_tokenize(sentence)) 
print(tokenized_sentence)



[('James', 'NNP'), ('is', 'VBZ'), ('working', 'VBG'), ('at', 'IN'), ('Disney', 'NNP'), ('in', 'IN'), ('London', 'NNP')]

 

# โญ ๊ฐœ์ฒด๋ช… ์ธ์‹ 
ner_sentence = ne_chunk(tokenized_sentence) 
print(ner_sentence)

 

James๋Š” PERSON(์‚ฌ๋žŒ), Disney๋Š” ์กฐ์ง(ORGANIZATION), London์€ ์œ„์น˜(GPE)๋ผ๊ณ  ์ •์ƒ์ ์œผ๋กœ ๊ฐœ์ฒด๋ช… ์ธ์‹์ด ์ˆ˜ํ–‰

 

 

๐Ÿ“Œ BIO ํ‘œํ˜„ 

 

https://wikidocs.net/24682

 

4) ๊ฐœ์ฒด๋ช… ์ธ์‹์˜ BIO ํ‘œํ˜„ ์ดํ•ดํ•˜๊ธฐ

๊ฐœ์ฒด๋ช… ์ธ์‹์€ ์ฑ—๋ด‡ ๋“ฑ์—์„œ ํ•„์š”ํ•œ ์ฃผ์š” ์ „์ฒ˜๋ฆฌ ์ž‘์—…์ด๋ฉด์„œ ๊ทธ ์ž์ฒด๋กœ๋„ ๊นŒ๋‹ค๋กœ์šด ์ž‘์—…์ž…๋‹ˆ๋‹ค. ๋„๋ฉ”์ธ ๋˜๋Š” ๋ชฉ์ ์— ํŠนํ™”๋˜๋„๋ก ๊ฐœ์ฒด๋ช… ์ธ์‹์„ ์ •ํ™•ํ•˜๊ฒŒ ํ•˜๋Š” ๋ฐฉ๋ฒ• ์ค‘ ํ•˜๋‚˜๋Š” ๊ธฐ์กด์— ...

wikidocs.net

 

 

 

๐Ÿ‘€ ๊ฐœ์ฒด๋ช… ์ธ์‹์€ ์ฑ—๋ด‡ ๋“ฑ์—์„œ ํ•„์š”ํ•œ ์ฃผ์š” ์ „์ฒ˜๋ฆฌ ์ž‘์—…์ด๋ฉฐ ๊นŒ๋‹ค๋กœ์šด ์ž‘์—…์ด๋‹ค. 

 

๐Ÿ‘€ ๋„๋ฉ”์ธ ๋˜๋Š” ๋ชฉ์ ์— ํŠนํ™”๋˜๋„๋ก NER ์„ ์ˆ˜ํ–‰ํ•˜๋Š” ๋ฐฉ๋ฒ•์€ ๊ธฐ์กด์— ๊ณต๊ฐœ๋œ NER ์ธ์‹๊ธฐ๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹Œ, ์ง์ ‘ ๋ชฉ์ ์— ๋งž๋Š” ๋ฐ์ดํ„ฐ๋ฅผ ์ค€๋น„ํ•˜์—ฌ ๋ชจ๋ธ์„ ๋งŒ๋“œ๋Š” ๊ฒƒ์ด๋‹ค. 

 

โ—พ B : Begin ์˜ ์•ฝ์ž๋กœ ๊ฐœ์ฒด๋ช…์ด ์‹œ์ž‘๋˜๋Š” ๋ถ€๋ถ„ 

โ—พ I : Inside ์˜ ์•ฝ์ž๋กœ ๊ฐœ์ฒด๋ช…์˜ ๋‚ด๋ถ€ ๋ถ€๋ถ„์„ ์˜๋ฏธ 

โ—พ O : Outside ์˜ ์•ฝ์ž๋กœ ๊ฐœ์ฒด๋ช…์ด ์•„๋‹Œ ๋ถ€๋ถ„์„ ์˜๋ฏธ 

 

 

์˜ํ™” ์ œ๋ชฉ์— ๋Œ€ํ•œ ๊ฐœ์ฒด๋ช… ์ถ”์ถœ ์˜ˆ์ œ

 

ํ•ด B - movie
๋ฆฌ I - movie
ํฌ I - movie
ํ„ฐ I - movie
๋ณด O
๋Ÿฌ O
์šฉ B - spot
์‚ฐ I - spot
์— O
๊ฐ€ O
์ž O

 

๐Ÿ‘‰ B์™€ I ๋Š” ๊ฐœ์ฒด๋ช…์„ ์œ„ํ•ด ์‚ฌ์šฉ๋˜๊ณ , O ๋Š” ๊ฐœ์ฒด๋ช…์ด ์•„๋‹ˆ๋ผ๋Š” ์˜๋ฏธ 

 

 

 

๐Ÿ“Œ NER BIO Bi-LSTM ์‹ค์Šต

 

 

๐Ÿ‘€ ํ™œ์šฉ ๋ฐ์ดํ„ฐ์…‹ : CONLL2003 ์œผ๋กœ ์ „ํ†ต์ ์ธ ์˜์–ด ๋ฐ์ดํ„ฐ์…‹์ด๋‹ค. 

 

[๋‹จ์–ด] [ํ’ˆ์‚ฌ ํƒœ๊น…] [์ฒญํฌ ํƒœ๊น…] [๊ฐœ์ฒด๋ช… ํƒœ๊น…]

 

  • German : B-ORG : ๊ฐœ์ฒด๋ช…์˜ ์‹œ์ž‘ ๋ถ€๋ถ„์ด๋ฉด์„œ organization ์ธ ๊ฐœ์ฒด 
  • call ์€ ๊ฐœ์ฒด๋ช…์ด ์•„๋‹ˆ๋ฏ€๋กœ O ๊ฐ€ ํƒœ๊น…๋จ 
  • Peter Blackburn : ์‚ฌ๋žŒ ์ด๋ฆ„์— ํ•ด๋‹น๋˜๋Š”๋ฐ ์ด๋•Œ Peter ๋Š” ๊ฐœ์ฒด๋ช… ์‹œ์ž‘ ๋ถ€๋ถ„์ด๋ฉด์„œ ์‚ฌ๋žŒ๋ช… ์ด๊ธฐ ๋•Œ๋ฌธ์— B-PER ์ด๋ผ๋Š” ๊ฐœ์ฒด๋ช… ํƒœ๊น…์ด ๋ถ™๊ณ , Blackburn ์€ ์ด์–ด์ง€๋Š” ๊ฐœ์ฒด๋ช… ๋ถ€๋ถ„์ด๋ฏ€๋กœ I-PER ๋กœ ํƒœ๊น…์ด ๋ถ™๊ฒŒ ๋œ๋‹ค. 

 

๐Ÿ‘€ ํ…์ŠคํŠธ ์ „์ฒ˜๋ฆฌ , Embedding 

 

  • Tokenizer(num_words=vocab_size, oov_token='OOV')

 

๐Ÿ‘‰ num_words : ํ† ํฐํ™”๋ฅผ ํ•  ๋•Œ ๋ฌธ์žฅ ๋ฐ์ดํ„ฐ์—์„œ ๋†’์€ ๋นˆ๋„์ˆ˜๋ฅผ ๊ฐ€์ง„ ์ƒ์œ„ ์•ฝ N ๊ฐœ๋งŒ์˜ ๋‹จ์–ด๋ฅผ ์‚ฌ์šฉํ•˜๊ณ ์ž ํ•  ๋•Œ ์ง€์ •ํ•˜๋Š” ํŒŒ๋ผ๋ฏธํ„ฐ

๐Ÿ‘‰ oov_token : ๋นˆ๋„์ˆ˜๊ฐ€ ๋‚ฎ์€ ๋‹จ์–ด์˜ ๊ฒฝ์šฐ OOV ๋กœ ๋Œ€์ฒด๋˜๋„๋ก ํ•จ 

 

  • .texts_to_sequences(sentences)

 

๐Ÿ‘‰ ์ •์ˆ˜ ์ธ์ฝ”๋”ฉ ์ˆ˜ํ–‰ 

 

  • pad_sequences(X_train, padding='post', maxlen=max_len)

 

๐Ÿ‘‰ ํŒจ๋”ฉ์„ ์ง„ํ–‰ : maxlen ์€ ์ƒ˜ํ”Œ์˜ ๊ธธ์ด๋ฅผ ์ง€์ •ํ•ด์ฃผ๋Š” ์ธ์ž 

 

 

 

 

๐Ÿ‘€ Bi-LSTM ๊ตฌ์กฐ 

 

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding, LSTM, Bidirectional, TimeDistributed
from tensorflow.keras.optimizers import Adam

embedding_dim = 128
hidden_units = 128

model = Sequential()
model.add(Embedding(input_dim=vocab_size, output_dim=embedding_dim, input_length=max_len, mask_zero=True))
model.add(Bidirectional(LSTM(hidden_units, return_sequences=True)))
model.add(TimeDistributed(Dense(tag_size, activation='softmax')))

model.compile(loss='categorical_crossentropy', optimizer=Adam(0.001), metrics=['accuracy'])
model.fit(X_train, y_train, batch_size=128, epochs=8, validation_data=(X_test, y_test))

 

โ—พ ์ž„๋ฒ ๋”ฉ ๋ฒกํ„ฐ์˜ ์ฐจ์› : 128 , ์€๋‹‰ ์ƒํƒœ์˜ ํฌ๊ธฐ : 128 

โ—พ ๋‹ค๋Œ€๋‹ค ๊ตฌ์กฐ์˜ LSTM ์˜ ๊ฒฝ์šฐ return_sequences ์ธ์ž ๊ฐ’์„ True ๋กœ ์ฃผ์–ด์•ผ ํ•œ๋‹ค. 

โ—พ ๊ฐ ๋ฐ์ดํ„ฐ์˜ ๊ธธ์ด๊ฐ€ ๋‹ฌ๋ผ ํŒจ๋”ฉ์„ ์ˆ˜ํ–‰ํ•˜๋Š๋ผ ์ˆซ์ž 0์ด ๋งŽ์•„์ง€๋Š” ๊ฒฝ์šฐ, mask_zero=True ๋ฅผ ์„ค์ •ํ•ญ ์ˆซ์ž 0์€ ์—ฐ์‚ฐ์—์„œ ์ œ์™ธ์‹œํ‚จ๋‹ค๋Š” ์˜ต์…˜์„ ์ค€๋‹ค. 

โ—พ ์ถœ๋ ฅ์ธต์˜ TimeDistributed ๋Š” LSTM ์˜ ๋ชจ๋“  ์‹œ์ ์— ๋Œ€ํ•ด ์ถœ๋ ฅ์ธต์„ ์‚ฌ์šฉํ•  ํ•„์š”๊ฐ€ ์žˆ์„ ๋•Œ ์‚ฌ์šฉํ•œ๋‹ค. ์œ„์˜ ๋ชจ๋ธ์€ ๋ชจ๋“  ์‹œ์ ์— ๋Œ€ํ•ด ๊ฐœ์ฒด๋ช… ๋ ˆ์ด๋ธ” ๊ฐœ์ˆ˜ ๋งŒํผ์˜ ์„ ํƒ์ง€ ์ค‘ ํ•˜๋‚˜๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ๋‹ค์ค‘ ํด๋ž˜์Šค ๋ถ„๋ฅ˜๋ฅผ ์ˆ˜ํ–‰ํ•˜๋Š” ๋ชจ๋ธ์— ํ•ด๋‹นํ•œ๋‹ค. 

 

 

 

 

 

 

 

2๏ธโƒฃ NER task by spacy library


๐Ÿ“Œ SpaCy 

 

๐Ÿ‘€ ์ž์—ฐ์–ด์ฒ˜๋ฆฌ๋ฅผ ์œ„ํ•œ ํŒŒ์ด์ฌ ๊ธฐ๋ฐ˜์˜ ์˜คํ”ˆ ์†Œ์Šค ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋กœ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๊ธฐ๋Šฅ์„ ์ œ๊ณตํ•œ๋‹ค. 

 

 

๐Ÿ’จ spacy ๊ธฐ๋ณธ ์‚ฌ์šฉ๋ฒ• 

 

import spacy

nlp = spacy.load('en_core_web_sm') # ์›ํ•˜๋Š” ์–ธ์–ด์˜ ๋ชจ๋ธ์„ ๊ฐ€์ ธ์˜ด 
doc = nlp('Apple is looking at buyin at U.K startup for $1 billion.') # ๋ฌธ์žฅ์„ nlp ์— ๋„˜๊ฒจ์ฃผ๋ฉด ๋จ 
print(doc) # ์›๋ž˜ ๋ฌธ์žฅ์ด ์ถœ๋ ฅ๋จ 
print(list(doc)) #๋ฆฌ์ŠคํŠธ๋กœ ๋ณ€ํ˜•ํ•˜๋ฉด ํ† ํฐํ™”ํ•œ ๊ฒฐ๊ณผ๊ฐ€ ์ถœ๋ ฅ



Apple is looking at buyin at U.K startup for $1 billion.
[Apple, is, looking, at, buyin, at, U.K, startup, for, $, 1, billion, .]

 

๐Ÿ’จ NER task 

 

โญ doc.ents 

โญ ent.label_

 

# NER task 
doc = nlp('Apple is looking at buying U.K. startup for $1 billion') 

for ent in doc.ents : 
  print(ent.text, ent.label_)

 

 

 

 

๐Ÿ“Œ kaggle ์‹ค์Šต 

 

https://www.kaggle.com/code/amarsharma768/custom-ner-using-spacy/notebook

 

Custom NER using SpaCy

Explore and run machine learning code with Kaggle Notebooks | Using data from Custom NER in spaCy

www.kaggle.com

 

 

๐Ÿ‘€  customer NER 

 

  • ํ›ˆ๋ จ๋˜์ง€ ์•Š์€ ๋ฐ์ดํ„ฐ ์„ธํŠธ์— ๋ช…๋ช…๋œ ์—”ํ‹ฐํ‹ฐ๋ฅผ ํ•™์Šตํ•˜๋Š” ๋ฐฉ๋ฒ• : ์ด๋ ฅ์„œ pdf ๋ฐ์ดํ„ฐ ํ™œ์šฉ
  • ์ด๋ ฅ์„œ์—์„œ ์‚ฌ๋žŒ ์ด๋ฆ„, ๊ธฐ๊ด€๋ช… ๋“ฑ์˜ ๊ฐ์ฒด๋ฅผ ๊ตฌ๋ถ„ํ•˜๋Š” task
  • Training data consis of 200 manually labelled resume's 

 

 

 

 

๐Ÿ“Œ ํ•œ๊ตญ์–ด NER 

 

https://github.com/monologg/KoBERT-NER

 

GitHub - monologg/KoBERT-NER: NER Task with KoBERT (with Naver NLP Challenge dataset)

NER Task with KoBERT (with Naver NLP Challenge dataset) - GitHub - monologg/KoBERT-NER: NER Task with KoBERT (with Naver NLP Challenge dataset)

github.com

 

  • ์‚ฌ์ „ํ›ˆ๋ จ๋œ KoBERT๋ฅผ ์ด์šฉํ•œ ํ•œ๊ตญ์–ด Named Entity Recognition Task ๋ฅผ ์ˆ˜ํ–‰
  • ๋„ค์ด๋ฒ„ NLP challenge 2018 ๋Œ€ํšŒ์˜ NER ๋ฐ์ดํ„ฐ์…‹์„ ์‚ฌ์šฉํ•˜์—ฌ ๋ชจ๋ธ ํ›ˆ๋ จ์„ ์ง„ํ–‰ : http://air.changwon.ac.kr/?page_id=10 

 

 

 

 

 

 

 

 

โž• ์ฐธ๊ณ ์ž๋ฃŒ : http://aispiration.com/nlp2/nlp-ner-python.html 

 

nlp-ner-python

๊ฐœ์ฒด๋ช… ์ธ์‹์˜ ์ •๊ทœํ‘œํ˜„์‹¶ ์ƒ๊ธฐ์™€ ๊ฐ™์ด ๊ฐœ์ฒด๋ช…์ธ์‹์ด ๋‚˜๋ฆ„ ์„ฑ๊ณต์ ์ธ ์‚ฌ๋ก€๋„ ์žˆ์ง€๋งŒ, ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ฑ—๋ด‡์„ ๊ฐœ๋ฐœํ•  ๋•Œ ํ•˜๋‚˜์˜ ์‚ฌ๋ก€๋กœ ์ธ์‚ฌ๋ฅผ ํ•˜๋Š” ๊ฒƒ์„ ์‚ดํŽด๋ณด์ž. ์ฆ‰, ์ฑ—๋ด‡์ด "์ธ์‚ฌ"๋ผ๋Š” ์˜๋„๋ฅผ

aispiration.com

 

 

 

 

 

 

728x90

๋Œ“๊ธ€