자연어 처리 중 불용어가 삭제되지 않는 문제 (수정)

Question

자연어 처리 중 불용어가 삭제되지 않는 문제 (수정)

조회수 391회

jupyter-notebook

0

싫어요

안녕하세요, 주피터 노트북으로 자연어 전처리 중 불용어가 삭제되지 않아 문의드립니다.

import konlpy
import re

def tokenize_korean_text(text):
    text = re.sub(r'[^,.?!\w\s]','', text)

    okt = konlpy.tag.Okt()
    Okt_morphs = okt.pos(text)

    words = []
    for word, pos in Okt_morphs:
        if pos == 'Verb' or pos == 'Noun':
            words.append(word)

    return words


tokenized_list = []

for text in df['Keyword']:
    tokenized_list.append(tokenize_korean_text(text))

print(len(tokenized_list))
print(tokenized_list[1800])

여기서 tokenized_list를 설정하고,

stop_words = ['입니다','있습니다','우리','할','수','하는','합니다','여러분','대한','하는','수','있다','한다']

위와 같이 불용어를 지정하고,

clean_words = [i for i in tokenized_list if i not in stop_words]

이렇게 실행해줬습니다.

그리고 텍스트 분석을 실행하기 위해

dictionary = corpora.Dictionary(clean_words)  
dictionary.filter_extremes(no_below=2, no_above=0.05) 
corpus = [dictionary.doc2bow(text) for text in clean_words]

ldamodel = LdaModel(corpus, num_topics=8, id2word=dictionary, passes=20, iterations=500) 
ldamodel.print_topics(num_words=8)

위의 코드를 실행했는데요. 불용어로 지정되었던 '하는' '수'와 같은 불용어들이 여전히 결과에 나옵니다.

코드가 잘못된 걸까요?

도움 주시면 정말 감사드리겠습니다.

Honeybee 10 points

2022-05-31 13:40:18에 작성됨

tokenized_list도 올려주세요. 초보자 2022.5.31 13:49
올렸습니다! Honeybee 2022.5.31 14:02

댓글 입력

score 1 · Accepted Answer

1

싫어요

채택 취소하기

# 해야 하는 것
clean_words = []
for i in tokenized_list:
    a = 0
    for ii in stop_words:
        if ii in i:
            a += 1
    if a < 1:
        clean_words.append(i)

# 하고 있는 것
clean_words = []
for i in tokenized_list:
    if stop_words not in i:
        clean_words.append(i)

편집 요청

초보자 1,785 points

2022-05-31 20:50:34에 작성됨

초보님 안녕하세요. 코딩 도와주셔서 큰 도움이 되었습니다만, 알려주신 코드를 실행하자 데이터 반절이 날아가버리는 현상이 발생하고 있습니다. 새로이 질문글을 올렸는데 확인해주실 수 있으시다면 정말 큰 도움이 될것같습니다. Honeybee 2022.6.10 15:55

자연어 처리 중 불용어가 삭제되지 않는 문제 (수정)

조회수 391회

jupyter-notebook

0

Honeybee 10 points

2022-05-31 13:40:18에 작성됨

댓글 입력

2 답변

1

초보자 1,785 points

2022-05-31 20:50:34에 작성됨

댓글 달기

1

nowp 9,214 points

2022-05-31 16:52:11에 작성됨

댓글 달기

자연어 처리 중 불용어가 삭제되지 않는 문제 (수정)

조회수 391회

jupyter-notebook

0

Honeybee 10 points

2022-05-31 13:40:18에 작성됨

댓글 입력

2 답변

1

초보자 1,785 points

2022-05-31 20:50:34에 작성됨

댓글 달기

1

nowp 9,214 points

2022-05-31 16:52:11에 작성됨

댓글 달기

답변을 하려면 로그인이 필요합니다.