편집 기록

프로필 편집요청빌런님의 편집

날짜2020.01.22

파이썬 셀레니움 사용한 크롤링 질문입니다.

crawling

crawler

python

youtube

youtube-api

파이썬을 활용한 유튜브 댓글 크롤링 코드를 실행시키고 있었습니다. 여기서 댓글을 스크롤을 내려서 전체 댓글을 수집하고 싶었는데 자꾸 스크롤 실행이 잘 되지않아 20개까지만 수집이 되고있는 상태입니다. 그리고 실행이 언제는 성공했다가 언제는 아예 실행되지 않는데 그 이유가 코드에 있을까요? 혹시 알고 계신분 계시면 조금만 도와주시면 감사하겠습니다.

num_of_end = 4

while num_of_end:
    body.send_keys(Keys.END)
    time.sleep(2)
    num_of_end-= 1

이 부분이 스크롤을 내리는 코드입니다. 그리고 혹시 해결하는데 도움이 될까봐 전체 코드도 올립니다.

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup
from urllib.request import urlopen
import re
import pandas as pd
import numpy as np
import nltk
import seaborn as sns
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
import string
import time

driver = webdriver.Chrome("C:\\Users\\AppData\\Roaming\\Microsoft\\Windows\\Start Menu\\Programs\\Python 3.8\\chromedriver")

# 로드할 페이지 URL
driver.get('https://www.youtube.com/watch?v=jv1b2c3EYb4')
time.sleep(1)

#마우스 커서 위치 지정. 스크롤링을 위해 마우스커서를 바디에 위치
body = driver.find_element_by_tag_name("body")
html = driver.page_source
soup = BeautifulSoup(html, "html.parser")

# 제목 가져오기
title = driver.find_element_by_xpath('//*[@id="container"]/h1/yt-formatted-string').text
print("Video Title: " + title + '\n\n')

# 전체 댓글을 수집하기 위해 스크롤바를 최하단에 위치

#페이지를 내리고 싶은 회수 지정
num_of_end = 4

while num_of_end:
    body.send_keys(Keys.END)
    time.sleep(2)
    num_of_end-= 1


# 댓글 내용 (html 태그를 통해 추출)
comment = soup.find_all('yt-formatted-string', {'id':'content-text'})

comment_list = []
for c in comment:
    comment_list.append(c.get_text().strip())

# 댓글 작성자 id
user_id = soup.find_all('a', {'id':'author-text'})
id_list = []
for u in user_id:
    id_list.append(u.get_text().strip())

# 댓글에 좋아요 개수
like = soup.find_all('span', {'id':'vote-count-left'})

like_list_bf = []
like_list = []
for l in like:
    like_list_bf.append(l.get_text().strip())

#비워져 있는 값 0으로 대치
for bf in like_list_bf :
    if bf =='' :
        like_list.append(0)
    else :
        like_list.append(bf)


# 가져온 댓글 리스트 확인
for c in comment_list:
    print(c + '\n')

    # dataframe을 만들기 위해 크기가 일정한지 확인
print(len(comment_list))
print(len(id_list))
print(len(like_list))

# csv로 저장 시 깨짐 방지를 위한 전처리(이모지-유니코드, 아랍문자 등 제거)
s_filter = re.compile("[^"
                        "a-zA-Z"      #English
                        "ㄱ-ㅣ가-힣"  #Korean
                        "0-9"         #Number
                        "\{\}\[\]\/?.,;:|\)*~`!^\-_+<>@\#$%&\\\=\(\'\"" #특수기호
                        "\ " #space
                        "]+")
# 댓글에 대한 전처리
comment_result = []
for i in comment_list:
    i = re.sub(s_filter,"",i)
    i = ''.join(i)
    comment_result.append(i)

# 아이디에 아랍문자 사용이 가능하기 때문에 아이디도 전처리 필요
id_result = []
for i in id_list:
    i = re.sub(s_filter,"",i)
    i = ''.join(i)
    id_result.append(i)
    # 가져온 각각의 data를 하나의  dataframe으로 제작
DB = pd.DataFrame({'id' : id_result,'comment' : comment_result,'like' : like_list})

# 분석을 위해 댓글 길이 추가
DB['text_length'] = DB['comment'].apply(len)

# DB 확인
DB.head()

# csv 파일로 export
DB.to_csv("Dataset Raw27.csv",encoding="euc-kr")

프로필 Shin Gayeong님의 편집

날짜2020.01.22
파이썬 셀레니움 사용한 크롤링 질문입니다.

crawling

crawler

python

youtube

youtube-api
파이썬을 활용한 유튜브 댓글 크롤링 코드를 실행시키고 있었습니다. 여기서 댓글을 스크롤을 내려서 전체 댓글을 수집하고 싶었는데 자꾸 스크롤 실행이 잘 되지않아 20개까지만 수집이 되고있는 상태입니다. 그리고 실행이 언제는 성공했다가 언제는 아예 실행되지 않는데 그 이유가 코드에 있을까요? 혹시 알고 계신분 계시면 조금만 도와주시면 감사하겠습니다.
num_of_end = 4
while num_of_end: body.send_keys(Keys.END) time.sleep(2) num_of_end-= 1
이 부분이 스크롤을 내리는 코드입니다. 그리고 혹시 해결하는데 도움이 될까봐 전체 코드도 올립니다.
from selenium import webdriver from selenium.webdriver.common.keys import Keys from bs4 import BeautifulSoup from urllib.request import urlopen import re import pandas as pd import numpy as np import nltk import seaborn as sns from nltk.corpus import stopwords from sklearn.feature_extraction.text import CountVectorizer import string import time
driver = webdriver.Chrome("C:\Users\AppData\Roaming\Microsoft\Windows\Start Menu\Programs\Python 3.8\chromedriver")

로드할 페이지 URL

driver.get('https://www.youtube.com/watch?v=jv1b2c3EYb4') time.sleep(1)

마우스 커서 위치 지정. 스크롤링을 위해 마우스커서를 바디에 위치

body = driver.find_element_by_tag_name("body") html = driver.page_source soup = BeautifulSoup(html, "html.parser")

제목 가져오기

title = driver.find_element_by_xpath('//*[@id="container"]/h1/yt-formatted-string').text print("Video Title: " + title + '\n\n')

전체 댓글을 수집하기 위해 스크롤바를 최하단에 위치

페이지를 내리고 싶은 회수 지정

num_of_end = 4
while num_of_end: body.send_keys(Keys.END) time.sleep(2) num_of_end-= 1

댓글 내용 (html 태그를 통해 추출)

comment = soup.find_all('yt-formatted-string', {'id':'content-text'})
comment_list = [] for c in comment: comment_list.append(c.get_text().strip())

댓글 작성자 id

user_id = soup.find_all('a', {'id':'author-text'}) id_list = [] for u in user_id: id_list.append(u.get_text().strip())

댓글에 좋아요 개수

like = soup.find_all('span', {'id':'vote-count-left'})
like_list_bf = [] like_list = [] for l in like: like_list_bf.append(l.get_text().strip())

비워져 있는 값 0으로 대치

for bf in like_list_bf : if bf =='' : like_list.append(0) else : like_list.append(bf)

가져온 댓글 리스트 확인

for c in comment_list: print(c + '\n')

# dataframe을 만들기 위해 크기가 일정한지 확인

print(len(comment_list)) print(len(id_list)) print(len(like_list))

csv로 저장 시 깨짐 방지를 위한 전처리(이모지-유니코드, 아랍문자 등 제거)

s_filter = re.compile("[^" "a-zA-Z" #English "ㄱ-ㅣ가-힣" #Korean "0-9" #Number "{}[]\/?.,;:|)*~`!^{-_+<>@#$%&\=(\'\""} #특수기호 "\ " #space "]+")

댓글에 대한 전처리

comment_result = [] for i in comment_list: i = re.sub(s_filter,"",i) i = ''.join(i) comment_result.append(i)

아이디에 아랍문자 사용이 가능하기 때문에 아이디도 전처리 필요

id_result = [] for i in id_list: i = re.sub(s_filter,"",i) i = ''.join(i) id_result.append(i) # 가져온 각각의 data를 하나의 dataframe으로 제작 DB = pd.DataFrame({'id' : id_result,'comment' : comment_result,'like' : like_list})

분석을 위해 댓글 길이 추가

DB['text_length'] = DB['comment'].apply(len)

DB 확인

DB.head()

# csv 파일로 export

DB.to_csv("Dataset Raw27.csv",encoding="euc-kr")

편집 기록

편집 기록

프로필 편집요청빌런님의 편집

날짜2020.01.22

파이썬 셀레니움 사용한 크롤링 질문입니다.

crawling

crawler

python

youtube

youtube-api

프로필 Shin Gayeong님의 편집

날짜2020.01.22

파이썬 셀레니움 사용한 크롤링 질문입니다.

crawling

crawler

python

youtube

youtube-api

로드할 페이지 URL

마우스 커서 위치 지정. 스크롤링을 위해 마우스커서를 바디에 위치

제목 가져오기

전체 댓글을 수집하기 위해 스크롤바를 최하단에 위치

페이지를 내리고 싶은 회수 지정

댓글 내용 (html 태그를 통해 추출)

댓글 작성자 id

댓글에 좋아요 개수

비워져 있는 값 0으로 대치

가져온 댓글 리스트 확인

csv로 저장 시 깨짐 방지를 위한 전처리(이모지-유니코드, 아랍문자 등 제거)

댓글에 대한 전처리

아이디에 아랍문자 사용이 가능하기 때문에 아이디도 전처리 필요

분석을 위해 댓글 길이 추가

DB 확인