편집 기록

프로필 엽토군님의 편집

날짜2019.08.04

파이썬 크롤링에 관하여 질문드립니다.(수정)

python

crawling

div

html

api

먼저 댓글을 달아주신 엽토군님께 감사인사를 드리며 시작하겠습니다.
#질문 수정하였습니다.

목표

현재 약 1000개이상의 html 를 리스트에 넣는다
for문을 이용하여 각각 html에서 원하는 내용을 추출하여 새로운 리스트에 넣는다
프레임소스에서 div class : "xxx"를 출력한다

현재 목표1번은 성공하였습니다. 목표 3번도 성공하였습니다
이제 새로운 리스트에 넣고 csv 파일로 저장하는일만 남게 되었습니다.

import urllib.request
import bs4

for informationurl in informationurls:
    i=[]
    url = informationurl
    html = urllib.request.urlopen(url)

    bsObj = bs4.BeautifulSoup(html, "html.parser")
    contents = bsObj.find("div", {"class":"user_content"})
    i+=(contents.text)
print(i)

들여쓰기를 어떻게 해야할지 감이 잘 안잡혀서 일단 이렇게 만들어 보았습니다

url을 모아둔 리스트이름은 informationurls입니다.

다시 작동이 되어 오류가 떳습니다..ㅠ

HTTPError                                 Traceback (most recent call last)
<ipython-input-49-d326d153da0a> in <module>
      5     i=[]
      6     url = informationurl
----> 7     html = urllib.request.urlopen(url)
      8 
      9     bsObj = bs4.BeautifulSoup(html, "html.parser")

~\Anaconda3\lib\urllib\request.py in urlopen(url, data, timeout, cafile, capath, cadefault, context)
    220     else:
    221         opener = _opener
--> 222     return opener.open(url, data, timeout)
    223 
    224 def install_opener(opener):

~\Anaconda3\lib\urllib\request.py in open(self, fullurl, data, timeout)
    529         for processor in self.process_response.get(protocol, []):
    530             meth = getattr(processor, meth_name)
--> 531             response = meth(req, response)
    532 
    533         return response

~\Anaconda3\lib\urllib\request.py in http_response(self, request, response)
    639         if not (200 <= code < 300):
    640             response = self.parent.error(
--> 641                 'http', request, response, code, msg, hdrs)
    642 
    643         return response

~\Anaconda3\lib\urllib\request.py in error(self, proto, *args)
    567         if http_err:
    568             args = (dict, 'default', 'http_error_default') + orig_args
--> 569             return self._call_chain(*args)
    570 
    571 # XXX probably also want an abstract factory that knows when it makes

~\Anaconda3\lib\urllib\request.py in _call_chain(self, chain, kind, meth_name, *args)
    501         for handler in handlers:
    502             func = getattr(handler, meth_name)
--> 503             result = func(*args)
    504             if result is not None:
    505                 return result

~\Anaconda3\lib\urllib\request.py in http_error_default(self, req, fp, code, msg, hdrs)
    647 class HTTPDefaultErrorHandler(BaseHandler):
    648     def http_error_default(self, req, fp, code, msg, hdrs):
--> 649         raise HTTPError(req.full_url, code, msg, hdrs, fp)
    650 
    651 class HTTPRedirectHandler(BaseHandler):

HTTPError: HTTP Error 403: Forbidden

이런식으로 나왔는데..
html = urllib.request.urlopen(url) 부분에서 잘못된건 알겠는데 고치지를 못하겠어요...

프로필 알 수 없는 사용자님의 편집

날짜2019.08.04

파이썬 크롤링에 관하여 질문드립니다.(수정)

python

crawling

div

html

api

먼저 댓글을 달아주신 엽토군님께 감사인사를 드리며 시작하겠습니다.

질문 수정하였습니다.

목표

현재 약 1000개이상의 html 를 리스트에 넣는다
for문을 이용하여 각각 html에서 원하는 내용을 추출하여 새로운 리스트에 넣는다
프레임소스에서 div class : "xxx"를 출력한다

현재 목표1번은 성공하였습니다. 목표 3번도 성공하였습니다 이제 새로운 리스트에 넣고 cvs 파일로 저장하는일만 남게 되었습니다.

import urllib.request
import bs4

for informationurl in informationurls:
    i=[]
    url = informationurl
    html = urllib.request.urlopen(url)

    bsObj = bs4.BeautifulSoup(html, "html.parser")
    contents = bsObj.find("div", {"class":"user_content"})
    i+=(contents.text)
print(i)

들여쓰기를 어떻게 해야할지 감이 잘 안잡혀서 일단 이렇게 만들어 보았습니다

url을 모아둔 리스트이름은 informationurls 입니다.

다시 작동이 되어 오류가 떳습니다..ㅠ

HTTPError                                 Traceback (most recent call last)
<ipython-input-49-d326d153da0a> in <module>
      5     i=[]
      6     url = informationurl
----> 7     html = urllib.request.urlopen(url)
      8 
      9     bsObj = bs4.BeautifulSoup(html, "html.parser")

~\Anaconda3\lib\urllib\request.py in urlopen(url, data, timeout, cafile, capath, cadefault, context)
    220     else:
    221         opener = _opener
--> 222     return opener.open(url, data, timeout)
    223 
    224 def install_opener(opener):

~\Anaconda3\lib\urllib\request.py in open(self, fullurl, data, timeout)
    529         for processor in self.process_response.get(protocol, []):
    530             meth = getattr(processor, meth_name)
--> 531             response = meth(req, response)
    532 
    533         return response

~\Anaconda3\lib\urllib\request.py in http_response(self, request, response)
    639         if not (200 <= code < 300):
    640             response = self.parent.error(
--> 641                 'http', request, response, code, msg, hdrs)
    642 
    643         return response

~\Anaconda3\lib\urllib\request.py in error(self, proto, *args)
    567         if http_err:
    568             args = (dict, 'default', 'http_error_default') + orig_args
--> 569             return self._call_chain(*args)
    570 
    571 # XXX probably also want an abstract factory that knows when it makes

~\Anaconda3\lib\urllib\request.py in _call_chain(self, chain, kind, meth_name, *args)
    501         for handler in handlers:
    502             func = getattr(handler, meth_name)
--> 503             result = func(*args)
    504             if result is not None:
    505                 return result

~\Anaconda3\lib\urllib\request.py in http_error_default(self, req, fp, code, msg, hdrs)
    647 class HTTPDefaultErrorHandler(BaseHandler):
    648     def http_error_default(self, req, fp, code, msg, hdrs):
--> 649         raise HTTPError(req.full_url, code, msg, hdrs, fp)
    650 
    651 class HTTPRedirectHandler(BaseHandler):

HTTPError: HTTP Error 403: Forbidden

먼저 댓글을 달아주신 엽토군님께 감사인사를 드리며 시작하겠습니다.

질문 수정하였습니다.

목표

현재 약 1000개이상의 html 를 리스트에 넣는다
for문을 이용하여 각각 html에서 원하는 내용을 추출하여 새로운 리스트에 넣는다
프레임소스에서 div class : "xxx"를 출력한다

현재 목표1번은 성공하였습니다. 목표 3번도 성공하였습니다 이제 새로운 리스트에 넣고 cvs 파일로 저장하는일만 남게 되었습니다.

import urllib.request
import bs4

for informationurl in informationurls:
    i=[]
    url = informationurl
    html = urllib.request.urlopen(url)

    bsObj = bs4.BeautifulSoup(html, "html.parser")
    contents = bsObj.find("div", {"class":"user_content"})
    i+=(contents.text)
print(i)

들여쓰기를 어떻게 해야할지 감이 잘 안잡혀서 일단 이렇게 만들어 보았습니다

url을 모아둔 리스트이름은 informationurls 입니다.

다시 작동이 되어 오류가 떳습니다..ㅠ

HTTPError                                 Traceback (most recent call last)
<ipython-input-49-d326d153da0a> in <module>
      5     i=[]
      6     url = informationurl
----> 7     html = urllib.request.urlopen(url)
      8 
      9     bsObj = bs4.BeautifulSoup(html, "html.parser")

~\Anaconda3\lib\urllib\request.py in urlopen(url, data, timeout, cafile, capath, cadefault, context)
    220     else:
    221         opener = _opener
--> 222     return opener.open(url, data, timeout)
    223 
    224 def install_opener(opener):

~\Anaconda3\lib\urllib\request.py in open(self, fullurl, data, timeout)
    529         for processor in self.process_response.get(protocol, []):
    530             meth = getattr(processor, meth_name)
--> 531             response = meth(req, response)
    532 
    533         return response

~\Anaconda3\lib\urllib\request.py in http_response(self, request, response)
    639         if not (200 <= code < 300):
    640             response = self.parent.error(
--> 641                 'http', request, response, code, msg, hdrs)
    642 
    643         return response

~\Anaconda3\lib\urllib\request.py in error(self, proto, *args)
    567         if http_err:
    568             args = (dict, 'default', 'http_error_default') + orig_args
--> 569             return self._call_chain(*args)
    570 
    571 # XXX probably also want an abstract factory that knows when it makes

~\Anaconda3\lib\urllib\request.py in _call_chain(self, chain, kind, meth_name, *args)
    501         for handler in handlers:
    502             func = getattr(handler, meth_name)
--> 503             result = func(*args)
    504             if result is not None:
    505                 return result

~\Anaconda3\lib\urllib\request.py in http_error_default(self, req, fp, code, msg, hdrs)
    647 class HTTPDefaultErrorHandler(BaseHandler):
    648     def http_error_default(self, req, fp, code, msg, hdrs):
--> 649         raise HTTPError(req.full_url, code, msg, hdrs, fp)
    650 
    651 class HTTPRedirectHandler(BaseHandler):

HTTPError: HTTP Error 403: Forbidden

이런식으로 나왔는데.. html = urllib.request.urlopen(url)부분에서 잘못된건 알겠는데 고치지를 못하겠어요...