(초보) 웹 크로링이 되는것도 안되는것도 있어요 ㅠ

Question

(초보) 웹 크로링이 되는것도 안되는것도 있어요 ㅠ

조회수 885회

python

0

싫어요

동일한 홈페이지(예 : 두인경매)에서 "경매", "공매" 카테고리로 들어가서 웹크로링를 공부중에 있습니다. 3행은 경매, 4행은 공매인데, 경매(3행 url_) 실행하면 (8행) tots 값이 나오는데, 공매(4행 url_) 실행하면 tots값이 안 나옵니다.(3행과 4행중 하나씩만 수행) ㅠ

두개의 HTML을 분석해서 'div.page'가 각각 유일합니다.

최종적으로 html 코드에 있는 (네모형태) 값을 구하려고 합니다. (경매에서 14032, 공매에서 2153) 문자열 중간에 값을 추출하는것도 쉽지 않네요. 이렇게 질문해도 되는지 모르겠네요.. 몇시간째 해보고 있는데 모르겠어요.

import urllib.request<a>

from bs4 import BeautifulSoup

url = 'http://www.dooinauction.com/auction/ca_list.php'  #경매분야<a> 

url = 'http://www.dooinauction.com/pubauct/list.php'  #공매분야<a>

req = urllib.request.Request(url)

html = urllib.request.urlopen(req).read()

soup = BeautifulSoup(html, 'html.parser')

tots = soup.select('div.pagn')

print('Test end')

경매페이지 html

공매페이지 html

정영훈 15,709 points

2019-12-28 20:47:04에 수정됨
(•́ ✖ •̀)
알 수 없는 사용자
〉

댓글 입력

score 1 · Accepted Answer

안된다고 하는 것은 동적으로 데이터를 받아와서 클라이언트측에서 수행하기 때문입니다.

import re
import requests
from bs4 import BeautifulSoup

url = 'http://www.dooinauction.com/auction/ca_list.php' #경매분야

html = requests.get(url).content
soup = BeautifulSoup(html, 'html.parser')
tots = soup.select('div.pagn a')

results = [re.findall(r'total_record=([0-9]+)', link['href'])[0] for link in tots]

print(results)

['14568',
 '14568',
 '14568',
 '14568',
 '14568',
 '14568',
 '14568',
 '14568',
 '14568',
 '14568',
 '14568']

문제는 공매인데...아래의 링크를 이용해야 합니다 xml을 얻을 수 있으므로 파싱하여 사용하면 됩니다.

http://www.dooinauction.com/xml/pubauct_list.php?pdNo=&pdStatus=1&sdate=&edate=&g_sprice=0&g_eprice=0&ctgr1=0&ctgr2=0&l_sprice=0&l_eprice=0&sido=0&gugun=0&dong=0&ref_page=&ref_sido=&ref_gugun=&ref_dong=&decrease=0&order_type=0&list_scale=20&page_scale=10&start=0&total_record=0

import re
import requests
from bs4 import BeautifulSoup

url = 'http://www.dooinauction.com/xml/pubauct_list.php?pdNo=&pdStatus=1&sdate=&edate=&g_sprice=0&g_eprice=0&ctgr1=0&ctgr2=0&l_sprice=0&l_eprice=0&sido=0&gugun=0&dong=0&ref_page=&ref_sido=&ref_gugun=&ref_dong=&decrease=0&order_type=0&list_scale=20&page_scale=10&start=0&total_record=0' #공매분야

html = requests.get(url).content
soup = BeautifulSoup(html, 'lxml-xml')

print(soup.find('total_record').text)
2153

(초보) 웹 크로링이 되는것도 안되는것도 있어요 ㅠ

조회수 885회

python

0

정영훈 15,709 points

2019-12-28 20:47:04에 수정됨

(•́ ✖ •̀)
알 수 없는 사용자

댓글 입력

2 답변

1

정영훈 15,709 points

2019-12-28 18:22:57에 작성됨

댓글 달기

1

정영훈 15,709 points

2019-12-29 01:43:48에 작성됨

댓글 달기

(초보) 웹 크로링이 되는것도 안되는것도 있어요 ㅠ

조회수 885회

python

0

정영훈 15,709 points

2019-12-28 20:47:04에 수정됨

(•́ ✖ •̀)알 수 없는 사용자

댓글 입력

2 답변

1

정영훈 15,709 points

2019-12-28 18:22:57에 작성됨

댓글 달기

1

정영훈 15,709 points

2019-12-29 01:43:48에 작성됨

댓글 달기

답변을 하려면 로그인이 필요합니다.

(•́ ✖ •̀)
알 수 없는 사용자