[웹크롤링] post방식 requset 웹 크롤링하기

웹 개발/크롤링

[웹크롤링] post방식 requset 웹 크롤링하기

내만 2022. 9. 30. 20:55

728x90

728x90

🙆‍♂️개념

#requests import
import requests
#bs4 import
from bs4 import BeautifulSoup

def notices(targetUrl,page_index):
    #targetURL
    base_url=targetUrl

    #post param
    post_params = {'page':page_index}
    
    #응답 값은 response에
    res = requests.post(base_url,data=post_params)

    #응답에 실패하면
    if res.status_code != 200:
        #그냥 print
        print("Can't")
    #응답에 성공하면
    else:
        #결과 저장 리스트
        results=[]
        #BeautifulSoup 생성해서 soup으로 사용
        soup=BeautifulSoup(res.text,"html.parser")
        #td태그들 값을 저장(find_all)
        tds = soup.find_all('td',class_="_artclTdTitle")
        for td in tds:
            #td 중 a태그 추출
            anchors = td.find_all('a')
            #추출한 a태그 정제
            for anchor in anchors:
                link = f"https://community.bu.ac.kr{anchor['href']}"
                #a태그 속 title이 저장된 span태그 find하기
                title = anchor.find('span')
                rst={
                    'title':title.string,
                    'link':link
                }
                results.append(rst)
        return results

변경한 웹 크롤링 함수.py 코드는 이렇습니다.

아직은 문제가 많은 코드입니다.

Burp Suit을 사용해서 보니 page넘어가는 방식을 post방식으로 넘기더라구요

그래서 위 처럼 post_params 변수를 만들어줘서 페이지 인덱스를 받아오고 그 페이지를 넣는 것입니다.

#requests import
import requests
#bs4 import
from bs4 import BeautifulSoup
from scrapModule.scrap import notices

noticeList = []
i=1
while 1:
    notice = notices("https://community.bu.ac.kr/info/1787/subview.do",i)
    if notice != []:
        noticeList.append(notice)
        i+=1
    else:
        break


print(noticeList)

그래서 이제 실제로 크롤링하는 코드는 이렇습니다.

while문을 돌리면서 돌릴 때마다. i=1부터 게시물이 없을 때 까지 도는데

게시물이 없는 page에 가면 함수에서 반환된 리스트 길이가 0인 부분을 이용해서 멈출 수 있도록 합니다.

이렇게 많은 결과를 가져올 수 있었습니다.

728x90

'웹 개발 > 크롤링' 카테고리의 다른 글

[웹크롤링] 04. beautifulSoup 좀더 자세히 (0)	2022.09.21
[웹크롤링] 03. beautifulsoup4 (0)	2022.09.17
[웹크롤링] 02. requests 라이브러리 (0)	2022.09.15
[웹크롤링] 01. URL 포맷하기 (0)	2022.09.15

현재글[웹크롤링] post방식 requset 웹 크롤링하기

250x250

나만의 개발 블로그

백준, 스택, 웹 크롤링, 리액트, pygame, flutter, 뉴런, 파이썬, 파이썬 뉴런 라이브러리, 기초100제, python neuron 라이브러리, React, 플러터, 자료구조, 파이썬 뉴런, python neuron, python, 코드업, 앱개발, 파이게임,

Today :
Yesterday :

일	월	화	수	목	금	토
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30

나만의 개발 블로그