requests라이브러리

웹페이지의 정보를 가져오도록 requests라이브러리를 사용해야한다.

requests.get(url) : 웹페이지를 가져온다.

위 함수로 웹페이지의 정보를 가져올 수 있다. text메서드를 사용하면 html파일을 텍스트로 출력할 수 있다.

import requests

webpage = requests.get(
    "<https://en.wikipedia.org/wiki/List_of_American_exchange-traded_funds>")
print(webpage.text)

<!DOCTYPE html>
<html class="client-nojs" lang="en" dir="ltr">
<head>
<meta charset="UTF-8"/>
<title>List of American exchange-traded funds - Wikipedia</title>
...

BeautifulSoup

html문서를 탐색해서 원하는 부분만 뽑아낼 수 있는 파이썬 라이브러리이다.

Parser

requests.get을 통해 웹페이지를 받았으면 그 html을 분석해주어야 한다. 구문 해석기를 Parser라고 하는데 BeautifulSoup객체를 생성하려면 파서를 입력해야 한다.

html.parser
lxml : xml파일을 처리할 수 있다.

tag접근

BeautifulSoup객체를 생성했으면 특정 속성을 불러오는 것 처럼 특정 태그에 접근할 수 있다.

태그 속성의 하위 속성으로 이름, 문자열 등을 추출할 수 있다.

name : 태그명
string : 문자열추출

import requests
from bs4 import BeautifulSoup

webpage = requests.get(
    "<https://en.wikipedia.org/wiki/List_of_American_exchange-traded_funds>")
soup = BeautifulSoup(webpage.content, "html.parser")

print("tag : h1 ------------------------")
print(soup.h1)
print()
print("tag : title ------------------------")
print(soup.title)
print()
print("tag name : ", soup.title.name)
print("tag string : ", soup.title.string)