Python爬虫

Python爬虫

学习Python爬虫

静态网页爬取

requests

利用代码requests.get(URL)可以获取URL的内容

动态网页爬取

学习

初步实践

  1. 将豆瓣top250的电影爬出来
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
from bs4 import BeautifulSoup
import requests

# 如果没有headers就会报418错误,418错误就是只想服务浏览器用户,所以用headers伪装成浏览器用户
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
}
for start_num in range(0, 250, 25): # 翻页
response = requests.get(f"https://movie.douban.com/top250?start={start_num}", headers=headers)
html = response.text
soup = BeautifulSoup(html, "html.parser")
all_titles = soup.findAll("span", attrs={"class": "title"})
for title in all_titles:
title_string = title.string
if "/" not in title_string:
print(title_string)
  1. 用requests爬取博客信息
1
2
3
4
5
6
import requests

r = requests.get('https://github.com/HauUhang') #发送请求
m = r.status_code #返回码
print(m) #显示码(结果为200成功)

作者

Hau uhang

发布于

2024-01-26

更新于

2025-04-06

许可协议

评论