学习Python爬虫
静态网页爬取
requests
利用代码requests.get(URL)
可以获取URL的内容
动态网页爬取
学习
初步实践
- 将豆瓣top250的电影爬出来
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
| from bs4 import BeautifulSoup import requests
# 如果没有headers就会报418错误,418错误就是只想服务浏览器用户,所以用headers伪装成浏览器用户 headers = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36" } for start_num in range(0, 250, 25): # 翻页 response = requests.get(f"https://movie.douban.com/top250?start={start_num}", headers=headers) html = response.text soup = BeautifulSoup(html, "html.parser") all_titles = soup.findAll("span", attrs={"class": "title"}) for title in all_titles: title_string = title.string if "/" not in title_string: print(title_string)
|
- 用requests爬取博客信息
1 2 3 4 5 6
| import requests
r = requests.get('https://github.com/HauUhang') #发送请求 m = r.status_code #返回码 print(m) #显示码(结果为200成功)
|