最近看算法(第四版) 一书中图这一章的时候,发现网页也是图,网页之间的超链接就是图节点的连线,由此想到了用广度优先算法来遍历网站,爬取想要的信息。

GitHub地址,目录里面的movies.txt是在服务器上面跑了几个小时后,爬取了差不多三万个资源下载地址

分析网页源码

由图可以看出电影天堂所有电影的链接都是以’/html….html’来表示的,于是就用正则表达式来匹配

算法与数据结构

小程序里面用队列来实现bfs,并用set来储存每一个链接,这样就达到了防止重复爬取链接,使程序陷入死循环了

匹配下载链接

当爬虫到了电影页面,也用正则表达式来匹配ftp链接

运行

1
python3 main.py

源码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
import re,requests,queue

LINK = set()
List = []
times = 10009970 #爬取资源的次数

URL = "http://www.dytt8.net"
headers = {
'Referer':'http://www.dytt8.net/',
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.108 Safari/537.36'
}

def getftp(link):
try:
global num
r = requests.get(URL+link,headers=headers)
r.encoding=r.apparent_encoding
web = r.text
movies = re.findall(r'"(ftp[^\'"]+)"',web)
name = re.search('<title>.+《(.+)》.+<\/title>',web).group(1)
tplt = "{0:{2}^10}\t{1:{2}^90}\n" #定义格式化字符串
for movie in movies:
List.append(tplt.format(name,movie,chr(12288)))
num += 1
print(num)
print(movie,name)
except:
print("error getftp")
pass

num = 0
def bfs(url):
Q = queue.Queue() #定义一个队列
Q.put(url)
global num
while not Q.empty():
try:
url = Q.get()
r = requests.get(URL+url,headers=headers)
r.encoding=r.apparent_encoding
text = r.text
links = re.findall(r'[^\'"<>]+\.html',text)
except:
continue
for link in links:
if link in LINK:
continue
getftp(link)
LINK.add(link)
Q.put(link)
if num>times:
return

def main():
#开始爬取的链接
url = '/plus/sitemap.html'
bfs(url)

#写入文件
with open('movies.txt','w+',encoding='utf-8') as movies:
for strs in List:
movies.write(strs)

with open('urls.txt','w+',encoding='utf-8') as url_file:
for link in LINK:
url_file.write(URL+link+'\n')

if __name__ == '__main__':
main()