使用BeautifulSoup庫解析htm、xml文檔
BeautifulSoup
安裝:
~/Desktop$ sudo pip install beautifulsoup4
1
測試:
from bs4 import BeautifulSoup if __name__ == "__main__": # 第一個參數是html文檔文本,第二個參數是指定的解析器 soup = BeautifulSoup('
data
', 'html.parser') print(soup.prettify())1
2
3
4
5
6
輸出:
data
1
2
3
說明安裝成功了。
Beautiful Soup庫也叫bs4,Beautiful Soup庫是解析 、遍歷、維護 “標簽樹”的功能庫。
Beautiful Soup庫解析器:
Beatiful Soup類的基本元素
示例:
import requests from bs4 import BeautifulSoup def handle_url(url): try: r = requests.get("http://www.baidu.com") r.raise_for_status() if r.encoding == 'ISO-8859-1': r.encoding = r.apparent_encoding demo = r.text soup = BeautifulSoup(demo, 'html.parser') # a標簽有很多個,但soup.a返回第一個 print(soup.a) #
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
Beatiful Soup遍歷HTML元素
Html具有樹型結構,因此遍歷有三種:
下行遍歷:
import requests from bs4 import BeautifulSoup def handle_url(url): try: r = requests.get("http://www.baidu.com") r.raise_for_status() if r.encoding == 'ISO-8859-1': r.encoding = r.apparent_encoding demo = r.text soup = BeautifulSoup(demo, 'html.parser') print(soup.head) # head標簽的兒子節點 print(soup.head.contents) # 是list列表類型 print(type(soup.head.contents)) # head有5個兒子節點 print(len(soup.head.contents)) # 取出head的第5個兒子節點 print(soup.head.contents[4]) # 使用children遍歷兒子節點 for child in soup.head.children: print(child) # 使用descendants遍歷子孫節點 for child in soup.head.descendants: print(child) except: print("fail fail fail") if __name__ == "__main__": url = "http://www.baidu.com" handle_url(url)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
上行遍歷:
import requests from bs4 import BeautifulSoup def handle_url(url): try: r = requests.get("http://www.baidu.com") r.raise_for_status() if r.encoding == 'ISO-8859-1': r.encoding = r.apparent_encoding demo = r.text soup = BeautifulSoup(demo, 'html.parser') # html標簽的父節點是它自己 print(soup.html.parent) # soup本身也是一種特殊的標簽節點,它的父節點是None空 print(soup.parent) # title標簽的父節點 print(soup.title.parent) # 遍歷title標簽的先輩節點 for parent in soup.title.parents: if parent is None: print(parent) else: print(parent.name) except: print("fail fail fail") if __name__ == "__main__": url = "http://www.baidu.com" handle_url(url)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
平行遍歷::必須發生在同一個父節點下
import requests from bs4 import BeautifulSoup def handle_url(url): try: r = requests.get("http://www.baidu.com") r.raise_for_status() if r.encoding == 'ISO-8859-1': r.encoding = r.apparent_encoding demo = r.text soup = BeautifulSoup(demo, 'html.parser') # title的前一個平行節點 print(soup.title.previous_sibling) # link的下一個平行節點 print(soup.link.next_sibling) # 遍歷meta標簽的所有的后續平行節點 for sibling in soup.meta.next_siblings: print(sibling) # 遍歷title標簽的所有前續的平行節點 for sibling in soup.title.previous_siblings: print(sibling) except: print("fail fail fail") if __name__ == "__main__": url = "http://www.baidu.com" handle_url(url)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
HTML XML
版權聲明:本文內容由網絡用戶投稿,版權歸原作者所有,本站不擁有其著作權,亦不承擔相應法律責任。如果您發現本站中有涉嫌抄襲或描述失實的內容,請聯系我們jiasou666@gmail.com 處理,核實后本網站將在24小時內刪除侵權內容。