- 論壇徽章:
- 2
|
想通過nltk中的clean_html功能來清除html內(nèi)容
import nltk,re,pprint
import urllib2
html='‘ /論壇發(fā)不了html鏈接/
h=urllib2.urlopen(html)
c=h.read()
raw=nltk.clean_html(c)
但是報如下的錯誤
Traceback (most recent call last):
File "E:\python_project\test1.py", line 7, in <module>
raw=nltk.clean_html(c)
File "E:\python2.7\lib\site-packages\nltk\util.py", line 346, in clean_html
raise NotImplementedError ("To remove HTML markup, use BeautifulSoup's get_text() function")
NotImplementedError: To remove HTML markup, use BeautifulSoup's get_text() function
|
|