在Python中从字符串中剥离HTML

番长樱梅2020-04-03

from mechanize import Browser
br = Browser()
br.open('http://somewebpage')
html = br.response().readlines()
for line in html:
  print line

当在HTML文件中打印一行时，我试图找到一种仅显示每个HTML元素的内容而不显示格式本身的方法。如果找到'<a href="whatever.com">some text</a>'，它将仅打印“某些文本”，'<b>hello</b>'打印“ hello”，等等。如何去做呢？

神无2020/04/03 10:24:14

大多数情况下，使用BeautifulSoup，html2text或@Eloff中的代码，它仍然保留一些html元素，javascript代码...

因此，您可以结合使用这些库并删除markdown格式（Python 3）：

import re
import html2text
from bs4 import BeautifulSoup
def html2Text(html):
    def removeMarkdown(text):
        for current in ["^[ #*]{2,30}", "^[ ]{0,30}\d\\\.", "^[ ]{0,30}\d\."]:
            markdown = re.compile(current, flags=re.MULTILINE)
            text = markdown.sub(" ", text)
        return text
    def removeAngular(text):
        angular = re.compile("[{][|].{2,40}[|][}]|[{][*].{2,40}[*][}]|[{][{].{2,40}[}][}]|\[\[.{2,40}\]\]")
        text = angular.sub(" ", text)
        return text
    h = html2text.HTML2Text()
    h.images_to_alt = True
    h.ignore_links = True
    h.ignore_emphasis = False
    h.skip_internal_links = True
    text = h.handle(html)
    soup = BeautifulSoup(text, "html.parser")
    text = soup.text
    text = removeAngular(text)
    text = removeMarkdown(text)
    return text

它对我来说效果很好，但是可以增强，当然...

Mandy2020/04/03 10:24:14

对于一个项目，我需要剥离HTML，同时剥离CSS和js。因此，我对Eloffs回答做了一个变体：

class MLStripper(HTMLParser):
    def __init__(self):
        self.reset()
        self.strict = False
        self.convert_charrefs= True
        self.fed = []
        self.css = False
    def handle_starttag(self, tag, attrs):
        if tag == "style" or tag=="script":
            self.css = True
    def handle_endtag(self, tag):
        if tag=="style" or tag=="script":
            self.css=False
    def handle_data(self, d):
        if not self.css:
            self.fed.append(d)
    def get_data(self):
        return ''.join(self.fed)

def strip_tags(html):
    s = MLStripper()
    s.feed(html)
    return s.get_data()

卡卡西Near2020/04/03 10:24:14

这是一种与当前接受的答案（https://stackoverflow.com/a/925630/95989）类似的解决方案，除了它HTMLParser直接使用内部类（即没有子类），从而使其简洁得多：

def strip_html（文字）：
    零件= []                                                                      
    解析器= HTMLParser（）                                                           
    parser.handle_data = parts.append                                               
    parser.feed（文本）                                                               
    返回''.join（parts）

梅2020/04/03 10:24:13

您可以编写自己的函数：

def StripTags(text):
     finished = 0
     while not finished:
         finished = 1
         start = text.find("<")
         if start >= 0:
             stop = text[start:].find(">")
             if stop >= 0:
                 text = text[:start] + text[start+stop+1:]
                 finished = 0
     return text

卡卡西Near2020/04/03 10:24:13

我已经成功地将Eloff的答案用于Python 3.1 [非常感谢！]。

我升级到Python 3.2.3，并遇到错误。

在此感谢响应者Thomas K 提供的解决方案是将super().__init__()以下代码插入：

def __init__(self):
    self.reset()
    self.fed = []

为了使它看起来像这样：

def __init__(self):
    super().__init__()
    self.reset()
    self.fed = []

...，它将适用于Python 3.2.3。

再次感谢Thomas K的修复以及上面提供的Eloff原始代码！

Gil伽罗小宇宙2020/04/03 10:24:13

您可以使用其他HTML解析器（如lxml或Beautiful Soup），该解析器提供仅提取文本的功能。或者，您可以在行字符串上运行正则表达式以去除标记。有关更多信息，请参见Python文档。

LGil2020/04/03 10:24:13

这是我对python 3的解决方案。

import html
import re

def html_to_txt(html_text):
    ## unescape html
    txt = html.unescape(html_text)
    tags = re.findall("<[^>]+>",txt)
    print("found tags: ")
    print(tags)
    for tag in tags:
        txt=txt.replace(tag,'')
    return txt

不知道它是否完美，但是解决了我的用例，看起来很简单。

Mandy2020/04/03 10:24:13

Beautiful Soup包会立即为您执行此操作。

from bs4 import BeautifulSoup

soup = BeautifulSoup(html)
text = soup.get_text()
print(text)

西门JinJin2020/04/03 10:24:13

如果要剥离所有HTML标记，我发现的最简单方法是使用BeautifulSoup：

from bs4 import BeautifulSoup  # Or from BeautifulSoup import BeautifulSoup

def stripHtmlTags(htmlTxt):
    if htmlTxt is None:
            return None
        else:
            return ''.join(BeautifulSoup(htmlTxt).findAll(text=True))

我尝试了接受的答案的代码，但得到的是“ RuntimeError：超出最大递归深度”，上述代码块未发生这种情况。

Newest Answer

HTML5 Drag & Drop 拖动时更改图标/光标

5 Answer

如何完全卸载 Node.js，然后从头重新安装 (Mac OS X)

NUXT引入tailwindcss报错Error: Expected an opening square bracket.

1 Answer