有没有一种简单的方法可以在JavaScript中获取html字符串并去除html?
从文本JavaScript中删除HTML
The accepted answer works fine mostly, however in IE if the html
string is null
you get the "null"
(instead of ''). Fixed:
function strip(html)
{
if (html == null) return "";
var tmp = document.createElement("DIV");
tmp.innerHTML = html;
return tmp.textContent || tmp.innerText || "";
}
I have created a working regular expression myself:
str=str.replace(/(<\?[a-z]*(\s[^>]*)?\?(>|$)|<!\[[a-z]*\[|\]\]>|<!DOCTYPE[^>]*?(>|$)|<!--[\s\S]*?(-->|$)|<[a-z?!\/]([a-z0-9_:.])*(\s[^>]*)?(>|$))/gi, '');
I just needed to strip out the <a>
tags and replace them with the text of the link.
This seems to work great.
htmlContent= htmlContent.replace(/<a.*href="(.*?)">/g, '');
htmlContent= htmlContent.replace(/<\/a>/g, '');
simple 2 line jquery to strip the html.
var content = "<p>checking the html source </p><p>
</p><p>with </p><p>all</p><p>the html </p><p>content</p>";
var text = $(content).text();//It gets you the plain text
console.log(text);//check the data in your console
cj("#text_area_id").val(text);//set your content to text area using text_area_id
function stripHTML(my_string){
var charArr = my_string.split(''),
resultArr = [],
htmlZone = 0,
quoteZone = 0;
for( x=0; x < charArr.length; x++ ){
switch( charArr[x] + htmlZone + quoteZone ){
case "<00" : htmlZone = 1;break;
case ">10" : htmlZone = 0;resultArr.push(' ');break;
case '"10' : quoteZone = 1;break;
case "'10" : quoteZone = 2;break;
case '"11' :
case "'12" : quoteZone = 0;break;
default : if(!htmlZone){ resultArr.push(charArr[x]); }
}
}
return resultArr.join('');
}
Accounts for > inside attributes and <img onerror="javascript">
in newly created dom elements.
usage:
clean_string = stripHTML("string with <html> in it")
demo:
https://jsfiddle.net/gaby_de_wilde/pqayphzd/
demo of top answer doing the terrible things:
If you want to keep the links and the structure of the content (h1, h2, etc) then you should check out TextVersionJS You can use it with any HTML, although it was created to convert an HTML email to plain text.
The usage is very simple. For example in node.js:
var createTextVersion = require("textversionjs");
var yourHtml = "<h1>Your HTML</h1><ul><li>goes</li><li>here.</li></ul>";
var textVersion = createTextVersion(yourHtml);
Or in the browser with pure js:
<script src="textversion.js"></script>
<script>
var yourHtml = "<h1>Your HTML</h1><ul><li>goes</li><li>here.</li></ul>";
var textVersion = createTextVersion(yourHtml);
</script>
It also works with require.js:
define(["textversionjs"], function(createTextVersion) {
var yourHtml = "<h1>Your HTML</h1><ul><li>goes</li><li>here.</li></ul>";
var textVersion = createTextVersion(yourHtml);
});
This should do the work on any Javascript environment (NodeJS included).
const text = `
<html lang="en">
<head>
<style type="text/css">*{color:red}</style>
</head>
<body><b>This is some text</b><br/><body>
</html>`;
// Rule to remove inline CSS.
text.replace(/<style[^>]*>.*<\/style>/gm, '')
// Rule to remove all opening, closing and orphan HTML tags.
.replace(/<[^>]+>/gm, '')
// Rule to remove leading spaces and repeated CR/LF.
.replace(/([\r\n]+ +)+/gm, '');
myString.replace(/<[^>]*>?/gm, '');
最简单的方法:
jQuery(html).text();
这将从html字符串中检索所有文本。
如果您在浏览器中运行,那么最简单的方法就是让浏览器为您完成...
function stripHtml(html)
{
var tmp = document.createElement("DIV");
tmp.innerHTML = html;
return tmp.textContent || tmp.innerText || "";
}
注意:正如人们在评论中所指出的那样,如果您不控制HTML的源代码(例如,请勿在可能来自用户输入的任何内容上运行此代码),则最好避免这种情况。对于这些情况,您仍然可以让浏览器为您完成工作- 请参阅Saba关于使用现在广泛使用的DOMParser的答案。
For escape characters also this will work using pattern matching: