Convert HTML to Plain Text while Preserving Certain Tags in JavaScript?

Hi Anki community!

as outlined in this post I’m using the following code snippet provided by hkr to convert an HTML fragment to plain text:

// convert an HTML fragment to plain text
    var tempDiv = document.createElement('div');		    
    tempDiv.innerHTML = '{{example}}';
    document.documentElement.appendChild(tempDiv);
    var singleColumnCSV = tempDiv.innerText;
    document.documentElement.removeChild(tempDiv);

This conversion removes all HTML tags.
But I would like to preserve certain HTML tags.
For example the tags <b> which I’m using to highlight stressed syllables.
Is there a way to preserve certain html tags <b>?

Thank you!

ssnoyes from the Anki subreddit kindly provided the solution:

tempDiv.innerHTML = '{{example}}'
tempDiv.innerHTML = tempdiv.innerHTML.replace(/(\<b>)(.*?)(\<\/b>)/g, "&lt;b&gt;$2&lt;&#47;b&gt;");

var singleColumnCSV = tempDiv.innerText
singleColumnCSV = singleColumnCSV.replace(/&lt;b&gt;/g, "<b>").replace(/&lt;&#47;b&gt;/g, "</b>");

If you want to preserve more tags, just add this after the 2nd line:
tempDiv.innerHTML = tempDiv.innerHTML.replace(/(\<TAG>)(.*?)(\<\/TAG>)/g, "&lt;TAG&gt;$2&lt;&#47;TAG&gt;");

See reddit post here:

1 Like

I found another regex that seems to preserve all html tags.

source: https://stackoverflow.com/questions/1499889/remove-html-tags-in-javascript-with-regex

In this case:

tempDiv.innerHTML = '{{example}}'
var regex = /(<([^>]+)>)/ig
tempDiv.innerHTML = tempdiv.innerHTML.replace(regex, "&lt;$2&gt;");