Tag striscia HTML Agility Pack NOT IN whitelist

Sto provando a creare una funzione che rimuove tag e attributi html che non sono in una lista bianca. Ho il seguente codice HTML:Tag striscia HTML Agility Pack NOT IN whitelist

<b>first text </b> 
<b>second text here 
     <a>some text here</a> 
<a>some text here</a> 

</b> 
<a>some twxt here</a>

Sto usando l'agilità HTML confezione e il codice che ho finora è:

static List<string> WhiteNodeList = new List<string> { "b" }; 
static List<string> WhiteAttrList = new List<string> { }; 
static HtmlNode htmlNode; 
public static void RemoveNotInWhiteList(out string _output, HtmlNode pNode, List<string> pWhiteList, List<string> attrWhiteList) 
{ 

// remove all attributes not on white list 
foreach (var item in pNode.ChildNodes) 
{ 
    item.Attributes.Where(u => attrWhiteList.Contains(u.Name) == false).ToList().ForEach(u => RemoveAttribute(u)); 

} 

// remove all html and their innerText and attributes if not on whitelist. 
//pNode.ChildNodes.Where(u => pWhiteList.Contains(u.Name) == false).ToList().ForEach(u => u.Remove()); 
//pNode.ChildNodes.Where(u => pWhiteList.Contains(u.Name) == false).ToList().ForEach(u => u.ParentNode.ReplaceChild(ConvertHtmlToNode(u.InnerHtml),u)); 
//pNode.ChildNodes.Where(u => pWhiteList.Contains(u.Name) == false).ToList().ForEach(u => u.Remove()); 

for (int i = 0; i < pNode.ChildNodes.Count; i++) 
{ 
    if (!pWhiteList.Contains(pNode.ChildNodes[i].Name)) 
    { 
    HtmlNode _newNode = ConvertHtmlToNode(pNode.ChildNodes[i].InnerHtml); 
    pNode.ChildNodes[i].ParentNode.ReplaceChild(_newNode, pNode.ChildNodes[i]); 
    if (pNode.ChildNodes[i].HasChildNodes && !string.IsNullOrEmpty(pNode.ChildNodes[i].InnerText.Trim().Replace("\r\n", ""))) 
    { 
    HtmlNode outputNode1 = pNode.ChildNodes[i]; 
    for (int j = 0; j < pNode.ChildNodes[i].ChildNodes.Count; j++) 
    { 
    string _childNodeOutput; 
    RemoveNotInWhiteList(out _childNodeOutput, 
      pNode.ChildNodes[i], WhiteNodeList, WhiteAttrList); 
    pNode.ChildNodes[i].ReplaceChild(ConvertHtmlToNode(_childNodeOutput), pNode.ChildNodes[i].ChildNodes[j]); 
    i++; 
    } 
    } 
    } 
} 

// Console.WriteLine(pNode.OuterHtml); 
_output = pNode.OuterHtml; 
} 

private static void RemoveAttribute(HtmlAttribute u) 
{ 
u.Value = u.Value.ToLower().Replace("javascript", ""); 
u.Remove(); 

} 

public static HtmlNode ConvertHtmlToNode(string html) 
{ 
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument(); 
doc.LoadHtml(html); 
if (doc.DocumentNode.ChildNodes.Count == 1) 
    return doc.DocumentNode.ChildNodes[0]; 
else return doc.DocumentNode; 
}

L'uscita Sono tryig di realizzare è

<b>first text </b> 
<b>second text here 
     some text here 
some text here 

</b> 
some twxt here

Ciò significa che voglio solo mantenere i tag <b>.
Il motivo per cui sto facendo questo è perché alcuni utenti fanno cpoy-paste da MS WORD in ny WYSYWYG editor html.

Grazie!

fonte

2010-06-24 Dragos Durlut

eh, a quanto pare ho quasi trovato una risposta in un post sul blog qualcuno ha fatto ....

using System.Collections.Generic; 
using System.Linq; 
using HtmlAgilityPack; 

namespace Wayloop.Blog.Core.Markup 
{ 
    public static class HtmlSanitizer 
    { 
     private static readonly IDictionary<string, string[]> Whitelist; 

     static HtmlSanitizer() 
     { 
      Whitelist = new Dictionary<string, string[]> { 
       { "a", new[] { "href" } }, 
       { "strong", null }, 
       { "em", null }, 
       { "blockquote", null }, 
       }; 
     } 

     public static string Sanitize(string input) 
     { 
      var htmlDocument = new HtmlDocument(); 

      htmlDocument.LoadHtml(input); 
      SanitizeNode(htmlDocument.DocumentNode); 

      return htmlDocument.DocumentNode.WriteTo().Trim(); 
     } 

     private static void SanitizeChildren(HtmlNode parentNode) 
     { 
      for (int i = parentNode.ChildNodes.Count - 1; i >= 0; i--) { 
       SanitizeNode(parentNode.ChildNodes[i]); 
      } 
     } 

     private static void SanitizeNode(HtmlNode node) 
     { 
      if (node.NodeType == HtmlNodeType.Element) { 
       if (!Whitelist.ContainsKey(node.Name)) { 
        node.ParentNode.RemoveChild(node); 
        return; 
       } 

       if (node.HasAttributes) { 
        for (int i = node.Attributes.Count - 1; i >= 0; i--) { 
         HtmlAttribute currentAttribute = node.Attributes[i]; 
         string[] allowedAttributes = Whitelist[node.Name]; 
         if (!allowedAttributes.Contains(currentAttribute.Name)) { 
          node.Attributes.Remove(currentAttribute); 
         } 
        } 
       } 
      } 

      if (node.HasChildNodes) { 
       SanitizeChildren(node); 
      } 
     } 
    } 
}

I got HtmlSanitizer from here A quanto pare non elimina ° tag, ma rimuove l'elemento altoghether.

OK, ecco la soluzione per coloro che ne avranno bisogno in seguito.

public static class HtmlSanitizer 
    { 
     private static readonly IDictionary<string, string[]> Whitelist; 
     private static List<string> DeletableNodesXpath = new List<string>(); 

     static HtmlSanitizer() 
     { 
      Whitelist = new Dictionary<string, string[]> { 
       { "a", new[] { "href" } }, 
       { "strong", null }, 
       { "em", null }, 
       { "blockquote", null }, 
       { "b", null}, 
       { "p", null}, 
       { "ul", null}, 
       { "ol", null}, 
       { "li", null}, 
       { "div", new[] { "align" } }, 
       { "strike", null}, 
       { "u", null},     
       { "sub", null}, 
       { "sup", null}, 
       { "table", null }, 
       { "tr", null }, 
       { "td", null }, 
       { "th", null } 
       }; 
     } 

     public static string Sanitize(string input) 
     { 
      if (input.Trim().Length < 1) 
       return string.Empty; 
      var htmlDocument = new HtmlDocument(); 

      htmlDocument.LoadHtml(input);    
      SanitizeNode(htmlDocument.DocumentNode); 
      string xPath = HtmlSanitizer.CreateXPath(); 

      return StripHtml(htmlDocument.DocumentNode.WriteTo().Trim(), xPath); 
     } 

     private static void SanitizeChildren(HtmlNode parentNode) 
     { 
      for (int i = parentNode.ChildNodes.Count - 1; i >= 0; i--) 
      { 
       SanitizeNode(parentNode.ChildNodes[i]); 
      } 
     } 

     private static void SanitizeNode(HtmlNode node) 
     { 
      if (node.NodeType == HtmlNodeType.Element) 
      { 
       if (!Whitelist.ContainsKey(node.Name)) 
       { 
        if (!DeletableNodesXpath.Contains(node.Name)) 
        {      
         //DeletableNodesXpath.Add(node.Name.Replace("?","")); 
         node.Name = "removeableNode"; 
         DeletableNodesXpath.Add(node.Name); 
        } 
        if (node.HasChildNodes) 
        { 
         SanitizeChildren(node); 
        }     

        return; 
       } 

       if (node.HasAttributes) 
       { 
        for (int i = node.Attributes.Count - 1; i >= 0; i--) 
        { 
         HtmlAttribute currentAttribute = node.Attributes[i]; 
         string[] allowedAttributes = Whitelist[node.Name]; 
         if (allowedAttributes != null) 
         { 
          if (!allowedAttributes.Contains(currentAttribute.Name)) 
          { 
           node.Attributes.Remove(currentAttribute); 
          } 
         } 
         else 
         { 
          node.Attributes.Remove(currentAttribute); 
         } 
        } 
       } 
      } 

      if (node.HasChildNodes) 
      { 
       SanitizeChildren(node); 
      } 
     } 

     private static string StripHtml(string html, string xPath) 
     { 
      HtmlDocument htmlDoc = new HtmlDocument(); 
      htmlDoc.LoadHtml(html); 
      if (xPath.Length > 0) 
      { 
       HtmlNodeCollection invalidNodes = htmlDoc.DocumentNode.SelectNodes(@xPath); 
       foreach (HtmlNode node in invalidNodes) 
       { 
        node.ParentNode.RemoveChild(node, true); 
       } 
      } 
      return htmlDoc.DocumentNode.WriteContentTo(); ; 
     } 

     private static string CreateXPath() 
     { 
      string _xPath = string.Empty; 
      for (int i = 0; i < DeletableNodesXpath.Count; i++) 
      { 
       if (i != DeletableNodesXpath.Count - 1) 
       { 
        _xPath += string.Format("//{0}|", DeletableNodesXpath[i].ToString()); 
       } 
       else _xPath += string.Format("//{0}", DeletableNodesXpath[i].ToString()); 
      } 
      return _xPath; 
     } 
    }

ho rinominato il nodo, perché se dovessi analizzare un nodo di namespace XML che potrebbe andare in crash il parsing XPath.

fonte

2010-06-24 06:08:07

Il link per HtmlSanitizer è rotto. Questo potrebbe essere il codice Meltdown si riferisce a: https://gist.github.com/814428 –

Questo non è in alcun modo il codice da cui ho creato la classe di validazione Whitelist. L'autore originale non ha utilizzato RegEx. Il codice originale dell'autore è il primo pezzo di codice che ho postato. –

Questo codice non funziona, posso salvare facilmente un modulo con pulsante di invio e una sezione di script che contiene codice dannoso. –

Grazie per il codice - ottima cosa !!!!

ho fatto pochi ottimizzazione ...

class TagSanitizer 
{ 
    List<HtmlNode> _deleteNodes = new List<HtmlNode>(); 

    public static void Sanitize(HtmlNode node) 
    { 
     new TagSanitizer().Clean(node); 
    } 

    void Clean(HtmlNode node) 
    { 
     CleanRecursive(node); 
     for (int i = _deleteNodes.Count - 1; i >= 0; i--) 
     { 
      HtmlNode nodeToDelete = _deleteNodes[i]; 
      nodeToDelete.ParentNode.RemoveChild(nodeToDelete, true); 
     } 
    } 

    void CleanRecursive(HtmlNode node) 
    { 
     if (node.NodeType == HtmlNodeType.Element) 
     { 
      if (Config.TagsWhiteList.ContainsKey(node.Name) == false) 
      { 
       _deleteNodes.Add(node); 
      } 
      else if (node.HasAttributes) 
      { 
       for (int i = node.Attributes.Count - 1; i >= 0; i--) 
       { 
        HtmlAttribute currentAttribute = node.Attributes[i]; 

        string[] allowedAttributes = Config.TagsWhiteList[node.Name]; 
        if (allowedAttributes != null) 
        { 
         if (allowedAttributes.Contains(currentAttribute.Name) == false) 
         { 
          node.Attributes.Remove(currentAttribute); 
         } 
        } 
        else 
        { 
         node.Attributes.Remove(currentAttribute); 
        } 
       } 
      } 
     } 

     if (node.HasChildNodes) 
     { 
      node.ChildNodes.ToList().ForEach(v => CleanRecursive(v)); 
     } 
    } 
}

fonte

2011-10-06 12:13:58 Yacov

Che cos'è Config da questa linea? if (Config.TagsWhiteList.ContainsKey (node.Name) == false) –

Questo è solo un altro elenco, puoi cambiarlo come vuoi :) – Yacov

Come nota a margine, quando ho provato questo mi sono imbattuto in problemi con il markup risultante essere incoerente (sezioni non ordinate, non tutta la formattazione correttamente rimossa) probabilmente a causa dell'ottimizzazione multi-threading con la ricorsione. – Elsa

Tag striscia HTML Agility Pack NOT IN whitelist

risposta

Problemi correlati