vendredi 31 juillet 2015

Double Encoding of HTML Entities in XML using VB.NET

I am trying to create an XML file using VB.NET for a system that accepts XML information for publication on their website. Even though special characters like © & ® are acceptable in the XML, they really should be encoded to © and ®.

The problem that I am having is when I insert © into the XML, it results in a double-encoding of © in the XML, if I insert ©, then © is inserted unencoded.

I created a bare-bones example that I have replicated the problem with below, it has the same XML structure as what I need. I'm using HtmlAgilityPack to convert entities.

Imports HtmlAgilityPack

Private Sub webXML()

    Dim oXml As New XmlDocument

    oXml.LoadXml("<webTable xmlns=""http://ift.tt/1Jk9NWv"" xmlns:n1=""http://ift.tt/1IP9A1K"" xmlns:xsi=""http://ift.tt/1fQ42a7"" xsi:schemaLocation=""http://ift.tt/1IP9A1L""></webTable>")

    ''Add Namespace
    Dim NS As New Xml.XmlNamespaceManager(oXml.NameTable)
    NS.AddNamespace("ns", "http://ift.tt/1Jk9NWv")
    NS.AddNamespace("n1", "http://ift.tt/1IP9A1K")
    NS.AddNamespace("xsi", "http://ift.tt/1fQ42a7")

    ''Create XML declaration 
    Dim xmldecl As XmlDeclaration
    xmldecl = oXml.CreateXmlDeclaration("1.0", "UTF-8", Nothing)
    xmldecl.Encoding = "UTF-8"

    ''Add node to document 
    Dim root As XmlElement = oXml.DocumentElement
    oXml.InsertBefore(xmldecl, root)

    ''info
    Dim info As XmlNode = oXml.CreateNode("element", "info", "http://ift.tt/1Jk9NWv")

    ''data1
    Dim data1 As XmlNode = oXml.CreateNode("element", "data1", "http://ift.tt/1Jk9NWv")
    Dim data1Value As String = HtmlEntity.Entitize(Trim("Company Name 1 ©"), True)
    Dim data1Text As XmlText = oXml.CreateTextNode(data1Value)
    data1.AppendChild(data1Text)
    info.AppendChild(data1)
    Console.WriteLine("Data1 value: " + data1Value)
    Console.WriteLine("Data1 text node value: " + data1Text.Value)
    Console.WriteLine("Data1 node text value: " + data1.InnerText)
    Console.WriteLine("Data1 node XML value: " + data1.InnerXml)

    ''data2
    Dim data2 As XmlNode = oXml.CreateNode("element", "data2", "http://ift.tt/1Jk9NWv")
    Dim data2Value As String = Trim(HtmlEntity.Entitize("Company Name 2 ®", False))
    data2.InnerText = data2Value
    info.AppendChild(data2)
    Console.WriteLine("Data2 value: " + data2Value)
    Console.WriteLine("Data2 node text value: " + data2.InnerText)
    Console.WriteLine("Data2 node XML value: " + data2.InnerXml)

    ''data3
    Dim data3 As XmlNode = oXml.CreateNode("element", "data3", "http://ift.tt/1Jk9NWv")
    Dim data3value As String = Trim(HtmlEntity.Entitize("Company Name 3 ®", False))
    data3.InnerXml = data3value
    info.AppendChild(data3)
    Console.WriteLine("Data3 value: " + data3value)
    Console.WriteLine("Data3 node text value: " + data3.InnerText)
    Console.WriteLine("Data3 node XML value: " + data3.InnerXml)

    ''Add info to Root
    root.AppendChild(info)

    oXml.Save(Console.Out)
    oXml.Save("C:\Users\Chris\Dropbox\SECUREX\Junk\textXML.xml")

End Sub

The output I get is:

Data1 value: Company Name 1 &copy;
Data1 text node value: Company Name 1 &copy;
Data1 node text value: Company Name 1 &copy;
Data1 node XML value: Company Name 1 &amp;copy;
Data2 value: Company Name 2 &#174;
Data2 node text value: Company Name 2 &#174;
Data2 node XML value: Company Name 2 &amp;#174;
Data3 value: Company Name 3 &#174;
Data3 node text value: Company Name 3 ®
Data3 node XML value: Company Name 3 ®
<?xml version="1.0" encoding="Windows-1252"?>
<webTable xmlns="http://ift.tt/1Jk9NWv" xmlns:n1="http://ift.tt/1IP9A1K" xmlns:xsi="http://ift.tt/1fQ42a7" xsi:schemaLocation="http://ift.tt/1IP9A1L">
  <info>
    <data1>Company Name 1 &amp;copy;</data1>
    <data2>Company Name 2 &amp;#174;</data2>
    <data3>Company Name 3 ®</data3>
  </info>
</webTable>

As you can see I have tries several different ways and I still can't get the entities encoded right.

Please note that oXml.Save(Console.Out) shows Windows-1252 encoding but my output file is identical except it shows it properly as UTF-8.

I'm using VS 2012 Express.

Any idea what I can do to encode the HTML entities properly?

Thanks in advance.

Aucun commentaire:

Enregistrer un commentaire