Introduction

Converting HTML pages to different formats and especially to PDF has become a widely spread routine for web developers. The process itself is plenty straightforward, because there are quite a lot of PDF development libraries and services around the web. However, one day you may need not just to make a PDF copy of the page, but to automatically add some modifications to the result PDF output (for example you may want to access SVG data on the page). In this article I’m going to show a simple example of accomplishing this task in ASP.NET using some .NET and XSL tips and a PD4ML PDF library.

Step 1: Searching for xHTML markup.

ASP.NET is great for easy creating complicated pages. However, all these controls and other stuff have very little in common with result xHTML markup, which is rendered and sent to client. That’s why the first thing we are going to do is to somehow bring it to the light. The markup is created with the help of “Render” method of the page’s life cycle, so we need to override this method.

    protected override void Render(HtmlTextWriter output)
    {

       //Creating String and Html writers to copy the created HTML markup
       StringWriter writer = new StringWriter();
       HtmlTextWriter htmlWriter = new HtmlTextWriter(writer);
       //Creating HTML markup with the help of our "fake" HTmlTextWriter
       base.Render(htmlWriter);
       //Coping the markup to the string and saving it to the disk
       string htmlMarkup = writer.ToString();
       StreamWriter XMLwriter = new StreamWriter(Server.MapPath("Htmloutput.xml"));
       XMLwriter.Write(htmlMarkup);
       XMLwriter.Close();
       //Creating actual HTML markup for display
       output.Write(htmlMarkup);
    }

Step 2: Getting ready for XSL transformation.

Now we need to prepare our XSLT file. ASP.NET produces a valid xHTML markup, hence we just need to change it according to our needs, but there are still some problems, you may face:

  • First, don’t forget, that xHTML markup uses a default xmlns=http://www.w3.org/1999/xhtml namespace, so we need to create some prefix in our XSLT file, to reach the nodes. That’s why we add xmlns:xhtml=http://www.w3.org/1999/xhtml string to our XSLT file and add xhtml to ”exclude-result-prefixes” to remove it from the result document.
  • Second, now we are able to do transformations, but there is another problem: lots of xmlns="" nodes in the output document. To get rid of them add xmlns=http://www.w3.org/1999/xhtml to the XSLT file namespace declaration.
  • Third, HTML pages contain plain text, which is not allowed in XML, therefore it’s not processed by XSLT. To get rid of text nodes put <xsl:template match="xhtml:body//text()"> template in your XSLT style sheet.
  • <?xml version="1.0" encoding="utf-8"?>
    
    <xsl:stylesheet version="1.0"
                   xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                   xmlns:msxsl="urn:schemas-microsoft-com:xslt"
    
                   xmlns:xhtml="http://www.w3.org/1999/xhtml"
                   xmlns="http://www.w3.org/1999/xhtml"
    
                   exclude-result-prefixes="msxsl xhtml">
    <!--the rest of the xsl file --!>
    

Step 3: Creating PDF file.

That is, where we come to our final goal. All we need to do, is to perform XSL transformation and create PDF file. I‘ll use – PD4ML HTML to PDF converting library, cos it’s possible to use it in different programming languages, like Java, PHP, Ruby, etc. I’m going to use MemoryStream, because I don’t want to save any intermediate data to hard drive.

    protected void MakePDFButton_Click(object sender, EventArgs e)

    {
       //Doing XSL transformation
       string XSLTFile = Server.MapPath("XSLTFile.xslt");
       string XMLFile = Server.MapPath("HTMLoutput.xml");
       // Allowing DTD in our xHTML markup
       XmlReaderSettings settings = new XmlReaderSettings();
       settings.ProhibitDtd = false;
       XmlReader reader = XmlReader.Create(XMLFile, settings);
       //Transforming the initial HTML markup and outputting it to MemoryStream
       //object instance for further PDF conversion
       XslCompiledTransform XSLTransform = new XslCompiledTransform();
       XSLTransform.Load(XSLTFile);
       Stream memoryStream = new MemoryStream();
       XSLTransform.Transform(reader, null, memoryStream);
       //Flushing the stream and positioning the cursor at the beginning
       //of the data in the stream.
       memoryStream.Flush();
       memoryStream.Position=0;
       reader.Close();

       //Showing the markup on the page
       StreamReader streamReader=new StreamReader(memoryStream);
       string output=streamReader.ReadToEnd();
       HTMLoutput.Text = Server.HtmlEncode(output);

       //Converting result HTML page to PDF
       PD4ML PDFcreator = new PD4ML();
       PDFcreator.PageSize = PD4Constants.A4;
       PDFcreator.DocumentTitle = "The result PDF file";
       string path=Server.MapPath("Output.pdf");
       StreamWriter streamWriter = new StreamWriter(path);
       memoryStream.Position = 0;
       PDFcreator.render(memoryStream as MemoryStream, streamWriter);
       //Closing all the streams
       streamReader.Close();
       streamWriter.Close();
    }

Conclusion

That's it! Now let's come up with a short summary:

  • Use override “Render” method to manipulate and obtain xHTML markup.
  • Use custom xml namespace prefix to reach non-prefixed xHTML nodes.
  • Use little xslt “xmlns=http://www.w3.org/1999/xhtml” hack to get rid of numerous xmlns="" nodes.
  • Use <xsl:template match="xhtml:body//text()"> if you need to get rid of plain text, which isn't wrapped by any element.
I hope, that the combination of a valid xHTML markup, which is taken “for granted” by Visual Studio developers and several easy tips, which were described above will give you countless possibilities of manipulating your document's data.

推荐.NET配套的通用数据层ORM框架:CYQ.Data 通用数据层框架