Migrating from ColdFusion to Railo Part 7: PDFs

12 January 2015

Continuing my series of notes on the nitty gritty of migrating from ColdFusion 9 to Railo 4.2.

← Migrating from ColdFusion to Railo Part 6: More ORM issues

Given the association between Adobe and PDF I was expecting Railo's support to fall short. Which it does, although not quite as badly as I feared.

Setting metadata on PDF binary variables

Issue

Adobe ColdFusion 9 lets you add metadata, such as author, title etc, to a PDF in-memory binary variable created by cfdocument or FileReadBinary() without saving it to disk.


<cfscript>
pdfBinary = FileReadBinary( ExpandPath( "test.pdf" ) );
info = { author:"ACME Ltd" };
</cfscript>
<cfpdf action="setInfo" info=info source="pdfBinary">

In Railo this produces an error:

"source is not based on a resource, destination file is required".

In other words it requires you to save the resulting PDF binary to disk even though you may not wish to do so (say you want to perform other operations before saving, or you plan to stream it directly).

Fix

This issue doesn't apply to PDF objects created using <cfpdf action="read">, so if the binary is created using FileReadBinary(), you can simply use pdf action="read" name="pdfBinary"; (i.e. the script version of the tag) instead.

If it's created using <cfdocument> the following will get around the problem:

  1. Specify a temporary destination file path when performing "setInfo".
  2. Read the temp file back into another variable.
  3. Delete the temp file.

Wrapped up as a function:


binary function setInfoOnPdfBinary( required binary pdfBinary,required struct info ){
	var tempPath	=	"ram://#CreateUuid()#";
	pdf action="setInfo" info=info source=pdfBinary destination=tempPath;
	var result	=	FileReadBinary( tempPath );
	FileDelete( tempPath );
	return result;
}

Bug tracker

Raised as issue 3283.

Update Raised against Lucee as Issue #92

Metadata is lost when writing PDF objects to disk

Issue

In Railo if you use pdf action="write" to write a PDF object to disk, metadata such as "author" will be lost.


//create a blank PDF
document format="PDF" filename="test.pdf" overwrite=true{};
//set the author
info={ author:"ACME Ltd" };
pdf action="setInfo" info=info source="test.pdf" destination="test.pdf" overwrite=true;
// check author was set
pdf action="getInfo" source="test.pdf" name="infoBeforeWrite";
//read pdf into a variable
pdf action="read" source="test.pdf" name="pdfObjectBeforeWrite";
// write back to file
pdf action="write" source=pdfObjectBeforeWrite destination="testWithInfo.pdf" overwrite=true;
// now check the author
pdf action="getInfo" source="testWithInfo.pdf" name="infoAfterWrite";
dump( var=infoBeforeWrite.author,label="Author before write" );
dump( var=infoAfterWrite.author,label="Author after write" );

screenshot of result dump

Fix

Instead of using pdf action="write", use a combination of pdf action="getInfo" and pdf action="setInfo" to write the object with the metadata preserved:


pdf action="read" source="test.pdf" name="pdfObject";
pdf action="getInfo" source="pdfObject" name="info";
pdf action="setInfo" source="pdfObject" info=info destination="test.pdf";

Bug tracker

Raised as issue 3289.

Update Raised against Lucee as Issue #93

cfdocument layout/styles

Issue

ACF9's support for CSS within <cfdocument> is notoriously limited, and Railo's seems to be too. Unfortunately it's also different, so that PDF designs created in ACF won't necessarily look the same in Railo. Very simple layouts have been fine in my experience, but anything more complex has required work to make them render the same.

Fix

An approach worth considering is to ditch <cfdocument> altogether and use the open source Webkit HTML to PDF (wkhtmltopdf) library instead. This uses the Webkit rendering engine so CSS support is more up-to-date. It can be easly invoked within Railo using <cfexecute>.

However for some reason I couldn't get certain fonts to render acceptably using wkhtmltopdf, so I persevered with Railo's <cfdocument> implementation and after much trial and error with the CSS was able to get acceptable results.

Bug tracker

If you are having trouble with <cfdocument> rendering, then Issue 3112 logged by Nando is worth reading along with the associated Railo Group discussion.

TIFFs and cfdocument

Issue

Unlike ACF9, Railo's <cfdocument> implementation won't render images in TIFF format. TIFF is generally only relevant when images are to be printed, but since that's often the case with PDFs it's a pity the format isn't supported.

Fix

Convert TIFFs to PNGs.

Bug tracker

Raised as issue 3301.

Update Raised against Lucee as Issue #94

Extracting text from PDFs

Issue

Despite being mentioned in the docs for cfpdf (under the "type" attribute), the action="extractText" is not supported in Railo.

Fix

My first attempt to plug this gap involved Saman's solution using the PDFBox library.

Saman's post is a few years old, but I was able to get the latest version of the library, now under the Apache aegis, working using JavaLoader.

But looking through the Railo core lib folder, I noticed that a version of PDFBox is already loaded with the current version of Railo. Why it's there but not implemented I'm not sure, but it means that text extraction is available without needing to add or load any additional jar files.

Update 25 February 2015 It seems the version PDFBox included in Railo/Lucee will throw NullPointerExceptions in some circumstances. I've therefore gone back to loading the latest version, which apparently fixes this issue.
Update 18 June 2015 Code for calling a newer version of PDFBox via JavaLoader is below.

pdfTextExtractor.cfc


/* The latest pre-built standalone PDFBox jar file and the javaloader package are assumed to be in the same folder as the following component */
component{

	function init( javaLoaderPath="javaloader.JavaLoader" ){
		if( !server.KeyExists( "_pdfBoxLoader" ) ){
			var paths	= [];
			paths.append( GetDirectoryFromPath( GetCurrentTemplatePath() ) & "pdfbox-app-1.8.8.jar" );
			server._pdfBoxLoader =	New "#javaLoaderPath#"( paths );
		}
		variables.reader =	server._pdfBoxLoader.create( "org.apache.pdfbox.pdmodel.PDDocument" );
		variables.stripper =	server._pdfBoxLoader.create( "org.apache.pdfbox.util.PDFTextStripper" );
		return this;
	}

	string function extractText( required string pdfPath,numeric startPage=0,numeric endPage=0 ){
		if( Val( startPage ) )
			stripper.setStartPage( startPage );
		if( Val( endPage ) )
			stripper.setEndPage( endPage );
		var pdf	=	reader.load( pdfPath );
		var text	=	stripper.getText( pdf );
		reader.close();
		return text;
	}

}

Implementing as a CFC means it can be stored in memory, reducing the overhead of creating instances on each use.


application.pdfTextExtractor = New pdfTextExtractor();
extractedText = application.pdfTextExtractor.extractText( ExpandPath( "test.pdf" ),1,3 );// First 3 pages only

Bug tracker

This was logged a disappointingly long time ago as issue 1559.

Posted on . Updated

Comments

  • Formatting comments: See this list of formatting tags you can use in your comments.
  • Want to paste code? Enclose within <pre><code> tags for syntax higlighting and better formatting and if possible use script. If your code includes "self-closing" tags, such as <cfargument>, you must add an explicit closing tag, otherwise it is likely to be mangled by the Disqus parser.
Back to the top