Menu

As simple as possible, as complex as necessary

Generating large spreadsheets more efficiently using SXSSF

25 November 2019

In a recent blog post, Matthew Clemente noted that by choosing the newer XML format (.xlsx) over the default pre-Excel 2007 binary format (.xls) you can significantly reduce file size when generating spreadsheets in CFML.

Another benefit of using the XML format is that there is no limit on the number of rows you can add to a sheet. Binary sheets can hold a maximum of 65535 rows so you usually need to go for .xlsx when creating really big spreadsheets.

However, depending on the resources available to the JVM, building and saving large spreadsheets can be slow, or lead to request timeouts or out-of-memory errors due to the fact that the entire workbook has to be loaded into memory for write access.

SXSSF

To address this, POI—the underlying Java library which powers spreadsheet functionality in both Adobe ColdFusion (ACF) and Lucee—makes available an additional "Streaming XML" format, or SXSSF as it's known. This allows .xlsx files to be written to disk in a much more efficient way by processing and flushing the data in batches instead of holding the entire contents in memory.

ACF's built-in spreadsheet functionality doesn't yet support SXSSF and neither does the Lucee cfspreadsheet extension. But users of either CFML engine can take advantage of it using the Lucee Spreadsheet Library, which despite the name also supports ACF2016+.

Basic usage is as follows.

// instantiate the library assuming it's in the same directory
spreadsheet = New spreadsheetLibrary.Spreadsheet();
// create a streaming XML workbook
workbook = spreadsheet.newStreamingXlsx();
// populate
spreadsheet.addRows( workbook, hugeQuery );
// write to disk
spreadsheet.write( workbook, ExpandPath( "large.xlsx" ) );

The same can achieved more simply using the writeFileFromQuery() convenience method:

// instantiate the library assuming it's in the same directory
spreadsheet = New spreadsheetLibrary.Spreadsheet();
spreadsheet.writeFileFromQuery( data=hugeQuery, filepath=ExpandPath( "large.xlsx" ), streamingXml=true );

The default batch or "window" size is 100 rows at a time, but this can be configured by adding the streamingWindowSize argument to the appropriate method. For example:

workbook = spreadsheet.newStreamingXlsx( streamingWindowSize=10 );

or

spreadsheet.writeFileFromQuery(
 data=hugeQuery
 ,filepath=ExpandPath( "large.xlsx" )
 ,streamingXml=true
 ,streamingWindowSize=10
);

It may be worth experimenting with the window size to find the optimum performance setting for your environment/use case.

Performance comparision

Here is a test which will run on either Lucee or ACF to compare processing times between standard and streaming XML formats for a generated 80,000 row query with 3 simple text columns.

streamingXmlTest.cfm

<cfscript>
// creates a large query
query function createData( numberOfColumns=3, numberOfRows=80000 ){
	var columns = [];
	var columnTypes = [];
	var row = [];
	var data = [];
	//create the row to be repeated
	for( var i=1; i <= arguments.numberOfColumns; i++ ){
		columns.Append( "col#i#" );
		row.Append( "Column #i#" );
		columnTypes.Append( "VarChar" );
	}
	//add the required number of rows
	for( var i=1; i<= arguments.numberOfRows; i++ ){
		data.Append( row );
	}
	return QueryNew( columns.ToList(), columnTypes.ToList(), data );
}
// generates a .xlsx spreadsheet with or without SXSSF
numeric function createSpreadsheet( required boolean useStreamingXml ){
	var filepath = ExpandPath( "large.xlsx" );
	var spreadsheetLib = New luceeSpreadsheet.SpreadSheet();
	var data = createData();
	var startTime = GetTickCount();
	spreadsheetLib.writeFileFromQuery( data=data, filepath=filepath, overwrite=true, streamingXml=arguments.useStreamingXml );
	var endTime = GetTickCount();
	return ( endTime - startTime );
}
//run the test if this boolean URL parameter is present
if( url.KeyExists( "useStreamingXml" ) ){
	result = createSpreadsheet( url.useStreamingXml );
	WriteOutput( "<p>Streaming Xml? #YesNoFormat( url.useStreamingXml )#. Milliseconds: #result#</p>" );
}
</cfscript>
<a href="streamingXmlTest.cfm?useStreamingXml=false">Generate spreadsheet <b>without streaming</b></a>
|
<a href="streamingXmlTest.cfm?useStreamingXml=true">Generate spreadsheet <b>with streaming</b></a>

On a CommandBox Lucee 5.3 instance with up to 512MB of memory assigned to the JVM, I'm seeing the following typical processing times:

Without streaming: 28 seconds
With streaming: 8 seconds

The Lucee Spreadsheet Library is available from GitHub.

Comments

  • Formatting comments: See this list of formatting tags you can use in your comments.
  • Want to paste code? Enclose within <pre><code> tags for syntax higlighting and better formatting and if possible use script. If your code includes "self-closing" tags, such as <cfargument>, you must add an explicit closing tag, otherwise it is likely to be mangled by the Disqus parser.
Back to the top