Streaming support for excel source - High memory usage

Description

Excel source consumes high memory and fails for large pipeline !

Release Notes

None

blocks

Activity

Show:

Ankit Jain April 12, 2024 at 5:27 AM
Edited

Context from offline discussion:

To clarify, this is the API I believe we should be using for Excel: https://poi.apache.org/components/spreadsheet/how-to.html#sxssf

The Streaming API allows us to read and write to a spreadsheet by working with the document in a "sliding window" fashion, which reduces memory usage by quite a bit.

The Streaming API provided by apache POI only works for writing data.

The main reason we can't use that for reading is how apache POI works, it gives us random access, This requires loading the complete file in memory.

The docs recommend if we wish to read the data with low memory we can use the low level API provided to read the XML directly. This obvious will drop some features like formula evaluation, and files pre 2007 won't work as they used a binary format.

There is one XML parser was able to find online that is a wrapper for POI, but this is not maintained. A fork for the library above is being maintained here: https://github.com/pjfanning/excel-streaming-reader

This library is licensed using the Apache 2.0 license so yes, we can use it https://github.com/pjfanning/excel-streaming-reader/blob/main/LICENSE

go/thirdpartylicenses#notice

Unresolved
Pinned fields
Click on the next to a field label to start pinning.

Details

Assignee

Reporter

Reviewer

Fix versions

Priority

More fields

Created March 20, 2024 at 3:50 AM
Updated August 30, 2024 at 5:03 AM