Streaming support for excel source - High memory usage
Activity
Ankit Jain April 12, 2024 at 5:27 AMEdited
Context from offline discussion:
To clarify, this is the API I believe we should be using for Excel: https://poi.apache.org/components/spreadsheet/how-to.html#sxssf
The Streaming API allows us to read and write to a spreadsheet by working with the document in a "sliding window" fashion, which reduces memory usage by quite a bit.
The Streaming API provided by apache POI only works for writing data.
The main reason we can't use that for reading is how apache POI works, it gives us random access, This requires loading the complete file in memory.
The docs recommend if we wish to read the data with low memory we can use the low level API provided to read the XML directly. This obvious will drop some features like formula evaluation, and files pre 2007 won't work as they used a binary format.
There is one XML parser @Pushpender Saini was able to find online that is a wrapper for POI, but this is not maintained. A fork for the library above is being maintained here: https://github.com/pjfanning/excel-streaming-reader
This library is licensed using the Apache 2.0 license so yes, we can use it https://github.com/pjfanning/excel-streaming-reader/blob/main/LICENSE
Excel source consumes high memory and fails for large pipeline !