After less than a day, CDAP cluster runs out of disk space

Description

I have an app that has a flow that fails to start (flowlet throws an exception when loading the datasets). The flow never fails to start... it is running but keeps allocating new containers to re-attempt that flowlet. After 2 hours, the cluster runs out of space because the yarn appcache uses up all space on the data disk. This is a single node cluster with about 200GB of data space. There are now 1227 attempts to start the container and each one has all the jars.

# df -h Filesystem Size Used Avail Use% Mounted on /dev/sda1 9.8G 3.9G 5.4G 42% / tmpfs 7.3G 0 7.3G 0% /dev/shm /dev/sdb 197G 187G 16M 100% /data # du -sh /data/ 187G /data/ # du -sh /data/yarn/local/usercache/cdap/appcache/ 183G /data/yarn/local/usercache/cdap/appcache/

This is not good: A bad app can quickly render a whole cluster unusable.

Release Notes

None

Activity

Show:

Sreevatsan Raman August 11, 2015 at 2:27 AM

Issue happened, since we log a lot of warnings and errors. Need to verify if this still is an issue.

Andreas Neumann August 5, 2015 at 1:54 AM

This needs validation, whether it still happens. Fixing this probably requires a fix in Twill

Fixed
Pinned fields
Click on the next to a field label to start pinning.

Details

Assignee

Reporter

Affects versions

Components

Fix versions

Priority

Created June 19, 2015 at 4:32 AM
Updated December 9, 2020 at 8:59 PM
Resolved September 17, 2015 at 6:27 PM