Many methods in program contexts use service discovery to find the URL to use when making remote calls. Most have hardcoded timeouts. These should be changed to use the retry policy configured for the program. Some examples are:
DatasetContext.getDataset() (if the cache is not hit)
SecureStore.listSecureData()
SecureStore.getSecureData()
Transactional.execute()
Admin.datasetExists()
Admin.getDatasetType()
Admin.getDatasetProperties()
Admin.createDataset()
Admin.updateDataset()
Admin.dropDataset()
Admin.truncateDataset()
StreamWriter.write() (only in Workers)
StreamWriter.writeFile() (only in Workers)
StreamBatchWriter (only in Workers)
CDAP context methods will now be retried according to a program's retry policy.
transaction related work likely requires TEPHRA-165 in tephra.
Discovery with a timeout (via EndpointStrategy is ok. The part that needs to change is actually the client that making the call. The overall logic should be something like this:
PR for transaction service unavailability https://github.com/caskdata/cdap/pull/7829
PR for when client discovers the service, but the service has already died or dies soon after https://github.com/caskdata/cdap/pull/7949