SQL retry mechanism does not catch errors in aux calls (setTransactionIsolation/commit/etc)
Description
Release Notes
Activity

Ankit Jain July 18, 2023 at 6:27 PM
I was able to confirm after testing end-to-end that pipelines do not fail due to cloudsql restart.

Vitalii Tymchyshyn July 11, 2023 at 4:31 PM
I am reopening it to ensure we test it end-to-end and validate with error reporting. Please confirm after you did test end-to end (run some pipelines on instance, restart cloudsql, run more) as we need to be 100% sure it’s fixed

Vitalii Tymchyshyn July 7, 2023 at 11:50 PM
I also think the current approach would not work well for the next scenario:
There is a burst of activity resulting in N connections opened and added to the pool
There is a SQL restart, resulting in N connections closed remaining in the pool
There is a single transaction that tries to run
In this case if N<20 it would work, but wait N*2 seconds (assuming defaults). If N>=20 it will fail after 40 seconds trying to clean the pool.
Currently max for N is 800.
To resolve this a good would be to limit number of idle connections in the pool, e.g. by just a few (say 5). This would limit max wait by 10 seconds in this scenario before all failed connections will be recycled.
Also we should limit time connection can be idle by, say, 5 minutes. This would limit scenario above and would decrease unnecessary load on PostgreSQL. In PostgreSQL every open connection is a process that takes few megs of RAM even when idle. Also there is a limit on total number of open connections, see https://cloud.google.com/sql/docs/postgres/quotas#maximum_concurrent_connections for CloudSQL.

Vitalii Tymchyshyn July 7, 2023 at 11:03 PM
Overall we should not rely on the SQL exception cause to be on a certain depth of cause list, but get all causes and search for SQLException in it (E.g. with com.google.common.base.Throwables#getCausalChain).
Note that we should not use com.google.common.base.Throwables#getRootCause as the root cause can be some socket IO exception
Details
Details
Assignee

Reporter

In we added retries to SQL exceptions. Unfortunatelly does not wrap exceptions coming from direct calls in
SqlTransactionException
as it’s not the cause of the exception, it’s exception itself, resulting in next stack trace (6.8 line):java.lang.RuntimeException: org.postgresql.util.PSQLException: An I/O error occurred while sending to the backend. at io.cdap.cdap.spi.data.transaction.TransactionRunners.propagateThrowable(TransactionRunners.java:223) at io.cdap.cdap.spi.data.transaction.TransactionRunners.propagate(TransactionRunners.java:210) at io.cdap.cdap.spi.data.transaction.TransactionRunners.run(TransactionRunners.java:144) at io.cdap.cdap.store.DefaultNamespaceStore.list(DefaultNamespaceStore.java:99) at io.cdap.cdap.internal.app.namespace.DefaultNamespaceAdmin.list(DefaultNamespaceAdmin.java:430) at io.cdap.cdap.gateway.handlers.NamespaceHttpHandler.getAllNamespaces(NamespaceHttpHandler.java:67) at sun.reflect.GeneratedMethodAccessor164.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at io.cdap.http.internal.HttpMethodInfo.invoke(HttpMethodInfo.java:87) at io.cdap.http.internal.HttpDispatcher.channelRead(HttpDispatcher.java:45) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) at io.cdap.cdap.common.http.AuthenticationChannelHandler.channelRead(AuthenticationChannelHandler.java:115) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) at io.netty.channel.AbstractChannelHandlerContext.access$600(AbstractChannelHandlerContext.java:38) at io.netty.channel.AbstractChannelHandlerContext$7.run(AbstractChannelHandlerContext.java:353) at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163) at io.cdap.http.internal.NonStickyEventExecutorGroup$NonStickyOrderedEventExecutor.run(NonStickyEventExecutorGroup.java:254) at io.netty.util.concurrent.UnorderedThreadPoolEventExecutor$NonNotifyRunnable.run(UnorderedThreadPoolEventExecutor.java:277) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) Caused by: org.postgresql.util.PSQLException: An I/O error occurred while sending to the backend. at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:335) at org.postgresql.jdbc.PgStatement.executeInternal(PgStatement.java:441) at org.postgresql.jdbc.PgStatement.execute(PgStatement.java:365) at org.postgresql.jdbc.PgStatement.executeWithFlags(PgStatement.java:307) at org.postgresql.jdbc.PgStatement.executeCachedSql(PgStatement.java:293) at org.postgresql.jdbc.PgStatement.executeWithFlags(PgStatement.java:270) at org.postgresql.jdbc.PgConnection.execSQLUpdate(PgConnection.java:440) at org.postgresql.jdbc.PgConnection.setTransactionIsolation(PgConnection.java:856) at org.apache.commons.dbcp2.DelegatingConnection.setTransactionIsolation(DelegatingConnection.java:564) at org.apache.commons.dbcp2.DelegatingConnection.setTransactionIsolation(DelegatingConnection.java:564) at io.cdap.cdap.spi.data.sql.SqlTransactionRunner.run(SqlTransactionRunner.java:68) at io.cdap.cdap.spi.data.sql.RetryingSqlTransactionRunner.run(RetryingSqlTransactionRunner.java:76) at io.cdap.cdap.spi.data.transaction.TransactionRunners.run(TransactionRunners.java:141)