Google CloudSQL PostgreSQL Batch Source
Reads from a CloudSQL PostgreSQL database table(s) using a configurable SQL query. Outputs one record for each row returned by the query. For example, you may want to create daily snapshots of a database table by using this source and writing to Amazon S3.
Configuration
Property | Macro Enabled? |
| Description |
---|---|---|---|
Use Connection | No | 6.7.0/1.8.0 | Optional. Whether to use an existing connection. If you use a connection, connection related properties do not appear in the plugin properties. |
Connection | Yes | 6.7.0/1.8.0 | Optional. Name of the connection to use. Project and service account information will be provided by the connection. You can also use the macro function |
CloudSQL Instance Type | No |
| Optional. Whether the CloudSQL instance to connect to is private or public. Default is Public. |
Connection Name | Yes (6.10.0) |
| Required. The CloudSQL instance to connect to in the format <PROJECT_ID>:<REGION>:<INSTANCE_NAME>. Can be found in the instance overview page. |
Port | Yes | 6.9.0/1.10.5 | Optional. Port that PostgreSQL is running on. Default is 5432. |
JDBC Driver Name | No |
| Required. Name of the JDBC driver to use. Default is cloudsql-postgresql. |
Username | Yes |
| Optional. User identity for connecting to the specified database. |
Password | Yes |
| Optional. Password to use to connect to the specified database. |
Connection Arguments | Yes |
| Optional. A list of arbitrary string key/value pairs as connection arguments. These arguments will be passed to the JDBC driver as connection arguments for JDBC drivers that may need additional configurations. |
Reference Name | No |
| Required. Name used to uniquely identify this source for lineage, annotating metadata, etc. |
Database | Yes (6.9.0/1.10.5) |
| Required. CloudSQL PostgreSQL database name. |
Import Query | Yes |
| Required. The SELECT query to use to import data from the specified table. You can specify an arbitrary number of columns to import, or import all columns using *. The Query should contain the ‘$CONDITIONS’ string. For example, ‘SELECT * FROM table WHERE $CONDITIONS’. The ‘$CONDITIONS’ string will be replaced by Split Column field limits specified by the bounding query. The ‘$CONDITIONS’ string is not required if Number of Splits is set to 1. |
Bounding Query | Yes |
| Bounding Query should return the minimum and maximum of the values of the Split Column field. For example, ‘SELECT MIN(id),MAX(id) FROM table’. Not required if Number of Splits is set to 1. |
Split Column | Yes |
| Field Name which will be used to generate splits. Not required if Number of Splits is set to 1. |
Number of Splits | Yes |
| Number of splits to generate. |
Fetch Size | Yes | 6.6.0/1.7.0 | Optional. The number of rows to fetch at a time per split. Larger Fetch Size can result in faster import with the trade-off of higher memory usage. Default is 1000. |
Data Type Mapping
All PostgreSQL specific data types mapped to string and can have multiple input formats and one ‘canonical’ output form. To figure out proper formats, see PostgreSQL data types documentation..
PostgreSQL Data Type | CDAP Schema Data Type |
---|---|
bigint | long |
bigserial | long |
bit(n) | string |
bit varying(n) | string |
boolean | boolean |
bytea | bytes |
character | string |
character varying | string |
double precision | double |
integer | int |
numeric(precision, scale)/decimal(precision, scale) | decimal |
real | float |
smallint | int |
smallserial | int |
serial | int |
text | string |
date | date |
time [ (p) ] [ without time zone ] | time |
time [ (p) ] with time zone | string |
timestamp [ (p) ] [ without time zone ] | timestamp |
timestamp [ (p) ] with time zone | timestamp |
xml | string |
tsquery | string |
tsvector | string |
uuid | string |
box | string |
cidr | string |
circle | string |
inet | string |
interval | string |
json | string |
jsonb | string |
line | string |
lseg | string |
macaddr | string |
macaddr8 | string |
money | string |
path | string |
point | string |
polygon | string |
Examples
Connecting to a public CloudSQL PostgreSQL instance
You want to read data from CloudSQL PostgreSQL database named “prod”, as “postgres” user with “postgres” password. Get the latest version of the CloudSQL socket factory jar with driver and dependencies here), and then configure plugin with:
Property | Value |
---|---|
Reference Name | src1 |
Driver Name | cloudsql-postsgresql |
Database | prod |
CloudSQL Instance Type | Public |
Connection Name | [PROJECT_ID]:[REGION]:[INSTANCE_NAME] |
Import Query | "select id, name, email, phone from users;" |
Number of Splits | 1 |
Username | postgresql |
Password | postgresql |
For example, if the ‘id’ column is a primary key of type int and the other columns are non-nullable varchars, output records will have this schema:
Field Name | Type |
---|---|
id | int |
name | string |
string | |
phone | string |
Connecting to a private CloudSQL PostgreSQL instance
If you want to connect to a private CloudSQL PostgreSQL instance, create a Compute Engine VM that runs the CloudSQL Proxy docker image using the following command:
# Set the environment variables
export PROJECT_ID=[PROJECT_ID]
export REGION=[vm-region]
export ZONE=`gcloud compute zones list --filter="name=${REGION}" --limit
1 --uri --project_id=${PROJECT_ID}| sed 's/.*\///'`
export SUBNET=[vpc-subnet-name]
export INSTANCE_NAME=[gce-vm-name]
export POSTGRESQL_CONN=[postgresql-instance-connection-name]
# Create a Compute Engine VM
gcloud beta compute --project=${PROJECT_ID} instances create ${INSTANCE_NAME}
--zone=${ZONE} --machine-type=g1-small --subnet=${SUBNET} --no-address
--metadata=startup-script="docker run -d -p 0.0.0.0:5432:5432
gcr.io/cloudsql-docker/gce-proxy:1.16 /cloud_sql_proxy
-instances=${POSTGRESQL_CONNECTION_NAME}=tcp:0.0.0.0:5432" --maintenance-policy=MIGRATE
--scopes=https://www.googleapis.com/auth/cloud-platform
--image=cos-69-10895-385-0 --image-project=cos-cloud
Optionally, you can promote the internal IP address of the VM running the Proxy image to a static IP using:
# Get the VM internal IP
export IP=`gcloud compute instances describe ${INSTANCE_NAME} --zone ${ZONE} |
grep "networkIP" | awk '{print $2}'`
# Promote the VM internal IP to static IP
gcloud compute addresses create postgresql-proxy --addresses ${IP} --region
${REGION} --subnet ${SUBNET}
# Note down the IP to be used in PostgreSQL JDBC
# connection string
echo Proxy IP: ${IP}
echo "JDBC Connection string:"
echo "jdbc:postgresql://${IP}:5432/{PostgreSQL_DB_NAME}"
Get the latest version of the CloudSQL socket factory jar with driver and dependencies from here, and then configure plugin with:
Property | Value |
---|---|
Reference Name | src1 |
Driver Name | cloudsql-postsgresql |
Database | prod |
CloudSQL Instance Type | Private |
Connection Name | The poxy IP returned from the |
Import Query | "select id, name, email, phone from users;" |
Number of Splits | 1 |
Username | postgresql |
Password | postgresql |
Created in 2020 by Google Inc.