Amazon Redshift
Amazon Redshift is a data warehouse that allows its users to analyze data in conjunction with existing Business Intelligence tools and standard SQL. With Amazon Redshift users are enabled to perform complex analysis queries on multiple petabytes of structured data.
Type Name
redshift
Connection Properties
Template name: redshift
Appropriate translator name: redshift
Properties:
host
(default:localhost
)port
(default:5439
)db
user-name
password
(default: empty)driver
(default:redshift
)driver-class
(default:com.amazon.redshift.jdbc.Driver
)ssl
(TRUE
|FALSE
; default:FALSE
)jdbc-properties
(arbitrary extra properties)new-connection-sql
check-valid-connection-sql
(default:select 0
)min-pool-size
(default:2
)max-pool-size
(default:70
)cloudAgent
(default:FALSE
)
Here is an example:
CALL SYSADMIN.createConnection(name => 'redshift', jbossCLITemplateName => 'redshift', connectionOrResourceAdapterProperties => 'host=<host>,port=5439,db=<database>,user-name=<user_name>,password=<password>') ;;
CALL SYSADMIN.createDataSource(name => 'redshift', translator => 'redshift', modelProperties => 'importer.schemaPattern=test_nk,importer.useFullSchemaName=FALSE,importer.tableTypes="TABLE,VIEW"', translatorProperties => 'supportsNativeQueries=true') ;;
Translator Properies
Translator Properties Shared by All JDBC Connectors
(Properties listed in alphabetical order)
To view the full table, click the expand button in its top right corner
Name | Description | Default value |
---|---|---|
comparisonStringConversion
| Sets a template to convert Examples
SQL
|
|
| Database time zone, used when fetching date, time, or timestamp values | System default time zone |
| Specific database version, used to fine-tune pushdown support | Automatically detected by the server through the data source JDBC driver, if possible |
forbidNonMatchingNames | Only considered when importer.tableNamePattern is set. When set to FALSE , allows creation of tables with names that do not match the tableNamePattern . Tables with non-matching names will only be accessible until the server restarts | TRUE |
ForceQuotedIdentifiers
| If | TRUE |
| Maximum size of prepared insert batch |
|
OrderByStringConversion | Sets a template to convert Examples
SQL
|
|
supportsConvertFromClobToString
| If TRUE , indicates that the translator supports the CONVERT /CAST function from clob to string |
|
| Forces a translator to issue a Example
SQL
|
|
supportsOrderByAlias
| If If |
|
supportsOrderByString
| If If |
|
TrimStrings | If |
|
| if |
|
| Embeds a / comment / leading comment with session/request id in the source SQL query for informational purposes |
|
The names of the translator properties are case-sensitive.
Translator Properties Specific for Amazon Redshift
To view the full table, click the expand button in its top right corner
Name | Description | Default value |
---|---|---|
replaceNullCharsWith | String property. If set, the translator replaces all null characters in strings before executing INSERT / UPDATE with the string specified as the property value. You may specify an empty string as well | Single space |
uploadMode | Values:
The legacy value | |
Value: Uploads data as files on Amazon S3 storage. Translator properties | ||
maxChunkSize | Sets the size limit of a temp file in bytes |
16000
|
numberOfThreads | Specifies the maximum number of uploader threads |
10 |
tempFolder
| Value: a path to a folder (relative or absolute). Specifies the folder to be used for creating temporary files instead of the system configured one. If not specified, the default OS tmp folder is used |
|
bucketName | Value: a bucket name. Only for the |
|
bucketPrefix
| Value: a bucket prefix. Only for the |
|
createBucket
| Value: boolean. Specifies if the bucket set by the |
FALSE |
region
| Value: a region. Only for the |
|
keyId, secretKey
| Only for the |
|
iamRole, awsAccountId
| Only for the |
|
EncryptDataOnS3
| Only for the |
FALSE |
useDoubleSlashToEscapeRegex
| Used to change the default escaping behaviour in |
TRUE |
uploadZipped
| Enables compression of temporary files before uploading them to S3. May be disabled for environments with high network bandwidth in order to save some CPU power but this will increase the disk usage |
FALSE |
truncateStrings
| If |
FALSE |
varcharReserveAdditionalSpacePercent
| As Redshift and Vertica measure varchar size in bytes, not chars, and stores strings in UTF-8 encoding, a char may be 1 to 4 bytes long. You can specify the percentage by which the byte size will exceed the original char size. Also, there's a special value: 65535 which makes every varchar to be of 65535 bytes long |
0 |
acceptInvChars
| Value: any ASCII character except Only for the This property enables the loading of data into VARCHAR columns even if the data contains invalid UTF-8 characters. If it's specified, If acceptInvChars is not specified, an error will be thrown whenever an invalid UTF-8 character is encountered |
|
copyParams
| Arbitrary parameters to be passed to the COPY command when uploading data from S3 |
NULL |
keepTempFiles
| Keep temporary files after uploading |
FALSE
- can be enabled for debugging |
maxTableNameLength
| Maximum length of a table name |
127 |
maxColumnNameLength
| Maximum length of a column name . Five chars of defined maximum length will be reserved for internal purposes and cannot be used for column-identifier |
127 |
Here is an example:
CALL SYSADMIN.createConnection(name => 'redshift', jbossCLITemplateName => 'redshift', connectionOrResourceAdapterProperties => 'host=<host>,port=5439,db=<database>,user-name=<user_name>,password=<password>') ;;
CALL SYSADMIN.createDataSource(name => 'redshift', translator => 'redshift', modelProperties => 'importer.schemaPattern=test_nk,importer.useFullSchemaName=FALSE,importer.tableTypes="TABLE,VIEW"', translatorProperties => 'varcharReserveAdditionalSpacePercent=300,supportsNativeQueries=true,uploadMode=s3Load,region=<region>,bucketName=<bucket_name>,createBucket=true,keyId=<key_ID>,secretKey="<secret_key>"') ;;
Translator Properties for Amazon Redshift as Analytical Storage
If Amazon Redshift is used as analytical storage, we recommend loading data using Amazon AWS S3 (S3LOAD), as inserting data into Redshift using standard JDBC protocol can be very slow.
The following translator properties are required to configure S3LOAD:
Parameter | Description |
---|---|
uploadMode=s3load | Explicitly specifies S3LOAD mode |
| AWS S3 region endpoint |
bucketName | Bucket name to upload data files t; optional |
bucketPrefix | Prefix of the temporary bucket to upload data files to if bucketName is not specified; must comply with Amazon S3 bucket naming convention (nb: 36 characters would be added to the bucket prefix when creating a temporary bucket); optional |
createBucket | Specifies if the bucket set in the bucketName parameter should be created if it does not exist; optional; default: FALSE |
keyId | AWS S3 key ID |
secretKey | AWS S3 secret key |
Here is an example:
CALL SYSADMIN.createConnection(name => 'redshift', jbossCLITemplateName => 'redshift', connectionOrResourceAdapterProperties => 'host=<host>,port=5439,db=<database>,user-name=<user_name>,password=<password>') ;;
CALL SYSADMIN.createDataSource(name => 'redshift', translator => 'redshift', modelProperties => 'importer.schemaPattern=test_nk,importer.useFullSchemaName=FALSE,importer.tableTypes="TABLE,VIEW"', translatorProperties => 'varcharReserveAdditionalSpacePercent=300,supportsNativeQueries=true,uploadMode=s3Load,region=<region>,bucketName=<bucket_name>,createBucket=true,keyId=<key_ID>,secretKey="<secret_key>"') ;;
Data Source Properties
Data Source Properties Shared by All JDBC Connectors
(Properties listed in alphabetical order)
To view the full table, click the expand button in its top right corner
Name | Description | Default |
---|---|---|
importer.autoCorrectColumnNames
| Replaces . in a column name with _ as the period character is not supported by the CData Virtuality Server in column names |
TRUE
|
| Database catalogs to use. Can be used if the Only for Microsoft SQL Server and Snowflake:
| Exasol:
SQL
All others: empty |
importer.defaultSchema
|
Please note that writing into a data source is only possible if this parameter is set. | Empty |
importer.enableMetadataCache | Turns on metadata cache for a single data source even when the global option is turned off. Together with importer.skipMetadataLoadOnStartup=true , it allows using materialized views after server restart when the original source is unavailable |
FALSE
|
importer.excludeProcedures
| Case-insensitive regular expression that will exclude a matching fully qualified procedure name from import | Empty |
importer.excludeSchemas
| Comma-separated list of schemas (no % or ? wildcards allowed) to exclude listed schemas from import. A schema specified in defaultSchema or schemaPattern will be imported despite being listed in excludeSchemas . Helps to speed up metadata loading | Oracle:
SQL
All others: empty |
| Case-insensitive regular expression that will exclude a matching fully qualified table name from import. Does not speed up metadata loading. Here are some examples: 1. Excluding all tables in the (source) schemas
SQL
2. Excluding all tables except the ones starting with "public.br" and "public.mk" using a negative lookahead:
SQL
3. Excluding "tablename11" from the list ["tablename1", "tablename11", "company", "companies"]:
SQL
| Empty |
| Fetch size assigned to a resultset on loading metadata | No default value |
| If set to |
|
importer.importIndexes
| If set to TRUE , imports index/unique key/cardinality information |
FALSE
|
importer.importKeys
| If set to TRUE , imports primary and foreign keys |
FALSE
|
| If set to Please note that it is currently not possible to import procedures which use the same name for more than one parameter (e.g. same name for |
|
importer.loadColumnsTableByTable | Set to TRUE to force table by table metadata processing | FALSE /TRUE only for Netsuite and SAP Advantage Database Server |
importer.loadMetadataWithJdbc
| If set to TRUE , turns off all custom metadata load ways |
FALSE
|
importer.loadSourceSystemFunctions | If set to TRUE , data source-specific functions are loaded. Supported for Microsoft SQL Server and Azure | FALSE |
importer.procedureNamePattern
| Procedure(s) to import. If omitted, all procedures will be imported. % as a wildcard is allowed: for example, importer. will import foo , foobar , etc. W orks only in combination with importProcedures | Empty |
| If set to |
|
importer.renameDuplicateColumns
| If set to TRUE , renames duplicate columns caused by either mixed case collisions or autoCorrectColumnNames replacing . with _ . The suffix _n where n is an integer will be added to make the name unique |
TRUE
|
importer.renameDuplicateTables
| If set to TRUE , renames duplicate tables caused by mixed case collisions. The suffix _n where n is an integer will be added to make the name unique |
TRUE
|
importer.replaceSpecSymbsInColNames | If set to TRUE , replaces all special symbols (any symbols not in the ^A-Za-z0-9_ sequence) to the _ symbol in column names of tables |
FALSE
/
TRUE
only for BigQuery |
importer.schemaPattern
| Schema(s) to import. If omitted or has "" value, all schemas will be imported. % as wildcard is allowed: for example, importer.schemaPattern=foo% will import foo , foobar , etc. To specify several schema names or/and patterns, values should be comma-separated and enclosed within double quotes:
importer.schemaPattern="schema1,schema2,pattern1%,pattern2%" . For proper escaping of special characters depending on the type of data source, check Escaping special characters in schema names or use wildcards instead: "[schema_name]" can be rewritten as "%schema%name%" . Helps to speed up metadata loading | Empty |
importer.skipMetadataLoadOnStartup
| If set to |
FALSE
|
importer.tableNamePattern
| Table(s) to import. If omitted, all tables will be imported. % as a wildcard is allowed: for example, importer.tableNamePattern=foo% will import foo , foobar , etc | Empty |
importer.tableTypes
| Comma-separated list (without spaces) of table types to import. Available types depend on the DBMS. Usual format: Other typical types are | Empty |
importer.useCatalogName
| If set to TRUE , uses any non-null/non-empty catalogue name as part of the name in source, e.g. "catalogue"."table"."column" , and in the CData Virtuality Server runtime name if useFullSchemaName is TRUE . If set to FALSE , will not use the catalogue name in either the name in source or the CData Virtuality Server runtime name. Should be set to FALSE for sources that do not fully support a catalogue concept, but return a non-null catalogue name in their metadata - such as HSQL | TRUE / FALSE only for Hive and EXASOL |
importer.useFullSchemaName
| If set to Please note that this may lead to objects with duplicate names when importing from multiple schemas, which results in an exception |
TRUE
|
importer.useProcedureSpecificName
| If set to TRUE , allows the import of overloaded procedures (which will normally result in a duplicate procedure error) by using the unique procedure specific name in the CData Virtuality Server. This option will only work with JDBC 4.0 compatible drivers that report specific names |
FALSE
|
importer.widenUnsignedTypes
| If set to TRUE , converts unsigned types to the next widest type. For example, SQL Server reports tinyint as an unsigned type. With this option enabled, tinyint would be imported as a short instead of a byte |
TRUE
|
Escaping wildcards in importer.catalog
available since v4.0.8
Default values
and importer.catalog='EXA_DB'
importer.useCatalogName=FALSE
available since v4.4
importer.loadSourceSystemFunctions
is available since v4.6
set to importer.importProcedures
TRUE
by default for CData connector since v4.7
Distribution and Sort Keys
Redshift does not support indexes but supports sort and distribution keys that can be used to improve the performance of queries. With respect to indexes, these keys must be defined when the table is created.
Sort Keys
SORTKEY
s are created by analyzing the currently recommended indexes collected for each optimization. They can be specified both at column and table levels. It is possible to specify only one SORTKEY
column (at column level) or multiple columns (at table level).
Since it is possible to specify only one SORTKEY
(with one or more columns) at the table level, we decided to create a SORTKEY
corresponding to the recommended index (with kind SINGLE
or MULTIPLE
) with the highest frequency. The system will create then a SORTKEY
with one column or with multiple columns if the highest frequency index is SINGLE
or MULTIPLE
, respectively.
Columns that are normally recommended for index creation are used to define sort and distribution keys.
Distribution Keys
DISTKEY
s are not automatically recommended by the system and need to be manually created by the user. Here are two things to keep in mind:
- It is not possible to specify more than one
DISTKEY
for each recommended optimization; IndexType
of aDISTKEY
must be set toMANUAL
(which is the default, so you can skip this step).
Distribution Style
The data distribution style is defined for the whole table. Amazon Redshift distributes the rows of a table to the compute nodes according to the distribution style specified for the table. The distribution style that you select for tables affects the overall performance of your database.
Style | Description |
---|---|
EVEN | The data in the table is spread evenly across the nodes in a cluster in a round-robin distribution. Row IDs are used to determine the distribution, and roughly the same number of rows are distributed to each node. This is the default distribution method |
KEY | The data is distributed by the values in the DISTKEY column. When you set the joining columns of joining tables as distribution keys, the joining rows from both tables are collocated on the compute nodes. When data is collocated, the optimizer can perform joins more efficiently. If you specify DISTSTYLE KEY , you must name a DISTKEY column |
ALL | A copy of the entire table is distributed to every node. This distribution style ensures that all the rows required for any join are available on every node, but it multiplies storage requirements and increases the load and maintenance times for the table. ALL distribution can improve execution time when used with certain dimension tables where KEY distribution is not appropriate, but performance improvements must be weighed against maintenance costs |
Internal
SORTKEY
and DISTKEY
created for a table in Redshift can be checked with a query like this (to be executed directly on Redshift):
SELECT tablename, "column", type, encoding, distkey, sortkey FROM pg_table_def WHERE tablename LIKE 'mat_table%';
Please note that the schema containing the table has to be in the search path.
Usage Examples
SORTKEY
, DISTKEY
, and DISTSTYLE
are passed as OPTIONS
. They are added to the CREATE TABLE
or SELECT INTO
command as shown below:
CREATE TABLE source.table_name (id integer, name varchar(255))
OPTIONS (DISTKEY 'id', SORTKEY 'id,name') ;;
SELECT * INTO target.table_name FROM source.table_name
OPTIONS (SORTKEY 'id', DISTSTYLE 'EVEN') ;;
Redshift Spectrum
The Redshift JDBC driver exposes Spectrum tables as EXTERNAL TABLE
. In order to have your Redshift data source list Spectrum tables, adjust the data source parameter importer.tableTypes
accordingly by specifying e.g. importer.tableTypes="TABLE,VIEW,EXTERNAL TABLE"
.
See Also
Using S3 to Ingest Data into Redshift instead of using SQL INSERT
statements
Show Running Queries on Redshift via Native Interface for the 200 most recent queries