Built-in Advanced Replication Jobs
Pipes supports several types of built-in replication jobs:
- Full/CopyOver
- Incremental
- Batch Update (Pipes Pro feature)
- History Update (Pipes Pro feature)
- Upsert Update (Pipes Pro feature)
CopyOver
This job retrieves the current data from a given source and places the data into the local table.
If a table with the same name already exists from previous runs, it will be handled according to the selected cleanup method:
drop
: table will be dropped and recreated;delete
: table will be emptied via deletion of all rows;truncate
: rows will be truncated.
Incremental
With this type of job, only new rows are retrieved from the source and added to the target. The tricky thing about incremental materialization is that it requires a column with a reliable timestamp value that can easily determine which rows are new and which are not. For example, this can be a Modified
column displaying exactly when a row is changed (newly inserted or updated) or a column providing an autoincremented ID.
Batch Update
This is a Pipes Pro feature
Batch replication is a type of replication which is easier to implement than incremental replication (but less flexible) and more powerful than complete replication. It also differs from both in these ways:
- Old rows can be kept even if deleted from the source;
- Updates can be performed on current rows, and inserted rows will be added;
- No timestamp column is required to get new rows because all rows will be retrieved from the source;
- You can decide whether or not to allow duplicates in case of updates on rows;
- You can also decide how to name the analytical storage table, and there will be no different stages to distinguish, just one table containing the data.
The main concept of batch replication is one fixed table residing in the analytical storage, updated whenever the replication job runs. Unlike complete replication, the user provides the name for this table, and there are no stages. This offers transparency because there is always one stage of the table, but it comes with a risk in case of accidental data loss (by dropping, for example) once the table is lost. The table is created and filled with data at the first run.
History Update
This is a Pipes Pro feature
In some cases, we may need to keep track of changes made to data. Complete, incremental, or batch replication will always yield a table which contains all available data; incremental or batch updates may allow keeping 'old' data which is not present in the data source anymore. But generally, changes in data cannot be tracked, and the user is always forced to work with the most current information.
If you need to know when and how the data has been changed, history update is the right replication type to use. There are several ways to implement this type of replication, and on this page, we describe the type used in the Data Virtuality Server.
It is based on performing a kind of versioning on the data records by adding two columns to the analytical storage table where the locally optimized data is stored. These columns, called fromtimestamp
and totimestamp
, provide information on the period when the given record was the valid version. To use this feature, you need to do the following:
- Select column of the source object (either table or view) which will be considered the key columns. Please remember that the fields should be chosen so that duplicates (entire rows with the same values in all fields) cannot occur.
- Select columns to be tracked for updates when the replication job runs.
Upsert Update
This is a Pipes Pro feature
In this case, the job retrieves the current data from a given source and places the data into the local table. The decision whether INSERT
or UPDATE
will be used is based on a set of key columns (keyColumnsArray
) that has to be provided. For each of the source rows, the following operation is done:
If there is no row with the same values in the key columns, then an
INSERT
will be done, and the new row will be added to the analytical storage table;Otherwise, the existing row in the analytical storage table will be updated based on the
updateColumns
andinvertUpdateColumns
parameters in the following wayinvertUpdateColumns=FALSE
: all the columns fromupdateColumns
will be updated.invertUpdateColumns=TRUE
: all columns except the ones inupdateColumns
will be updated.
Please note that columns from keyColumnsArray
cannot be enumerated in columnsToCheckArray
and they will never be updated, because this would change the row to represent a completely new entity.
See Also
Replication Jobs Management for the relevant system procedures
A short guide on data replication types and use cases. CopyOver Replication for a post in the Data Virtuality blog on CopyOver replication
A short guide on data replication types and use cases. Incremental and Upsert Replication - a post in the Data Virtuality blog explaining incremental replication and its use cases
A short guide on data replication types and use cases. Batch Replication for a blog post on batch replication and its use cases
A short guide on data replication types and use cases. History replication - a post in the Data Virtuality blog on history replication