Creating ETL process can pose so many challenges we will only fully discover them when really building the ETL.
Before you complain about modelling, the example here has no relation to data warehouse modelling. The ETL in question was a migration from a data source to another. The need for executing multiple times was mainly for a testing purpose.
Let’s get back a bit and understand what I’m talking about.
Imagine you are building an ETL. Importing a lot of data to a set of tables and implementing many rules inside the ETL. Your ETL will be executed many times, but the tables need to be always empty when the process starts. As a result your ETL needs a cleaning task, in the beginning, clearing all the tables.
Some tables have an identity field. Since you are clearing the tables and ensuring everything will start again from the same point, you need to reset the identity as well.
Probably your first thought is to use Truncate Table statement. The problem is Truncate Table is not supported in tables referenced by a foreign key.
Check the example below: The Truncate Table is blocked by the foreign key.
(
id INT IDENTITY(1, 1) PRIMARY KEY,
title VARCHAR(50)
)
go
CREATE TABLE students
(
id INT IDENTITY(1, 1) PRIMARY KEY,
NAME VARCHAR(50),
classid INT,
FOREIGN KEY (classid) REFERENCES class(id)
)
go
TRUNCATE TABLE class
You need to use Delete or drop and re-create the constraints. Choosing the simplest solution, let’s talk about using the Delete.
Let’s consider the scenario where your ETL will need to run many times and need to start with a cleaning task. This scenario, sometimes, is used for staging servers in the middle of a bigger ETL process. I used for data migration and had to make the execution many times during development, always returning to the start point before executing again.
That’s why you need a cleaning process, deleting all the records on the tables, but that’s not enough. The tables have identity fields and you also need to reset the identity seed before executing again the ETL.
Reseting the Identity Seed: Regular behaviour
Reset an identity seed seems simple:
The problem is the combination of Delete and the DBCC CheckIdent. This combination can lead to very strange results. Let’s see what can happen below:
The first execution will be in an empty table, never used before.
(
id INT IDENTITY(1, 1) PRIMARY KEY,
price NUMERIC(15, 2)
)
go
DELETE testtable
DBCC checkident(‘testTable’, reseed, 1)
INSERT INTO testtable
VALUES (10)
SELECT *
FROM testtable
The following executions will be in a table already used, having passed by a delete and dbcc checkident. It’s the same code executed again:
DBCC checkident(‘testTable’, reseed, 1)
INSERT INTO testtable
VALUES (10)
SELECT *
FROM testtable
As you may notice, even making the dbcc checkident using the same seed value, the initial identity value in the table becomes different, what can cause all sort of problems for your software.
The root cause of the problem is one statement, DBCC CheckIdent, behaving in different ways according to how the table is at the moment. Our cleaning script needs a consistent result and it’s not achieving this.
Investigating the root cause
We will need to search deeper in SQL Server to discover why the checkident has these differences and where is SQL Server storing the current number for the Identity column of a table.
The DMV sys.identity_columns has a column called LAST_VALUE which contains the last value used for the identity column. Let’s analyse the value this column stores during the execution of the script.
go
CREATE TABLE testtable
(
id INT IDENTITY(1, 1) PRIMARY KEY,
price NUMERIC(15, 2)
)
go
DELETE testtable
SELECT object_id,
NAME,
column_id,
last_value
FROM sys.identity_columns
WHERE Object_name(object_id) = ‘testTable’
Just after have been created, the Last_Value column contains NULL, even after the delete statement has been executed.
SELECT object_id,
NAME,
column_id,
last_value
FROM sys.identity_columns
WHERE Object_name(object_id) = ‘testTable’
Even after the execution of the checkident, the Last_Value column remains with NULL value.
Let’s insert a record:
VALUES (10)
go 2
SELECT *
FROM testtable
The Null value on Last_Value results in a first identity as 1, which is the exact value set as a seed for the identity.
Let’s repeat again:
SELECT object_id,
NAME,
column_id,
last_value
FROM sys.identity_columns
WHERE Object_name(object_id) = ‘testTable’
After the delete, the last_value is 2, the last identity used. We need the checkident:
SELECT object_id,
NAME,
column_id,
last_value
FROM sys.identity_columns
WHERE Object_name(object_id) = ‘testTable’
Now we can easily see the difference: When the table is empty the last_value is NULL and it continues to be NULL after the checkident. However, when the table already had some records included, the last_value is filled and the checkident will reseed it to the value we are providing. So, it’s a difference between last_value field being NULL or the seed value.
Solving the problem
Knowing these details, we can create a query to calculate the next identity value for a table. If the last_value is NULL the next value will be the seed value, if last_value has a value, the next one will be the last_value + the increment
+ CONVERT(INT, increment_value), CONVERT(INT, seed_value))
FROM sys.identity_columns
WHERE Object_name(object_id) = ‘testTable’
The Converts are needed because on this DMV the columns are of type sql_variant, so we need to convert them to an integer.
Knowing how to find the correct value we can now fix our cleaning code in order to always have the same result, with the identity starting in 1.
DELETE testtable
SELECT @lastValue = last_value
FROM sys.identity_columns
WHERE Object_name(object_id) = ‘testTable’
IF ( @lastValues IS NULL )
DBCC checkident(‘testTable’, reseed, 1)
ELSE
DBCC checkident(‘testTable’, reseed, 0)
The post Identity column vs ETL Process appeared first on Simple Talk.
from Simple Talk https://ift.tt/2TkKxw0
via
No comments:
Post a Comment