Change data capture was introduced in SQL Server 2008 and has been pretty popular ever since. The basic concept behind it is that all changes (inserts, updates, deletes) are captured and the details made available in a consumable manner. This means ultimately that all changes will be written to a table you can query to see everything that has been going on.
In SSIS 2008 it was a little cumbersome to maintain and get data out of the CDC tables. SSIS 2012 has made that much easier with the addition of three components specifically for change data capture. In this blog we will explore the CDC Control Task.
CDC Control Task Information
The initial load of the data is the most challenging part of the change data capture process from a component standpoint. According to books online the CDC components support three initial loading and change processing scenarios:
- Initial loading done with a database snapshot. In this case, change processing starts with the LSN of the snapshot event.
- Initial loading from a quiescent database. In this case, no changes are made during initial loading so the current LSN is sampled at sometime during the initial load and change processing starts with that LSN.
- Initial loading from an active database. In this case, as the initial load is in progress, changes are made to the database and there is no single LSN from which change processing can be precisely started. In this case, the initial load package developer can sample the source database current LSN before and after the initial load. Then, when processing changes, care should be taken when processing changes made in parallel to the initial load as some of the processed changes are already seen in the initial load (for example, an Insert change may fail with a duplicate key error because the inserted row was read by the initial load process).
The CDC Control Task supports two different types of operations:
- Sync of initial load and change processing
- Management of processing range
The following operations can be performed by the CDC Control Task (again, available in books online as well):
- Reset CDC State: I may be missing something, but this is listed online but not available in the actual product. So I will skip it for now.
- Mark Initial Load start: Records the LSN in the source database before the initial load reads any data from the source table. Used in conjunction with scenario 3 above.
- Mark Initial Load end: Records the LSN in the source data after the initial load is done loading data from the source table. Used in conjunction with scenario 3 above.
- Mark CDC Start: Records the current LSN and indicates that change processing should start from that point forward. Used in conjunction with scenario 1 and scenario 2 above.
- Get Processing Range: Used before a data flow so the CDC source knows what LSNs should be read from.
- Mark Processed Range: Used after a data flow to indicate the last LSN that was processed. This becomes the LSN that will be the start of the range next time a Get Processing Range operation is performed.
Configuring the CDC Control Task
The CDC Control Task can be found in the SSIS Toolbox inside Data Tools. This is an SSIS 2012 only feature. It will be listed under Other Tasks by default.
Here is a completed CDC Control Task with a brief description of what each option is for:
- SQL Server CDC database ADO.Net connection manager: Connection to a SQL database where CDC is enabled on the database level and on at least one table.
- CDC control operation: Operation for the task. The option chosen here will use the variable in the next option to store any required information.
- Variable containing the CDC state: Stores information from the previous option and passes it between the task, control flow source and potentially other CDC control tasks.
- SQL Server LSN to start the CDC or Snapshot database name: Current source database LSN or snapshot database name where the initial load performed from. This indicates where CDC should start. Only used when option 2 is set to Mark CDC Start. Leaving this blank will default to the current LSN.
- Connection manager for the database where the state is stored: Connection to the database where the state table is stored.
- Table to use for storing state: Table where the CDC state is stored. I know that was an amazing explanation. 🙂 What can I say, the name explains itself.
- State Name: Used to look up the state in the table from option 6 above.
Now for a quick step by step on actually configuring the task…
Requirements for using this task include a database where CDC is enabled, an ADO.NET connection to the database where changes are tracked, a table to store the current CDC state and a variable for storing the current CDC state in the package. All of these can be created from the task itself with the exception of enabling CDC on the database and table(s).
- Select a new ADO.Net connection manager or create a new one from the task.
- Set the CDC control operation based on the uses listed above
- Initial load start, initial load end or CDC start for the initial load
- Get processing range or mark processed range for subsequent loads
- Select or create a new a variable to store the CDC state
- You can likely leave the LSN to start CDC empty which will just use the current LSN and keep the store state in database box checked
- Choose the connection you created to the CDC database again in the connection manager for the database where the state is stored option, or if there is a separate database with this information create a connection to that database and select it.
- If this is the first time you are setting up the state click New next to the table to use for storing state option. If your table is already set up select it from the drop down menu.
- Finally choose the state name, if you are using all the defaults it will be CDC_State. There will be one record in the state table for each unique state. So if you want to maintain each table separately then you should create a different state for each table or group of tables.
There is a lot of information in here, so please ask questions if you have any. I think the most important piece to this whole task is when it is used to configure the initial load. I did an initial load using the mark initial load start and end options when first testing the task out. To my surprise when I ran the data flow to pick up changes it came back with nothing. The second time I ran it however all my changes were picked up. This was because I was loading data from a static database and should have just chosen to load the data then do an operation of Mark CDC start.
I will be posting a blog about each of the data flow components then I will post an SSIS package that sets up a database for CDC, does the initial load and then does a changes load. Be on the lookout for those in the next couple of weeks. I have the SSIS package already built, but I want to be sure to go through each of the three components first so that everyone understands what each option does. Then you can avoid my mistake of choosing the wrong options and spending an hour or two trying to figure it out before finally looking at books online.
One final note, at the time of writing this post there are several errors/clarifications needed in the books online documentation that are known issues and are supposed to be in the process of being fixed.
Enjoyed reading your post on CDC. Do you know if it’s possible to build a SSIS 2012 CDC flow with SQL Server 2008 R2 CDC enabled database and tables?
Nice post there… Just a quick thing I wanna ask you. Do you recommend CDC (in SSIS 2012) to handle SCD type 2 changes?
Please advise if you have any suggestions.
Great explination and really help full
I need to implement the incremenatal load in sql server 2008. Is that possible to achive it .
I’ve one confusion for Type 0 transformation.do SCD and CDC is same kind of?
SCD is more of a concept (unless you mean the SCD wizard which is an SSIS component) and CDC is a technology that tracks what changes are made in the database. That being said a type 0 attribute is one that doesn’t change.
These CDC components are provided by Atuunity? or Microsoft itself?
These are provided by Microsoft. They are not available from codeplex or anywhere outside the product though. They are built right into SSIS starting with the 2012 release. There is no way to get these components in 2008R2 or prior releases.
[…] SSIS 2012 Change Data Capture (CDC) Control Task … – Change data capture was introduced in SQL Server 2008 and has been pretty popular ever since. The basic concept behind it is that all changes (inserts, updates …… […]
Good morning Bradley – i came across your article this morning while researching something related to CDC in SSIS 2014. In #7 above you state “Finally choose the state name, if you are using all the defaults it will be CDC_State. There will be one record in the state table for each unique state. So if you want to maintain each table separately then you should create a different state for each table or group of tables.”
My question is how to make the CDC_State name dynamic? I have 3 packages in a project, each package doing CDC on it’s own table. I would like to have 3 rows in my db0.cdc_states table accordingly. However, I can’t figure out how to parameterize the “State Name” in the CDC Control Task Editor. I’ve tried using package configurations with no luck yet.
How can I implement CDC for multiple tables? I want to run all Data Flow Tasks for multiple tables in parallel. While doing initial load, do I need separate CDC Control Flow task for each table? Or, shall I just add multiple DFT’s between ‘CDC Control Flow – Start’ and ‘CDC Control Flow – End’.
CDC State holds the LSN at database level or table level?
Do I need distinct State Name for different tables if its in the same package with parallel DFT?
Hi Brad , do you know how it is with creating line lsn in sql server.
We have cdc component Oracle-SQL server (2016 SP1), everything was ok. But last week was created in table cdc.lsn_time_mapping in column start_lsn out of order.
0x0026DD1C523201000001 2017-07-10 11:16:46.010
0x0054000100000000AF05 2017-07-10 11:16:46.010
0x0026DD1C523601000001 2017-07-10 11:16:46.027
How is it possible?
Do you know ,how is it the solution ?