Friday, 18 July 2014

SCCM 2012 Data Replication Service (DRS)

SCCM 2012 Data Replication Service (DRS) 
Data replication Service came as a new feature for SCCM 2012 for the replication between the sites along with the legacy file based replication that was used in the earlier versions. Sometimes people refer it as a SQL replication which is incorrect as ‘SQL replication’ is a standalone SQL feature and SCCM 2012  is designed not to use it. Though it would be correct to say the replication is SQL based as we use SQL features like SQL Service Broker (SSB) and Change Tracking along with Bulk copy program (BCP). The component is partly written in SQL and partly in ConfigMgr (known as Replication Configuration Monitor- RCM).
Advantages of DRS implementation in SCCM 2012
SCCM 2007 was designed with multiple tiers in the hierarchy
One of the pain areas of the SCCM 2007 was that we could design multiple tiers in the hierarchy and a big issue generators being ‘Package status for a third/fourth tier site not being yet reported to the Central console. Stuck in Install pending state’. Generally in such cases if any one of the tiers in between are stuck in sending huge packages then the status message being the lowest priority was delayed and with many packages around it used be a tedious task troubleshooting such issues which used to get resolved after days automatically (once the sending was completed)
With SCCM 2012, there is hierarchy limitation of a single primary site under CAS which ultimately can have secondary’s  under it. So it is better to say the hierarchy was flattened for the better. Also, there was definitely a huge need in the differentiation between -

a.       Data (Status messages, inventory, Collection changes, package changes, site wide changes)
b.      Content (The actual source files for the packages and applications)

So, SCCM 2012 went ahead in prioritizing the data as a SQL based replication which is known as DRS to be independent way of replication than the content which was still kept as a legacy file based replication.

Redundancy  in  data processing with SCCM 2007.
Consider an inventory for a client from the tier 3 site. This will process there and then will move up in the hierarchy via the legacy replication. So in all processing in each of the upper tiers plus the bandwidth wastage in sending.
With ConfigMgr 2012, we don’t have the multi-tier primary hierarchies and the data is only processed on the reporting primary site and replicated to the CAS via DRS. 
What do DRS actually replicate?
a.       Global data: The data which is common across the hierarchy. It is shared between CAS and all the primaries. So what comes in Global data? I  would say the best way to remember this is by answering the questions ‘What an admin can create?’ viz.   Packages, collections, applications etc. Now, as I mentioned that this common across the hierarchy this was not the case in ConfigMgr 2007. A package created on a Tier 2 site will only move down and not up.
A derivation of the Global data is the ‘Global Proxy’ data which is shared across the primaries and their respective secondaries. Also, a design change in ConfigMgr 2012 is secondary now will have a database. Reason is clear as we have moved to DRS we will need SQL DB to there for the packages and other stuff data replication.

b.      Site data: The data which comes from the client can be termed as site data. The client directly reports to the corresponding primary and this becomes the site data. The data is shared between the CAS and the respective primaries. Meaning- Primary1 will not have the data of Primary 2 but CAS will have the data of both the primaries.

How do DRS differentiate Global data and Site data?
Data is broken into global data and site data, and then it is further divided them into categories known as ‘Replication groups’ for better management.
In simple words, a replication group is nothing but a groups of tables. SCCM 2012 have around 12 replication groups in global data. The division is simply on the basis of the type. E.g. Alert tables combined to form ‘Alerts’ group. Configuration information combined to form a group ‘Configuration Data’.
Similarly, site data is divided into replication groups. Few important ones are Hardware_inventory_* (* is a number starting from 0 and this can grow depending on the magnitude and types of H/W inventory information collected.) Software Metering is one of the other ones.
There is one more thing which is ‘Replication pattern’ which is nothing but the type of data viz. Global, site or Global proxy.

How to query the replication group that belong to a data?

For Global Data –
 select * from vReplicationData where Replicationpattern = 'global'

For Site Data –
select * from vReplicationData where Replicationpattern = 'site'


Below is the Screenshot  of the first query – Global data


Now once we know the groups, we can also find which tables is actually a part of a replication group. For example if we want to find the tables for the ‘Alerts’ group.
select ArticleName from ArticleData where ReplicationID = (select ID from vReplicationData where ReplicationGroup = 'Alerts')


Assume a requirement in which we want to find all hardware inventory related tables then the query could be like –
select ArticleName from ArticleData where ReplicationID in (select ID from vReplicationData where ReplicationGroup like 'Hardware_Inventory%')

Now that we have a bit of background of DRS, let’s move to how the replication is fulfilled through DRS :

                This can be broken down to two parts –

a.       Site Initialization – This is the first step where the data gets copied in bulk to the given primary site. Taking a scenario of a new Primary site creation it needs all the global data from the CAS.
The process is simple –
Ø  The primary asks for the init to the CAS. This transfers to the CAS as a RCM (Replication Configuration Management/Monitoring) message from the RCM  service broker queue.
Ø  CAS once it gets the message calls the BCP OUT function to copy all the global data from the database to *.bcp files along with the proper rowcounts (*.rowcount) for each table.
Ø  This is then compressed by the RCMCtrl on the CAS and then send to the primary via the legacy sender.
Ø  The primary gets the data and then performs the BCP IN to insert the data back into the SQL database.

What is maintenance mode? What are the different modes a site can have?
When a site in initializing its global/site data it goes into maintenance mode. There are two types of maintenance modes.
1.          Site Maintenance – When the primary site is being installed it is not usable and is in a Site maintenance mode. The console is in a read only mode.
2.          Replication Maintenance – The CAS goes into replication maintenance when it is yet to get the site data from any of its primaries. During this time the site is usable but it will never replicate any data to any other sites.        
       
SiteStatus
Mode
100
105
110
115
120
125
130
135
199
200
205
210
215
220
225
230
250
255
'SITE_INSTALLING'
'SITE_INSTALL_COMPLETE'
'INACTIVE'
'INITIALIZING'
'MAINTENANCE_MODE'
'ACTIVE'
'DETACHING'
'READY_TO_DETACH'
'STATUS_UNKNOWN'
'SITE_RECOVERED'
'SITE_PREPARE_FOR_RECOVERY'
'SITE_PREPARED_FOR_RECOVERY'
'REPLCONFIG_REINITIALIZING'
'REPLCONFIG_REINITIALIZED'
'RECOVERY_IN_PROGRESS'
'RECOVERING_DELTAS'
'RECOVERY_RETRY'
'RECOVERY_FAILED'

It is clear that if the replication groups when not get initialized we will be having the site in Maintenance mode. When primary does not have the Global data initialized it will be in Maintenance mode and will not give the site data to CAS keeping it in maintenance too.

Below are the different modes a replication group can be?

0
'Unknown'
1
'Required'
2
'Requested'
3
'PendingCreation'
4
'PackageCreated'
5
'PendingApplication'
6
'Active'
7
'Aborted'
9
'Failed'

We can check the InitializationStatus for the replication group in the RCM_DRSInitializationTracking table.

b.      Site Active – Once the site has initialized its global data then it is ready for production and goes Active. So all the activities will now use DRS replication; as an example – We create a package on the primary. Here the sequence of activities are –
1.       The provider will write to the corresponding tables for the package (SMSPackages_G etc) in the primary site.
2.       These tables are actually enabled for the feature ‘SQL Change tracking’. So when the new data gets inserted into the SMSPackages_G table only the primary key column information is stored in the Change tracking tables. The point in keeping only the primary key information is to avoid increasing the size of the change tracking table. The whole information of the package to be send can be obtained later by joining Change tracking table to the original SMSPackages_G table on the primary key column.
What is SQL  change tracking? - Change tracking is to identify and keep track of what changed in the Database like a row insertion, updation or deletion. We need this as we have to now only send the changes to the CAS and not the whole global data again as both the CAS and primary are now in page.
How Change Tracking works in background
For every table that is enabled for change tracking we have one more internal table change_tracking_<object_id >. When a row gets inserted in the actual table then a corresponding entry (containing) only the primary key information gets added in the internal table change_tracking_<object_id >. For every successful transaction that is committed for a table we see a row in the sys.syscommittab table. Thesys.syscommittab is a system table which can be referenced by thesys.dm_tran_commit_table view.
Let’s table an example of SMSPackages_G. How do we find the corresponding internal change tracking table for the same –

select OBJECT_NAME(OBJECT_ID) [ObjectName], * from sys.change_tracking_tables where OBJECT_NAME(OBJECT_ID) like '%smspackages%'



From the object id we will get the change tracking table for the same –

select name,internal_type_desc,* from sys.internal_tables where name like '%1675153013%'


Similarly for getting a view of sys.syscommittab use the below query –

select * from sys.dm_tran_commit_table
Also, Change Tracking is something that has to be enabled for the database and also for the tables.
          For finding the change tracking enabled databases

 select * from sys.change_tracking_databases
                     
                               For finding the change tracking enabled tables
                     select * from sys.change_tracking_tables

 
3.       Converting the change into a message and then placing it in the corresponding SQL service broker (SSB) queue. The message will be then passed to the corresponding queue of the other SSB endpoint by the corresponding SQL broker service.

For those who don’t know what SQL service broker is, to clarify it is nothing but a mechanism to deliver messages end to end. So Why SQL SSB? Well there are many reason as to why it is awesome -
a.       It provides faster transactional ability as it is the part of engine itself.
b.      It provides reliability and enhanced security. Reliability in terms of any message that goes out the SQL SSB stays in the sys.transmission_queue until it receives the acknowledgment for the same. Also these queues are not running on Memory but stored in the form of tables hence never get lost in outages. Security is brought by the certificates used to encrypt the messages if it’s going to the endpoint on the other SSB.
c.       It uses asynchronous communication tactic. Meaning, it never polls for the messages but as and when the messages arrive they invoke the activation procedure based on the queue on which they arrive. The procedure then takes care as to how to process this message.


Wednesday, 9 July 2014

Checking Inbox Backlog and troubleshooting in SCCM 2012

Check Inbox Backlog and troubleshooting in SCCM 2012

A number of Stored Procedures can be used to find more details about the backlog along with monitoring of Transmission Queue.
The most useful one is the stored procedure called “spDiagDRS”. Run “EXEC spDiagDRS” to get the below results (shown in the pic).
The stored procedure “spDiagDRS” will fetch details about queued messages. Have a look at the columns named “OutgoingMessagesInQueue” and “IncomingMessagesInQueue”.
In ideal scenario, there should NOT be any queued messages and the values of those columns should be ZERO.

 In my scenario, “OutgoingMessagesInQueue” is 257 that means some error in the send and we have a backlog.
The stored procedure “spDiagDRS” will also tell us about Status and LastSyncTime of each Replication Group. In my example, the SiteSending is CAS and SiteReceiving is PR1.


Apart from spDiagDRS, there are some very useful stored procedures that we can use at the time of backlog troubleshooting. See, the list of Stored Procedures below. To filter the stored procedures type spDiag in the filter settings



you can check out the following examples of these along with parameters for in-depth troubleshooting.
EXEC spDiagMessagesInQueue
EXEC spDiagGetReplicationGroupStats ‘Configuration Data’, ‘PR1′
EXEC spDiagGetProcedureStats ’100′
EXEC spDiagGetQueryStats ’10′
EXEC spDiagGetRunningQueries ’10′
EXEC spDiagStartTrace
EXEC spDiagStopTrace

Transmission Queue is  another option that we should look at, incase of a backlog (or the out going messages are stuck). All the other queues (ConfigMgrDRSSiteQueue, ConfigMgrRCMQueue, ConfigMgrDRSMsgBuilderQueue, ConfigMgrDRSQueue etc. ) shown in the following pic are application related queues.

To check Transmission Queue, you need to run the below SQL query. With the below query, we can check Transmission for a particular primary site (in the below query – CAS server site code = CAS. Primary site code is PR1).
SELECT TOP 1000 *, casted_message_body =
CASE message_type_name WHEN ‘X’
THEN CAST(message_body AS NVARCHAR(MAX))
ELSE message_body
END


FROM [CM_CAS].[sys].[transmission_queue] where to_service_name = ‘ConfigMgrDRS_SitePR1′


In the below figure, you can see the records waiting for transmit. Have a look at the “transmission_status” column, this will provide more details about any transmission errors. This will be very helpful for further troubleshooting.




vLogs view is the DRS (Data Replication Service) log file. This will provide us more details about DRS process and backlog.
Run the following SQL query – “Select top 1000 * from vLogs order by LogTime desc” to get more details about these logs.
RCM_ReplicationLinkStatus table can also provide us more details about the link status of between the sites.

Run the SQL Query – “select * from RCM_ReplicationLinkStatus” . Look at the StatusName column for more details like Failed, Degraded etc.

TRACE stored procedure – For In-depth analysis of backlog . 
This can be performed by using following stored procedures. 

Caution – this will create lot of overhead on SQL server also use lot of disk space because of the creation ConfigMgrDBTrace.trc file.
EXEC spDiagStartTrace
EXEC spDiagStopTrace

You can start the trace process with “EXEC spDiagStartTrace”.  This process will create  .trc (trace file) – in the SQL installed location “C:\Program Files\Microsoft SQL 
Server\MSSQL10.MSSQLSERVER\MSSQL\DATA\ConfigMgrDBTrace.trc”. Also, this process will start tracing each and every event of the SQL server. To stop trace use “EXEC spDiagStopTrace”. Ensure that you STOP the trace ASAP otherwise it may create some adverse impact on the server.
We need to use SQL Server Profiler to open the .trc file. You will get depth details about each event performed by SQL server during the time of TRACE. like Duration, EventClass, StartTime etc.