Death By Patterns

SQL Express Change Tracking Framework – Part 3 : Sync Direction

 

 

Last time we created a basic client server demo showing how to synchronise data between a CE client and a server using Sync Services for ADO.Net and WCF for the communication.  The next step in the process is understanding what sync direction is and how to use it.

What is Sync Direction

Sync direction allows us to specify which tables must be synchronised in what direction.  A table’s sync direction can be specified to be either:

  • Download only
  • Upload only
  • Bidirectional
  • Snapshot

Download only means that changes made to the client table are not synchronised with the server.  The client can still make changes to the table, but those changes aren’t reflected on the server at all.

Upload only is the opposite.  Changes made on the server side are ignored, and only changes made on the client are synched.

Bidirectional means that changes made on either the client or the server are synched.

Snapshot means that when the table is synced it takes a snapshot of the server table and makes the client table the same.  All client data is overwritten.  This is useful in tables where you need the clients tables to match the server’s (e.g. lookup tables) regardless of what’s happening on the client.

How do we use Sync Direction?

Sync Direction is specified on the client sync provider.  In order to specify a direction you simply have to provide a value for the SyncDirection property on the SyncTable class.  If you want the server to have control over the sync direction of a table you have to work a whole lot of extra logic into the DbServerSyncProvider class to check if the changes to/from the client must be applied.

Remember that the wizard created a SyncTable class for each of the tables we specified.  Each of these classes was a strongly typed implementation of the SyncTable class, but we don’t have to extend SyncTable to benefit from the behaviour (more on this when we jump into the SQLExpress client though).

By default the SyncDirection on a table is set to Download Only, so if you need BiDirectional or need to upload changes then we need to change this behaviour.  Right click on the Client Sync Cache file (JoesDeliCache.sync in this case) and select view code.  In the partial class there will be an OnInitialized() method. Add the following code:

partial void OnInitialized()
{
BaseIngredients.SyncDirection = SyncDirection.Bidirectional;
}

Do the same for each table that needs it’s behaviour changed.  Now when you try and sync a download only table after making changes on the client side you’ll see that no changes are applied.

Simple as that!

Next time:
Filtering data based on the client.

Posted by CalmYourself
Filed under:

SQL Express Change Tracking Framework – Part 2

 

 

In Part 1 we took a look at what SQL Change Tracking was and had a brief glance at how we can take a look at the underlying table to see what data has changed.

This time we’re going to do something a little more practical and use SQL change tracking to actually synchronise data between a client and a server by throwing in some Sync Services for ADO.Net.  We’re going to synchronise the data to a SQL CE database to get the concepts and after that move it over to a SQL Express database.

The Database

Meet Joe.  Joe owns a deli and has a side interest in programming.  Joe’s gone and made an application which keeps track of all his recipes and all the orders he gets.

Joe's Deli

Each recipe has various ingredients, and may even require other recipes to be finished first (the self-join i need).  Each recipe has various ingredients with different quantities.  Each order is comprised of various quantities of recipes and the order has to be filled at a certain date.

Joe’s Deli app works fine, but over the past few months Joe has decided to increase his sales staff and he’s hired a couple of mobile sales reps.  These sales reps need to go to the clients and push through orders remotely… keeping in mind that the list of available recipes may change.  These sales reps will synchronise with the central DB whenever they are able to.

Assuming that Joe’s Deli app will work perfectly on mobile devices and the only change that he has to do is allow his sales teams to connect and resynchronize data with the central database then it’s simple:  Throw some SQL Change Tracking and ADO.Net Sync Services at the problem.

Daniel Moth has an excellent screencast on using Change Tracking with ADO.Net Sync Services which you can view here.  The screencast only shows the sync happening within one project so I’m going to show you how to sync between multiple projects (client server style).

1. Create two projects

The first of which will act as the server which the clients will synch with, and the second will act as the tool which runs on the clients and causes the sync with the server app.

 

App Connectivitiy

 

 

The server will host the data over a WCF endpoint which the clients will connect and authenticate to and use that connection to sync data across.  Each client will run a copy of the Client Sync App.

So again, your first project would be a WCF Service Library so you can go ahead and start one of those up.  After you’ve created it delete all the files in the project except the App.Config file.

1

Then add another project to the list, this one is going to be used by your client app to sync the data.  One thing you’ll notice is that for some reason only certain project types allow you to add a Local Database Cache to them, and since we don’t know what type of device or application is going to be initiating the sync I’m going to make the second project a Class Library so that it can be used in most projects.

2

After the Class Library is created delete the Class1.cs file that’s added for you.  OK, so we’ve got our 2 projects.  Now add a Local Database Cache to the Client Library.

3

 

A wizard will appear to help you configure the change tracking.  The only step here that’s different to Daniel’s screencast is that we’re splitting the implementation into two projects.  So go ahead and expand the advanced section and tell the wizard which project is the server project (the WCF project) and which is the client project (the class library).

4

Do the rest like Daniels video (add the tables, click OK, do your usual dataset stuff) and then go and open the <DataCacheName>.Server.SyncContract.cs file.  Right at the top you’ll see a commented out section which has a <service> element and a <behaviour> section.  Copy the <service> element and paste it over the existing <service> element in the Server project’s App.Config.  Do the same with the <behaviour> element.

If you’ve done this correctly your Server project should appear in the “Add Service Reference” dialog for the client library after clicking the “Discover” button.

5

So go ahead and add the service reference.  All we have to do now is expose a Synchronise method from the Class Library.

All I did was add a class file (ClientSyncManager.cs) which exposes the following static method:

public static SyncStatistics Synchronise()
{
using (JoesDeliCacheSyncAgent syncAgent = new JoesDeliCacheSyncAgent())
{
using (ServerSyncProviderProxy serverProxy =
new ServerSyncProviderProxy(new JoesDeliCacheSyncContractClient()))
{
syncAgent.RemoteProvider = serverProxy;
return syncAgent.Synchronize();
}
}
}

This allows us to make changes to the datawarehouse and a simple call to ClientSyncManager.Synchronise() will sync the data down to the client.

What about uploading changes?  Also simple, but we just have to tell the client that it must upload the changes too. To do that, open up the Client’s generated SyncAgent and add the following for each of the tables that need to upload changes too:

public partial class JoesDeliCacheSyncAgent
{
partial void OnInitialized()
{
BaseIngredients.SyncDirection = SyncDirection.Bidirectional;
}
}
Now we can go ahead and make changes on our data warehouse and our local cache, call the sync method and all the data will sync happily together.
 
Play around with the sample app, see what happens when you try and sync conflicting rows together, see how you can break it.  If you ever completely mess it up, simply delete the tables from the local SDF cache and they’ll be recreated when you try and sync again.

Next Time:

Next time we’ll add some additional functionality such as:

  • Filtering the data to be synched so that each agent only has a subset of the data
  • Progress reporting (so we know how far our sync is, progress bars are pretty)
  • Allowing the clients to delete on their local DB without the delete affecting the data warehouse
  • Conflict Handling (both the server and the client have new versions of the same row)

Download the Sample Code


Posted by CalmYourself

Creating a SQL Express Change Tracking Based Sync Framework – Part 1

 

 

Change tracking used to be a nightmare.  It involved creating tables to keep track of inserts, updates and deletes, triggers to keep these tables up to date, and a whole lot of logic to perform cleanups on these tables.  This usually (read: always) has the result of turning your nicely designed database into something that looks awful.  Enter SQL 2008.

SQL Server 2008 shipped with a wonderful feature called SQL Change Tracking which allows applications to keep track and take advantage of change tracking information without having to create and manage tables and triggers.  All the hard stuff is done for free, meaning that the developers can focus on application features rather than background tasks.  It’s not entirely for free though.

While SQL Change Tracking is a great product, you will probably want to use it in conjunction with a tool such as the Sync Services for ADO.Net which unfortunately only supports SQL CE at the moment.  There is a SQL Express client sample, but there’s no official support for it yet.  Hence this post.  Over the next few blog posts I’ll be documenting the process involved in creating a SQL Express Change Tracking Framework using Sync Services for ADO.Net.

Why use SQL Change Tracking?

Change tracking allows applications to (just about) seamlessly cater for occasionally connected clients or to allow for replication where a full replication solution may not be possible or required.  In the instance which prompted this series of posts, I need to synchronise data from various client databases to a central data warehouse.  Sounds simple, but the client databases are all running SQL Express which means no SQL Agent meaning no SQL Log Shipping.

Change tracking not only keeps a record of data changes, but it also keeps it all nicely versioned which was always a big problem with custom change tracking solutions.  Each time a client queries the change tracking records, the client would specify which version of the data it has would be able to see all the changes since that version.

Back to the current series:
The requirement was for one central data warehouse to house all the data from all the clients.  Each client would only ever have a subset of the data based on some or other filter.  Data could also be deleted from the client, but the data warehouse should retain all data indefinitely.  Some of the tables needed to be bi-directional, some download only, some upload only.  The solution also needed to work regardless of whether the client was connected 24/7, once a day, once a week, once a month, etc.  Again, all of the requirements would have been met by LOG Shipping or some other form of replication, but the client databases weren’t going to be big enough to warrant a full version of SQL.

Change Tracking Data Flow

 

 

We eventually decided on using SQL Change Tracking to keep track of the changes at the data level and to use Sync Services and WCF to communicate the changes and versions between the data warehouse and the client DBs.

We have a working solution in place, but it’s definitely not scalable or adaptable to different environments or changing requirements.  Hence the series.  I’m going to use this series of articles to document the process I followed to convert our current solution into a decent framework which can be used to synchronise data between different types of data stores (I don’t want to be limited to SQL databases), from different environments with different requirements to a central location.

There were a couple problems we experienced with our initial implementation (besides the lack of proper architecture):

  • Currently Sync Services only supports CE, not SQL Express
  • Large tables needed to be synced in batches
  • Self-Joins on tables doesn’t work

There are others, but these are the biggies which spring to mind at the moment.

Why SQL Express?

Sync Services supports SQL CE which doesn’t always provide a workable solution.  Some of the shortcomings of SQL CE include:

  • Lack of full security management
  • No stored procedures, triggers or views
  • No user defined types

There are others, but these are the typical deal-breakers for me.  And yes, there are work-arounds for most of them but I don’t think the fact that you can get by without these features is any excuse that you should.

Just for completeness, we did try and run the client apps using SQL CE but the performance was completely abysmal.  That and the lack of remote connections meant it wasn’t a workable solution at all.

How does it work?

In order to use change tracking it needs to be enabled at both the database and the table level.  Enabling change tracking at the DB level is as easy as:

   1: ALTER DATABASE ChangeTrackingDemo SET CHANGE_TRACKING = ON
   2: (
   3:     CHANGE_RETENTION = 31 DAYS,
   4:     AUTO_CLEANUP = ON
   5: )

There are 2 optional parameters:

  1. CHANGE_RETENTION (Default: 2 days) – specifies how long change tracking data must be stored for.  The longer you need to store the information, the more space you’ll use.  Keep in mind that if a client tries to synchronise with the server and the client hasn’t synced within the amount of time stored, the client won’t be able to sync. E.g. A client syncs on the 1st of the month and again on the 10th.  If the server only stores change tracking data for 5 days the client won’t be able to sync.
  2. AUTO_CLEANUP (Default: On) – determines whether or not the change tracking data must be cleared automatically after the retention period has passed.  You may want to put this off if your clients are having problems with connecting or synching and you need to keep the change tracking data for longer than the initial retention period (I haven’t yet worked out how to clear the change tracking data manually without truncating the table, so please let me know if you know a way).

Now that change tracking is enabled at the DB level we need to enable it at the table level.  Enable change tracking on a table like so:

   1: ALTER TABLE TableToBeTracked ENABLE CHANGE_TRACKING WITH 
   2:     ( 
   3:                 TRACK_COLUMNS_UPDATED = OFF  
   4:     )

One optional parameter TRACK_COLUMNS_UPDATED (Default: Off) which records the columns which were updated on the table.  This adds additional overhead and isn’t terribly useful unless you have really large rows in your table and want to be sure to only sync the columns that have changed.

Note that change tracking cannot be enabled for a table without a primary key.

So that’s it, you now have a database which supports change tracking and a table which is being tracked.

As soon as you enabled change tracking on the table, an internal table was created to record the changes to the data.  This table would be called sys.change_tracking_<object id of table being tracked>.  The size of this table depends on how long you’re keeping the change data for as well as whether the column mask is being recorded.  To see the current size of the table you can use the sp_spaceused store proc.

How do we use it?

Now that we’ve enable change tracking let’s play around with some data to check the changes.

There are a few functions which we’re going to use frequently when dealing with change tracking:

  • CHANGE_TRACKING_CURRENT_VERSION
  • CHANGE_TRACKING_MIN_VALID_VERSION
  • CHANGETABLE

CHANGE_TRACKING_CURRENT_VERSION is a database wide function which returns the version of the last committed transaction.  Every transaction that is committed on a change tracking enabled table will affect the result of this function.

 

CHANGE_TRACKING_MIN_VALID_VERSION accepts as a parameter the object ID of a table.  The result is the minimum change tracking version stored for the table.  These 2 functions give us the range of changes stored.  If the current version of the database is 100,000 and the min version for a table is 45,000 then all clients which have versions greater than 45,000 of the table in question can sync.  If the client’s version is less than 45,000 the client won’t be able to sync.

 

The CHANGETABLE function can be used to get a list of all the changes made to the table when compared to a version.  To do this you would specify that you want the changes to a particular table made since a particular version, like so:

   1: SELECT
   2:     *
   3: FROM
   4:     CHANGETABLE(CHANGES TableToBeTracked, @CurrentVersionNumber) ct

You’ll notice that the function is a table valued function and that it has to be aliased.

Also, prefixing the table name is the keyword CHANGES.  This specifies that you want a list of all the changes made to the table.  The result

You can also pass in the keyword VERSION there which will change the behaviour of the function to accept different parameters and to return the change tracking info for a specific row in the table.  If you use this version of the function you have to specify the primary key columns for the table as well as values for the primary key. E.g.

   1: SELECT
   2:     *
   3: FROM
   4:     CHANGETABLE(VERSION TableToBeTracked, (IDTableToBeTracked), (1)) ct

This returns the change tracking version information about the row with the PK (IDTableToBeTracked) equal to 1.  This will allow you to see the version of the row in question.  You can use this to determine what effect the changes on the row had on the version of the db.  You can use this to check if a client and a server have both made changes to the same row.

Tying it together

Let’s take these functions and try and do something useful.  Let’s assume that a client has synced with the server at version 100, has gone away for a while and come back when the server was at version 120.  How does the client go about trying to figure out what data has changed?

The easiest way would be to query the CHANGETABLE function and specify 100 as the current version number.  This would give us all the operations that have occurred on that table since version 100.

   1: SELECT
   2:     ct.SYS_CHANGE_VERSION,
   3:     ct.SYS_CHANGE_OPERATION,
   4:     ct.IDTableToBeTracked
   5: FROM
   6:     CHANGETABLE(CHANGES Table2, 100) ct

 

Which results in:

results

We now know all the ID’s which have been changed since version 100.

The SYS_CHANGE_OPERATION column tells us what the operation was (I = Insert, U = Update, D = Delete).  Careful though, the delete operations can occur on rows that were created after the previous sync and can therefore be ignored as we’re going to look at the data as it is now.  So if we change our query to this:

   1: SELECT
   2:     ct.SYS_CHANGE_VERSION,
   3:     ct.SYS_CHANGE_OPERATION,
   4:     ct.IDTableToBeTracked,
   5:     -- Select columns here
   6:     Table2.Data
   7: FROM
   8:     CHANGETABLE(CHANGES Table2, @PreviousSyncedVersion) ct
   9:     LEFT JOIN Table2 ON Table2.IDTableToBeTracked = ct.IDTableToBeTracked
  10: WHERE
  11:     CASE ct.SYS_CHANGE_OPERATION
  12:         WHEN 'D' THEN COALESCE(ct.SYS_CHANGE_CREATION_VERSION, @PreviousSyncedVersion)
  13:         ELSE @PreviousSyncedVersion
  14:     END <= @PreviousSyncedVersion

 

We get the following results:

results2

This tells us which rows have changed since the last sync and gives us the current values of the table.

These queries should probably be done in a single transaction set at snapshot isolation so that you can be sure that the data you’re working with isn’t changing while you’re processing it.

Round-up

We’ve taken a look at the basics of change tracking from a db perspective, why we’d want to use it as well as how to query the underlying change tracking tables but still have no idea of how to pull the pieces together to make a sync framework.

Next time

Next time we’ll be taking a look at tying the SQL change tracking into a simple app using Sync Services to get the data to synchronise between a client and server database.

Stay Tuned!

Posted by CalmYourself

Managing your SQL Log files - Part 2

<PreAmble type="rantish-apology">Well it seems my home internet just isn't playing along.  I wrote this (and the previous) post in Windows Live Writer but every time I wanted to publish, my internet at home stopped working.  To make matters worse, Windows Live Writer doesn't work behind my company firewall.  So anyway, I copied and pasted this and the previous post out of Live Writer, but the formatting gets kinda wonky.  Sorry about that, hope the content makes up for it.</PreAmble>

You'll be pleased to hear that I have gotten the hint.  The point I was trying to make last time didn't come across quite as clearly as I had hoped.  I'll attribute that to my poor writing skills, I'm working on it.

Jumping right in with the clarifications: The goal of this series of posts on log files is to aid you when choosing a backup and recovery method for your databases.  That is, after all, the main point of having a log file.  Ensuring maximum availability and minimum data loss exposure.  In order to discuss the data loss risk you're exposed to and how the log file plays a part we need to understand the various types of backups and recovery models available.  When you have that knowledge, you can apply it to your application and choose the most appropriate backup and recovery procedures for your scenarios.

 

backup and restore

There are numerous types of backups you can create for your database:

  • Full database backup
  • Differential database backup
  • Partial full database backup
  • Partial differential database backup
  • Full file backup
  • Differential file backup

A full database backup creates a full backup of the entire database - the data and the log files are backed up here.  This is the most common backup that I see in production databases (tragic - but we're coming to that later).

A Differential database backup is a backup of the data and log files in the database, but it only contains the data extents that have changed since the last full backup.  Creating a full backup on the 10th of June and a differential backup on the 12th will allow you to restore to the 12th of June.  Creating an additional differential backup on the 14th of June will mean that the differential backup created on the 12th is no longer needed during the restore process.  This is because a differential backup contains the data that has changed from the last FULL backup, not differential backup.

Partial backups were introduced in SQL 2005 (and are being carried through to 2008 as well) and are used to optimize backups of database which contain data that rarely changes (or read-only data).  If a database has several read-only tables, there is little purpose in backing these tables up after the initial full backup.  A partial backup contains all the read/write files and filegroups and any read-only filegroups that are additionally specified.

Partial differential backups are similar to Partial backups except that they only backup data extents that have changed since the last partial backup of the SAME SET of files and filegroups.

File backups allow you to backup and restore individual files or filegroups in your database.  If you have a database spanning multiple drives you can create a file backup of each drives filegroups.  Upon a hard drive failure, you will only need to restore the filegroups that were stored on the drive which crashed.  Note: if your database is in Simple Recovery mode, this won't help you since file backups of database in Simple Recovery mode are restricted to read-only secondary file groups.

Partial file backups are the same, but store the data extents that changed since the last full backup of each file group.  In simple recovery mode, the data is assumed to be read-only since the last full file backup.

 

Restoring a database involves 3 simple phases:

  1. Data Copy
  2. Roll Forward
  3. Roll Back

During the Data Copy phase, all filegroups are initialised and then the data from one or more full backups is copied.  If there are any differential backups which apply, they are copied after the full backup data has been copied.

The Roll Forward phase involves applying all the committed transactions in the transaction log but weren't committed to the data file.  Because all changes are written to the log file before the data file, we are guaranteed that all transactions that have been committed are in the log file.  The time taken for this phase is dependant on the recovery interval for the database.

After the roll forward phase, the database may contain changes which were made by uncommitted or incomplete transactions.  The Roll Back phase involves rolling these transactions back so the database is left in a valid state after the restore.

 

recovery models

Understanding the effect that your recovery model has on your backup and recovery procedures is important.  The recovery model that you set your database to affects how the transaction log is managed.  There are three recovery models:

  • Simple
  • Full
  • Bulk-Logged

Simple recovery is just that, simple.  There are no log backups taken and your transaction log file is truncated automatically to keep the size small.  The downside?  You can only restore to the time of the last backup.

Full recovery requries log backups.  The space used by your log file is only freed once a log backup is made, until then, the log file is maintained and the size is continually increased.

Bulk-Logged recovery is similar to Full recovery except that it only logs the extent allocations for the BCP, BULK INSERT, SELECT INTO, CREATE INDEX, ALTER INDEX REBUILD and DBCC DBREINDEX commands.  The transaction log file can be used to restore to a point in time unless one of those commands have been issued.  The Full recovery model logs each individual transaction the the bulk commands generate.

 

risk exposure

Now that we have an understanding of backups and recovery models, we can begin to talk about risk exposure.

Scenario time.  A production database has a full backup taking at 8AM, and a log backup (not differential) taken every 4 hours after that.  There is a SELECT INTO command issued at 12:47.  At 13:14, the drive containing the data file crashes and a tail-log backup is taken.  Let's see how far we can restore with each recovery model.

 

Simple Recovery Mode Bulk-logged Recovery Mode Full Recovery Mode

In Simple Recovery mode, the log file is not used.  This means that the full backup taking at 8AM can be used to restore the database to 8AM but no further.  All changes made to the database after that time have been lost.

Recovery time: 08:00

Log file size: Smallest

The log file is backed up, but the SELECT INTO that was performed is only minimally backed up.  As a result, the full backup and the transaction log backup can be applied but not the tail-log backup after the failure.

Recovery time: 12:00

Log file size: Between full and simple

The log file is backed up and maintained throughout all the transactions, SELECT INTO or otherwise.  When the drive fails, the full backup and the log backup can be applied bringing the databasee to the same state it was in at 12:00.  More than this, though, the tail-log backup can be applied meaning that ALL transactions can be recovered.

Recovery time: 13:14

Log file size: Largest

As shown, when using Full Recovery you can recover your database from an entire drive crash as long as the log file is on a seperate drive to the data file.  Full recovery minimizes your data loss exposure whilst simple and bulk-logged still leave you open to data loss.

 

log file size

Obviously there are bound to be tradeoffs, and size is the largest noticeable tradeoff between the recovery models.  Simple recovery doesn't back the transaction log up and as a result, the space is available for use by other transactions as soon as the entries in the log file are written to the data file (a CHECKPOINT is triggered)

Full and Bulk-logged use the transaction log to enable point in time recovery.  This is true even after the entries in the log file have been written to the data file.  As a result, the log file space is only reclaimed after a log backup has been taken.  For bulk-logged, tail-log backups are useless if there have been any bulk alterations performed on the data between the last log file backup and the failure.  If not, both models support point-in-time recovery.

 

next time

We've mentioned recovery quite a bit, but can we simulate it?  Next time we're going to have scripts and examples showing how to recover to a point in time on a database that breaks or drive that crashes.  I'll also cover a small section on how to choose the most appropriate recovery model for your environment.  Until then....


 

Managing your SQL Log files

Part 2, covering how the transaction log is affected by the recovery model and how it affects data loss, is now out 

OK, everyone who has put some thought into the setup of their transaction log file, please raise your hand...

 

...I thought so.

 

The sad truth is that there are very few people who think through the log files for their production systems - often with dramatic results.  Each and every single SQL Server database has both a data file and a log file, both of which serve an important function.  The data file stores the actual data in the database whilst the log file stores all changes made to the data stored in the data file.  Why?  What is the purpose of storing (basically) the same information twice?

why do we need a log file?

Log files help satisfy the ACID requirements of RDBMS's.  Just about every command that gets issued against your database is stored in the log file (there are some exceptions).  Some clarification: Every command you issue stores the following in the log file:

  • The start time for the transaction
  • Before and after images of the data
  • Whether each transaction was committed or not
  • All allocation changes to database pages
  • As well as a few other things...

Why? I mean, what is the point?  OK, example time.  You are a DBA responsible for maintaining a line of business database.  Being the conscientious DBA you are, you have scheduled regular full backups - twice a day in fact!  Now, image a situation where 3 minutes and 12 seconds before a backup is scheduled to run some twit of a developer forgets to put a where clause in his update statement (or you get a torn page or something, anything bad).  What do you do?

I'm going to assume that the developers transaction has been committed.  First and foremost, you decide to restore your last full backup... from a full half day ago.  Doing well so far, but where do you get the rest of the days transactions from?  The transaction log file, of course!  After you've restored your latest backup you can roll the transaction log forward to recover the rest of your transactions meaning that even though your most up to date backup was out of date, you haven't lost anything.  True story!

how does it work?

The reason this works is very simple.  SQL Server does not commit any changes to the data file unless they have been committed to the log file first.  This ensures that no matter what happens, no data is written to disk without there being a record in the transaction log.  This is known as a Write Ahead Log (WAL).  The process behind WALing works as follows: SQL maintains a cache of data pages which are to be read or updated.  Whenever data is modified, the cache is modified NOT the physical data file.  The modification is only written to the physical file once a checkpoint is reached or SQL needs to use the cache for something else.  Every modification results in an entry in the transaction log.  The transaction log entry is written to disk BEFORE the data modification.  If it were to happen the other way around, the data modification could create a transaction which you could not roll back from.

In this way, you have a guaranteed source of data as well as an exact plan of how to get there unless you mess your backups and checkpoints up... which brings us to the next point:

"my log file is too large!"

The most common complaint regarding log files is "My log file is too large!"  Granted, this is a problem, and I'm fairly sure many of you have already experience this problem, but I'm willing to bet that this has become a problem due to the way you've decided to manage your log file.  If you've left your database setup to someone who doesn't understand how log files work you are almost guaranteed to have this problem.

As mentioned before, log files are used to store a history of all transactions that have been performed against your database.  Straight away you would think this would cause a problem.  A record of every single transaction?  How is that NOT going to get extremely large?  Valid question, and the answer is in how the log file works.  Every time you make a backup of your database you are invalidating the need for a record of all the transactions before the backup.  Granted, you can't rollback transactions anymore (not entirely true) but you have a copy of the data somewhere other than in the data file.

In order to help SQL Server maintain the transactions which can have their logs discarded and the transactions which can't, the transaction log file is split into an Active and a Inactive portion.  The active portion cannot be altered as this is the portion of the log file which is used to roll-forward transactions that were not saved at the last checkpoint.  SQL Server keeps a pointer to the start of the active portion using a MinLSN (Minimum recovery Log Sequence Number).  This is a pointer to a transaction in the log file (not necessarily the start) which tells SQL Server that everything after the pointer is an Active Transaction, everything before the MinLSN has been written to disk and can be altered.

The MinLSN gets moved when one of the following occurs:

  • A BACKUP LOG command completes
  • A CHECKPOINT command completes (only in Simple recover model) - either by using the CHECKPOINT command, or by the system issuing a checkpoint.

CHECKPOINT's are cause by the following:

  • Executing the CHECKPOINT command
  • A transaction not logged completely is executed (Bulk inserts, Select Into's...)
  • A database backup is taken
  • The database is shutdown
  • Database files are altered in some way (ALTER DATABASE commands)
  • The database issues checkpoints based on the Recover Interval (SQL estimates how many logs can be processed during the Recovery Interval)
  • In Simple Recovery mode, when the log file becomes 70% full

Don't expect your log file to shrink in size simply because you issue a CHECKPOINT command even if you are in simple recovery mode, if you have any long running transactions SQL cannot commit it and therefore cannot update the MinLSN.

 

OK, so we know that log files help ensure that we have an up-to-date copy of the data at all times and we know how it does this.  We also know that we need to be careful about choosing our recovery model and our backup procedures, but we don't have any guidelines yet as to how to do this.  To that end, next time we'll be taking a look at the various recovery models and the way they affect the growth of the transaction log.  Until then though, take a look at your log files and the general sizes they go through during the day.  See if you can get it down to a relatively consistent size regardless of the type of transactions being issued against it.

What exactly is 4th (fourth) normal form?

Haven’t posted in a while, sorry about that.  My planned series on data-mining seems to have lost some steam, but I will try and get back to it sometime in the future, rest assured.

There was a SQL Question of the Day asked on this topic over at SADeveloper.net 

What is database normalization?

Database normalization is a design techniques aimed at reducing redundancy and inconsistency of data.  There are various stages of normalization, starting with first normal form and proceeding through to sixth normal form (including Boyce-Codd normal form and domain/key normal form along the way).

1st Normal form removes duplicate records and allows for unique identification of a row in the table

2nd Normal form states that all other fields in a table are dependant on the ENTIRE primary key, not a part of it.  If a field is dependant on a part of the primary key, it can be removed into a separate table.

3rd Normal form enforces that all the values in a row of the table are dependent on the primary key only and no other fields.  An example would be a table where EmployerID is the primary key, but there is a field [Dialing Code] which is dependant on the [Phone Number] field (contrived example, but work with me here) then that is a violation of 3rd Normal Form.

Boyce-Codd normal form is similar to 3rd normal form except for the additional requirement that there must be no other dependencies on fields in the tables other than the fields that are a superkey.

English: A superkey is a collection of fields which can be used to uniquely identify a row.  There must be no columns in your table that are dependant on a subset of a superkey (as long as the subset is not also a superkey).

Example: Consider the table Animals(AnimalType, AnimalBreed, AnimalDescription, Type Code).  The table satisfies 3rd normal form, however Type Code is dependant on the AnimalType column only (AnimalType is a subset of the superkey but is not a superkey itself).  Thus, Animals is not in Boyce-Codd normal form and there is redundant, duplicated data.

These initial forms of normalization are standard and many database designers find them almost second nature.  In an effort to avoid redundancy, this is a noble effort but not enough – there is still the possibility for data redundancy.

Remaining redundancy

Consider the situation where a ternary relationship exists in a database, for our example, we’ll consider a 3-way relationship.

I realize that the example is completely contrived and unrealistic, but it serves to show that redundancy can still exist when a database is in Boyce-Codd normal form.

Given three tables Company, Area and Service; and the services offered do not differ by area, having an associative entity relating the three tables together would result in redundant data.  

The table below satisfies Boyce-Codd form as there are no non-key elements in the table for there to be other dependencies.

Company

Area

Service

Green Pastures Sandton Carpeting
Green Pastures Sandton Cleaning
Green Pastures Fourways Carpeting
Green Pastures Fourways Cleaning
Green Pastures Midrand Carpeting
Green Pastures Midrand Cleaning
Strong Arm Carpeting SeaPoint Armed Response
Strong Arm Carpeting SeaPoint Neighborhood Patrol
Strong Arm Carpeting Camps Bay Armed Response
Strong Arm Carpeting Camps Bay Neighborhood Patrol
Strong Arm Carpeting Stellenbosch Armed Response
Strong Arm Carpeting Stellenbosch Neighborhood Patrol

The table does, however, have redundant data.  If Green Pastures decided to add Upholstery as one of their services there would have to be a new entry for each area that Green Pastures services.

What is 4th normal form?

4th Normal form is another step towards removing redundancy in data that may be left over after Boyce-Codd normal form.  The main point is to remove multiple dependencies – for our example, the dependencies between company and area and company and service.

The rule is that a Boyce-Codd normal form database is in 4th normal form if all the multivalued dependencies in the database are functional dependencies.

A dependency (e.g. column x can be determined by using column a, b and c) is a functional dependency if for every a, b and c; there is only value for x.  In the table above Service is dependent on the Company, but given Company you cannot determine only 1 value of Carpeting and Cleaning.

If we split the table into two separate tables linking Company to Area and Company to Service, our database is then in 4th normal form (assuming the other tables are also ok)

Hope that sheds some light on the little-known 4th normal form.  It does help reduce data redundancy but normalization to this extent might not always be desirable.  Code tends to get more complicated with extreme normalization.  There are other, stronger forms of normalization too, but these are used even less and complicate the matter even more.

Posted by CalmYourself | 4 comment(s)
Filed under: ,

SQL Rand() trap - Same value per row?

While preparing the next article in my data mining series I needed to use the RAND() function in SQL but got quickly reminded that not everyone will know that RAND() will only generate one random number per query.  To illustrate:

The Problem:
Execute the following SQL: SELECT RAND()

Gives you a nice random number between 1 and 0. Perfect.  Now, run the query multiple times. A different random number each time, great.  Now, create a temporary table, populate it with some data and run the SELECT from the table (code below)

SET NOCOUNT ON

CREATE TABLE #Temp
(
    IDItem INT NOT NULL PRIMARY KEY IDENTITY(1, 1),
    ItemDescription VARCHAR(30) NOT NULL
)

INSERT INTO #Temp (ItemDescription) VALUES ('Fish')
INSERT INTO #Temp (ItemDescription) VALUES ('Meat')
INSERT INTO #Temp (ItemDescription) VALUES ('Bread')
INSERT INTO #Temp (ItemDescription) VALUES ('Milk')

SELECT RAND() FROM #Temp

DROP TABLE #Temp

Oops, same value for each and every row!  What if we wanted a new random number for each row?  An example would be generating quantities for items, for each item in the table generate random numbers that represent the quantity of that particular item.

A Step In The Right Direction:
A fix would be to have the random function called once per row. There are a of standard couple ways we can achieve this:

  • Processing the data in a cursor, row by row
  • Processing the data in a while loop, row by row

The issue with these is that cursors and while loops are horribly inefficient.  They have their place, but if they can be avoided it's probably best to do so.  There is another way of achieving this which doesn't involve row by row operations.  What needs to happen is we need to force RAND to choose a new seed for every row, but we need to trick SQL into letting us do that.

Problem with the step:
If we can call RAND from a user defined scalar function, the function will initialize the seed for RAND everytime it is called.  This will result in a new random number for every row in the result if we call the function for every row.  But there's a catch, we can't call RAND from within a user defined function.  If you try the following:

CREATE FUNCTION dbo.UDF_Q_RandomNumber()
RETURNS NUMERIC(18, 10)
AS
BEGIN

    RETURN (SELECT RAND())
END

you will get the error: Invalid use of 'rand' within a function.

Fix for the problem:
That''s not a problem though, because once again, we can trick SQL into working for us.  Create a view which selects a random number, and inside the user defined function, simply select a result from the view, observe:

CREATE VIEW dbo.vRandomNumber
AS
SELECT
RAND() AS RandomNumber
GO

CREATE FUNCTION dbo.UDF_Q_RandomNumber()
RETURNS NUMERIC(18, 10)
AS
BEGIN

    RETURN (SELECT TOP 1 RandomNumber FROM vRandomNumber)
END
GO

So now we have a user defined function which returns a random number each time it is called, which is perfect.  Lets throw that into our table query and see what happens:

SET NOCOUNT ON 

CREATE FUNCTION dbo.UDF_Q_RandomNumber()
RETURNS NUMERIC(18, 10)
AS
BEGIN

    RETURN (SELECT TOP 1 RandomNumber FROM vRandomNumber)
END
GO

CREATE TABLE #Temp
(
    IDItem INT NOT NULL PRIMARY KEY IDENTITY(1, 1),
    ItemDescription VARCHAR(30) NOT NULL
)

INSERT INTO #Temp (ItemDescription) VALUES ('Fish')
INSERT INTO #Temp (ItemDescription) VALUES ('Meat')
INSERT INTO #Temp (ItemDescription) VALUES ('Bread')
INSERT INTO #Temp (ItemDescription) VALUES ('Milk')

SELECT *, CAST(10 *  dbo.UDF_Q_RandomNumber() + 1 AS INT) RandomNumber FROM #Temp

DROP TABLE #Temp

If you run this query you will get all the values from the #Temp table as well as a random integer between 1 and 10 (including) for each row.

Happy randoming! 

Posted by CalmYourself | 6 comment(s)
Filed under: , ,

SQL Server 2005 Data Mining Series - Part 1 - Versions

This is the first part of a multi-part series covering data mining in SQL Server 2005.  This edition covers the absolute basics of what you will need, i.e. making sure you have the right version of SQL to do what you want.

There are numerous versions on SQL Server 2005 available, each of which has certain features, limitations and costs.  Each edition may have certain OS limitations which need to be taken into account too.  For the purpose of this entry, however, I will be focusing only on which features each addition brings to the table which relate directly to data mining and the other business intelligence features of SQL 2005.  As far as SQL Business Intelligence goes, there are a number of features which are available:

  • Data Mining
  • Perspectives
  • Translations
  • Proactive Caching
  • Partitioned Cubes
  • Integration Services

There are more SQL features which fit under the hat of Business Intelligence, but these are the features I will be using to demonstrate basic data mining in SQL 2005.  SQL Data Mining is the process of analysing data in order to find patterns and rules pertaining to the data.  Perspectives is the means by which a data warehouse can define how a particular Data Model will look based on who is looking at it.  Translations provides the capability of all data in a data warehouse being stored in one localization whilst analysts viewing the data in other markets would see the data formatted for their currency etc.  Proactive caching means that you can specify when each of your data cubes will be updated, based on either an event (new data being inserted) or a time lapse.  Partitioned cubes allow for great performance gains by allowing a single cube to employ partitions on multiple analysis servers.  And, of course, Integration Services (which replaces DTS) is Microsofts ETL (extract, transform, load) product allowing you to transfer data from multiple data sources, perform data cleaning and a host of other nice data transformation steps (this is an entirely extra topic on it's own)

Each edition of SQL 2005 (Express, Workgroup, Standard and Enterprise) supports a certain set of these features.  See matrix below:

Feature Express Workgroup Standard Enterprise
Data Mining No No Yes Yes
Perspectives No No No Yes
Translations No No No Yes
Proactive Caching No No No Yes
Partitioned Cubes No No No Yes
Integration Services No Yes Yes Yes

SQL Server 2005 Express edition is pretty much useless when it comes to doing any analysis wich the Workgroup edition only providing SSIS features.  As far as basic data mining and analysis goes, the Standard edition is more than capable but for hard-core, distributed Business Intelligence analysis you need the Enterprise Edition.  The rest of the entries in this series will be working with the SQL Server 2005 Enterprise Edition as this is basically the developer edition.  I will, however, only be using features available in the standard edition as this provides the minimum features required and allows for a basic introduction to data mining in SQL Server 2005.

Next time round, I'll cover loading various databases with random (but still useful - it contains loaded patterns) information and extracting all the info into a single data warehouse.  Stay tuned!!

Posted by CalmYourself | with no comments
Filed under: , ,

Practical threading practices

How many times have you needed to do some quick multithreading only to realise it isn't quite as quick?  Setting a thread up and getting it going is an easy task, but managing that thread isn't.  The majority of times that I'm using threads I'm using them to call a method which blocks until informat