T-SQL Vs. CLR

By , May 16, 2010 17:07

A colleague of mine asked me if I knew how to convert a character string to a decimal(18,3). Our System was receiving flat files, from a 3rd party system, containing master data which is also needed by our application. No point in double entries, so we import the data; one of the methods of doing so is to load the file content into temporary tables in the database and then take over using T-SQL procedures.

Instincts Reign

“No big deal” I thought and my immediate response was to go with a cast or convert, because I do like to solve problems in T-SQL whenever possible. Then my colleague told me that the character values contain a digit grouping symbol (,) and a decimal symbol (.) and there were 3 digits after the decimal symbol eg. 1,684.582 (in words: one thousand six hundred and eighty four point five hundred and eighty two).

I knew that my idea wasn’t going to work, but I tried it anyway:

declare @val varchar(35) --this is the input format
set @val = '1,684.582';
select CAST(@val as decimal(18,3))

And got this error message (as had my colleague before me):

Msg 291, Level 16, State 1, Line 3
CAST or CONVERT: invalid attributes specified for type ‘varchar’
 

I told him that I could write a CLR user function to do the conversion in .NET and would get back to him.

public partial class UserDefinedFunctions
{
  [Microsoft.SqlServer.Server.SqlFunction]
  public static SqlDecimal ufn_StringToDecimal(SqlString inputString)
  {
    try
    {
      return Convert.ToDecimal(inputString.Value);
    }
    #region error handling
    catch (FormatException fex)
    {
      throw new FormatException(fex.Message);
    }
    catch (InvalidCastException icex)
    {
      throw new InvalidCastException(icex.Message);
    }
    catch (OverflowException oex)
    {
      throw new OverflowException(oex.Message);
    }
    catch (Exception ex)
    {
      throw new Exception(ex.Message);
    }
    #endregion error handling
  }
};

I am by no means a C# crack, so if there is anybody out there that has any ideas as to how to better go about this, feel free to let me know. In particular, I would like to know if there is a better way to pass exception information back to the SQL-Client.

Once I had completed the coding and had deployed the assembly and user function to the Dev-Database, I went to my colleague’s office to let him know he could do some testing with his import procedure. This was when I found out that I had gone to the effort for nothing (or so I thought).

Another one of my colleagues had had the same problem in a project he had dealt with. This is the code he provided the first colleague with:

declare @val varchar(35)
set @val = '1.123,456';
select CAST(REPLACE(REPLACE(@val, '.', ''), ',', '.') as decimal(18,3))

Not very pretty, but it works (well that depends – see Regional Problems) and to be honest it was one of the things I did think of before deciding to resort to CLR.

Regional Problems

One of the reasons I decided to go for CLR was because I thought I was going to have to access the Operating System’s regional settings to get the various symbols in order to replace them in the string being passed into the function. And that is not possible using T-SQL, but is easily done in .NET. These values vary depending on the regional settings of the server running the MSSQL service. This was something my two colleagues didn’t think of 🙂 Besides, I was looking for a reason to do a bit of coding in my freshly installed Visual Studio 2010 Ultimate, which I don’t get to do too often. As it turns out, .NET does all of the work for you in finding the correct symbols based on the system’s regional settings. So all I had to do was a simple Convert.ToDecimal() and .NET did the rest for me.

Was my effort all for nothing?
I hate doing something for nothing and I was sure that my solution was going to perform better than two replaces and a cast! In particular, when large amounts of data were going to be imported in one session.

So I got on to my ThinkTank database and did a bit of testing.

I ran the code that the second colleague provided and the result was 1123.456, which is what I expected. Then I had a closer look at the data that my first colleague provided me with and ran the select with one of his values. Notice the different grouping symbols.

declare @val varchar(35)
set @val = '1,684.582';
select CAST(REPLACE(REPLACE(@val, '.', ''), ',', '.') as decimal(18,3))

The result of the select was: 1.685. So we had lost the digits after the decimal symbol and the value was 1000 times smaller that the actual input (talk about data compression)! The reason for this was that here in Germany (where we work and have our development systems) the digit grouping symbol and the decimal symbol are exactly the other way around as is the case in the English speaking world.
So to a German system 1.684,582 means what 1,684.582 would mean to an “English” system. (The database that my colleague was importing data into, is situated in the USA). The second colleague had just assumed that he was going to be dealing with input that had a certain format and wasn’t allowing for his routine to run with different regional settings.

This was the first knock-out for the replace routine. I would hate to have to go through all the import routines when deploying a database to a foreign country to have to replace ‘,’ with ‘.’ and vice versa.

Next I compared the T-SQL routine (using REPLACE()) with my user function using 10000, 100000 and one million rows respectively in a temp table. I started off doing simple selects but found the overall run times between the two to be very similar, although my solution was the clear winner as far as CPU time was concerned. Most of the time was being used up by the client to dislplay the results (see performance graphs below). Then I decided to be a little bit more realistic and rather than just select, I inserted the results into physical tables. After all, I don’t know too many people who want to visualise 1 million rows at any time.

This is a summary of what I observed:
(I ran each query 5 times and these are the averages)

Comparison Results

The performance of the CLR was much better that the replace in T-SQL. Not only in elapsed time, but also in CPU usage.

The performance graphs for the CLR solution

These are the performance graphs that I was monitoring while running my tests on the CLR solution for one million rows. I used Process Explorer from Sysinternals to monitor CPU and Memory usage as well as IO. Sysinternal’s website (if you don’t already know) can be found under this link. I am not including these graphs to show the difference between the two solutions as I have already done so above, but rather to show how performance indicators should be monitored closely depending on what and how you are testing.

 System Information
System Performance
SQL Server Performance
SQL Server Performance

Management Studio performance

The first batch (the first set of performance peaks) was the insert and we can see the IO activity to testify that data was being written to disk. We can also see that the CPU was used by the SQL-Service and not the client application (which in this test, were on one and the same machine).

The second batch (the second set of performance peaks) was just the select in which the results were displayed in Management Studio. Here we can clearly see a shift in CPU usage from the SQL-Service to Management Studio.

Displaying a million rows in a client application is a fairly unlikely scenario, while building a set of data that contains 1 million rows to work with isn’t. Keep this in mind when monitoring CPU usage while testing your code. Depending on how your solution is built, you may need to consider that the client-app will be hogging the lion’s share of the CPU time. I find this to be a valuable piece of information.

The performance graphs for the replace solution were very similar so it doesn’t serve any purpose to include the screenshots here. The major differences were that CPU usage was higher than for CLR and the runtime, as indicated in the table above, was almost double that of the CLR version.

What does all this mean?

I can only conclude that the CLR solution wins across the board. It is faster, uses less CPU and is environment-setting safe.

What do you think?

The ever illusive Dynamic Pivot

By , April 10, 2010 17:47

Introduction
Recently, I have read many forum entries made by frustrated SQL users who have had problems dynamically building a list of columns and rows when the number of these is variable. I have also read, in the same forums, the argument that this sort of functionality belongs in the reporting application and not at the database level. Although that makes perfect sense, there are situations in which there just is no reporting application available or, as weird as it may seem, the reporting services built in to MSSQL Server since version 2005 are explicitly not to be implemented. These are the typical day to day problems many of us are faced with in our jobs.

Some time ago I was faced with exactly this problem and was required to build a “report” for a customer who needed a list of ordered part numbers and their required quantities for any given day.  This at first sounded like a simple select and group by, but it later turned out that what the customer wanted was a list with part numbers as column names, dates as row names and the accumulated quantities of parts per date to be the body of information.

At the time, I was unable to find any useful information in the usual forums (although I generally try to find a solution without help from outside, this doesn’t always work), so this is what I came up with.

Let’s go for it
First let’s do a bit of DDL. Ordinarily there would be a series of other tables along with FK relationships between the tables as well as a range of primary/unique keys and indexes. For the sake of simplicity, I am not going to reproduce the entire range of DDL here.

use ThinkTank
go

if  OBJECT_ID('tempdb..#PartNumbers') is not null
 drop table #PartNumbers
go
create table #PartNumbers (
 [ID] int identity (1,1),
 [PartNo] varchar(35),
 --[PartUsage] etc...
 )
go

if  OBJECT_ID('tempdb..#VehicleOrders') is not null
 drop table #VehicleOrders
go
create table #VehicleOrders (
 [ID] int identity(1,1),
 [ProdNr] varchar(35),
 [DeliveryDate] datetime,
 --[OrderStatus] etc...
 )
go

if  OBJECT_ID('tempdb..#VehicleOrdersItem') is not null
 drop table #VehicleOrdersItem
go
create table #VehicleOrdersItem (
 [VehicleOrders_ID] varchar(35),
 [PartNo_ID] int
 ) 
go

Now for the data

--let's work with 10 part numbers
insert into #PartNumbers (PartNo)
values
('AW93 54502B98 ADWAMN'),
('AW93 54502B98 ADWNUG'),
('AW93 54502B98 AF0AMN'),
('AW93 54502B98 AF8NUG'),
('AW93 54502B98 AG0AMN'),
('AW93 54502B98 AG8NUG'),
('AW93 54502B98 BD0AMN'),
('AW93 54502B98 BDWAMN'),
('AW93 54502B98 BDWNUG'),
('AW93 54502B98 BF0AMN')
go

--Each vehicle can consist of 1 - n parts.
insert into #VehicleOrders (ProdNr, DeliveryDate)
 select top 150000
  --Random production number
  'P1021' + cast(dbo.ufn_getRandomInt(100000,250000) as Varchar(35)),
  --add random number of days to date so that we have a 10 day range
  DATEADD(dd, dbo.ufn_getRandomTinyint(1,10), GETDATE())
  from ufn_MillionTally()
go

insert into #VehicleOrdersItem( VehicleOrders_ID, PartNo_ID)
 select top 1000000
  --Random production number id between 1 and 150000
  dbo.ufn_getRandomint(1,150000),
  --random id from #PartNumbers we want to avoid partno. 7
  replace(dbo.ufn_getRandomTinyint(1,10), 7, 1)
  from ufn_MillionTally()
go

The core problem here is that there could be any number of part numbers (this is an ever changing list in the automobile industry) and the number of dates is also a variable ( No Mo.-Fri.: trade union meetings, public holidays, strikes, extra shifts etc…). Furthermore, there could be part numbers that have not yet been ordered or have been discontinued, but are still in the master data table #PartNumbers.

This was my first simple select and group by mentioned earlier.

--we only want the date info, time is not relevant for this report
select CAST(vo.DeliveryDate as date)as [Date],
  pn.PartNo,
  count(vi.PartNo_ID) as Quantity   
  from #VehicleOrdersItem vi inner join #VehicleOrders vo
 on vo.ID = vi.VehicleOrders_ID
  inner join #PartNumbers pn
 on pn.ID = vi.PartNo_ID   
  group by CAST(vo.DeliveryDate as date), pn.PartNo
order by  [Date], pn.PartNo
go

Essentially, the information returned is what was required, but not formatted in a way that is easy to use.

What was required was the information from the above select, but formatted in such a way, that each date is cumulated into a single row and the part numbers are the names of the columns. As it is impossible to predict what and when the customer is going to order, both of these are variable (well the “what” is predictable to a certain degree, but I don’t want to delve into the ins and outs of suppliers’ nightmares in the automobile industry). Thus it is not possible to define a fixed set of column and row names.

Enter the dynamic pivot!

I used dynamic T-SQL to build the actual PIVOT statement, so that the variables could be inserted in a separate work-step and executed at the end.

declare @PartNoList nvarchar(max) = ''
declare @PartNoListIsNull nvarchar(max) = ''
declare @PivotString nvarchar(max)  =
'select cast(DeliveryDate as date) as [Date of Usage], ?FieldListIsnull   
 from   
 (   
  select cast(vo.DeliveryDate as date) as [DeliveryDate],
   pn.PartNo, isnull(count(*),0) as Quantity   
   from #VehicleOrdersItem vi inner join #VehicleOrders vo
 on vo.ID = vi.VehicleOrders_ID
   inner join #PartNumbers pn
 on pn.ID = vi.PartNo_ID   
   group by CAST(vo.DeliveryDate as date), pn.PartNo
 ) as RSet
 pivot
 (
  sum(RSet.Quantity)   
  for RSet.PartNo in   
   (
    ?FieldList   
   )
 ) as PTable
 order by DeliveryDate'

Build a list of part numbers:

--@PartNoListIsNull is used to display the part numbers and is required
--to allow for part numbers that have no orders within the viewed time frame.
--Remember, we avoided orders for PartNo_ID = 7 while building test data earlier.
select @PartNoListIsNull =
  @PartNoListIsNull + 'ISNULL(' + QUOTENAME(PartNo) + ', 0) as '
  + QUOTENAME(PartNo) + ', '    
 from #PartNumbers   
order by PartNo   

select @PartNoList =
  @PartNoList + QUOTENAME(PartNo) + ', '    
 from #PartNumbers   
order by PartNo  

--now we cut off the trailing comma
select  @PartNoListIsNull =
  SUBSTRING(@PartNoListIsNull, 1, LEN(@PartNoListIsNull) - 1),
@PartNoList =
  SUBSTRING(@PartNoList, 1, LEN(@PartNoList) - 1)

Replace the variables in the PIVOT command and execute to get the required result:

select	@PivotString = REPLACE(@PivotString, '?FieldListIsnull', @PartNoListIsNull),
		@PivotString = REPLACE(@PivotString, '?FieldList', @PartNoList)
execute  sp_executesql @PivotString

There are surely better ways of doing this, but it did work for me and fulfilled the customer’s requirements. I have yet to investigate wether or not this sort of code would be prone to SQL injection, but I don’t think so because this code, packed into a stored procedure, would have a typed parameter of type datetime which, as far as I am aware, would not allow malicious code to be injected.

Credit where credit is due:

1. The random function used above was extracted from Eric Fickes’ Blog. Thanks!

2. I wrote the ufn_MillionTally() tally function after closely following an interesting editorial and the resulting discussion on SQLServerCentral.com
This could also be replaced by a simple cross join on the master.sys.all_columns table.(See code download below)

Code Download
The code is provided as is and should not be run in a production environment before sufficient testing has taken place. Feel free to use this solution if you see fit.

Finally
Since having the idea to write this article I have briefly read about cross table queries, but haven’t had a chance to compare these approaches to my solution yet. I will do so in the not too distant future and will report on my findings.

In the beginning…

By , February 20, 2010 12:03

While this is the first actual post on my blog, it has no reference to SQL or any other technology for that matter.

I am actually working on the first entry, but it is not anywhere near being complete and I didn’t want to just leave the standard “Sorry, you’re looking for something that isn’t here” on the front page of KeepITSQL.com. So here is a short introduction to really say thank you to the online community which I have used a great deal in the last few years. This community is ultimately the reason for me creating this little bit of web in the first place.

There are so many really good quality online sources out there and, among them, a few have inspired me to take the plunge into blogging. I would like to mention just two of the few (I hope they don’t mind):

  1. Brad McGehee has a brilliant knack of being able to explain the sometimes very complex concepts of database administration and putting them into context. He is also brilliant at getting that information out into the public. Check out his website.
  2. Gail Shaw tipped the scales for me when I read her post “Getting here from there”. Gail can really get down to the nitty gritty when it comes to performance issues and in a way that’s easy to understand. (I was really stoked to find a fellow South African right up there on the “International SQL Scene”)

I don’t even hope to be able to compete with Gail or Brad (that would be presumptuous indeed), but I am where I am and would like to start the journey of getting to where they are.

With a little help from my friends, I don’t see why that shouldn’t be possible.

Happy reading

Sean

Panorama Theme by Themocracy