Friday, July 27, 2012

MS Sql and my pain.



SQL Server has some really nice features. It’s best isn’t even really a part of the core database; it’s the Management Studio used to access it. If MS Sql is the only thing you’ve ever used, you’re spoiled. If you’ve never used MSSQL, you don’t know what you’re missing. It really does provide great tools for scripting data constructs, exporting and importing data, database profiling, etc.

But it always annoys me what is missing from MS tools and how we’re told we don’t really want it anyway. In SQL, it’s the “group by”. Suppose you have a set of sales records in your database. You want to find any that don’t have master account records (because something failed) and add the account record, so you won’t have dangling transactions. (Looking at the dangling transactions, dangling participles seem harmless.)

Now, I know… I know. You should have a database constraint that creates a foreign key between the sales table and the accounts table so this can never happen. But, the world should be at peace too. You can’t always control those things.

So you sit down to write the query: “select account_number from sales where [it doesn’t exist in accounts] [then insert it]“. And you realize that there can be multiple sales on that account, but account number is unique in the accounts table. So all you want to do is de-dupe a little.

So far this isn’t too bad. Until you realize that “Customer_Name” is in both sales and account. From a pure normalization perspective, maybe that doesn’t make sense, but it actually does make some sense from a practical  view, since not all the sales on the account may be done by the same person. And the world still isn’t at peace, so you shrug it off and move on.

But how do you do it? You need one of the names from sales to put into the account. Frankly, which one doesn’t really matter. So you try the obvious:

Select account_number, Customer_Name from sales group by account_number.

This works in MySql. It gives one row per account number with a somewhat randomly chosen name. In SQL Server, it throws a compile error and complains that the Customer_Name needs to be part of the group by.

Easy to fix.
Select account_number, Customer_Name from sales group by account_number, Customer_name, right?

No. Now it will give you duplicate account numbers, if there are multiples sales with different names.

After a week of cussing, you end up coming up with a solution, but it’s ugly. It will involve inner selects, MAX/MINs (which won’t work the way you’d think because you’ll have to be concerned with NULLS and the like) or getting the row id with a “Partition Over”. It’s all ugly. It gets worse when you have a combined key, so “account_number” alone isn’t unique, but it’s account_number + store_id + district_code + random_silly_grouping_value.

If you ask MS (or one of their fanbois) they'll tell you that you don't really want to do that. I love it when they say that. They'll say "you don't want that because you can't guarantee which Customer_Name you're getting." When you say you don't care, they'll tell you that the database is mis-designed then. OK... and the world is not at peace. I usually want to give them my boss' number and have them explain how he doesn't really want what he wants, and if he does, he needs to shell out a few million $$ to re-design the data.

It’s the kind of thing that drives me nuts about Microsoft. Doing simple things is so easy. Doing complex things is  easy too. But varying off the path – easy or hard – is so painful.

If you don't believe me, try this: Go to Management Studio and right click on a database and run the wizard to script all the objects in the database. Now, try to do the same thing from a command line, so you can actually.... you know, schedule it. Turns out to be impossible (without rolling your own tool).
Getting up at 3am each night to script things is on the path. Automating it is off.

And the world is not at peace.
--kevin

No comments:

Post a Comment