Jul 25 2016

Monitor Query Performance

Blog post #7 in support of Tim Ford’s (b|t) #iwanttohelp, #entrylevel. Read about it here.

Sooner or later when you’re working with SQL Server, someone is going to complain that the server is slow. I already pointed out the first place you should look when this comes up. But what if they’re more precise? What if, you know, or at least suspect, you have a problem with a query? How do you get information about how queries are behaving in SQL Server?

Choices For Query Metrics

It’s not enough to know that you have a slow query or queries. You need to know exactly how slow they are. You must measure. You need to know how long they take to run and you need to know how many resources are used while they run. You need to know these numbers in order to be able to determine if, after you do something to try to help the query, you’ll know whether or not you’ve improved performance. To measure the performance of queries, you have a number of choices. Each choice has positives and negatives associated with them. I’m going to run through my preferred mechanisms for measuring query performance and outline why. I’ll also list some of the other mechanisms you have available and tell you why I don’t like them. Let’s get started.

Dynamic Management Views

Since SQL Server 2005, Dynamic Management Views (DMV) and Functions (DMF) have been available for access all sorts of information about the server. Specifically there are a few DMVs that are focused on queries and query performance. If you go back through my blog, you can find tons of examples where I illustrate their use. You can also see them at work in commercial tools and free tools. Adam Machanic’s sp_WhoIsActive, a free tool, makes extensive use of DMVs. To learn more about DMVs, you can download a free book, Performance Tuning with SQL Server Dynamic Management Views. DMVs are available in Azure SQL Database, Azure SQL Data Warehouse, and all editions of SQL Server.

The information captured by DMVs is an aggregation of all the times the query has been run. This means you can’t find how long the query ran at 3PM yesterday. You can though see the minimum and maximum time the query took as well as the average. The ability to see this information is what makes DMVs useful. However, another important point about DMVs is that they only collect information while a query is in memory. As soon as it leaves the cache (the area of memory it is stored in), so does all the aggregated information about the query in the DMVs.

You use the DMVs for a general understanding of how a query is behaving. They’re not meant for detailed for long term collection of information about queries. For that we use other tools.

Extended Events

Introduced in SQL Server 2008, Extended Events (ExEvents) are a mechanism for capturing detailed information about SQL Server and the processes within. One of those processes is how queries behave. I have multiple examples on this blog on using ExEvents. You can’t go wrong reading about them on Jonathan Kehayias’ blog. Extended events are available in Azure SQL Database and all editions of SQL Server.

When you need to know every query against a database, or each time a particular query is called, and all the details associated with the query (reads, writes, duration), ExEvents are the way to go. ExEvents are very lightweight on the server (but not free) and can be filtered so that you capture just the information you need. The information is detailed and not aggregated. Instead it’s raw. The real issue with capturing this data is the amount of data you’ll be capturing. Testing and careful filtering to ensure you’re dealing with too much information is important. Prior to SQL Server 2012, there was no graphical user interface for reading ExEvent data, so you would have been forced to run queries against the XML that the information is captured within. With the tools available in SQL Server Management Studio, this is no longer the case.

You use ExEvents when you need specific and detailed information about a query. ExEvents are not so good for generalized monitoring.

Query Store

Introduced in Azure SQL Database, and first released in SQL Server with 2016, Query Store is another mechanism for capturing aggregated information about queries. As before, I have examples on how to work with Query Store on my blog. You can also find quite a bit on it over at Simple-Talk. Query Store is pretty specialized still and only available in Azure and SQL Server 2016, but it is in all editions of SQL Server 2016.

Query Store captures information similar to what is available in the DMVs. However, unlike the DMVs, the information that Query Store captures is kept around, even after a query ages out or is removed from cache. This persistence makes Query Store very exciting. You do have to choose to turn it on for each database you wish to capture queries for. It’s not automatic like DMVs. The capture processes are asynchronous, so they should be relatively light weight for most databases.

You use the Query Store when you need to capture query metrics over the long term, but you don’t need detailed information and aggregations works well for you.

Others

There are other ways to measure query performance. You can use the Profiler GUI, but that actually seriously negatively impacts the server. You can bring a server down by using it, so it should be avoided. Profiler generates scripts called trace, which can be used to monitor your server. However, they have a much higher impact than ExEvents and they’re on the deprecation list. Microsoft is not added new trace events for new functionality, so they’re becoming less and less useful with each release. You also can’t use trace against Azure. If you’re writing a query and you just want to see how long it takes to run, you can use SET STATISTICS TIME ON, to capture the execution time. This is a handy way to quickly measure performance. There is also the ability to capture reads and writes using SET STATISTICS IO ON, but, while this does capture the metrics we need, it adds considerable overhead to the query, skewing performance measurement. This is why I stick to ExEvents when I need an accurate measure.

Conclusion

Honest people can disagree about the best way to capture query performance. I have my preferences as you can see. However, I’m fairly certain that everyone would agree that it’s important to know how to capture performance metrics in order to be able to assert that performance has increased or decreased in a measured fashion. You don’t want to guess at query performance, you want to know.

Jun 06 2016

Elastic Query in Azure SQL Database and Views

The question came up, how do the constructs necessary for Elastic Query within Azure SQL Database affect your ability to create views that join across databases. Since I haven’t tested this myself, and I haven’t seen anyone else doing anything with it yet, I decided to set up a test. I recreated my cross database queries using the scripts posted here. Let’s create a view:

CREATE VIEW dbo.JoinedView
AS
SELECT  dt.Val,
        dt2.Val AS Val2
FROM    dbo.DB1Table AS dt
LEFT JOIN dbo.DB2Table AS dt2
        ON dt2.ID = dt.ID;

If I run the query, I get back results. Done. We can create views that join between Azure SQL Databases… But, views are all about masking right? What if I wanted to change the name of the table on my database. Could I do something like this?

CREATE EXTERNAL TABLE dbo.CanICheatThisName (
	ID int, 
	Val varchar(50)) 
WITH (
	DATA_SOURCE = DB2Access);

The CREATE script runs just fine with no errors. When I try to query it though:

Msg 46823, Level 16, State 2, Line 55
Error retrieving data from one or more shards.  The underlying error message received was: ‘Invalid object name ‘dbo.CanICheatThisName’.’.

Understanding how the Elastic Query works, of course this fails. So… what about faking it the other way. I’ll create a view in my second database:

CREATE VIEW dbo.FakeTable
AS
SELECT  *
FROM    dbo.DB2Table AS dt;

Now, I’ll create the EXTERNAL TABLE using the name of the view:

CREATE EXTERNAL TABLE dbo.FakeTable (
	ID int, 
	Val varchar(50)) 
WITH (
	DATA_SOURCE = DB2Access);

When I query this, the data comes across just fine.

Conclusion

Creating a view, or any other query, that joins across databases using Elastic Query works just fine. However, if you want to mask things using a view, you might need to get a little creative in how you implement Elastic Query. The good news is, Elastic Query is somewhat, shall we say, elastic in how you set it up. More so than it immediately appears.

Jun 01 2016

Independent Azure Data Platform Instructors

azureThe Azure Data Platform is taking off. I’m seeing more and more interest on the forums, at conferences and in my personal interactions. I’ve been teaching the data platform for six years. Almost as soon as it was available, I started working with it, putting up blog posts and setting up sessions. I’ve had stuff in production on the platform for almost that long too. I’m an advocate and, I hope, an independent voice on the topic. By independent in this case, I mean non-Microsoft. Don’t get me wrong, most of the people I learn from work for Microsoft. They are excellent instructors and more knowledgeable on the topic than I’ll ever be. I’m not questioning the ability of Microsoft people to deliver the very best Data Platform content. I just believe that people also want voices that are not directly beholden to the company.

That brings up my list. I have a list of Azure Data Platform Instructors. I’ve been maintaining this for about a eight weeks now. I’m adding to it regularly. At this point, the list is small and in alphabetical order. As it grows, I will begin the process of curating the list. I’ll probably start with categories only, but eventually, I’ll start to pick and choose who goes on the list based on my knowledge of the individuals involved. For now, these are the independent voices who are teaching Azure and the Azure Data Platform. If you’re getting started in Azure, this is the list you need to consult to find the very best independent voices.

Apr 27 2016

Customizing Your Azure Portal

Not all of us are going to use Azure in the same way. That’s just common sense. Therefore, not all of us are going to want it to look the same. Here’s a default view of the portal for a login I created within my own domain:

AzurePortalDefault

You can contrast that with this login to my portal that I use for most of my testing and training (this isn’t my company’s production or development portal, which looks still more different):

AzurePortalCustomized

Clicking on either of these will open them so you can look at details. Not only are the color schemes different, but you’ll note that the selection lists on the left are different as are the grids on the dashboard. All this can be customized for each login, and, within a login, on each Subscription to which that login belongs.

All this means that you can customize the portal for your individual needs, even between your production and development areas within the portal (assuming you’re keeping them separate, which you ought to be). Let’s explore a few of those options.

Favorites

Over on the left of the screen, is a listing of favorite resources. At the bottom of this list you’ll see a “Browse >” button. Selecting this pops out a list of all available resources:

AzureportalFavorties

This is a complete listing of the resources available to your subscription, including anything that is in Preview. Note, on the right hand side of the screen are stars. One of the stars in the list above shows up as gold and filled in. This indicates that it’s currently on my list of resources as a favorite. Selecting one of the resources that is not filled in will add it to the list and selecting one that is filled in will remove it. This is your first customization. Add the resources to favorites that you use all the time. You can even drag them around to rearrange the order in which they appear.

Portal Settings

If you want to change the color scheme, you can do that too. At the top of the screen is a little icon of a gear. Clicking on that opens the Portal Settings blade:

azureportalsettings

Not only can you modify the colors scheme, but you get to set labels, animations & the language of the portal.

Dashboard

This is where you can get crazy. You have the ability to add all sorts of output, readouts, shortcuts, to your dashboard. You get to set their size, rearrange them, even create additional dashboards. Additional dashboards give you the ability to customize a view so that you could, for example, have your most important Azure SQL Databases displaying their DTU readouts as a single view:

azureportalNewDashboard

Yeah, yeah, it reads zero. Test databases, remember. Anyway, customizing a dashboard is easy. If you click on “+New dashboard” or “edit dashboard” at the top of the screen, a new blade opens. This allows you to add and rearrange tiles. You’ll see a grid that has dashed lines for the resolution of a screen. It makes it very easy to figure out what’s going to be displayed and what’s not. On the left you get the ability to track down different kinds of tiles. In the example below I have a list of the tiles for Azure SQL Database (you will need to click on this one, it’s big):

azureportaldashboardedit

You can drag and drop the tiles from the left on to the screen. Once there you can rearrange where they are located, resize them, remove them, pretty much what you would expect. Further, you can customize each tile individually. The tile will have a clickable link called “Edit”. Clicking on that will bring up unique options depending on the tile. In the example below I’m showing the options for the Resource Utilization tile:

azureportaleditchart

I’ve decided to just show DTU percentage and DTU used for the last hour as a Line chart. You can rearrange this as necessary to fit your own needs. You can rename the dashboard, and then, when you’re done, just click on the “Done Customizing” button and you’ll see the dashboard. You can pick which dashboard to look at. You can even share a dashboard with your team so that everyone has access to the same information you do.

Conclusion

This was just a bare bones run-through of what’s possible with the Azure portal. You really can achieve a very high level of customization to get just the information you need, in the way that you need it. Take advantage of this because, at times, it can be quite difficult to manage information on the portal.

Apr 21 2016

Microsoft’s Commitment to Azure

For several years, many of us who were involved with working in Azure from the early days, were concerned that Microsoft had bet the farm and could possibly lose it all. They may well have bet the farm, but more and more it’s extremely clear that there is zero likelihood of them losing the farm. In fact, it’s looking more and more like, while using farming analogies, they’re turning into an agro-corp. Azure is taking off.

You need to start working on adding Azure knowledge to your skill set. If you have access to an MSDN license, getting into Azure is easy because of the credits available. However, not everyone works for a company that provides MSDN or has purchased a license. In that case, getting into Azure, just for testing and learning could be expensive (I frequently “spend” $150/month with my MSDN credits). However, Microsoft is very serious about getting everyone moved into this space. They’ve launched a new free program called IT Pro Cloud Essentials. Not only does it give you Azure credit, but you also get access to O365, another set of skills and understanding you need to acquire.

If cost was one of the reasons you were choosing to not explore Azure, time to give that one up. Explore the offering and then get your learn on.

Mar 29 2016

Do You Teach Azure Data Platform?

azureI offer instruction on the Azure Data Platform, and have for about six years, since shortly after it came out. I started using Azure SQL Database (although it had a different name then) Day 1.

I know a few other people who don’t work for Microsoft, but have been actively pursuing Azure SQL Database, SQL Server on Azure VMs, and pretty much all the Microsoft Data Platform. I’m not counting the BI people who have dived into PowerBI and related tech. The BI people, who are generally pretty smart, jumped on Azure with both feet. I’m talking about the data platform aspect of Azure. The people that I know who regularly teach classes are (in no particular order, sheesh, you people):

Karen Lopez(b|t)
Denny Cherry(b|t)
Jes Borland (b|t)
Thomas LaRock (b|t)
Joe D’Antoni (b|t)
Ron Dameron (b|t)
Aaron Bertrand (b|t)
Tim Radney (b|t)

I’m sure I overlooked someone who is active in this space. Please help me out. Let’s create a good list of active educators on the Azure Data Platform. I believe this is needed so that people know where to go, besides the excellent Microsoft resources (and they are excellent), to get more information. Please, no Microsoft employees. Yeah, many of them are great educators and I’m sure going to go and sit in their classes, as you should. I’m just trying to get the fundamental list of non-Microsoft speakers together and share it out.

Azure interest is growing, fast. Independent voices are valued and needed. Let’s get this list together, published, and maintained. Send me your input through the comments or my email (grant – at – scarydba -dot- com). I’ll get things published ASAP.

Oh, and if I missed you from the initial list and you were an obvious inclusion, my apologies. I’m old.

And, thinking about it, let’s get the BI people in Azure listed too. I was being lazy, not exclusive. Lazy & old. The intent is still good.

If you, or someone you know, is actively teaching Azure Data Platform, I want to know about it so I can add them to the list that I’ll maintain.

Mar 21 2016

Cross Database Query in Azure SQL Database

You can’t query across databases in Azure SQL Database… or can you?

Let’s check. I’ve created two new databases on an existing server:

dblist

I’ve created two tables on each respective database:

CREATE TABLE dbo.DB1Table (
     ID INT IDENTITY(1, 1)
            NOT NULL
            PRIMARY KEY,
     Val VARCHAR(50)
    );


CREATE TABLE dbo.DB2Table (
     ID INT IDENTITY(1, 1)
            NOT NULL
            PRIMARY KEY,
     Val VARCHAR(50)
    );

Now, let’s query the DB2 table from the DB1 database:

SELECT  *
FROM    DB2.dbo.DB2Table AS dt;

And here’s the lovely error message:

Msg 40515, Level 15, State 1, Line 35
Reference to database and/or server name in ‘DB2.dbo.DB2Table’ is not supported in this version of SQL Server.

So, like I said, you can’t do three part name cross-database queries in Azure SQL Database… oh wait, that’s not quite what I said is it. Let’s do this. Let’s create a new security credential within DB1 for a login that can get us into DB2:

CREATE MASTER KEY ENCRYPTION BY PASSWORD = 'securitymatters';

CREATE DATABASE SCOPED CREDENTIAL DB2Security 
WITH IDENTITY = 'Grant',
SECRET = 'securitymatters';

Then, we’ll use that to define an external data source:

CREATE EXTERNAL DATA SOURCE DB2Access
WITH (
	TYPE=RDBMS,
	LOCATION='myservernotyours.database.secure.windows.net',
	DATABASE_NAME='DB2',
	CREDENTIAL= DB2Security);

With this, we can put Elastic Query (corrected from Polybase see note below) to work and create an external table:

CREATE EXTERNAL TABLE dbo.DB2Table (
	ID int,
	Val varchar(50))
WITH
(
	DATA_SOURCE = DB2Access);

And that’s it. If I query dbo.DB2Table from DB1, I get to see data in DB2. In short, you can do a cross database query within Azure SQL Database. Yeah, it’s going to require some setup and possibly some code modifications since you can’t use the old three part naming for performing the query, but, you can do it. Further, note that these are Standard, not Premium databases. Further further, they’re not a part of an elastic pool. It’s just using the external data source and external table to connect the two databases. However, if the one thing keeping you from moving into Azure SQL Database is the ability to query across databases, that’s gone.

 

Feb 29 2016

Wait Statistics in Azure SQL Database

You need to be aware that you’re going to see differences when you’re working with Azure SQL Database when it comes to wait statistics. If you’re running a v12 Azure SQL Database (and if you’re not, go fix it), you can run a query against sys.dm_os_wait_stats. Here’s the output for the TOP 10 waits ordered by wait time on one of my databases:

waits_os

Everything you’re used to seeing, right? Well… not quite. This is Azure SQL Database. So, let’s use sys.dm_db_wait_stats, a DMO that is only available within Azure. This lists waits by database. The TOP 10 here are:

waits_db

You’ll notice that these results are wildly different from those above. What we’re looking is largely a server versus a database, but not completely. I mean that sys.dm_os_wait_stats is showing the waits for the instance on which my primary Azure SQL Database is currently running. Most of those waits are mine, but because it’s part of the management structure of Azure, sys.dm_os_wait_stats shows some information that’s not applicable, directly, to me. The “server” is not really that. It’s a logical container holding your database. There’s a lot more to it under the covers. To get the waits that are absolutely applicable to me and my databases, I have to go to sys.dm_db_wait_stats.

It goes further. One of the options for sys.dm_os_wait_stats is the ability to reset the waits. You have to call a DBCC command to do it:

DBCC SQLPERF ('sys.dm_os_wait_stats', CLEAR);

However, if you run this on a Azure SQL Database, you’re going to get an error:

Msg 297, Level 16, State 11, Line 15
The user does not have permission to perform this action.

Makes sense. Azure SQL Database is a Platform as a Service. You don’t have access to the server level objects to make these types of changes.

So, we’ll fix it. I want to reset the database wait stats:

DBCC SQLPERF ('sys.dm_db_wait_stats', CLEAR);

This results in:

Msg 2526, Level 16, State 12, Line 15
Incorrect DBCC statement. Check the documentation for the correct DBCC syntax and options.

If you read the documentation, you can’t reset the wait stats for the database. However, the wait stats will be reset if your database goes through a failover, which is actually hard to detect (still trying to figure this out actually). You can’t currently control the reset.

With the understanding of what you’re seeing, you can use sys.dm_db_wait_stats in the same way you would sys.dm_os_wait_stats.

Oh, and if you’re looking to filter the unnecessary wait stats, Paul Randal’s script works fine. Just substitute ‘db’ for ‘os’ and you’ll be happy.

EDIT: Just so we’re clear, sys.dm_db_wait_stats shows you just the waits in a single database. The database you’re running them from. Further, this DMV is only available in Azure SQL Database. I hope that helps the clarity of the message a little.

EDIT2: I’ve changed the description after some conversations. I think I’m closer to reality based on what I’ve been told and Jeremiah’s comment below. It’s hard because I’m explaining something I don’t directly support because I don’t work at Microsoft.


For lots more information on query tuning, I’m presenting an all day pre-con at SQL Day in Wroclaw Poland on May 16.

Feb 18 2016

Azure SQL Database Error

I was on SQL Cruise where I was scheduled to present a session on Azure SQL Database. I recorded all my demonstrations before we went to sea, but, I planned to attempt to try live demo’s. Yeah, yeah. Well, anyway, I got a unique error when I attempted to connect:

Forced connection closes from remote host

That’s a partial message from the whole error. I don’t have a good screen capture. I wasn’t able to find anything on it through Ging searches, but this week I was at Microsoft for a training course on Azure. I asked the room. The rough answer is (paraphrasing):

The IP address I was attempting to connect from is not is not on the approved list

Interesting. I didn’t realize there were blackout zones. The really interesting thing was, this only occurred when I attempted to connect through SQL Server Management Studio. I was able to connect through the Azure Portal. One more wrinkle in the interesting challenges offered by Azure and the Data Platform.

Feb 15 2016

Loading Data into Azure SQL Data Warehouse

Ouch.

Let’s start with the level set. I’m not an ETL expert. In fact, I haven’t done any professional ETL work for several years. My skills are, at best, rusty. With this in mind, I knew I’d have a hard time extracting data from a local database in order to move it up to Azure SQL Data Warehouse. I expected to be fumbling and slow and to constantly feel more than a little stupid as I made one mistake after another. All of this came to pass.

Yet… OMG! THAT WAS DIFFICULT!

Here’s how I started. I defined a bcp command for the tables I was interested in. I ensured it was working correctly, then wrote a little PowerShell script so I could supply a list of tables and get them all exported at once. Here’s an example:

$tables | foreach-object {
$filename = $_.Substring($_.IndexOf(".") + 1,($_.Length - ($_.IndexOf(".") + 1)))
bcp $_ out "c:\data\bcp\$filename.rtf" -S'WIN-3SRG45GBF97\dojo' -d'ODS' -T -n -c 
}

That worked fine. Next up, I used azcopy to load the files I created up to blob storage. Again, a little Powershell work to move all the files at once:

$tables | foreach-object {
.\AzCopy.exe /Source:c:\data\bcp /Dest:"https://xxx.blob.core.windows.net/sales/data/$_" /DestKey:xx /Pattern:"$_.rtf" /Z:"C:\data\bcp" /Y
}

No issues. Well, OK, I screwed up both these commands like 20 times until I got them right and then did the quick automation. However, the screw-ups were mine, not external issues. Next up, setting up external tables. There’s a good article on how to do this provided from Microsoft. Up until I tried to create the table, everything worked just fine. Then I saw this:

Msg 104051, Level 16, State 1, Line 11
Unsupported data type error. Statement references a data type that is unsupported in Parallel Data Warehouse, or there is an expression that yields an unsupported data type. Modify the statement and re-execute it.

Oh boy.

Luckily, there’s another blog post from Microsoft on getting the schema right. NUMERIC out and DECIMAL in, I got my next error:

Msg 105001, Level 16, State 1, Line 11
CREATE EXTERNAL TABLE failed because the location URI could not be reached. Revise the URI and try again.

My fault. I tried to get fancy and did this:

create external data source AzureStoreageGF
with (type = Hadoop,
location='wasbs:[email protected]', 
credential=AzureStorageCredential);

Changing it to just [email protected] and then making my WITH statement on the CREATE EXTERNAL TABLE command read like this:

WITH (
LOCATION='data/Account/',
DATA_SOURCE=AzureStorage,
FILE_FORMAT=TextFile);

Worked! I’m a winner. All I have to do now is query from the Polybase table I just created.

Msg 107090, Level 16, State 1, Line 132
Query aborted– the maximum reject threshold (0 rows) was reached while reading from an external source: 1 rows rejected out of total 1 rows processed.
(/data/Account/Account.rtf)Column ordinal: 1, Expected data type: NVARCHAR(1300) collate SQL_Latin1_General_CP1_CI_AS, Offending value: xxx (Tokenization failed), Error: Not enough columns in this line.

RASSIN FRASSIN… Except I didn’t use those words. I substituted some other Anglo-Saxon terms. Anyway, what’s it mean? It means, probably, although it’s hard to know, that I must have messed up the data export in some fashion. Or that there’s something in the data that’s breaking the interpretation of the code…. only I can’t find it. So, old school, I opened this up in Excel and found two columns that had carriage returns in them. Fixed that. Going again:

Query aborted– the maximum reject threshold (0 rows) was reached while reading from an external source: 1 rows rejected out of total 33 rows processed.
(/data/Account/Account.rtf)Column ordinal: 0, Expected data type: DECIMAL(18,2), Offending value: , Error: UTF8 decode failed.

A WIN! What? You didn’t see the win? You’re looking at the error message. 33 rows processed. Yeah, that’s out of 96,000 rows, but it’s more than 1. That means that 32 rows processed successfully. Now, you might be wondering about that file format. Last time I got that was because I had bcp out to a non-character format. That suggests there’s a data type in there that is giving me more problems in Polybase. Back to SQL Server to see what’s up with this data (because everything worked perfectly in Excel this time). And found it. This is the problem:

‘å’

I’m using bcp. It goes to a UTF-16 file format. Polybase supports the Hadoop standard of UTF-8. This must be blocking it……. Great. There it is in black & white:

bcp does not support exporting to UTF-8. This may prevent using PolyBase on bcp exported data

Fine. Powershell to the rescue:

Get-ChildItem -rec -fi *.rtf | %{(get-content $_ -enc string) | set-content -enc utf8 $_.fullname}

This time I processed 197 rows before I found another problem:

Query aborted– the maximum reject threshold (0 rows) was reached while reading from an external source: 1 rows rejected out of total 197 rows processed.
(/data/Account/Account.rtf)Column ordinal: 110, Expected data type: INT NOT NULL, Offending value: |45178|1043866 (Tokenization failed), Error: Too many columns in the line.

What follows is a tedious and dull march through various & sundry text columns that, for some strange reason, have a pipe (|) character in them (why the phone number, by Odin’s Eye Patch, why the phone number). Suddenly… I have 96,000 rows returning from my COUNT(*) query (using Visual Studio, SSMS was twitchy when connected to ADW). Run a CREATE TABLE FROM SELECT query and I have a new table. Ta-da.

Let’s bottom line this. Data is dirty. Any ETL process is likely to expose the trashy data, so what are the real issues here? First up, Powershell is my bestest buddy. Scripting everything through Posh right out of the gate made a huge difference in my ability to constantly reset. The fact that our most common processes spit out UTF16, while not a problem, is something you need to get on top of (see Posh above). AZCopy works really well and the command line is easy to implement, but you’ll need to go through the process of setting up all the appropriate connections. Polybase is wicked cool. Yes, I had to hop through hoops to get the data clean, but, what you should note in the above descriptions, at no point did I need to go and recreate my EXTERNAL TABLE from Azure SQL Data Warehouse. I just kept updating the files through my scripts and as soon as they were loaded up to blob storage, I ran a query against them. No additional work required. That’s nice. You’re going to want to go through your data types on your data before you start the migration. You’re going to have to plan for a lot of time cleaning your data if you haven’t already.

In summary, this process is well documented, but has a lot of moving parts and lots of places for things to break down. If you’re like me and ETL is a distant memory, plan for a very steep learning curve. If you’re good at ETL, but not so hot at automation, your mouse clicking finger is going to get tired. Microsoft has a decent overview of the process posted here.

Now, I need to get started cleaning the data on the next table… pity me.

(Kidding. I’m having a blast)