Mar 21 2016

Cross Database Query in Azure SQL Database

You can’t query across databases in Azure SQL Database… or can you?

Let’s check. I’ve created two new databases on an existing server:

dblist

I’ve created two tables on each respective database:

CREATE TABLE dbo.DB1Table (
     ID INT IDENTITY(1, 1)
            NOT NULL
            PRIMARY KEY,
     Val VARCHAR(50)
    );


CREATE TABLE dbo.DB2Table (
     ID INT IDENTITY(1, 1)
            NOT NULL
            PRIMARY KEY,
     Val VARCHAR(50)
    );

Now, let’s query the DB2 table from the DB1 database:

SELECT  *
FROM    DB2.dbo.DB2Table AS dt;

And here’s the lovely error message:

Msg 40515, Level 15, State 1, Line 35
Reference to database and/or server name in ‘DB2.dbo.DB2Table’ is not supported in this version of SQL Server.

So, like I said, you can’t do three part name cross-database queries in Azure SQL Database… oh wait, that’s not quite what I said is it. Let’s do this. Let’s create a new security credential within DB1 for a login that can get us into DB2:

CREATE MASTER KEY ENCRYPTION BY PASSWORD = 'securitymatters';

CREATE DATABASE SCOPED CREDENTIAL DB2Security 
WITH IDENTITY = 'Grant',
SECRET = 'securitymatters';

Then, we’ll use that to define an external data source:

CREATE EXTERNAL DATA SOURCE DB2Access
WITH (
	TYPE=RDBMS,
	LOCATION='myservernotyours.database.secure.windows.net',
	DATABASE_NAME='DB2',
	CREDENTIAL= DB2Security);

With this, we can put Elastic Query (corrected from Polybase see note below) to work and create an external table:

CREATE EXTERNAL TABLE dbo.DB2Table (
	ID int,
	Val varchar(50))
WITH
(
	DATA_SOURCE = DB2Access);

And that’s it. If I query dbo.DB2Table from DB1, I get to see data in DB2. In short, you can do a cross database query within Azure SQL Database. Yeah, it’s going to require some setup and possibly some code modifications since you can’t use the old three part naming for performing the query, but, you can do it. Further, note that these are Standard, not Premium databases. Further further, they’re not a part of an elastic pool. It’s just using the external data source and external table to connect the two databases. However, if the one thing keeping you from moving into Azure SQL Database is the ability to query across databases, that’s gone.

 

Mar 17 2016

Opportunities To Talk

It’s weird being an introvert who likes to talk to people, but what can I do. I like talking to people. I have a number of upcoming trips, quite literally all over the world, that provide us with the opportunities to get together and have a chat.

First, I’ll be at SQL Saturday Boston (the 500th SQL Saturday event, HUZZAH!), this weekend, March 19th 2016. I’ll be talking about the Query Store and I’ll be doing a presentation for PASS since this is a milestone event. The first SQL Saturday event in Boston was #34, six years ago, which I helped organize. It’s been quite the journey.

I’m going to SQL Saturday Madison on April 9th. I’ll be talking about the Query Store and how to automate your database deployments. I haven’t been in Wisconsin for years.

Also in April, on the 19th, I’ll be heading down to Orlando. I’m pleased to be able to say I have the honor (and I really do consider it that way) to be able to take part in SQL Intersection. Check out the speakers there. Amazing. I’m doing a couple of new sessions on improving your T-SQL and on hybrid Azure environments.

Then things get busy. First, on May 2nd and 3rd, I’ll be at the PASS Business Analytics Conference. I’m going there to learn as well as support the event in my role as the PASS EVP. I’m very excited about it. Last year the BAC was great. This year looks even better.

On May 4th, yes, leaving one to get to the next, I fly out to merry old England where I’m presenting at the SQLBits conference. Bits is hands down one of the great events each year. I truly look forward to it and to getting to talk with all my friends from over the pond.

I get to come home for a few days, and then, something completely new. I’m off to Wroclaw Poland for the SQL Day Poland conference, May 16-18. This will be the furthest from home I’ve ever travelled to present. It will be my first time ever in Poland. I’m excited like a puppy dog about this event. I’m doing a pre-conference seminar and a couple of sessions, all about query tuning and execution plans. I don’t know when, or if, I’ll be back over there again, so please, take advantage of this special opportunity.

Back in the states, in June, I’m doing a road trip (still unnamed, I need help with that) through the state of Ohio hitting multiple SQL Server user groups. I’ll do another couple of posts on this event as we get it slightly more nailed down (I still haven’t picked a topic).

The last thing I have scheduled currently this year is another new trip. Remember that record I’m going to set by flying off to Poland in May? Yeah, well, it’s only going to stand for three months. In August, I’m travelling to India for the SQL Server Geeks Conference. There I’ll be presenting a pre-conference, all-day, seminar as well as a couple of sessions. And yeah, puppy dog time again.

I’m going to try to get to a SQL Saturday event in July and maybe another in August. Nothing picked yet. I’m open to suggestions.

Please, if you come to one of these events, introduce yourself. I do want to talk to you. That’s why I’m there.

Mar 14 2016

Leadership Lessons

Not for you, for me.

I’m sure you’ve heard the statement: Praise in public. Criticize in private.

I agree with this approach. However, I find it extremely difficult to do. It’s one of the fundamental proofs that all leadership, all life for that matter, is about constant practice and discipline. It’s not enough to know something. It’s not enough to practice something occasionally. To get good at this stuff, you need to practice a lot.

Let me tell you about a recent failure on my part. My 17 year old daughter had friends for a sleepover (yeah, they still do that). She makes her own breakfast and starts eating. I remind her to ask her friends what they want. She does so in this really irritated manner. Of course, the friends don’t want anything because she’s so clearly put out. I proceed to lecture her on the mistake and how she should have done it. She’s embarrassed and I realize I screwed up.

Now, you can say that’s just parenting, and you wouldn’t be completely wrong. However, the same lessons apply in the business world. It’s so easy to see people doing stuff that is wrong and openly correct them. The hard way, the right way, is to get the correction in, but do so without being critical, in public. You can, you should, be critical of people. You just need to be cautious about how, when, and most importantly, where.

I’m typing this up because, in addition to my screw up as a parent, I’ve been a little too openly critical of others of late and I need to remind myself of the right way to get things done. There. I’ve been warned. I hope you enjoyed this little chat.

Mar 07 2016

SQL Server Backups Are A Business Decision

Blog post #3 in support of Tim Ford’s (b|t) #iwanttohelp, #entrylevel

Read more about Tim’s challenge here.

It’s very easy to think of SQL Server backups as a technical problem. You have so much stuff going on, BACKUP DATABASE commands, recovery models, BACKUP LOG commands, Differential backups. Getting them all into the correct order and automating the processes sure seems like a technical problem. It isn’t. It’s all about the business. If you’re taking on the duties of a DBA whether you’re an accidental DBA, a reluctant DBA or you were voluntold into the DBA position, you need to plan to sit down with responsible parties from the business and get an understanding with them regarding RPO and RTO.

RPO is a TLA for Recovery Point Objective. The easiest way to describe RPO is to ask, “In terms of time, how much data are we willing to lose?” The immediate answer is always going to be zero. Here is where we have to be honest. You won’t be able to guarantee zero data loss (yeah, there are probably ways to do this, but #entrylevel). Talk with the business. Most of the time, you’ll find that they’d actually be OK with 15 minutes, or maybe 5 minutes, or even an hour of lost data. It really varies, not only from business to business, but from database to database within the business (allow for this flexibility). You need to establish this number. RPO is going to help you figure out how to set up your backups, your recovery model, your logs and their backups. All that stuff that seemed so technical, it’s all based on this extremely important number that you’re going to work with the business to arrive at.

Oh, but we’re not done. Once you’ve managed to get the business comfortable (as comfortable as they can be) with the idea that they could lose data, you also have to prepared them for the idea that, in the event of a disaster, restoring the database from backups is not going to be instant. It’s going to take time. This is where we have to define the RTO or Recovery Time Objective. This is our goal for how quickly we can restore the database. RTO is not so much a negotiation with the business as it is an education for the business. You see, you can only restore so fast on your hardware. Further, the RESTORE DATABASE process is dependent on the size of the backups. Even further, it’s dependent on the types of restore operations we’re running and whether or not we use WITH RECOVERY in the RESTORE operations. You may have to test a few restores to get an idea how fast things are with your system. Regardless, the RTO has to be arrived at and agreed on. You may also have to readdress the RTO as the number, size and volume of your data changes over time. Be prepared for this as well.

With the RPO established, you can now decide on the recovery model. Let’s take an example. If the business says that they can afford a day of data loss, depending on the size of your database, you can put this database into SIMPLE recovery, run a full backup once a day and walk away a winner. Another example, the business decides that it could live with up to 15 minutes worth of data loss. Now you have to go to FULL recovery and you have to set up log backups in addition to your full backups. Then, you start to mix the RTO into the mix. Let’s say you’re outage was to occur at 8PM and you run your backups at Midnight. You now have to restore 20 hours worth of log backups. That can take a long time. So, in order to make the RTO as short as possible, you toss in a differential backup every day at noon. Now you’ll only ever have to restore 12 hours worth of backups, so you can define a rough RTO for the business.

These simplified, and somewhat simplistic, examples are just the start of the process of figuring out how best to do your backups. However, that’s the technical part of the problem. The fundamental definitions that you have to have in order to start solving this technical issue are business decisions that you must get your business people involved with. Define the RPO and RTO, then start defining your recovery strategy.

Mar 04 2016

Speaker of the Month: March 2016

I’m finally getting back out to community events so I should be able to avoid giving this gigantic honor to professional speakers for a month or two. My Speaker of the Month for February 2016 is Ron Dameron (b|t)

I saw Ron’s presentation Monitoring & Alerting for Azure SQL Database at SQL Saturday Tampa. It was a good presentation (or it wouldn’t be here would it). I especially like the topic. As more people move on to the Azure platform, this is the sort of information they’ll need. Ron spent a lot of time setting up slides to avoid overusing bullets. Nicely done. His delivery was clear and solid. He used Zoomit quickly and smoothly, it looked like it was just part of the presentation. His demos worked (unlike a couple of mine at this event).

The feedback I’d have for Ron would be to intersperse the demos through the presentation, rather than gathering them at the end. There were so many questions during the session (a great thing, the audience was engaged) that it bit into demo time a little. I liked the use of screen captures for the slides, but I’d make them the whole width of the slide so that they’re more visible. That’s it. Like I said, a good presentation.

Thanks Ron

Feb 29 2016

Wait Statistics in Azure SQL Database

You need to be aware that you’re going to see differences when you’re working with Azure SQL Database when it comes to wait statistics. If you’re running a v12 Azure SQL Database (and if you’re not, go fix it), you can run a query against sys.dm_os_wait_stats. Here’s the output for the TOP 10 waits ordered by wait time on one of my databases:

waits_os

Everything you’re used to seeing, right? Well… not quite. This is Azure SQL Database. So, let’s use sys.dm_db_wait_stats, a DMO that is only available within Azure. This lists waits by database. The TOP 10 here are:

waits_db

You’ll notice that these results are wildly different from those above. What we’re looking is largely a server versus a database, but not completely. I mean that sys.dm_os_wait_stats is showing the waits for the instance on which my primary Azure SQL Database is currently running. Most of those waits are mine, but because it’s part of the management structure of Azure, sys.dm_os_wait_stats shows some information that’s not applicable, directly, to me. The “server” is not really that. It’s a logical container holding your database. There’s a lot more to it under the covers. To get the waits that are absolutely applicable to me and my databases, I have to go to sys.dm_db_wait_stats.

It goes further. One of the options for sys.dm_os_wait_stats is the ability to reset the waits. You have to call a DBCC command to do it:

DBCC SQLPERF ('sys.dm_os_wait_stats', CLEAR);

However, if you run this on a Azure SQL Database, you’re going to get an error:

Msg 297, Level 16, State 11, Line 15
The user does not have permission to perform this action.

Makes sense. Azure SQL Database is a Platform as a Service. You don’t have access to the server level objects to make these types of changes.

So, we’ll fix it. I want to reset the database wait stats:

DBCC SQLPERF ('sys.dm_db_wait_stats', CLEAR);

This results in:

Msg 2526, Level 16, State 12, Line 15
Incorrect DBCC statement. Check the documentation for the correct DBCC syntax and options.

If you read the documentation, you can’t reset the wait stats for the database. However, the wait stats will be reset if your database goes through a failover, which is actually hard to detect (still trying to figure this out actually). You can’t currently control the reset.

With the understanding of what you’re seeing, you can use sys.dm_db_wait_stats in the same way you would sys.dm_os_wait_stats.

Oh, and if you’re looking to filter the unnecessary wait stats, Paul Randal’s script works fine. Just substitute ‘db’ for ‘os’ and you’ll be happy.

EDIT: Just so we’re clear, sys.dm_db_wait_stats shows you just the waits in a single database. The database you’re running them from. Further, this DMV is only available in Azure SQL Database. I hope that helps the clarity of the message a little.

EDIT2: I’ve changed the description after some conversations. I think I’m closer to reality based on what I’ve been told and Jeremiah’s comment below. It’s hard because I’m explaining something I don’t directly support because I don’t work at Microsoft.


For lots more information on query tuning, I’m presenting an all day pre-con at SQL Day in Wroclaw Poland on May 16.

Feb 24 2016

Query Store and Recompile

One of the many advantages of SQL Cruise is the ability to have enough time during a presentation to be able to answer questions from the people there in great detail. One question came up while I was showing the new functionality of Query Store (available soon in SQL Server 2016, available right now in Azure SQL Database).

What happens to plan forcing when you have OPTION RECOMPILE on a query?

Great question. I have a favorite procedure I use to illustrate the functionality of parameter sniffing:

ALTER PROC dbo.AddressByCity @City NVARCHAR(30)
AS
SELECT  a.AddressID,
        a.AddressLine1,
        a.AddressLine2,
        a.City,
        sp.Name AS StateProvinceName,
        a.PostalCode
FROM    Person.Address AS a
JOIN    Person.StateProvince AS sp
        ON a.StateProvinceID = sp.StateProvinceID
WHERE   a.City = @City;

If this procedure is called with the value of ‘Mentor’ you get an execution plan that looks like this:

2016-02-22_10-48-00

If you remove that plan from cache and then call the procedure with the value of ‘London’ then the plan looks like this:

2016-02-22_10-38-43

It’s parameter sniffing in action. One of the cool pieces of functionality that comes with the Query Store is the ability to choose a plan and make that plan get used, regardless of the parameter values. You do this through the following command:

sys.sp_query_store_force_plan

You pass it a query_id and a plan_id  from the information in the Query Store and now that plan will get used. But, what happens when you have attempted to fix your query using a query hint, such as RECOMPILE? Let’s modify our procedure:

ALTER PROC dbo.spAddressByCity @City NVARCHAR(30)
AS
SELECT  a.AddressID,
        a.AddressLine1,
        a.AddressLine2,
        a.City,
        sp.Name AS StateProvinceName,
        a.PostalCode
FROM    Person.Address AS a
JOIN    Person.StateProvince AS sp
        ON a.StateProvinceID = sp.StateProvinceID
WHERE   a.City = @City
OPTION  (RECOMPILE);

If I run the procedure, regardless of the parameter values passed in, I get a new plan for each value. Nothing is stored in cache, but, it is stored in the Query Store.

Let’s pull the query_id and plan_id out of Query Store for this procedure:

SELECT  qsq.query_id,
        qsp.plan_id,
        CAST(qsp.query_plan AS XML) AS sqlplan
FROM    sys.query_store_query AS qsq
JOIN    sys.query_store_plan AS qsp
        ON qsp.query_id = qsq.query_id
WHERE   qsq.object_id = OBJECT_ID('dbo.spAddressByCity');

Currently, on my system, this brings back four rows, two distinct query_id and four different plan_id. This is because, when I modified the text of the query within the procedure, I kept the object_id the same, but, the query text is different (the RECOMPILE hint). This means a different record in the Query Store. We’ll use the newer query_id and pick one of the plans to force it:

EXEC sys.sp_query_store_force_plan
    42,
    44;

Now, no matter what value I pass it, the RECOMPILE hint is effectively ignored. I have the same plan every time (in my case, the London plan). Ah, but is the RECOMPILE hint ignored? In fact, no, it isn’t. If I capture the extended event sql_statement_recompile, I can see every single time I execute this procedure that it’s going through a full recompile… which is then tossed and the plan I chose is forced upon the optimizer. Every time.

This isn’t a revelation. It makes sense. However, the Query Store represents one more thing that we have to think through.


If you want to talk query tuning, execution plans, the Query Store, and more, I’ll be presenting an all day pre-conference seminar at SQLDay in Wroclaw Poland on May 16th, 2016. Let’s talk.

Feb 22 2016

Why Is The Server Slow?

This is blog post #2 in support of Tim Ford’s (b|t) #iwanttohelp, #entrylevel.

If you haven’t been working in SQL Server for very long, you may not have got this phone call yet, but you will:

Hi, yeah, the server is slow. Thanks. Bye.

Let’s pretend for a moment that you know which server they’re referring to (because just finding out that piece of information can be a challenge). Now what?

The list of tools and mechanisms within SQL Server for gathering metrics is extremely long:

Performance Monitor
Dynamic Management Views & Functions
System Views
Extended Events
Trace Events
Activity Monitor
Data Collector
Execution Plans
3rd Party Tools

I’m leaving out lots of stuff in that list. So where do you start when you get this phone call? Where is the server slow?

The best place to start is by looking at the wait statistics.

A simple, and simplified, explanation for how processes work within SQL Server is that each process gets access to the various resources for a little while, then has to pass off access to another process, each of them working together to get the work done. But, some processes take longer than others. When a process has to wait for anything in SQL Server, this information gets logged. Depending on the system you’re working with you can access these wait statistics from one of three locations:

sys.dm_os_wait_stats: for SQL Server
sys.dm_db_wait_stats: for Azure SQL Database
sys.dm_pdw_nodes_os_wait_stats: for Azurew SQL Data Warehouse

Getting information from these system views is extremely simple. Here’s an example query:

SELECT  *
FROM    sys.dm_os_wait_stats AS dows
ORDER BY dows.wait_time_ms DESC;

The results of this query look like this:

waits

The first column lists the wait types. The second column provides a count of the tasks that have had that wait type. The third column, wait_time_ms, is the amount of time in milliseconds that tasks have been waiting within the system, cumulative. Next is the maximum wait time that any one task has waited, max_wait_time_ms. Finally, signal_wait_time_ms, you see the time that the waits have had to wait for access to the CPU (also known as time spent on the Runnable queue). The importance of waits are not simply the time that a wait has had, but also the number of tasks and the max time. Using all these values gives meaning to the individual wait.

Here is where things actually get difficult. The wait types are arcane, difficult to understand, and difficult to interpret. Further, a lot of the wait types actually don’t mean anything at all. The waits are not indicative of an issue. So, while the query above is simple, the results it provides are very weak. Instead, I strongly recommend you use the query provided by Paul Randal, located here. It will filter out the wait statistics that you shouldn’t care about.

Now, you have a meaningful list of wait statistics that will tell you exactly why, if not where, your server is running slow. Unfortunately, these waits still need to be interpreted. If you read further on Paul’s blog, you’ll see he has a number of waits and their causes documented. That’s your best bet to start understanding what’s happening on your system (although, I hear, Paul might be creating a more complete database of wait stats. I’ll update this blog post should that become available).

One other thing to consider. These waits are since the last time the server was started (or failed over, or the database was failed over in Azure, or if the values have been reset). This means that simply looking at the list doesn’t give you necessarily enough information. Instead, running this more than once during a day can show you what’s been slow over time by comparing the two data sets. You can also use sys.dm_exec_session_wait_stats to see what any given session is experiencing if you want to know what a given user or process is experiencing. Don’t just look at the list and think you’re done. The best thing to do is get to a point of proactive monitoring (for this, monitoring tools make it easier).

So, when you get the phone call that says the server is slow, you know how to get started understanding exactly why that may be the truth.

Feb 18 2016

Azure SQL Database Error

I was on SQL Cruise where I was scheduled to present a session on Azure SQL Database. I recorded all my demonstrations before we went to sea, but, I planned to attempt to try live demo’s. Yeah, yeah. Well, anyway, I got a unique error when I attempted to connect:

Forced connection closes from remote host

That’s a partial message from the whole error. I don’t have a good screen capture. I wasn’t able to find anything on it through Ging searches, but this week I was at Microsoft for a training course on Azure. I asked the room. The rough answer is (paraphrasing):

The IP address I was attempting to connect from is not is not on the approved list

Interesting. I didn’t realize there were blackout zones. The really interesting thing was, this only occurred when I attempted to connect through SQL Server Management Studio. I was able to connect through the Azure Portal. One more wrinkle in the interesting challenges offered by Azure and the Data Platform.

Feb 15 2016

Loading Data into Azure SQL Data Warehouse

Ouch.

Let’s start with the level set. I’m not an ETL expert. In fact, I haven’t done any professional ETL work for several years. My skills are, at best, rusty. With this in mind, I knew I’d have a hard time extracting data from a local database in order to move it up to Azure SQL Data Warehouse. I expected to be fumbling and slow and to constantly feel more than a little stupid as I made one mistake after another. All of this came to pass.

Yet… OMG! THAT WAS DIFFICULT!

Here’s how I started. I defined a bcp command for the tables I was interested in. I ensured it was working correctly, then wrote a little PowerShell script so I could supply a list of tables and get them all exported at once. Here’s an example:

$tables | foreach-object {
$filename = $_.Substring($_.IndexOf(".") + 1,($_.Length - ($_.IndexOf(".") + 1)))
bcp $_ out "c:\data\bcp\$filename.rtf" -S'WIN-3SRG45GBF97\dojo' -d'ODS' -T -n -c 
}

That worked fine. Next up, I used azcopy to load the files I created up to blob storage. Again, a little Powershell work to move all the files at once:

$tables | foreach-object {
.\AzCopy.exe /Source:c:\data\bcp /Dest:"https://xxx.blob.core.windows.net/sales/data/$_" /DestKey:xx /Pattern:"$_.rtf" /Z:"C:\data\bcp" /Y
}

No issues. Well, OK, I screwed up both these commands like 20 times until I got them right and then did the quick automation. However, the screw-ups were mine, not external issues. Next up, setting up external tables. There’s a good article on how to do this provided from Microsoft. Up until I tried to create the table, everything worked just fine. Then I saw this:

Msg 104051, Level 16, State 1, Line 11
Unsupported data type error. Statement references a data type that is unsupported in Parallel Data Warehouse, or there is an expression that yields an unsupported data type. Modify the statement and re-execute it.

Oh boy.

Luckily, there’s another blog post from Microsoft on getting the schema right. NUMERIC out and DECIMAL in, I got my next error:

Msg 105001, Level 16, State 1, Line 11
CREATE EXTERNAL TABLE failed because the location URI could not be reached. Revise the URI and try again.

My fault. I tried to get fancy and did this:

create external data source AzureStoreageGF
with (type = Hadoop,
location='wasbs:[email protected]', 
credential=AzureStorageCredential);

Changing it to just [email protected] and then making my WITH statement on the CREATE EXTERNAL TABLE command read like this:

WITH (
LOCATION='data/Account/',
DATA_SOURCE=AzureStorage,
FILE_FORMAT=TextFile);

Worked! I’m a winner. All I have to do now is query from the Polybase table I just created.

Msg 107090, Level 16, State 1, Line 132
Query aborted– the maximum reject threshold (0 rows) was reached while reading from an external source: 1 rows rejected out of total 1 rows processed.
(/data/Account/Account.rtf)Column ordinal: 1, Expected data type: NVARCHAR(1300) collate SQL_Latin1_General_CP1_CI_AS, Offending value: xxx (Tokenization failed), Error: Not enough columns in this line.

RASSIN FRASSIN… Except I didn’t use those words. I substituted some other Anglo-Saxon terms. Anyway, what’s it mean? It means, probably, although it’s hard to know, that I must have messed up the data export in some fashion. Or that there’s something in the data that’s breaking the interpretation of the code…. only I can’t find it. So, old school, I opened this up in Excel and found two columns that had carriage returns in them. Fixed that. Going again:

Query aborted– the maximum reject threshold (0 rows) was reached while reading from an external source: 1 rows rejected out of total 33 rows processed.
(/data/Account/Account.rtf)Column ordinal: 0, Expected data type: DECIMAL(18,2), Offending value: , Error: UTF8 decode failed.

A WIN! What? You didn’t see the win? You’re looking at the error message. 33 rows processed. Yeah, that’s out of 96,000 rows, but it’s more than 1. That means that 32 rows processed successfully. Now, you might be wondering about that file format. Last time I got that was because I had bcp out to a non-character format. That suggests there’s a data type in there that is giving me more problems in Polybase. Back to SQL Server to see what’s up with this data (because everything worked perfectly in Excel this time). And found it. This is the problem:

‘å’

I’m using bcp. It goes to a UTF-16 file format. Polybase supports the Hadoop standard of UTF-8. This must be blocking it……. Great. There it is in black & white:

bcp does not support exporting to UTF-8. This may prevent using PolyBase on bcp exported data

Fine. Powershell to the rescue:

Get-ChildItem -rec -fi *.rtf | %{(get-content $_ -enc string) | set-content -enc utf8 $_.fullname}

This time I processed 197 rows before I found another problem:

Query aborted– the maximum reject threshold (0 rows) was reached while reading from an external source: 1 rows rejected out of total 197 rows processed.
(/data/Account/Account.rtf)Column ordinal: 110, Expected data type: INT NOT NULL, Offending value: |45178|1043866 (Tokenization failed), Error: Too many columns in the line.

What follows is a tedious and dull march through various & sundry text columns that, for some strange reason, have a pipe (|) character in them (why the phone number, by Odin’s Eye Patch, why the phone number). Suddenly… I have 96,000 rows returning from my COUNT(*) query (using Visual Studio, SSMS was twitchy when connected to ADW). Run a CREATE TABLE FROM SELECT query and I have a new table. Ta-da.

Let’s bottom line this. Data is dirty. Any ETL process is likely to expose the trashy data, so what are the real issues here? First up, Powershell is my bestest buddy. Scripting everything through Posh right out of the gate made a huge difference in my ability to constantly reset. The fact that our most common processes spit out UTF16, while not a problem, is something you need to get on top of (see Posh above). AZCopy works really well and the command line is easy to implement, but you’ll need to go through the process of setting up all the appropriate connections. Polybase is wicked cool. Yes, I had to hop through hoops to get the data clean, but, what you should note in the above descriptions, at no point did I need to go and recreate my EXTERNAL TABLE from Azure SQL Data Warehouse. I just kept updating the files through my scripts and as soon as they were loaded up to blob storage, I ran a query against them. No additional work required. That’s nice. You’re going to want to go through your data types on your data before you start the migration. You’re going to have to plan for a lot of time cleaning your data if you haven’t already.

In summary, this process is well documented, but has a lot of moving parts and lots of places for things to break down. If you’re like me and ETL is a distant memory, plan for a very steep learning curve. If you’re good at ETL, but not so hot at automation, your mouse clicking finger is going to get tired. Microsoft has a decent overview of the process posted here.

Now, I need to get started cleaning the data on the next table… pity me.

(Kidding. I’m having a blast)