Friday, September 30, 2022

Generating Repeatable Sets Of Test Rows

In order to test graph structures, I needed a large set of random data. In some ways, this data will resemble the IMDB database I will include later in this chapter, but to make it one, controllable in size and two, random, I created this random dataset. I loaded a set of values for an account into a table and a set of interests. I wanted then to be able to load a set of random data into edge, related one account to another (one follows the other), and then an account to a random set of interests.

In this article, I will discuss a few techniques I used, starting with the simplest method using ordering by NEWID(), then using RAND() to allow the code to generate a fixed set of data.

The Setup

As the base data, I am going to create a table named: Demo.FetchRandomDataFrom that will serve as my stand-in for any table you want to randomly fetch data from.

USE Tempdb; 
GO
CREATE SCHEMA Demo;
GO
SET NOCOUNT ON;
DROP TABLE IF EXISTS  Demo.FetchRandomDataFrom
GO
CREATE TABLE Demo.FetchRandomDataFrom
(
    --this value should be the primary key of your table, 
    --ideally using the clustered index key for best 
    --performance, but definitelu unique.
    FetchRandomDataFromId INT IDENTITY(2,4) 
        CONSTRAINT PKFetchRandomDataFrom PRIMARY KEY,
    SomeOtherData CHAR(1000) NOT NULL
)
GO
--Make the rows of a consequential size. Small row size often
--is an issue with random sets of data you want to use because 
--it performs different than real tables.
INSERT INTO Demo.FetchRandomDataFrom
( SomeOtherData )
VALUES (REPLICATE('a',1000));
GO 100000 --create 100000 rows

Let’s take a look at a few rows in my table. Note that I made the surrogate key start at 2 and increment by 4. the goal was to make good and sure that the data was obviously not a sequential number. If your keys are pretty much sequential, you may not need to do some steps I have included, but unless it is a perfect set without gaps, that may come into play when trying to make repeatable outcomes from the same data.

SELECT TOP 10 *
FROM   Demo.FetchRandomDataFrom
ORDER BY FetchRandomDataFromId

This returns:

FetchRandomDataFromId SomeOtherData
--------------------- ----------------------------------
2                     aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
6                     aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
10                    aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
14                    aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
18                    aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
22                    aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
26                    aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
30                    aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
34                    aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
38                    aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

This will be my stand in for a more realistic set of data where SomeOtherData would actually have been number, dates, values, etc.

Using NEWID()

If you simply need a random set of rows from a table, you can simply do something like this:

SELECT TOP 10 FetchRandomDataFromId
FROM   Demo.FetchRandomDataFrom
ORDER BY NEWID();

This just sorts the data in a random order based on the NEWID() GUID values being generated. It is a wonderful technique you can do with almost any set of data, Executing the previous statement will return a random set of rows each time you execute it, such as:

FetchRandomDataFromId
---------------------
177818
138446
272154
171854
195266
105330
314422
203530
395666
365046

There are data sampling code types in SQL Server syntax, which you can read about here in an article by Derek Colley, but like this article also quotes, if you want a random set, most basic methods use NEWID() to get truly random outputs.

The biggest pro with that method is that it is simple and very much random. However, there are two downfalls to this method. First, no duplicated values will be found in that set. For most cases, that is great, but in other cases, you may wish to have the chance of the same value being generated multiple times (in my real case, I absolutely want the same node to end up connected to more than one node. Second, it is non repeatable. If you want to send this code to someone else, they will always get a different answer. Obviously, this may or may not be an issue, but I can’t count how many times I have generated a set, found something interesting about it and then lost the set.

Repeatable Random values

In order to make a random enough set that was repeatable, I then decided to take advantage of the RAND() function and one of its most interesting capabilities. This being that it isn’t really technical random as you can pass any number into it as a seed, and then all the next values generated are technically deterministic.

For example, execute the following code on any SQL Server:

SELECT RAND(1)
UNION ALL
SELECT RAND()
UNION ALL
SELECT RAND()
UNION ALL
SELECT RAND()
UNION ALL
SELECT RAND()
UNION ALL
SELECT RAND()

The output will be:

----------------------
0.713591993212924
0.472241415009636
0.943996112912269
0.0088524383111658
0.343443392823765
0.210903950196124

I got the same output on SQL Server 2019 and 2022 RC0. Even if you execute RAND(1) and then an hour later, the rest of the code, the “random” value that is output will be the same. So, in the following code, I will use this by letting the user set a seed (or just take a random seed when exploring data sets, but then telling the user the seed in case they wish to use it again). The following code builds on this concept using RAND() and loading a temp table of random data:

It is a lot of code, but I included quite a bit of comments to make it clear:

DROP TABLE IF EXISTS #HoldFetchRandomDataFrom
GO
CREATE TABLE #HoldFetchRandomDataFrom
(
    FetchRandomDataFromId INT PRIMARY KEY
)
DECLARE @FetchRandomDataFromId INT, --hold the fetched value
                @RowsLoopCounter INT = 0, --number of rows fetched
                @LoopCounter INT = 0, --number of loops made, 
                                --to stop if too many
                @RowsInTableCount INT, --Used to be able to set the 
                                 --RAND value
                @SafetyFactor NUMERIC(3,1) = 2, --this times the loop 
                                         --counter for max loops
                
                @DesiredRowCount INT = 100, --number of rows you want 
                                            --in output
                @Seed INT = 1; --set specific value if you want 
                              --repeatability
                --default the seed (doing this, I give it a seed that 
                --is repeatable and will override the one that was set 
                --in the next execution unless you reconnect.
                SET @seed = COALESCE(@seed,2147483647 * RAND())
                --sets the seed to start generating

                DECLARE @SetSeed INT = RAND(@seed)

                SELECT CONCAT('Seed used: ',@seed);

--load all of the data into this table. The PrepId column will 
--hold a sequential value to make random number easier
DROP TABLE IF EXISTS #PrepFetchRandomDataFrom 
CREATE TABLE #PrepFetchRandomDataFrom
(
     --Start at 0 because RAND can produce 0
        PrepId INT IDENTITY(0,1), 
        FetchRandomDataFromId INT NOT NULL
)
INSERT INTO #PrepFetchRandomDataFrom
(
    FetchRandomDataFromId
)
SELECT FetchRandomDataFromId
FROM  Demo.FetchRandomDataFrom;
--the random number will be this times RAND. (it has to be 
--captured in a generic bit of code because you could choose a 
--value > the number of rows
SET @RowsInTableCount = (SELECT COUNT(*) 
                         FROM #PrepFetchRandomDataFrom);
WHILE 1=1
 BEGIN
        --this tells you the number of loops done.
        SET @Loopcounter = @LoopCounter + 1;
        --this way, if some error keeps happening, you will 
     --eventually stop
        IF @LoopCounter >= @DesiredRowCount * @SafetyFactor
          BEGIN
                 RAISERROR('Loop stopped due to safety counter',10,1); 
                 BREAK;
          END;
        --fetch one random row
        SELECT @FetchRandomDataFromId = RAND() * @RowsInTableCount
        --do the insert in a try. You may have criteria in your 
     --table that needs to be met (uniqueness, for example)
        BEGIN TRY
        --join to get the real value from the table
        INSERT INTO #HoldFetchRandomDataFrom(FetchRandomDataFromId)
        SELECT FetchRandomDataFromId
        FROM    #PrepFetchRandomDataFrom
        WHERE  PrepId = @FetchRandomDataFromId
        
        --this is set inside the try, because the row has been 
     --added
        SET @RowsLoopCounter = @RowsLoopCounter + 1;
        --this is then end if the rows inserted match the desired 
     --rowcount
        IF @RowsLoopCounter = @DesiredRowCount
         BREAK;
        END TRY
        BEGIN CATCH
           --most likely error is a duplicated value
           DECLARE @msg NVARCHAR(1000);
           --display the message in case you do something such 
           --that the loop can never complete
           SET @msg = CONCAT('Some error occurred:' 
                                             ,ERROR_MESSAGE())
           RAISERROR(@msg,10,1) WITH NOWAIT; --immediate message 
           --Note, we keep going. This is what the safety factor 
          --was for, because an error could cause infinite loop
        END CATCH;
END;

Output the rows:

SELECT *
FROM   #HoldFetchRandomDataFrom
ORDER BY FetchRandomDataFromId;

Now since I set the seed value to 1, you will see the same output:

-----------------------
Seed used: 1

FetchRandomDataFromId
---------------------
1670
2870
3542
4606
6982
10626
12846

Clearly this is a LOT more code than simply TOP 1 ORDER BY NEWID(). Am I saying this is always necessary? Not at all. But it is a technique that you can use to generate a repeatable random set of data (as far as you care it to be random).

For example, you could test some code with a random set of data that you find an error. Instead of having to save that data off for the next person, you could just use the script. My reason for doing this is for a book I am working on. I want to be able to send a set of edges to the reader without making this huge script. So I needed two correlated random values. This method made that pretty easy.

Multiple Related Repeatable Random values

What made me first try this was I wanted to generate edges, which need a from value and a to value. When I tried to do this with the NEWID() method in a loop 300000 times, it took close to forever to complete (I gave up at 150000 rows after that executed all night).

That is how this next piece of code was written:

DROP TABLE IF EXISTS #HoldFetchRandomDataFrom
GO
CREATE TABLE #HoldFetchRandomDataFrom
(
         FetchRandomDataFromId1 INT NOT NULL,
         FetchRandomDataFromId2 INT NOT NULL,
         --no duplicate rows
         PRIMARY KEY (FetchRandomDataFromId1,
                                 FetchRandomDataFromId2),
         --no self relationship. Needs to be different values
         CHECK (FetchRandomDataFromId1 <> FetchRandomDataFromId2)
)
DECLARE @FetchRandomDataFromId1 INT, --hold the first fetched 
                                     --value
           @FetchRandomDataFromId2 INT, --the second fetched value
           @RowsLoopCounter INT = 0, --number of rows fetched
           @LoopCounter INT = 0, --number of loops made, to stop if 
                              --too many                
        @RowsInTableCount INT, --Used to be able to set the RAND 
                               --value
        @SafetyFactor NUMERIC(3,1) = 2, --this times the loop  
                               --counter for max loops
                
        @DesiredRowCount INT = 100, -- number of rows you want 
                                    --in output
        @Seed INT = 1; --set specific value if you want 
                        --repeatability
                --default the seeds (doing this, I give it a seed that 
          --is repeatable and will override the one that was set 
          --in the next execution unless you reconnect.
                SET @seed = COALESCE(@seed,2147483647 * RAND())
                --sets the seed to start generating
                DECLARE @SetSeed INT = RAND(@seed)
                SELECT CONCAT('Seed used:',@seed);
--load all of the data into this table. The PrepId column will 
--hold a sequential value to make random number easier'
DROP TABLE IF EXISTS #PrepFetchRandomDataFrom 
CREATE TABLE #PrepFetchRandomDataFrom
(
      --Start at 0 because RAND can produce 0
        PrepId INT IDENTITY(0,1), 
        FetchRandomDataFromId INT NOT NULL
)
INSERT INTO #PrepFetchRandomDataFrom
(
    FetchRandomDataFromId
)
SELECT FetchRandomDataFromId
FROM  Demo.FetchRandomDataFrom;
--the random number will be this times RAND. (it has to be 
--captured in a generic bit of code because you could choose a 
--value > the number of rows
SET @RowsInTableCount = (SELECT COUNT(*) 
                         FROM #PrepFetchRandomDataFrom);
WHILE 1=1
 BEGIN
        --this tells you the number of loops done.
        SET @Loopcounter = @LoopCounter + 1;
        --this way, if some error keeps happening, you will 
     --eventually stop
        IF @LoopCounter >= @DesiredRowCount * @SafetyFactor
          BEGIN
                 RAISERROR('Loop stopped due to safety counter',10,1); 
                 BREAK;
          END;
    --fetch one random row
    SELECT @FetchRandomDataFromId1 = RAND() * @RowsInTableCount;
    SELECT @FetchRandomDataFromId2 = RAND() * @RowsInTableCount;
        --do the insert in a try. You may have criteria in your 
     --table that needs to be met (uniqueness, for example)
        BEGIN TRY
        --join to get the real value from the table
        INSERT INTO #HoldFetchRandomDataFrom
              (FetchRandomDataFromId1,FetchRandomDataFromId2)
        VALUES((SELECT FetchRandomDataFromId
                        FROM    #PrepFetchRandomDataFrom
                        WHERE  PrepId = @FetchRandomDataFromId1),
                                        (SELECT FetchRandomDataFromId
                                        FROM    #PrepFetchRandomDataFrom
                                        WHERE  PrepId = 
                                   @FetchRandomDataFromId2));
        
        --this is set inside the try, because the row has 
     --been added
        SET @RowsLoopCounter = @RowsLoopCounter + 1;
        --this is then end if the rows inserted match the desired 
     --rowcount
        IF @RowsLoopCounter = @DesiredRowCount
         BREAK;
        END TRY
        BEGIN CATCH
                --most likely error is a duplicated value
                DECLARE @msg NVARCHAR(1000);
                --display the message in case you do something such 
          --that the loop can never complete
                SET @msg = CONCAT('Some error occurred:' 
                                             ,ERROR_MESSAGE())
                RAISERROR(@msg,10,1) WITH NOWAIT; --immediate message 
                --Note, we keep going. This is what the safety factor 
          --was for, because an error could cause infinite loop
        END CATCH
END;
--Output the rows:
SELECT *
FROM   #HoldFetchRandomDataFrom
ORDER BY FetchRandomDataFromId1;

Executing this returned:

---------------------
Seed used:1
FetchRandomDataFromId1 FetchRandomDataFromId2
---------------------- ----------------------
3542                   137378
4606                   301534
9346                   31374
10626                  110734
14702                  106954
14978                  279338

If you do a little digging, you can see the values from the previous generation in here since I used the same seed:

SELECT FetchRandomDataFromId1
FROM #HoldFetchRandomDataFrom
UNION ALL
SELECT FetchRandomDataFromId2
FROM #HoldFetchRandomDataFrom
ORDER BY FetchRandomDataFromId1

This returns:

FetchRandomDataFromId1

----------------------
1670
2870
3542
4606
6982
9346
10626

And the first, single row query returned (copied from earlier in the document:

FetchRandomDataFromId
---------------------
1670
2870
3542
4606
6982
10626
12846

You can see a new value in the sequence (which makes sense because here are 2X the number of random values generated), but still the same values.

 

The post Generating Repeatable Sets Of Test Rows appeared first on Simple Talk.



from Simple Talk https://ift.tt/NDqvOoi
via

Thursday, September 22, 2022

Mighty Tester: Why it doesn’t need to be fixed…

Image showing comic strip. In each frame, the tester is speaking with another developer. She says "I found a bug." Dev1: "It was working yesterday." Dev2: "It works on my machine" Dev3: "It's not a bug, it's a feature" Dev4: "It's not in the requirements." Dev5: "It's only a nice-to-have" Dev6: "Did you try restarting your machine?" Dev7: "The user would never do that" Finally in the last frame, the tester says to herself: "And that concludes our developer bingo"

Commentary Competition

Enjoyed the topic? Have a relevant anecdote? Disagree with the author? Leave your two cents on this post in the comments below, and our favourite response will win a $50 Amazon gift card. The competition closes two weeks from the date of publication, and the winner will be announced in the next Simple Talk newsletter.

The post Mighty Tester: Why it doesn’t need to be fixed… appeared first on Simple Talk.



from Simple Talk https://ift.tt/EYlrPfI
via

Wednesday, September 14, 2022

T-SQL Tuesday #154 – SQL Server 2022, IS DISTINCT FROM

Despite my recent job change from full time T-SQL code jockey/architect to website editor for Simple-Talk, I will always be at heart at T-SQL programmer. While I will mostly be writing professionally in the foreseeable future will in support of an article, I will also continue writing code for several databases that I use for multiple personal reason housed on a SQL Server Express server on the desk beside me.

The Invitation: T-SQL Tuesday #154 Invitation – SQL Server 2022 (glennsqlperformance.com)

When I saw the topic of this T-SQL Tuesday was to write about what I have been doing with SQL Server 2022, I figured I would note the two things I have been doing recently. First, I wrote a chapter about security changes in SQL Server 2002 for a forthcoming book entitled “SQL Server 2022 Administration Inside Out” for Pearson with a great group of people that should be out later this year, early next at the latest. There are a few things I found out writing that chapter that I am keen to use, one of them being the more granular UNMASK permissions for the Dynamic Data Masking feature, but since I won’t be writing multi-user production code in the future, I probably won’t be masking any data, much less need granular masking capabilities.

The other thing I have been doing is trying out some of the new features coming in SQL Server 2022 that I will absolutely be using even in my hobby databases. There are tons of new features in 2022, as there always is. But the ones that excite me are the T-SQL improvements. In this article I am going to highlight 1 feature that is immediately a standout.

IS (NOT) DISTINCT FROM

The MOST exciting change from a T-SQL standpoint is: IS NOT DISTINCT FROM. This feature solves an age-old issue for T-SQL programmers and is worth its weight in gold. It is basically an equals comparison operator like =, but treats NULL as an individual value. Unlike =, this new operator returns only TRUE or FALSE, but not UNKNOWN. Writing queries that compare to values that can contain NULL is tedious, mostly because of code like the following:

SELECT CASE WHEN 1 = NULL THEN 'True' Else 'False' end,
       CASE WHEN NOT(1 = NULL) THEN 'True' Else 'False' end

The fact that both of these comparisons return False is confusing at times even to me, and I have written on the whole NULL comparison and negating NULL values things about as many times as I have dealt with it in production code. But using IS DISTINCT FROM, this is no longer the case:

SELECT CASE WHEN 1 IS DISTINCT FROM NULL 
               THEN 'True' Else 'False' end,
       CASE WHEN NOT 1 IS DISTINCT FROM NULL
               THEN 'True' Else 'False' end

Where this is really important is doing a query where you are looking for differences between two sets of data (often for a merge type operation). So consider the following table (from WideWorldImporters, which you can get here) :

SELECT COUNT(*), 
       SUM(CASE WHEN LatestRecordedPopulation IS NULL 
           THEN 1 ELSE 0 END)
FROM Application.Cities;

This returns 37940 total rows and 11048 rows with a NULL population value. Now, let’s join the table to itself on the PK and the population value

SELECT *
FROM   Application.Cities
          JOIN Application.Cities AS C2
            ON C2.CityID = Cities.CityID
WHERE.LatestRecordedPopulation 
                         = Cities.LatestRecordedPopulation;

This returns 26892 rows, which you can do the math, is 37940-11048. Looking at this, without thinking about NULL values (who does initially?), this has to return every row in the table. But clearly not. Usually this becomes obvious when a few customers living in one of those cities isn’t showing up on a report (or maybe even not getting their shipments.)

The pre-SQL Server 2022 way of handling this properly this was to do something like this:

SELECT *
FROM   Application.Cities
          JOIN Application.Cities AS C2
               ON C2.CityID = Cities.CityID
                  AND C2.LatestRecordedPopulation = 
                                      Cities.LatestRecordedPopulation 
                    OR (C2.LatestRecordedPopulation IS NULL
                        AND Cities.LatestRecordedPopulation IS NULL);

Now we have checked the either they are the same value, or they both have a value of NULL. This query returns every row in the table, but it is kind of tricky code. And looking for differences is even more difficult, because you have to check to see if the values are different, if column 1 is null and column2 is not, and again vice versa. Another way this is often done is to change the population comparison to

AND COALESCE(C2.LatestRecordedPopulation.-100) = 
             COALESCE(Cities.LatestRecordedPopulation,-100)

Which is safe from a correctness standpoint (assuming you can coalesce your values to something that is 100% not possible), but not from a performance one. This eliminates index seek utilization for these columns and makes it slower. That isn’t always an issue, but for larger data sets, you may end up with more scans than you hoped.

Using the new syntax, we can simply write this as:

SELECT *
FROM   Application.Cities
         JOIN Application.Cities AS C2
           ON C2.CityID = Cities.CityID
              AND C2.LatestRecordedPopulation 
                    IS NOT DISTINCT FROM 
                           Cities.LatestRecordedPopulation;

The name of the operator might be a little bit confusing because of the words FROM and DISTINCT, it really makes sense. DISTINCT has a seemingly different usage here, but really it is the same meaning. If the value is the same, it is not distinct from one another, and if it is different, it is distinct. And the DISTINCT operator in the SELECT clause honors NULL values as a single bucket too. Now, go back and change the previous query to IS DISTINCT FROM and 0 rows will be returned.

This feature would have saved me many many hours over the years! Is this alone a reason to upgrade to SQL Server 2022 alone? Since I have never been the one to write those checks, and I use the free Express edition for my hobby databases… I can say an unqualified “Yes” to that!

 

The post T-SQL Tuesday #154 – SQL Server 2022, IS DISTINCT FROM appeared first on Simple Talk.



from Simple Talk https://ift.tt/KlvCY6d
via

Monday, September 5, 2022

Synapse Analytics: When you should use (or not use) Synapse

Synapse is a great data lake house tool. This means in a single tool we have resources to manage a data lake and data warehouse.

The Synapse Serverless Pool is great to manage data lakes and for a great price: around us$ 5,00 for each TB of data queried. This makes it a great choice.

For the data warehouse, on the other hand, it’s a bit different. Before Synapse, there was Azure SQL Data Warehouse. This product was rebranded as SQL Dedicated Pool. The change on the product name helped to put down an old mistake: People believe that for a Data Warehouse in the cloud they need to use Azure SQL Data Warehouse.

The idea was reduced, but a lot of people still think they need to use Synapse for a Data Warehouse. This is a mistake. A data warehouse is not related to a tool. It’s a database. Some tools may be better or worse for the data warehouse. In relation to Synapse, as the usual answer from every specialist, “It depends”.

What’s the Synapse SQL Dedicated Pool

Understanding what is the SQL Dedicated Pool is the key to understand when you need to use it or not. The most common misconception is to believe the SQL Dedicated Pool is only a simple database. It’s not.

The SQL Dedicated Pool is a MPP tool. MPP stands for Massive Parallel Processing. This is an architecture intended to break down a processing request between many processing nodes, join the processing result from the individual nodes and return the final result. It’s much more than simple parallel processing, its’s a distributed processing broke down in many physical nodes.

 

The data is stored in a structure optimized for the MPP execution. What you may see as a single database is in fact a total of 60 databases. Each table is broken down in 60 pieces. This can be a great organization for the MPP execution, but there are more details to analyse.

Synapse Dedicated Pool Service level

The service level of the SQL Dedicated Pool defines the number of physical nodes the dedicated pool will have. Any service level below DW 1000 will use only one node. DW 1000 is the first level with 2 nodes and the number grows up to 60 nodes.

The number of databases used to broke down the data is fixed, always 60. These databases will be spread among the physical nodes the dedicated pool contains. For example, if the dedicated pool has two nodes (DW 1000), each node will contain 30 databases.

The first important point is the service level: You should should never use a dedicate pool with less than DW 1000.

 

Below DW 1000, there is no real MPP. However, your data is optimized for MPP. Your data is broken down is 60 pieces. Without using a MPP, you will be losing performance instead of getting better performance.

If for any reason you believe you need a SQL Dedicated Pool below DW 1000 in production, that’s because you don’t need a SQL Dedicated Pool, a simple Azure SQL Database may do the job.

Data Volume

Synapse usually makes an interesting recommendation to us: Never use a clustered columnstore index on a table with less than 60 million records in Synapse. If we analyse this recomendation, we may discover some rules about when to use or not use Synapse.

 

First, it’s important to better understand the columnstore indexes and how important they are.

Let’s columnstore indexes features and behaviours:

  • They are specially optimized for analytical queries
  • The index is compressed
  • They are in memory
  • Allow batch mode execution and optimization

The columnstore indexes are very important for data warehouses. We would be losing a lot by not using them. Why Synapse doesn’t recommend them for tables smaller than 60 millions ?

The physical structure of the columnstore indexes use segments with 1 million rows each. The segments are “closed” and compressed on each million records. That’s why the 60 million recomendation: Synapse breaks down each table in 60 pieces. 60 pieces with 1 million rows each makes a total of 60 millions. Any table with less than 60 million rows will have less than 1 million rows on each of their pieces, resulting in incomplete segments and affecting the work of the columnstore index.

Is this a rule to be always followed? Not at all, that’s why it’s just a recommendation. A data warehouse uses star schemas. Usually, the fact table is way bigger than the dimension tables. It’s ok if we have some small dimension tables around a very big fact table, this is not a problem. The problem is when your entire model has no table with the size which requires Synpase for a better processing.

Consumers

Synapse can be very good as a central data source for BI systems such as Power BI, Qulick and more. However, if there is no directly BI system consuming the data, if most of the consumers are other applications receiving data through ETL processes, maybe Synapse is not the better option. Data lakes could perform the same task for a lower cost in this situation.

Summary

Should you use Synapse ? Your volume of data, the need for an MPP and the consumers of your data can provide a good guidance about whether Synapse is a good choice for your solution or not.

The post Synapse Analytics: When you should use (or not use) Synapse appeared first on Simple Talk.



from Simple Talk https://ift.tt/9SzEZwV
via

Thursday, September 1, 2022

Starting my dream job…

A new adventure, there in your eyes.

It’s just beginning,

Feel your heart beat faster.

Reach out and find your

Happily ever after!

Well, here I am. My first new job in over 20 years and it is a dream come true. I have always wanted to do more to help people learn about technology and being the Editor of the Simple Talk website is going to be a great place to do that. I have admired the heck out of the people who I am following and I have worked with them all over the years.

How that dream comes out, who knows? I won’t pretend the back of my head isn’t filled with the little voices telling me “you are the imposter” like we are playing a game of Clue only we are looking for the person who shouldn’t be there rather than a murderer. As I said, I have known the previous editors and they were all the greatest… can I keep up.?

I have come to realize that practically everyone has that same voice echoing frustrations in their cavernous skull boxes. And it is okay. It probably is the only thing standing between some of us trying to fly without wings over the edge of the Grand Canyon (or since Redgate is headquartered across the pond from me, the White Cliffs of Dover). Doubt makes us work smarter along with harder when it isn’t making us doubt our ability to do anything.

I don’t want to be better than my predecessors, or worse, or even the same. I want to be me and do the things that I am good at and surround myself with people who make me look (but not feel) dumb.

I wrote this the night before I start at Redgate and am publishing it 10 minutes before my first meeting. As I often do,  I read this to my editor (my wife Dr Valerie Davidson, real Dr of Education, not like my drsql Twitter handle!), she asked “do you really want to come off sounding so scared?” I just want to be honest about how I feel the night before I start.

I want to say that I feel like a 10 year old kid about to board their first roller coaster with an inversion. Excited beyond words and also scared to death. Will this be the most amazing thing, or will it kill me? I have experienced first roller coaster rides with several people who were scared of it for too long, some with tears (ok, maybe that was me). Most get finished with that first ride and it becomes something they like. A few hate it. Others, it becomes an obsession (this was me.)

Did I mention I am a theme park fanatic? I often do! This was the picture I took for the company directory. Fun fact, my first interview was from Dollywood where I was about to do some article interviews of some of their hosts. I have a YouTube channel and a theme park blog too.

 

My experience with roller coasters is very similar to my experience with the SQL Server community for the past 20 years (or maybe more, you can see I am not 22 in the picture). I was scared silly to get involved, then I did, then it changed my life forever. I hope I can live up to task of helping others do the same. Either way, I will give it my all and do my best to have no regrets.

Lets go!

The post Starting my dream job… appeared first on Simple Talk.



from Simple Talk https://ift.tt/m590eWl
via

Do not be surprised

For the first time in eleven years of travel, I became profoundly sick while on the road. No, I’m not sharing details of any kind. What I will share is just this; I wasn’t prepared.

When I travel, I look at the weather, where I’m going, how long, and I pack accordingly. I have my presentations and code backed up, locally, at home, and on the cloud. I’m ready for flight cancellations and all sorts of travel mishaps. But I was not ready for being ill. I really should have been.

I think a lot of our emergency preparedness is like this. We think we have our backups, high availability, disaster recovery all set. After all, we’ve experienced a lot of different outages and have learned from them. Further, we know that things go wrong, so we look at what happens to others and try to learn from them, rather than on our own. Yet, chances are, something is going to take you by surprise.

What can you do?

Well, same as you’ve done mostly. Examine and test your preparations. Learn from others. Keep an eye out for changing circumstances. Most of all, practice recovery. Finally, try to develop a resilient attitude about things. That should be as much a part of your preparations as anything else.

Commentary Competition

Enjoyed the topic? Have a relevant anecdote? Disagree with the author? Leave your two cents on this post in the comments below, and our favourite response will win a $50 Amazon gift card. The competition closes two weeks from the date of publication, and the winner will be announced in the next Simple Talk newsletter.

The post Do not be surprised appeared first on Simple Talk.



from Simple Talk https://ift.tt/eWjf5qQ
via